rfc:replace_parse_url

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:replace_parse_url [2016/10/04 18:14] bp1222rfc:replace_parse_url [2021/03/27 14:57] (current) – Move to inactive ilutov
Line 1: Line 1:
-====== PHP RFC: Replace parse_url() ====== +====== PHP RFC: Create RFC Compliant URL Parser ====== 
-  * Version: 0.1+  * Version: 0.3
   * Date: 2016-10-04   * Date: 2016-10-04
   * Author: David Walker (dave@mudsite.com)   * Author: David Walker (dave@mudsite.com)
   * Proposed version: PHP 7.2+   * Proposed version: PHP 7.2+
-  * Status: Under Discussion+  * Status: Inactive
   * First Published at: http://wiki.php.net/rfc/replace_parse_url   * First Published at: http://wiki.php.net/rfc/replace_parse_url
  
 ===== Introduction ===== ===== Introduction =====
-This RFC came about for an attempt to resolve [[https://bugs.php.net/bug.php?id=72811|Bug #72811]].  In the attempt, discussion shifted from trying to patch the current implementation to more generally replacing the current one.  The current implementation of ''parse_url()'' makes a bunch of exceptions to [[https://tools.ietf.org/html/rfc3986|RFC 3986]].  I do not know if these are conscious exceptionsor, if ''parse_url()'' was never based off of following the RFC.+This RFC came about for an attempt to resolve [[https://bugs.php.net/bug.php?id=72811|Bug #72811]].  In the attempt, discussion shifted from trying to patch the current implementation of ''parse_url()'' to more generally replacing the current one.  The discussion then shifted to the inability to remove ''parse_url()'' due to BC issues.  Ideas formed on creating an immutable class that will take a URL and parse itexposing the pieces by getters.
  
-So, this RFC proposes replacing the current implementation of ''parse_url()'' using re2c based parser that will be strict to the RFC when parsing URI's.+The current implementation of ''parse_url()'' makes bunch of exceptions to [[https://tools.ietf.org/html/rfc3986|RFC 3986]].  I do not know if these are conscious exceptions, or, if ''parse_url()'' was never based off of the RFC.  After raising this RFC, I was alerted that the RFC, is complimented by [[https://url.spec.whatwg.org|WHATWG]] spec on URLs.  The aim of WHATWG is to combine RFC 3986 and [[https://tools.ietf.org/html/rfc3987|RFC 3987]].  However, WHATWG is a "Living Standard" which makes it subject to change, however frequent.  Although it does some good combining the two RFC's, the complexities to have a single PHP parser that would require constant maintaining to adhere to the evolving standard is not exactly practical.
  
-===== Reasoning ===== +So, this RFC proposes creating a new parser that adheres to the two RFC's.  In doing so, if PHP is compiled with mbstring support, would be able to properly support multibyte characters in a URL.
-The bug described an issue where using ''parse_url()'' with an IPv4 address would correctly parse the hostbut with IPv6 it would not.+
  
 +===== Proposal =====
 <file php> <file php>
 <?php <?php
-var_dump(parse_url("127.0.0.1:80", PHP_URL_HOST)); 
-var_dump(parse_url("[::1]:80", PHP_URL_HOST)); 
  
-/* Outputs: +class URL { 
-string(9"127.0.0.1" +    public function  __construct(string $url, string|URL $base); 
-NULL +     
-*/ +    /** 
-</file>+     * $input - The string to be parsed 
 +     * $base - (optionalIf $url is relative, this is what it is relative to 
 +     * $encoding_override - (optional) we assume $url is a UTF-8 encoded string, you may change it here 
 +     * $url - (optional) A URL object that should be modified by the parsing of $input The return value will be this variable as well 
 +     * $state_override - (optional) begin parting the $input from a specific state
 +     */ 
 +    static public function parse(string $input[, URL $base[, int $encoding_override[, URL $url[, int $state_override]]]]) : URL; 
 +     
 +    public function getScheme() : ?string; 
 +    public function getUsername() : ?string; 
 +    public function getPassword() : ?string; 
 +    public function getHostname() : ?string; 
 +    public function getPort() : ?int; 
 +    public function getPath() : ?string; 
 +    public function getQuery() : ?string; 
 +    public function getFragment() : ?string; 
 +     
 +    public function getAll() : array; 
 +}
  
-While we may agree the that former line is sensible and maybe expected; the behavior is contrary to how the RFC defines parsing a URI.  To be compliant it should parse as a single PATH element ''string(12) "127.0.0.1:80"'' Why?  The RFC defines the ''host'' as a component of the ''authority'' The authority is only parsed if it's preceded by a double-slash.  Since the above example lacks a double-slash, the ''authority'' portion of the ''hier-part'' should not be processed, and the example would match into the ''path-rootless'' portion. 
- 
-The bug does state that the parsing difference between IPv4 address and IPv6 addresses are handled differently (in the sense that the IPv4 parsing isn't standards compliant).  However, according to the RFC, the IPv6 case the user reported in the bug is accurate per the spec.  None of the path elements permit a ''['' as the first character of the path, so the IPv6 formatted line should be NULL. 
- 
-An accurate example of standards compliant parsing: 
-<file php> 
-<?php 
-var_dump(parse_url("127.0.0.1:80", PHP_URL_PATH)); 
-var_dump(parse_url("[::1]:80", PHP_URL_PATH)); 
- 
-/* Outputs: 
-string(12) "127.0.0.1:80" 
-NULL 
-*/ 
 </file> </file>
- 
-With that in mind, a correct example of parsing URI's to acquire the host portion, per the bugs request would look similar to the following: 
-<file php> 
-<?php 
-var_dump(parse_url("127.0.0.1:80/index.php", PHP_URL_HOST)); 
-var_dump(parse_url("[::1]:80/index.php", PHP_URL_HOST)); 
- 
-var_dump(parse_url("//127.0.0.1:80/index.php", PHP_URL_HOST)); 
-var_dump(parse_url("//[::1]:80/index.php", PHP_URL_HOST)); 
- 
-/* Outputs: 
-NULL 
-NULL 
-string(9) "127.0.0.1" 
-string(5) "[::1]" 
-*/ 
-</file> 
- 
-===== Proposal ===== 
-The proposal of this RFC is two fold.  One, replace the current parser used for ''parse_url()'' to utilize re2c.  Two, ensure ''parse_url()'' more closely follows the RFC.  The function signature will not change, however, the return value will be more consistent. 
- 
-The function can return 
-  * An array consisting of each component of the URI found. 
-  * A string|int of the component requested by the 2nd argument 
-  * NULL when we can not parse the URI, or, the component request contains no value 
  
 ===== Backward Incompatible Changes ===== ===== Backward Incompatible Changes =====
-Many of the tests that were developed for the current implementation of ''parse_url()'' have been changed to reflect a more standards compliant test.  This change will break anyone who is using the function with a non-standards compliant URI format.  This is the most problematic in terms of a BC break.  By this point, many people who use ''parse_url()'' might expect it to work in a, lets say, forgiving manner.  The example provided in the bug report is a perfect example of what I feel is a common use case of this function which will no longer act in a standards compliant method. +None
- +
-This function will no longer return false.+
  
 ===== RFC Impact ===== ===== RFC Impact =====
 ==== To Existing Extensions ==== ==== To Existing Extensions ====
-standard +standard
  
 ===== Open Issues ===== ===== Open Issues =====
-  * Deprecate ''parse_url()'' and create a new function with new parsing +  * Deprecate ''parse_url()''?  Try and push people into using the new URLParser class. 
-  * Allow for certain breaks in the RFC to provide more lenient parsing? (i.e. allow 'example.com:80to parse as host & portnot a path)+  * Should ''parse_url()'' have sunset date of PHP8or PHP9?
  
 ===== Proposed Voting Choices ===== ===== Proposed Voting Choices =====
-Vote to replace ''parse_url()'' with an re2c parser, and require standard compliant URI formats. 
 Requires 2/3 Requires 2/3
  
Line 91: Line 65:
  
 ===== References ===== ===== References =====
-PR with working Implementation: [[https://github.com/php/php-src/pull/2079]] 
rfc/replace_parse_url.1475604853.txt.gz · Last modified: 2017/09/22 13:28 (external edit)