rfc:replace_parse_url

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:replace_parse_url [2016/10/10 02:48] bp1222rfc:replace_parse_url [2021/03/27 14:57] (current) – Move to inactive ilutov
Line 1: Line 1:
-====== PHP RFC: Create URLParser and URLBuilder Classes ====== +====== PHP RFC: Create RFC Compliant URL Parser ====== 
-  * Version: 0.2+  * Version: 0.3
   * Date: 2016-10-04   * Date: 2016-10-04
   * Author: David Walker (dave@mudsite.com)   * Author: David Walker (dave@mudsite.com)
   * Proposed version: PHP 7.2+   * Proposed version: PHP 7.2+
-  * Status: Draft+  * Status: Inactive
   * First Published at: http://wiki.php.net/rfc/replace_parse_url   * First Published at: http://wiki.php.net/rfc/replace_parse_url
  
Line 10: Line 10:
 This RFC came about for an attempt to resolve [[https://bugs.php.net/bug.php?id=72811|Bug #72811]].  In the attempt, discussion shifted from trying to patch the current implementation of ''parse_url()'' to more generally replacing the current one.  The discussion then shifted to the inability to remove ''parse_url()'' due to BC issues.  Ideas formed on creating an immutable class that will take a URL and parse it, exposing the pieces by getters. This RFC came about for an attempt to resolve [[https://bugs.php.net/bug.php?id=72811|Bug #72811]].  In the attempt, discussion shifted from trying to patch the current implementation of ''parse_url()'' to more generally replacing the current one.  The discussion then shifted to the inability to remove ''parse_url()'' due to BC issues.  Ideas formed on creating an immutable class that will take a URL and parse it, exposing the pieces by getters.
  
-The current implementation of ''parse_url()'' makes a bunch of exceptions to [[https://tools.ietf.org/html/rfc3986|RFC 3986]].  I do not know if these are conscious exceptions, or, if ''parse_url()'' was never based off of following the RFC.  After raising this RFC, I was alerted that the RFC, is itself, generally superseded by [[https://url.spec.whatwg.org|WHATWG]] spec on URLs.  This is a more practical specification to how URLs exist in the real-world.+The current implementation of ''parse_url()'' makes a bunch of exceptions to [[https://tools.ietf.org/html/rfc3986|RFC 3986]].  I do not know if these are conscious exceptions, or, if ''parse_url()'' was never based off of the RFC.  After raising this RFC, I was alerted that the RFC, is complimented by [[https://url.spec.whatwg.org|WHATWG]] spec on URLs.  The aim of WHATWG is to combine RFC 3986 and [[https://tools.ietf.org/html/rfc3987|RFC 3987]].  However, WHATWG is a "Living Standard" which makes it subject to change, however frequent.  Although it does some good combining the two RFC's, the complexities to have a single PHP parser that would require constant maintaining to adhere to the evolving standard is not exactly practical.
  
-So, this RFC proposes creating two new classes, URLParser and URLBuilder.  The former will be an immutable classthat is constructed with a URL to be parsed.  There will be methods to access each piece of the URL, as well as a general getter, that will accept a string of flags that will return requested portions in an array.  The complimentary to this will be URLBuilder, which will expose methods to set, or add, pieces to a URL, and a method to get the built value. +So, this RFC proposes creating a new parser that adheres to the two RFC's.  In doing soif PHP is compiled with mbstring support, would be able to properly support multibyte characters in a URL.
- +
-===== Reasoning ===== +
-The bug described an issue where using ''parse_url()'' with an IPv4 address would correctly parse the host, but with IPv6 it would not.+
  
 +===== Proposal =====
 <file php> <file php>
 <?php <?php
-var_dump(parse_url("127.0.0.1:80", PHP_URL_HOST)); 
-var_dump(parse_url("[::1]:80", PHP_URL_HOST)); 
  
-/* Outputs: +class URL { 
-string(9"127.0.0.1" +    public function  __construct(string $url, string|URL $base); 
-NULL +     
-*/ +    /** 
-</file>+     * $input - The string to be parsed 
 +     * $base - (optionalIf $url is relative, this is what it is relative to 
 +     * $encoding_override - (optional) we assume $url is a UTF-8 encoded string, you may change it here 
 +     * $url - (optional) A URL object that should be modified by the parsing of $input The return value will be this variable as well 
 +     * $state_override - (optional) begin parting the $input from a specific state
 +     */ 
 +    static public function parse(string $input[, URL $base[, int $encoding_override[, URL $url[, int $state_override]]]]) : URL; 
 +     
 +    public function getScheme() : ?string; 
 +    public function getUsername() : ?string; 
 +    public function getPassword() : ?string; 
 +    public function getHostname() : ?string; 
 +    public function getPort() : ?int; 
 +    public function getPath() : ?string; 
 +    public function getQuery() : ?string; 
 +    public function getFragment() : ?string; 
 +     
 +    public function getAll() : array; 
 +}
  
-While we may agree the that former line is sensible and maybe expected; the behavior is contrary to how the RFC defines parsing a URI.  To be compliant it should parse as a single PATH element ''string(12) "127.0.0.1:80"'' Why?  The RFC defines the ''host'' as a component of the ''authority'' The authority is only parsed if it's preceded by a double-slash.  Since the above example lacks a double-slash, the ''authority'' portion of the ''hier-part'' should not be processed, and the example would match into the ''path-rootless'' portion. 
- 
-The bug does state that the parsing difference between IPv4 address and IPv6 addresses are handled differently (in the sense that the IPv4 parsing isn't standards compliant).  However, according to the RFC, the IPv6 case the user reported in the bug is accurate per the spec.  None of the path elements permit a ''['' as the first character of the path, so the IPv6 formatted line should be NULL. 
- 
-An accurate example of standards compliant parsing: 
-<file php> 
-<?php 
-var_dump(parse_url("127.0.0.1:80", PHP_URL_PATH)); 
-var_dump(parse_url("[::1]:80", PHP_URL_PATH)); 
- 
-/* Outputs: 
-string(12) "127.0.0.1:80" 
-NULL 
-*/ 
 </file> </file>
- 
-With that in mind, a correct example of parsing URI's to acquire the host portion, per the bugs request would look similar to the following: 
-<file php> 
-<?php 
-var_dump(parse_url("127.0.0.1:80/index.php", PHP_URL_HOST)); 
-var_dump(parse_url("[::1]:80/index.php", PHP_URL_HOST)); 
- 
-var_dump(parse_url("//127.0.0.1:80/index.php", PHP_URL_HOST)); 
-var_dump(parse_url("//[::1]:80/index.php", PHP_URL_HOST)); 
- 
-/* Outputs: 
-NULL 
-NULL 
-string(9) "127.0.0.1" 
-string(5) "[::1]" 
-*/ 
-</file> 
- 
-===== Proposal ===== 
-The proposal of this RFC is two fold.  One, replace the current parser used for ''parse_url()'' to utilize re2c.  Two, ensure ''parse_url()'' more closely follows the RFC.  The function signature will not change, however, the return value will be more consistent. 
- 
-The function can return 
-  * An array consisting of each component of the URI found. 
-  * A string|int of the component requested by the 2nd argument 
-  * NULL when we can not parse the URI, or, the component request contains no value 
  
 ===== Backward Incompatible Changes ===== ===== Backward Incompatible Changes =====
-Many of the tests that were developed for the current implementation of ''parse_url()'' have been changed to reflect a more standards compliant test.  This change will break anyone who is using the function with a non-standards compliant URI format.  This is the most problematic in terms of a BC break.  By this point, many people who use ''parse_url()'' might expect it to work in a, lets say, forgiving manner.  The example provided in the bug report is a perfect example of what I feel is a common use case of this function which will no longer act in a standards compliant method. +None
- +
-This function will no longer return false.+
  
 ===== RFC Impact ===== ===== RFC Impact =====
 ==== To Existing Extensions ==== ==== To Existing Extensions ====
-standard +standard
  
 ===== Open Issues ===== ===== Open Issues =====
-  * Deprecate ''parse_url()'' and create a new function with new parsing +  * Deprecate ''parse_url()''?  Try and push people into using the new URLParser class. 
-  * Allow for certain breaks in the RFC to provide more lenient parsing? (i.e. allow 'example.com:80to parse as host & portnot a path)+  * Should ''parse_url()'' have sunset date of PHP8or PHP9?
  
 ===== Proposed Voting Choices ===== ===== Proposed Voting Choices =====
-Vote to replace ''parse_url()'' with an re2c parser, and require standard compliant URI formats. 
 Requires 2/3 Requires 2/3
  
Line 93: Line 65:
  
 ===== References ===== ===== References =====
-PR with working Implementation: [[https://github.com/php/php-src/pull/2079]] 
rfc/replace_parse_url.1476067715.txt.gz · Last modified: 2017/09/22 13:28 (external edit)