rfc:replace_parse_url

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:replace_parse_url [2016/10/03 23:06] bp1222rfc:replace_parse_url [2021/03/27 14:57] (current) – Move to inactive ilutov
Line 1: Line 1:
-====== PHP RFC: Replace parse_url() ====== +====== PHP RFC: Create RFC Compliant URL Parser ====== 
-  * Version: 0.1 +  * Version: 0.3 
-  * Date: 2016-10-03+  * Date: 2016-10-04
   * Author: David Walker (dave@mudsite.com)   * Author: David Walker (dave@mudsite.com)
-  * Status: Draft+  * Proposed version: PHP 7.2+ 
 +  * Status: Inactive
   * First Published at: http://wiki.php.net/rfc/replace_parse_url   * First Published at: http://wiki.php.net/rfc/replace_parse_url
  
 ===== Introduction ===== ===== Introduction =====
-This RFC came about for an attempt to resolve [[https://bugs.php.net/bug.php?id=72811|Bug #72811]].  In the attempt, discussion shifted from trying to patch the current implementation of ''parse_url()'' to replacing it with an re2c based parser.  The current implementation of ''parse_url()'' does not respect [[https://tools.ietf.org/html/rfc3986|RFC 3986]] with regard to most of the components of a URL.  The bug in question noted that+This RFC came about for an attempt to resolve [[https://bugs.php.net/bug.php?id=72811|Bug #72811]].  In the attempt, discussion shifted from trying to patch the current implementation of ''parse_url()'' to more generally replacing the current one.  The discussion then shifted to the inability to remove ''parse_url()'' due to BC issues.  Ideas formed on creating an immutable class that will take a URL and parse it, exposing the pieces by getters.
  
-<file php> +The current implementation of ''parse_url()'' makes a bunch of exceptions to [[https://tools.ietf.org/html/rfc3986|RFC 3986]].  I do not know if these are conscious exceptions, or, if ''parse_url()'' was never based off of the RFC After raising this RFC, I was alerted that the RFC, is complimented by [[https://url.spec.whatwg.org|WHATWG]] spec on URLs.  The aim of WHATWG is to combine RFC 3986 and [[https://tools.ietf.org/html/rfc3987|RFC 3987]].  However, WHATWG is a "Living Standard" which makes it subject to changehowever frequent.  Although it does some good combining the two RFC's, the complexities to have a single PHP parser that would require constant maintaining to adhere to the evolving standard is not exactly practical.
-<?php +
-var_dump(parse_url("127.0.0.1:80", PHP_URL_HOST));+
  
-/* Outputs: +So, this RFC proposes creating a new parser that adheres to the two RFC's In doing so, if PHP is compiled with mbstring support, would be able to properly support multibyte characters in a URL.
-string(9) "127.0.0.1" +
-*/ +
-</file>+
  
-While we all may agree that this is sensible, and totally expected, it is actually a lie.  That is not how the RFC defines how that string should be interpreted.  It should parse as a single PATH element ''string(12) "127.0.0.1:80"'' Why?  Well the RFC defines the ''hier-part'', which contains the host portion, of the URI to be after a double-slash, to which the example lacks.  This would result in the ''path-noscheme'' portion of the parsing to match beginning at the ''1'' and fill the path until a ''?'' or ''#'' is found. +===== Proposal =====
- +
-So a RFC standard implementation should be parsed as such:+
 <file php> <file php>
 <?php <?php
-var_dump(parse_url("127.0.0.1:80", PHP_URL_HOST)); 
-var_dump(parse_url("127.0.0.1:80", PHP_URL_PATH)); 
  
-/* Outputs: +class URL { 
-NULL +    public function  __construct(string $url, string|URL $base); 
-string(12"127.0.0.1:80" +     
-*/ +    /** 
-</file>+     * $input - The string to be parsed 
 +     * $base - (optional) If $url is relative, this is what it is relative to 
 +     * $encoding_override - (optional) we assume $url is a UTF-8 encoded string, you may change it here 
 +     * $url - (optionalA URL object that should be modified by the parsing of $input The return value will be this variable as well 
 +     * $state_override - (optional) begin parting the $input from a specific state
 +     */ 
 +    static public function parse(string $input[, URL $base[, int $encoding_override[, URL $url[, int $state_override]]]]) : URL; 
 +     
 +    public function getScheme() : ?string; 
 +    public function getUsername() : ?string; 
 +    public function getPassword() : ?string; 
 +    public function getHostname() : ?string; 
 +    public function getPort() : ?int; 
 +    public function getPath() : ?string; 
 +    public function getQuery() : ?string; 
 +    public function getFragment() : ?string; 
 +     
 +    public function getAll() : array; 
 +}
  
-The bug does state that the parsing difference between IPv4 address and IPv6 addresses are handled differently (in the sense that the IPv4 parsing isn't standards compliant).  However, according to the RFC, the simple case the user reported in the bug exists per the spec. 
- 
-<file php> 
-<?php 
-var_dump(parse_url("127.0.0.1:80", PHP_URL_PATH)); 
-var_dump(parse_url("[::1]:80", PHP_URL_PATH)); 
- 
-/* Outputs: 
-string(12) "127.0.0.1:80" 
-NULL 
-*/ 
 </file> </file>
- 
-This is due to the rules of the path, which allow a path to begin with a ''/'', or an alpha-numeric character.  Paths can not being with a ''['', and so the IPv6 formatted URI fails.  But with the above, we are basing what we expect to be the output based on a standards poor format.  A proper look at getting the correct HOST output for both an IPv4 and IPv6 example follows 
- 
-<file php> 
-<?php 
-var_dump(parse_url("127.0.0.1:80/index.php", PHP_URL_HOST)); 
-var_dump(parse_url("[::1]:80/index.php", PHP_URL_HOST)); 
- 
-var_dump(parse_url("//127.0.0.1:80/index.php", PHP_URL_HOST)); 
-var_dump(parse_url("//[::1]:80/index.php", PHP_URL_HOST)); 
- 
-/* Outputs: 
-NULL 
-NULL 
-string(9) "127.0.0.1" 
-string(5) "[::1]" 
-*/ 
-</file> 
- 
-===== Proposal ===== 
-The proposal of this RFC is two fold.  One, replace the current parser used for ''parse_url()'' to utilize re2c.  Two, ensure ''parse_url()'' more closely follows the RFC.  The function signature will not change, however, the return value will be more consistent. 
- 
-The function can return 
-  * An array consisting of each component of the URI found. 
-  * A string|int of the component requested by the 2nd argument 
-  * NULL when we can not parse the URI, or, the component request contains no value 
  
 ===== Backward Incompatible Changes ===== ===== Backward Incompatible Changes =====
-Many of the tests that were developed for the current implementation of ''parse_url()'' have been changed to reflect a more standards compliant test.  This change will break anyone who is using the function with a non-standards compliant URI format.  This is the most problematic in terms of a BC break.  By this point, many people who use ''parse_url()'' might expect it to work in a, lets say, forgiving manner.  The example provided in the bug report is a perfect example of what I feel is a common use case of this function which will no longer act in a standards compliant method. +None
- +
-===== Proposed PHP Version(s) ===== +
-PHP 7.2, or later+
  
 ===== RFC Impact ===== ===== RFC Impact =====
 ==== To Existing Extensions ==== ==== To Existing Extensions ====
-standard +standard
  
 ===== Open Issues ===== ===== Open Issues =====
-Make sure there are no open issues when the vote starts!+  * Deprecate ''parse_url()''?  Try and push people into using the new URLParser class. 
 +  * Should ''parse_url()'' have a sunset date of PHP8, or PHP9?
  
 ===== Proposed Voting Choices ===== ===== Proposed Voting Choices =====
-Vote to replace ''parse_url()'' with an re2c parser, and require standard compliant URI formats. 
 Requires 2/3 Requires 2/3
  
Line 95: Line 65:
  
 ===== References ===== ===== References =====
-PR with working Implementation: [[https://github.com/php/php-src/pull/2079]] 
rfc/replace_parse_url.1475536006.txt.gz · Last modified: 2017/09/22 13:28 (external edit)