rfc:replace_parse_url

This is an old revision of the document!


PHP RFC: Create URLParser and URLBuilder Classes

Introduction

This RFC came about for an attempt to resolve Bug #72811. In the attempt, discussion shifted from trying to patch the current implementation of parse_url() to more generally replacing the current one. The discussion then shifted to the inability to remove parse_url() due to BC issues. Ideas formed on creating an immutable class that will take a URL and parse it, exposing the pieces by getters.

The current implementation of parse_url() makes a bunch of exceptions to RFC 3986. I do not know if these are conscious exceptions, or, if parse_url() was never based off of following the RFC. After raising this RFC, I was alerted that the RFC, is itself, generally superseded by WHATWG spec on URLs. This is a more practical specification to how URLs exist in the real-world.

So, this RFC proposes creating two new classes, URLParser and URLBuilder. The former will be an immutable class, that is constructed with a URL to be parsed. There will be methods to access each piece of the URL, as well as a general getter, that will accept a string of flags that will return requested portions in an array. The complimentary to this will be URLBuilder, which will expose methods to set, or add, pieces to a URL, and a method to get the built value.

Reasoning

The bug described an issue where using parse_url() with an IPv4 address would correctly parse the host, but with IPv6 it would not.

<?php
var_dump(parse_url("127.0.0.1:80", PHP_URL_HOST));
var_dump(parse_url("[::1]:80", PHP_URL_HOST));
 
/* Outputs:
string(9) "127.0.0.1"
NULL
*/

While we may agree the that former line is sensible and maybe expected; the behavior is contrary to how the RFC defines parsing a URI. To be compliant it should parse as a single PATH element string(12) “127.0.0.1:80”. Why? The RFC defines the host as a component of the authority. The authority is only parsed if it's preceded by a double-slash. Since the above example lacks a double-slash, the authority portion of the hier-part should not be processed, and the example would match into the path-rootless portion.

The bug does state that the parsing difference between IPv4 address and IPv6 addresses are handled differently (in the sense that the IPv4 parsing isn't standards compliant). However, according to the RFC, the IPv6 case the user reported in the bug is accurate per the spec. None of the path elements permit a [ as the first character of the path, so the IPv6 formatted line should be NULL.

An accurate example of standards compliant parsing:

<?php
var_dump(parse_url("127.0.0.1:80", PHP_URL_PATH));
var_dump(parse_url("[::1]:80", PHP_URL_PATH));
 
/* Outputs:
string(12) "127.0.0.1:80"
NULL
*/

With that in mind, a correct example of parsing URI's to acquire the host portion, per the bugs request would look similar to the following:

<?php
var_dump(parse_url("127.0.0.1:80/index.php", PHP_URL_HOST));
var_dump(parse_url("[::1]:80/index.php", PHP_URL_HOST));
 
var_dump(parse_url("//127.0.0.1:80/index.php", PHP_URL_HOST));
var_dump(parse_url("//[::1]:80/index.php", PHP_URL_HOST));
 
/* Outputs:
NULL
NULL
string(9) "127.0.0.1"
string(5) "[::1]"
*/

Proposal

The proposal of this RFC is two fold. One, replace the current parser used for parse_url() to utilize re2c. Two, ensure parse_url() more closely follows the RFC. The function signature will not change, however, the return value will be more consistent.

The function can return

  • An array consisting of each component of the URI found.
  • A string|int of the component requested by the 2nd argument
  • NULL when we can not parse the URI, or, the component request contains no value

Backward Incompatible Changes

Many of the tests that were developed for the current implementation of parse_url() have been changed to reflect a more standards compliant test. This change will break anyone who is using the function with a non-standards compliant URI format. This is the most problematic in terms of a BC break. By this point, many people who use parse_url() might expect it to work in a, lets say, forgiving manner. The example provided in the bug report is a perfect example of what I feel is a common use case of this function which will no longer act in a standards compliant method.

This function will no longer return false.

RFC Impact

To Existing Extensions

standard

Open Issues

  • Deprecate parse_url() and create a new function with new parsing
  • Allow for certain breaks in the RFC to provide more lenient parsing? (i.e. allow 'example.com:80' to parse as a host & port, not a path)

Proposed Voting Choices

Vote to replace parse_url() with an re2c parser, and require standard compliant URI formats. Requires 2/3

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

PR with working Implementation: https://github.com/php/php-src/pull/2079

rfc/replace_parse_url.1476067715.txt.gz · Last modified: 2017/09/22 13:28 (external edit)