rfc:replace_parse_url

This is an old revision of the document!


PHP RFC: Replace parse_url()

Introduction

This RFC came about for an attempt to resolve Bug #72811. In the attempt, discussion shifted from trying to patch the current implementation of parse_url() to replacing it with an re2c based parser. The current implementation of parse_url() does not respect RFC 3986 with regard to most of the components of a URL. The bug in question noted that

<?php
var_dump(parse_url("127.0.0.1:80", PHP_URL_HOST));
 
/* Outputs:
string(9) "127.0.0.1"
*/

While we all may agree that this is sensible, and totally expected, it is actually a lie. That is not how the RFC defines how that string should be interpreted. It should parse as a single PATH element string(12) “127.0.0.1:80”. Why? Well the RFC defines the hier-part, which contains the host portion, of the URI to be after a double-slash, to which the example lacks. This would result in the path-noscheme portion of the parsing to match beginning at the 1 and fill the path until a ? or # is found.

So a RFC standard implementation should be parsed as such:

<?php
var_dump(parse_url("127.0.0.1:80", PHP_URL_HOST));
var_dump(parse_url("127.0.0.1:80", PHP_URL_PATH));
 
/* Outputs:
NULL
string(12) "127.0.0.1:80"
*/

The bug does state that the parsing difference between IPv4 address and IPv6 addresses are handled differently (in the sense that the IPv4 parsing isn't standards compliant). However, according to the RFC, the simple case the user reported in the bug exists per the spec.

<?php
var_dump(parse_url("127.0.0.1:80", PHP_URL_PATH));
var_dump(parse_url("[::1]:80", PHP_URL_PATH));
 
/* Outputs:
string(12) "127.0.0.1:80"
NULL
*/

This is due to the rules of the path, which allow a path to begin with a /, or an alpha-numeric character. Paths can not being with a [, and so the IPv6 formatted URI fails. But with the above, we are basing what we expect to be the output based on a standards poor format. A proper look at output for both an IPv4 and IPv6 example follows

<?php
var_dump(parse_url("127.0.0.1:80/index.php", PHP_URL_PATH));
var_dump(parse_url("[::1]:80/index.php", PHP_URL_PATH));
 
var_dump(parse_url("//127.0.0.1:80/index.php", PHP_URL_PATH));
var_dump(parse_url("//[::1]:80/index.php", PHP_URL_PATH));
 
/* Outputs:
string(22) "127.0.0.1:80/index.php"
NULL
string(10) "/index.php"
string(10) "/index.php"
 
*/

Proposal

The proposal of this RFC is two fold. One, replace the current parser used for parse_url() to utilize re2c. Two, ensure parse_url() more closely follows the RFC. The function signature will not change, however, the return value will be more consistent.

The function can return

  • An array consisting of each component of the URI found.
  • A string|int of the component requested by the 2nd argument
  • NULL when we can not parse the URI, or, the component request contains no value

Discussion Points

RFC Break

I do make a single exception and break with the RFC in one place. The RFC does not permit curly-braces within a query component. For instance http://example.net/index.php?q={fullname}, where the RFC would define the path as being q=, I don't feel this is accurate as { and } are not special markers within an URI and should otherwise be treated as part of the string.

Backward Incompatible Changes

Many of the tests that were developed for the current implementation of parse_url() have been changed to reflect a more standards compliant test. This change will break anyone who is using the function with a non-standards compliant URI format. This is the most problematic in terms of a BC break. By this point, many people who use parse_url() might expect it to work in a, lets say, forgiving manner. The example provided in the bug report is a perfect example of what I feel is a common use case of this function which will no longer act in a standards compliant method.

Proposed PHP Version(s)

PHP 7.2, or later

RFC Impact

To Existing Extensions

standard

Open Issues

Make sure there are no open issues when the vote starts!

Unaffected PHP Functionality

List existing areas/features of PHP that will not be changed by the RFC.

This helps avoid any ambiguity, shows that you have thought deeply about the RFC's impact, and helps reduces mail list noise.

Proposed Voting Choices

Vote to replace parse_url() with an re2c parser, and require standard compliant URI formats. Requires 2/3

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

PR with working Implementation: https://github.com/php/php-src/pull/2079

rfc/replace_parse_url.1475535417.txt.gz · Last modified: 2017/09/22 13:28 (external edit)