rfc:replace_parse_url

This is an old revision of the document!


PHP RFC: Replace parse_url()

Introduction

This RFC came about for an attempt to resolve Bug #72811. In the attempt, discussion shifted from trying to patch the current implementation to more generally replacing the current one. The current implementation of parse_url() does not respect RFC 3986 with regard to most of the components of a URL.

The RFC proposes replacing the current implementation of parse_url() using a re2c based parser that will be strict to the RFC when parsing URI's.

Reasoning

The bug described an issue where using parse_url() with an IPv4 address would correctly parse the host, but with IPv6 it would not.

<?php
var_dump(parse_url("127.0.0.1:80", PHP_URL_HOST));
var_dump(parse_url("[::1]:80", PHP_URL_HOST));
 
/* Outputs:
string(9) "127.0.0.1"
NULL
*/

While we may agree the that former line is sensible and maybe expected; the behavior is contrary to how the RFC defines parsing a URI. To be compliant it should parse as a single PATH element string(12) “127.0.0.1:80”. Why? The RFC defines the host as a component of the authority. The authority is only parsed if it's preceded by a double-slash. Since the above example lacks a double-slash, the authority portion of the hier-part should not be processed, and the example would match into the path-rootless portion.

The bug does state that the parsing difference between IPv4 address and IPv6 addresses are handled differently (in the sense that the IPv4 parsing isn't standards compliant). However, according to the RFC, the IPv6 case the user reported in the bug is accurate per the spec. None of the path elements permit a [ as the first character of the path, so the IPv6 formatted line should be NULL.

An accurate example of standards compliant parsing:

<?php
var_dump(parse_url("127.0.0.1:80", PHP_URL_PATH));
var_dump(parse_url("[::1]:80", PHP_URL_PATH));
 
/* Outputs:
string(12) "127.0.0.1:80"
NULL
*/

With that in mind, a correct example of parsing URI's to acquire the host portion, per the bugs request would look similar to the following:

<?php
var_dump(parse_url("127.0.0.1:80/index.php", PHP_URL_HOST));
var_dump(parse_url("[::1]:80/index.php", PHP_URL_HOST));
 
var_dump(parse_url("//127.0.0.1:80/index.php", PHP_URL_HOST));
var_dump(parse_url("//[::1]:80/index.php", PHP_URL_HOST));
 
/* Outputs:
NULL
NULL
string(9) "127.0.0.1"
string(5) "[::1]"
*/

Proposal

The proposal of this RFC is two fold. One, replace the current parser used for parse_url() to utilize re2c. Two, ensure parse_url() more closely follows the RFC. The function signature will not change, however, the return value will be more consistent.

The function can return

  • An array consisting of each component of the URI found.
  • A string|int of the component requested by the 2nd argument
  • NULL when we can not parse the URI, or, the component request contains no value

Backward Incompatible Changes

Many of the tests that were developed for the current implementation of parse_url() have been changed to reflect a more standards compliant test. This change will break anyone who is using the function with a non-standards compliant URI format. This is the most problematic in terms of a BC break. By this point, many people who use parse_url() might expect it to work in a, lets say, forgiving manner. The example provided in the bug report is a perfect example of what I feel is a common use case of this function which will no longer act in a standards compliant method.

Proposed PHP Version(s)

PHP 7.2, or later

RFC Impact

To Existing Extensions

standard

Open Issues

Make sure there are no open issues when the vote starts!

Proposed Voting Choices

Vote to replace parse_url() with an re2c parser, and require standard compliant URI formats. Requires 2/3

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

PR with working Implementation: https://github.com/php/php-src/pull/2079

rfc/replace_parse_url.1475591950.txt.gz · Last modified: 2017/09/22 13:28 (external edit)