This is an old revision of the document!
PHP RFC: Add RFC3986 and WHATWG compliant URI parsing support
- Version: 1.0
- Date: 2024-06-11
- Author: Máté Kocsis, kocsismate@php.net
- Status: Under Discussion
- First Published at: https://wiki.php.net/rfc/url_parsing_api
- Implementation: https://github.com/php/php-src/pull/14461
Introduction
URIs and URLs are one of the most fundamental concepts of the web because they make it possible to reference specific resources on a network. URLs were originally defined by Tim Berners-Lee in RFC 1738, but since then other specifications have also emerged, out of which RFC 3986 and WHATWG URL are the most notable ones. The former one updates the original RFC 1738 and defines URIs, while the latter one specifies how browsers should treat URLs.
Despite the ubiquitous nature of URLs and URIs, they are not so unequivocal as people may think, because different clients treat and parse them differently by either following one of the standards, or even worse, not following any at all. Unfortunately, PHP falls into the latter category: the parse_url()
function is offered for parsing URLs, however, it isn't compliant with any standards. Even the PHP manual contains the following warning:
This function may not give correct results for relative or invalid URLs, and the results may not even match common behavior of HTTP clients. ...
Incompatibility with current standards is a serious issue, as it hinders interoperability with different tools (i.e. HTTP clients), or it can result in bugs which are difficult to notice. For example, cURL's URL parsing implementation is based on RFC 3986, that's why URLs validated by FILTER_VALIDATE_URL may not necessarily be accepted when passed to cURL.
Proposal
In order to address the above mentioned problems, a new, always available URI
extension is to be added to the standard library. The extension would support parsing, validating, as well as recomposing (converting the parsed structures back to strings) URIs based on both RFC 3986 and the WHATWG URL specifications. For this purpose, the following internal classes and methods are added:
namespace Uri; readonly abstract class Uri implements Stringable { private ?string $scheme; private ?string $user; private ?string $password; private ?string $host; private ?int $port; private ?string $path; private ?string $query; private ?string $fragment; public static function fromRfc3986(string $uri, ?string $baseUrl = null): ?static {} /** @param array<int, WhatWgError> $errors */ public static function fromWhatWg(string $uri, ?string $baseUrl = null, &$errors = null): ?static {} public function getScheme(): ?string {} public function getUser(): ?string {} public function getPassword(): ?string {} public function getHost(): ?string {} public function getPort(): ?int {} public function getPath(): ?string {} public function getQuery(): ?string {} public function getFragment(): ?string {} public function __toString(): string {} } readonly class Rfc3986Uri extends Uri { public function __construct(string $uri, ?string $baseUrl = null) {} } readonly class WhatWgUri extends Uri { /** @param array<int, WhatWgError> $errors */ public function __construct(string $uri, ?string $baseUrl = null, &$errors = null) {} } final readonly class WhatWgError { public const int ERROR_TYPE_DOMAIN_TO_ASCII = UNKNOWN; public const int ERROR_TYPE_DOMAIN_TO_UNICODE = UNKNOWN; public const int ERROR_TYPE_DOMAIN_INVALID_CODE_POINT = UNKNOWN; public const int ERROR_TYPE_HOST_INVALID_CODE_POINT = UNKNOWN; public const int ERROR_TYPE_IPV4_EMPTY_PART = UNKNOWN; public const int ERROR_TYPE_IPV4_TOO_MANY_PARTS = UNKNOWN; public const int ERROR_TYPE_IPV4_NON_NUMERIC_PART = UNKNOWN; public const int ERROR_TYPE_IPV4_NON_DECIMAL_PART = UNKNOWN; public const int ERROR_TYPE_IPV4_OUT_OF_RANGE_PART = UNKNOWN; public const int ERROR_TYPE_IPV6_UNCLOSED = UNKNOWN; public const int ERROR_TYPE_IPV6_INVALID_COMPRESSION = UNKNOWN; public const int ERROR_TYPE_IPV6_TOO_MANY_PIECES = UNKNOWN; public const int ERROR_TYPE_IPV6_MULTIPLE_COMPRESSION = UNKNOWN; public const int ERROR_TYPE_IPV6_INVALID_CODE_POINT = UNKNOWN; public const int ERROR_TYPE_IPV6_TOO_FEW_PIECES = UNKNOWN; public const int ERROR_TYPE_IPV4_IN_IPV6_TOO_MANY_PIECES = UNKNOWN; public const int ERROR_TYPE_IPV4_IN_IPV6_INVALID_CODE_POINT = UNKNOWN; public const int ERROR_TYPE_IPV4_IN_IPV6_OUT_OF_RANGE_PART = UNKNOWN; public const int ERROR_TYPE_IPV4_IN_IPV6_TOO_FEW_PARTS = UNKNOWN; public const int ERROR_TYPE_INVALID_URL_UNIT = UNKNOWN; public const int ERROR_TYPE_SPECIAL_SCHEME_MISSING_FOLLOWING_SOLIDUS = UNKNOWN; public const int ERROR_TYPE_MISSING_SCHEME_NON_RELATIVE_URL = UNKNOWN; public const int ERROR_TYPE_INVALID_REVERSE_SOLIDUS = UNKNOWN; public const int ERROR_TYPE_INVALID_CREDENTIALS = UNKNOWN; public const int ERROR_TYPE_HOST_MISSING = UNKNOWN; public const int ERROR_TYPE_PORT_OUT_OF_RANGE = UNKNOWN; public const int ERROR_TYPE_PORT_INVALID = UNKNOWN; public const int ERROR_TYPE_FILE_INVALID_WINDOWS_DRIVE_LETTER = UNKNOWN; public const int ERROR_TYPE_FILE_INVALID_WINDOWS_DRIVE_LETTER_HOST = UNKNOWN; public string $position; public int $errorCode; public function __construct(string $position, int $errorCode) {} }
Parser Library Choice
Adding a WHATWG compliant URL parser to the standard library was originally attempted in 2023. The implementation used ADA URL parser as its parser backend which is known for its ultimate performance. At last, the proof of concept was abandoned due to some technical limitations that weren't possible to resolve.
Specifically, ADA is written in C++, and requires a compiler supporting C++17 at least. Despite the fact that it has a C wrapper, its tight compiler requirements would make it unprecedented, and practically impossible to add the URI
extension to PHP as a required extension, because PHP has never had a C++ compiler dependency for the always enabled extensions, only optional extensions (like Intl
) can be written in C++.
The firm position of this RFC is that an URL parser extension should always be available, therefore a different parser backend written in pure C should be found. Fortunately, Niels Dossche proposed PHP RFC: DOM HTML5 parsing and serialization not long after the experiment with ADA, and his work required bundling parts of the Lexbor browser engine. This library is written in C, and coincidentally contains a WHATWG compliant URL parsing submodule, therefore it makes it suitable to be used as the library of choice.
For parsing URIs according to RFC 3986, the URIParser library was chosen. It is a lightweight and relatively fast library with no dependencies. It uses the “new BSD license” which is compatible with the current PHP license as well as the PHP License Update RFC.
Plugability
The capability provided by parse_url()
is used for multiple purposes in the internal PHP source:
SoapClient::__doRequest()
: parsing the$location
parameter as well as the value of theLocation
header- FTP/FTPS stream wrapper:
parse_url()
is used for connecting to an URL, renaming a file, following theLocation
header FILTER_VALIDATE_URL
: validating URLs- SSL/TLS socket communication: parsing the target URL
- GET/POST session: accepting the session ID from the query string, manipulating the output URL to automatically include the session ID (Deprecate GET/POST sessions RFC
It would cause an inconsistency if parsing of URIs based on the two specifications referred above was supported in userland, but the legacy parse_url()
based behavior was kept internally without the possibility to use the new API. That's why the current RFC was designed with plugability in mind.
Specifically, supported parser backends would have to be registered by using a similar method how password hashing algorithms are registered. One one hand, this approach would make it possible for 3rd party extensions to leverage URI parser backends other than the built-in ones (i.e. support for ADA URL could also be added). But more importantly, an internal “interface” for parsing and handling URIs is defined this way so that it now becomes possible to configure how URIs should be parsed throughout PHP's codebase:
For this purpose, a uri.default_handler
PHP INI option is added with the following valid values:
rfc3986
: URL parsing based on the RFC 3986 standard via the uriparser librarywhatwg
: URL parsing based on the WHATWG spcification via the Lexbor libraryparse_url
: URL parsing based on the legacyparse_url()
based method (default value)
ini_set("uri.default_handler", "rfc3986"); filter_var("https://example.com", FILTER_VALIDATE_URL);
The above piece of code validates https://example.com
based on the RFC 3986 standard, while the below piece of code validates it based on the WHATWG specification:
ini_set("uri.default_handler", "whatwg"); filter_var("https://example.com", FILTER_VALIDATE_URL);
Performance Considerations
The implementation of parse_url()
is optimized for performance. This also means that it doesn't deal with validation properly and disregards some edge cases. A fully standard compliant parser will generally be slower than parse_url()
, because it has to execute more code.
According to the rough initial benchmarks, the following relative results were measured:
Time of parsing of a basic URL (smaller % is faster)
parse_url()
: 100%Uri\Uri::fromRfc3986()
: 150%Uri\Uri::fromWhatwg()
: 300%
Time of parsing of a complex URL (smaller % is faster)
parse_url()
: 100%Uri\Uri::fromRfc3986()
: 50%Uri\Uri::fromWhatwg()
: 100%
More accurate benchmarks are to be performed later.
API Design
The new URI parsing API consists of multiple classes where Uri\Uri
is the (abstract) base class containing two factory methods: Uri\Uri::fromRfc3986()
and Uri\Uri::fromWhatwg()
. These methods are generally expected to be used for instantiating the concrete implementations: Uri\Rfc3986Uri
and Uri\WhatwgUri
.
Having separate classes for the separate URI implementations makes it possible to refer to a specific implementation, or require one at the type level via type declarations. Doing so is important when one has to work only with one of the URI implementations. However, for use-cases where it is not important which implementation is passed, the non-instantiable Uri\Uri
base class is at hand.
Currently, the Uri\Uri
class does not support mutation. It is because the underlying libraries do not support changing the individual URI components once the input URI is parsed. Support for the missing features may be added to uriparser and lexbor in the future though.
Relation to PSR-7
PSR-7 UriInterface is currently the de-facto interface in userland for representing URIs. That's why it seems a good candidate for adoption. However, the current RFC does not purse this for the following reasons:
- PSR-7's
UriInterface
supports mutation of the URI objects via “wither” methods (i.e.UriInterface::withScheme()
). This is not currently possible to achieve for theURI
extension due to technical limitations mentioned above. - PSR-7 strictly follows the RFC 3986 standard, and therefore only has a notion of "userinfo", rather than "user" and "password" which is used by the WhatWG specification.
- PSR-7's
UriInterface
have non-nullable method return types expect forUriInterface::getPort()
whereas WhatWG specifically allowsnull
values.
Due to its immutability, The Uri\Uri
class is not compatible with the PSR-7 UriInterface.
Examples in Other Languages
Go
Even though Go's standard library ships with a net/url
package containing a url.Parse()
function along with some utility functions, unfortunately it's not highlighted in the documentation which standard it conforms to. However, it's not very promising that the manual mentions the following sentence:
Trying to parse a hostname and path without a scheme is invalid but may not necessarily return an error, due to parsing ambiguities.
Java
In Java, a URL class has been available from the beginning. Unfortunately, it's unclear whether it adheres to any URL standards. Speaking about its design, URL
itself is immutable, and somewhat peculiarly, it contains some methods which can open a connection to the URL, or get its content.
Since Java 20, all of the URL
constructors are deprecated in favor of using URI.toURL()
. The URI class conforms to RFC 2396 standard.
NodeJS
NodeJS recently added support for a decent WHATWG URL compliant URL parser, built on top of the ADA URL parser project.
Python
Python also comes with built-in support for parsing URLs, made available by the urllib.parse.urlparse and urllib.parse.urlsplit functions. According to the documentation, “these functions incorporate some aspects of both [the WHATWG URL and the RFC 3986 specifications], but cannot be claimed compliant with either”.
Backward Incompatible Changes
None.
Proposed PHP Version(s)
The next minor PHP version (either PHP 8.5 or 9.0, whichever comes first).
Future Scope
- In addition to the configuration of the globally used URI parser via the
uri.default_handler
INI option, the URI parser could also be set at the individual feature level - Support for mutation of the URI components could be added, similarly to the PSR-7
UriInterface
(i.e.UriInterface::withScheme()
. - Support for an abstraction for manipulating query parameters, like URLSearchParams defined by WhatWg
- The
parse_url()
function can be deprecated at some distant point of time
References
Discussion thread: https://externals.io/message/123997
Vote
The vote requires 2/3 majority in order to be accepted.