rfc:url_parsing_api

PHP RFC: Add RFC3986 and WHATWG compliant URI parsing support

Introduction

URIs and URLs are one of the most fundamental concepts of the web because they make it possible to reference specific resources on a network. URLs were originally defined by Tim Berners-Lee in RFC 1738, but since then other specifications have also emerged, out of which RFC 3986 and WHATWG URL are the most notable ones. The former one updates the original RFC 1738 and defines URIs, while the latter one specifies how browsers should treat URLs.

Despite the ubiquitous nature of URLs and URIs, they are not so unequivocal as people may think, because different clients treat and parse them differently by either following one of the standards, or even worse, not following any at all. Unfortunately, PHP falls into the latter category: the parse_url() function is offered for parsing URLs, however, it isn't compliant with any standards. Even the PHP manual contains the following warning:

This function may not give correct results for relative or invalid URLs, and the results may not even match common behavior of HTTP clients. ...

Incompatibility with current standards is a serious issue, as it hinders interoperability with different tools (i.e. HTTP clients), or it can result in bugs which are difficult to notice. For example, cURL's URL parsing implementation is based on RFC 3986, that's why URLs validated by FILTER_VALIDATE_URL may not necessarily be accepted when passed to cURL.

Proposal

In order to address the above mentioned problems, a new, always available URI extension is to be added to the standard library. The extension would support parsing, validating, modifying, and recomposing (converting the parsed structures back to strings) URIs based on both RFC 3986 and the WHATWG URL specifications, as well as resolving references (turning a (relative) URI to an absolute one by applying a base URI to it). For this purpose, the following internal classes and methods are added:

namespace Uri;
 
const URI_PARSER_RFC3986 = "rfc3986";
const URI_PARSER_WHATWG = "whatwg";
 
abstract class UriException extends \Exception
{
}
 
class UninitializedUriException extends \Uri\UriException
{
}
 
class UriOperationException extends \Uri\UriException
{
}
 
class InvalidUriException extends \Uri\UriException
{
    public readonly array $errors;
}
 
readonly class WhatWgError
{
    public const int ERROR_TYPE_DOMAIN_TO_ASCII = UNKNOWN;
    public const int ERROR_TYPE_DOMAIN_TO_UNICODE = UNKNOWN;
    public const int ERROR_TYPE_DOMAIN_INVALID_CODE_POINT = UNKNOWN;
    public const int ERROR_TYPE_HOST_INVALID_CODE_POINT = UNKNOWN;
    public const int ERROR_TYPE_IPV4_EMPTY_PART = UNKNOWN;
    public const int ERROR_TYPE_IPV4_TOO_MANY_PARTS = UNKNOWN;
    public const int ERROR_TYPE_IPV4_NON_NUMERIC_PART = UNKNOWN;
    public const int ERROR_TYPE_IPV4_NON_DECIMAL_PART = UNKNOWN;
    public const int ERROR_TYPE_IPV4_OUT_OF_RANGE_PART = UNKNOWN;
    public const int ERROR_TYPE_IPV6_UNCLOSED = UNKNOWN;
    public const int ERROR_TYPE_IPV6_INVALID_COMPRESSION = UNKNOWN;
    public const int ERROR_TYPE_IPV6_TOO_MANY_PIECES = UNKNOWN;
    public const int ERROR_TYPE_IPV6_MULTIPLE_COMPRESSION = UNKNOWN;
    public const int ERROR_TYPE_IPV6_INVALID_CODE_POINT = UNKNOWN;
    public const int ERROR_TYPE_IPV6_TOO_FEW_PIECES = UNKNOWN;
    public const int ERROR_TYPE_IPV4_IN_IPV6_TOO_MANY_PIECES = UNKNOWN;
    public const int ERROR_TYPE_IPV4_IN_IPV6_INVALID_CODE_POINT = UNKNOWN;
    public const int ERROR_TYPE_IPV4_IN_IPV6_OUT_OF_RANGE_PART = UNKNOWN;
    public const int ERROR_TYPE_IPV4_IN_IPV6_TOO_FEW_PARTS = UNKNOWN;
    public const int ERROR_TYPE_INVALID_URL_UNIT = UNKNOWN;
    public const int ERROR_TYPE_SPECIAL_SCHEME_MISSING_FOLLOWING_SOLIDUS = UNKNOWN;
    public const int ERROR_TYPE_MISSING_SCHEME_NON_RELATIVE_URL = UNKNOWN;
    public const int ERROR_TYPE_INVALID_REVERSE_SOLIDUS = UNKNOWN;
    public const int ERROR_TYPE_INVALID_CREDENTIALS = UNKNOWN;
    public const int ERROR_TYPE_HOST_MISSING = UNKNOWN;
    public const int ERROR_TYPE_PORT_OUT_OF_RANGE = UNKNOWN;
    public const int ERROR_TYPE_PORT_INVALID = UNKNOWN;
    public const int ERROR_TYPE_FILE_INVALID_WINDOWS_DRIVE_LETTER = UNKNOWN;
    public const int ERROR_TYPE_FILE_INVALID_WINDOWS_DRIVE_LETTER_HOST = UNKNOWN;
 
    public string $position;
    public int $errorCode;
 
    public function __construct(string $position, int $errorCode) {}
}
 
interface UriInterface extends \Stringable
{
    public function getScheme(): ?string {}
 
    public function withScheme(?string $scheme): static {}
 
    public function getUser(): ?string {}
 
    public function withUser(?string $user): static {}
 
    public function getPassword(): ?string {}
 
    public function withPassword(?string $password): static {}
 
    public function getHost(): ?string {}
 
    public function withHost(?string $host): static {}
 
    public function getPort(): ?int {}
 
    public function withPort(?int $port): static {}
 
    public function getPath(): ?string {}
 
    public function withPath(?string $path): static {}
 
    public function getQuery(): ?string {}
 
    public function withQuery(?string $query): static {}
 
    public function getFragment(): ?string {}
 
    public function withFragment(?string $fragment): static {}
 
    public function equalsTo(\Uri\UriInterface $uri, bool $excludeFragment = true): bool {}
 
    public function normalize(): static {}
 
    public function toNormalizedString(): string {}
 
    public function __toString(): string {}
}
 
readonly class Rfc3986Uri implements \Uri\UriInterface
{
    private ?string $scheme;
    private ?string $user;
    private ?string $password;
    private ?string $host;
    private ?int $port;
    private ?string $path;
    private ?string $query;
    private ?string $fragment;
 
    public static function parse(string $uri, ?string $baseUrl = null): ?static {}
 
    public function __construct(string $uri, ?string $baseUrl = null) {}
 
    public function getScheme(): ?string {}
 
    public function withScheme(?string $scheme): static {}
 
    public function getUser(): ?string {}
 
    public function withUser(?string $user): static {}
 
    public function getPassword(): ?string {}
 
    public function withPassword(?string $password): static {}
 
    public function getHost(): ?string {}
 
    public function withHost(?string $host): static {}
 
    public function getPort(): ?int {}
 
    public function withPort(?int $port): static {}
 
    public function getPath(): ?string {}
 
    public function withPath(?string $path): static {}
 
    public function getQuery(): ?string {}
 
    public function withQuery(?string $query): static {}
 
    public function getFragment(): ?string {}
 
    public function withFragment(?string $fragment): static {}
 
    public function equalsTo(\Uri\UriInterface $uri, bool $excludeFragment = true): bool {}
 
    public function normalize(): static {}
 
    public function toNormalizedString(): string {}
 
    public function __toString(): string {}
 
    public function __serialize(): array;
 
    public function __unserialize(array $data): void;
}
 
readonly class WhatWgUri implements \Uri\UriInterface
{
    private ?string $scheme;
    private ?string $user;
    private ?string $password;
    private ?string $host;
    private ?int $port;
    private ?string $path;
    private ?string $query;
    private ?string $fragment;
 
    public static function parse(string $uri, ?string $baseUrl = null): static|array {}
 
    public function __construct(string $uri, ?string $baseUrl = null) {}
 
    public function getScheme(): ?string {}
 
    public function withScheme(?string $scheme): static {}
 
    public function getUser(): ?string {}
 
    public function withUser(?string $user): static {}
 
    public function getPassword(): ?string {}
 
    public function withPassword(?string $password): static {}
 
    public function getHost(): ?string {}
 
    public function withHost(?string $host): static {}
 
    public function getPort(): ?int {}
 
    public function withPort(?int $port): static {}
 
    public function getPath(): ?string {}
 
    public function withPath(?string $path): static {}
 
    public function getQuery(): ?string {}
 
    public function withQuery(?string $query): static {}
 
    public function getFragment(): ?string {}
 
    public function withFragment(?string $fragment): static {}
 
    public function equalsTo(\Uri\UriInterface $uri, bool $excludeFragment = true): bool {}
 
    public function normalize(): static {}
 
    public function toNormalizedString(): string {}
 
    public function __toString(): string {}
 
    public function __serialize(): array {}
 
    public function __unserialize(array $data): void {}
}

API Design

First and foremost, the new URI parsing API contains a Uri\UriInterface interface which is implemented by two classes, Uri\Rfc3986Uri and Uri\WhatWgUri, representing RFC 3986 and WHATWG URIs, respectively. Having separate classes for the two standards makes it possible to indicate explicit intent at the type level that one specific standard is required. Actually, it may cause a security vulnerability to have wrong assumptions about the origin of a URI, as Daniel Stenberg (author of cURL) writes in one of his blog posts. That's why it's recommended to rely on one of the concrete URI implementations rather than the Uri\UriInterface interface itself.

Both built-in URI implementations support instantiation via two methods:

  • the constructor: It expects a required URI and an optional base URI parameter in order to support reference resolution. In case of an invalid URI, a Uri\InvalidUriException is thrown.
  • a parse() factory method: It expects the same parameters as the constructor does, but in case of an invalid URI, the error is returned instead of throwing an exception. Using this method is recommended for validating URIs.
$uri = new Uri\Rfc3986Uri("https://example.com"); // An RFC 3986 URI instance is created
$uri = Uri\Rfc3986Uri::parse("https://example.com"); // An RFC 3986 URI instance is created
 
$uri = new Uri\Rfc3986Uri("invalid uri"); // A Uri/InvalidUriException is thrown
$uri = Uri\Rfc3986Uri::parse("invalid uri"); // null is returned in case of an invalid URI
 
$uri = new Uri\WhatWgUri("https://example.com"); // A WHATWG URL instance is created
$uri = Uri\WhatWgUri::parse("https://example.com"); // A WHATWG URL instance is created
 
$uri = new Uri\Rfc3986Uri("invalid uri"); // A Uri/InvalidUriException is thrown
$uri = Uri\Rfc3986Uri::parse("invalid uri"); // An array of WhatWgError objects is returned in case of an invalid URI

The two built-in Uri\UriInterface implementations are readonly, and they have a respective private virtual property for each URI component. These URI components can be retrieved via getters, and immutable modification is possible via “wither” methods. While property hooks and/or asymmetric visibility would make it possible to get rid of the getters, the position of this RFC is to still go with regular get*() method calls as the conservative option, consistent with other internal interfaces. Even though hooked properties could also be declared in interfaces, but since readonly properties are not supported, this possibility was rejected: the author of this RFC believes that it's more important to guarantee the immutability of URI implementations than to optimize performance via eliminating (getter) method calls.

$uri1 = new Uri\Rfc3986Uri("https://example.com");
$uri2 = $uri->withHost("test.com");
 
echo $uri1->getHost();                            // example.com
echo $uri2->getHost();                            // test.com

Besides accessors, the Uri\UriInterface contains a toString() method too. This can be used for recomposing the URI components back to a string. Why such a method is necessary at all? It's because the recomposition process doesn't necessarily simply return the input URI, but it applies some modifications to it. The WHATWG standard specifically mandates the usage of quite some transformations (i.e. removal of extraneous / characters in the scheme, lowercasing some URI components, application of IDNA encoding). While some of the transformations are also required by default for RFC 3986, they are less frequent than for WHATWG.

$uri = new Uri\WhatWgUri("https://////example.com");
 
echo $uri->__toString();                         // https://example.com

On the other hand, RFC 3986 has the notion of normalization, which is an optional process for canonizing different URIs identifying the same resource to the same URI. Therefore URI implementations may support normalization via the normalize() method. E.g. the https:///////EXAMPLE.com and the HTTPS://example.com/ URIs identify the same resource, so implementations may normalize both of them to https://example.com. If an implementation supports this process, it should apply some kind of normalization technique on the URI (i.e. case normalization, percent-encoding, normalization etc.) and return a new instance, otherwise the current, unmodified object can be returned. The toNormalizedString() method is a shorthand for $uri->normalize()->__toString(), and it's useful when one needs the normalized string representation, but the URI components themselves don't have to be modified.

// Uri\Rfc3986Uri supports normalization
$uri = new Uri\Rfc3986Uri("https://EXAMPLE.COM/foo/../bar");
 
echo $uri->__toString();                        // https://EXAMPLE.COM/foo/../bar"
echo $uri->normalize()->__toString();           // https://example.com/bar
echo $uri->toNormalizedString();                // https://example.com/bar
 
// Uri\WhatWgUri normalizes the URI by default, therefore normalize() doesn't change anything
$uri = new Uri\WhatWgUri("https://EXAMPLE.COM/foo/../bar");
 
echo $uri->__toString();                        // https://example.com/bar
echo $uri->normalize()->__toString();           // https://example.com/bar
echo $uri->toNormalizedString();                // https://example.com/bar

Normalization is especially important when it comes to comparing URIs because it reduces the likelihood of false positive results, since URI comparison is based on checking whether the URIs string representation is the same. The Uri::equalsTo() method can be used for comparing URIs. First, this method checks whether the called object and the URI instance received in the argument list has any parent-child relation, since it doesn't make sense to compare URIs of different standards. Then it normalizes and recomposes the URI represented by the object and the URI received in the argument list to a string, and checks whether the two strings match. By default, the fragment component is disregarded.

// A URI equals to another URI of the same standard that has the same string representation after normalization
new Uri\Rfc3986Uri("https://example.COM")->equalsTo(new Uri\Rfc3986Uri("https://EXAMPLE.COM")));  // true
 
// A URI doesn't equal to another URI of a different standard even though they have the same string representation
new Uri\Rfc3986Uri("https://example.com/")->equalsTo(new Uri\WhatWgUri("https://example.com/"));  // false

It should be noted that the equalsTo() method only accepts an Uri\UriInterface instance, while it could also accept string URIs. It was a deliberate decision not to allow such arguments, because it would be unclear how the comparison works in this case: Should the passed in URI be also normalized, or exact string match is performed? Would the URI be parsed based on the same standard as the callee object? These are the questions which don't have to be answered when only the Uri\UriInterface parameter type is supported. Furthermore, the equality operator is not overloaded because this construct is not supported in userland.

Relation to PSR-7

PSR-7 UriInterface is currently the de-facto interface for representing URIs in userland. That's why it seems a good candidate for adoption. However, the current RFC does not purse this mainly for the following reasons:

  • PSR-7 strictly follows the RFC 3986 standard, and therefore only has a notion of "userinfo", rather than "user" and "password" which is used by the WHATWG specification.
  • PSR-7's UriInterface have non-nullable method return types except for UriInterface::getPort() whereas WHATWG specifically allows null values.

Why query parameter manipulation is not supported?

It would be very useful for an URI implementation to support direct query parameter manipulation. Actually, the WHATWG URL specification contains a URLSearchParams interface that could be used for the purpose. However, the position of this RFC is not to include this interface *yet* for the following reasons:

  • Query string parsing is a fuzzy area, since there is no established rules how to parse query strings
  • The URLSearchParams interface doesn't follow either RFC 1738, or RFC 3986
  • The already large scope of the RFC would increase even more

For all these reasons, the topic of query parameter manipulation should be discussed as a followup to the current RFC.

Parser Library Choice

Adding a WHATWG compliant URL parser to the standard library was originally attempted in 2023. The implementation used ADA URL parser as its parser backend which is known for its ultimate performance. At last, the proof of concept was abandoned due to some technical limitations that weren't possible to resolve.

Specifically, ADA is written in C++, and requires a compiler supporting C++17 at least. Despite the fact that it has a C wrapper, its tight compiler requirements would make it unprecedented, and practically impossible to add the URI extension to PHP as a required extension, because PHP has never had a C++ compiler dependency for the always enabled extensions, only optional extensions (like Intl) can be written in C++.

The firm position of this RFC is that an URL parser extension should always be available, therefore a different parser backend written in pure C should be found. Fortunately, Niels Dossche proposed PHP RFC: DOM HTML5 parsing and serialization not long after the experiment with ADA, and his work required bundling parts of the Lexbor browser engine. This library is written in C, and coincidentally contains a WHATWG compliant URL parsing submodule, therefore it makes it suitable to be used as the library of choice.

For parsing URIs according to RFC 3986, the URIParser library was chosen. It is a lightweight and fast C library with no dependencies. It uses the “new BSD license” which is compatible with the current PHP license as well as the PHP License Update RFC.

Plugability

The capability provided by parse_url() is used for multiple purposes in the internal PHP source:

  • SoapClient::_doRequest(): parsing the $location parameter as well as the value of the Location header
  • FTP/FTPS stream wrapper: parse_url() is used for connecting to an URL, renaming a file, following the Location header
  • FILTER_VALIDATE_URL: validating URLs
  • SSL/TLS socket communication: parsing the target URL
  • GET/POST session: accepting the session ID from the query string, manipulating the output URL to automatically include the session ID (Deprecate GET/POST sessions RFC

It would cause inconsistency and security vulnerability if parsing of URIs based on the two specifications referred above was supported in userland, but the legacy parse_url() based behavior was kept internally without the possibility to use the new API. That's why the current RFC was designed with plugability in mind.

Specifically, supported parser backends would have to be registered by using a similar method how password hashing algorithms are registered. On one hand, this approach makes it possible for 3rd party extensions to leverage URI parser backends other than the built-in ones (i.e. support for ADA URL could also be added). But more importantly, an internal “interface” for parsing and handling URIs is defined this way so that it now becomes possible to configure the used backend for each use-case. Please note that URI parser backend registration is only supported for internal code: registering custom user-land implementations is not possible for now, mainly in order to prevent a possible new attack surface.

While it would sound natural to add a php.ini configuration option to configure the used parser backend globally, this option was rejected during the discussion period of the RFC because it would result in unsafe code that is controlled by global state: since any invoked piece of code can change the used parser backend, one should always check the current value of the config option before parsing URIs (and in case of libraries, the original option should also be reset after usage). Instead, the RFC proposes to add the following configuration options that only affect a single use-case:

  • SoapClient::_doRequest(): a new optional $uriParserName parameter is added accepting string or null arguments. Null represents the original (parse_url()) based method, while the new backends will be used when passing either URI_PARSER_RFC3986 or URI_PARSER_WHATWG.
  • FTP/FTPS stream wrapper: a new uri_parser_name stream context option is added
  • FILTER_VALIDATE_URL: filter_* functions can be configured by passing a uri_parser_name key to the $options array
  • SSL/TLS socket communication: a new uri_parser_name stream context option is added
  • GET/POST session: since this feature is deprecated by (Deprecate GET/POST sessions RFC, no configuration is added.

Performance Considerations

The implementation of parse_url() is optimized for performance. This also means that it doesn't deal with validation properly and disregards some edge cases. A fully standard compliant parser will generally be slower than parse_url(), because it has to execute more code. Fortunately, this overhead is usually minimal thanks to the huge efforts of the maintainers of the Lexbor and the uriparser libraries.

According to the rough benchmarks, the following results were measured:

Time of parsing of a basic URL (1000 times)

  • parse_url(): 0.000208 sec
  • Uri\Rfc3986Uri: 0.000311 sec
  • Uri\WhatWgUri: 0.000387 sec

Time of parsing of a complex URL (1000 times)

  • parse_url(): 0.000962
  • Uri\Rfc3986Uri: 0.000911 sec
  • Uri\WhatWgUri: 0.000962 sec

Examples in Other Languages

Go

Even though Go's standard library ships with a net/url package containing a url.Parse() function along with some utility functions, unfortunately it's not highlighted in the documentation which standard it conforms to. However, it's not very promising that the manual mentions the following sentence:

Trying to parse a hostname and path without a scheme is invalid but may not necessarily return an error, due to parsing ambiguities.

Java

In Java, a URL class has been available from the beginning. Unfortunately, it's unclear whether it adheres to any URL standards. Speaking about its design, URL itself is immutable, and somewhat peculiarly, it contains some methods which can open a connection to the URL, or get its content.

Since Java 20, all of the URL constructors are deprecated in favor of using URI.toURL(). The URI class conforms to RFC 2396 standard.

NodeJS

NodeJS recently added support for a decent WHATWG URL compliant URL parser, built on top of the ADA URL parser project.

Python

Python also comes with built-in support for parsing URLs, made available by the urllib.parse.urlparse and urllib.parse.urlsplit functions. According to the documentation, “these functions incorporate some aspects of both [the WHATWG URL and the RFC 3986 specifications], but cannot be claimed compliant with either”.

Backward Incompatible Changes

None.

Proposed PHP Version(s)

The next minor PHP version (either PHP 8.5 or 9.0, whichever comes first).

RFC Impact

To SAPIs

SAPIs should adopt the new internal API for parsing URIs instead of using the existing php_url_parse*() API. Additionally, they should add support for configuring the URI parsing backend.

To Existing Extensions

Extensions should adopt the new internal API for parsing URIs instead of using the existing php_url_parse*() API. Additionally, they should add support for configuring the URI parsing backend.

To Opcache

None.

Future Scope

  • Support for new parser backends so that other libraries (like Ada URL, or cURL) could also be used in addition to uriparser and Lexbor.
  • Support for an abstraction for manipulating query parameters, like URLSearchParams defined by WHATWG
  • The parse_url() function can be deprecated at some distant point of time

References

Vote

The vote requires 2/3 majority in order to be accepted.

Add the RFC 3986 and the WHATWG compliant URI API described above?
Real name yes no
Final result: 0 0
This poll has been closed.
rfc/url_parsing_api.txt · Last modified: 2024/11/25 22:02 by kocsismate