PHP RFC: Add RFC 3986 and WHATWG URL compliant API

Version: 3.0
Date: 2024-06-11
Author: Máté Kocsis, kocsismate@php.net
Status: Accepted
First Published at: https://wiki.php.net/rfc/url_parsing_api
Implementation: https://github.com/php/php-src/pull/14461

Introduction

URIs and URLs are one of the most fundamental concepts of the web because they make it possible to reference specific resources on a network. URLs were originally defined by Tim Berners-Lee in RFC 1738, but since then other specifications have also emerged, out of which RFC 3986 and WHATWG URL are the most notable ones. The former one updates the original RFC 1738 and defines URIs, while the latter one specifies how browsers should treat URLs.

Despite the ubiquitous nature of URLs and URIs, they are not so unequivocal as people may think, because different clients treat and parse them differently by either following one of the standards, or even worse, not following any at all. Unfortunately, PHP falls into the latter category: the parse_url() function is offered for parsing URLs, however, it isn't compliant with any standards. Even the PHP manual contains the following warning:

This function may not give correct results for relative or invalid URLs, and the results may not even match common behavior of HTTP clients. ...

Incompatibility with current standards is a serious issue, as it hinders interoperability with different tools (e.g. HTTP clients), or it can result in subtle bugs. For example, cURL's URL parsing implementation is based on RFC 3986, that's why URLs validated by FILTER_VALIDATE_URL may not necessarily be accepted when passed to cURL. And that's exactly what the parsing confusion security vulnerability exploits.

URIs, IRIs, URLs, URNs

First of all, we should define what URIs, IRIs, URLs, and URNs are, and what their relation is to each other, in order to have a better understanding of the terms used in the current RFC. It should be noted that different specifications use different definitions, so there is not a single definitive answer. However, the RFC tries to use these terms consistently according to the definitions below:

URI: A unique identifier that relates to an abstract or physical resource (e.g. www.google.com)
IRI: A superset of URIs defined by RFC 3987 which allows Unicode characters to be used, therefore supporting IDNA (internationalized domain names)
URL: A subset of URIs that specify their location (e.g. https://www.google.com)
URN: A subset of URIs that are globally unique within defined namespaces (e.g. urn:isbn:0451450523)

Their relation can be best illustrated via a Venn diagram:

* The image is reused from https://wiki.selfhtml.org/wiki/URN.

Relevant URI specifications

Before discussing the proposal itself, we should also briefly touch the URI specifications the present RFC implements.

RFC 3986

RFC 3986 is a generic specification for URIs. Therefore, it is relatively permissive in the sense that it doesn't include scheme-specific processing rules. For example, the LDAP specification builds upon RFC 3986 and extends it with additional rules (e.g. the ? and the , characters have to be percent-encoded at certain positions).

WHATWG URL

It is a fairly new specification that is mostly relevant in the web browser context. It is a living specification, meaning it changes from time to time. One of its fundamental differences compared to RFC 3986 is that it only deals with URLs, rather than URIs.

Important concepts related to URIs

URIs have some important concepts and capabilities that are needed to effectively work with them.

Parsing

Parsing is the single most important operation to use with URIs where a URI string is decomposed into multiple components during the process.

While RFC 3986 leaves the input URI string intact during parsing, WHATWG URL automatically transforms it (removes superfluous “/” characters after the scheme, lowercases the host, etc.).

Reference resolution

Reference resolution is a process which turns a potentially relative URI reference into a URL by applying it to an absolute URL (a URL that has no fragment component): resolving “/foo” on https://example.com/ results in https://example.com/foo. Both RFC 3986 and WHATWG URL support this concept.

Component recomposition

It is the process of recomposing the distinct URI components to a URI string. While RFC 3986 uses the following algorithm: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3, WHATWG URL applies the algorithm described at https://url.spec.whatwg.org/#url-serializing for the purpose.

An important question that needs to be elaborated upon is whether the recomposed URI equals the input URI string? The two specifications work differently in this regard again: By default, RFC 3986 doesn't require any transformations to be performed during parsing, but it makes some recommendations how to canonize the parsed URI string (see the next section). That's why - by default - the recomposed URI is the same as the originally supplied URI string.

On the other hand, WHATWG URL performs quite a few transformations on the input during parsing, so the recomposed URI may not be the same as the original one.

Normalization

Normalization is an optional process supported by RFC 3986 for canonizing different URIs identifying the same resource to the same URI. E.g. the https://EXAMPLE.com and the HTTPS://example.com/ URIs both refer to the same resource, so they can be normalized to https://example.com. As we will see, normalization is very useful in multiple cases.

Although WHATWG URL doesn't acknowledge this concept, it still applies very similar transformations during parsing (e.g. lowercasing of the scheme and hostname components, removal of superfluous path segments).

Percent-encoding & decoding

Encoding and decoding special characters is a crucial aspect of URI parsing. For this purpose, both RFC 3986 and WHATWG URL use percent-encoding (e.g. the % character is encoded as %25). However, the two standards slightly differ in the details.

WHATWG URL associates a character set for each component, defining the characters that must be percent-encoded in the context of the given component. For example, the query percent-encode set is associated with the query component, containing the “#” character (among others), while the path percent-encode set includes the “?” character in addition (among others). It's easy to see the pattern: if a character has special meaning after the given component, then it must be percent-encoded. That's why the userinfo percent-encode set contains the “/” character (among others), but the “query percent-encode set” doesn't include it anymore, since “/” characters don't have a special meaning after the path component.

Similarly, RFC 3986 also assigns a list of allowed characters to each component. For example, the query component may contain unreserved and any percent-encoded characters, as well as some reserved characters that are categorized as “sub-delimiters” (e.g. “&”, “!”, “'”), and also some “generic delimiters” (“:”, and “@”) that don't have any special meaning in the context of the path.

These two approaches are very similar, however there is a key difference between them: WHATWG URL automatically tries to percent-encode characters in the associated encoding character set when possible, and it also percent-encodes any characters that are illegal in a URL (that are not “URL units”), in which case a warning is emitted, while RFC 3986 rejects invalid characters and stops parsing with a failure.

RFC 3986 also specifies a set of reserved characters (“#”, “?”, “/”, etc.) that must not be percent-decoded according to the following sentence in order to be safely used by scheme-specific syntaxes as delimiters:

Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.

WHATWG URL simply doesn't do any percent-decoding because of reasons that are discussed in the following section.

Equivalence

Normalization and transformations during parsing are especially important when it comes to comparing URIs to each other because they reduce the likelihood of false positive results, as URI comparison is effectively checking whether two URIs represent the same resource.

In practice, this means that two URIs are normalized (when applicable) and then the components are recomposed. If the resulting URI strings are equal, then the two URIs are also equal. Usually, the fragment component is disregarded, since it refers to a secondary resource within the primary one that is identified by the URI.

To complicate things, there is also a nuanced difference in how the two specifications treat equivalence. RFC 3986 defines that “URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent”, which effectively means that percent-encoded unreserved characters and their decoded form are equivalent (e.g. character “e” is equivalent to “%65”).

On the contrary, WHATWG URL defines URL equivalence by the equality of the recomposed URL string, and never decodes percent-encoded characters, except in the host. This implies that percent-encoded characters are not equivalent to their percent-decoded form (except in the host).

The difference between RFC 3986 and WHATWG URL comes from the fact that the point of view of a maintainer of the latter specification is that webservers may legitimately choose to consider encoded and decoded paths distinct, and a standard cannot force them not to do so. This is a substantial BC break compared to RFC 3986, and it is actually a source of confusion among users of the WHATWG URL specification based on the large number of tickets related to this question.

Unicode & IDNA

IDNA (internationalized domain names) allow people around the world to register domain names in their native languages and scripts. This is made possible by encoding Unicode characters using the punycode transcription.

RFC 3986 neither supports IDNA, nor non-ASCII characters. WHATWG URL supports IDNA and Unicode characters, and it explicitly suggests that browsers should render the host component by displaying Unicode characters.

The recommendation is not just for user-friendliness: it's necessary for security reasons, alleviating the human risk factor in exploits. E.g. “xn--google.com” could deceive the uninitiated reader that it is a Google domain, however the IDNA domain decodes to “䕮䕵䕶䕱.com” in fact.

Proposal

A new, always available URI extension is to be added to the standard library. The extension would support parsing, validating, modifying, and recomposing URIs, as well as resolving references based on both RFC 3986 and the WHATWG URL specifications. For this purpose, the following internal classes and methods are added:

namespace Uri {
    class UriException extends \Exception
    {
    }
 
    class InvalidUriException extends \Uri\UriException
    {
    }
 
    enum UriComparisonMode
    {
        case IncludeFragment;
        case ExcludeFragment;
    }
}

namespace Uri\Rfc3986 {
    final readonly class Uri
    {
        public static function parse(string $uri, ?Uri $baseUrl = null): ?static {}
 
        /** @throws Uri\InvalidUriException */
        public function __construct(string $uri, ?Uri $baseUrl = null) {}
 
        public function getScheme(): ?string {}
 
        public function getRawScheme(): ?string {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function withScheme(?string $scheme): static {}
 
        public function getUserInfo(): ?string {}
 
        public function getRawUserInfo(): ?string {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function withUserInfo(#[\SensitiveParameter] ?string $userInfo): static {}

        public function getUsername(): ?string {}
 
        public function getRawUsername(): ?string {}
 
        public function getPassword(): ?string {}
 
        public function getRawPassword(): ?string {}
 
        public function getHost(): ?string {}
 
        public function getRawHost(): ?string {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function withHost(?string $host): static {}
 
        public function getPort(): ?int {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function withPort(?int $port): static {}
 
        public function getPath(): string {}
 
        public function getRawPath(): string {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function withPath(string $path): static {}
 
        public function getQuery(): ?string {}
 
        public function getRawQuery(): ?string {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function withQuery(?string $query): static {}
 
        public function getFragment(): ?string {}
 
        public function getRawFragment(): ?string {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function withFragment(?string $fragment): static {}
 
        public function equals(Uri $uri, \Uri\ComparisonMode $comparisonMode = \Uri\ComparisonMode::ExcludeFragment): bool {}
 
        public function toString(): string {}
 
        public function toRawString(): string {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function resolve(string $uri): static {}
 
        /**
         * @throws Exception
         */
        public function __serialize(): array {}
 
        /**
         * @throws Exception
         */
        public function __unserialize(array $data): void {}
 
        public function __debugInfo(): array {}
    }
}

namespace Uri\WhatWg {
    final readonly class Url
    {
        /** @param array<int, UrlValidationError> $errors */
        public static function parse(string $uri, ?Url $baseUrl = null, &$errors = null): ?static {}
 
        /**
         * @param array<int, UrlValidationError> $softErrors
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function __construct(string $uri, ?Url $baseUrl = null, &$softErrors = null) {}
 
        public function getScheme(): string {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function withScheme(string $scheme): static {}
 
        public function getUsername(): ?string {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function withUsername(?string $username): static {}
 
        public function getPassword(): ?string {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function withPassword(#[\SensitiveParameter] ?string $password): static {}

        public function getAsciiHost(): ?string {}
 
        public function getUnicodeHost(): ?string {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function withHost(?string $host): static {}
 
        public function getPort(): ?int {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function withPort(?int $port): static {}
 
        public function getPath(): string {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function withPath(string $path): static {}
 
        public function getQuery(): ?string {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function withQuery(?string $query): static {}
 
        public function getFragment(): ?string {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function withFragment(?string $fragment): static {}
 
        public function equals(Url $url, \Uri\ComparisonMode $comparisonMode = \Uri\ComparisonMode::ExcludeFragment): bool {}
 
        public function toAsciiString(): string {}
 
        public function toUnicodeString(): string {}
 
        /**
         * @param array<int, UrlValidationError> $softErrors
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function resolve(string $uri, &$softErrors = null): static {}
 
        /**
         * @throws Exception
         */
        public function __serialize(): array {}
 
        /**
         * @throws Exception
         */
        public function __unserialize(array $data): void {}
 
        public function __debugInfo(): array {}
    }
 
    enum UrlValidationErrorType {
        case DomainToAscii;
        case DomainToUnicode;
        case DomainInvalidCodePoint;
        case HostInvalidCodePoint;
        case Ipv4EmptyPart;
        case Ipv4TooManyParts;
        case Ipv4NonNumericPart;
        case Ipv4NonDecimalPart;
        case Ipv4OutOfRangePart;
        case Ipv6Unclosed;
        case Ipv6InvalidCompression;
        case Ipv6TooManyPieces;
        case Ipv6MultipleCompression;
        case Ipv6InvalidCodePoint;
        case Ipv6TooFewPieces;
        case Ipv4InIpv6TooManyPieces;
        case Ipv4InIpv6InvalidCodePoint;
        case Ipv4InIpv6OutOfRangePart;
        case Ipv4InIpv6TooFewParts;
        case InvalidUrlUnit;
        case SpecialSchemeMissingFollowingSolidus;
        case MissingSchemeNonRelativeUrl;
        case InvalidReverseSoldius;
        case InvalidCredentials;
        case HostMissing;
        case PortOutOfRange;
        case PortInvalid;
        case FileInvalidWindowsDriveLetter;
        case FileInvalidWindowsDriveLetterHost;
    }
 
    final readonly class UrlValidationError
    {
        public string $context;
        public UrlValidationErrorType $type;
        public bool $failure;
 
        public function __construct(string $context, UrlValidationErrorType $type, bool $failure) {}
    }
 
    class InvalidUrlException extends Uri\InvalidUriException
    {
        /** @param array<int, UrlValidationError> $errors */
        public readonly array $errors;
    }
}

API Design

First and foremost, the new URI parsing API contains two URI implementations: Uri\Rfc3986\Uri and Uri\WhatWg\Url, representing RFC 3986 and WHATWG URIs, respectively. Having separate classes for the two specifications makes it possible to properly model URIs with all their details and nuances. Actually, it could cause a security vulnerability to have wrong assumptions about the origin of a URI, as Daniel Stenberg (author of cURL) writes in one of his blog posts, that's why at least in security-sensitive applications, it's important to explicitly express which specification is used.

Parsing

Both built-in URI implementations are readonly classes, and support parsing URI strings via two methods:

the constructor: It expects a URI, and optionally, a base URL in order to support reference resolution. When parsing is unsuccessful, a Uri\InvalidUriException is thrown. The WHATWG implementation throws a Uri\WhatWg\InvalidUrlException containing the individual errors in the $errors property as an array of Uri\WhatWg\UrlValidationError instances.
a parse() factory method: It expects the same parameters as the constructor does, but when parsing is unsuccessful, null is returned instead of throwing an exception. Using this method is recommended for validating URIs and/or parsing URIs from untrusted input.

$uri = new Uri\Rfc3986\Uri("https://example.com");          // An RFC 3986 URI instance is created
$uri = Uri\Rfc3986\Uri::parse("https://example.com");       // An RFC 3986 URI instance is created
 
$uri = new Uri\Rfc3986\Uri("invalid uri");                  // Throws Uri\InvalidUriException
$uri = Uri\Rfc3986\Uri::parse("invalid uri");               // null is returned in case of an invalid URI
 
$url = new Uri\WhatWg\Url("https://example.com");           // A WHATWG URL instance is created
$url = Uri\WhatWg\Url::parse("https://example.com");        // A WHATWG URL instance is created
 
$url = new Uri\WhatWg\Url("invalid url");                   // Throws Uri\WhatWg\InvalidUrlException
 
$errors = [];
$url = Uri\WhatWg\Url::parse("invalid url", null, $errors); // null is returned, and an array of UrlValidationError objects are passed by reference to $errors

As it can be seen, Uri\WhatWg\Url can pass additional information about the triggered validation errors, as specified by WHATWG URL. In the example above, the $errors variable and the $errors property of the thrown Uri\WhatWg\InvalidUrlException will contain the following value:

array(1) {
  [0]=>
  object(Uri\WhatWg\UrlValidationError)#1 (2) {
    ["context"]=>
    string(11) "invalid uri"
    ["type"]=>
    enum(Uri\WhatWg\UrlValidationErrorType::MissingSchemeNonRelativeUrl)
    ["failure"]=>
    bool(true)
  }
}

The $context property refers to the substring where the error happened, while the $type property is a Uri\WhatWg\ UrlValidationErrorType enum storing the exact cause of the error. Last, the $failure field stores whether the error caused a failure, or processing could continue. Therefore, the true value refers to a hard error, while the false value means a soft error.

When trying to instantiate a WHATWG Url via its constructor, a Uri\WhatWg\InvalidUrlException is thrown when parsing results in a failure. In this case, the $errors property will contain an array of Uri\WhatWg\UrlValidationError instances. When parsing is successful, but soft errors were triggered, an array of Uri\WhatWg\UrlValidationError will be passed by reference to the $softErrors parameter.

When trying to instantiate a WHATWG Url via its parse() method, a null return value indicates that parsing results in a failure. In this case, the $errors by-ref parameter will contain an array of Uri\WhatWg\UrlValidationError instances. When parsing is successful, but soft errors were triggered, the $errors by-ref parameter will contain an array of Uri\WhatWg\UrlValidationError instances referring to only soft errors. The following example demonstrates how a soft error is triggered:

// Soft error due to the leading " " character when using the parse() method
$errors = [];
 
$url = Uri\WhatWg\Url::parse(" https://example.org", null, $errors);
echo $url->toAsciiString();                       // https://example.org
var_dump($errors[0]->type);                       // enum(Uri\WhatWg\UrlValidationErrorType::InvalidUrlUnit)
 
// Soft error due to the leading " " character when using the constructor
$softErrors = [];
 
$url = new Uri\WhatWg\Url(" https://example.org", null, $softErrors);
echo $url->toAsciiString();                       // https://example.org
var_dump($softErrors[0]->type);                   // enum(Uri\WhatWg\UrlValidationErrorType::InvalidUrlUnit)

Even though pass by reference is not a very desirable language construct, it is actually the least bad option to use with WHATWG errors which can happen even when parsing is successful. As PHP doesn't have native support for monads, reimplementing something similar in advance would be an unwise choice (e.g. a ParsingResult interface with three implementations: Success, PartialSuccess, Error).

If successful parsing and errors were mutually exclusive, then it would be possible to make the method return either a Uri\WhatWg\Url in case of success, or an array of Uri\WhatWg\UrlValidationErrors in case of failure, but since it's not the case, we had to reject the idea.

Reference resolution

Primarily, reference resolution is implemented via the $baseUrl parameter of the constructor and parse(). If the argument has a non-null value, and the $uri parameter is a relative URI, then $uri is attempted to be applied on $baseUrl.

$baseRfc3986Url = new Uri\Rfc3986\Uri("https://example.com");
 
$uri = new Uri\Rfc3986\Uri("/foo", $baseRfc3986Url);
echo $uri->toString();                                                // https://example.com/foo
 
$uri = new Uri\Rfc3986\Uri("https://test.com/foo", $baseRfc3986Url);
echo $uri->toString();                                                // https://test.com/foo
 
$baseWhatWgUrl = new Uri\WhatWg\Url("https://example.com");
 
$url = Uri\WhatWg\Url::parse("/foo", $baseWhatWgUrl);
echo $url->toAsciiString();                                           // https://example.com/foo
 
$url = Uri\WhatWg\Url::parse("https://test.com/foo", $baseWhatWgUrl);
echo $url->toAsciiString();                                           // https://test.com/foo

Additionally, URIs support a resolve() method that is able to resolve potentially relative URI strings with the current object as the base URL:

$uri = new Uri\Rfc3986\Uri("https://example.com");
echo $uri->resolve("/foo")->toString();                 //  https://example.com/foo
 
$url = new Uri\WhatWg\Url("https://example.com");
echo $url->resolve("/foo")->toAsciiString();            //  https://example.com/foo

This method is a shorthand for new get_class($uri)(”/foo“, $uri) for RFC 3986, and new get_class($url)(”/foo“, $url, $softErrors) for WHATWG URL.

Component retrieval

The individual URI components can be retrieved via getters. While property hooks and/or asymmetric visibility could be a modern replacement for getters, the RFC still chooses the more conservative getter-based approach because URI component may be available in multiple forms in order to best serve the vastly different needs users may have.

Supported representations

Most RFC 3986 URI components can be retrieved in two formats:

“raw” representation: It's how components are natively represented by the URI parser without any post-processing after parsing.
“normalized-decoded” representation: The URI is normalized (when applicable), and components are percent-decoded.

The “raw” representation is very straightforward and doesn't need much explanation: it reflects components the closest way to their origin. That's why this is mostly suitable for use-cases where one has to work with RFC 3986 URIs opaquely - usually API clients or signers fall in this category that want to avoid introducing any unnecessary changes to URIs, in order to avoid causing subtle bugs.

On the other hand, the “normalized-decoded” representation is useful in a whole lot of other cases, including application routers and HTTP cache implementations. This representation should be used when one wants to make sure that URI components are in their most canonical form. For example, in application routers, all URIs that represent the same resource should be routed to the same controller action: both https://example.com/test and https://example.com/t%65st should trigger the same piece of code, otherwise the application may fail to serve some traffic using a slightly abnormal URI for any reason.

The “normalized-decoded” form should do post-processing in such a way that the result can still be safely used for modification of the same component of another valid URI (the data is “roundtripable”):

$uri1 = new Uri\Rfc3986\Uri("HTTPS://example.com");
$uri2 = new Uri\Rfc3986\Uri("http://test.com");
 
// The scheme of $uri2 is successfully modified with the
// "normalized-decoded" representation of the scheme of $uri1
$uri2 = $uri2->withScheme($uri1->getScheme());

This attribute is important for usability - it would be inconvenient to always do additional checks when the “normalized-decoded” representation is used for building or modifying an RFC 3986 URI.

The “normalized-decoded” representation also guarantees equivalence with the “raw” representation of the same component of the same RFC 3986 URI because the equivalence semantics of the specification are compatible with the “normalized-decoded” representation.

Contrarily to Uri\Rfc3986\Uri, Uri\WhatWg\Url only supports the “raw” representation because WHATWG URL doesn't specify percent-decoding rules for most components. More information and reasoning for not implementing any custom logic for percent-decoding is available in the following discussion thread: https://externals.io/message/123997#127102.

Basic examples

Given the https://%61pple:p%61ss@ex%61mple.com:433/foob%61r?%61bc=%61bc#%61bc URI (the percent-encoded variant of https://apple:pass@example.com:433/foobar?abc=abc#abc), let's see how the individual components can be represented in case of Uri\Rfc3986\Uri:

$uri = new Uri\Rfc3986\Uri("https://%61pple:p%61ss@ex%61mple.com:433/foob%61r?%61bc=%61bc#%61bc");
 
echo $uri->getRawScheme();                       // https
echo $uri->getScheme();                          // https
 
echo $uri->getRawUserInfo();                     // %61pple:p%61ss
echo $uri->getUserInfo();                        // apple:pass
 
echo $uri->getRawUsername();                     // %61pple
echo $uri->getUsername();                        // apple
 
echo $uri->getRawPassword();                     // p%61ss
echo $uri->getPassword();                        // pass
 
echo $uri->getRawHost();                         // ex%61mple.com
echo $uri->getHost();                            // example.com
 
echo $uri->getPort();                            // 433
 
echo $uri->getRawPath();                         // /foob%61r
echo $uri->getPath();                            // /foobar
 
echo $uri->getRawQuery();                        // %61bc=%61bc
echo $uri->getQuery();                           // abc=abc
 
echo $uri->getRawFragment();                     // %61bc
echo $uri->getFragment();                        // abc

Let's have a look at another example which involves normalization:

$uri = new Uri\Rfc3986\Uri("HTTPS://EXAMPLE.COM/foo/../bar/");
 
echo $uri->getRawScheme();                       // HTTPS
echo $uri->getScheme();                          // https
 
echo $uri->getRawHost();                         // EXAMPLE.COM
echo $uri->getHost();                            // example.com
 
echo $uri->getRawPath();                         // /foo/../bar/
echo $uri->getPath();                            // /bar/

In case of Uri\Whatwg\Url, we'll get the following results for the first example:

$url = new Uri\Whatwg\Url("HTTPS://%61pple:p%61ss@ex%61mple.com:433/foob%61r?%61bc=%61bc#%61bc");
 
echo $url->getRawScheme();                       // method does not exist, because WHATWG URL always normalizes the scheme
echo $url->getScheme();                          // https
 
echo $url->getRawUsername();                     // method does not exist, because WHATWG URL doesn't specify percent-decoding for this component
echo $url->getUsername();                        // %61pple
 
echo $url->getRawPassword();                     // method does not exist, because WHATWG URL doesn't specify percent-decoding for this component
echo $url->getPassword();                        // p%61ss
 
echo $url->getRawHost();                         // method does not exist, because WHATWG URL always normalizes the host
echo $url->getAsciiHost();                       // example.com
echo $url->getUnicodeHost();                     // example.com
 
echo $url->getPort();                            // 433
 
echo $url->getRawPath();                         // method does not exist, because WHATWG URL doesn't specify percent-decoding for this component
echo $url->getPath();                            // /foob%61r
 
echo $url->getRawQuery();                        // method does not exist, because WHATWG URL doesn't specify percent-decoding for this component
echo $url->getQuery();                           // %61bc=%61bc
 
echo $url->getRawFragment();                     // method does not exist, because WHATWG URL doesn't specify percent-decoding for this component
echo $url->getFragment();                        // %61bc

This script resulted in some changes compared to Uri\Rfc3986\Uri. Most notably, no getters with a “raw” prefix exist. It is because WHATWG URL doesn't specify percent-decoding rules for most components, so only one representation is available for them. For the rest of the components (scheme and host), normalization - including percent decoding - is done during parse-time, so these components also have a single representation.

If WHATWG URL wasn't a living specification, we could assume that the “normalized-decoded” representation was available for the scheme and host components (getScheme(), getHost()), while the rest of the components had a single “raw” representation (getRawUsername(), getRawPassword(), getRawPath(), getRawQuery(), getRawFragment()). But since the WHATWG URL specification is subject to constant updates, it's possible that normalization or percent-decoding rules change in the future. That's why getter names don't reflect any currently valid assumptions about the representation of WHATWG URL components. More information and reasoning for not implementing any custom logic for percent-decoding is available in the following discussion thread: https://externals.io/message/123997#127102.

In addition to the getAsciiHost() method that returns the host component as ASCII characters, the getUnicodeHost() method comes in handy to retrieve the IDNA host that is best suitable for display:

$url = new Uri\WhatWg\Url("https://🐘.com");
echo $url->getAsciiHost();                         // xn--go8h.com
echo $url->getUnicodeHost();                       // 🐘.com

Getters of Uri\WhatWg\Url have a few gotchas for the ones who are inherently familiar with the WHATWG URL specification: they don't (entirely) follow the “getter steps” that are defined by the specification, but the individual components are returned directly without any other changes that the “getter steps” would otherwise specify. This choice has a few consequences:

Most Uri\WhatWg\Url getters are nullable, as most WHATWG URL components are also nullable. However, most “getter steps” convert null to an empty string (e.g. the search getter steps say the following: “If this’s URL’s query is either null or the empty string, then return the empty string”).
Uri\WhatWg\Url::getScheme() doesn't return the trailing ”:“ character as opposed to how WHATWG URL specifies the protocol getter steps (i.e. “https” is returned, rather than “https:”).
Uri\WhatWg\Url::getQuery() doesn't return the leading ”?“ character as opposed to how WHATWG URL specifies the search getter steps (i.e. “foo=bar” is returned, rather than ”?foo=bar“).
Uri\WhatWg\Url::getFragment() doesn't return the leading ”#“ character as opposed to how WHATWG URL specifies the hash getter steps (i.e. “foo” is returned, rather than ”#foo“).

These deviations from the WHATWG URL “getter steps” are necessary in order to have at least a baseline compatibility between the getters of Uri\WhatWg\Url and Uri\Rfc3986\Uri, even if the implementation details (e.g. percent-decoding rules) differ. The rough compatibility is required to add support for a very basic internal API that is needed for example to prevent the parsing confusion vulnerability.

Advanced examples

There is an edge-case which needs to be highlighted with more examples: it's the percent-decoding of reserved characters.

$uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar%2Fbaz");
 
echo $uri->getPath();                            // /foo/bar%2Fbaz
echo $uri->getRawPath();                         // /foo/bar%2Fbaz

In the example above, the second path segment contains the %2F which is the percent-encoded form of the / character. But why does it have to be percent-encoded at all? It's because there is a semantical difference between / and %2F in the path: the former one separates the individual path segments, while %2F means that the respective / is part of a single segment (but using the percent-decoded form would be ambiguous). Therefore the example URI has two path segments: “foo” and “bar/baz”.

Let's also have a look at another tricky example:

$uri = new Uri\Rfc3986\Uri("https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");
 
echo $uri->getRawHost();                        // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getHost();                           // [2001:0db8:0001:0000:0000:0ab9:c0a8:0102]
echo $uri->getRawPath();                        // /foo/bar%3Fbaz
echo $uri->getPath();                           // /foo/bar%3Fbaz
echo $uri->getRawQuery();                       // foo=bar%26baz%3Dqux
echo $uri->getQuery();                          // foo=bar%26baz%3Dqux

What happens here? The host name is an IPv6 address that must be enclosed within a [] pair. The path contains a ”%3F“ (percent-encoded form of ”?“), while the query string contains a ”%26“ (percent-encoded form of ”&“) and ”%3D“ (percent-encoded form of ”=“) that are all reserved characters, whose percent-decoding is disallowed by RFC 3986, so not even the “normalized-decoded” representation can convert them.

For reference, WHATWG Url has a similar result, although the IPv6 address is normalized, and as mentioned above, some of the methods are not available:

$url = new Uri\WhatWg\Url("https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");
 
echo $url->getAsciiHost();                      // [2001:db8:1::ab9:c0a8:102]
echo $url->getPath();                           // /foo/bar%3Fbaz
echo $url->getQuery();                          // foo=bar%26baz%3Dqux

Component modification

Immutable modification of the individual URI components is possible via “wither” methods. Let's see a very basic example first for modifying and retrieving the host URI component:

$uri1 = new Uri\Rfc3986\Uri("https://example.com");
$uri2 = $uri->withHost("test.com");
 
echo $uri1->getHost();                             // example.com
echo $uri2->getHost();                             // test.com

The above example demonstrates that withers create a new instance for each modification, leaving the original object intact. However, an exception is thrown if a modification resulted in an invalid URI. This way, URIs always stay valid:

$uri = new Uri\Rfc3986\Uri("https://example.com");
$uri->withHost("/");                               // Throws Uri\InvalidUriException

Withers of Uri\WhatWg\Url follow the relevant “setter steps” that are defined by WHATWG URL. Unfortunately, these algorithms sometimes have surprising behavior where modification fails silently, and the original values are kept. For example:

$url = new Uri\WhatWg\Url("https://example.com");
$url = $url->withHost("2001:db8:0:0:0:0:0:1");    // invalid IPv6 host, but no exception is triggered and the original host is kept
 
echo $url->getAsciiHost();                        // example.com

Even though this RFC acknowledges the fact that the WHATWG URL “setter steps” have gotchas, it doesn't try to prevent them - as doing so would be spec-incompliant.

Another thing to keep in mind is that WHATWG URL is inconsistent with itself when it comes to naming of the “setter steps” of the “host” component: the host setter steps can possibly modify the “port” component as well, while the "hostname setter steps" are the ones that are exclusively related to the the “host” component.

That's why Uri\WhatWg\Url::withHost() could also be called as Uri\WhatWg\Url::withHostname() to be more in line with the terminology used by WHATWG URL. After some consideration, the method name remained Uri\WhatWg\Url::withHost() at last, because it's the most consistent one with Uri\WhatWg\Url::getHost().

It's also important to know how withers handle percent-encoding and reserved characters. WHATWG explicitly declares an algorithm for modifying components, which is discussed in the https://url.spec.whatwg.org/#url-class section (listed below as “setters steps”). But generally speaking, withers accept URI components in percent-encoded form:

$url = new Uri\WhatWg\Url("https://example.com");
$url = $url->withUsername("use%72%3A");            // percent-encoded form of "user:"
 
echo $url->getUsername();                       // use%72%3A

Additionally, the WHATWG URL algorithm automatically percent-encodes characters that fall into the percent-encoding character set of a component (i.e. ”?“ and ”#“ for the path in the example below) as well as any invalid characters, as discussed in the Percent-encoding & decoding section:

$url = new Uri\WhatWg\Url("https://example.com");
$url = $url->withPath("/?#:");
 
echo $url->getPath();                          // /%3F%23:

The wither invocation above triggers a Uri\WhatWg\UrlValidationErrorType::InvalidUrlUnit (soft) validation error due to the ”#“ character not being a URL code point which is not exposed for userland. Withers would need to have a $softErrors parameter similar to what the Uri\WhatWg\Url::__construct() has in order to be able to expose similar soft errors.

Let's see another special case where the input contains delimiters:

$url = new Uri\WhatWg\Url("https://example.com/");
$url = $url->withQuery("?foo");
$url = $url->withFragment("#bar");
 
echo $url->getQuery();                          // foo
echo $url->getFragment();                       // bar

The above example makes it clear that withers optionally accept leading delimiter characters (”?“ for the query, and ”#“ for the fragment), even if they are not returned by the relevant getters. All this is possible to do because ”?“ doesn't have any special meaning in the context of the query component, and neither does ”#“ in the context of the fragment.

All the above examples defined how Uri\WhatWg\Url handles percent-encoding during component modification, but Uri\Rfc3986\Uri has not been discussed yet. It's because RFC 3986 doesn't have an explicitly specified algorithm for component modification. That's why the present RFC has to set the rules:

In order to offer consistent behavior with the parsing rules of RFC 3986, withers of Uri\Rfc3986\Uri also only accept properly formatted input, meaning characters that are not allowed to be present in a component must be percent-encoded. Let's see what this means in practice through the following example:

$uri = new Uri\Rfc3986\Uri("https://example.com/");
$uri = $uri->withQuery("foo%5B%5D=%23");           // percent-encoded form of "foo[]=#"
 
echo $uri->getRawQuery();                          // foo%5B%5D=%23
echo $uri->getQuery();                             // foo%5B%5D=%23

Characters ”[“, ”]“, and ”#“ are disallowed in the query component, therefore they have to be percent-encoded, otherwise a Uri\InvalidUriException will be thrown. Since the characters in question are reserved, they cannot be normalized (and therefore percent-decoded), so even the “normalized-decoded” representation of ”[]“ is ”%5B%5D“. This is also important to achieve the “roundtripable data” property.

Unlike how WHATWG URLs work, RFC 3986 URIs don't allow leading delimiter characters during component modification simply because it's the most consistent with the more “rigid” parsing rules of RFC 3986:

$uri = new Uri\Rfc3986\Uri("https://example.com");
$uri = $uri->withQuery("?foo");                    // throws Uri\InvalidUriException
$uri = $uri->withFragment("#bar");                 // throws Uri\InvalidUriException

Component recomposition

Besides accessors, URI implementations contain various ”toString()“ methods as well. They can be used for recomposing the URI components back to a string. Why such methods are necessary at all instead of simply returning the input URI string? It's because URI parsers may have applied some modifications to the input during parsing. This is specifically the case for the WHATWG URL specification, since it mandates the usage of quite a few parse-time transformations.

Uri\WhatWg\Url has two ”toString()“ methods to provide both an ASCII mostly suitable for machine processing and a Unicode format mostly suitable for display:

$url = new Uri\WhatWg\Url("HTTPS://////EXAMPLE.com");
echo $url->toAsciiString();                        // https://example.com/
 
$url = new Uri\WhatWg\Url("HTTPS://////你好你好.com");
echo $url->toAsciiString();                        // https://xn--6qqa088eba.com/
echo $url->toUnicodeString();                      // https://你好你好.com/

The toAsciiString() method recomposes the URI in a format which is most suitable for machine processing (host names using IDNA characters are translated to ASCII characters via punycode), while the toUnicodeString() method is a user-friendly representation that displays the host as a Unicode string.

As RFC 3986 doesn't support IDNA, its two ”toString()“ methods don't differentiate based on the target audience, but rather whether normalization is performed:

$uri = new Uri\Rfc3986\Uri("HTTPS://EXAMPLE.com");
echo $uri->toRawString();                             // HTTPS://EXAMPLE.com
 
$uri = new Uri\Rfc3986\Uri("HTTPS://EXAMPLE.com");
echo $uri->toString();                                // https://example.com

The Uri\Rfc3986\Uri::toRawString() returns the unnormalized URI string, while Uri\Rfc3986\Uri::toString() normalizes its return value.

Another example showcasing that Uri\Rfc3986\Uri doesn't support IDNA:

$uri = Uri\Rfc3986\Uri::parse("https://你好你好.com");
var_dump($uri);                                    // NULL
 
$uri = Uri\Rfc3986\Uri::parse("https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com"); // percent-encoded form of https://你好你好.com
echo $uri->toString();                             // https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com

Since WHATWG normalizes IPv6 addresses in the host component during parsing, only the normalized value is accessible after recomposition:

$url = new Uri\WhatWg\Url("https://[0:0::1]/");
 
echo $url->toAsciiString();                        // https://[::1]

The attentive reader may have noticed that neither URI implementation contains a __toString() magic method. This is a deliberate design decision not to add this method to the built-in URI classes, as doing so could cause incorrect results when using equality comparison (==). Given the following example:

$uri = new Uri\Rfc3986\Uri("https://EXAMPLE.com");
 
var_dump($uri == 'HTTPS://example.com/');

The output would be bool(false) if Uri\Rfc3986\Uri contained a __toString() method, because of the $uri object being automatically converted to its string representation (https://example.com/) which is then compared against HTTPS://example.com. However, the two URIs should be indeed equal, as a result of normalization. Furthermore, equality of URIs disregards the fragment component by default, thus a https://example.com#foo URI would also yield a false positive result in the example.

Equivalence

The equals() method only accepts URI objects of the same specification, as it doesn't make sense to compare URIs of different standards. Then it normalizes (if applicable) and recomposes the URI represented by the object as well as the URI received in the argument list to a string, and checks whether the two strings match. By default, the fragment component is disregarded.

// An RFC 3986 URI equals another RFC 3986 URI that has the same string representation after normalization.
$uri = new Uri\Rfc3986\Uri("https://example.COM#foo");
$uri->equals(new Uri\Rfc3986\Uri("https://EXAMPLE.COM"));       // true
 
// The fragment component of Uri\Rfc3986\Uri can also be taken into account
$uri = new Uri\Rfc3986\Uri("https://example.com#foo");
$uri->equals(new Uri\Rfc3986\Uri("https://example.com"), Uri\UriComparisonMode::IncludeFragment); // false
 
// A WHATWG URL equals another WHATWG URL that has the same string representation
$url = new Uri\WhatWg\Url("https:////example.COM/");
$url->equals(new Uri\WhatWg\Url("https://EXAMPLE.COM"));       // true
 
// The fragment component of Uri\WhatWg\Url can also be taken into account
$url = new Uri\WhatWg\Url("https://example.com#foo");
$url->equals(new Uri\WhatWg\Url("https://example.com"), Uri\UriComparisonMode::ExcludeFragment); // true
 
// A URI cannot be compared against another URI of a different specification
$url = new Uri\Rfc3986\Uri("https://example.com/");
$url->equals(new Uri\WhatWg\Url("https://example.com/"));      // throws TypeError

It should be noted that the equals() method could also accept URI strings. It was a deliberate decision not to allow such arguments, because it would be unclear how the comparison works in this case: Should the passed in string be also normalized, or exact string match should be performed? This is a question that doesn't have to be answered when only a URI object parameter type is supported.

The same question - combined with the fact that the construct is not supported in userland - led us not to overload the equality operator.

Cloning

Cloning of URIs is supported. Withers also clone the object they are invoked on before actually performing component modification in order to guarantee immutability.

$uri1 = new Uri\Rfc3986\Uri("https://example.com/foo/");
$uri2 = clone $uri1;                                       // creates a new Uri instance
 
$url1 = new Uri\WhatWg\Url("https://example.com/foo/");
$url2 = clone $url1;                                       // creates a new Url instance

Serialization

Both built-in URI classes support serialization and deserialization - albeit a little bit differently than most classes usually do: the serialized form only includes the recomposed URI itself in its “raw” representation exposed as the uri field, but the individual properties or URI components are not present. This approach makes deserialization easier that performs regular URI string parsing that doesn't need the individual URI components.

$uri = new Uri\Rfc3986\Uri("HTTPS://example.com/foo/bar");
echo serialize($uri);
 
// "O:15:"Uri\Rfc3986\Uri":2:{i:0;a:1:{s:3:"uri";s:27:"HTTPS://example.com/foo/bar";}i:1;a:0:{}}"

The same example with Uri\WhatWg\Url:

$url = new Uri\WhatWg\Url("HTTPS://example.com/foo/bar");
echo serialize($url);
 
// O:14:"Uri\WhatWg\Url":2:{i:0;a:1:{s:3:"uri";s:27:"https://example.com/foo/bar";}i:1;a:0:{}}

The approach of using the above mentioned “meta” field would normally require the $uri property to be reserved from usage by any 3rd party URI implementations and any possible child classes (should the internal URI classes become extensible in the future). In order to avoid this problem, the serialization format of ext/random is adopted, which uses two different “buckets” (arrays in practice): one for the “meta” field(s), and one for the possible properties (which is always empty for now). This way, it is not possible to have any property name collision.

// The following line throws Exception because of the missing "uri" field ("url" is used instead)
unserialize('O:15:"Uri\Rfc3986\Uri":2:{i:0;a:1:{s:3:"url";s:27:"HTTPS://example.com/foo/bar";}i:1;a:0:{}}');
 
// The following line throws Exception because the "uri" field has a wrong type
unserialize('O:14:"Uri\WhatWg\Url":2:{i:0;a:1:{s:3:"uri";i:1;}i:1;a:0:{}}');

Debugging

The two built-in URI classes implement the __debugInfo() magic method in order to expose their internal state for debugging purposes:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo/");
var_dump($uri);
 
/*
object(Uri\Rfc3986\Uri)#1 (9) {
  ["scheme"]=>
  string(5) "https"
  ["userinfo"]=>
  NULL
  ["username"]=>
  NULL
  ["password"]=>
  NULL
  ["host"]=>
  string(11) "example.com"
  ["port"]=>
  NULL
  ["path"]=>
  string(5) "/foo/"
  ["query"]=>
  NULL
  ["fragment"]=>
  NULL
}
*/

The method returns the “raw representation” of each component in order not to skew the natively stored values. The purpose of exposing the individual components rather just the single recomposed string is to provide a deeper look into the anatomy of the URI. in some cases, it's not trivial to decide what value each component has. Just one example: one could naively assume that the “mailto:kocsismate@php.net” URI has a user(info) component of “kocsismate” and a hostname of “php.net”. The representation provided by __debugInfo() can quickly highlight that “kocsismate@php.net” is the path in fact.

When trying to understand how a URI is exactly composed, being able to see the individual components at once is very helpful. Without this, one would have to call the getters one by one to find out what value each of them have.

Even though the example above used Uri\Rfc3986\Uri, Uri\WhatWg\Url behaves the same way.

Exceptions

The proposal adds 3 new exceptions:

Uri\UriException: It's the base exception, and isn't directly thrown anywhere.
Uri\InvalidUriException: It extends Uri\UriException, and it's thrown when URI string parsing is unsuccessful in case of __construct(), all wither methods, and resolve().
Uri\WhatWg\InvalidUrlException: It extends Uri\InvalidUriException, and it's thrown by the WHATWG implementation so that additional error details can be added.

Plugability

The capability provided by parse_url() is used for multiple purposes in the internal PHP source:

SoapClient::_doRequest(): parsing the $location parameter as well as the value of the Location header
FTP/FTPS stream wrapper: parse_url() is used for connecting to a URL, renaming a file, following the Location header
FILTER_VALIDATE_URL: validating URLs
SSL/TLS socket communication: parsing the target URL
GET/POST session: accepting the session ID from the query string, manipulating the output URL to automatically include the session ID (Deprecate GET/POST sessions RFC)

It would cause inconsistency and a security vulnerability if parsing of URI strings based on the two specifications referred above were supported in userland, but the legacy parse_url() based behavior was kept internally without the possibility to use the new API. That's why the current RFC was designed with plugability in mind.

Specifically, supported parser backends would have to be registered by using a similar method to how password hashing algorithms are registered. On the one hand, this approach makes it possible for 3rd party extensions to leverage URI parser backends other than the built-in ones (e.g. support for ADA URL could also be added). But more importantly, an internal “interface” for parsing and handling URIs is defined this way so that it now becomes possible to configure the used backend for each use-case. Please note that URI parser backend registration is only supported by internal code: registering custom userland implementations is not possible for now, mainly in order to prevent a possible new attack surface.

While it would sound natural to add a php.ini configuration option to configure the used parser backend globally, this option was rejected during the discussion period of the RFC because it would result in unsafe code that is controlled by global state: since any invoked piece of code can change the used parser backend, one should always check the current value of the config option before parsing URI strings (and in case of libraries, the original option should also be reset after usage). Instead, the RFC proposes to add the following configuration options that only affect a single use-case:

SoapClient::_doRequest(): a new optional $uriParserClass parameter is added accepting string or null arguments. Null represents the original (parse_url()) based method, while the new backends will be used when passing either Uri\Rfc3986\Uri::class or Uri\WhatWg\Url::class.
FTP/FTPS stream wrapper: a new uri_parser_class stream context option is added
FILTER_VALIDATE_URL: filter_* functions can be configured by passing a uri_parser_class key to the $options array
SSL/TLS socket communication: a new uri_parser_class stream context option is added
GET/POST session: since this feature is deprecated by the Deprecate GET/POST sessions RFC, no configuration is added.

There are certain file-handling functions that can already accept URIs as strings: these include file_get_contents(), file(), fopen(). As per the current proposal, the URI parser can be supplied in the $context parameter to these functions, but this approach is somewhat tedious, especially if the URI already had to be parsed previously (e.g. for validation purposes). Let's consider the following example:

$url = $_GET['url'];
validate_url($url);
 
$context = stream_context_create([
    "uri_parser_class" => \Uri\Rfc3986\Uri::class,
]);
$contents = file_get_contents($url, context: $context);

Even though there are other much more convenient approaches, the current RFC still goes with the current, less ergonomic one, as going either way would need more discussion, resulting in a scope creep. The improvement possibilities include passing URI instances to the functions in question, or converting URIs to streams based on Java's example.

Parser Library Choice

Adding a WHATWG compliant URL parser to the standard library was originally attempted in 2023. The implementation used ADA URL parser as its parser backend which is known for its high performance. The proof of concept was ultimately abandoned due to some technical limitations that weren't possible to resolve.

Specifically, ADA is written in C++, and requires a compiler supporting C++17 at least. Despite the fact that it has a C wrapper, its tight compiler requirements would make it unprecedented, and practically impossible to add the URI extension to PHP as a required extension, because PHP has never had a C++ compiler dependency for the always enabled extensions, only optional extensions (like Intl) can be written in C++.

The firm position of this RFC is that a URL parser extension should always be available, therefore a different parser backend written in pure C should be found. Fortunately, Niels Dossche proposed PHP RFC: DOM HTML5 parsing and serialization not long after the experiment with ADA, and his work required bundling parts of the Lexbor browser engine. This library is written in C, and coincidentally contains a WHATWG compliant URL parsing submodule, therefore it makes it suitable to be used as the library of choice.

For parsing URIs according to RFC 3986, the URIParser library was chosen. It is a lightweight and fast C library with no dependencies. It uses the “new BSD license” which is compatible with the current PHP license as well as the PHP License Update RFC.

Performance Considerations

The implementation of parse_url() is optimized for performance. This also means that it doesn't deal with validation properly and disregards some edge cases. A fully standard compliant parser will generally be slower than parse_url(), because it has to execute more code. Fortunately, this overhead is acceptable thanks to the efforts of the maintainers of the Lexbor and the uriparser libraries.

According to the rough benchmarks performed on a Linux instance in GitHub Actions, the following results were measured:

Time of parsing of a basic URL (1000 times)

parse_url(): 0.000233 sec
Uri\Rfc3986\Uri: 0.000298 sec
Uri\WhatWg\Url: 0.000394 sec

Time of parsing of a complex URL (1000 times)

parse_url(): 0.000538
Uri\Rfc3986\Uri: 0.000817 sec
Uri\WhatWg\Url: 0.000917 sec

Discussion

The following sections give some additional context and explanation for the questions that had to be answered during the discussion phase of the RFC.

Naming considerations

After multiple iterations, the RFC settled on using the Uri\Rfc3986\Uri and the Uri\WhatWg\Url class names. By having different subnamespaces for the two specifications, it became possible to group together all the WHATWG related classes (Uri\WhatWg\UrlValidationErrorType, Uri\WhatWg\UrlValidationError). Additionally, the chosen class names (Uri and Url) try to disambiguate how the two specifications actually work:

RFC 3986 works with actual relative URIs which don't have a scheme
WHATWG can only work with URLs (URIs that have a scheme)

The additional benefit of using different class names is that there is no clash when both classes are imported into the same PHP file.

Why isn't a common URI interface supported?

PSR-7 UriInterface is currently the de-facto interface for representing URIs in userland. That's why it seemed a good candidate for adoption at the first glance. However, the current RFC didn't pursue reusing it for the following reasons:

PSR-7 strictly follows the RFC 3986 standard, and therefore only has a notion of "userinfo", rather than "user" and "password" which is used by the WHATWG URL specification.
PSR-7's UriInterface has non-nullable method return types except for UriInterface::getPort(), whereas WHATWG URL specifically allows null values for the majority of the components.

As an alternative, the RFC attempted to define a new common URI interface (called Uri\Uri), but I had to realize late in the RFC process that the RFC 3986 and WHATWG URL specifications have so many smaller or bigger differences between them that a common URI interface is just not feasible to define. These differences are called out thoroughly throughout the RFC, so they are not new for the careful reader.

Why does the "user:password" format of the "User Information" component of RFC 3986 have special support?

RFC 3986 states the following when discussing the format of the “userinfo” component:

The userinfo subcomponent may consist of a user name and, optionally, scheme-specific information about how to gain authorization to access the resource. The user information, if present, is followed by a commercial at-sign (”@“) that delimits it from the host.

The definition is then extended with the following warning:

Use of the format “user:password” in the userinfo field is deprecated. Applications should not render as clear text any data after the first colon (”:“) character found within a userinfo subcomponent unless the data after the colon is the empty string (indicating no password)

The above sentences have always served as a source of contention whether the Uri\Rfc3986\Uri class should handle the userinfo component strictly conformant to the RFC, or is it possible to add dedicated methods for the “user:password” format as “syntactic sugar”.

The position of the RFC is that the “user:password” format deserves special attention in spite of the fact that it's deprecated, because it's still the most often used format in the wild by far. That's why the dedicated getters (getUsername(), getRawUsername(), getPassword(),getRawPassword()) are added to Uri\Rfc3986\Uri. Dedicated withers are not added, because Uri\Rfc3986\Uri::withUserInfo() is trivial to use with passwords:

$uri = new Uri\Rfc3986\Uri("https://user@example.com");
$uri->withUserInfo($uri->getUsername() . ":password");
echo $uri->toString();                                           // https://user:password@example.com

Previously, UriInterface of PSR-7 only added special support for the password component in its withUserInfo() method. Unfortunately, rather than setting user and password separately, the most recurring problem people face is to retrieve these two components separately. Not to mention the fact that setting a new password with the same user is still very cumbersome to achieve with the approach of PSR-7:

$uri = new \Laminas\Diactoros\Uri("https://user:password@example.com");
 
$userInfo = explode(":", $uri->getUserInfo());
$username = $userInfo[0];
 
$uri = $uri->withUserInfo($username, "new_password");

That's why the current RFC doesn't try to follow the solution chosen by PSR-7, but rather solves working with passwords the other way around.

Why isn't query parameter manipulation supported?

It would be very useful for a URI implementation to support direct query parameter manipulation. Actually, the WHATWG URL specification contains a URLSearchParams interface that could be used for the purpose. However, the position of this RFC is not to include this interface yet for the following reasons:

Query string parsing is a fuzzy area, since there are no established rules for parsing
The URLSearchParams interface doesn't follow either RFC 1738, or RFC 3986
The already large scope of the RFC would increase even more

For all these reasons, the topic of query parameter manipulation should be discussed as a followup to the current RFC.

How should URI modification work?

Since URIs are value objects inherently, this RFC models them as immutable classes that support modification through withers. The usage of withers comes with some performance penalty (as a new instance is created for each modification), but this is a necessity in order to hold identity constraints.

Alternatively, it would be possible to make URIs completely immutable by using the builder pattern to construct and modify URIs (e.g. by having a Uri\Rfc3986\UriBuilder and a Uri\WhatWg\UrlBuilder class). This way, new Uri instances would only be created once: after the very last modification. This is especially true when one wants to construct a completely new URI. That's why this solution seems more optimized than the wither based approach.

However, this is not always true. When one wants to modify only a single detail of a URI, then withers are not only easier to use but are more efficient as well:

// Redirection of HTTP traffic to HTTPS by using withers
 
$uri = new Uri\Rfc3986\Uri("http://example.com");
 
$uri = $uri->withScheme("https");                        // a new URI instance is created at this point

Whereas, the following piece of code should be used if URIs didn't support modification (given a hypothetical Uri\Rfc3986\UriBuilder class):

// Redirection of HTTP traffic to HTTPS by using the builder pattern
 
$uri = new Uri\Rfc3986\Uri("http://example.com/foo");
 
$builder = Uri\Rfc3986\UriBuilder::fromUri($uri);
$builder->setScheme("https");                           // overwrites the URI scheme
$uri = $builder->build();                               // a new URI instance is created at this point

The above example makes it clear that the builder pattern mostly shines when it can save multiple instance creations, and it's especially true if a URI has to be constructed from the scratch:

// Redirection of HTTP traffic to HTTPS by using the builder pattern
 
$builder = new Uri\Rfc3986\UriBuilder();
$builder->setScheme("https")
        ->setHost("example.com")
        ->setPath("/foo");
 
$uri = $builder->build();                               // a new URI instance is created at this point

Builder classes are not offered by the present RFC just yet. They definitely have their use-case, as they can help write more optimized code, but they are not essential at the get go. Therefore, this feature is one of the top candidates of a followup RFC.

Why should the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes be final?

The new API has to conform to three contradictory expectations:

Security & Reliability: URI handling is a very complex topic, and generally, it's difficult to do right. Therefore the new API has to provide reliable behavior that offers the least amount of surprise. This is especially crucial because URI handling has a severe impact on security, since there are multiple related vulnerabilities (e.g. parsing confusion).
Extendability & Interoperability: URI handling is also a widespread problem people often face. As there may be vastly different use-cases, people need some way to customize the built-in features for their own purposes. On top of this, multiple libraries may need to work together on the same URI object. That's why interoperability is also an important factor.
Evolvability: There are multiple planned features in future scope that should be supported while introducing the least amount of breaking changes.

Unfortunately, it's not easy to meet all expectations at the same time. On the one hand, a customizable URI implementation allows bugs to be introduced (either accidentally or deliberately) that may result in issues with spec-compatibility. Additionally, it hinders evolvability, since internal methods added later may collide with methods added by userland implementations. On the other hand, a strictly locked implementation may make it difficult for the ecosystem to play together nicely, and does not encourage creativity of userland implementations.

Let's list the possible solutions:

Making the classes open for extension: This solution has acknowledged technical challenges (https://github.com/php/php-src/pull/14461#discussion_r1847316607), and it limits our possibilities of adding changes the most, but users can effectively add or modify any behavior they need. Naturally, this also invites bugs or incompatibilities, but these bugs usually cannot be avoided entirely anyway.

Making the classes final: By making the built-in URI classes final, we effectively forbid any alternative implementations. For example, an ADA URL based URL class or any other userland implementation cannot be used instead of the current Uri\WhatWg\Url class. This way, we can eliminate some edge cases (e.g. handling of uninitialized state).

Making the methods final: By making all URI methods final, we can guarantee that only new methods can be added, but the existing behavior cannot be modified. This has the same benefits as the above solution, but leaves a bit more degrees of freedom for userland implementations, as child classes could extend the parent classes with new methods. Consecutively, adding new methods to the parent classes (for example as the result a followup RFC) would be more difficult, since we should respect all child classes that possibly already implement the methods to be added.

Making the classes final, but adding a separate interface for each: The negative impacts of making the built-in classes final would be mitigated by adding one interface for each specification that could serve as a common ground for interoperable implementations. Just like the above solution, it would be more difficult to add new methods to the interfaces, since such changes would be considered backward incompatibility breaks.

As it can be seen, each and every solution has multiple advantages and disadvantages. Let's see then how each approach would affect userland code:

If we had 2 final built-in URI classes, userland code could primarily customize them via composition. When a custom URI implementation is passed to a 3rd party library, then the built-in URI has to be extracted from the composing class, and the said library could either use the pure internal implementation, or recompose it as their own URI class. Let's see an example about a hypothetical use-case for composition:

class MyUri
{
    public function __construct(
        private readonly Uri\Rfc3986\Uri $uri
    ) {
    }
 
    /** Proxy the getHost() method of the built-in URI */
    public function getHost(): ?string
    {
        return $this->uri->getHost();
    }
 
    /** Custom feature that is not yet supported by the built-in URI */
    public function isIpv6Host(): bool
    {
        return str_starts_with($this->getHost(), "[");
    }
 
    // ...
 
    /** Support extraction of the built-in URI */
    public function getUri(): Uri\Rfc3986\Uri
    {
        return $this->uri;
    }
}
 
// Instantiate the custom URI implementation via composition
// by passing the URI of the current request as input.
$myUri = new MyUri(new Uri\Rfc3986\Uri("https://" . $_SERVER["HTTP_HOST"] . $_SERVER["REQUEST_URI"]));
 
// Call a 3rd party library that needs a URI to be passed.
// Let's extract the built-in URI class for this purpose.
$libraryCode = new Library1();
$libraryCode->doSomethingWithUri($myUri->getUri());
 
// Call another 3rd party library that uses their own custom URI implementation.
$otherLibraryCode = new Library2();
$libraryCode->doSomethingWithUri(new Library2Uri($myUri->getUri()));

As it can be seen, composition is a relatively expensive pattern to use, since it involves the instantiation of multiple classes, and all the methods of the composed class have to be duplicated in order to be able to call them. That probably explains why Nikita was working on adding native support of the "decorator" pattern.

Furthermore, interoperability between multiple libraries is automatically supported by the built-in classes. It's just inconvenient and wasteful to extract (and possibly recompose) a built-in class every time it has to be passed to a 3rd party library using a different abstraction. Of course, by having a PSR-7 like UriInterface in userland for both built-in URI implementations would improve the situation, but composition would still be always necessary to use with all its bells and whistles, as the built-in classes don't implement any interfaces.

By also adding an interface for each built-in class, interoperability would become simpler and less wasteful. As the built-in URI implementations would implement their respective interface, it wouldn't matter what kind of implementation is passed to a function that expects a URI interface.

/** A function expecting a UriInterface implementation */
function doSomethingWithUri(Uri\Rfc3986\UriInterface $uri)
{
    // ...
}
 
// The built-in Uri class is passed to the function
doSomethingWithUri(new Uri\Rfc3986\Uri("http://example.com"));
 
// The custom Uri class is passed to the function
doSomethingWithUri(new MyUri(new Uri\Rfc3986\Uri("http://example.com")));

This way, composition would still be necessary to use for custom implementations, but at least it could be avoided for the native implementations.

If we made it possible to extend the built-in classes, then URI customization would be much easier and would result in more optimized code, as the built-in classes could be substituted with their children. On the other hand, the implementation of the built-in classes is somewhat special, as it relies on some features that are not possible in userland code. This may potentially make child classes work unintuitively, or inconsistently with regular userland code.

Note: It should also be taken into account that Uri\Rfc3986\Uri implements the generic URI syntax upon which scheme-specific syntax can be built (probably one of the most well-known examples is LDAP URIs). When designing an implementation for a scheme-specific URI syntax, it would seem very straightforward to extend Uri\Rfc3986\Uri, and adapt it to the specific purpose. But this is a wrong choice based on the Liskov-Substitution Principle, which states that a child class should be able to substitute its parent without breaking the program. Clearly, an implementation for a scheme-specific URI syntax is more specific than the generic one, so by definition, it's not applicable as a substitute. This attribute of Uri\Rfc3986\Uri is important to know when trying to customize its behavior.

(Of course, people are not forced to use composition, they could just add their own UriHelper classes which could host the necessary static helper methods. Unless a popular UriHelper library emerges, everyone has to write their own custom code, which is a major regression.)

Based on the above factors, the future plans, and the apparent resistance against opening the built-in classes for extension, and generally the unforseeable effects of the other choices, the current RFC chooses to make the built-in URI implementations final without adding any internal interfaces for the time being. This is the most restricting choice of all options, and is mostly chosen as a safety measure until the new API becomes mature enough and becomes tested in practice.

The rationale behind this choice is that there are immediate plans to add new capabilities to the new API, so it's better not to provide an extension point for userland code in order to avoid negative consequences of any backward compatibility breaks introduced later on. Once the API settles, we plan to lift these restrictions at some extent.

Possibly, providing two interfaces and final implementations (option 4) could be a good choice in the future, especially if we made composition more ergonomic to use in the meanwhile. However, given the heated discussion around the topic with totally opposing opinions, we should discuss the question on its own, not just a smaller detail of a huge proposal. Choosing the most restrictive opinion can cut the debate short, and put the focus back on the big picture for now.

Examples in Other Languages

Go

Even though Go's standard library ships with a net/url package containing a url.Parse() function along with some utility functions, unfortunately it's not highlighted in the documentation which specification it conforms to. However, it's not very promising that the manual mentions the following sentence:

Trying to parse a hostname and path without a scheme is invalid but may not necessarily return an error, due to parsing ambiguities.

Java

In Java, a URL class has been available from the beginning. Unfortunately, it's unclear whether it adheres to any URI specification. Speaking about its design, URL itself is immutable, and somewhat peculiarly, it contains some methods which can open a connection to the URL, or get its content.

Since Java 20, all of the URL constructors are deprecated in favor of using URI.toURL(). The URI class conforms to RFC 2396 standard.

C#

C# has an extensive support for URIs and IRIs, as the documentation states that its percent-encoding and decoding rules conform to RFC 3987. Uniquely, the standard library offers advanced features such as a UriBuilder, and customizable URI Parsers.

NodeJS

NodeJS recently added support for a decent WHATWG URL compliant URL parser, built on top of the ADA URL parser project.

Python

Python also comes with built-in support for parsing URLs, made available by the urllib.parse.urlparse and urllib.parse.urlsplit functions. According to the documentation, “these functions incorporate some aspects of both [the WHATWG URL and the RFC 3986 specifications], but cannot be claimed compliant with either”.

Backward Incompatible Changes

A new parameter is added to SoapClient::__doRequest(). When this method is overridden, the $uriParserClass parameter has to be added to the parameter list.

Proposed PHP Version(s)

The next minor PHP version (PHP 8.5).

RFC Impact

To SAPIs

SAPIs should adopt the new internal API for parsing URIs instead of using the existing php_url_parse*() API. Additionally, they should add support for configuring the URI parsing backend.

To Existing Extensions

Extensions should adopt the new internal API for parsing URIs instead of using the existing php_url_parse*() API. Additionally, they should add support for configuring the URI parsing backend.

To Opcache

None.

Future Scope

Support for a UriBuilder class, similar to the one implemented by C#
Support for RFC 3987 (Internationalized Resource Identifiers)
Support for new parser backends so that other libraries (like Ada URL, or cURL) could also be used in addition to uriparser and Lexbor.
Support for an abstraction for manipulating query parameters, like URLSearchParams defined by WHATWG
Support for retrieving/modifying path segments as an array
The parse_url() function can be deprecated at some distant point of time

References

Discussion thread: https://externals.io/message/123997
RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986
RFC 3987: https://datatracker.ietf.org/doc/html/rfc3987
WHATWG URL specification: https://url.spec.whatwg.org/

Vote

The vote started on 2025-05-08, ends on 2025-05-22, and requires 2/3 majority to be accepted.

Add the RFC 3986 and the WHATWG URL compliant API described above?
Real name	yes	no
adiel (adiel)
alcaeus (alcaeus)
beberlei (beberlei)
cpriest (cpriest)
crell (crell)
daniels (daniels)
derick (derick)
edorian (edorian)
ericmann (ericmann)
galvao (galvao)
girgias (girgias)
jimw (jimw)
josh (josh)
jwage (jwage)
kalle (kalle)
kguest (kguest)
kinncj (kinncj)
kocsismate (kocsismate)
mbeccati (mbeccati)
nicolasgrekas (nicolasgrekas)
nielsdos (nielsdos)
ocramius (ocramius)
petk (petk)
pmjones (pmjones)
reywob (reywob)
santiagolizardo (santiagolizardo)
sergey (sergey)
theodorejb (theodorejb)
thorstenr (thorstenr)
timwolla (timwolla)
weierophinney (weierophinney)
Final result:	30	1
This poll has been closed.

Errata

The following paragraph is factually incorrect in the RFC:

“Unfortunately, these algorithms sometimes have surprising behavior where modification fails silently, and the original values are kept.”

This sentence was added to the RFC as a result of a bug in lexbor. Therefore, the following example will trigger a Uri\WhatWg\InvalidUrlException exception as opposed to what's written in the comment.

$url = new Uri\WhatWg\Url("https://example.com");
$url = $url->withHost("2001:db8:0:0:0:0:0:1");    // invalid IPv6 host, but no exception is triggered and the original host is kept
 
echo $url->getAsciiHost();                        // example.com

PHP RFC: Add RFC 3986 and WHATWG URL compliant API

Introduction

URIs, IRIs, URLs, URNs

Relevant URI specifications

RFC 3986

WHATWG URL

Important concepts related to URIs

Parsing

Reference resolution

Component recomposition

Normalization

Percent-encoding & decoding

Equivalence

Unicode & IDNA

Proposal

API Design

Parsing

Reference resolution

Component retrieval

Supported representations

Basic examples

Advanced examples

Component modification

Component recomposition

Equivalence

Cloning

Serialization

Debugging

Exceptions

Plugability

Parser Library Choice

Performance Considerations

Time of parsing of a basic URL (1000 times)

Time of parsing of a complex URL (1000 times)

Discussion

Naming considerations

Why isn't a common URI interface supported?

Why does the "user:password" format of the "User Information" component of RFC 3986 have special support?

Why isn't query parameter manipulation supported?

How should URI modification work?

Why should the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes be final?

Examples in Other Languages

Go

Java

C#

NodeJS

Python

Backward Incompatible Changes

Proposed PHP Version(s)

RFC Impact

To SAPIs

To Existing Extensions

To Opcache

Future Scope

References

Vote

Errata

Page Tools

Table of Contents