This is an old revision of the document!
PHP RFC: Add RFC 3986 and WHATWG compliant URI parsing support
- Version: 3.0
- Date: 2024-06-11
- Author: Máté Kocsis,
- Status: Under Discussion
- First Published at:
- Implementation:
URIs and URLs are one of the most fundamental concepts of the web because they make it possible to reference specific resources on a network. URLs were originally defined by Tim Berners-Lee in RFC 1738, but since then other specifications have also emerged, out of which RFC 3986 and WHATWG URL are the most notable ones. The former one updates the original RFC 1738 and defines URIs, while the latter one specifies how browsers should treat URLs.
Despite the ubiquitous nature of URLs and URIs, they are not so unequivocal as people may think, because different clients treat and parse them differently by either following one of the standards, or even worse, not following any at all. Unfortunately, PHP falls into the latter category: the parse_url()
function is offered for parsing URLs, however, it isn't compliant with any standards. Even the PHP manual contains the following warning:
This function may not give correct results for relative or invalid URLs, and the results may not even match common behavior of HTTP clients. ...
Incompatibility with current standards is a serious issue, as it hinders interoperability with different tools (i.e. HTTP clients), or it can result in subtle bugs. For example, cURL's URL parsing implementation is based on RFC 3986, that's why URLs validated by FILTER_VALIDATE_URL may not necessarily be accepted when passed to cURL. And that's exactly what the parsing confusion security vulnerability exploits.
First of all, we should define what URIs, IRIs, URLs, and URNs are, and what their relation is to each other, in order to have a better understanding of the terms used in the current RFC. It should be noted that different specifications use different definitions, so there is not a single definitive answer. However, the RFC tries to use these terms consistently according to the definitions below:
- URI: A unique identifier that relates to an abstract or physical resource (i.e.
) - IRI: A superset of URIs defined by RFC 3987 which allow Unicode characters, therefore supporting IDNA (internationalized domain names)
- URL: A subset of URIs that specify their location (i.e.
) - URN: A subset of URIs that are globally unique within defined namespaces (i.e.
Their relation can be best illustrated via a Venn diagram:
* The image is reused from
Relevant URI specifications
Before discussing the proposal itself, we should also briefly touch the URI specifications the present RFC implements.
RFC 3986
RFC 3986 is a generic specification for URIs. Therefore, it is relatively permissive in the sense that it doesn't include scheme-specific processing rules. I.e. the LDAP specification builds upon RFC 3986 and extends it with additional rules (i.e. the ?
and the ,
characters have to be percent-encoded at certain positions).
It is a fairly new specification that is mostly relevant in the web browser context. It is a living specification, meaning it changes from time to time. One of its fundamental differences compared to RFC 3986 is that it only deals with URLs, rather than URIs.
Important concepts related to URIs
URIs have some important concepts and capabilities that are needed to effectively work with them.
Parsing is the single most important operation to use with URIs where a URI string is decomposed into multiple components during the process.
While RFC 3986 leaves the input URI string intact during parsing, WHATWG automatically transforms it (removes superfluous /
characters after the scheme, lowercases the host, etc.).
Reference resolution
Reference resolution is a process which turns a potentially relative URI reference into a URL by applying it to an absolute URL (a URL that has no fragment component): resolving “/foo
” on
results in
. Both RFC 3986 and WHATWG support this concept.
Component recomposition
It is the process of recomposing the distinct URI components to a URI string. While RFC 3986 uses the following algorithm:, WHATWG applies the algorithm described at for the purpose.
An important question that needs to be elaborated upon is whether the recomposed URI equals the input URI string? The two specification work differently in this regard again: By default, RFC 3986 doesn't require any transformations to be performed during parsing, however it makes some recommendations how to canonize the parsed URI string (see the next section). That's why - by default - the recomposed URI may be the same as the input URI string.
On the other hand, WHATWG performs quite a few transformations on the input during parsing, that's why the recomposed URI may not be the same as the original one. Besides these, the recomposition process also contains a step where IPv4 and IPv6 hostnames are canonized (e.g. “[0:0::1]” becomes “[::1] ”).
Normalization is an optional process supported by RFC 3986 for canonizing different URIs identifying the same resource to the same URI. E.g. the and the HTTPS:// URIs both refer to the same resource, so they can be normalized to As we will see, normalization is very useful in multiple cases.
WHATWG doesn't have this concept, as all transformations are applied during parsing.
Percent-encoding & decoding
Encoding and decoding special characters is a crucial aspect of URI parsing. For this purpose, both RFC 3986 and WHATWG use percent-encoding (i.e. the %
character is encoded as %25
). However, the two standards slightly differ in the details.
WHATWG associates a character set for each component, defining the characters that must be percent-encoded in the context of the given component. For example, the ''query percent-encode set'' is associated with the query component, containing the “#” character (among others), while the path percent-encode set
includes the “?” character in addition (among others). It's easy to see the pattern: if a character has special meaning after the given component, then it must be percent-encoded. That's why the userinfo percent-encode set
has to also contain the “/” character (among others), but the query percent-encode set
doesn't include it, since “/” characters don't have a special meaning after the path component.
Similarly, RFC 3986 assigns a list of allowed characters to each component. For example, the query component can contain unreserved and any percent-encoded characters, as well as some reserved characters that are categorized as “sub-delimiters” (i.e. “&”, “!”, “'”), and also some “generic delimiters” (“:”, and “@”) that doesn't have any special meaning in the context of the path.
These two approaches are very similar, however there is a key differences between them: WHATWG automatically tries to percent-encode characters in the associated encoding character set when possible, as well as any characters that are illegal in a URL (that are not “URL units”), and only emits a warning in the same time, while RFC 3986 rejects invalid characters and stops parsing with a failure.
RFC 3986 also specifies a set of reserved characters (“#”, “?”, “/”, etc.) that must not be percent-decoded according to the following sentence in order to be safely used by scheme-specific syntaxes as delimiters:
Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.
WHATWG simply doesn't do any percent-decoding because of reasons that are discussed in the following section.
Normalization and transformations during parsing are especially important when it comes to comparing URIs to each other because they reduce the likelihood of false positive results, as URI comparison is effectively checking whether two URIs represent the same resource.
In practice, this means that two URIs are normalized (when applicable) and then the components are recomposed. If the resulting URI strings are equal, then the 2 URIs are equal too. Usually, the fragment component is disregarded, since it refers to a secondary resource within the primary one that is identified by the URI.
To complicate things, there is also a nuanced difference how the two specifications treat equivalence. RFC 3986 defines that “URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent”, which effectively means that percent-encoded unreserved characters and their decoded form are equivalent (i.e. character “e” is equivalent to “%65”).
On the contrary, WHATWG defines URL equivalence by the equality of the recomposed URL string, and never decodes percent-encoded characters, except in the host. This implies that percent-encoded characters are not equivalent to their percent-decoded form (except in the host).
The difference between RFC 3986 and WHATWG comes from the fact that the point of view of a maintainer of the WHATWG specification is that webservers may legitimately choose to consider encoded and decoded paths distinct, and a standard cannot force them not to do so. This is a substantial BC break compared to RFC 3986, and it is actually a source of confusion among users of the WHATWG specification based on the large number of tickets related to this question.
Unicode & IDNA
IDNA (internationalized domain names) allow people around the world to register domain names in their native languages and scripts. This is made possible by encoding Unicode characters using the Punycode transcription.
RFC 3986 neither supports IDNA, nor non-ASCII characters. WHATWG supports IDNA and Unicode characters, and it explicitly suggests that browsers should render the host component by displaying Unicode characters.
The recommendation is not just for user-friendliness: it's necessary for security reasons, alleviating the human risk factor in exploits. E.g. “” could deceive the uninitiated reader that it is a Google domain, however the IDNA domain decodes to “䕮䕵䕶䕱.com” in fact.
A new, always available URI
extension is to be added to the standard library. The extension would support parsing, validating, modifying, and recomposing URIs based on both RFC 3986 and the WHATWG URL specifications, as well as resolving references. For this purpose, the following internal classes and methods are added:
namespace Uri { class UriException extends \Exception { } class UninitializedUriException extends \Uri\UriException { } class UriOperationException extends \Uri\UriException { } class InvalidUriException extends \Uri\UriException { public readonly array $errors; } }
namespace Uri\Rfc3986 { readonly class Uri { public static function parse(string $uri, ?string $baseUrl = null): ?static {} /** @throws Uri\InvalidUriException */ public function __construct(string $uri, ?string $baseUrl = null) {} /** * @throws Uri\UninitializedUriException */ public function getScheme(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawScheme(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withScheme(?string $scheme): static {} /** * @throws Uri\UninitializedUriException */ public function getUserInfo(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException */ public function getRawUserInfo(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withUserInfo(?string $userInfo): static {} /** * @throws Uri\UninitializedUriException */ public function getUser(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawUser(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getPassword(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawPassword(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getHost(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawHost(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withHost(?string $host): static {} /** * @throws Uri\UninitializedUriException */ public function getPort(): ?int {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withPort(?int $port): static {} /** * @throws Uri\UninitializedUriException */ public function getPath(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawPath(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withPath(?string $path): static {} /** * @throws Uri\UninitializedUriException */ public function getQuery(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawQuery(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withQuery(?string $query): static {} /** * @throws Uri\UninitializedUriException */ public function getFragment(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawFragment(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withFragment(?string $fragment): static {} /** * @throws Uri\UninitializedUriException */ public function equals(Uri $uri, bool $excludeFragment = true): bool {} /** * @throws Uri\UninitializedUriException */ public function toNormalizedString(): string {} /** * @throws Uri\UninitializedUriException */ public function toString(): string {} /** * @throws Uri\UninitializedUriException * @throws Uri\InvalidUriException */ public function resolve(string $uri): static {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException */ public function __serialize(): array; /** * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function __unserialize(array $data): void; public function __debugInfo(): array; } }
namespace Uri\WhatWg { readonly class Url { /** @param array<int, WhatWgError> $errors */ public static function parse(string $uri, ?string $baseUrl = null, &$errors = null): ?static {} /** * @param array<int, WhatWgError> $softErrors * @throws Uri\InvalidUriException */ public function __construct(string $uri, ?string $baseUrl = null, &$softErrors = null) {} /** * @throws Uri\UninitializedUriException */ public function getScheme(): string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withScheme(string $scheme): static {} /** * @throws Uri\UninitializedUriException */ public function getUser(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawUser(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withUser(?string $user): static {} /** * @throws Uri\UninitializedUriException */ public function getPassword(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawPassword(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withPassword(?string $password): static {} /** * @throws Uri\UninitializedUriException */ public function getHost(): string {} /** * @throws Uri\UninitializedUriException */ public function getHostForDisplay(): string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withHost(string $host): static {} /** * @throws Uri\UninitializedUriException */ public function getPort(): ?int {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withPort(?int $port): static {} /** * @throws Uri\UninitializedUriException */ public function getPath(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawPath(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withPath(?string $path): static {} /** * @throws Uri\UninitializedUriException */ public function getQuery(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawQuery(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withQuery(?string $query): static {} /** * @throws Uri\UninitializedUriException */ public function getFragment(): ?string {} /** * @throws Uri\UninitializedUriException */ public function getRawFragment(): ?string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function withFragment(?string $fragment): static {} /** * @throws Uri\UninitializedUriException */ public function equals(Url $uri, bool $excludeFragment = true): bool {} /** * @throws Uri\UninitializedUriException */ public function toString(): string {} /** * @throws Uri\UninitializedUriException */ public function toDisplayString(): string {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException */ public function resolve(string $uri): static {} /** * @throws Uri\UninitializedUriException * @throws Uri\UriOperationException */ public function __serialize(): array {} /** * @throws Uri\UriOperationException * @throws Uri\InvalidUriException */ public function __unserialize(array $data): void {} public function __debugInfo(): array {} } enum WhatWgErrorType { case DomainToAscii; case DomainToUnicode; case DomainInvalidCodePoint; case HostInvalidCodePoint; case Ipv4EmptyPart; case Ipv4TooManyParts; case Ipv4NonNumericPart; case Ipv4NonDecimalPart; case Ipv4OutOfRangePart; case Ipv6Unclosed; case Ipv6InvalidCompression; case Ipv6TooManyPieces; case Ipv6MultipleCompression; case Ipv6InvalidCodePoint; case Ipv6TooFewPieces; case Ipv4InIpv6TooManyPieces; case Ipv4InIpv6InvalidCodePoint; case Ipv4InIpv6OutOfRangePart; case Ipv4InIpv6TooFewParts; case InvalidUrlUnit; case SpecialSchemeMissingFollowingSolidus; case MissingSchemeNonRelativeUrl; case InvalidReverseSoldius; case InvalidCredentials; case HostMissing; case PortOfOfRange; case PortInvalid; case FileInvalidWindowsDriveLetter; case FileInvalidWindowsDriveLetterHost; } readonly class WhatWgError { public string $context; public WhatWgErrorType $type; public bool $failure; public function __construct(string $context, WhatWgErrorType $type, bool $failure) {} } }
API Design
First and foremost, the new URI parsing API contains two URI implementations, Uri\Rfc3986\Uri
and Uri\WhatWg\Url
, representing RFC 3986 and WHATWG URIs, respectively. Having separate classes for the two specifications makes it possible to properly model URIs with all their details and nuances. Actually, it could cause a security vulnerability to have wrong assumptions about the origin of a URI, as Daniel Stenberg (author of cURL) writes in one of his blog posts, that's why at least in security-sensitive applications, it's important to explicitly express which specification is used.
Both built-in URI implementations are readonly
classes, and support parsing URI strings via two methods:
- the constructor: It expects a URI, and optionally, a base URL in order to support reference resolution. When parsing is unsuccessful, a
is thrown, containing errors. - a
factory method: It expects the same parameters as the constructor does, but when parsing is unsuccessful,null
is returned instead of throwing an exception. Using this method is recommended for validating URIs and/or parsing URIs from untrusted input.
$uri = new Uri\Rfc3986\Uri(""); // An RFC 3986 URI instance is created $uri = Uri\Rfc3986\Uri::parse(""); // An RFC 3986 URI instance is created $uri = new Uri\Rfc3986\Uri("invalid uri"); // Throws Uri\InvalidUriException $uri = Uri\Rfc3986\Uri::parse("invalid uri"); // null is returned in case of an invalid URI $url = new Uri\WhatWg\Url(""); // A WHATWG URL instance is created $url = Uri\WhatWg\Url::parse(""); // A WHATWG URL instance is created $url = new Uri\WhatWg\Url("invalid uri"); // Throws Uri\InvalidUriException $url = Uri\WhatWg\Url::parse("invalid uri", null, $errors); // null is returned, and an array of WhatWgError objects are passed by reference to $errors
As it can be seen, Uri\WhatWg\Url::parse()
can pass additional information about the triggered validation errors by reference, as specified by WHATWG. In the example above, $errors
will contain the following value:
array(1) { [0]=> object(Uri\WhatWg\WhatWgError)#1 (2) { ["context"]=> string(11) "invalid uri" ["type"]=> enum(Uri\WhatWg\WhatWgErrorType::MissingSchemeNonRelativeUrl) ["failure"]=> bool(true) } }
The $context
property refers to the substring where the error happened, while the $type
property is a Uri\WhatWg\WhatWgErrorType
enum storing the exact cause of the error. Last, the $failure
field stores whether the error caused a failure, or processing could continue. Therefore, the true
value refers to a hard error, while the false
value means a soft error.
When trying to instantiate a WHATWG Url
via its constructor, a Uri\InvalidUriException
is thrown when parsing results in a failure. In this case, the Uri\InvalidUriException::$errors
property will contain an array of Uri\WhatWg\WhatWgError
instances. When parsing is successful, but soft errors were triggered, an array of Uri\WhatWg\WhatWgError
will be passed by reference to the $softErrors
When trying to instantiate a WHATWG Url
via its parse()
method, a null
return value indicates that parsing results in a failure. In this case, the $errors
by-ref parameter will contain an array of Uri\WhatWg\WhatWgError
instances. When parsing is successful, but soft errors were triggered, the $errors
by-ref parameter will contain an array of Uri\WhatWg\WhatWgError
instances referring to only soft errors. The following example demonstrates how a soft error is triggered:
// Soft error due to the leading " " character when using the parse() method $errors = []; $url = Uri\WhatWg\Url::parse("", null, $errors); echo $url->toString(); // var_dump($errors[0]->type); // enum(Uri\WhatWg\WhatWgErrorType::InvalidUrlUnit) // Soft error due to the leading " " character when using the constructor $softErrors = []; $url = new Uri\WhatWg\Url("", null, $softErrors); echo $url->toString(); // var_dump($softErrors[0]->type); // enum(Uri\WhatWg\WhatWgErrorType::InvalidUrlUnit)
Even though pass by reference is not a very desirable language construct, it is actually the least bad option to use with WHATWG errors which can happen even when parsing is successful. As PHP doesn't have native support for monads, reimplementing something similar in advance would be an unwise choice (i.e. a ParsingResult
interface with three implementations: Success
, PartialSuccess
, Error
However, if successful parsing and errors were mutually exclusive, then it would be possible to make the method return either a Uri\WhatWg\Url
in case of success, or an array of Uri\WhatWg\WhatWgError
s in case of failure, but since it's not the case, we had to reject the idea.
Reference resolution
Primarily, reference resolution is implemented via the $baseUrl
parameter of the constructor and parse()
. If the argument has a non-null value, and the $uri
parameter is a relative URI, then $uri
is attempted to be applied on $baseUri
$uri = new Uri\Rfc3986\Uri("/foo", ""); echo $uri->toString(); // $uri = new Uri\Rfc3986\Uri("", ""); echo $uri->toString(); // $uri = new Uri\Rfc3986\Uri("/foo", ".com"); // Throws Uri\InvalidUriException because $baseUri is invalid $url = Uri\WhatWg\Url::parse("/foo", ""); echo $url->toString(); // $url = Uri\WhatWg\Url::parse("", ""); echo $url->toString(); // $url = Uri\WhatWg\Url::parse("/foo", ".com"); // Throws Uri\InvalidUriException because of $baseUri
Additionally, URIs support a resolve()
method that is able to resolve potentially relative URI strings with the current object as the base URL:
$uri = new Uri\Rfc3986\Uri(""); echo $uri->resolve("/foo")->toString(); // $url = new Uri\WhatWg\Url(""); echo $url->resolve("/foo")->toString(); //
This method is a shorthand for new get_class($uri)(”/foo“, $base->toString())
Component retrieval
The individual URI components can be retrieved via getters. While property hooks and/or asymmetric visibility could be a modern replacement for getters, the RFC still chooses the more conservative getter-based approach because each URI component actually has to be available in multiple forms in order to best serve the vastly different needs users may have.
All URI components - with the exception of the host - can be retrieved in two formats:
- “raw” representation: It's how components are natively represented by URI parsers without any post-processing after parsing.
- “normalized-decoded” representation: The URI is normalized (when applicable), and components are percent-decoded.
The “raw” representation is very straightforward and doesn't need much explanation: it reflects components the closest way to their origin. That's why this is mostly suitable for use-cases where one has to work with URIs opaquely - usually API clients or signers fall in this category that want to avoid introducing any unnecessary changes to URIs, in order to avoid causing subtle bugs.
On the other hand, the “normalized-decoded” representation is useful in a whole lot of other cases, including application routers and HTTP cache implementations. This representation should be used when one wants to make sure that URI components are in their most canonical form. I.e. in case of application routers, all URIs that represent the same resource should be routed to the same controller action: both and should trigger the same piece of code, otherwise the application may fail to serve some traffic using a slightly abnormal URI for any reason.
The “normalized-decoded” form should do post-processing in such a way that the result can still be safely used for modification of the same component of another valid URI (the data is “roundtripable”):
$uri1 = new Uri\Rfc3986\Uri("HTTPS://"); $uri2 = new Uri\Rfc3986\Uri(""); // The scheme of $uri2 is successfully modified with the // "normalized-decoded" representation of the scheme of $uri1 $uri2 = $uri2->withScheme($uri1->getScheme());
This attribute is important for usability - it would be inconvenient to always do additional checks when the “normalized-decoded” representation is used for building or modifying a URI.
On the other hand, the “normalized-decoded” representation doesn't always guarantee equivalence with the “raw” representation of the same component of the same URI. According to the details outlined in the Equivalence section, the WHATWG specification considers the percent-encoded and decoded forms of the same string different everywhere besides the host component, that's why one shouldn't assume that the two representations are completely interchangeable in case of Uri\WhatWg\Url
$url1 = new Uri\WhatWg\Url(""); // the "normalized-decoded" representation of the path is retrieved, containing the "test" value $path = $url->getPath(); // $url2 is constructed as $url2 = $url1->withPath($path); // false, != $uri1->equals($uri2);
isn't subject to the above problem though, since its equivalence semantics are compatible with the “normalized-decoded” representation.
Given the
URI (the percent-encoded variant of
), let's see how the individual components can be represented in case of Uri\Rfc3986\Uri
$uri = new Uri\Rfc3986\Uri(""); echo $uri->getScheme(); // https echo $uri->getRawScheme(); // https echo $uri->getUserInfo(); // apple:pass echo $uri->getRawUserInfo(); // %61pple:p%61ss echo $uri->getUser(); // apple echo $uri->getRawUser(); // %61pple echo $uri->getPassword(); // pass echo $uri->getRawPassword(); // p%61ss echo $uri->getHost(); // echo $uri->getRawHost(); // echo $url->getPort(); // 433 echo $uri->getPath(); // /foobar echo $uri->getRawPath(); // /foob%61r echo $uri->getQuery(); // abc=abc echo $uri->getRawQuery(); // %61bc=%61bc echo $uri->getFragment(); // abc echo $uri->getRawFragment(); // %61bc
Let's have a look at another example which involves normalization:
$uri = new Uri\Rfc3986\Uri("HTTPS://EXAMPLE.COM/foo/../bar/"); echo $uri->getScheme(); // https echo $uri->getRawScheme(); // HTTPS echo $uri->getHost(); // echo $uri->getRawHost(); // EXAMPLE.COM echo $uri->getPath(); // /bar/ echo $uri->getRawPath(); // /foo/../bar/
In case of Uri\Whatwg\Url
, we'll get the following results for the first example:
$url = new Uri\Whatwg\Url("HTTPS://"); echo $url->getScheme(); // https echo $url->getRawScheme(); // method does not exist, because Uri\WhatWg\Url always normalizes the scheme echo $url->getUser(); // apple echo $url->getRawUser(); // %61pple echo $url->getPassword(); // pass echo $url->getRawPassword(); // p%61ss echo $url->getHost(); // echo $url->getHostForDisplay(); // echo $url->getRawHost(); // method does not exist, because Uri\WhatWg\Url always normalizes the host echo $url->getPort(); // 433 echo $url->getPath(); // /foobar echo $url->getRawPath(); // /foob%61r echo $url->getQuery(); // abc=abc echo $url->getRawQuery(); // %61bc=%61bc echo $url->getFragment(); // abc echo $url->getRawFragment(); // %61bc
This script gave very similar results as the previous one did, except for the scheme and the host components. For one, Uri\Whatwg\Url
automatically normalizes the scheme as lowercased, that's why it has no getRawScheme()
method. For similar reasons, Uri\Whatwg\Url
neither has a getRawHost()
method because WHATWG automatically percent-decodes the host during parsing, so there is no “raw” representation. On the other hand, the getHostForDisplay()
method comes handy to retrieve the IDNA host that is best suitable for display:
$url = new Uri\WhatWg\Url("https://🐘.com"); echo $url->getHost(); // echo $url->getHostForDisplay(); // 🐘.com
There is an edge-case which needs to be highlighted with more examples: it's the percent-decoding of reserved characters.
$uri = new Uri\Rfc3986\Uri(""); echo $uri->getPath(); // /foo/bar%2Fbaz echo $uri->getRawPath(); // /foo/bar%2Fbaz
In the example above, the second path segment contains the %2F
which is the percent-encoded form of the /
character. But why does it have to be percent-encoded at all? It's because there is a semantical difference between /
and %2F
in the path: the former one separates the individual path segments, while %2F
means that the respective /
is part of a single segment (but using the percent-decoded form would be ambiguous). Therefore the example URI has two path segments: “foo” and “bar/baz”.
Let's see the same example with WHATWG:
$url = new Uri\WhatWg\Url(""); echo $url->getPath(); // /foo/bar%2Fbaz echo $url->getRawPath(); // /foo/bar%2Fbaz
Let's have a look at some other tricky example with Uri\Rfc3986\Uri
$uri = new Uri\Rfc3986\Uri("https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/?foo=bar%26baz%3Dqux"); echo $uri->getHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] echo $uri->getRawHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] echo $uri->getQuery(); // foo=bar%26baz%3Dqux echo $uri->getRawQuery(); // foo=bar%26baz%3Dqux
What happens here? The host name is an IPv6 address that must be enclosed within a []
pair. The query string contains a ”%26“ (percent-encoded form of ”&“) and ”%3D“ (percent-encoded form of ”=“) that are both reserved characters, whose percent-decoding is disallowed by RFC 3986. WHATWG doesn't explicitly specify such thing, but the proposed implementation follows the behavior set by RFC 3986 for consistency.
Component modification
Immutable modification of the individual URI components is possible via “wither” methods. Let's see a very basic example first for modifying and retrieving the host URI component:
$uri1 = new Uri\Rfc3986\Uri(""); $uri2 = $uri->withHost(""); echo $uri1->getHost(); // echo $uri2->getHost(); //
The above example demonstrates that withers create a new instance for each modification, leaving the original object intact. However, an exception is thrown if a modification resulted in an invalid URI. This way, URIs always stay valid:
$uri = new Uri\Rfc3986\Uri(""); $uri->withHost("/"); // Throws Uri\InvalidUriException
It's also important to know how withers handle percent-encoding and reserved characters. WHATWG explicitly declares an algorithm for modifying the components, which is discussed in the section (listed below as “setters steps”). But generally speaking, withers accept URI components in an appropriately percent-encoded form:
$url = new Uri\WhatWg\Url(""); $url = $url->withUser("use%72%3A"); // percent-encoded form of "user:" echo $url->getUser(); // user: echo $url->getRawUser(); // use%72%3A
As it can be seen above, not only reserved characters (”:“), but unreserved ones (“r”) can also be percent-encoded.
Additionally, the WHATWG algorithm automatically percent-encodes characters that fall into the percent-encoding character set of the given component (”?“ and ”#“ in the example below) as well as any invalid characters, as discussed in the Percent-encoding & decoding section:
$url = new Uri\WhatWg\Url(""); $url = $url->withPath("/?#:"); echo $url->getPath(); // /%3F%23: echo $url->getRawPath(); // /%3F%23:
Let's see another special case where the input contains delimiters:
$url = new Uri\WhatWg\Url(""); $url = $url->withQuery("?foo"); $url = $url->withFragment("#bar"); echo $url->getQuery(); // foo echo $url->getFragment(); // bar
The above example makes it clear that withers optionally accept leading delimiter characters (”?“ for the query, and ”#“ for the fragment), even if they are not returned by the relevant getters. All this is possible to do because ”?“ doesn't have any special meaning in the context of the query component, and neither ”#“ has in the context of the fragment.
The above examples defined how Uri\WhatWg\Url
handles percent-encoding during component modification, but Uri\Rfc3986\Uri
has not been discussed yet. It's because RFC 3986 doesn't have an explicitly specified algorithm for component modification. That's why the present RFC has to set the rules:
In order to offer consistent behavior with RFC 3986 parsing rules, withers of Uri\Rfc3986\Uri
also only accepts properly formatted input, meaning characters that are not allowed to be present in a component must be percent-encoded. Let's see what this means in practice through the following example:
$uri = new Uri\Rfc3986\Uri(""); $uri = $uri->withQuery("foo%5B%5D=%23"); // percent-encoded form of "foo[]=#" echo $uri->getQuery(); // foo%5B%5D=%23 echo $uri->getRawQuery(); // foo%5B%5D=%23
Characters ”[“, ”]“, and ”#“ are disallowed in the query component, therefore they have to be percent-encoded. Since all of them are reserved characters, they cannot be normalized - and therefore percent-decoded -, so even the “normalized-decoded” representation of ”[]“ is ”%5B%5D“. This is also important to achieve the “roundtripable data” property.
$url = new Uri\WhatWg\Url(""); $url = $url->withQuery("foo%5B%5D=%23"); // percent-encoded form of "foo[]=#" echo $url->getQuery(); // foo%5B%5D=%23 echo $url->getRawQuery(); // foo%5B%5D=%23
The same example for WHATWG gives the same result, but for different reasons. According to WHATWG, ”[“ and ”]“ are not allowed to be present in an URL (they are not URL code points), that's why they shouldn't be percent-decoded either. Character ”#“ is in the query percent-encode set (in order to be able to distinguish the fragment component after the query), that's why it should also stay percent-encoded. All these rules make it possible to satisfy the “roundtripable data” property of the “normalized-decoded” representation for WHATWG.
Component recomposition
Besides accessors, URI implementations contain various ”toString
“ methods as well. They can be used for recomposing the URI components back to a string. Why such methods are necessary at all instead of simply returning the input URI string? It's because URI parsers may have applied some modifications to the input during parsing. This is specifically the case for the WHATWG specification, since it mandates the usage of quite some transformations.
has two “toString” methods to provide both a machine-friendly and a human-friendly format:
$url = new Uri\WhatWg\Url("HTTPS://////"); echo $url->toString(); // $url = new Uri\WhatWg\Url("HTTPS://////你好你好.com"); echo $url->toString(); // https://xn--6qqa088eba/ echo $url->toDisplayString(); // https://你好你好/
The toString()
method recomposes the URI in a format which is most suitable for machine processing (host names using IDNA characters are translated to ASCII characters), while the toDisplayString()
method is a user-friendly representation that displays the host as a Unicode string.
As RFC 3986 doesn't support IDNA, its two “toString” methods don't differentiate based on the target audience, but rather whether normalization is performed:
$uri = new Uri\Rfc3986\Uri("HTTPS://"); echo $uri->toString(); // HTTPS:// $uri = new Uri\Rfc3986\Uri("HTTPS://"); echo $uri->toNormalizedString(); //
The Uri\Rfc3986\Uri::toString()
returns the unnormalized URI string, while Uri\Rfc3986\Uri::toNormalizedString()
does normalize its return value.
Another example showcasing that Uri\Rfc3986\Uri
doesn't support IDNA:
$uri = Uri\Rfc3986\Uri::parse("https://你好你好.com"); var_dump($uri); // NULL $uri = Uri\Rfc3986\Uri::parse(""); // percent-encoded form of https://你好你好.com echo $uri->toString(); //
Furthermore, as mentioned in the Component recomposition section, WHATWG normalizes the IPv4 and IPv6 addresses in the host component:
$url = new Uri\WhatWg\Url("https://[0:0::1]/"); echo $url->toDisplayString(); // https://[::1]
The attentive reader may have noticed that neither URI implementations contain a __toString()
magic method. This is a deliberate design decision not to add this method to the built-in URI classes, as doing so could cause incorrect results when using equality comparison (==
). Given the following example:
$uri = new Uri\Rfc3986\Uri(""); var_dump($uri == 'HTTPS://');
The output would be bool(false)
if Uri\Rfc3986\Uri
contained a __toString()
method, because of the $uri
object being automatically converted to its string representation (
) which is then compared against HTTPS://
. However, the two URIs should be indeed equal, as a result of normalization. Furthermore, equality of URIs disregards the fragment component by default, thus a
URI would also yield a false positive result in the example.
The equals()
method only accepts URI objects of the same specification, since it doesn't make sense to compare URIs of different standards. Then it normalizes (if applicable) and recomposes the URI represented by the object as well as the URI received in the argument list to a string, and checks whether the two strings match. By default, the fragment component is disregarded.
// An RFC 3986 URI equals another RFC 3986 URI that has the same string representation after normalization. $uri = new Uri\Rfc3986\Uri("https://example.COM#foo"); $uri->equals(new Uri\Rfc3986\Uri("https://EXAMPLE.COM")); // true // The fragment component of Uri\Rfc3986\Uri can also be taken into account $uri = new Uri\Rfc3986\Uri(""); $uri->equals(new Uri\Rfc3986\Uri("", true)) // false // A WHATWG URL equals another WHATWG URL that has the same string representation $url = new Uri\WhatWg\Url("https:////example.COM/"); $url->equals(new Uri\WhatWg\Url("https://EXAMPLE.COM")); // true // The fragment component of Uri\WhatWg\Url can also be taken into account $url = new Uri\WhatWg\Url(""); $url->equals(new Uri\WhatWg\Url(""), true); // false // A URI cannot be compared against another URI of a different specification $url = new Uri\Rfc3986\Uri(""); $url->equals(new Uri\WhatWg\Url("")); // throws TypeError
It should be noted that the equals()
method could also accept URI strings. It was a deliberate decision not to allow such arguments, because it would be unclear how the comparison works in this case: Should the passed in string be also normalized, or exact string match should be performed? This is a question that don't have to be answered when only a URI object parameter type is supported.
The same question - combined with the fact that the construct is not supported in userland - led us not to overload the equality operator.
Cloning of URIs is supported. Actually, withers also clone the object they are invoked on before actually performing component modification in order to guarantee immutability.
$uri1 = new Uri\Rfc3986\Uri(""); $uri2 = clone $uri1; // creates a new Uri instance $url1 = new Uri\WhatWg\Url(""); $url2 = clone $url1; // creates a new Url instance
Both built-in URI classes support serialization and deserialization - albeit a little bit differently than most classes usually do: the serialized form only includes the recomposed URI itself exposed as the __uri
field, but the individual properties or URI components are not present. This approach makes deserialization easier that performs regular URI string parsing where URI components are not needed.
$uri = new Uri\Rfc3986\Uri(""); echo serialize($uri); // O:15:"Uri\Rfc3986\Uri":1:{s:5:"__uri";s:27:"";}
The same example with Uri\WhatWg\Url
$url = new Uri\WhatWg\Url(""); echo serialize($url); // O:14:"Uri\WhatWg\Url":1:{s:5:"__uri";s:27:"";}
The approach of using the above mentioned “meta” field requires reserving the $__uri
property from being used by any 3rd party URI implementations. When such a property is present during serialization, a Uri\UriOperationException
is thrown. Conversely, Uri\UriOperationException
is also thrown when the string being used for deserialization doesn't contain the __uri
field, or it's not a string.
// The following line throws Uri\UriOperationException because of the missing __uri field unserialize('O:15:"Uri\Rfc3986\Uri":1:{s:3:"uri";s:19:"";}'); // The following line throws Uri\UriOperationException because the __uri field has a wrong type unserialize('O:14:"Uri\WhatWg\Url":1:{s:5:"__uri";i:1;}');
The two built-in URI classes implement the __debugInfo()
magic method in order to expose their internal state for debugging purposes:
$uri = new Uri\Rfc3986\Uri(""); var_dump($uri); /* object(Uri\Rfc3986\Uri)#1 (9) { ["scheme"]=> string(5) "https" ["userinfo"]=> NULL ["user"]=> NULL ["password"]=> NULL ["host"]=> string(11) "" ["port"]=> NULL ["path"]=> string(5) "/foo/" ["query"]=> NULL ["fragment"]=> NULL } */
Even though the example above uses Uri\Rfc3986\Uri
, Uri\WhatWg\Url
behaves the same way.
The proposal adds 4 new exceptions that are triggered under the following circumstances:
: Any method call that tries to use the internally stored URI triggers this exception when a URI instance was created without actually invoking any of the following: the constructor, theparse()
or the__unserialize()
method. This can happen for example when the object is instantiated viaReflectionClass::newInstanceWithoutConstructor()
: It's thrown when any operation involving URIs fails, specifically in the following scenarios: when cloning an URI (it may be possible by memory issues), or during deserialization (as discussed in the Serialization section). Theoretically, URI component reading may also trigger this exception, but in practice this can never happen with the built-in URI implementations, because they never fail.Uri\InvalidUriException
: It's thrown when URI string parsing is unsuccessful in case of__construct()
, all wither methods,resolve()
, as well as during deserialization.
All of them extend the forth exception, Uri\UriException
The capability provided by parse_url()
is used for multiple purposes in the internal PHP source:
: parsing the$location
parameter as well as the value of theLocation
header- FTP/FTPS stream wrapper:
is used for connecting to an URL, renaming a file, following theLocation
: validating URLs- SSL/TLS socket communication: parsing the target URL
- GET/POST session: accepting the session ID from the query string, manipulating the output URL to automatically include the session ID (Deprecate GET/POST sessions RFC
It would cause inconsistency and a security vulnerability if parsing of URI strings based on the two specifications referred above were supported in userland, but the legacy parse_url()
based behavior was kept internally without the possibility to use the new API. That's why the current RFC was designed with plugability in mind.
Specifically, supported parser backends would have to be registered by using a similar method how password hashing algorithms are registered. On one hand, this approach makes it possible for 3rd party extensions to leverage URI parser backends other than the built-in ones (i.e. support for ADA URL could also be added). But more importantly, an internal “interface” for parsing and handling URIs is defined this way so that it now becomes possible to configure the used backend for each use-case. Please note that URI parser backend registration is only supported by internal code: registering custom user-land implementations is not possible for now, mainly in order to prevent a possible new attack surface.
While it would sound natural to add a php.ini configuration option to configure the used parser backend globally, this option was rejected during the discussion period of the RFC because it would result in unsafe code that is controlled by global state: since any invoked piece of code can change the used parser backend, one should always check the current value of the config option before parsing URI strings (and in case of libraries, the original option should also be reset after usage). Instead, the RFC proposes to add the following configuration options that only affect a single use-case:
: a new optional$uriParserClass
parameter is added acceptingstring
represents the original (parse_url()
) based method, while the new backends will be used when passing eitherUri\Rfc3986\Uri::class
.- FTP/FTPS stream wrapper: a new
stream context option is added FILTER_VALIDATE_URL
functions can be configured by passing auri_parser_class
key to the$options
array- SSL/TLS socket communication: a new
stream context option is added - GET/POST session: since this feature is deprecated by (Deprecate GET/POST sessions RFC, no configuration is added.
There are certain file-handling functions that can already accept URIs as strings: these include file_get_contents()
, file()
, fopen()
. As per the current proposal, the URI parser can be supplied in the $context
parameter to these functions, but this approach is somewhat tedious, especially if the URI already had to be parsed previously (i.e. for validation purposes). Let's consider the following example:
$url = $_GET['url']; validate_url($url); $context = stream_context_create([ "uri_parser_class" => \Uri\Rfc3986\Uri::class, ]); $contents = file_get_contents($url, context: $context);
Even though there are other much more convenient approaches, the current RFC still goes with the current, less ergonomic one, as going either way would need more discussion, resulting in a scope creep. The improvement possibilities include passing URI instances to the functions in question, or converting URIs to streams based on Java's example.
Parser Library Choice
Adding a WHATWG compliant URL parser to the standard library was originally attempted in 2023. The implementation used ADA URL parser as its parser backend which is known for its ultimate performance. At last, the proof of concept was abandoned due to some technical limitations that weren't possible to resolve.
Specifically, ADA is written in C++, and requires a compiler supporting C++17 at least. Despite the fact that it has a C wrapper, its tight compiler requirements would make it unprecedented, and practically impossible to add the URI
extension to PHP as a required extension, because PHP has never had a C++ compiler dependency for the always enabled extensions, only optional extensions (like Intl
) can be written in C++.
The firm position of this RFC is that an URL parser extension should always be available, therefore a different parser backend written in pure C should be found. Fortunately, Niels Dossche proposed PHP RFC: DOM HTML5 parsing and serialization not long after the experiment with ADA, and his work required bundling parts of the Lexbor browser engine. This library is written in C, and coincidentally contains a WHATWG compliant URL parsing submodule, therefore it makes it suitable to be used as the library of choice.
For parsing URIs according to RFC 3986, the URIParser library was chosen. It is a lightweight and fast C library with no dependencies. It uses the “new BSD license” which is compatible with the current PHP license as well as the PHP License Update RFC.
Performance Considerations
The implementation of parse_url()
is optimized for performance. This also means that it doesn't deal with validation properly and disregards some edge cases. A fully standard compliant parser will generally be slower than parse_url()
, because it has to execute more code. Fortunately, this overhead is acceptable thanks to the efforts of the maintainers of the Lexbor and the uriparser libraries.
According to the rough benchmarks performed on a Linux instance in GitHub Actions, the following results were measured:
Time of parsing of a basic URL (1000 times)
:0.000233 sec
:0.000298 sec
:0.000394 sec
Time of parsing of a complex URL (1000 times)
:0.000817 sec
:0.000917 sec
The following sections give some additional context and explanation for the questions that had to be answered during the discussion phase of the RFC.
Naming considerations
After multiple iterations, the RFC settled on using the Uri\Rfc3986\Uri
and the Uri\WhatWg\Url
class names at last. By having different subnamespaces for the two specifications, it became possible to group together all the WHATWG related classes (Uri\WhatWg\WhatWgErrorType
, Uri\WhatWg\WhatWgError
). Additionally, the chosen class names (Uri
and Url
) try to disambiguate how the two specifications actually work:
- RFC 3986 works with actual relative URIs which don't have a scheme
- WHATWG can only work with URLs (URIs that have a scheme)
The additional benefit of using different class names is that there is no clash when both classes are imported into the same PHP file.
Why a common URI interface is not supported?
PSR-7 UriInterface is currently the de-facto interface for representing URIs in userland. That's why it seemed a good candidate for adoption at the first glance. However, the current RFC didn't pursue to reuse it for the following reasons:
- PSR-7 strictly follows the RFC 3986 standard, and therefore only has a notion of "userinfo", rather than "user" and "password" which is used by the WHATWG specification.
- PSR-7's
have non-nullable method return types except forUriInterface::getPort()
whereas WHATWG specifically allowsnull
values for the majority of the components.
As an alternative, the RFC attempted to define a new common URI interface (called Uri\Uri
), but it turned out late in the RFC process that the RFC 3986 and WHATWG specifications have so many smaller or bigger differences between them that a common URI interface is not really feasible to define.
Why the "user:password" format of the "User Information" component of RFC 3986 is supported?
RFC 3986 states the following when discussing the format of the “userinfo” component:
The userinfo subcomponent may consist of a user name and, optionally, scheme-specific information about how to gain authorization to access the resource. The user information, if present, is followed by a commercial at-sign (”@“) that delimits it from the host.
The definition is then extended with the following warning:
Use of the format “user:password” in the userinfo field is deprecated. Applications should not render as clear text any data after the first colon (”:“) character found within a userinfo subcomponent unless the data after the colon is the empty string (indicating no password)
The above sentences have always served as a source of contention whether the Uri\Rfc3986\Uri
class should handle the userinfo component strictly conformant to the RFC, or is it possible to add dedicated methods for the “user:password” format as “syntactic sugar”.
The position of the RFC is that the “user:password” format deserves special attention in spite of the fact that it's deprecated, because it's still the most often used format in the wild by far. That's why the dedicated getters (getUser()
, getRawUser()
, getPassword()
) are added to Uri\Rfc3986\Uri
. Dedicated withers are not added, because Uri\Rfc3986\Uri::withUserInfo()
is trivial to use with passwords:
$uri = new Uri\Rfc3986\Uri(""); $uri->withUserInfo($uri->getUser() . ":password"); echo $uri->toString(); //
Previously, UriInterface
of PSR-7 only added special support for the password component in its withUserInfo() method. Unfortunately, rather than setting user and password separately, the most recurring problem people face is to retrieve these two components separately. Not to mention the fact that setting a new password with the same user is still very cumbersome to achieve with the approach of PSR-7:
$uri = new \Laminas\Diactoros\Uri(""); $userInfo = explode(":", $uri->getUserInfo()); $username = $userInfo[0]; $uri = $uri->withUserInfo($username, "new_password");
That's why the current RFC doesn't try to follow the solution chosen by PSR-7, but rather solves the problem with dealing passwords the other way around.
Why Query Parameter Manipulation Is Not Supported?
It would be very useful for a URI implementation to support direct query parameter manipulation. Actually, the WHATWG specification contains a URLSearchParams interface that could be used for the purpose. However, the position of this RFC is not to include this interface yet for the following reasons:
- Query string parsing is a fuzzy area, since there is no established rules how for parsing
- The
interface doesn't follow either RFC 1738, or RFC 3986 - The already large scope of the RFC would increase even more
For all these reasons, the topic of query parameter manipulation should be discussed as a followup to the current RFC.
How should URI modification work?
Since URIs are value objects inherently, this RFC models them as immutable classes that support modification through withers. The usage of withers comes with some performance penalty - as a new instance is created for each modification -, but this is a necessity in order to hold identity constraints.
Alternatively, it would be possible to make URIs completely immutable by using the builder pattern to construct and modify URIs (i.e. by having a Uri\Rfc3986\UriBuilder
and a Uri\WhatWg\UrlBuilder
class). This way, new Uri
instances would only be created once: after the very last modification. This is especially true when one wants to construct a completely new URI. That's why this solution seems more optimized than the wither based approach.
However, this is not always true. When one wants to modify only a single detail of a URI, then withers are not only easier to use but are more efficient as well:
// Redirection of HTTP traffic to HTTPS by using withers $uri = new Uri\Rfc3986\Uri(""); $uri = $uri->withScheme("https"); // a new URI instance is created at this point
Whereas, the following piece of code should be used if URIs didn't support modification (given a hypothetical Uri\Rfc3986\UriBuilder
// Redirection of HTTP traffic to HTTPS by using the builder pattern $uri = new Uri\Rfc3986\Uri(""); $builder = Uri\Rfc3986\UriBuilder::fromUri($uri); $builder->setScheme("https"); // overwrites the URI scheme $uri = $builder->build(); // a new URI instance is created at this point
The above example makes it clear that the builder pattern mostly shines when it can save multiple instance creations, and it's especially true if a URI has to be constructed from the scratch:
// Redirection of HTTP traffic to HTTPS by using the builder pattern $builder = new Uri\Rfc3986\UriBuilder(); $builder->setScheme("https") ->setHost("") ->setPath("/foo"); $uri = $builder->build(); // a new URI instance is created at this point
Builder classes are not offered by the present RFC just yet. They definitely have their use-case, as they can help write more optimized code, but they are not essential at the get go. Therefore, this feature is one of the top candidates of a followup RFC.
Examples in Other Languages
Even though Go's standard library ships with a net/url
package containing a url.Parse()
function along with some utility functions, unfortunately it's not highlighted in the documentation which specification it conforms to. However, it's not very promising that the manual mentions the following sentence:
Trying to parse a hostname and path without a scheme is invalid but may not necessarily return an error, due to parsing ambiguities.
In Java, a URL class has been available from the beginning. Unfortunately, it's unclear whether it adheres to any URI specification. Speaking about its design, URL
itself is immutable, and somewhat peculiarly, it contains some methods which can open a connection to the URL, or get its content.
Since Java 20, all of the URL
constructors are deprecated in favor of using URI.toURL()
. The URI class conforms to RFC 2396 standard.
C# has an extensive support for URIs, although the documentation doesn't mention which the specification is uses. Uniquely, the standard library offers advanced features such as a UriBuilder, and customizable URI Parsers.
NodeJS recently added support for a decent WHATWG URL compliant URL parser, built on top of the ADA URL parser project.
Python also comes with built-in support for parsing URLs, made available by the urllib.parse.urlparse and urllib.parse.urlsplit functions. According to the documentation, “these functions incorporate some aspects of both [the WHATWG URL and the RFC 3986 specifications], but cannot be claimed compliant with either”.
Backward Incompatible Changes
A new parameter is added to SoapClient::__doRequest()
. When this method is overridden, the $uriParserClass
parameter has to be added to the parameter list.
Proposed PHP Version(s)
The next minor PHP version (either PHP 8.5 or 9.0, whichever comes first).
RFC Impact
SAPIs should adopt the new internal API for parsing URIs instead of using the existing php_url_parse*()
API. Additionally, they should add support for configuring the URI parsing backend.
To Existing Extensions
Extensions should adopt the new internal API for parsing URIs instead of using the existing php_url_parse*()
API. Additionally, they should add support for configuring the URI parsing backend.
To Opcache
Future Scope
- Support for a
class, similarly to the one implemented by C# - Support for an abstraction for manipulating query parameters, like URLSearchParams defined by WHATWG
- The
function can be deprecated at some distant point of time
Discussion thread:
The vote requires 2/3 majority in order to be accepted.