URIs and URLs are one of the most fundamental concepts of the web because they make it possible to reference specific resources on a network. URLs were originally defined by Tim Berners-Lee in RFC 1738, but since then other specifications have also emerged, out of which RFC 3986 and WHATWG URL are the most notable ones. The former one updates the original RFC 1738 and defines URIs, while the latter one specifies how browsers should treat URLs.
Despite the ubiquitous nature of URLs and URIs, they are not so unequivocal as people may think, because different clients treat and parse them differently by either following one of the standards, or even worse, not following any at all. Unfortunately, PHP falls into the latter category: the parse_url()
function is offered for parsing URLs, however, it isn't compliant with any standards. Even the PHP manual contains the following warning:
This function may not give correct results for relative or invalid URLs, and the results may not even match common behavior of HTTP clients. ...
Incompatibility with current standards is a serious issue, as it hinders interoperability with different tools (i.e. HTTP clients), or it can result in bugs which are difficult to notice. For example, cURL's URL parsing implementation is based on RFC 3986, that's why URLs validated by FILTER_VALIDATE_URL may not necessarily be accepted when passed to cURL.
In order to address the above mentioned problems, a new, always available URI
extension is to be added to the standard library. The extension would support parsing, validating, modifying, and recomposing (converting the parsed structures back to strings) URIs based on both RFC 3986 and the WHATWG URL specifications, as well as resolving references (turning a (relative) URI to an absolute one by applying a base URI to it). For this purpose, the following internal classes and methods are added:
namespace Uri { class UriException extends \Exception { } class UninitializedUriException extends \Uri\UriException { } class UriOperationException extends \Uri\UriException { } class InvalidUriException extends \Uri\UriException { public readonly array $errors; } }
namespace Uri\Rfc3986 { readonly class Uri { public static function parse(string $uri, ?string $baseUrl = null): ?static {} public function __construct(string $uri, ?string $baseUrl = null) {} public function getScheme(): ?string {} public function getRawScheme(): ?string {} public function withScheme(?string $encodedScheme): static {} public function getUser(): ?string {} public function getRawUser(): ?string {} public function withUser(?string $encodedUser): static {} public function getPassword(): ?string {} public function getRawPassword(): ?string {} public function withPassword(?string $encodedPassword): static {} public function getHost(): ?string {} public function getRawHost(): ?string {} public function withHost(?string $encodedHost): static {} public function getPort(): ?int {} public function withPort(?int $port): static {} public function getPath(): ?string {} public function getRawPath(): ?string {} public function withPath(?string $encodedPath): static {} public function getQuery(): ?string {} public function getRawQuery(): ?string {} public function withQuery(?string $encodedQuery): static {} public function getFragment(): ?string {} public function getRawFragment(): ?string {} public function withFragment(?string $encodedFragment): static {} public function equals(Uri $uri, bool $excludeFragment = true): bool {} public function toNormalizedString(): string {} public function toString(): string {} public function resolve(string $uri): static {} public function __serialize(): array; public function __unserialize(array $data): void; public function __debugInfo(): array; } }
namespace Uri\WhatWg { readonly class Url { /** @param array<int, WhatWgError> $errors */ public static function parse(string $uri, ?string $baseUrl = null, &$errors = null): ?static {} /** @param array<int, WhatWgError> $softErrors */ public function __construct(string $uri, ?string $baseUrl = null, &$softErrors = null) {} public function getScheme(): string {} public function getRawScheme(): string {} public function withScheme(string $encodedScheme): static {} public function getUser(): ?string {} public function getRawUser(): ?string {} public function withUser(?string $encodedUser): static {} public function getPassword(): ?string {} public function getRawPassword(): ?string {} public function withPassword(?string $encodedPassword): static {} public function getHost(): string {} public function getHumanFriendlyHost(): string {} public function withHost(string $encodedHost): static {} public function getPort(): ?int {} public function withPort(?int $encodedPort): static {} public function getPath(): ?string {} public function getRawPath(): ?string {} public function withPath(?string $encodedPath): static {} public function getQuery(): ?string {} public function getRawQuery(): ?string {} public function withQuery(?string $encodedQuery): static {} public function getFragment(): ?string {} public function getRawFragment(): ?string {} public function withFragment(?string $encodedFragment): static {} public function equals(Url $uri, bool $excludeFragment = true): bool {} public function toMachineFriendlyString(): string {} public function toHumanFriendlyString(): string {} public function resolve(string $uri): static {} public function __serialize(): array {} public function __unserialize(array $data): void {} public function __debugInfo(): array {} } enum WhatWgErrorType { case DomainToAscii; case DomainToUnicode; case DomainInvalidCodePoint; case HostInvalidCodePoint; case Ipv4EmptyPart; case Ipv4TooManyParts; case Ipv4NonNumericPart; case Ipv4NonDecimalPart; case Ipv4OutOfRangePart; case Ipv6Unclosed; case Ipv6InvalidCompression; case Ipv6TooManyPieces; case Ipv6MultipleCompression; case Ipv6InvalidCodePoint; case Ipv6TooFewPieces; case Ipv4InIpv6TooManyPieces; case Ipv4InIpv6InvalidCodePoint; case Ipv4InIpv6OutOfRangePart; case Ipv4InIpv6TooFewParts; case InvalidUrlUnit; case SpecialSchemeMissingFollowingSolidus; case MissingSchemeNonRelativeUrl; case InvalidReverseSoldius; case InvalidCredentials; case HostMissing; case PortOfOfRange; case PortInvalid; case FileInvalidWindowsDriveLetter; case FileInvalidWindowsDriveLetterHost; } readonly class WhatWgError { public string $context; public WhatWgErrorType $type; public function __construct(string $context, WhatWgErrorType $type) {} } }
First and foremost, the new URI parsing API contains two URI implementations, Uri\Rfc3986\Uri
and Uri\WhatWg\Url
, representing RFC 3986 and WHATWG URIs, respectively. Having separate classes for the two specifications makes it possible to properly model URIs with all their details and nuances. Actually, it could cause a security vulnerability to have wrong assumptions about the origin of a URI, as Daniel Stenberg (author of cURL) writes in one of his blog posts, that's why at least in security-sensitive applications, it's very important to explicitly require the usage of one specific standard.
Both built-in URI implementations support instantiation via two methods:
Uri\InvalidUriException
is thrown.parse()
factory method: It expects the same parameters as the constructor does, but in case of an invalid URI, null
is returned instead of throwing an exception. Using this method is recommended for validating URIs and/or parsing URIs from untrusted input.$uri = new Uri\Rfc3986\Uri("https://example.com"); // An RFC 3986 URI instance is created $uri = Uri\Rfc3986\Uri::parse("https://example.com"); // An RFC 3986 URI instance is created $uri = new Uri\Rfc3986\Uri("invalid uri"); // A Uri/InvalidUriException is thrown $uri = Uri\Rfc3986\Uri::parse("invalid uri"); // null is returned in case of an invalid URI $url = new Uri\WhatWg\Url("https://example.com"); // A WHATWG URL instance is created $url = Uri\WhatWg\Url::parse("https://example.com"); // A WHATWG URL instance is created $url = new Uri\WhatWg\Url("invalid uri"); // A Uri/InvalidUriException is thrown $url = Uri\WhatWg\Url::parse("invalid uri", null, $errors); // null is returned, and an array of WhatWgError objects are passed by reference to $errors
As it can be seen, Uri\WhatWg\Url::parse()
can pass additional information about the triggered validation errors by reference, as specified by WHATWG. In the example above, $errors
will contain the following value:
array(1) { [0]=> object(Uri\WhatWg\WhatWgError)#1 (2) { ["context"]=> string(11) "invalid uri" ["type"]=> enum(Uri\WhatWg\WhatWgErrorType::MissingSchemeNonRelativeUrl) } }
However, it is also possible that a WHATWG URL can be parsed successfully with some validation errors. When using the constructor, only soft errors are passed by reference, while hard errors are thrown. The following example demonstrates a soft error:
$softErrors = []; $url = new Uri\WhatWg\Url(" https://example.org", null, $softErrors); var_dump($url->toString()); // https://example.org var_dump($softErrors[0]->type); // enum(Uri\WhatWg\WhatWgErrorType::InvalidUrlUnit)
The two built-in URI implementations are readonly, and they have a respective private property for each URI component. These URI components can be retrieved via getters, and immutable modification is possible via “wither” methods. While property hooks and/or asymmetric visibility would make it possible to get rid of the getters, the position of this RFC is to still go with regular get*()
method calls as the conservative option, especially because hooked properties cannot be readonly: the author of this RFC believes that it's more important to guarantee the immutability of URI implementations than to optimize performance via eliminating (getter) method calls. Not to mention the fact that getters may benefit from additional optional parameters in the future, if we would like to have more control on the encoding of the output.
$uri1 = new Uri\Rfc3986\Uri("https://example.com"); $uri2 = $uri->withHost("test.com"); echo $uri1->getHost(); // example.com echo $uri2->getHost(); // test.com
The above example demonstrates that withers create a new instance for each modification, leaving the original object intact. However, an exception is thrown if a modification resulted in an invalid URI. This way, URIs can always stay valid:
$uri1 = new Uri\Rfc3986\Uri("https://example.com"); $uri->withHost("/"); // A Uri/InvalidUriException is thrown
Besides accessors, URI implementations contain a toString()
method too. This can be used for recomposing the URI components back to a string. Why such a method is necessary at all? It's because the recomposition process doesn't necessarily simply return the input URI, but it applies some modifications to it. The WHATWG standard specifically mandates the usage of quite some transformations (i.e. removal of extraneous /
characters in the scheme, lowercasing some URI components, application of IDNA encoding). While some of the transformations are also required by default for RFC 3986, they are less frequent than for WHATWG.
$url = new Uri\WhatWg\Url("https://////example.com"); echo $url->toString(); // https://example.com
The attentive reader may have noticed that the examples used toString()
instead of __toString()
. This is a deliberate design decision not to add a __toString()
method to the built-in URI classes, as doing so would cause incorrect results when using equality comparison (==
). Given the following example:
$url = new Uri\WhatWg\Url("https://example.com"); var_dump($url == 'HTTPS://example.com');
The output would be bool(false)
if Uri\WhatWg\Url
contained a __toString()
method, because of the $uri
object being automatically converted to its string representation (https://example.com
) which is then compared against HTTPS://example.com
. However, as we will see in the following paragraphs, the two URIs should be indeed equal, as a result of normalization. Furthermore, equality of URIs usually disregards the fragment component, thus a https://example.com#foo
URI would also yield a false positive result in the example.
As mentioned above, RFC 3986 has the notion of normalization, which is an optional process for canonizing different URIs identifying the same resource to the same URI. Therefore, URI implementations may support normalization via the normalize()
method. E.g. the https:///////EXAMPLE.com
and the HTTPS://example.com/
URIs identify the same resource, so implementations may normalize both of them to https://example.com
. Implementations should apply some kind of normalization techniques on the current URI (i.e. case normalization, percent-decoding normalization etc.) and return a new instance. The toNormalizedString()
method is a shorthand for $uri->normalize()->toString()
, and it's useful when one needs the normalized string representation, but the URI components themselves don't have to be modified.
Let's see an example for retrieving the normalized path component (foo/../bar
becomes bar
):
$uri1 = new Uri\Rfc3986\Uri("https://EXAMPLE.COM/foo/../bar"); $uri2 = $uri1->normalize(); echo $uri1->getPath(); // foo/../bar echo $uri2->getPath(); // bar
Another example for the two ways to return the normalized string representation of an URI:
$uri = new Uri\Rfc3986\Uri("https://EXAMPLE.COM/foo/../bar"); echo $uri->toString(); // https://EXAMPLE.COM/foo/../bar" echo $uri->normalize()->toString(); // https://example.com/bar echo $uri->toNormalizedString(); // https://example.com/bar
Please note that only
Uri\Rfc3986\Uri
support this capability, since the WHATWG specification doesn't have the concept of optional normalization.
Normalization is especially important when it comes to comparing URIs because it reduces the likelihood of false positive results, since URI comparison is based on checking whether the URIs represent the same resources. The equals()
method can be used for comparing URIs. First, the method only accept URI objects of the same specification, since it doesn't make sense to compare URIs of different standards. Then they normalize (if applicable) and recompose the URI represented by the object as well as the URI received in the argument list to a string, and checks whether the two strings match. By default, the fragment component is disregarded.
// An RFC 3986 URI equals another RFC 3986 URI that has the same string representation after normalization new Uri\Rfc3986\Uri("https://example.COM")->equals(new Uri\Rfc3986\Uri("https://EXAMPLE.COM"))); // true // A WHATWG URL equals another WHATWG URL that has the same string representation after normalization new Uri\WhatWg\Url("https:////example.COM/")->equals(new Uri\WhatWg\Url("https://EXAMPLE.COM"))); // true // A URI cannot be compared against another URI of a different standard new Uri\Rfc3986\Uri("https://example.com/")->equals(new Uri\WhatWg\Url("https://example.com/")); // throws TypeError
It should be noted that the equals()
method could also accept URI strings. It was a deliberate decision not to allow such arguments, because it would be unclear how the comparison works in this case: Should the passed in string be also normalized, or exact string match should be performed? This is a question that don't have to be answered when only a URI object parameter type is supported.
The same question - combined with the fact that the construct is not supported in userland - led us not to overload the equality operator.
Last but not least, URIs support a resolve()
method that is able to resolve potentially relative URIs with the current object as the base URI:
$uri = new Uri\Rfc3986\Uri("https://example.com"); echo $uri->resolve("/foo")->toString(); // https://example.com/foo $url = new Uri\WhatWg\Url("https://example.com"); echo $url->resolve("/foo")->toString(); // https://example.com/foo
This method is a shorthand for new get_class($uri)(“/foo”, $base->toString())
.
After multiple iterations, the RFC settled on using the Uri\Rfc3986\Uri
and the Uri\WhatWg\Url
class names at last. By having different subnamespaces for the two specifications, it became possible to group together all the WHATWG related classes (Uri\WhatWg\WhatWgErrorType
, Uri\WhatWg\WhatWgError
). Additionally, the chosen class names (Uri
and Url
) try to disambiguate how the two specifications actually work:
The additional benefit of using different class names is that there is no clash when both classes are imported in the same PHP file.
Encoding and decoding special characters is a crucial aspect of URI parsing. For this purpose, both RFC 3986 and WHATWG use percent-encoding (i.e. the %
character is encoded as %25
). However, the two standards differ significantly in this regard:
RFC 3986 defines that “URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent”, which means that percent-encoded characters and their decoded form are equivalent. On the contrary, WHATWG defines URL equivalence by the equality of the serialized URLs, and never decodes percent-encoded characters, except in the host. This implies that percent-encoded characters are not equivalent to their decoded form (except in the host).
The difference between RFC 3986 and WHATWG comes from the fact that the point of view of a maintainer of the WHATWG specification is that webservers may legitimately choose to consider encoded and decoded paths distinct, and a standard cannot force them not to do so. This is a substantial BC break compared to RFC 3986, and it is actually a source of confusion among users of the WHATWG specification based on the large number of tickets related to this question.
PSR-7 UriInterface is currently the de-facto interface for representing URIs in userland. That's why it seemed a good candidate for adoption at the first glance. However, the current RFC didn't pursue to reuse it for the following reasons:
UriInterface
have non-nullable method return types except for UriInterface::getPort()
whereas WHATWG specifically allows null
values.
As an alternative, the RFC attempted to define a new URI interface (called Uri\Uri
), but it turned out late in the RFC process that the RFC 3986 and WHATWG specifications have so many smaller or bigger differences between them that a common URI interface is not really feasible to define.
It would be very useful for a URI implementation to support direct query parameter manipulation. Actually, the WHATWG URL specification contains a URLSearchParams interface that could be used for the purpose. However, the position of this RFC is not to include this interface yet for the following reasons:
URLSearchParams
interface doesn't follow either RFC 1738, or RFC 3986For all these reasons, the topic of query parameter manipulation should be discussed as a followup to the current RFC.
Adding a WHATWG compliant URL parser to the standard library was originally attempted in 2023. The implementation used ADA URL parser as its parser backend which is known for its ultimate performance. At last, the proof of concept was abandoned due to some technical limitations that weren't possible to resolve.
Specifically, ADA is written in C++, and requires a compiler supporting C++17 at least. Despite the fact that it has a C wrapper, its tight compiler requirements would make it unprecedented, and practically impossible to add the URI
extension to PHP as a required extension, because PHP has never had a C++ compiler dependency for the always enabled extensions, only optional extensions (like Intl
) can be written in C++.
The firm position of this RFC is that an URL parser extension should always be available, therefore a different parser backend written in pure C should be found. Fortunately, Niels Dossche proposed PHP RFC: DOM HTML5 parsing and serialization not long after the experiment with ADA, and his work required bundling parts of the Lexbor browser engine. This library is written in C, and coincidentally contains a WHATWG compliant URL parsing submodule, therefore it makes it suitable to be used as the library of choice.
For parsing URIs according to RFC 3986, the URIParser library was chosen. It is a lightweight and fast C library with no dependencies. It uses the “new BSD license” which is compatible with the current PHP license as well as the PHP License Update RFC.
The capability provided by parse_url()
is used for multiple purposes in the internal PHP source:
SoapClient::_doRequest()
: parsing the $location
parameter as well as the value of the Location
headerparse_url()
is used for connecting to an URL, renaming a file, following the Location
headerFILTER_VALIDATE_URL
: validating URLs
It would cause inconsistency and a security vulnerability if parsing of URIs based on the two specifications referred above were supported in userland, but the legacy parse_url()
based behavior was kept internally without the possibility to use the new API. That's why the current RFC was designed with plugability in mind.
Specifically, supported parser backends would have to be registered by using a similar method how password hashing algorithms are registered. On one hand, this approach makes it possible for 3rd party extensions to leverage URI parser backends other than the built-in ones (i.e. support for ADA URL could also be added). But more importantly, an internal “interface” for parsing and handling URIs is defined this way so that it now becomes possible to configure the used backend for each use-case. Please note that URI parser backend registration is only supported for internal code: registering custom user-land implementations is not possible for now, mainly in order to prevent a possible new attack surface.
While it would sound natural to add a php.ini configuration option to configure the used parser backend globally, this option was rejected during the discussion period of the RFC because it would result in unsafe code that is controlled by global state: since any invoked piece of code can change the used parser backend, one should always check the current value of the config option before parsing URIs (and in case of libraries, the original option should also be reset after usage). Instead, the RFC proposes to add the following configuration options that only affect a single use-case:
SoapClient::_doRequest()
: a new optional $uriParserClass
parameter is added accepting string
or null
arguments. Null
represents the original (parse_url()
) based method, while the new backends will be used when passing either Uri\Rfc3986\Uri::class
or Uri\WhatWg\Url::class
.uri_parser_class
stream context option is addedFILTER_VALIDATE_URL
: filter_*
functions can be configured by passing a uri_parser_class
key to the $options
arrayuri_parser_class
stream context option is added
There are certain file-handling functions that can already accept URIs as strings: these include file_get_contents()
, file()
, fopen()
. As per the current proposal, the URI parser can be supplied in the $context
parameter to these functions, but this approach is somewhat tedious, especially if the URI already had to be parsed previously (i.e. for validation purposes). Let's consider the following example:
$url = $_GET['url']; validate_url($url); $context = stream_context_create([ "uri_parser_class" => \Uri\Rfc3986\Uri::class, ]); $contents = file_get_contents($url, context: $context);
However, there are other much more convenient approaches, but the current RFC still goes with the current, less ergonomic one, as going either way would need more discussion, and a scope creep. The improvement possibilities include passing URI instances to the functions in question, or converting URIs to streams based on Java's example.
The implementation of parse_url()
is optimized for performance. This also means that it doesn't deal with validation properly and disregards some edge cases. A fully standard compliant parser will generally be slower than parse_url()
, because it has to execute more code. Fortunately, this overhead is usually minimal thanks to the huge efforts of the maintainers of the Lexbor and the uriparser libraries.
According to the rough benchmarks, the following results were measured:
parse_url()
: 0.000208 sec
Uri\Rfc3986\Uri
: 0.000311 sec
Uri\WhatWg\Url
: 0.000387 sec
parse_url()
: 0.000962
Uri\Rfc3986\Uri
: 0.000911 sec
Uri\WhatWg\Url
: 0.000962 sec
Even though Go's standard library ships with a net/url
package containing a url.Parse()
function along with some utility functions, unfortunately it's not highlighted in the documentation which specification it conforms to. However, it's not very promising that the manual mentions the following sentence:
Trying to parse a hostname and path without a scheme is invalid but may not necessarily return an error, due to parsing ambiguities.
In Java, a URL class has been available from the beginning. Unfortunately, it's unclear whether it adheres to any URI specification. Speaking about its design, URL
itself is immutable, and somewhat peculiarly, it contains some methods which can open a connection to the URL, or get its content.
Since Java 20, all of the URL
constructors are deprecated in favor of using URI.toURL()
. The URI class conforms to RFC 2396 standard.
C# has an extensive support for URIs, although the documentation doesn't mention which the specification is uses. Uniquely, the standard library offers advanced features such as a UriBuilder, and customizable URI Parsers.
NodeJS recently added support for a decent WHATWG URL compliant URL parser, built on top of the ADA URL parser project.
Python also comes with built-in support for parsing URLs, made available by the urllib.parse.urlparse and urllib.parse.urlsplit functions. According to the documentation, “these functions incorporate some aspects of both [the WHATWG URL and the RFC 3986 specifications], but cannot be claimed compliant with either”.
None.
The next minor PHP version (either PHP 8.5 or 9.0, whichever comes first).
SAPIs should adopt the new internal API for parsing URIs instead of using the existing php_url_parse*()
API. Additionally, they should add support for configuring the URI parsing backend.
Extensions should adopt the new internal API for parsing URIs instead of using the existing php_url_parse*()
API. Additionally, they should add support for configuring the URI parsing backend.
None.
parse_url()
function can be deprecated at some distant point of timeDiscussion thread: https://externals.io/message/123997
The vote requires 2/3 majority in order to be accepted.