rfc:url_parsing_api

PHP RFC: Add WHATWG compliant URL parsing API

Introduction

URLs are one of the most fundamental concepts of the web because they make it possible to reference specific resources on a network. URLs were originally defined by Tim Berners-Lee in RFC 1738, but since then other specifications have also emerged, out of which WHATWG URL is the most notable one, because it's implemented by most browsers nowadays.

Despite the ubiquitous nature of URLs, they are not so unequivocal as people may think, because different clients treat and parse them differently by either following one of the standards, or even worse, not following any at all. Unfortunately, PHP falls into the latter category: the parse_url() function is offered for parsing URLs, however, it isn't compliant with any standards. Even the PHP manual contains the following warning:

This function may not give correct results for relative or invalid URLs, and the results may not even match common behavior of HTTP clients. ...

Incompatibility with current standards is a serious issue, as it hinders interoperability with different tools (i.e. HTTP clients), or it can result in bugs which are difficult to notice.

Proposal

In order to address the above mentioned problems, a new, always available URL extension is to be added to the standard library. Initially, it would only support the WHATWG URL specification, with the explicit intention of adding support for other standards when the need arises.

The extension would support parsing, as well as manipulating URLs. For this purpose, the following internal classes and methods are added:

namespace Url;
 
enum UrlComponent: int
{
    case Scheme = 0;
    case Host = 1;
    case Port = 2;
    case User = 3;
    case Password = 4;
    case Path = 5;
    case Query = 6;
    case Fragment = 7;
}
 
final readonly class UrlParser
{
    public static function parseUrl(string $url): ?Url {}
 
    /** @return array<string, int|string> */
    public static function parseUrlToArray(string $url): ?array {}
 
    public static function parseUrlComponent(string $url, UrlComponent $component): string|int|null {}
}
 
final readonly class Url implements \Stringable
{
    public function __construct(
        public string $scheme,
        public string $host,
        public ?int $port,
        public string $user,
        public string $password,
        public string $path,
        public string $query,
        public string $fragment
    ) {}
 
    public function getScheme(): string {}
 
    public function withScheme(string $scheme): static {}
 
    public function getAuthority(): string {}
 
    public function getUserInfo(): string {}
 
    public function withUserInfo(string $user, string $password): static {}
 
    public function getHost(): string {}
 
    public function withHost(string $host): static {}
 
    public function getPort(): ?int {}
 
    public function withPort(int $port): static {}
 
    public function getPath(): string {}
 
    public function withPath(string $path): static {}
 
    public function getQuery(): string {}
 
    public function withQuery(string $query): static {}
 
    public function getFragment(): string {}
 
    public function withFragment(string $fragment): static {}
 
    public function __toString(): string {}
}

Parser Library Choice

Adding a WHATWG compliant URL parser to the standard library was originally attempted in 2023. The implementation used ADA URL parser as its parser backend which is known for its ultimate performance. At last, the proof of concept was abandoned due to some technical limitations that weren't possible to resolve.

Specifically, ADA is written in C++, and requires a compiler supporting C++17 at least. Despite the fact that it has a C wrapper, its tight compiler requirements would make it unprecedented, and practically impossible to add the URL extension to PHP as a required extension, because PHP has never had a C++ compiler dependency for the always enabled extensions, only optional extensions (like Intl) can be written in C++.

The firm position of this RFC is that a WHATWG compliant URL parser extension should be always available, therefore a different parser backend written in pure C should be found. Fortunately, Niels Dossche proposed PHP RFC: DOM HTML5 parsing and serialization not long after the experiment with ADA, and his work required bundling the Lexbor browser engine. This library is written in C, and coincidentally contains a WHATWG compliant URL parsing submodule, therefore it makes it suitable to be used as the library of choice.

Performance Considerations

The implementation of parse_url() is optimized for performance. This also means that it doesn't deal with validation properly and disregards some edge cases. A WHATWG compliant parser will always be slower than parse_url(), because it has to execute more code. According to the initial benchmarks, Url\UrlParser::parseUrl() is ~3.6x, while Url\UrlParser::parseUrlToArray() is ~3x slower than parse_url(). The new functions still have some room for performance optimizations, but we shouldn't expect a significant performance improvement.

API Design

The new URL parsing API consists of two classes: Url\UrlParser and Url\Url. It's a deliberate design decision to separate parsing from representation: this way URLs can be parsed into multiple representations (class, array, scalar values). However, it's still to be decided which representations are really necessary to have.

Additionally, this design makes it easier to support custom parser backends. For example, one can then create an extension which uses the ADA URL parser library to parse URLs.

Relation to PSR-7

The Url\Url class is intentionally compatible with the PSR-7 UriInterface. This makes it possible for a next iteration of the PSR-7 standard to use Url\Url directly instead of requiring implementations to provide their own Psr\Http\Message\UriInterface implementation.

Examples in Other Languages

Go

Even though Go's standard library ships with a net/url package containing a url.Parse() function along with some utility functions, unfortunately it's not highlighted in the documentation which standard it conforms to. However, it's not very promising that the manual mentions the following sentence:

Trying to parse a hostname and path without a scheme is invalid but may not necessarily return an error, due to parsing ambiguities.

Java

In Java, a URL class has been available from the beginning. Unfortunately, it's unclear whether it adheres to any URL standards. Speaking about its design, URL itself is immutable, and somewhat peculiarly, it contains some methods which can open a connection to the URL, or get its content.

Since Java 20, all of the URL constructors are deprecated in favor of using URI.toURL(). The URI class conforms to RFC 2396 standard.

NodeJS

NodeJS recently added support for a decent WHATWG URL compliant URL parser, built on top of the ADA URL parser project.

Python

Python also comes with built-in support for parsing URLs, made available by the urllib.parse.urlparse and urllib.parse.urlsplit functions. According to the documentation, “these functions incorporate some aspects of both, but cannot be claimed compliant with either [the WHATWG URL or the RFC 3986 specifications]”.

Backward Incompatible Changes

None.

Proposed PHP Version(s)

Either PHP 8.5 or 9.0.

Future Scope

Support for other parser backends may be added so that other libraries (like Ada URL) could be also used in addition to Lexbor.

The parse_url() function can be deprecated at some distant point of time.

References

Vote

The vote requires 2/3 majority in order to be accepted.

Add the WHATWG compliant URL API described above?
Real name yes no
Final result: 0 0
This poll has been closed.
rfc/url_parsing_api.txt · Last modified: 2024/06/28 20:14 by kocsismate