rfc:url_parsing_api

PHP RFC: Add WHATWG compliant URL parsing API

Introduction

URLs are one of the most fundamental concepts of the web because they make it possible to reference specific resources on a network. URLs were invented by Tim Berners-Lee himself, and his work was accepted as RFC 1738. Since then other specifications have also emerged, out of which WHATWG URL is the most notable one, because it's implemented by most browsers nowadays.

Despite the ubiquitous nature of URLs, they are not so unequivocal as people may think, because different clients treat and parse them differently by either following one of the standards, or even worse, not following any at all. Unfortunately, PHP falls into the latter category: the parse_url() function is offered for parsing URLs, however, it isn't compliant with any standards. Even the PHP manual contains the following warning:

This function may not give correct results for relative or invalid URLs, and the results may not even match common behavior of HTTP clients. ...

Incompatibility with current standards is a serious issue, as it hinders interoperability with different tools (i.e. HTTP clients), or it can result in bugs which are difficult to notice.

Proposal

In order to address the above mentioned problems, a new, always available URL extension is to be added to the PHP core. Initially, it would only support the WHATWG URL specification, with the explicit intention of adding support for other standards when the need arises.

The extension would support parsing, as well as manipulating URLs. For this purpose, the following internal classes and methods are added:

namespace Url;
 
enum UrlComponent: int
{
    case Scheme = 0;
    case Host = 1;
    case Port = 2;
    case User = 3;
    case Password = 4;
    case Path = 5;
    case Query = 6;
    case Fragment = 7;
}
 
final readonly class Url implements \Stringable
{
    public function __construct(
        public ?string $scheme,
        public ?string $host,
        public ?int $port,
        public ?string $user,
        public ?string $password,
        public ?string $path,
        public ?string $query,
        public ?string $fragment
    ) {}
 
    public function getScheme(): ?string {}
 
    public function withScheme(?string $scheme): static {}
 
    public function getAuthority(): ?string {}
 
    public function getUserInfo(): ?string {}
 
    public function withUserInfo(?string $user, ?string $password): static {}
 
    public function getHost(): ?string {}
 
    public function withHost(?string $host): static {}
 
    public function getPort(): ?int {}
 
    public function withPort(?int $port): static {}
 
    public function getPath(): ?string {}
 
    public function withPath(?string $path): static {}
 
    public function getQuery(): ?string {}
 
    public function withQuery(?string $query): static {}
 
    public function getFragment(): ?string {}
 
    public function withFragment(?string $fragment): static {}
 
    public function __toString(): string {}
}
 
final readonly class UrlParser
{
    public static function parseUrl(string $url): ?Url {}
 
    /** @return array<string, int|string> */
    public static function parseUrlToArray(string $url): ?array {}
 
    public static function parseUrlComponent(string $url, UrlComponent $component): string|int|null {}
}

Performance Considerations

The implementation of parse_url() is optimized for performance. This also means that it doesn't deal with validation and disregards some edge cases. A WHATWG compliant parser will always be slower than parse_url(), because it has to execute much more code. According to the initial benchmarks, Url\UrlParser::parseUrl() is ~3.6x, while Url\UrlParser::parseUrlToArray() is ~3x slower than parse_url(). The new functions still have some room for performance optimizations, but we shouldn't expect a significant performance improvement.

Parser Library Choice

Adding a WHATWG compliant URL parser to the PHP core was originally attempted in 2023. The implementation used ADA URL parser as its parser backend which is known for its ultimate performance. At last, the proof of concept was abandoned due to some technical limitations that weren't possible to resolve.

Specifically, ADA is written in C++, and requires a compiler supporting C++17 at least. Despite the fact that it has a C wrapper, its tight compiler requirements would make it unprecedented, and practically impossible to add the URL extension to the PHP core as a required extension, as PHP never had a C++ compiler dependency for the always enabled extensions, only optional extensions (like Intl) can be written in C++.

The firm position of this RFC is that a WHATWG compliant URL parser extension should be always available, therefore a different parser backend written in pure C should be found. Fortunately, Niels Dossche proposed PHP RFC: DOM HTML5 parsing and serialization not long after the experiment with ADA, and his work required bundling the Lexbor browser engine. This library is written in C, and contains a WHATWG compliant URL parsing submodule, therefore it makes it suitable to be used as the library of choice.

Backward Incompatible Changes

None.

Proposed PHP Version(s)

Next minor version (either 8.5 or 9.0).

Future Scope

Support for other parser backends may be added so that other libraries (like Ada URL) could be also used in addition to Lexbor.

The parse_url() function can be deprecated at some distant point of time.

References

Discussion thread:

Vote

The vote requires 2/3 majority in order to be accepted.

Add the WHATWG compliant URL API described above?
Real name yes no
Final result: 0 0
This poll has been closed.
rfc/url_parsing_api.txt · Last modified: 2024/06/15 07:11 by kocsismate