Table of Contents

PHP RFC: Followup Improvements for ext/uri

Introduction

This RFC proposes a set of follow-up improvements to the URL Parsing API RFC. It extends the Uri\Rfc3986\Uri and Uri\WhatWg\Url classes with additional capabilities, several of which were already discussed or requested during the original RFC’s review process.

While these features improve the overall usability of the URI extension, they were intentionally left out of the initial proposal. In some cases, they required further discussion; in others, they were considered non-essential at the time. Deferring them allowed the original RFC to remain focused and avoided further increasing its scope.

Proposal

The following new functionality is introduced in this proposal:

Originally, the proposal included two more topics:

Each feature proposed is voted separately and requires a 2/3 majority.

URI Building

Currently, only already existing (and validated) URIs can be manipulated via wither methods. These calls always create a new instance so that immutability of URIs is preserved. Even though this behavior has plenty of advantages, there's at least one disadvantage: instance creation has a performance overhead. This is especially problematic if a lot of URI components have to be modified in the same time, because a lot of objects are “wasted” through intermediate instantiations.

$uri1 = Uri\Rfc3986\Uri::parse("http://example.com");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withHost("example.net")
    ->withPath("/foo/bar");                // This creates 3 objects altogether!

Besides its suboptimal performance, another drawback of the current wither-based solution is that URI creation from the scratch is currently not possible: one always has to create a valid URI first. The empty string is a valid RFC 3986 URI, that's why it may seem a good candidate for an initial URI for URI building, but unfortunately, it's not valid for WHATWG URL. And anyway, the success of some transformations depend on the current state (which is a form of temporal coupling):

$uri1 = Uri\Rfc3986\Uri::parse("");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withUserInfo("user:pass")            // throws Uri\InvalidUriException: Cannot set a userinfo without having a host
    ->withHost("example.com");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withHost("example.com")
    ->withUserInfo("user:pass")            // No exception is thrown

In order to provide a more ergonomic and efficient solution for URI building, a fluent API is proposed that implements the Builder pattern.

$uriBuilder = new Uri\Rfc3986\UriBuilder()
    ->setScheme("https")
    ->setUserInfo("user:pass")
    ->setHost("example.com")
    ->setPort(8080)
    ->setPath("/foo/bar")
    ->setQuery("a=1&b=2"])
    ->setFragment("section1")
 
$uri = $uriBuilder->build();               // URI instance creation is only done at this point
 
echo $uri->toRawString();                  // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1

The same works for WHATWG URL:

$urlBuilder = new Uri\WhatWg\UrlBuilder()
    ->setScheme("https")
    ->setUsername("user")
    ->setPassword("pass")
    ->setHost("example.com")
    ->setPort(8080)
    ->setPath("/foo/bar")
    ->setQuery("a=1&b=2"]
    ->setFragment("section1")
 
$url = $urlBuilder->build();               // URL instance creation is only done at this point
 
echo $url->toAsciiString();                // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1

When a Builder instance is not instantiated by ourselves or a trusted party, one cannot be sure whether it already has any components set. Therefore, it's highly recommended to reset the instance state before usage, if a completely clean state is needed:

function buildUri(Uri\Rfc3986\UriBuilder $builder): void
{
    // Was there any component set before?
 
    $builder->reset();
 
    // Further usage is safe now...
}
 
function buildUrl(Uri\WhatWg\UrlBuilder $builder): void
{
    // Was there any component set before?
 
    $builder->reset();
 
    // Further usage is safe now...
}

The reset() method also comes handy when the same Builder instance is reused to instantiate multiple URIs/URLs in a row.

The complete class signatures to be added are the following:

namespace Uri\Rfc3986 {
    final class UriBuilder
    {
        public function __construct() {}
 
        public function reset(): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setScheme(?string $scheme): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setUserInfo(#[\SensitiveParameter] ?string $userInfo): static {}

        /**
         * @throws Uri\InvalidUriException
         */
        public function setHost(?string $host): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setPath(string $path): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setQuery(?string $query): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setFragment(?string $fragment): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function build(?\Uri\Rfc3986\Uri $baseUrl = null): \Uri\Rfc3986\Uri {}
    }
}
namespace Uri\WhatWg {
    final class UrlBuilder
    {
        public function __construct() {}
 
        public function reset(): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setScheme(?string $scheme): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setUsername(?string $username): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setPassword(#[\SensitiveParameter] ?string $password): static {}

        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setHost(?string $host): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setPath(string $path): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setQuery(?string $query): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setFragment(?string $fragment): static {}
 
        /**
         * @param array $errors
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function build(?\Uri\WhatWg\Url $baseUrl = null, &$errors = null): \Uri\WhatWg\Url {}
    }
}

The builder objects perform validation at two distinct levels:

An example for component syntax validation:

$uriBuilder = new Uri\Rfc3986\UriBuilder()
    ->setScheme("http%80");                // Throws a Uri\InvalidUriException because the scheme is not well formed

An example for validation of the global state:

$uriBuilder = new Uri\Rfc3986\UriBuilder()
    ->setScheme("https")
    ->setUserInfo("user:pass");            // Doesn't throw an exception yet
 
$uri = $uriBuilder->build();               // Throws an Uri\InvalidUriException because the host is not present, but the userinfo is

Design considerations

Builder design pattern

Why is a complex Builder pattern based approach is proposed instead of a much simpler Factory Method based one? The factory method could be as simple as the following:

namespace Uri\Rfc3986 {
    final readonly class Uri
    {
        ...
 
        public static function fromComponents(
            ?string $scheme = null, ?string $host = null, string $path = "",
            ?string $userInfo = null, ?string $queryString = null, ?string $fragment = null
        ) {}
 
        ...
    }
}
 
namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public static function fromComponents(
            string $scheme, ?string $host = "", string $path = "",
            ?string $username = null, ?string $password = null,
            ?string $queryString = null, ?string $fragment = null
        ) {}
 
        ...
    }
}

The current RFC proposes the Builder pattern based approach because of its flexibility: it makes it possible to add more convenience methods in the future.

Dedicated classes

This RFC proposes a dedicated Builder class for both RFC 3986 and WHATWG URL, instead of a single, unified implementation with 2 build() methods (e.g. buildUri() and buildUrl()). This decision has the following reasons:

Mutability

The UriBuilder and UrlBuilder classes are intentionally designed as mutable objects.

A Builder’s primary purpose is incremental construction and modification. In such workflows, immutability would require creating a new instance after every state change (e.g. setting the scheme, host, etc.).

In contrast to value objects such as Uri\Rfc3986\Uri or Uri\WhatWg\Url, which represent fully parsed and normalized identifiers and therefore benefit from immutability, a Builder is inherently a transitional construct. It is not meant to represent a stable value but to facilitate step-by-step assembly.

Making the Builder classes to be immutable would:

Setter naming convention

Setter methods of the UriBuilder and UrlBuilder classes follow the naming convention which is already widespread among internal functions: they use a set prefix, e.g. setScheme(), setHost(). The current RFC rejects the usage of any other naming convention, most notably the omission of the set prefix (e.g. scheme(), host()) due to the following reasons:

Voting

Add URI building support as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

URI Type Detection

RFC 3986 distinguishes different URI “types” based on what they begin with. Actually, the RFC 3986 specification collectively refers to these as URI-references.

In order to better support granular RFC 3986 URI type detection, the following enums and methods are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriType
    {
        case AbsolutePathReference;
        case RelativePathReference;
        case NetworkPathReference;
        case Uri;
    }
 
    final readonly class Uri
    {
        ...
 
        public function getUriType(): Uri\Rfc3986\UriType {}
 
        ...
    }
}

This way, it becomes easier to detect URI types:

$uri = new Uri\Rfc3986\Uri("https://example.com");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::Uri
 
$uri = new Uri\Rfc3986\Uri("https:");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::Uri
 
$uri = new Uri\Rfc3986\Uri("/foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::AbsolutePathReference
 
$uri = new Uri\Rfc3986\Uri("foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::RelativePathReference
 
$uri = new Uri\Rfc3986\Uri("//host.com/foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::NetworkPathReference

The position of this RFC is that identifying the distinction between URIs and absolute URIs (such URIs that don't include a fragment component) doesn't need special support, therefore a dedicated Uri\Rfc3986\UriType::AbsoluteUri enum case is omitted from the proposal.

The WHATWG URL specification defines some special schemes (http, https, ftp, file, ws, wss), which have distinct parsing and serialization rules. In order to make checks for special URLs easier to perform, a new Uri\WhatWg\Url::isSpecialScheme() method is added:

namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public function isSpecialScheme(): bool {}
 
        ...
    }
}

This enables low-level control for applications that need to mirror WHATWG behavior in parsing or normalization.

$url = new Uri\WhatWg\Url("https://example.com");
var_dump($url->isSpecialScheme());                // true
 
$url = new Uri\WhatWg\Url("custom:example");
var_dump($url->isSpecialScheme());                // false
Add support for detecting URI type as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Host Type Detection

Both the RFC 3986 and WHATWG URL specifications distinguish different types of the host component because each of them have different parsing and formatting rules. Probably the most notable example is the IPv6 host type that requires the IPv6 address to be written between a [ and ] pair.

In order to support returning information about the host type, the following enums and methods are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriHostType
    {
        case IpV4;
        case IpV6;
        case IpVFuture;
        case RegisteredName;
    }
 
    final readonly class Uri
    {
        ...
 
        public function getHostType(): ?\Uri\Rfc3986\UriHostType {}
 
        ...
    }
}
namespace Uri\WhatWg {
    enum UrlHostType
    {
        case IpV4;
        case IpV6;
        case Domain;
        case Opaque;
        case Empty;
    }
 
    final readonly class Url
    {
        ...
 
        public function getHostType(): ?\Uri\WhatWg\UrlHostType {}
 
        ...
    }
}

The new getHostType() methods return the type of the host component for both specifications.

Let's see a few examples for RFC 3986:

$uri = new Uri\Rfc3986\Uri("https://192.168.0.1/");
echo $uri->getHostType();                  // UriHostType::IpV4
 
$uri = new Uri\Rfc3986\Uri("https://[2001:db8::1]/");
echo $uri->getHostType();                  // UriHostType::IpV6
 
$uri = new Uri\Rfc3986\Uri("https://[v1.1.2.3]/");
echo $uri->getHostType();                  // UriHostType::IpVFuture
 
$uri = new Uri\Rfc3986\Uri("https://example.com/");
echo $uri->getHostType();                  // UriHostType::RegisteredName
 
$uri = new Uri\Rfc3986\Uri("file:///C:/a.txt");
echo $uri->getHostType();                  // null
 
$uri = new Uri\Rfc3986\Uri("foo:bar/baz");
echo $uri->getHostType();                  // null
 
$uri = new Uri\Rfc3986\Uri("/foo/bar");
echo $uri->getHostType();                  // null
 
$uri = new Uri\Rfc3986\Uri("mailto:john.doe@example.com");
echo $uri->getHostType();                  // null

According to RFC 3986, the host can be either an IPv4 or an IPv6 address, a so-called IPvFuture address (a potential IP address type that might be developed after IPv6), or a registered name (usually but not exclusively a DNS name). RFC 3986 also allows the host to be empty (“https://”), in which case, UriHostType::RegisteredName is returned, since the reg-name syntax of RFC 3986 supports empty strings. When the host is missing, Uri\Rfc3986\Uri::getHostType() returns null.

Some examples for WHATWG URL:

$url = new Uri\WhatWg\Url("https://192.168.0.1/");
echo $url->getHostType();                  // UrlHostType::IpV4
 
$url = new Uri\WhatWg\Url("https://[2001:db8::1]/");
echo $uri->getHostType();                  // UrlHostType::IpV6
 
$url = new Uri\WhatWg\Url("https://example.com/");
echo $url->getHostType();                  // UrlHostType::Domain
 
$url = new Uri\WhatWg\Url("file:///C:/a.txt");
echo $url->getHostType();                  // UrlHostType::Empty

As it can be seen, the behavior of the WHATWG URL specification is straightforward in case of special URLs: the host can be either an IPv4 or an IPv6 address, a domain, or it can be empty (when the host is missing).

As a side note, let's also mention that WHATWG URL accepts much more IPv4 address formats than RFC 3986:

$url = new Uri\WhatWg\Url("https://127.1/");
echo $url->getAsciiHost();                 // 127.0.0.1
 
$url = new Uri\WhatWg\Url("https://0x7f.0x0.0x0.0x1");
echo $url->getAsciiHost();                 // 127.0.0.1
 
$url = new Uri\WhatWg\Url("https://2130706433/");
echo $url->getAsciiHost();                 // 127.0.0.1

Things are getting more complicated when we look at non-special URLs:

$url = new Uri\WhatWg\Url("git://example.com/whatwg/url.git");
echo $url->getHostType();                  // UrlHostType::Opaque
 
$url = new Uri\WhatWg\Url("scheme://127.0.0.1/");
echo $url->getHostType();                  // UrlHostType::Opaque
 
$url = new Uri\WhatWg\Url("mailto:john.doe@example.com");
echo $url->getHostType();                  // null
 
$url = new Uri\WhatWg\Url("foo:/bar/baz");
echo $url->getHostType();                  // null

Hosts of non-special URL can be either opaque (note the various opaque hosts!), or null (when the host is missing). While treating the host of any non-special URL as opaque may seem unusual at first, this follows directly from the design principles of the WHATWG URL specification: the specification intentionally avoids making assumptions about the syntax of schemes it does not define. For example, it cannot know how a hypothetical foo scheme structures its host component. Therefore, such hosts are treated as opaque and are not subject to further parsing or validation.

These considerations explain why this RFC defines two separate enums (UriHostType and UrlHostType) for the two specifications, even though they contain similar or partially overlapping cases.

RFC 3986 defines a generic URI syntax. Its host categorization reflects this generality as it uses the “registered name” phrasing, without assuming any particular name resolution mechanism. In particular, a registered name is a syntactic production and does not imply a DNS domain.

In contrast, the WHATWG URL specification defines a host model mostly tailored to web interoperability. Because these two specifications operate at different abstraction levels and assign different semantics to superficially similar host forms, their host type systems are not compatible. For example, an RFC 3986 registered name cannot in general be mapped to a WHATWG domain, and WHATWG’s opaque host concept has no direct equivalent in RFC 3986.

For example, consider the URI app://my-application/resource using the app scheme as specified by https://www.w3.org/TR/app-uri/. According to RFC 3986, the host my-application is a valid registered name, even though it does not represent a DNS domain and is not expected to be resolved via DNS.

Therefore, a unified host type enum would either blur these semantic distinctions or incorrectly suggest full compatibility, but only partial compatibility exists. Using separate enums ensures that the host classification faithfully reflects the underlying specification being applied.

Add support for host type detection as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Percent-Encoding Support

Contrarily to the common belief that's probably further affirmed by the urlencode() function, percent-encoding is a context-sensitive process. Context-sensitivity means that different characters need to be percent-encoded depending on which URI component is being processed.

It should also be mentioned that in fact, urlencode() should rather be used for the application/x-www-form-urlencoded media type, while rawurlencode() more closely implements RFC 3986.

For example, the path component dedicates special meaning for the / character. Therefore, this character doesn't necessarily have to be percent-encoded in the path component. There are some cases though when it makes sense to percent-encode them, as highlighted by the first example within the “Advanced examples” section of the original URI RFC. Unfortunately, rawurlencode() doesn't take the component into account, and replaces the “/” with “%2F” unconditionally.

echo rawurlencode("/foo/bar/baz");                // %2Ffoo%2Fbar%2Fbaz

In order to correctly handle percent-encoding based on the rules of RFC 3986 and WHATWG URL, the following methods and enums are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriPercentEncodingMode
    {
        case UserInfo;
        case RegisteredNameHost;
        case Path;
        case PathSegment;
        case Query;
        case FormQuery;
        case Fragment;
        case AllReservedCharacters;
        case AllButUnreservedCharacters;
    }
 
    function uri_percent_encode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
}
namespace Uri\WhatWg {
    enum UrlPercentEncodingMode
    {
        case Username;
        case Password;
        case OpaqueHost;
        case Path;
        case OpaquePath;
        case PathSegment;
        case Query;
        case SpecialQuery;
        case FormQuery;
        case Fragment;
    }
 
    function url_percent_encode(string $input, \Url\WhatWg\UrlPercentEncodingMode $mode): string {}
}

The uri_percent_encode() and url_percent_encode() functions percent-encode the $input parameter according to the $mode percent-encoding mode. These functions are infallible.

The following modes are supported:

For the complete ABNF syntax of each component, consult Appendix A of RFC 3986.

Since neither RFC 3986, nor WHATWG URL support percent-encoded characters inside the scheme component, none of the enums contain a Scheme case. WHATWG URL automatically percent-decodes the host for special URLs, so Uri\WhatWg\UrlPercentEncoder doesn't contain a Host case. For opaque URLs, the Uri\WhatWg\UrlPercentEncoder::OpaqueHost case can be used.

By using the proposed percent-encoding capabilities, many use-cases will become possible to implement in a specification-compliant way which were difficult to achieve before.

For example, paths can be properly percent-encoded when they contain various special characters:

$uri = new Uri\Rfc3986\Uri("https://example.com");
 
$uri = $uri->withPath(
    Uri\Rfc3986\uri_percent_encode("/foo/bar/[baz]", Uri\Rfc3986\UriPercentEncodingMode::Path)
);
 
$uri->getPath();                                              // /foo/bar/%5Bbaz%5D

The current RFC doesn't propose the percent-decoding counterpart, because this functionality may cause confusion. Let's take an example:

$uri = new Uri\Rfc3986\Uri("https://example.com/?a=b%26c");  // The query is the percent-encoded form of "a=b&c=d"
 
echo Uri\Rfc3986\uri_percent_decode(
    $uri->getQuery(),
    Uri\Rfc3986\UriPercentEncodingMode::Query
);                                                            // a=b&c

The result is probably not what we expected, because percent-decoding changed the meaning of the component:

In order to avoid such situations, the present RFC only includes percent-encoding capabilities.

The WHATWG URL specification defines the allowed code points for each component indirectly, by stating which code points should be percent-encoded automatically: thus, the rest of the URL code points are allowed. It's the opposite what RFC 3986 does, which specifies the exact syntax with the allowed characters for each component.

Add support for percent-encoding as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Backward Incompatible Changes

All the proposed changes are completely backward compatible because the affected classes are all final.

Proposed PHP Version(s)

Next minor version (PHP 8.6)

RFC Impact

To the Ecosystem

The ecosystem can build upon the additional capabilities this RFC introduces.

To Existing Extensions

Existing extensions can continue to use the existing URI API without any changes. Some of the features are exposed as PHPAPI functions through public headers.

To SAPIs

None.

Open Issues

None.

Future Scope

None.

Patches and Tests

https://github.com/kocsismate/php-src/pull/9

Implementation

After the RFC is implemented, this section should contain:

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

Rejected Features

None.

Changelog