PHP RFC: Followup Improvements for ext/uri

Version: 0.1
Date: 2025-10-17
Author: Máté Kocsis, kocsismate@php.net
Status: Under Discussion
Target version: next minor version (PHP 8.6)
Implementation: https://github.com/kocsismate/php-src/pull/9

Introduction

This RFC proposes various follow-up improvements to the URL Parsing API RFC, extending the Uri\Rfc3986\Uri and Uri\WhatWg\Url classes with additional capabilities that were requested during the discussion phase of the original RFC. These capabilities were deemed not to be essential from the get-go, therefore they were postponed in order not to increase scope even further.

Proposal

The following new functionality is introduced in this proposal:

URI Building
URI Type Detection
Host Type Detection
Accessing Path Segments as an Array
Percent-Encoding and Decoding Support

Originally, the proposal included Query Parameter manipulation support, but it was later separated to its own RFC at https://wiki.php.net/rfc/query_params due to its complexity.

Each feature proposed is voted separately and requires a 2/3 majority.

URI Building

Currently, only already existing (and validated) URIs can be manipulated via wither methods. These calls always create a new instance so that immutability of URIs is preserved. Even though this behavior has plenty of advantages, there's at least one disadvantage: instance creation has a performance overhead. This is especially problematic if a lot of URI components have to be modified in the same time, because a lot of objects are “wasted” through intermediate instantiations.

$uri1 = Uri\Rfc3986\Uri::parse("http://example.com");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withHost("example.net")
    ->withPath("/foo/bar");                // This creates 3 objects altogether!

Besides its suboptimal performance, another drawback of the current wither-based solution is that URI creation from the scratch is currently not possible: one always has to create a valid URI first. The empty string is a valid RFC 3986 URI, that's why it may seem a good candidate for an initial URI for URI building, but unfortunately, it's not valid for WHATWG URL. And anyway, the success of some transformations depend on the current state (which is a form of temporal coupling):

$uri1 = Uri\Rfc3986\Uri::parse("");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withUserInfo("user:pass")            // throws Uri\InvalidUriException: Cannot set a userinfo without having a host
    ->withHost("example.com");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withHost("example.com")
    ->withUserInfo("user:pass")            // No exception is thrown

In order to provide a more ergonomic and efficient solution for URI building, a fluent API is proposed that implements the Builder pattern.

$uriBuilder = new Uri\Rfc3986\UriBuilder()
    ->setScheme("https")
    ->setUserInfo("user:pass")
    ->setHost("example.com")
    ->setPort(8080)
    ->setPath("/foo/bar")
    ->setQuery("a=1&b=2"])
    ->setQueryParams(Uri\Rfc3986\UriQueryParams::fromArray(["a" => 1, "b" => 2]) // Has the same effect as the setQuery() call above
    ->setFragment("section1")
 
$uri = $uriBuilder->build();               // URI instance creation is only done at this point
 
echo $uri->toRawString();                  // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1

The same works for WHATWG URL:

$urlBuilder = new Uri\WhatWg\UrlBuilder()
    ->setScheme("https")
    ->setUsername("user")
    ->setPassword("pass")
    ->setHost("example.com")
    ->setPort(8080)
    ->setPath("/foo/bar")
    ->setQuery("a=1&b=2"])
    ->setQueryParams(Uri\WhatWg\UrlQueryParams::fromArray(["a" => 1, "b" => 2]) // Has the same effect as the setQuery() call above
    ->setFragment("section1")
 
$url = $urlBuilder->build();               // URL instance creation is only done at this point
 
echo $url->toAsciiString;                  // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1

When a Builder instance is not instantiated by ourselves or a trusted party, one cannot be sure whether it already has any components set. Therefore, it's highly recommended to reset the instance state before usage:

function buildUri(Uri\Rfc3986\UriBuilder $builder): void
{
    // Was there any component set before?
 
    $builder->reset();
 
    // Further usage is safe now...
}
 
function buildUrl(Uri\WhatWg\UrlBuilder $builder): void
{
    // Was there any component set before?
 
    $builder->reset();
 
    // Further usage is safe now...
}

The reset() method also comes handy when the same Builder instance is reused to instantiate multiple URIs/URLs in a row.

The complete class signatures to be added are the following:

namespace Uri\Rfc3986 {
    final class UriBuilder
    {
        public function __construct() {}
 
        public function reset(): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setScheme(?string $scheme): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setUserInfo(#[\SensitiveParameter] ?string $userInfo): static {}

        /**
         * @throws Uri\InvalidUriException
         */
        public function setHost(?string $host): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setPath(string $path): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setPathSegments(array $segments): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setQuery(?string $query): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setQueryParams(\Uri\Rfc3986\UriQueryParams $queryParams): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setFragment(?string $fragment): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function build(?\Uri\Rfc3986\Uri $baseUrl = null): \Uri\Rfc3986\Uri {}
    }
}

namespace Uri\WhatWg {
    final class UrlBuilder
    {
        public function __construct() {}
 
        public function reset(): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setScheme(?string $scheme): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setUsername(?string $username): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setPassword(#[\SensitiveParameter] ?string $password): static {}

        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setHost(?string $host): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setPath(string $path): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setPathSegments(array $segments): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setQuery(?string $query): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setQueryParams(\Uri\WhatWg\UrlQueryParams $queryParams): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setFragment(?string $fragment): static {}
 
        /**
         * @param array $errors
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function build(?\Uri\WhatWg\Url $baseUrl = null, &$errors = null): \Uri\WhatWg\Url {}
    }
}

The builder objects would perform validation in two places:

Validation of pure component syntax: The individual setter methods would immediately validate if the input is syntactically correct. For example, the scheme component cannot contain percent-encoded octets, therefore the setScheme() method would throw whenever a “%” character is encountered.
Validation of global state: There are a few validation rules that depend on the “global state”. For example, RFC 3986 requires the host component to be present when the userinfo is set. Any such validations would be delayed until the build() method call to avoid the problem with temporal coupling that was mentioned in the beginning of the section.

An example for component syntax validation:

$uriBuilder = new Uri\Rfc3986\UriBuilder()
    ->setScheme("http%80");                // Throws a Uri\InvalidUriException because the scheme is not well formed

An example for validation of the global state:

$uriBuilder = new Uri\Rfc3986\UriBuilder()
    ->setScheme("https")
    ->setUserInfo("user:pass");            // Doesn't throw an exception yet
 
$uri = $uriBuilder->build();               // Throws an Uri\InvalidUriException because the host is not present, but the userinfo is

Design considerations

Builder design pattern

Why is a complex Builder pattern based approach is proposed instead of a much simpler Factory Method based one? The factory method could be as simple as the following:

namespace Uri\Rfc3986 {
    final readonly class Uri
    {
        ...
 
        public static function fromComponents(
            ?string $scheme = null, ?string $host = null, string $path = "",
            ?string $userInfo = null, ?string $queryString = null, ?string $fragment = null
        ) {}
 
        ...
    }
}
 
namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public static function fromComponents(
            string $scheme, ?string $host = "", string $path = "",
            ?string $username = null, ?string $password = null,
            ?string $queryString = null, ?string $fragment = null
        ) {}
 
        ...
    }
}

The current RFC proposes the Builder pattern based approach because of its flexibility: it makes it possible to add more convenience methods in the future. Actually, the setQueryParams() method that expects a Query parameter list object instead of the query string representation is already one.

Dedicated classes

This RFC proposes a dedicated Builder class for both RFC 3986 and WHATWG URL, instead of a single, unified implementation with 2 build() methods (e.g. buildUri() and buildUrl()). This decision has the following reasons:

The two specifications don't recognize the same components. RFC 3986 has the userinfo component, while WHATWG URL has a separate username and password component instead. Even though these incompatibilities are probably possible to workaround, the position of this RFC is that it's better not to try to maintain compatibility artificially.
RFC 3986 only requires the path component to be present (that's why the empty string is a valid RFC 3986 URI), while WHATWG URL mandates the presence of the scheme component too. This distinction is visible from the proposed signatures: while the Uri\Rfc3986\UriBuilder::setScheme() method accepts a string or null, Uri\WhatWg\UrlBuilder::setScheme() only accepts a string parameter. The same distinction is already present in the Uri\Rfc3986\Uri::withScheme() and the Uri\WhatWg\Url::withScheme() methods.
Setter methods validate the input based on the rules of the specification they implement. For example, RFC 3986 URIs cannot contain Unicode characters, so all setters fail when such a character is passed to them. On the other hand, WHATWG URL can handle Unicode characters, and setters won't fail when they encounter one. If a single, unified Builder class was proposed, performing validations early during the setter calls wouldn't be possible, only during the build*() method calls. According to the proposal, this would lead to a counterintuitive behavior because of the delayed feedback loop.

Mutability

The UriBuilder and UrlBuilder classes are intentionally designed as mutable objects.

A Builder’s primary purpose is incremental construction and modification. In such workflows, immutability would require creating a new instance after every state change (e.g. setting the scheme, host, etc.).

In contrast to value objects such as Uri\Rfc3986\Uri or Uri\WhatWg\Url, which represent fully parsed and normalized identifiers and therefore benefit from immutability, a Builder is inherently a transitional construct. It is not meant to represent a stable value but to facilitate step-by-step assembly.

Making the Builder classes to be immutable would:

introduce avoidable performance overhead due to repeated allocations
more complicate usage
provide limited practical safety benefits, since the Builder is not intended for concurrent sharing.

Setter naming convention

Setter methods of the UriBuilder and UrlBuilder classes follow the naming convention which is already widespread among internal functions: they use a set prefix, e.g. setScheme(), setHost(). The current RFC rejects the usage of any other naming convention, most notably the omission of the set prefix (e.g. scheme(), host()) due to the following reasons:

The set prefix adds additional context about the intended behavior: all proposed setters completely overwrite the related component. E.g. setQuery() and setQueryParams() neither prepend nor append their input to the existing query string, but they both overwrite the whole component. If set was omitted from the method name, then this additional context was completely missing, and people could have even less idea about what was going to happen when they use these methods.
Using the set prefix for the setters would allow the addition of other convenience methods in the future more naturally: e.g. appendQueryParams(), appendPathSegments() etc.

Voting

Add URI building support as outlined in the RFC?
Real name	Yes	No	Abstain
Final result:	0	0	0
This poll has been closed.

URI Type Detection

RFC 3986 distinguishes different URI “types” based on what they begin with. Actually, the RFC 3986 specification collectively refers to these as URI-references.

Relative-reference: Starts with a path, and the scheme is therefore omitted. Relative-references can be further grouped into the following types:
- Absolute-path reference: Starts with a single slash (“/”), e.g.: “/foo”
- Relative-path reference: Starts without a slash (“/”), e.g.: “foo”
- Network-path reference: Starts with a double slash (“//”) followed by an authority, e.g.: //host/foo
URI: Starts with the scheme component, and then continues with either the authority, or the path
- Absolute URI: A subtype of URIs are absolute URIs which don't include the fragment component.

In order to better support granular RFC 3986 URI type detection, the following enums and methods are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriType
    {
        case AbsolutePathReference;
        case RelativePathReference;
        case NetworkPathReference;
        case Uri;
    }
 
    final readonly class Uri
    {
        ...
 
        public function getUriType(): Uri\Rfc3986\UriType {}
 
        ...
    }
}

This way, it becomes easier to detect URI types:

$uri = new Uri\Rfc3986\Uri("https://example.com");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::Uri
 
$uri = new Uri\Rfc3986\Uri("https:");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::Uri
 
$uri = new Uri\Rfc3986\Uri("/foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::AbsolutePathReference
 
$uri = new Uri\Rfc3986\Uri("foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::RelativePathReference
 
$uri = new Uri\Rfc3986\Uri("//host.com/foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::NetworkPathReference

The position of this RFC is that identifying the distinction between URIs and absolute URIs (such URIs that don't include a fragment component) doesn't need special support, therefore a dedicated Uri\Rfc3986\UriType::AbsoluteUri enum case is omitted from the proposal.

The WHATWG URL specification defines some special schemes (http, https, ftp, file, ws, wss), which have distinct parsing and serialization rules. In order to make checks for special URLs easier to perform, a new Uri\WhatWg\Url::isSpecialScheme() method is added:

namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public function isSpecialScheme(): bool {}
 
        ...
    }
}

This enables low-level control for applications that need to mirror WHATWG behavior in parsing or normalization.

$url = new Uri\WhatWg\Url("https://example.com");
var_dump($url->isSpecialScheme());                // true
 
$url = new Uri\WhatWg\Url("custom:example");
var_dump($url->isSpecialScheme());                // false

Add support for detecting URI type as outlined in the RFC?
Real name	Yes	No	Abstain
Final result:	0	0	0
This poll has been closed.

Host Type Detection

Both the RFC 3986 and WHATWG URL specifications distinguish different types of the host component because each of them have different parsing and formatting rules. Probably the most notable example is the IPv6 host type that requires the IPv6 address to be written between a [ and ] pair.

In order to support returning information about the host type, the following enums and methods are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriHostType
    {
        case IPv4;
        case IPv6;
        case IPvFuture;
        case RegisteredName;
    }
 
    final readonly class Uri
    {
        ...
 
        public function getHostType(): ?\Uri\Rfc3986\UriHostType {}
 
        ...
    }
}

namespace Uri\WhatWg {
    enum UrlHostType
    {
        case IPv4;
        case IPv6;
        case Domain;
        case Opaque;
        case Empty;
    }
 
    final readonly class Url
    {
        ...
 
        public function getHostType(): ?\Uri\WhatWg\UrlHostType {}
 
        ...
    }
}

The new getHostType() methods return the type of the host component for both specifications.

Let's see a few examples for RFC 3986:

$uri = new Uri\Rfc3986\Uri("https://192.168.0.1/");
echo $uri->getHostType();                  // UriHostType::IPv4
 
$uri = new Uri\Rfc3986\Uri("https://[2001:db8::1]/");
echo $uri->getHostType();                  // UriHostType::IPv6
 
$uri = new Uri\Rfc3986\Uri("https://[v1.1.2.3]/");
echo $uri->getHostType();                  // UriHostType::IPvFuture
 
$uri = new Uri\Rfc3986\Uri("https://example.com/");
echo $uri->getHostType();                  // UriHostType::RegisteredName
 
$url = new Uri\Rfc3986\Uri("file:///C:/a.txt");
echo $uri->getHostType();                  // null
 
$uri = new Uri\Rfc3986\Uri("foo:bar/baz");
echo $uri->getHostType();                  // null
 
$uri = new Uri\Rfc3986\Uri("/foo/bar");
echo $uri->getHostType();                  // null
 
$url = new Uri\Rfc3986\Uri("mailto://john.doe@example.com");
echo $uri->getHostType();                  // null

According to RFC 3986, the host can be either an IPv4 or an IPv6 address, a so-called IPvFuture address (a potential IP address type that might be developed after IPv6), or a registered name (usually but not exclusively a DNS name). RFC 3986 also allows the host to be missing (/foo/bar) or empty (“https://”), and in these cases Uri\Rfc3986\Uri::getHostType() returns null.

Some examples for WHATWG URL:

$url = new Uri\WhatWg\Url("https://192.168.0.1/");
echo $url->getHostType();                  // UrlHostType::IPv4
 
$url = new Uri\WhatWg\Url("https://[2001:db8::1]/");
echo $uri->getHostType();                  // UrlHostType::IPv6
 
$url = new Uri\WhatWg\Url("https://example.com/");
echo $url->getHostType();                  // UrlHostType::Domain
 
$url = new Uri\WhatWg\Url("file:///C:/a.txt");
echo $url->getHostType();                  // UrlHostType::Empty

As it can be seen, the behavior of the WHATWG URL specification is straightforward in case of special URLs: the host can be either an IPv4 or an IPv6 address, a domain, or it can be empty (when the host is missing).

Things are getting more complicated when we look at non-special URLs:

$url = new Uri\WhatWg\Url("git://example.com/whatwg/url.git");
echo $url->getHostType();                  // UrlHostType::Opaque
 
$url = new Uri\WhatWg\Url("scheme://127.0.0.1/");
echo $url->getHostType();                  // UrlHostType::Opaque
 
$url = new Uri\WhatWg\Url("mailto://john.doe@example.com");
echo $url->getHostType();                  // UrlHostType::Opaque
 
$url = new Uri\WhatWg\Url("mailto:john.doe@example.com");
echo $url->getHostType();                  // null
 
$url = new Uri\WhatWg\Url("foo:/bar/baz");
echo $url->getHostType();                  // null

Hosts of non-special URL can be either opaque (note the various opaque hosts!), or null (when the host is missing). While treating the host of any non-special URL as opaque may seem unusual at first, this follows directly from the design principles of the WHATWG URL specification: the specification intentionally avoids making assumptions about the syntax of schemes it does not define. For example, it cannot know how a hypothetical foo scheme structures its host component. Therefore, such hosts are treated as opaque and are not subject to further parsing or validation.

These considerations explain why this RFC defines two separate enums (UriHostType and UrlHostType) for the two specifications, even though they contain similar or partially overlapping cases.

RFC 3986 defines a generic URI syntax. Its host categorization reflects this generality as it uses the “registered name” phrasing, without assuming any particular name resolution mechanism. In particular, a registered name is a syntactic production and does not imply a DNS domain.

In contrast, the WHATWG URL specification defines a host model mostly tailored to web interoperability. Because these two specifications operate at different abstraction levels and assign different semantics to superficially similar host forms, their host type systems are not compatible. For example, an RFC 3986 registered name cannot in general be mapped to a WHATWG domain, and WHATWG’s opaque and empty host concepts have no direct equivalents in RFC 3986.

Therefore, a unified host type enum would either blur these semantic distinctions or incorrectly suggest compatibility where none exists. Using separate enums ensures that the host classification faithfully reflects the underlying specification being applied.

Add support for host type detection as outlined in the RFC?
Real name	Yes	No	Abstain
Final result:	0	0	0
This poll has been closed.

Accessing Path Segments as an Array

Sometimes, accessing path segments rather than the whole path as string is needed. When this is the case, splitting the path to segments manually after retrieval is both inconvenient and disadvantageous performance-wise, especially considering the fact that Uri\Rfc3986\Uri internally stores the path as a list of segments.

That's why the following methods are proposed to be added:

namespace Uri\Rfc3986 {
    final readonly class Uri
    {
        ...
 
        public function getRawPathSegments(): array {}
 
        public function getPathSegments(): array {}
 
        public function getDecodedPathSegments(): array {}
 
        public function withPathSegments(array $segments, \Uri\Rfc3986\LeadingSlashPolicy $leadingSlashPolicy = \Uri\Rfc3986\LeadingSlashPolicy::AddForNonEmptyRelative): static {}
 
        ...
    }
 
    enum LeadingSlashPolicy
    {
        case AddForNonEmptyRelative;
        case NeverAdd;
    }
}
 
namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public function getPathSegments(): array|string {}
 
        public function withPathSegments(array $segments): static {}
 
        ...
    }
}

This way, it is possible to write the following code:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/baz");
$segments = $uri->getPathSegments();        // ["foo", "bar", "baz"]
 
$uri = $uri->withPathSegments(["a", "b"]);
echo $uri->getPath();                       // /a/b

The same also works for WHATWG URL:

$url = new Uri\WhatWg\Url("https://example.com/foo/bar/baz");
$segments = $url->getPathSegments();        // ["foo", "bar", "baz"]
 
$url = $url->withPathSegments(["a", "b"]);
echo $url->getPath();                       // /a/b

Uri\Rfc3986\Uri::getPathSegments() and Uri\WhatWg\Url::getPathSegments() split the path according to the rules of their respective specification.

Uri\Rfc3986\Uri::withPathSegments() and Uri\WhatWg\Url::withPathSegments() internally concatenate the input segments separated by a / character, and then trigger the respective withPath() method to update the path.

Segment definition of RFC 3986

In order to understand better why and exactly how this functionality works, we should more carefully understand how RFC 3986 defines the path and path segments: according to the specification, path segments start after the leading “/” due to the following ABNF rule:

path-abempty  = *( "/" segment )

That is, the path-abempty syntax only applies in case of URIs containing an authority component, and it declares that the path is either empty, or contains a “/” followed by a segment one or multiple times. Then segments have the following syntax:

segment       = *pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

That is, segments are composed of zero or multiple characters in the “pchar” charset (the exact values don't matter in this case). It should be mentioned that there are some additional special-case segment syntaxes (they are marked with segment-nz and segment-nz-nc in the ABNF syntax), but let's disregard them now for ease of understanding.

The above definitions imply that an empty path has zero segments:

$uri = new Uri\Rfc3986\Uri("https://example.com");
$segments = $uri->getPathSegments();        // []

When the path consists of a leading “/” and a string matching the segment syntax (e.g. /foo), the path has one segment:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo");
$segments = $uri->getPathSegments();        // ["foo"]

We can easily see based on the above example that the URI https://example.com/ also has a single segment - but it's empty:

$uri = new Uri\Rfc3986\Uri("https://example.com/");
$segments = $uri->getPathSegments();        // [""]

This is perfectly valid, because segments can be empty (at least in the above case when the URI has an authority). Another interesting question is how segments are represented when the path has a trailing slash (e.g. /foo/)? Consistent to the above rules, it's the following:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo/");
$segments = $uri->getPathSegments();        // ["foo", ""]

A few other special cases are also collected below:

“https://#foo”: It means that the URI has an empty authority starting after the “//” characters, and the path is also empty, and therefore this URI has zero path segments
“https:/”: It means that the URI has no authority and the path starts after the “:” character (it is “/”), therefore this URI has one empty path segment
“https:”: It means that the URI has no authority, and the path starts after the “:” character (it is “”), therefore this URI has zero path segments
“” (empty string): It means that the relative reference consists of a single path component which is empty, and therefore this relative reference has zero path segments
“/foo”: It means that the relative reference consists of a single path component which is “/foo”, and therefore this relative reference has one path segment “foo”
“foo”: It means that the relative reference consists of a single path component which is “foo”, and therefore this relative reference has one path segment “foo”
“foo/”: It means that the relative reference consists of a single path component which is “foo/”, and therefore this relative reference has two path segments “foo” and “”
“/”: It means that the relative reference consists of a single path component which is “/”, and therefore this relative reference has one empty path segment

Segment definition of WHATWG URL

As always, WHATWG URL has similar, but somewhat different rules. First of all, the definition of a path segment is the following:

A URL path segment is an ASCII string. It commonly refers to a directory or a file, but has no predefined meaning.

Even though it's not a very specific definition, it aligns with the segment definition of RFC 3986. Then WHATWG URL defines the path component based on segments:

A URL path is either a URL path segment or a list of zero or more URL path segments.

This is a major shift from RFC 3986, because it states that the path segment can be either a single segment or a list of segments in the following cases:

Opaque URLs have an opaque path, which cannot be divided any further into segments, therefore the path consists of a single path segment, represented as a string
The path of special URLs can be split into segments, therefore the path consists of a list of zero or more path segments, represented as an array of strings

This behavior is consistent with how WHATWG URL categorizes host types: only special hosts which are known are attempted to be inspected, and no assumptions are made against the unknown ones. The same happens in case of paths.

Let's see a few typical examples, first related to special URLs where WHATWG URL behaves the same way as RFC 3986:

$url = new Uri\WhatWg\Url("https://example.com/");
$segments = $url->getPathSegments();           // [""]
 
$url = new Uri\WhatWg\Url("https://example.com/foo");
$segments = $url->getPathSegments();           // ["foo"]
 
$url = new Uri\WhatWg\Url("https://example.com/foo/");
$segments = $url->getPathSegments();           // ["foo", ""]
 
$url = new Uri\WhatWg\Url("https://example.com/foo/bar");
$segments = $url->getPathSegments();           // ["foo", "bar"]

Now let's see how non-special URLs behave:

$url = new Uri\WhatWg\Url("scheme://example.com/");
$segments = $url->getPathSegments();           // "/"
 
$url = new Uri\WhatWg\Url("scheme://example.com/foo");
$segments = $url->getPathSegments();           // "/foo"
 
$url = new Uri\WhatWg\Url("scheme://example.com/foo/");
$segments = $url->getPathSegments();           // "/foo/"
 
$url = new Uri\WhatWg\Url("scheme://example.com/foo/bar");
$segments = $url->getPathSegments();           // "/foo/bar"

Consistent to WHATWG URL's definition, Uri\WhatWg\Url::getPathSegments() returns the whole path as a string in case of non-special hosts.

$url = new Uri\WhatWg\Url("scheme://example.com/");
$url = $url->withPathSegments(["foo"]);
echo $url->getPath();                          // "/foo"
 
$url = new Uri\WhatWg\Url("scheme://example.com/");
$url = $url->withPathSegments(["foo", ""]);
echo $url->getPath();                          // "/foo/"

Even though opaque paths cannot be split into segments, path modification via Uri\WhatWg\Url::withPathSegments() still works the same way as for non-opaque paths: the $segment argument is concatenated into a path string, and the path component is overwritten with this value.

Ambiguity

There's one edge case which needs disambiguation in relation to the RFC 3986 specification and the withPathSegments() method. Let's consider the following example:

$uri = new Uri\Rfc3986\Uri("/foo");            // absolute-path reference
 
$uri = $uri->withPathSegments(["bar"]);        // should the result be "/bar" or "bar"?

In this case, it would be ambiguous whether the resulting URI is an absolute- or a relative-path reference.

That's why Uri\Rfc3986\Uri::withPathSegments() has a second parameter $leadingSlashPolicy, which can be used to decide if a relative reference should became an absolute- or a relative-path reference:

$uri = new Uri\Rfc3986\Uri("/foo");            // absolute-path reference
 
$uri = $uri->withPathSegments(["bar"], Uri\Rfc3986\LeadingSlashPolicy::NeverAdd); // The leading slash is not prepended
 
echo $uri->getPath();                          // bar
 
$uri = new Uri\Rfc3986\Uri("foo");             // relative-path reference
 
$uri = $uri->withPathSegments(["bar"], Uri\Rfc3986\LeadingSlashPolicy::AddForNonEmptyRelative);  // The leading slash is prepended
 
echo $uri->getPath();                          // /bar

The Uri\Rfc3986\LeadingSlashPolicy::AddForNonEmptyRelative enum case only has effect when the URI is a relative reference, and the first path segment is not empty. Any other cases are unambiguous.

Since WHATWG URL doesn't support relative references, there's no case which needs disambiguation, and that's why the $leadingSlashPolicy parameter is not needed when modifying path segments.

Percent-Encoding and Decoding

Path segment retrieval works the same way as path retrieval does. In case of RFC 3986, the getRawPathSegments() and getPathSegments() methods follow the percent-decoding behavior of getRawPath() and getPath(), respectively. Furthermore, a getDecodedPathSegments() method is added to improve user experience.

getRawPathSegments(): Returns the path segments non-normalized, without any post-processing.
getPathSegments(): Returns the normalized path segments, without any post-processing.
getDecodedPathSegments(): Returns the normalized path segments, with the percent-encoded octets decoded in the path segment context.

Let's see what the distinction is between the above methods in practice:

$uri = new Uri\Rfc3986\Uri("/fo%6F/bar%2fbaz");  // percent-encoded form of "/foo/bar/baz"
 
$segments = $uri->getRawPathSegments();          // ["foo", "bar%2fbaz"]
$segments = $uri->getPathSegments();             // ["foo", "bar%2Fbaz"]
$segments = $uri->getDecodedPathSegments();      // ["foo", "bar/baz"]

Uri\Rfc3986\Uri::getRawPathSegments() returns the path separated into segments as-is.

Uri\Rfc3986\Uri::getPathSegments() returns the normalized path separated into segments consistent to other Uri\Rfc3986\Uri getters, as seen in the Advanced Examples of the original ext/uri RFC. That is, reserved characters are not percent-decoded, as mentioned in the Generic percent-decoding introduction of the original ext/uri RFC.

Finally, Uri\Rfc3986\Uri::getDecodedPathSegments() returns the normalized path separated into segments whose contents are percent-decoded according to the segment ABNF rule. In addition to Uri\Rfc3986\Uri::getPathSegments(), the “%2F” percent-encoded octet representing the “/” character is also percent-decoded. Although “/” is a reserved character in the generic URI syntax, once the path has already been split, it is no longer syntactically ambiguous within an individual segment. As a result, this method intentionally goes beyond a strict application of the generic syntax defined by RFC 3986. By decoding reserved characters within segment boundaries, it provides an application-level interpretation of the path segments rather than a purely syntactic representation.

Design considerations

Should Uri\Rfc3986\Uri and Uri\WhatWg\Url really support path segment handling?

Some schemes don't use “/” to express the hierarchy inside the path according to their scheme-specific processing rules: e.g. in case of the “mailto” scheme, the “@” separates the “local name” and the “domain name” in the path ( nobody@example.com). There are even such schemes which don't support hierarchy in the path component at all. This leads to the question whether Uri\Rfc3986\Uri should really support path segments as described, because Uri\Rfc3986\Uri is supposed to implement the generic URI syntax, therefore its functionality should apply to all URIs?

The answer is yes, it does, because the generic URI syntax uses path segments and the “/” separator to define the path component (remember the ABNF rules above!). It's possible that separating the path into segments is semantically incorrect in case of some schemes, but syntactically speaking, path segments are first class citizens of the generic URI syntax.

WHATWG URL takes one step further, and it explicitly defines how the path is separated into segments only in case of some specific schemes (special URLs), and it explicitly leaves them undefined for the rest of the URLs (opaque URLs). This way, there's no gap between the syntactic and semantic interpretation of path segments.

Why isn't there a PathSegments class?

Path segments could be modeled as a dedicated class (e.g. PathSegments) rather than simple arrays. Mainly, this would improve their extensibility - new features could be added to this class easily in the future. However, the current RFC still chooses the array model due to a few reasons.

First, it's unclear how and when validation should happen:

Should the PathSegments class be always valid just like the rest of the ext/uri classes? If the answer is no, then there will be a discrepancy, and possibly it would go against user expectations.
However, if the answer is yes, then there are a few big hurdles:
- Should we add a dedicated class per specification? WHATWG URL and RFC 3986 have vastly different path definitions, so indeed, there should be two classes ideally, or at very least, a dedicated factory method per specification.
- To complicate things, both specifications use a context-sensitive algorithm for path validation: WHATWG URL has the notion of opaque paths which depend on the scheme and the presence of the host. RFC 3986 has different segment parsing rules for relative-path references, absolute-path references, and URIs (see segment, segment-nz, segment-nz-nc).

Add support for accessing path segments as an array as outlined in the RFC?
Real name	Yes	No	Abstain
Final result:	0	0	0
This poll has been closed.

Percent-Encoding and Decoding Support

Contrarily to the common belief that's probably further affirmed by the urlencode() and urldecode() functions, percent-encoding and decoding are both a context-sensitive process. Context-sensitivity means that different characters need to be percent-encoded/percent-encoded depending on which URI component is being processed.

It should also be mentioned that in fact, urlencode() and urldecode() should rather be used for the application/x-www-form-urlencoded media type, and rawurlencode() and rawurldecode() more closely implements RFC 3986.

For example, the path component dedicates special meaning for the / character. Therefore, this character doesn't necessarily have to be percent-encoded in the path component. There are some cases though when it makes sense to percent-encode them, as highlighted by the first example within the “Advanced examples” section of the original URI RFC. Unfortunately, rawurlencode() doesn't take the component into account, and replaces the “/” with “%2F” unconditionally.

echo rawurlencode("/foo/bar/baz");                // %2Ffoo%2Fbar%2Fbaz

In order to correctly handle percent-encoding and decoding based on the rules of RFC 3986 and WHATWG URL, the following methods and enums are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriPercentEncodingMode
    {
        case UserInfo;
        case Host;
        case Path;
        case PathSegment;
        case Query;
        case FormQuery;
        case Fragment;
        case AllReservedCharacters;
        case AllButUnreservedCharacters;
    }
 
    final readonly class Uri
    {
        ...
 
        public static function percentEncode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
 
        public static function percentDecode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
 
        ...
    }
}

namespace Uri\WhatWg {
    enum UrlPercentEncodingMode
    {
        case UserInfo;
        case Host;
        case OpaqueHost;
        case Path;
        case PathSegment;
        case OpaquePath;
        case OpaquePathSegment;
        case Query;
        case SpecialQuery;
        case FormQuery;
        case Fragment;
    }
 
    final readonly class Url
    {
        ...
 
        public static function percentEncode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
 
        public static function percentDecode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
 
        ...
    }
}

The percentEncode() and percentDecode() methods both require an input string and a PercentEncodingMode enum to be passed. The enums make the context of the encoding/decoding processes fully explicit and clear. The following modes are supported:

Uri\Rfc3986\UriPercentEncodingMode
- UserInfo: Besides unreserved characters, percent-encoded octets, as well as sub-delimiters, it also allows the following characters to be present: “:”. Any other characters are percent-encoded.
- Host: If the input string is a valid IPv4, an IPv6 or an IPvFuture address, no percent-encoding is performed, since these host types do not support the process. Otherwise (for registered names), unreserved characters, percent-encoded octets, as well as sub-delimiters are allowed to be present. Any other characters are percent-encoded.
- Path: Besides unreserved characters, percent-encoded octets, as well as sub-delimiters, it also allows the following characters to be present: “/”, “:”, “@”. Any other characters are percent-encoded.
- PathSegment: Besides unreserved characters, percent-encoded octets, as well as sub-delimiters, it also allows the following characters to be present: “:”, “@”. Any other characters are percent-encoded.
- Query: Besides unreserved characters, percent-encoded octets, as well as sub-delimiters, it also allows the following characters to be present: “:”, “@”, “/”, and “?”. Any other characters are percent-encoded.
- FormQuery: It is mostly the same as Uri\Rfc3986\UriPercentEncodingMode::Query, but it behaves according to the application/x-www-form-urlencode media type rather than RFC 3986. The only difference between the two is that “ ” is encoded as “+”.
- Fragment: Besides unreserved characters, percent-encoded octets, as well as sub-delimiters, it also allows the following characters to be present: “:”, “@”, “/”, and “?”. Any other characters are percent-encoded.
- AllReservedCharacters: All reserved characters are percent-encoded. The rest of the characters are left as-is.
- AllButUnreservedCharacters: Besides unreserved characters and percent-encoded octets, all other characters are percent-encoded.

For the complete ABNF syntax of each component, consult Appendix A of RFC 3986.

Uri\WhatWg\UrlPercentEncodingMode
- UserInfo: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::Path, the following code points are percent-encoded: U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+0040 (@), U+005B ([) to U+005D (]), inclusive, and U+007C (|).
- OpaqueHost: Control characters, and all code points greater than ~ are percent-encoded.
- Path: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::Query, the following code points are percent-encoded: U+003F (?), U+005E (^), U+0060 (`), U+007B ({), and U+007D (}).
- PathSegment: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::Query, the following code points are percent-encoded: U+003F (?), U+005E (^), U+0060 (`), U+007B ({), U+007D (}), and U+002F (/).
- OpaquePathSegment:
- Query: Besides Control characters, and all code points greater than ~, the following code points are percent-encoded: U+0020 SPACE, U+0022 (“), U+0023 (#), U+003C (<), and U+003E (>).
- SpecialQuery: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::Query, the following code points are percent-encoded: U+0027 (')
- FormQuery: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::UserInfo, the following code points are percent-encoded: U+0024 ($) to U+0026 (&), inclusive, U+002B (+), U+002C (,), U+0021 (!), U+0027 (') to U+0029 RIGHT PARENTHESIS, inclusive, and U+007E (~).
- Fragment: Besides Control characters, and all code points greater than ~, the following code points are percent-encoded: U+0020 SPACE, U+0022 (“), U+003C (<), U+003E (>), and U+0060 (`).

Since neither RFC 3986, nor WHATWG URL support percent-encoded characters inside the scheme component, none of the enums contain a Scheme case. WHATWG URL automatically percent-decodes the host when it's special, so Uri\WhatWg\UrlPercentEncodingMode doesn't contain a Host case.

The percentDecode() methods perform the inverse operation of percentEncode(): they decode those percent-encoded octets which refer to such characters that are allowed by the current percent-encoding mode.

$uri = new Uri\Rfc3986\Uri("https://example.com#_%40%2F"); // The fragment is the percent-encoded form of "_@/"
 
echo Uri\Rfc3986\Uri::percentDecode(
    $uri->getFragment(),
    Uri\Rfc3986\UriPercentEncodingMode::Fragment
);                                                         // _%40/

The ”/” character is allowed in the fragment, so it's needlessly percent-encoded in the URI - that's why it can be percent-decoded by percentDecode(). On the other hand, “@” is not supported in the context of the fragment, so it's kept in the percent-encoded octet form.

RFC 3986 has a sentence that apparently contradicts with the behavior of Uri\Rfc3986\Uri::percentDecode():

Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.

According to this rule, reserved characters - even if they are allowed in the context of a component - should not be percent-decoded during normalization. Even though the Uri\Rfc3986\Uri getters respect this rule, the percentDecode() method intentionally disregards it so that it can serve in use-cases where those getters cannot. Let's see an example:

$uri = new Uri\Rfc3986\Uri("https://example.com/?q=%3A%29"); // The query is the percent-encoded form of ":)"
 
echo $uri->getQuery();                            // %3A%29
 
echo Uri\Rfc3986\Uri::percentDecode(
    $uri->getQuery(),
    Uri\Rfc3986\UriPercentEncodingMode::Query
);                                                // :)

As it can be seen above, the getQuery() getter only normalizes the “%20” percent-encoded octet, and it leaves the two reserved characters (“:” and “)”) as-is, even though both “:” and “)” are allowed in the context of the query (so they shouldn't be percent-encoded at all). By using percentDecode() one can make the input consumable directly, and scheme-specific or producer-specific algorithms should continue to use the getters should they need to perform any kind of custom processing.

By using the proposed percent-encoding and decoding capabilities, many use-cases will become possible to implement in a specification-compliant way which was difficult to achieve before.

For example, path segments can be properly percent-encoded when they contain the / character:

$uri = new Uri\Rfc3986\Uri("https://example.com");
$uri = $uri->withPathSegments(
    [
        "foo",
        Uri\Rfc3986\Uri::percentEncode("bar/baz", Uri\Rfc3986\UriPercentEncodingMode::PathSegment)
    ]
);
 
$uri->toRawString();                              // https://example.com/foo/bar%2Fbaz

Add support for percent-encoding and decoding as outlined in the RFC?
Real name	Yes	No	Abstain
Final result:	0	0	0
This poll has been closed.

Backward Incompatible Changes

All the proposed changes are completely backward compatible because the affected classes are all final.

Proposed PHP Version(s)

Next minor version (PHP 8.6)

RFC Impact

To the Ecosystem

What effect will the RFC have on IDEs, Language Servers (LSPs), Static Analyzers, Auto-Formatters, Linters and commonly used userland PHP libraries?

To Existing Extensions

Existing extensions can continue to use the existing URI API without any changes. Some of the features are exposed as PHPAPI functions through public headers.

To SAPIs

None.

Open Issues

None.

Future Scope

None.

Patches and Tests

https://github.com/kocsismate/php-src/pull/9

Implementation

After the RFC is implemented, this section should contain:

the version(s) it was merged into
a link to the git commit(s)
a link to the PHP manual entry for the feature

References

Add RFC 3986 and WHATWG URL compliant API
Discussion thread: https://externals.io/message/129486
RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986
RFC 3987: https://datatracker.ietf.org/doc/html/rfc3987
WHATWG URL specification: https://url.spec.whatwg.org/

Rejected Features

None.

PHP RFC: Followup Improvements for ext/uri

Introduction

Proposal

URI Building

Design considerations

Builder design pattern

Dedicated classes

Mutability

Setter naming convention

Voting

URI Type Detection

Host Type Detection

Accessing Path Segments as an Array

Segment definition of RFC 3986

Segment definition of WHATWG URL

Ambiguity

Percent-Encoding and Decoding

Design considerations

Should Uri\Rfc3986\Uri and Uri\WhatWg\Url really support path segment handling?

Why isn't there a PathSegments class?

Percent-Encoding and Decoding Support

Backward Incompatible Changes

Proposed PHP Version(s)

RFC Impact

To the Ecosystem

To Existing Extensions

To SAPIs

Open Issues

Future Scope

Patches and Tests

Implementation

References

Rejected Features

Changelog

Table of Contents

PHP RFC: Followup Improvements for ext/uri

Introduction

Proposal

URI Building

Design considerations

Builder design pattern

Dedicated classes

Mutability

Setter naming convention

Voting

URI Type Detection

Host Type Detection

Accessing Path Segments as an Array

Segment definition of RFC 3986

Segment definition of WHATWG URL

Ambiguity

Percent-Encoding and Decoding

Design considerations

Should Uri\Rfc3986\Uri and Uri\WhatWg\Url really support path segment handling?

Why isn't there a PathSegments class?

Percent-Encoding and Decoding Support

Backward Incompatible Changes

Proposed PHP Version(s)

RFC Impact

To the Ecosystem

To Existing Extensions

To SAPIs

Open Issues

Future Scope

Patches and Tests

Implementation

References

Rejected Features

Changelog

Page Tools

Table of Contents