Table of Contents

PHP RFC: Followup Improvements for ext/uri

Introduction

This RFC proposes various follow-up improvements to the URL Parsing API RFC, extending the Uri\Rfc3986\Uri and Uri\WhatWg\Url classes with additional capabilities that were requested during the discussion phase of the original RFC. These capabilities were deemed not to be essential from the get-go, therefore they were postponed in order not to increase scope even further.

Proposal

The following new functionality is introduced in this proposal:

Each feature proposed is voted separately and requires a 2/3 majority.

URI Building

Currently, only already existing (and validated) URIs can be manipulated via wither methods. These calls always create a new instance so that immutability of URIs is preserved. Even though this behavior has plenty of advantages, there's at least one disadvantage: instance creation has a performance overhead. This is especially problematic if a lot of URI components have to be modified in the same time, because a lot of objects are “wasted” through intermediate instantiations.

$uri1 = Uri\Rfc3986\Uri::parse("http://example.com");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withHost("example.net")
    ->withPath("/foo/bar");                // This creates 3 objects altogether!

Besides its suboptimal performance, another drawback of the current wither-based solution is that URI creation from the scratch is currently not possible: one always has to create a valid URI first. The empty string is a valid RFC 3986 URI, that's why it may seem a good candidate for an initial URI for URI building, but unfortunately, it's not valid for WHATWG URL. And anyway, the success of some transformations depend on the current state (which is a form of temporal coupling):

$uri1 = Uri\Rfc3986\Uri::parse("");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withUserInfo("user:pass")            // throws Uri\InvalidUriException: Cannot set a userinfo without having a host
    ->withHost("example.com");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withHost("example.com")
    ->withUserInfo("user:pass")            // No exception is thrown

In order to provide a more ergonomic and efficient solution for URI building, a fluent API is proposed that implements the Builder pattern.

$uriBuilder = new Uri\Rfc3986\UriBuilder()
    ->setScheme("https")
    ->setUserInfo("user:pass")
    ->setHost("example.com")
    ->setPort(8080)
    ->setPath("/foo/bar")
    ->setQuery("a=1&b=2"])
    ->setQueryParams(Uri\Rfc3986\UriQueryParams::fromArray(["a" => 1, "b" => 2]) // Has the same effect as the setQuery() call above
    ->setFragment("section1")
 
$uri = $uriBuilder->build();               // URI instance creation is only done at this point
 
echo $uri->toRawString();                  // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1

The same works for WHATWG URL:

$urlBuilder = new Uri\WhatWg\UrlBuilder()
    ->setScheme("https")
    ->setUsername("user")
    ->setPassword("pass")
    ->setHost("example.com")
    ->setPort(8080)
    ->setPath("/foo/bar")
    ->setQuery("a=1&b=2"])
    ->setQueryParams(Uri\WhatWg\UrlQueryParams::fromArray(["a" => 1, "b" => 2]) // Has the same effect as the setQuery() call above
    ->setFragment("section1")
 
$url = $urlBuilder->build();               // URL instance creation is only done at this point
 
echo $url->toAsciiString;                  // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1

When a Builder instance is not instantiated by ourselves or a trusted party, one cannot be sure whether it already has any components set. Therefore, it's highly recommended to clear the instance state before usage:

function buildUri(Uri\Rfc3986\UriBuilder $builder): void
{
    // Was there any component set before?
 
    $builder->clear();
 
    // Further usage is safe now...
}
 
function buildUrl(Uri\WhatWg\UrlBuilder $builder): void
{
    // Was there any component set before?
 
    $builder->clear();
 
    // Further usage is safe now...
}

The clear() method also comes handy when the same Builder instance is reused to instantiate multiple URIs/URLs in a row.

The complete class signatures to be added are the following:

namespace Uri\Rfc3986 {
    final class UriBuilder
    {
        public function __construct() {}
 
        public function clear(): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setScheme(?string $scheme): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setUserInfo(#[\SensitiveParameter] ?string $userInfo): static {}

        /**
         * @throws Uri\InvalidUriException
         */
        public function setHost(?string $host): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setPath(string $path): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setPathSegments(array $segments): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setQuery(?string $query): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setQueryParams(\Uri\Rfc3986\UriQueryParams $queryParams): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function setFragment(?string $fragment): static {}
 
        /**
         * @throws Uri\InvalidUriException
         */
        public function build(?\Uri\Rfc3986\Uri $baseUrl = null): \Uri\Rfc3986\Uri {}
    }
}
namespace Uri\WhatWg {
    final class UrlBuilder
    {
        public function __construct() {}
 
        public function clear(): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setScheme(?string $scheme): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setUsername(?string $username): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setPassword(#[\SensitiveParameter] ?string $password): static {}

        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setHost(?string $host): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setPath(string $path): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setPathSegments(array $segments): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setQuery(?string $query): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setQueryParams(\Uri\WhatWg\UrlQueryParams $queryParams): static {}
 
        /**
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function setFragment(?string $fragment): static {}
 
        /**
         * @param array $errors
         * @throws Uri\WhatWg\InvalidUrlException
         */
        public function build(?\Uri\WhatWg\Url $baseUrl = null, &$errors = null): \Uri\WhatWg\Url {}
    }
}

The builder objects would perform validation in two places:

- Validation of pure component syntax: The individual setter methods would immediately validate if the input is syntactically correct. For example, the scheme component cannot contain percent-encoded octets, therefore the setScheme() method would throw whenever a “%” character is encountered. - Validation of global state: There are a few validation rules that depend on the “global state”. For example, RFC 3986 requires the host component to be present when the userinfo is set. Any such validations would be delayed until the build() method call to avoid the problem with temporal coupling that was mentioned in the beginning of the section.

An example for component syntax validation:

$uriBuilder = new Uri\Rfc3986\UriBuilder()
    ->setScheme("http%80");                // Throws a Uri\InvalidUriException because the scheme is not well formed

An example for validation of the global state:

$uriBuilder = new Uri\Rfc3986\UriBuilder()
    ->setScheme("https")
    ->setUserInfo("user:pass");            // Doesn't throw an exception yet
 
$uri = $uriBuilder->build();               // Throws an Uri\InvalidUriException because the host is not present, but the userinfo is

Design considerations

Builder design pattern

Why is a complex Builder pattern based approach is proposed instead of a much simpler Factory Method based one? The factory method could be as simple as the following:

namespace Uri\Rfc3986 {
    final readonly class Uri
    {
        ...
 
        public static function fromComponents(
            ?string $scheme = null, ?string $host = null, string $path = "",
            ?string $userInfo = null, ?string $queryString = null, ?string $fragment = null
        ) {}
 
        ...
    }
}
 
namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public static function fromComponents(
            string $scheme, ?string $host = "", string $path = "",
            ?string $username = null, ?string $password = null,
            ?string $queryString = null, ?string $fragment = null
        ) {}
 
        ...
    }
}

The current RFC proposes the Builder pattern based approach because of its flexibility: it makes it possible to add more convenience methods in the future. Actually, the setQueryParams() method that expects a Query parameter list object instead of the query string representation is already one.

Dedicated classes

This RFC proposes a dedicated Builder class for both RFC 3986 and WHATWG URL, instead of a single, unified implementation with 2 build() methods (e.g. buildUri() and buildUrl()). This decision has the following reasons:

Mutability

The proposed classes are mutable in order to avoid the performance overhead that cloning before each modification (thus an immutable behavior) would cause. If it turns out that the performance overhead is possible to optimize away, then this design decision can be reevaluated.

Setter naming convention

Setter methods of the UriBuilder and UrlBuilder classes follow the naming convention which is already widespread among internal functions: they use a set prefix, e.g. setScheme(), setHost(). The current RFC rejects the usage of any other naming convention, most notably the omission of the set prefix (e.g. scheme(), host()) due to the following reasons:

Voting

Add URI building support as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Query Parameter Manipulation

Query parameter manipulation is an integral part of URI handling. WHATWG URL even dedicates a separate section for the URLSearchParams class that implements advanced query parameter handling. Unfortunately, RFC 3986 doesn't have any such capability, so ultimately, both proposed classes closely follow the design of the WHATWG URL specification.

Therefore, the following classes and methods are proposed for addition:

namespace Uri\Rfc3986 {
    final class UriQueryParams implements \Countable
    {
        public static function parseRfc3986(string $queryString): ?\Uri\Rfc3986\UriQueryParams {}
 
        public static function parseFormData(string $queryString): \Uri\Rfc3986\UriQueryParams {}
 
        public static function fromArray(array $queryParams): \Uri\Rfc3986\UriQueryParams {}
 
        public function __construct() {}
 
        public function append(string $name, mixed $value): static {}
 
        public function delete(string $name): static {}
 
        public function deleteValue(string $name, mixed $value): static {}
 
        public function has(string $name): bool {}
 
        public function hasValue(string $name, mixed $value): bool {}
 
        public function getFirst(string $name): ?string {}
 
        public function getLast(string $name): ?string {}
 
        public function getAll(?string $name = null): array {}
 
        public function count(): int {}
 
        public function set(string $name, mixed $value): static {}
 
        public function sort(): static {}
 
        public function toRfc3986String(): string {}
 
        public function toFormDataString(): string {}
 
        public function __serialize(): array {}
 
        public function __unserialize(array $data): void {}
 
        public function __debugInfo(): array {}
    }
    final readonly class Uri
    {
        ...
 
        public function getRawQueryParams(): ?\Uri\Rfc3986\UriQueryParams {}
 
        public function getQueryParams(): ?\Uri\Rfc3986\UriQueryParams {}
 
        public function withQueryParams(?\Uri\Rfc3986\UriQueryParams $queryParams): static {}
 
        ...
    }
}
namespace Uri\WhatWg {
    final class UrlQueryParams implements \Countable
    {
        public static function parse(string $queryString): \Uri\WhatWg\UrlQueryParams {}
 
        public static function fromArray(array $queryParams): \Uri\WhatWg\UrlQueryParams {}
 
        public function __construct() {}
 
        public function append(string $name, mixed $value): static {}
 
        public function delete(string $name): static {}
 
        public function deleteValue(string $name, mixed $value): static {}
 
        public function has(string $name): bool {}
 
        public function hasValue(string $name, string $value): bool {}
 
        public function getFirst(string $name): ?string {}
 
        public function getLast(string $name): ?string {}
 
        public function getAll(?string $name = null): array {}
 
        public function count(): int {}
 
        public function set(string $name, mixed $value): static {}
 
        public function sort(): static {}
 
        public function toString(): string {}
 
        public function __serialize(): array {}
 
        public function __unserialize(array $data): void {}
 
        public function __debugInfo(): array {}
    }
    final readonly class Url
    {
        ...
 
        public function getQueryParams(): ?\Uri\WhatWg\UrlQueryParams {}
 
        public function withQueryParams(?\Uri\WhatWg\UrlQueryParams $queryParams): static {}
 
        ...
    }
}

Construction

UriQueryParams supports the following methods for instantiation:

UrlQueryParams supports the following methods for instantiation:

$params = Uri\Rfc3986\UriQueryParams::parse("abc=foo&abc=bar"); // Successful instantiation
$params = Uri\Rfc3986\UriQueryParams::fromArray(
    [
        ["abc" => "foo"],
        ["abc" => "bar"],
    ]
);                                                              // Successful instantiation - same result as above
 
$params = new Uri\Rfc3986\UriQueryParams();                     // Successful instantiation - creates an empty query parameter list
 
$params = Uri\WhatWg\UrlQueryParams::parse("abc=foo&abc=bar");  // Successful instantiation
$params = Uri\WhatWg\UrlQueryParams::fromArray(
    [
        ["abc" => "foo"],
        ["abc" => "bar"],
    ]
);                                                              // Successful instantiation - same result as above
 
$params = new Uri\WhatWg\UrlQueryParams();                      // Successful instantiation - creates an empty query parameter list

It is also possible to create a UriQueryParams or UrlQueryParams instance from an Uri\Rfc3986\Uri or an Uri\WhatWg\Url object, respectively:

$uri = new Uri\Rfc3986\Uri("https://example.com/?foo=bar");
 
$params = $uri->getRawQueryParams();       // Creates a Uri\Rfc3986\UriQueryParams instance
$params = $uri->getQueryParams();          // Creates a Uri\Rfc3986\UriQueryParams instance
 
$url = new Uri\WhatWg\Url("https://example.com/?foo=bar");
 
$params = $url->getQueryParams();          // Creates a Uri\WhatWg\UrlQueryParams instance

The difference between Uri\Rfc3986\Uri::getRawQueryParams() and Uri\Rfc3986\Uri::getQueryParams() is that the former one passes the “raw” (non-normalized) query string as an input when instantiating Uri\Rfc3986\Uri\UriQueryParams.

The Uri\Rfc3986\Uri::getRawQueryParams(), Uri\Rfc3986\Uri::getQueryParams(), Uri\WhatWg\Url::getQueryParams() methods return null if the query string is missing (e.g. https://example.com/), and an empty query parameter list is returned if the query string is empty (e.g. https://example.com/?).

$uri = new Uri\Rfc3986\Uri("https://example.com/");
echo $uri->getRawQueryParams();            // null
echo $uri->getQueryParams();               // null
 
$uri = new Uri\Rfc3986\Uri("https://example.com/?");
echo $uri->getRawQueryParams();            // A new Uri\Rfc3986\Uri\UriQueryParams containing zero items
echo $uri->getQueryParams();               // A new Uri\Rfc3986\Uri\UriQueryParams containing zero items

The same example with Uri\WhatWg\UrlQueryParams:

$url = new Uri\WhatWg\Url("https://example.com/");
echo $url->getQueryParams();               // null
 
$url = new Uri\WhatWg\Url("https://example.com/?");
echo $url->getQueryParams();               // A new Uri\WhatWg\Url\UrlQueryParams containing zero items

It's important to note that neither UriQueryParams, nor UrlQueryParams validate the query parameters appropriately during construction. This behavior is by design, because the idea of WHATWG URL's URLSearchParams class is that it's tolerant for reading, and UriQueryParams and UrlQueryParams follow the same principle. Validation happens anyway when the serialized query parameters are attempted to be written to a URI (via Uri\Rfc3986\Uri::withQueryParams() and Uri\WhatWg\Url::withQueryParams()), although invalid characters are automatically percent-encoded during recomposition - which is done under the hood when calling withQueryParams() -, before the new query is validated.

$uri = new Uri\Rfc3986\Uri("https://example.com/");
 
$params = Uri\Rfc3986\UriQueryParams::parse("#foo=bar"); // Parses an invalid parameter name "#foo"
 
$uri = $uri->withQueryParams($params);                   // Success: the query is automatically percent-encoded to "%23foo=bar"

The same example with Uri\WhatWg\UrlQueryParams:

$url = new Uri\WhatWg\Url("https://example.com/");
 
$params = Uri\WhatWg\UrlQueryParams::parse("#foo=bar");  // Parses an invalid parameter name "#baz"
 
$url = $url->withQueryParams($params);                   // Success: the query is automatically percent-encoded to "%23foo=bar"

Neither the parse(), nor the fromArray() factory methods can fail in practice: they only have memory-related failure cases which are handled by the PHP engine as a fatal error.

According to the WHATWG URL algorithm, the leading “?” character is removed during parsing. As opposed to this behavior, the leading “?” becomes part of the first query parameter name for RFC 3986 query params.

$params = Uri\Rfc3986\UriQueryParams::parse("?abc=foo");
 
// $params internally contains the ["?abc" => "foo"] key-value pair
 
$params = Uri\WhatWg\UrlQueryParams::parse("?abc=foo");
 
// $params internally contains the ["abc" => "foo"] key-value pair

Another difference between the two classes is how they parse percent-encoded characters. While UriQueryParams doesn't transform any of the input, UrlQueryParams percent-decodes it automatically as per the WHATWG URL specification:

$params = Uri\Rfc3986\UriQueryParams::parse("foo%5B%5D=b%61r"); // Percent-encoded form of "foo[]=bar"
 
// $params internally contains the ["foo%5B%5D" => "b%61r"] key-value pair
 
$params = Uri\WhatWg\UrlQueryParams::parse("foo%5B%5D=b%61r");  // Percent-encoded form of "foo[]=bar"
 
// $params internally contains the ["foo[]" => "bar"] key-value pair

Parameter Retrieval

The has() and hasValue() methods can be used to find out if a parameter exists:

$params = Uri\Rfc3986\UriQueryParams::parse("foo=bar&baz=qux&baz=baz");
 
echo $params->has("baz");                 // true
echo $params->has("non-existent");        // false
 
echo $params->hasValue("foo", "bar");     // true 
echo $params->hasValue("foo", "baz");     // false

The has() method returns true if there is at least one parameter in the parameter list with the given name, false otherwise. On the other hand, hasValue() returns true if the given name and value both matches at least one parameter, otherwise it returns false.

The number of query parameters can be retrieved by calling the count() method:

$params = Uri\Rfc3986\UriQueryParams::parse("foo=bar&baz=qux&baz=baz");
 
echo $params->count();                 // 3

There are also a number of methods that can return a query parameter or an array of query parameters:

$params = Uri\Rfc3986\UriQueryParams::parse("foo=bar&foo=baz&qux=quux");
 
echo $params->getFirst("foo");            // bar
echo $params->getFirst("non-existent");   // null
 
echo $params->getLast("foo");             // baz
echo $params->getLast("non-existent");    // null
 
echo $params->getAll("foo");              // [["foo", "bar"], ["foo", "baz"]]
echo $params->getAll("non-existent");     // []
 
echo $params->getAll(null);               // [["foo", "bar"], ["foo", "baz"], ["qux", "quux"]]
echo $params->getAll();                   // [["foo", "bar"], ["foo", "baz"], ["qux", "quux"]]

All these methods return the natively stored values without applying any transformations. That is, percent-encoding or decoding neither happens in the input, nor in the output.

$params = Uri\Rfc3986\UriQueryParams::parse("foo%5B%5D=b%61r");  // Percent-encoded form of "foo[]=bar"
 
echo $params->getFirst("foo%5B%5D");     // null
echo $params->getFirst("foo[]");         // bar
 
echo $params->getLast("foo%5B%5D");      // null
echo $params->getLast("foo[]");          // bar
 
echo $params->getAll("foo%5B%5D");       // []
echo $params->getAll("foo[]");           // [["foo[]" => "bar"]]

As mentioned in the previous section, UrlQueryParams automatically performs percent-decoding during the parse() method call, so it's only possible to retrieve parameters containing percent-encoded code points if the class is instantiated via the fromArray() method, or if new items are added to the query parameter list after construction.

$params = Uri\WhatWg\UrlQueryParams::fromArray(
    [
        ["foo%5B%5D" => "b%61r"].        // Percent-encoded form of "foo[]=bar"
    ],
);
 
echo $params->getFirst("foo%5B%5D");     // b%61r
echo $params->getFirst("foo[]");         // null
 
echo $params->getLast("foo%5B%5D");      // b%61r
echo $params->getLast("foo[]");          // null
 
echo $params->getAll("foo%5B%5D");       // [["foo%5B%5D" => "b%61r"]]
echo $params->getAll("foo[]");           // []

Percent-Encoding and Decoding

UriQueryParams and UrlQueryParams have their distinct way of percent-encoding and decoding which is mostly similar to the behavior of RFC 3986 URIs and WHATWG URLs, but it doesn't quite work the same way. This section will discuss the specific details.

UriQueryParams builds upon the uriparser library just like RFC 3986 URIs do. Uriparser has its custom query parameter list implementation that follows RFC 1866 in the absence of any clarification in RFC 3986 about how this component should be processed apart from a description of the basic syntax. According to RFC 1866, space characters are replaced by the plus character (+) during percent-encoding, and any characters that fall outside of the unreserved character set are percent-encoded just like how RFC 3986 does so. Percent-decoding inverts these operations.

This behavior clearly deviates from the percent-encoding rules of the query component of RFC 3986 which allows quite a few reserved characters to be present without percent-encoding (a few examples: “:”, “@”, “?”, “/”), not to mention the difference in how the space character is handled.

On the other hand, UrlQueryParams relies on the URLSearchParams class specified by WHATG URL, that yet again builds upon the application/x-www-form-urlencoded media type for historic reasons, albeit slightly differently than how RFC 1866 specifies it. As usually, WHATWG URL defines a dedicated percent-encoding set:

The application/x-www-form-urlencoded percent-encode set contains all code points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and U+005F (_).

Also, a dedicated algorithm for “serialization” is defined (in this context, serialization means recomposition - converting the list to a string): the space code point is percent-encoded as the plus code point (+), and the rest of the code points in the percent-encoding set are encoded how WHATWG URL normally does so.

This behavior deviates from the percent-encoding rules of the query component of WHATWG URL, as the query percent-encode set contains much less characters, and the space code point is handled differently again.

It's also important to compare how the percent encoding rules of UriQueryParams and UrlQueryParams differ: they handle the asterisk (*) and the tilde (~) symbols differently: UriQueryParams percent-encodes the first one, but UrlQueryParams doesn't, however UriQueryParams doesn't percent-encode the latter one, but UrlQueryParams does so.

Even though it comes from the percent-encoding definition directly, it may still be difficult to realize that the application/x-www-form-urlencoded media type even also percent-encodes “%” itself, no matter that it's part of an existing percent-encoded octet. It's counterintuitive (normally, RFC 3986 and WHATWG URL does not percent-encode “%” twice) and quite unsafe behavior due to double encoding.

$params = Uri\Rfc3986\UriQueryParams::parse("foo=b%61r");
echo $params->toString();                                 // foo=b%2561r
 
$params = Uri\WhatWg\UrlQueryParams::fromArray(
    [
        ["foo" => "b%61r"],
    ]
);
 
echo $params->toString();                                 // foo=b%2561r

As surprising as is, the toString() method percent-encodes “%” itself (thus “%” becomes “%25” first, and then “61r” is appended), rather than leaving the already percent-encoded octet alone.

Recomposition

In order to be consistent with the design of Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes, neither UriQueryParams, nor UrlQueryParams have a __toString() magic method. Instead, they contain a custom toString() method that recomposes the query string from the parameters.

$params = Uri\Rfc3986\UriQueryParams::parse("foo=bar&foo=baz");
echo $params->toString();                // foo=bar&foo=baz
 
$params = Uri\WhatWg\UrlQueryParams::parse("foo=bar&foo=baz");
echo $params->toString();                // foo=bar&foo=baz

Both Uri\Rfc3986\UriQueryParams::toString() and Uri\WhatWg\UrlQueryParams::toString() automatically percent-encodes the output according to the rules outlined in the previous section.

$params = Uri\Rfc3986\UriQueryParams::fromArray([["foo[]" => "bar baz"]]);
echo $params->toString();                // foo%5B%5D=bar+baz
 
$params = Uri\WhatWg\UrlQueryParams::fromArray([["foo[]" => "bar baz"]]);
echo $params->toString();                // foo%5B%5D=bar+baz

Unlike Uri\Rfc3986\Uri, the Uri\Rfc3986\UriQueryParams class doesn't have a toRawString() method because it could be misleading what it exactly does: toRawString() cannot really provide a “raw” representation of the query string, since automatic percent-encoding must happen any way to make the produced query string valid.

Relation to the query component

After learning about the details of the percent-encoding and decoding behavior of UriQueryParams and UrlQueryParams, it should be clarified how the new classes can interoperate with the existing Uri\Rfc3986\Uri and Uri\WhatWg\Url?

The short answer is they won't have 100% compatibility. But let's see an example where things can go wrong:

$uri = new Uri\Rfc3986\Uri("https://example.com?foo=a b");
$params = $uri->getQueryParams();
$uri = $uri->withQueryParams($params);
 
echo $uri->getQuery();                     // foo=a+b

The above example illustrates how the differing percent-encoding mechanisms of Uri\Rfc3986\Uri and Uri\Rfc3986\UriQueryParams affect the results: the original “foo=a b” query component is percent-encoded to “foo=a+b” during the $uri->withQueryParams($params) call. That's why the workflow is not roundtripable. Uri\WhatWg\UrlQueryParams and Uri\WhatWg\Url have the very same problem, and it's even encoded in the WHATWG URL specification itself.

Modification

The append() method can be used to append a parameter to the end of the list. As normally, the same query parameter can be added multiple times:

$params = Uri\Rfc3986\UriQueryParams::parse("foo=bar");
$params->append("baz", "qux");
$params->append("baz", "qaz");             // Appends "baz" twice
 
echo $params->toString();                  // foo=bar&baz=qux&baz=qaz

Updating a parameter is possible via the set() method:

$params = Uri\Rfc3986\UriQueryParams::parse("foo=bar&foo=baz");
$params->set("foo", "baz");                // Overwrites the first item "foo", and removes the second one
$params->set("qux", "qaz");                // Appends a new item "qux"
 
echo $params->toString();                  // foo=bar&baz=qux&baz=qaz

Actually, the set() method has a hybrid behavior: if a parameter is not present in the list, then it adds it just like append() does. Otherwise, it overwrites the first item, and removes the rest of the occurrences.

Neither append(), nor set() do any percent-encoding or decoding of their arguments. This wasn't a question for RFC 3986 though (because it never does so), but WHATWG URL usually post-processes its input automatically.

$params = Uri\WhatWg\UrlQueryParams::parse("");
$params->append("foo%5B%5D", "ab%63");     // Percent-encoded form of "foo[]=abc"
$params->set("bar%5B%5D", "de%66");        // Percent-encoded form of "bar[]=def"
 
echo $params->toString();                  // foo%255B%255D=ab%2563&bar%255B%255D=de%2566

As it can be seen, percent-encoded octets received during the append() and set() method calls were double-encoded in the final output, as warned by the Percent-Encoding and Decoding section. This means, foo%5B%5D and the rest of the input was accepted as-is, and then they got percent-encoded again during the toString() call.

Removing parameters is possible via either the delete() or the deleteValue() method: the former one removes all occurrences of the given parameter name, while the latter one removes all occurrences of a parameter if the given name and value both matches it, as demonstrated below:

$params = Uri\Rfc3986\UriQueryParams::parse("foo=bar&foo=baz&foo=qux");
$params->deleteValue("foo", "baz");    // Deletes the "foo=baz" parameter
$params->delete("foo");                    // Deletes the rest of the occurrences: "foo=bar" and "foo=qux"
$params->delete("non-existent");           // The parameter is not present: nothing happens

The last method that can modify the list is sort(), which sorts the parameters alphabetically:

$params = Uri\Rfc3986\UriQueryParams::parse("foo=bar&baz=qux&baz=baz");
$params->sort();
 
echo $params->toString();                  // baz=baz&baz=qux&foo=bar

Type support

What's also important to clarify is how non-string values are mapped? PHP's https://www.php.net/manual/en/function.http-build-query.php and functions can map basically any type to query params, however, the exact behavior is not specified by either RFC 3986 or WHATWG URL: RFC 3986 completely omits any information how query parameters should be build, while WHATWG URL's URLSearchParams only accepts and returns string data.

The position of this RFC is that it's important to follow the road that http_build_query() has already paved because of better developer experience and better interoperability with the existing ecosystem. That's why the following type mapping behavior is proposed when a query parameter is added/updated:

The above conversion rules work for both UriQueryParams and UrlQueryParams. However, Uri\Rfc3986\UriQueryParams can additionally properly handle null values: a null input is mapped to a query component so that only the parameter name is present - the “=” and the parameter value is omitted. On the other hand, Uri\WhatWg\UrlQueryParams converts null values to an empty string. Alternatively, it could omit parameters with null values completely, the same way as http_build_query() does.

$params = Uri\Rfc3986\UriQueryParams::parse("");
 
$params->append("param_null", null);
$params->append("param_bool", true);
$params->append("param_int", 123);
$params->append("param_float", 3.14);
 
var_dump($params->getFirst("param_null"));  // NULL
var_dump($params->getFirst("param_bool"));  // string(1) "1"
var_dump($params->getFirst("param_int"));   // string(3) "123"
var_dump($params->getFirst("param_float")); // string(4) "3.14"
 
echo $params->toString();                   // param_null&param_bool=1&param_int=123&param_float=3.14

Note how UrlQueryParams works differently with regards to null values:

$params = Uri\WhatWg\UrlQueryParams::parse("");
 
$params->append("param_null", null);
$params->append("param_bool", true);
$params->append("param_int", 123);
$params->append("param_float", 3.14);
 
var_dump($params->getFirst("param_null"));  // string(0) ""
var_dump($params->getFirst("param_bool"));  // string(1) "1"
var_dump($params->getFirst("param_int"));   // string(3) "123"
var_dump($params->getFirst("param_float")); // string(4) "3.14"
 
echo $params->toString();                   // param_null=&param_bool=1&param_int=123&param_float=3.14

Exact array and object casting rules are still to be decided.

Implemented Interfaces

The UriQueryParams and UrlQueryParams classes could implement the IteratorAggregate interface in theory. However, it's not possible to do so due to query components that share the same name, e.g.: param=foo&param=bar&param=baz. In this case, the same key (param) would be repeated 3 times - and it's actually not possible to support with iterators.

Cloning

Cloning of UriQueryParams and UrlQueryParams is supported.

$params1 = Uri\Rfc3986\UriQueryParams::parse("foo=bar&foo=baz");
$params2 = clone $params1;
 
echo $params1->toString();               // foo=bar&foo=baz
echo $params2->toString();               // foo=bar&foo=baz

UrlQueryParams works the same way:

$params1 = Uri\WhatWg\UrlQueryParams::parse("foo=bar&foo=baz");
$params2 = clone $params1;
 
echo $params1->toString();                // foo=bar&foo=baz
echo $params2->toString();                // foo=bar&foo=baz

Serialization

Both classes are serializable and deserializable. The only implementation gotcha is that the serialized format is slightly unexpected: instead of recomposing the query params into a query string, the individual key-value pairs are serialized as an array. This is necessary because both toString implementations automatically percent-encode the input, so using these algorithms would skew the original data, not to mention the fact that Uri\WhatWg\UrlQueryParams::parse() performs automatic percent-decoding too.

Debugging

Both classes contain a __debugInfo() method that returns all items in the query parameter list in order to make debugging easier.

$params = Uri\Rfc3986\UriQueryParams::parse("foo=bar&foo=baz&foo=qux");
var_dump($params);
 
/*
object(Uri\Rfc3986\UriQueryParams)#1 (1) {
  ["params"]=> array(3) {
    [0]=>
    array(1) {
      ["foo"]=>
      string(3) "bar"
    }
    [1]=>
    array(1) {
      ["foo"]=>
      string(3) "baz"
    }
    [2]=>
    array(1) {
      ["foo"]=>
      string(3) "qux"
    }
  }
}
*/
 
$params = Uri\WhatWg\UrlQueryParams::parse("foo=bar&foo=baz&foo=qux");
var_dump($params);
 
/*
object(Uri\WhatWg\UrlQueryParams)#1 (1) {
  ["params"]=> array(3) {
    [0]=>
    array(1) {
      ["foo"]=>
      string(3) "bar"
    }
    [1]=>
    array(1) {
      ["foo"]=>
      string(3) "baz"
    }
    [2]=>
    array(1) {
      ["foo"]=>
      string(3) "qux"
    }
  }
}
*/

Vote

Add support for query parameter manipulation as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Accessing Path Segments as an Array

Sometimes, accessing path segments rather than the whole path as string is needed. When this is the case, splitting the path to segments manually after retrieval is both inconvenient and disadvantageous performance-wise, especially considering the fact that Uri\Rfc3986\Uri internally stores the path as a list of segments.

That's why the following methods are proposed to be added:

namespace Uri\Rfc3986 {
    final readonly class Uri
    {
        ...
 
        public function getRawPathSegments(): array {}
 
        public function getPathSegments(): array {}
 
        public function withPathSegments(array $segments, bool $addLeadingSlashForNonEmptyRelativeUri = true): static {}
 
        ...
    }
}
namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public function getPathSegments(): array {}
 
        public function withPathSegments(array $segments): static {}
 
        ...
    }
}

This way, it is possible to write the following code:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/baz");
$segments = $uri->getPathSegments();        // ["foo", "bar", "baz"]
 
$uri = $uri->withPathSegments(["a", "b"]);
echo $uri->getPath();                       // /a/b

The same also works for WHATWG URL:

$url = new Uri\WhatWg\Url("https://example.com/foo/bar/baz");
$segments = $url->getPathSegments();        // ["foo", "bar", "baz"]
 
$url = $url->withPathSegments(["a", "b"]);
echo $url->getPath();                       // /a/b

In order to understand better why and exactly how this functionality works, we should more carefully understand how RFC 3986 defines the path and path segments: according to the specification, path segments start after the leading “/” in the path due to the following ABNF rule:

path-abempty  = *( "/" segment )

That is, the path-abempty syntax only applies in case of URIs containing an authority component, and it declares that the path is either empty, or contains a “/” followed by a segment one or multiple times. Then segments have the following syntax:

segment       = *pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

That is, segments are composed of zero or multiple characters in the “pchar” charset (the exact values don't matter in this case). It should be mentioned that there are some additional special-case segment syntaxes (they are marked with segment-nz and segment-nz-nc in the ABNF syntax), but let's disregard them now for ease of understanding.

The above definitions imply that an empty path has zero segments:

$uri = new Uri\Rfc3986\Uri("https://example.com");
$segments = $uri->getPathSegments();        // []

When the path consists of a leading “/” and a string matching the segment syntax (e.g. /foo), the path has one segment:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo");
$segments = $uri->getPathSegments();        // ["foo"]

We can easily see based on the above example that the URI https://example.com/ also has a single segment - but it's empty:

$uri = new Uri\Rfc3986\Uri("https://example.com/");
$segments = $uri->getPathSegments();        // [""]

This is perfectly valid, because segments can be empty (at least in the above case when the URI has an authority). Another interesting question is how segments are represented when the path has a trailing slash (e.g. /foo/)? Consistent to the above rules, it's the following:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo/");
$segments = $uri->getPathSegments();        // ["foo", ""]

A few other special cases are also collected below:

The above described behavior satisfies the definitions of RFC 3986. However, one case needs disambiguation in relation to the withPathSegments() method: “/foo” vs “foo”.

That's why Uri\Rfc3986\Uri::withPathSegments() has a second parameter $addLeadingSlashForNonEmptyRelativeUri, which can be used to decide if a relative reference should became an absolute- or a relative-path reference:

$uri = new Uri\Rfc3986\Uri("/foo");            // absolute-path reference
 
$uri = $uri->withPathSegments(["bar"], false); // The leading slash is not prepended
 
echo $uri->getPath();                          // bar
 
$uri = new Uri\Rfc3986\Uri("foo");             // relative-path reference
 
$uri = $uri->withPathSegments(["bar"], true);  // The leading slash is prepended
 
echo $uri->getPath();                          // /bar

The $addLeadingSlashForNonEmptyRelativeUri parameter only has effect when the URI is a relative reference, and the first path segment is not empty, any other cases are unambiguous.

Uri\Rfc3986\Uri::withPathSegments() and Uri\WhatWg\Url::withPathSegments() internally concatenate the input segments separated by a / character, and then trigger Uri\Rfc3986\Uri::withPath() and Uri\WhatWg\Url::withPath(), respectively.

Add support for accessing path segments as an array as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Host Type Detection

Both the RFC 3986 and WHATWG URL specifications distinguish different types of the host component because each of them have different parsing and formatting rules. Probably the most notable example is the IPv6 host type that requires the IPv6 address to be written between a [ and ] pair.

In order to support returning information about the host type, the following enums and methods are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriHostType
    {
        case IPv4;
        case IPv6;
        case IPvFuture;
        case RegisteredName;
    }
 
    final readonly class Uri
    {
        ...
 
        public function getHostType(): ?\Uri\Rfc3986\UriHostType {}
 
        ...
    }
}
namespace Uri\WhatWg {
    enum UrlHostType
    {
        case IPv4;
        case IPv6;
        case Domain;
        case Opaque;
        case Empty;
    }
 
    final readonly class Url
    {
        ...
 
        public function getHostType(): ?\Uri\WhatWg\UrlHostType {}
 
        ...
    }
}

The new getHostType() methods return the type of the host component for both specifications:

$uri = new Uri("https://192.168.0.1/");
echo $uri->getHostType();                  // UriHostType::IPv4
 
$uri = new Uri("https://[2001:db8::1]/");
echo $uri->getHostType();                  // UriHostType::IPv6
 
$uri = new Uri("https://[v1.1.2.3]/");
echo $uri->getHostType();                  // UriHostType::IPvFuture
 
$uri = new Uri("https://example.com/");
echo $uri->getHostType();                  // UriHostType::RegisteredName
 
$uri = new Uri("/foo/bar");
echo $uri->getHostType();                  // null

The same for WHATWG URL:

$url = new Uri\WhatWg\Url("https://192.168.0.1/");
echo $url->getHostType();                  // UrlHostType::IPv4
 
$url = new Uri\WhatWg\Url("https://[2001:db8::1]/");
echo $uri->getHostType();                  // UrlHostType::IPv6
 
$url = new Uri\WhatWg\Url("https://example.com/");
echo $url->getHostType();                  // UrlHostType::Domain
 
$url = new Uri\WhatWg\Url("scheme://example.com/");
echo $url->getHostType();                  // UrlHostType::Opaque
 
$url = new Uri\WhatWg\Url("mailto://john.doe@example.com");
echo $url->getHostType();                  // UrlHostType::Empty
 
$url = new Uri\WhatWg\Url("scheme://john.doe@example.com");
echo $url->getHostType();                  // null
Add support for host type detection as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

URI Type Detection

RFC 3986 distinguishes different URI “types” based on what they begin with. Actually, the RFC 3986 specification collectively refers to these as URI-references.

In order to better support granular RFC 3986 URI type detection, the following enums and methods are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriType
    {
        case AbsolutePathReference;
        case RelativePathReference;
        case NetworkPathReference;
        case Uri;
    }
 
    final readonly class Uri
    {
        ...
 
        public function getUriType(): Uri\Rfc3986\UriType {}
 
        ...
    }
}

This way, it becomes easier to detect URI types:

$uri = new Uri\Rfc3986\Uri("https://example.com");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::Uri
 
$uri = new Uri\Rfc3986\Uri("https:");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::Uri
 
$uri = new Uri\Rfc3986\Uri("/foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::AbsolutePathReference
 
$uri = new Uri\Rfc3986\Uri("foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::RelativePathReference
 
$uri = new Uri\Rfc3986\Uri("//host.com/foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::NetworkPathReference

The position of this RFC is that identifying the distinction between URIs and absolute URIs doesn't need special support, therefore a dedicated Uri\Rfc3986\UriType enum case is omitted.

The WHATWG URL specification defines some special schemes (http, https, ftp, file, ws, wss), which have distinct parsing and serialization rules. In order to make checks for special URLs easier to perform, a new Uri\WhatWg\Url::isSpecialScheme() method is added:

namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public function isSpecialScheme(): bool {}
 
        ...
    }
}

This enables low-level control for applications that need to mirror WHATWG behaviors in parsing or normalization.

$url = new Uri\WhatWg\Url("https://example.com");
var_dump($url->isSpecialScheme());                // true
 
$url = new Uri\WhatWg\Url("custom:example");
var_dump($url->isSpecialScheme());                // false
Add support for detecting URI type as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Percent-Encoding and Decoding Support

Contrarily to the common belief that's probably further affirmed by the urlencode() and urldecode() functions, percent-encoding and decoding are both a context-sensitive process. Context-sensitivity means that different characters need to be percent-encoded/percent-encoded depending on which URI component is being processed.

It should also be mentioned that in fact, urlencode() and urldecode() should rather be used for the application/x-www-form-urlencoded media type, and rawurlencode() and rawurldecode() more closely implements RFC 3986.

For example, the path component dedicates special meaning for the / character. Therefore, this character doesn't necessarily have to be percent-encoded in the path component. There are some cases though when it makes sense to percent-encode them, as highlighted by the first example within the “Advanced examples” section of the original URI RFC. Unfortunately, rawurlencode() doesn't take the component into account, and replaces the “/” with “%2F” unconditionally.

echo rawurlencode("/foo/bar/baz");                // %2Ffoo%2Fbar%2Fbaz

In order to correctly handle percent-encoding and decoding based on the rules of RFC 3986 and WHATWG URL, the following methods and enums are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriPercentEncodingMode
    {
        case UserInfo;
        case Host;
        case RelativeReferencePath;
        case RelativeReferenceFirstPathSegment;
        case Path;
        case PathSegment;
        case Query;
        case FormQuery;
        case Fragment;
        case AllReservedCharacters;
        case All;
    }
 
    final readonly class Uri
    {
        ...
 
        public static function percentEncode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
 
        public static function percentDecode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
 
        ...
    }
}
namespace Uri\WhatWg {
    enum UrlPercentEncodingMode
    {
        case UserInfo;
        case Host;
        case OpaqueHost;
        case Path;
        case PathSegment;
        case OpaquePath;
        case OpaquePathSegment;
        case Query;
        case SpecialQuery;
        case FormQuery;
        case Fragment;
    }
 
    final readonly class Url
    {
        ...
 
        public static function percentEncode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
 
        public static function percentDecode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
 
        ...
    }
}

The percentEncode() and percentDecode() methods both require an input string and a PercentEncodingMode enum to be passed. The enums make the context of the encoding/decoding processes fully explicit and clear. The following modes are supported:

For the complete ABNF syntax of each component, consult Appendix A of RFC 3986.

Since neither RFC 3986, nor WHATWG URL support percent-encoded characters inside the scheme component, none of the enums contain a Scheme case. WHATWG URL automatically percent-decodes the host when it's special, so Uri\WhatWg\UrlPercentEncodingMode doesn't contain a Host case.

The percentDecode() methods perform the inverted operation of percentEncode(): it decodes every character that is percent-encoded, but which are otherwise allowed by the current percent-encoding mode.

$uri = new Uri\Rfc3986\Uri("https://example.com#_%40%2F"); // The fragment is the percent-encoded form of "_@/"
 
echo Uri\Rfc3986\Uri::percentDecode(
    $uri->getFragment(),
    Uri\Rfc3986\UriPercentEncodingMode::Fragment
);                                                         // _%40/

The ”/” character is allowed in the fragment, so it's needlessly percent-encoded in the URI - that's why it can be percent-decoded by percentDecode(). On the other hand, “@” is not supported in the context of the fragment, so it's kept in the percent-encoded octet form.

RFC 3986 has a sentence that apparently contradicts with the behavior of Uri\Rfc3986\Uri::percentDecode():

Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.

According to this rule, reserved characters - even if they are allowed in the context of a component - should not be percent-decoded during normalization. Even though the Uri\Rfc3986\Uri getters respect this rule, the percentDecode() method intentionally disregards it so that it can serve in use-cases where those getters cannot. Let's see an example:

$uri = new Uri\Rfc3986\Uri("https://example.com/?q=%3A%29"); // The query is the percent-encoded form of ":)"
 
echo $uri->getQuery();                            // %3A%29
 
echo Uri\Rfc3986\Uri::percentDecode(
    $uri->getQuery(),
    Uri\Rfc3986\UriPercentEncodingMode::Query
);                                                // :)

As it can be seen above, the getQuery() getter only normalizes the “%20” percent-encoded octet, and it leaves the two reserved characters (“:” and “)”) as-is, even though both “:” and “)” are allowed in the context of the query (so they shouldn't be percent-encoded at all). By using percentDecode() one can make the input consumable directly, and scheme-specific or producer-specific algorithms should continue to use the getters should they need to perform any kind of custom processing.

By using the proposed percent-encoding and decoding capabilities, many use-cases will become possible to implement in a specification-compliant way which was difficult to achieve before.

For example, path segments can be properly percent-encoded when they contain the / character:

$uri = new Uri\Rfc3986\Uri("https://example.com");
$uri = $uri->withPathSegments(
    [
        "foo",
        Uri\Rfc3986\Uri::percentEncode("bar/baz", Uri\Rfc3986\UriPercentEncodingMode::PathSegment)
    ]
);
 
$uri->toRawString();                              // https://example.com/foo/bar%2Fbaz
Add support for percent-encoding and decoding as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Backward Incompatible Changes

All the proposed changes are completely backward compatible because the affected classes are all final.

Proposed PHP Version(s)

Next minor version (PHP 8.6 most likely)

RFC Impact

To the Ecosystem

What effect will the RFC have on IDEs, Language Servers (LSPs), Static Analyzers, Auto-Formatters, Linters and commonly used userland PHP libraries?

To Existing Extensions

Existing extensions can continue to use the existing URI API without any changes. Some of the features are exposed as PHPAPI functions through public headers.

To SAPIs

None.

Open Issues

None.

Future Scope

None.

Patches and Tests

https://github.com/kocsismate/php-src/pull/9

Implementation

After the RFC is implemented, this section should contain:

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

Rejected Features

None.

Changelog