rfc:uri_followup

PHP RFC: ext/uri follow-up

Introduction

This RFC proposes a follow-up to the URL Parsing API RFC, extending the Uri\Rfc3986\Uri and Uri\WhatWg\Url classes with additional capabilities that came up during the discussion of the original RFC. These capabilities were deemed not to be essential from the get-go, therefore they were postponed in order not to increase scope.

Proposal

The following new functionality is introduced in this proposal:

Each feature proposed is voted separately and requires a 2/3 majority.

URI Building

Currently, only already existing (and validated) URIs can be manipulated via wither methods. These calls always create a new instance so that immutability of URIs is preserved. Even though this behavior has plenty of advantages, there's at least one disadvantage with this: instance creation has a performance overhead which is not necessary in some cases. This is especially problematic if a lot of URI components have to be modified in the same time, because a lot of objects are “wasted” through intermediate instantiations.

$uri1 = Uri\Rfc3986\Uri::parse("http://example.com");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withHost("example.net")
    ->withPath("/foo/bar");                // This creates 3 objects altogether!

Besides its suboptimal performance, another drawback of the current wither-based solution is that URI creation from the scratch is currently not possible: one always has to have a valid URI first. The empty string is a valid RFC 3986 URI, that's why it may seem a good candidate for an initial URI for URI building, but unfortunately, it's not valid for WHATWG URL. And anyway, the success of some transformations depend on the current state (which is a form of temporal coupling):

$uri1 = Uri\Rfc3986\Uri::parse("");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withUserInfo("user:pass")            // throws Uri\InvalidUriException: Cannot set a userinfo without having a host
    ->withHost("example.com");
 
$uri2 = $uri1
    ->withScheme("https")
    ->withHost("example.com")
    ->withUserInfo("user:pass")            // No exception is thrown

In order to provide a more ergonomic and efficient solution for URI building, a fluent API is introduced that implements the Builder pattern.

$uriBuilder = new Uri\Rfc3986\UriBuilder();
$uriBuilder
    ->setScheme("https")
    ->setUserInfo("user:pass")
    ->setHost("example.com")
    ->setPort(8080)
    ->setPath("/foo/bar")
    ->setQuery("a=1&b=2"])
    ->setQueryParams(["a" => 1, "b" => 2]) // Has the same effect as the setQuery() call above
    ->setFragment("section1")
 
$uri = $uriBuilder->build();               // Validation and instance creation is only done at this point
 
echo $uri->toRawString();                  // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1

The same works for WHATWG URL:

$urlBuilder = new Uri\WhatWg\UrlBuilder();
$urlBuilder
    ->setScheme("https")
    ->setUserInfo("user:pass")
    ->setHost("example.com")
    ->setPort(8080)
    ->setPath("/foo/bar")
    ->setQuery("a=1&b=2"])
    ->setQueryParams(["a" => 1, "b" => 2]) // Has the same effect as the setQuery() call above
    ->setFragment("section1")
 
$url = $urlBuilder->build();               // Validation and instance creation is only done at this point
 
echo $url->toAsciiString;                  // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1

The complete class signatures to be added are the following:

namespace Uri\Rfc3986 {
    final class UriBuilder
    {
        public function __construct() {}
 
        public function setScheme(?string $scheme): static {}
 
        public function setUsername(?string $username): static {}
 
        public function setPassword(?string $password): static {}
 
        public function setUserInfo(?string $userInfo): static {}
 
        public function setHost(?string $host): static {}
 
        public function setPath(string $path): static {}
 
        public function setQuery(?string $query): static {}
 
        public function setQueryParams(mixed $queryParams): static {}
 
        public function setFragment(?string $fragment): static {}
 
        public function build(?\Uri\Rfc3986\Uri $baseUrl = null): \Uri\Rfc3986\Uri {}
    }
}
namespace Uri\WhatWg {
    final class UrlBuilder
    {
        public function __construct() {}
 
        public function setScheme(?string $scheme): static {}
 
        public function setUsername(?string $username): static {}
 
        public function setPassword(?string $password): static {}
 
        public function setUserInfo(?string $userInfo): static {}
 
        public function setHost(?string $host): static {}
 
        public function setPath(string $path): static {}
 
        public function setQuery(?string $query): static {}
 
        public function setQueryParams(mixed $queryParams): static {}
 
        public function setFragment(?string $fragment): static {}
 
        /** @param array $errors */
        public function build(?\Uri\WhatWg\Url $baseUrl = null, &$errors = null): \Uri\WhatWg\Url {}
    }
}

Design considerations

Builder pattern vs static factory method

Why is a complex Builder pattern based approach is proposed instead of a much simpler Factory Method based one? The factory method could be as simple as the following:

namespace Uri\Rfc3986 {
    final readonly class Uri
    {
        ...
 
        public static function fromComponents(
            ?string $scheme = null, ?string $host = null, string $path = "",
            ?string $userInfo = null, ?string $queryString = null, ?string $fragment = null
        ) {}
 
        ...
    }
}
 
namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public static function fromComponents(
            string $scheme, ?string $host = "", string $path = "",
            ?string $username = null, ?string $password = null,
            ?string $queryString = null, ?string $fragment = null
        ) {}
 
        ...
    }
}

The current RFC proposes the Builder pattern based approach because of its flexibility: it makes it possible to add more convenience methods in the future. Actually, the setQueryParams() method that expects an array of query params instead of the query string representation is already one.

Add URI building support as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Query Parameter Manipulation

Query parameter manipulation is an integral part of URI handling. WHATWG URL even dedicates a separate section for the URLSearchParams class that implements advanced query parameter handling. Unfortunately, RFC 3986 doesn't have any such capability, so ultimately, both proposed classes closely follow the design of the WHATWG URL specification.

Therefore, the following classes and methods are proposed for addition:

namespace Uri\Rfc3986 {
    final class UriQueryParams
    {
        public static function parse(string $queryString): \Uri\Rfc3986\UriQueryParams {}
 
        public static function fromArray(array $queryParams): \Uri\Rfc3986\UriQueryParams {}
 
        private function __construct() {}
 
        public function append(string $name, mixed $value): void {}
 
        public function delete(string $name): void {}
 
        public function deleteWithValue(string $name, mixed $value): bool {}
 
        public function has(string $name): bool {}
 
        public function hasWithValue(string $name, mixed $value): bool {}
 
        public function getRawFirst(string $name): mixed {}
 
        public function getFirst(string $name): mixed {}
 
        public function getRawLast(string $name): mixed {}
 
        public function getLast(string $name): mixed {}
 
        public function getRawAll(): array {}
 
        public function getAll(): array {}
 
        public function getSize(): int {}
 
        public function set(string $name, mixed $value): void {}
 
        public function sort(): void {}
 
        public function toRawString(): string {}
 
        public function toString(): string {}
 
        public function __serialize(): array {}
 
        public function __unserialize(array $data): void {}
 
        public function __debugInfo(): array {}
    }
    final readonly class Uri
    {
        ...
 
        public function getRawQueryParams(): ?\Uri\Rfc3986\UriQueryParams {}
 
        public function getQueryParams(): ?\Uri\Rfc3986\UriQueryParams {}
 
        #[\NoDiscard(message: "as Uri\Rfc3986\Uri::withQueryParams() does not modify the object itself")]
        public function withQueryParams(?\Uri\Rfc3986\UriQueryParams $queryParams): static {}
 
        ...
    }
}
namespace Uri\WhatWg {
    final class UrlQueryParams
    {
        public static function parse(string $queryString): \Uri\WhatWg\UrlQueryParams {}
 
        public static function fromArray(array $queryParams): \Uri\WhatWg\UrlQueryParams {}
 
        private function __construct() {}
 
        public function append(string $name, mixed $value): void {}
 
        public function delete(string $name): void {}
 
        public function deleteWithValue(string $name, mixed $value): void {}
 
        public function has(string $name): bool {}
 
        public function hasWithValue(string $name, string $value): bool {}
 
        public function getRawFirst(string $name): mixed {}
 
        public function getFirst(string $name): mixed {}
 
        public function getRawLast(string $name): mixed {}
 
        public function getLast(string $name): mixed {}
 
        public function getRawAll(): array {}
 
        public function getAll(): array {}
 
        public function getSize(): int {}
 
        public function set(string $name, mixed $value): void {}
 
        public function sort(): void {}
 
        public function toRawString(): string {}
 
        public function toString(): string {}
 
        public function __serialize(): array {}
 
        public function __unserialize(array $data): void {}
 
        public function __debugInfo(): array {}
    }
    final readonly class Url
    {
        ...
 
        public function getQueryParams(): ?\Uri\WhatWg\UrlQueryParams {}
 
        #[\NoDiscard(message: "as Uri\WhatWg\Url::withQueryParams() does not modify the object itself")]
        public function withQueryParams(?\Uri\WhatWg\UrlQueryParams $queryParams): static {}
 
        ...
    }
}

Construction

Both UriQueryParams and UrlQueryParams support two factory methods for instantiation:

  • parse() method: It parses a query string into a list of query parameters.
  • fromArray() method: It takes an array of query parameters and directly composes the query parameter list object based on it. It may be counter-intuitive, but a multi-dimension array is expected ("key1" => "value1"], ["key2" => "value2") instead of a single array of key-value pairs ([“key1” => “value1”, “key2” => “value2”]). This is needed to support repeated query parameter names.

The constructor of both classes is private that even throws upon invocation in order to foster the usage of the above mentioned factory methods. Some examples for instantiation:

$params = Uri\Rfc3986\UriQueryParams::parse("abc=foo&abc=bar"); // Successful instantiation
$params = Uri\Rfc3986\UriQueryParams::fromArray(
    [
        ["abc" => "foo"],
        ["abc" => "bar"],
    ]
);                                                              // Successful instantiation - same result as above
 
$params = new Uri\Rfc3986\UriQueryParams();                     // Thrown an exception
 
$params = Uri\WhatWg\UrlQueryParams::parse("abc=foo&abc=bar");  // Successful instantiation
$params = Uri\WhatWg\UrlQueryParams::fromArray(
    [
        ["abc" => "foo"],
        ["abc" => "bar"],
    ]
);                                                              // Successful instantiation - same result as above
 
$params = new Uri\WhatWg\UrlQueryParams();                      // Thrown an exception

Additionally, it is possible to create a UriQueryParams or UrlQueryParams instance from an Uri or Url, respectively:

$uri = new Uri\Rfc3986\Uri("https://example.com/?foo=bar");
 
$params = $uri->getRawQueryParams();       // Creates a Uri\Rfc3986\UriQueryParams instance
$params = $uri->getQueryParams();          // Creates a Uri\Rfc3986\UriQueryParams instance
 
$url = new Uri\WhatWg\Url("https://example.com/?foo=bar");
 
$params = $url->getQueryParams();          // Creates a Uri\Rfc3986\UriQueryParams instance

The difference between Uri\Rfc3986\Uri::getRawQueryParams() and Uri\Rfc3986\Uri::getQueryParams() is that the former one uses the “raw” (non-normalized) query string when instantiating Uri\Rfc3986\Uri\UriQueryParams.

It's important to note that neither of the above methods validate the query parameters appropriately, and that's why an exception is thrown if any error happens (which is mostly just a theoretical scenario in case of memory errors). This behavior is by design, because the idea of WHATWG URL's URLSearchParams class is that it's tolerant for reading, and UriQueryParams and UrlQueryParams follow the same principle. Validation happens anyway when the serialized query parameters are attempted to be written to a URI (via Uri\Rfc3986\Uri::withQueryParams() and Uri\WhatWg\Url::withQueryParams()).

$uri = new Uri\Rfc3986\Uri("https://example.com/?foo=bar");
 
$params = $uri->getRawQueryParams();       // Creates a Uri\Rfc3986\UriQueryParams instance 
$params->append("#baz", "qux");            // Appends an invalid parameter containing "#"
 
$uri = $uri->withQueryParams($params);     // Throws Uri\InvalidUriException: The specified query is malformed

The Uri\Rfc3986\Uri::getRawQueryParams(), Uri\Rfc3986\Uri::getQueryParams(), Uri\WhatWg\Url::getQueryParams() methods return null if the query string is missing (e.g. https://example.com/), and an empty query parameter list is returned if the query string is empty (e.g. https://example.com/?).

$uri = new Uri\Rfc3986\Uri("https://example.com/");
echo $uri->getRawQueryParams();            // null
echo $uri->getQueryParams();               // null
 
$uri = new Uri\Rfc3986\Uri("https://example.com/?");
echo $uri->getRawQueryParams();            // A new Uri\Rfc3986\Uri\UriQueryParams containing zero items
echo $uri->getQueryParams();               // A new Uri\Rfc3986\Uri\UriQueryParams containing zero items

The same example with Uri\WhatWg\UrlQueryParams:

$url = new Uri\WhatWg\Url("https://example.com/");
echo $url->getQueryParams();               // null
 
$url = new Uri\WhatWg\Url("https://example.com/?");
echo $url->getQueryParams();               // A new Uri\WhatWg\Url\UrlQueryParams containing zero items

Modification

As the first example above demonstrated, the append() method can be used to append a parameter to the end of the list. As normally, the same query parameter can be added multiple times:

$params = new Uri\Rfc3986\UriQueryParams("foo=bar");
$params->append("baz", "qux");
$params->append("baz", "qaz");             // Appends "baz" twice
 
echo $params->toString();                  // foo=bar&baz=qux&baz=qaz

Updating a parameter is possible via the set() method:

$params = new Uri\Rfc3986\UriQueryParams("foo=bar&foo=baz");
$params->set("foo", "baz");                // Overwrites the first item "foo", and removes the second one
$params->set("qux", "qaz");                // Appends a new item "qux"
 
echo $params->toString();                  // foo=bar&baz=qux&baz=qaz

Actually, the set() method has a hybrid behavior: if a parameter is not present in the list, then it adds it just like append() does. Otherwise, it overwrites the first item, and removes the rest of the occurrences.

Removing parameters is possible via either the delete() or the deleteWithValue() method: the former one removes all occurrences of the given parameter name, while the latter one removes all occurrences of a parameter if the given name and value both matches it, as demonstrated below:

$params = new Uri\Rfc3986\UriQueryParams("foo=bar&foo=baz&foo=qux");
$params->deleteWithValue("foo", "baz");    // Deletes the "foo=baz" parameter
$params->delete("foo");                    // Deletes the rest of the occurrences: "foo=bar" and "foo=qux"
$params->delete("non-existent");           // The parameter is not present: nothing happens

The last method that can modify the list is sort(), which can sort the parameters alphabetically:

$params = new Uri\Rfc3986\UriQueryParams('https://example.com/?foo=bar&baz=qux&baz=baz');
$params->sort();
 
echo $params->toString();                  // baz=baz&baz=qux&foo=bar

Getters

To find out if a parameter exists, the has() and hasWithValue() methods can be used:

$params = new Uri\Rfc3986\UriQueryParams('https://example.com/?foo=bar&baz=qux&baz=baz');
 
echo $params->has("baz");                 // true
echo $params->has("non-existent");        // false
 
echo $params->hasWithValue("foo", "bar"); // true 
echo $params->hasWithValue("foo", "baz"); // false

The has() method returns true if there is at least one parameter in the parameter list with the given name, false otherwise. On the other hand, hasWithValue() returns true if the given name and value both matches at least one parameter, otherwise it returns false.

The number of query parameters can be retrieved by calling the getSize() method:

$params = new Uri\Rfc3986\UriQueryParams('https://example.com/?foo=bar&baz=qux&baz=baz');
 
echo $params->getSize();                  // 3

There are also a number of methods that can return a query parameter or an array of query parameters:

  • getFirst(): Retrieves the first parameter with the given name. This actually implements the get() method in the WHATWG URL specification.
  • getLast(): Retrieves the last parameter with the given name. It's a custom addition to the WHATWG URL specification.
  • getAll(): Retrieves all parameters with the given name.
$params = new Uri\Rfc3986\UriQueryParams('https://example.com/?foo=bar&foo=baz&foo=qux');
 
echo $params->getFirst("foo");            // bar
echo $params->getFirst("non-existent");   // null
 
echo $params->getLast("foo");             // qux
echo $params->getLast("non-existent");    // null
 
echo $params->getAll("foo");             // [["foo", "bar"], ["foo", "baz"], ["foo", "qux"]]
echo $params->getAll("non-existent");    // []

Percent-encoding and decoding

Neither UriQueryParams, nor UrlQueryParams handle percent-encoding and decoding natively.

Implemented Interfaces

The UriQueryParams and UrlQueryParams classes could implement all the following interfaces: Countable, ArrayAccess, IteratorAggregate, however, the position of this RFC is that doing so would lead to counterintuitive behavior because it wouldn't be immediately clear if the “raw” or the normalized representation is being iterated upon or being accessed.

$params = new Uri\Rfc3986\UriQueryParams('https://example.com/?foo=b%61r');
 
echo $params["foo"];                      // Is it "b%61r" or "bar"?
echo $params->getRawFirst("foo");         // b%61r
echo $params->getFirst("foo");            // bar
 
foreach ($params->getIterator() as $value) {
    echo $value;                          // Is it "b%61r" or "bar"?
}
 
foreach ($params->getRawAll() as $value) {
    echo $value;                          // b%61r
}
 
foreach ($params->getAll() as $value) {
    echo $value;                          // bar
}

Since the Countable interface is less useful on its own, neither it is implemented by UriQueryParams and UrlQueryParams.

Cloning

Cloning of UriQueryParams and UrlQueryParams is supported.

Serialization

The classes are both serializable and deserializable.

Debugging

They also contain a __debugInfo() method that returns all items in the query parameter list, just like how the getRawAll() method does.

Add support for query parameter manipulation as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Accessing Path Segments as an Array

Sometimes, accessing path segments rather than the whole path as string is needed. When this is the case, splitting the path to segments manually after retrieval is both inconvenient and disadvantageous performance-wise, especially considering the fact that Uri\Rfc3986\Uri internally stores the path as a list of segments.

In order to better support the related use-cases, the following methods are proposed to be added:

namespace Uri\Rfc3986 {
    final readonly class Uri
    {
        ...
 
        public function getRawPathSegments(): ?array {}
 
        public function getPathSegments(): ?array {}
 
        #[\NoDiscard(message: "as Uri\Rfc3986\Uri::withPathSegments() does not modify the object itself")]
        public function withPathSegments(array $segments): static {}
 
        ...
    }
}
namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public function getPathSegments(): array {}
 
        #[\NoDiscard(message: "as Uri\WhatWg\Url::withPathSegments() does not modify the object itself")]
        public function withPathSegments(array $segments): static {}
 
        ...
    }
}

This way, it is possible to write the following code:

$uri = new Uri\WhatWg\Uri("https://example.com/foo/bar/baz");
$segments = $uri->getPathSegments();        // ["foo", "bar", "baz"]
 
$uri = $uri->withPathSegments(["a", "b"]);
echo $uri->getPath();                       // /a/b

The same for WHATWG URL:

$url = new Uri\WhatWg\Url("https://example.com/foo/bar/baz");
$segments = $url->getPathSegments();        // ["foo", "bar", "baz"]
 
$url = $url->withPathSegments(["a", "b"]);
echo $url->getPath();                       // /a/b

The getter methods return null if the path is empty (https://example.com), an empty array when the path consists of a single slash (https://example.com/), and a non-empty array otherwise.

Uri\Rfc3986\Uri::withPathSegments() and Uri\WhatWg\Url::withPathSegments() internally concatenate the input segments separated by a / character, and then trigger Uri\Rfc3986\Uri::withPath() and Uri\WhatWg\Url::withPath(), respectively.

Add support for accessing path segments as an array as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Host Type Detection

Both the RFC 3986 and WHATWG URL specifications distinguish different types of the host component because each of them have different parsing and formatting rules. Probably the most notable example is the IPv6 host type that requires the IPv6 address to be written between a [ and ] pair.

In order to support returning information about the host type, the following enums and methods are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriHostType
    {
        case IPv4;
        case IPv6;
        case IPvFuture;
        case RegisteredName;
    }
 
    final readonly class Uri
    {
        ...
 
        public function getHostType(): ?\Uri\Rfc3986\UriHostType {}
 
        ...
    }
}
namespace Uri\WhatWg {
    enum UrlHostType
    {
        case IPv4;
        case IPv6;
        case Domain;
        case Opaque;
        case Empty;
    }
 
    final readonly class Url
    {
        ...
 
        public function getHostType(): ?\Uri\WhatWg\UrlHostType {}
 
        ...
    }
}

The new getHostType() methods return the type of the host component for both specifications:

$uri = new Uri("https://192.168.0.1/");
echo $uri->getHostType();                  // UriHostType::IPv4
 
$uri = new Uri("https://[2001:db8::1]/");
echo $uri->getHostType();                  // UriHostType::IPv6
 
$uri = new Uri("https://[v1.1.2.3]/");
echo $uri->getHostType();                  // UriHostType::IPvFuture
 
$uri = new Uri("https://example.com/");
echo $uri->getHostType();                  // UriHostType::RegisteredName

The same for WHATWG URL:

$url = new Uri\WhatWg\Url("https://192.168.0.1/");
echo $url->getHostType();                  // UrlHostType::IPv4
 
$url = new Uri\WhatWg\Url("https://[2001:db8::1]/");
echo $uri->getHostType();                  // UrlHostType::IPv6
 
$url = new Uri\WhatWg\Url("https://example.com/");
echo $url->getHostType();                  // UrlHostType::Domain
 
$url = new Uri\WhatWg\Url("scheme://example.com/");
echo $url->getHostType();                  // UrlHostType::Opaque
 
$url = new Uri\WhatWg\Url("mailto://john.doe@example.com");
echo $url->getHostType();                  // UrlHostType::Empty
Add support for host type detection as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

URI Type Detection

RFC 3986 distinguishes different URI “types” based on what they begin with.

  • Relative-reference: Starts with a path, and the scheme is therefore omitted. Relative-references can be further grouped into the following types:
    • Absolute-path reference: Starts with a single slash (“/”), e.g.: “/foo”
    • Relative-path reference: Starts without a slash (“/”), e.g.: “foo”
    • Network-path reference: Starts with a double slash (“//”) followed by an authority, e.g.: //host/foo
  • URI: Starts with the scheme component, and then continues with either the authority, or the path.

In order to better support granular RFC 3986 URI type detection, the following enums and methods are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriType
    {
        case AbsolutePathReference;
        case RelativePathReference;
        case NetworkPathReference;
        case Uri;
    }
 
    final readonly class Uri
    {
        ...
 
        public function getUriType(): Uri\Rfc3986\UriType {}
 
        ...
    }
}

This way, it becomes easier to detect the URI type:

$uri = new Uri\Rfc3986\Uri("https://example.com");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::Uri
 
$uri = new Uri\Rfc3986\Uri("/foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::AbsolutePathReference
 
$uri = new Uri\Rfc3986\Uri("foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::RelativePathReference
 
$uri = new Uri\Rfc3986\Uri("//host.com/foo");
var_dump($uri->getUriType());                     // Uri\Rfc3986\UriType::NetworkPathReference

The WHATWG URL specification defines some special schemes (http, https, ftp, file, ws, wss), which have distinct parsing and serialization rules. In order to make checks for special URLs easier to perform, a new Uri\WhatWg\Url::isSpecial() method is added:

namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public function isSpecial(): bool {}
 
        ...
    }
}

This enables low-level control for applications that need to mirror WHATWG behaviors in parsing or normalization.

$url = new Uri\WhatWg\Url("https://example.com");
var_dump($url->isSpecial());                      // true
 
$url = new Uri\WhatWg\Url("custom:example");
var_dump($url->isSpecial());                      // false
Add support for detecting URI type as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Percent-Encoding and Decoding Support

Contrarily to the common belief that's probably further affirmed by the urlencode() and urldecode() functions, percent-encoding and decoding are both a context-sensitive process. Context-sensitivity means that different characters need to be percent-encoded/percent-encoded depending on which URI component is being processed.

It should also be mentioned that in fact, urlencode() and urldecode() should rather be used for the application/x-www-form-urlencoded media type, and rawurlencode() and rawurldecode() more closely implements RFC 3986.

For example, the path component dedicates special meaning for the / character. Therefore, this character doesn't necessarily have to be percent-encoded in the path component. There are some cases though when it makes sense to percent-encode them, as highlighted by the first example within the “Advanced examples” section of the original URI RFC. Unfortunately, rawurlencode() doesn't take the component into account, and replaces the “/” with “%2F” unconditionally.

echo rawurlencode("/foo/bar/baz");         // %2Ffoo%2Fbar%2Fbaz

In order to correctly handle percent-encoding and decoding based on the rules of RFC 3986 and WHATWG URL, the following methods and enums are proposed to be added:

namespace Uri\Rfc3986 {
    enum UriPercentEncodingMode
    {
        case UserInfo;
        case Host;
        case RelativeReferencePath;
        case RelativeReferenceFirstPathSegment;
        case Path;
        case PathSegment;
        case Query;
        case FormQuery;
        case Fragment;
        case AllReservedCharacters;
        case All;
    }
 
    final readonly class Uri
    {
        ...
 
        public static function percentEncode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
 
        public static function percentDecode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
 
        ...
    }
}
namespace Uri\WhatWg {
    enum UrlPercentEncodingMode
    {
        case UserInfo;
        case Host;
        case OpaqueHost;
        case Path;
        case PathSegment;
        case OpaquePath;
        case OpaquePathSegment;
        case Query;
        case SpecialQuery;
        case FormQuery;
        case Fragment;
    }
 
    final readonly class Url
    {
        ...
 
        public static function percentEncode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
 
        public static function percentDecode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
 
        ...
    }
}

The percentEncode() and percentDecode() methods both require an input string and a PercentEncodingMode enum to be passed. The enums make the context of the encoding/decoding processes fully explicit and clear. The following modes are supported:

For the complete ABNF syntax of each component, consult Appendix A of RFC 3986.

  • Uri\WhatWg\UrlPercentEncodingMode
    • UserInfo: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::Path, the following code points are percent-encoded: U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+0040 (@), U+005B ([) to U+005D (]), inclusive, and U+007C (|).
    • OpaqueHost: Control characters, and all code points greater than ~ are percent-encoded.
    • Path: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::Query, the following code points are percent-encoded: U+003F (?), U+005E (^), U+0060 (`), U+007B ({), and U+007D (}).
    • PathSegment: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::Query, the following code points are percent-encoded: U+003F (?), U+005E (^), U+0060 (`), U+007B ({), U+007D (}), and U+002F (/).
    • OpaquePathSegment:
    • Query: Besides Control characters, and all code points greater than ~, the following code points are percent-encoded: U+0020 SPACE, U+0022 (), U+0023 (#), U+003C (<), and U+003E (>).
    • SpecialQuery: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::Query, the following code points are percent-encoded: U+0027 (')
    • FormQuery: Besides the code points percent-encoded by Uri\WhatWg\UrlPercentEncodingMode::UserInfo, the following code points are percent-encoded: U+0024 ($) to U+0026 (&), inclusive, U+002B (+), U+002C (,), U+0021 (!), U+0027 (') to U+0029 RIGHT PARENTHESIS, inclusive, and U+007E (~).
    • Fragment: Besides Control characters, and all code points greater than ~, the following code points are percent-encoded: U+0020 SPACE, U+0022 (), U+003C (<), U+003E (>), and U+0060 (`).

Since neither RFC 3986, nor WHATWG URL support percent-encoded characters inside the scheme component, none of the enums contain a Scheme case. WHATWG URL automatically percent-decodes the host when it's special, so Uri\WhatWg\UrlPercentEncodingMode doesn't contain a Host case.

Even path segments could be percent-encoded/decoded in a specification compliant way:

$encodedComponent = Uri\Rfc3986\Uri::encodeComponent(
    "bar/baz",
    Uri\Rfc3986\UriPercentEncodingMode::PathSegment
);                                                      // bar%2Fbaz
 
$uri = new Uri\Rfc3986\Uri("https://example.com");
$uri = $uri->withPathSegments(["foo", $encodedComponent]);
 
$uri->toRawString();                                    // https://example.com/foo/bar%2Fbaz
Add support for percent-encoding and decoding as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Backward Incompatible Changes

All the proposed changes are completely backward compatible because the affected classes are all final.

Proposed PHP Version(s)

Next minor version (PHP 8.6 most likely)

RFC Impact

To the Ecosystem

What effect will the RFC have on IDEs, Language Servers (LSPs), Static Analyzers, Auto-Formatters, Linters and commonly used userland PHP libraries?

To Existing Extensions

Existing extensions can continue to use the existing URI API without any changes. Some of the features are exposed as PHPAPI functions through public headers.

To SAPIs

None.

Open Issues

None.

Future Scope

None.

Patches and Tests

Implementation

After the RFC is implemented, this section should contain:

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

Rejected Features

None.

Changelog

rfc/uri_followup.txt · Last modified: by kocsismate