====== PHP RFC: ext/uri follow-up ======
* Version: 0.1
* Date: 2025-10-17
* Author: Máté Kocsis, kocsismate@php.net
* Status: Draft
* Target version: next minor version (PHP 8.6)
* Implementation: https://github.com/kocsismate/php-src/pull/9
===== Introduction =====
This RFC proposes a follow-up to the [[rfc:url_parsing_api|URL Parsing API RFC]], extending the ''Uri\Rfc3986\Uri'' and ''Uri\WhatWg\Url'' classes with additional capabilities that came up during the discussion of the original RFC. These capabilities were deemed not to be essential from the get-go, therefore they were postponed in order not to increase scope.
===== Proposal =====
The following new functionality is introduced in this proposal:
- [[#uri_building|URI Building]]
- [[#query_parameter_manipulation|Query Parameter Manipulation]]
- [[#accessing_path_segments_as_an_array|Accessing Path Segments as an Array]]
- [[#host_type_detection|Host Type Detection]]
- [[#uri_type_detection|URI Type Detection]]
- [[#percent_encoding_decoding_support|Percent-Encoding/Decoding Support]]
Each feature proposed is voted separately and requires a 2/3 majority.
==== URI Building ====
Currently, only **already existing (and validated)** URIs can be manipulated via [[https://wiki.php.net/rfc/url_parsing_api#component_modification|wither methods]]. These calls always create a new instance so that immutability of URIs is preserved. Even though this behavior has plenty of advantages, there's at least one disadvantage with this: instance creation has a performance overhead which is not necessary in some cases. This is especially problematic if a lot of URI components have to be modified in the same time, because a lot of objects are "wasted" through intermediate instantiations.
$uri1 = Uri\Rfc3986\Uri::parse("http://example.com");
$uri2 = $uri1
->withScheme("https")
->withHost("example.net")
->withPath("/foo/bar"); // This creates 3 objects altogether!
Besides its suboptimal performance, another drawback of the current wither-based solution is that URI creation from the scratch is currently not possible: one always has to have a valid URI first. The empty string is a valid RFC 3986 URI, that's why it may seem a good candidate for an initial URI for URI building, but unfortunately, it's not valid for WHATWG URL. And anyway, the success of some transformations depend on the current state (which is a form of temporal coupling):
$uri1 = Uri\Rfc3986\Uri::parse("");
$uri2 = $uri1
->withScheme("https")
->withUserInfo("user:pass") // throws Uri\InvalidUriException: Cannot set a userinfo without having a host
->withHost("example.com");
$uri2 = $uri1
->withScheme("https")
->withHost("example.com")
->withUserInfo("user:pass") // no exception is thrown
In order to provide a more ergonomic and efficient solution for URI building, a fluent API is introduced that implements the [[https://refactoring.guru/design-patterns/builder|Builder pattern]].
$uriBuilder = new Uri\Rfc3986\UriBuilder();
$uriBuilder
->setScheme("https")
->setUserInfo("user:pass")
->setHost("example.com")
->setPort(8080)
->setPath("/foo/bar")
->setQuery("a=1&b=2"])
->setQueryParams(["a" => 1, "b" => 2]). // Has the same effect as the setQuery() call above
->setFragment("section1")
$uri = $uriBuilder->build(); // Validation and instance creation is only done at this point
echo $uri->toRawString(); // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1
The same works for WHATWG URL:
$urlBuilder = new Uri\WhatWg\UrlBuilder();
$urlBuilder
->setScheme("https")
->setUserInfo("user:pass")
->setHost("example.com")
->setPort(8080)
->setPath("/foo/bar")
->setQuery("a=1&b=2"])
->setQueryParams(["a" => 1, "b" => 2]). // Has the same effect as the setQuery() call above
->setFragment("section1")
$url = $urlBuilder->build(); // Validation and instance creation is only done at this point
echo $url->toAsciiString; // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1
The complete class signatures to be added are the following:
namespace Uri\Rfc3986 {
final class UriBuilder
{
public function __construct() {}
public function setScheme(?string $scheme): static {}
public function setUsername(?string $username): static {}
public function setPassword(?string $password): static {}
public function setUserInfo(?string $userInfo): static {}
public function setHost(?string $host): static {}
public function setPath(string $path): static {}
public function setQuery(?string $query): static {}
public function setQueryParams(mixed $queryParams): static {}
public function setFragment(?string $fragment): static {}
public function build(?\Uri\Rfc3986\Uri $baseUrl = null): \Uri\Rfc3986\Uri {}
}
}
namespace Uri\WhatWg {
final class UrlBuilder
{
public function __construct() {}
public function setScheme(?string $scheme): static {}
public function setUsername(?string $username): static {}
public function setPassword(?string $password): static {}
public function setUserInfo(?string $userInfo): static {}
public function setHost(?string $host): static {}
public function setPath(string $path): static {}
public function setQuery(?string $query): static {}
public function setQueryParams(mixed $queryParams): static {}
public function setFragment(?string $fragment): static {}
/** @param array $errors */
public function build(?\Uri\WhatWg\Url $baseUrl = null, &$errors = null): \Uri\WhatWg\Url {}
}
}
=== Design considerations ===
== Builder pattern vs static factory method ==
Why is a complex Builder pattern based approach is proposed instead of a much simpler [[https://refactoring.guru/design-patterns/factory-method|Factory Method]] based one? The factory method could be as simple as the following:
namespace Uri\Rfc3986 {
final readonly class Uri
{
...
public static function fromComponents(
?string $scheme = null, ?string $host = null, string $path = "",
?string $userInfo = null, ?string $queryString = null, ?string $fragment = null
) {}
...
}
}
namespace Uri\WhatWg {
final readonly class Url
{
...
public static function fromComponents(
string $scheme, ?string $host = "", string $path = "",
?string $username = null, ?string $password = null,
?string $queryString = null, ?string $fragment = null
) {}
...
}
}
The current RFC proposes the Builder pattern based approach because of its flexibility: it makes it possible to add more convenience methods in the future. Actually, the ''setQueryParams()'' method that expects an array of query params instead of the query string representation is already one.
namespace Uri\Rfc3986 {
final readonly class UriQueryParams
{
public static function parse(string $queryString): ?\Uri\Rfc3986\UriQueryParams {}
public static function fromArray(array $queryParams): \Uri\Rfc3986\UriQueryParams {}
public function append(string $name, mixed $value): void {}
public function delete(string $name): void {}
public function has(string $name): bool {}
public function getFirst(string $name): mixed {}
public function getLast(string $name): mixed {}
public function getAll(): mixed {}
public function set(string $name, mixed $value): mixed {}
public function sort(): mixed {}
public function toString(): string {}
public function __serialize(): array {}
public function __unserialize(array $data): void {}
public function __debugInfo(): array {}
}
}
namespace Uri\WhatWg {
final readonly class UrlQueryParams
{
public static function parse(string $queryString): ?\Uri\WhatWg\UrlQueryParams {}
public static function fromArray(array $queryParams): ?\Uri\WhatWg\UrlQueryParams {}
public function append(string $name, mixed $value): void {}
public function delete(string $name): void {}
public function deleteWithValue(string $name, string $value): void {}
public function has(string $name): bool {}
public function hasWithValue(string $name, string $value): bool {}
public function getFirst(string $name): mixed {}
public function getLast(string $name): mixed {}
public function getAll(): mixed {}
public function set(string $name, mixed $value): mixed {}
public function sort(): mixed {}
public function toString(): string {}
public function __serialize(): array {}
public function __unserialize(array $data): void {}
public function __debugInfo(): array {}
}
}
$uri = new Uri('https://example.com/?foo=bar&x=1');
$params = $uri->getQueryParams();
$uri = $uri->withQueryParams($params);
echo $uri->getQuery(); // foo=bar&x=1&y=2
namespace Uri\Rfc3986 {
final readonly class Uri
{
...
public function getRawPathSegments(): ?array {}
public function getPathSegments(): ?array {}
#[\NoDiscard(message: "as Uri\Rfc3986\Uri::withPathSegments() does not modify the object itself")]
public function withPathSegments(array $segments): static {}
...
}
}
namespace Uri\WhatWg {
final readonly class Url
{
...
public function getPathSegments(): array {}
#[\NoDiscard(message: "as Uri\WhatWg\Url::withPathSegments() does not modify the object itself")]
public function withPathSegments(array $segments): static {}
...
}
}
This way, it is possible to write the following code:
$uri = new Uri\WhatWg\Uri("https://example.com/foo/bar/baz");
$segments = $uri->getPathSegments(); // ["foo", "bar", "baz"]
$uri = $uri->withPathSegments(["a", "b"]);
echo $uri->getPath(); // /a/b
The same for WHATWG URL:
$url = new Uri\WhatWg\Url("https://example.com/foo/bar/baz");
$segments = $url->getPathSegments(); // ["foo", "bar", "baz"]
$url = $url->withPathSegments(["a", "b"]);
echo $url->getPath(); // /a/b
The getter methods return ''null'' if the path is empty ("https://example.com"), an empty array when the path consists of a single slash ("https://example.com/"), and a non-empty array otherwise.
''Uri\Rfc3986\Uri::withPathSegments()'' and ''Uri\WhatWg\Url::withPathSegments()'' internally concatenate the input segments separated by a ''/'' character, and then trigger ''Uri\Rfc3986\Uri::withPath()'' and ''Uri\WhatWg\Url::withPath()'', respectively.
namespace Uri\Rfc3986 {
enum UriHostType
{
case IPv4;
case IPv6;
case IPvFuture;
case RegisteredName;
}
final readonly class Uri
{
...
public function getHostType(): ?\Uri\Rfc3986\UriHostType {}
...
}
}
namespace Uri\WhatWg {
enum UrlHostType
{
case IPv4;
case IPv6;
case Domain;
case Opaque;
case Empty;
}
final readonly class Url
{
...
public function getHostType(): ?\Uri\WhatWg\UrlHostType {}
...
}
}
The new ''getHostType()'' methods return the type of the host component for both specifications:
$uri = new Uri("https://192.168.0.1/");
echo $uri->getHostType(); // UriHostType::IPv4
$uri = new Uri("https://[2001:db8::1]/");
echo $uri->getHostType(); // UriHostType::IPv6
$uri = new Uri("https://[v1.1.2.3]/");
echo $uri->getHostType(); // UriHostType::IPvFuture
$uri = new Uri("https://example.com/");
echo $uri->getHostType(); // UriHostType::RegisteredName
The same for WHATWG URL:
$url = new Uri\WhatWg\Url("https://192.168.0.1/");
echo $url->getHostType(); // UrlHostType::IPv4
$url = new Uri\WhatWg\Url("https://[2001:db8::1]/");
echo $uri->getHostType(); // UrlHostType::IPv6
$url = new Uri\WhatWg\Url("https://example.com/");
echo $url->getHostType(); // UrlHostType::Domain
$url = new Uri\WhatWg\Url("scheme://example.com/");
echo $url->getHostType(); // UrlHostType::Opaque
$url = new Uri\WhatWg\Url("mailto://john.doe@example.com");
echo $url->getHostType(); // UrlHostType::Empty
namespace Uri\Rfc3986 {
enum UriType
{
case AbsolutePathReference;
case RelativePathReference;
case NetworkPathReference;
case Uri;
}
final readonly class Uri
{
...
public function getUriType(): Uri\Rfc3986\UriType {}
...
}
}
This way, it becomes easier to detect the URI type:
$uri = new Uri\Rfc3986\Uri("https://example.com");
var_dump($uri->getUriType()); // Uri\Rfc3986\UriType::Uri
$uri = new Uri\Rfc3986\Uri("/foo");
var_dump($uri->getUriType()); // Uri\Rfc3986\UriType::AbsolutePathReference
$uri = new Uri\Rfc3986\Uri("foo");
var_dump($uri->getUriType()); // Uri\Rfc3986\UriType::RelativePathReference
$uri = new Uri\Rfc3986\Uri("//host.com/foo");
var_dump($uri->getUriType()); // Uri\Rfc3986\UriType::NetworkPathReference
The WHATWG URL specification defines some special schemes (''http'', ''https'', ''ftp'', ''file'', ''ws'', ''wss''), which have distinct parsing and serialization rules. In order to make checks for special URLs easier to perform, a new ''Uri\WhatWg\Url::isSpecial()'' method is added:
namespace Uri\WhatWg {
final readonly class Url
{
...
public function isSpecial(): bool {}
...
}
}
This enables low-level control for applications that need to mirror WHATWG behaviors in parsing or normalization.
$url = new Uri\WhatWg\Url("https://example.com");
var_dump($url->isSpecial()); // true
$url = new Uri\WhatWg\Url("custom:example");
var_dump($url->isSpecial()); // false
It should also be mentioned that in fact, ''urlencode()'' and ''urldecode()'' should rather be used for the ''application/x-www-form-urlencoded'' media type, and ''rawurlencode()'' and ''rawurldecode()'' more closely implements RFC 3986.For example, the path component dedicates special meaning for the ''/'' character. Therefore, this character doesn't necessarily have to be percent-encoded in the path component. There are some cases though when it makes sense to percent-encode them, as highlighted by the [[https://wiki.php.net/rfc/url_parsing_api#advanced_examples|first example]] within the "Advanced examples" section of the original URI RFC. Unfortunately, ''rawurlencode()'' doesn't take the component into account, and replaces the "/" with "%2F" unconditionally.
echo rawurlencode("/foo/bar/baz"); // %2Ffoo%2Fbar%2Fbaz
In order to correctly handle percent-encoding and decoding based on the rules of RFC 3986 and WHATWG URL, the following methods and enums are proposed to be added:
namespace Uri\Rfc3986 {
enum UriPercentEncodingMode
{
case UserInfo;
case Host;
case RelativeReferencePath;
case RelativeReferenceFirstPathSegment;
case Path;
case PathSegment;
case Query;
case FormQuery;
case Fragment;
case AllReservedCharacters;
case All;
}
final readonly class Uri
{
...
public static function percentEncode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
public static function percentDecode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
...
}
}
namespace Uri\WhatWg {
enum UrlPercentEncodingMode
{
case UserInfo;
case Host;
case OpaqueHost;
case Path;
case PathSegment;
case OpaquePath;
case OpaquePathSegment;
case Query;
case SpecialQuery;
case FormQuery;
case Fragment;
}
final readonly class Url
{
...
public static function percentEncode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
public static function percentDecode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
...
}
}
The ''percentEncode()'' and ''percentDecode()'' methods both require an input string and a ''PercentEncodingMode'' enum to be passed. The enums make the context of the encoding/decoding processes fully explicit and clear. The following modes are supported:
* **Uri\Rfc3986\UriPercentEncodingMode**
* **UserInfo:** Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**". Any other characters are percent-encoded.
* **Host:** If the input string is a valid IPv4, an IPv6 or an IPvFuture address, no percent-encoding is performed, since these host types do not support the process. Otherwise (for registered names), [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]] are allowed to be present. Any other characters are percent-encoded.
* **AbsolutePathReferenceFirstSegment:** The first segment of absolute-path references cannot start with "**%%//%%**" characters (e.g. ''%%//foo%%''), otherwise the path [[https://datatracker.ietf.org/doc/html/rfc3986#section-4.2|would be confusable]] with a network-path reference. Therefore, besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**", "**@**". Any other characters are percent-encoded.
* **RelativePathReferenceFirstSegment:** The first segment of relative-path references cannot contain a "**:**" character (e.g. ''this:that''), otherwise the path [[https://datatracker.ietf.org/doc/html/rfc3986#section-4.2|would be confusable]] with a scheme name. Therefore, besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**@**". Any other characters are percent-encoded.
* **RelativeReferencePath:**
* **Path:** Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**/**", "**:**", "**@**". Any other characters are percent-encoded.
* **PathSegment:** Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**", "**@**". Any other characters are percent-encoded.
* **Query:** Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**", "**@**", "**/**", and "**?**". Any other characters are percent-encoded.
* FormQuery: It is mostly the same as ''Uri\Rfc3986\UriPercentEncodingMode::Query'', but it behaves according to the ''application/x-www-form-urlencode'' media type rather than RFC 3986. The only difference between the two is that " " is encoded as "**+**".
* Fragment: Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**", "**@**", "**/**", and "**?**". Any other characters are percent-encoded.
* AllReservedCharacters: All [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|reserved characters]] are percent-encoded. The rest of the characters are left as-is.
* AllButUnreservedCharacters: Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]] and [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], all other characters are percent-encoded.
For the complete ABNF syntax of each component, consult [[https://datatracker.ietf.org/doc/html/rfc3986#appendix-A|Appendix A]] of RFC 3986.
* **Uri\WhatWg\UrlPercentEncodingMode**
* **UserInfo:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::Path'', the following code points are percent-encoded: U+002F (**/**), U+003A (**:**), U+003B (**;**), U+003D (**=**), U+0040 (**@**), U+005B (**[**) to U+005D (**]**), inclusive, and U+007C (**|**).
* **OpaqueHost:** [[https://infra.spec.whatwg.org/#c0-control|Control characters]], and all [[https://url.spec.whatwg.org/#c0-control-percent-encode-set|code points greater than ~]] are percent-encoded.
* **Path:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::Query'', the following code points are percent-encoded: U+003F (**?**), U+005E (**^**), U+0060 (**`**), U+007B (**{**), and U+007D (**}**).
* **PathSegment:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::Query'', the following code points are percent-encoded: U+003F (**?**), U+005E (**^**), U+0060 (**`**), U+007B (**{**), U+007D (**}**), and U+002F (**/**).
* **OpaquePathSegment:**
* **Query:** Besides [[https://infra.spec.whatwg.org/#c0-control|Control characters]], and all [[https://url.spec.whatwg.org/#c0-control-percent-encode-set|code points greater than ~]], the following code points are percent-encoded: U+0020 SPACE, U+0022 (**"**), U+0023 (**#**), U+003C (**<**), and U+003E (**>**).
* **SpecialQuery:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::Query'', the following code points are percent-encoded: U+0027 (**'**)
* **FormQuery:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::UserInfo'', the following code points are percent-encoded: U+0024 (**$**) to U+0026 (**&**), inclusive, U+002B (**+**), U+002C (**,**), U+0021 (**!**), U+0027 (**'**) to U+0029 RIGHT PARENTHESIS, inclusive, and U+007E (**~**).
* **Fragment:** Besides [[https://infra.spec.whatwg.org/#c0-control|Control characters]], and all [[https://url.spec.whatwg.org/#c0-control-percent-encode-set|code points greater than ~]], the following code points are percent-encoded: U+0020 SPACE, U+0022 (**"**), U+003C (**<**), U+003E (**>**), and U+0060 (**`**).
Since neither RFC 3986, nor WHATWG URL support percent-encoded characters inside the scheme component, none of the enums contain a ''Scheme'' case. WHATWG URL automatically percent-decodes the host when [[https://wiki.php.net/rfc/uri_followup#determining_if_the_whatwg_url_is_special|it's special]], so ''Uri\WhatWg\UrlPercentEncodingMode'' doesn't contain a ''Host'' case.
Even path segments could be percent-encoded/decoded in a specification compliant way:
$encodedComponent = Uri\Rfc3986\Uri::encodeComponent(
"bar/baz",
Uri\Rfc3986\UriPercentEncodingMode::PathSegment
); // bar%2Fbaz
$uri = new Uri\Rfc3986\Uri("https://example.com");
$uri = $uri->withPathSegments(["foo", $encodedComponent]);
$uri->toRawString(); // https://example.com/foo/bar%2Fbaz