====== PHP RFC: Followup Improvements for ext/uri ======
* Version: 0.1
* Date: 2025-10-17
* Author: Máté Kocsis, kocsismate@php.net
* Status: Under Discussion
* Target version: next minor version (PHP 8.6)
* Implementation: https://github.com/kocsismate/php-src/pull/9
===== Introduction =====
This RFC proposes various follow-up improvements to the [[rfc:url_parsing_api|URL Parsing API RFC]], extending the ''Uri\Rfc3986\Uri'' and ''Uri\WhatWg\Url'' classes with additional capabilities that were requested during the discussion phase of the original RFC. These capabilities were deemed not to be essential from the get-go, therefore they were postponed in order not to increase scope even further.
===== Proposal =====
The following new functionality is introduced in this proposal:
- [[#uri_building|URI Building]]
- [[#query_parameter_manipulation|Query Parameter Manipulation]]
- [[#accessing_path_segments_as_an_array|Accessing Path Segments as an Array]]
- [[#host_type_detection|Host Type Detection]]
- [[#uri_type_detection|URI Type Detection]]
- [[#percent-encoding_and_decoding_support|Percent-Encoding and Decoding Support]]
Each feature proposed is voted separately and requires a 2/3 majority.
==== URI Building ====
Currently, only **already existing (and validated)** URIs can be manipulated via [[https://wiki.php.net/rfc/url_parsing_api#component_modification|wither methods]]. These calls always create a new instance so that immutability of URIs is preserved. Even though this behavior has plenty of advantages, there's at least one disadvantage: instance creation has a performance overhead. This is especially problematic if a lot of URI components have to be modified in the same time, because a lot of objects are "wasted" through intermediate instantiations.
$uri1 = Uri\Rfc3986\Uri::parse("http://example.com");
$uri2 = $uri1
->withScheme("https")
->withHost("example.net")
->withPath("/foo/bar"); // This creates 3 objects altogether!
Besides its suboptimal performance, another drawback of the current wither-based solution is that URI creation from the scratch is currently not possible: one always has to create a valid URI first. The empty string is a valid RFC 3986 URI, that's why it may seem a good candidate for an initial URI for URI building, but unfortunately, it's not valid for WHATWG URL. And anyway, the success of some transformations depend on the current state (which is a form of temporal coupling):
$uri1 = Uri\Rfc3986\Uri::parse("");
$uri2 = $uri1
->withScheme("https")
->withUserInfo("user:pass") // throws Uri\InvalidUriException: Cannot set a userinfo without having a host
->withHost("example.com");
$uri2 = $uri1
->withScheme("https")
->withHost("example.com")
->withUserInfo("user:pass") // No exception is thrown
In order to provide a more ergonomic and efficient solution for URI building, a fluent API is proposed that implements the [[https://refactoring.guru/design-patterns/builder|Builder pattern]].
$uriBuilder = new Uri\Rfc3986\UriBuilder()
->setScheme("https")
->setUserInfo("user:pass")
->setHost("example.com")
->setPort(8080)
->setPath("/foo/bar")
->setQuery("a=1&b=2"])
->setQueryParams(Uri\Rfc3986\UriQueryParams::fromArray(["a" => 1, "b" => 2]) // Has the same effect as the setQuery() call above
->setFragment("section1")
$uri = $uriBuilder->build(); // URI instance creation is only done at this point
echo $uri->toRawString(); // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1
The same works for WHATWG URL:
$urlBuilder = new Uri\WhatWg\UrlBuilder()
->setScheme("https")
->setUsername("user")
->setPassword("pass")
->setHost("example.com")
->setPort(8080)
->setPath("/foo/bar")
->setQuery("a=1&b=2"])
->setQueryParams(Uri\WhatWg\UrlQueryParams::fromArray(["a" => 1, "b" => 2]) // Has the same effect as the setQuery() call above
->setFragment("section1")
$url = $urlBuilder->build(); // URL instance creation is only done at this point
echo $url->toAsciiString; // https://user:pass@example.com:8080/foo/bar?a=1&b=2#section1
When a Builder instance is not instantiated by ourselves or a trusted party, one cannot be sure whether it already has any components set. Therefore, it's highly recommended to clear the instance state before usage:
function buildUri(Uri\Rfc3986\UriBuilder $builder): void
{
// Was there any component set before?
$builder->clear();
// Further usage is safe now...
}
function buildUrl(Uri\WhatWg\UrlBuilder $builder): void
{
// Was there any component set before?
$builder->clear();
// Further usage is safe now...
}
The ''clear()'' method also comes handy when the same Builder instance is reused to instantiate multiple URIs/URLs in a row.
The complete class signatures to be added are the following:
namespace Uri\Rfc3986 {
final class UriBuilder
{
public function __construct() {}
public function clear(): static {}
/**
* @throws Uri\InvalidUriException
*/
public function setScheme(?string $scheme): static {}
/**
* @throws Uri\InvalidUriException
*/
public function setUserInfo(#[\SensitiveParameter] ?string $userInfo): static {}
/**
* @throws Uri\InvalidUriException
*/
public function setHost(?string $host): static {}
/**
* @throws Uri\InvalidUriException
*/
public function setPath(string $path): static {}
/**
* @throws Uri\InvalidUriException
*/
public function setPathSegments(array $segments): static {}
/**
* @throws Uri\InvalidUriException
*/
public function setQuery(?string $query): static {}
/**
* @throws Uri\InvalidUriException
*/
public function setQueryParams(\Uri\Rfc3986\UriQueryParams $queryParams): static {}
/**
* @throws Uri\InvalidUriException
*/
public function setFragment(?string $fragment): static {}
/**
* @throws Uri\InvalidUriException
*/
public function build(?\Uri\Rfc3986\Uri $baseUrl = null): \Uri\Rfc3986\Uri {}
}
}
namespace Uri\WhatWg {
final class UrlBuilder
{
public function __construct() {}
public function clear(): static {}
/**
* @throws Uri\WhatWg\InvalidUrlException
*/
public function setScheme(?string $scheme): static {}
/**
* @throws Uri\WhatWg\InvalidUrlException
*/
public function setUsername(?string $username): static {}
/**
* @throws Uri\WhatWg\InvalidUrlException
*/
public function setPassword(#[\SensitiveParameter] ?string $password): static {}
/**
* @throws Uri\WhatWg\InvalidUrlException
*/
public function setHost(?string $host): static {}
/**
* @throws Uri\WhatWg\InvalidUrlException
*/
public function setPath(string $path): static {}
/**
* @throws Uri\WhatWg\InvalidUrlException
*/
public function setPathSegments(array $segments): static {}
/**
* @throws Uri\WhatWg\InvalidUrlException
*/
public function setQuery(?string $query): static {}
/**
* @throws Uri\WhatWg\InvalidUrlException
*/
public function setQueryParams(\Uri\WhatWg\UrlQueryParams $queryParams): static {}
/**
* @throws Uri\WhatWg\InvalidUrlException
*/
public function setFragment(?string $fragment): static {}
/**
* @param array $errors
* @throws Uri\WhatWg\InvalidUrlException
*/
public function build(?\Uri\WhatWg\Url $baseUrl = null, &$errors = null): \Uri\WhatWg\Url {}
}
}
The builder objects would perform validation in two places:
- **Validation of pure component syntax**: The individual setter methods would immediately validate if the input is syntactically correct. For example, the scheme component cannot contain percent-encoded octets, therefore the ''setScheme()'' method would throw whenever a "%" character is encountered.
- **Validation of global state**: There are a few validation rules that depend on the "global state". For example, RFC 3986 requires the host component to be present when the userinfo is set. Any such validations would be delayed until the ''build()'' method call to avoid the problem with temporal coupling that was mentioned in the beginning of the section.
An example for component syntax validation:
$uriBuilder = new Uri\Rfc3986\UriBuilder()
->setScheme("http%80"); // Throws a Uri\InvalidUriException because the scheme is not well formed
An example for validation of the global state:
$uriBuilder = new Uri\Rfc3986\UriBuilder()
->setScheme("https")
->setUserInfo("user:pass"); // Doesn't throw an exception yet
$uri = $uriBuilder->build(); // Throws an Uri\InvalidUriException because the host is not present, but the userinfo is
=== Design considerations ===
== Builder design pattern ==
Why is a complex Builder pattern based approach is proposed instead of a much simpler [[https://refactoring.guru/design-patterns/factory-method|Factory Method]] based one? The factory method could be as simple as the following:
namespace Uri\Rfc3986 {
final readonly class Uri
{
...
public static function fromComponents(
?string $scheme = null, ?string $host = null, string $path = "",
?string $userInfo = null, ?string $queryString = null, ?string $fragment = null
) {}
...
}
}
namespace Uri\WhatWg {
final readonly class Url
{
...
public static function fromComponents(
string $scheme, ?string $host = "", string $path = "",
?string $username = null, ?string $password = null,
?string $queryString = null, ?string $fragment = null
) {}
...
}
}
The current RFC proposes the Builder pattern based approach because of its flexibility: it makes it possible to add more convenience methods in the future. Actually, the ''setQueryParams()'' method that expects a Query parameter list object instead of the query string representation is already one.
== Dedicated classes ==
This RFC proposes a dedicated Builder class for both RFC 3986 and WHATWG URL, instead of a single, unified implementation with 2 ''build()'' methods (e.g. ''buildUri()'' and ''buildUrl()''). This decision has the following reasons:
* The two specifications don't recognize the same components. RFC 3986 has the userinfo component, while WHATWG URL has a separate ''username'' and ''password'' component instead. Even though these incompatibilities are probably possible to workaround, the position of this RFC is that it's better not to try to maintain compatibility artificially.
* RFC 3986 only requires the ''path'' component to be present (that's why the empty string is a valid RFC 3986 URI), while WHATWG URL mandates the presence of the ''scheme'' component too. This distinction is visible from the proposed signatures: while the ''Uri\Rfc3986\UriBuilder::setScheme()'' method accepts a ''string'' or ''null'', ''Uri\WhatWg\UrlBuilder::setScheme()'' only accepts a ''string'' parameter. The same distinction is already present in the ''Uri\Rfc3986\Uri::withScheme()'' and the ''Uri\WhatWg\Url::withScheme()'' methods.
* Setter methods validate the input based on the rules of the specification they implement. For example, RFC 3986 URIs cannot contain Unicode characters, so all setters fail when such a character is passed to them. On the other hand, WHATWG URL can handle Unicode characters, and setters won't fail when they encounter one. If a single, unified Builder class was proposed, performing validations early during the setter calls wouldn't be possible, only during the ''build*()'' method calls. According to the proposal, this would lead to a counterintuitive behavior because of the delayed feedback loop.
== Mutability ==
The proposed classes are mutable in order to avoid the performance overhead that cloning before each modification (thus an immutable behavior) would cause. If it turns out that the performance overhead is possible to optimize away, then this design decision can be reevaluated.
== Setter naming convention ==
Setter methods of the ''UriBuilder'' and ''UrlBuilder'' classes follow the naming convention which is already widespread among internal functions: they use a ''set'' prefix, e.g. ''setScheme()'', ''setHost()''. The current RFC rejects the usage of any other naming convention, most notably the omission of the ''set'' prefix (e.g. ''scheme()'', ''host()'') due to the following reasons:
* The ''set'' prefix adds additional context about the intended behavior: all proposed setters completely overwrite the related component. E.g. ''setQuery()'' and ''setQueryParams()'' neither prepend nor append their input to the existing query string, but they both overwrite the whole component. If ''set'' was omitted from the method name, then this additional context was completely missing, and people could have even less idea about what was going to happen when they use these methods.
* Using the ''set'' prefix for the setters would allow the addition of other convenience methods in the future more naturally: e.g. ''appendQueryParams()'', ''appendPathSegments()'' etc.
=== Voting ===
namespace Uri\Rfc3986 {
final class UriQueryParams implements \IteratorAggregate, \Countable
{
public static function parseRfc3986(string $queryString): ?\Uri\Rfc3986\UriQueryParams {}
public static function parseFormData(string $queryString): \Uri\Rfc3986\UriQueryParams {}
public static function fromArray(array $queryParams): \Uri\Rfc3986\UriQueryParams {}
public function __construct() {}
public function append(string $name, mixed $value): static {}
public function appendArray(string $name, array $value): static {}
public function delete(string $name): static {}
public function deleteValue(string $name, mixed $value): static {}
public function has(string $name): bool {}
public function hasValue(string $name, mixed $value): bool {}
public function getFirst(string $name): ?string {}
public function getLast(string $name): ?string {}
public function getAll(string $name): array {}
public function getArray(string $name): array {}
public function list(): array {}
public function getIterator(): \Traversable
public function count(): int {}
public function set(string $name, mixed $value): static {}
public function setArray(string $name, array $value): static {}
public function sort(): static {}
public function toRfc3986String(): string {}
public function toFormDataString(): string {}
public function __serialize(): array {}
public function __unserialize(array $data): void {}
public function __debugInfo(): array {}
}
final readonly class Uri
{
...
public function getQueryParams(): ?\Uri\Rfc3986\UriQueryParams {}
...
}
}
namespace Uri\WhatWg {
final class UrlQueryParams implements \IteratorAggregate, \Countable
{
public static function parse(string $queryString): \Uri\WhatWg\UrlQueryParams {}
public static function fromArray(array $queryParams): \Uri\WhatWg\UrlQueryParams {}
public function __construct() {}
public function append(string $name, mixed $value): static {}
public function appendArray(string $name, array $value): static {}
public function delete(string $name): static {}
public function deleteValue(string $name, mixed $value): static {}
public function has(string $name): bool {}
public function hasValue(string $name, string $value): bool {}
public function getFirst(string $name): ?string {}
public function getLast(string $name): ?string {}
public function getAll(string $name): array {}
public function getArray(string $name): array {}
public function list(): array {}
public function getIterator(): \Traversable
public function count(): int {}
public function set(string $name, mixed $value): static {}
public function setArray(string $name, array $value): static {}
public function sort(): static {}
public function toString(): string {}
public function __serialize(): array {}
public function __unserialize(array $data): void {}
public function __debugInfo(): array {}
}
final readonly class Url
{
...
public function getQueryParams(): ?\Uri\WhatWg\UrlQueryParams {}
...
}
}
=== Construction ===
''UriQueryParams'' supports the following methods for instantiation:
* **''parseFormData()''**: It parses a query string into a list of query parameters according to the processing and percent-decoding rules of the ''application/x-www-form-urlencoded'' media type, as defined by [[https://datatracker.ietf.org/doc/html/rfc1866#section-8.2.1|RFC 1866]]. This specification regards query parameters as a list of name-value pairs, where the two parts are separated by a "='" character, and the individual parameters are separated from each other by a "&" character (e.g. ''name1=value1&name2=value2'').
* **''parseRfc3986()''**: It parses a query string into a list of query parameters according to the percent-decoding rules of RFC 3986, with the caveat that this specification in fact does not specify exactly how query parameters are composed. That's why the implementation defines query parameters based on the definition of [[https://datatracker.ietf.org/doc/html/rfc1866#section-8.2.1|RFC 1866]].
* **''fromArray()''**: It takes an array of query parameters and directly composes the query parameter list object based on it. Besides scalar values, it can also accept complex types such as arrays according to the rules discussed in the [[https://wiki.php.net/rfc/uri_followup#supported_types|"Supported types" section]].
* **''%%__construct()%%''**: It accepts an empty parameter list, and results in an empty query parameter list. This method allows building query parameters by starting from scratch.
''UrlQueryParams'' supports the following methods for instantiation:
* **''parse()''**: It parses a query string into a list of query parameters according to the percent-decoding rules of the ''application/x-www-form-urlencoded'' media type, as defined by the WHATWG URL specification.
* **''fromArray()''**: It takes an array of query parameters and directly composes the query parameter list object based on it. Besides scalar values, it can also accept complex types such as arrays according to the rules discussed in the [[https://wiki.php.net/rfc/uri_followup#supported_types|"Supported types" section]]
* **''%%__construct()%%''**: It accepts an empty parameter list, and results in an empty query parameter list. This method allows building query parameters by starting from scratch.
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("a=foo&b=bar"); // Successful instantiation
$params = Uri\Rfc3986\UriQueryParams::parseFormData("a=foo&b=bar"); // Successful instantiation
$params = Uri\Rfc3986\UriQueryParams::fromArray(
[
"a" => "foo",
"b" => "bar",
]
); // Successful instantiation - same result as above
$params = new Uri\Rfc3986\UriQueryParams(); // Successful instantiation - creates an empty query parameter list
$params = Uri\WhatWg\UrlQueryParams::parse("a=foo&b=bar"); // Successful instantiation
$params = Uri\WhatWg\UrlQueryParams::fromArray(
[
"a" => "foo",
"b" => "bar",
]
); // Successful instantiation - same result as above
$params = new Uri\WhatWg\UrlQueryParams(); // Successful instantiation - creates an empty query parameter list
It is also possible to create a ''UriQueryParams'' or ''UrlQueryParams'' instance from an ''Uri\Rfc3986\Uri'' or an ''Uri\WhatWg\Url'' object, respectively:
$uri = new Uri\Rfc3986\Uri("https://example.com/?foo=bar");
$params = $uri->getQueryParams(); // First call creates a Uri\Rfc3986\UriQueryParams instance
$params = $uri->getQueryParams(); // Subsequent calls reuse the already existing Uri\Rfc3986\UriQueryParams instance
$uri = $uri->withQuery("foo=baz"); // Modification of the query string invalidates the Uri\Rfc3986\UriQueryParams instance
$url = new Uri\WhatWg\Url("https://example.com/?foo=bar");
$params = $url->getQueryParams(); // First call creates a Uri\WhatWg\UrlQueryParams instance
$params = $url->getQueryParams(); // Subsequent calls reuse the already existing Uri\WhatWg\UrlQueryParams instance
$url = $url->withQuery("foo=baz"); // Modification of the query string invalidates the already existing Uri\WhatWg\UrlQueryParams instance
''Uri\Rfc3986\Uri::getQueryParams()'' uses the normalized query string to instantiate ''UriQueryParams'' when possible. If the URI has not been normalized before, then the non-normalized query string is used. In practice, this doesn't make a big difference, because ''UriQueryParams'' itself also normalizes (percent-decodes) the input — you can read more on this topic later.
''Uri\Rfc3986\Uri::getQueryParams()'' and ''Uri\WhatWg\Url::getQueryParams()'' return ''null'' if the query string is missing (e.g. https://example.com/), and an empty query parameter list is returned if the query string is empty (e.g. https://example.com/?).
$uri = new Uri\Rfc3986\Uri("https://example.com/");
echo $uri->getQueryParams(); // null
$uri = new Uri\Rfc3986\Uri("https://example.com/?");
echo $uri->getQueryParams(); // A new Uri\Rfc3986\Uri\UriQueryParams containing zero items
The same example with ''Uri\WhatWg\UrlQueryParams'':
$url = new Uri\WhatWg\Url("https://example.com/");
echo $url->getQueryParams(); // null
$url = new Uri\WhatWg\Url("https://example.com/?");
echo $url->getQueryParams(); // A new Uri\WhatWg\Url\UrlQueryParams containing zero items
It's important to note that neither ''UriQueryParams'', nor ''UrlQueryParams'' validate query parameters appropriately during construction. This behavior is by design, because the idea of WHATWG URL's ''URLSearchParams'' class is that it's tolerant for reading, and ''UriQueryParams'' and ''UrlQueryParams'' follow the same principle. Validation happens anyway when the recomposed query parameters are attempted to be written to a URI (via ''Uri\Rfc3986\Uri::withQuery()'' and ''Uri\WhatWg\Url::withQuery()''). Although, as we'll see, invalid characters are [[https://wiki.php.net/rfc/url_parsing_api#percent-encoding_decoding|automatically percent-encoded]] during query parameter recomposition, so the ''withQuery()'' calls won't fail in practice either.
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("#foo=bar"); // Parses an invalid parameter name "#foo"
$uri = new Uri\Rfc3986\Uri("https://example.com/");
$uri = $uri->withQuery($params->toRfc3986String()); // Success: the query is automatically percent-encoded to "%23foo=bar"
The same example with ''Uri\WhatWg\UrlQueryParams'':
$params = Uri\WhatWg\UrlQueryParams::parse("#foo=bar"); // Parses an invalid parameter name "#baz"
$url = new Uri\WhatWg\Url("https://example.com/");
$url = $url->withQuery($params->toString()); // Success: the query is automatically percent-encoded to "%23foo=bar"
Please note that this RFC doesn't propose a ''Uri\Rfc3986\Uri::withQueryParams()'' method for updating the query string directly based on a query parameter list, because ''Uri\Rfc3986\UriQueryParams'' supports multiple recomposition formats - and the user should choose from them explicitly. The example above recomposes the query parameters according to RFC 3986. Even though ''Uri\WhatWg\UrlQueryParams'' only supports a single recomposition algorithm (WHATWG URL), neither a ''Uri\WhatWg\Url::withQueryParams()'' method is proposed in order to be consistent with ''Uri\Rfc3986\Uri''.
Neither the ''parse*()'', nor the ''fromArray()'' factory methods can fail in practice: they only have memory-related failure cases which are handled by the PHP engine as a fatal error.
According to the WHATWG URL algorithm, the leading "?" character is removed during parsing. As opposed to this behavior, the leading "?" becomes part of the first query parameter name for RFC 3986 query params.
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("?abc=foo");
// $params internally contains the ["?abc" => "foo"] key-value pair
$params = Uri\WhatWg\UrlQueryParams::parse("?abc=foo");
// $params internally contains the ["abc" => "foo"] key-value pair
All ''parse*()'' variants percent-decode the input automatically when constructing the ''UriQueryParams'' or ''UrlQueryParams'' instances. This is necessary so that the classes can work with the unencoded query parameters.
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo%5B%5D=b%61r"); // Percent-encoded form of "foo[]=bar"
// $params internally contains the ["foo[]" => "bar"] key-value pair
$params = Uri\Rfc3986\UriQueryParams::parseFormData("foo%5B%5D=b%61r"); // Percent-encoded form of "foo[]=bar"
// $params internally contains the ["foo[]" => "bar"] key-value pair
$params = Uri\WhatWg\UrlQueryParams::parse("foo%5B%5D=b%61r"); // Percent-encoded form of "foo[]=bar"
// $params internally contains the ["foo[]" => "bar"] key-value pair
=== Parameter Retrieval ===
The ''has()'' and ''hasValue()'' methods can be used to find out if a parameter exists:
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar&baz=qux&baz=quux");
echo $params->has("baz"); // true
echo $params->has("non-existent"); // false
echo $params->hasValue("foo", "bar"); // true
echo $params->hasValue("foo", "baz"); // false
The ''has()'' method returns ''true'' if there is at least one parameter in the parameter list with the given name, ''false'' otherwise. On the other hand, ''hasValue()'' returns ''true'' if the given name and value both matches at least one parameter, otherwise it returns ''false''.
The number of query parameters can be retrieved by calling the ''count()'' method:
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar&baz=qux&baz=quux");
echo $params->count(); // 3
There are also a number of methods that can return a query parameter or an array of query parameters:
* ''getFirst()'': Retrieves the first parameter with the given name. This actually implements the [[https://url.spec.whatwg.org/#dom-urlsearchparams-get|get() method]] from the WHATWG URL specification.
* ''getLast()'': Retrieves the last parameter with the given name. It's a custom, PHP-specific method which doesn't have a WHATWG URL equivalent.
* ''getAll()'': Retrieves all parameters with the given name. This actually implements the [[https://url.spec.whatwg.org/#dom-urlsearchparams-getall|getAll() method]] from the WHATWG URL specification.
* ''list()'': Retrieves all query parameters. It's also a custom, PHP-specific method which doesn't have a WHATWG URL equivalent.
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar&foo=baz&qux=quux");
echo $params->getFirst("foo"); // bar
echo $params->getFirst("non-existent"); // null
echo $params->getLast("foo"); // baz
echo $params->getLast("non-existent"); // null
echo $params->getAll("foo"); // ["bar", "baz"]
echo $params->getAll("non-existent"); // []
echo $params->list(); // [["foo", "bar"], ["foo", "baz"], ["qux", "quux"]]
All these methods return the natively stored values without applying any transformations. That is, percent-encoding or decoding neither happens in the input, nor in the output.
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo%5B%5D=b%61r"); // Internally stored as "foo[]=bar"
echo $params->getFirst("foo%5B%5D"); // null
echo $params->getFirst("foo[]"); // bar
echo $params->getLast("foo%5B%5D"); // null
echo $params->getLast("foo[]"); // bar
echo $params->getAll("foo%5B%5D"); // []
echo $params->getAll("foo[]"); // ["bar"]
echo $params->list(); // [["foo[]", "bar"]]
=== Percent-Encoding and Decoding ===
''UriQueryParams'' and ''UrlQueryParams'' only perform percent-encoding when query parameters are recomposed to a query string (via ''to*String()'' methods), and they only perform percent-decoding when a query string is parsed into a query parameter list (via ''parse*()'' methods). The rest of the functionalities don't use percent-encoding or decoding.
''UriQueryParams'' supports percent-encoding and decoding according to two specifications: [[https://datatracker.ietf.org/doc/html/rfc1866#section-8.2.1|RFC 1866]] which specifies the percent-encoding and decoding rules of the ''application/x-www-form-urlencoded'' media type, and [[https://datatracker.ietf.org/doc/html/rfc3986|RFC 3986]] which defines the generic query string syntax. On the other hand, ''UrlQueryParams'' relies on the ''URLSearchParams'' class specified by WHATG URL, that yet again builds upon the ''application/x-www-form-urlencoded'' media type for historic reasons, albeit slightly differently than how RFC 1866 specifies it. The current section is going to have an overview about the percent-encoding and decoding details, as well as the differences between the aforementioned specifications.
According to RFC 1866, space characters are replaced by the plus character (''+'') during percent-encoding, and any characters that fall outside of the unreserved character set are percent-encoded. Percent-decoding inverts these operations.
This behavior clearly deviates from the percent-encoding rules of the query component of RFC 3986 which allows quite a few reserved characters to be present in the query component without percent-encoding (a few examples: ":", "@", "?", "/"), not to mention the difference in how the space character is handled.
Regarding WHATWG URL's ''URLSearchParams'' class, as usually, a [[https://url.spec.whatwg.org/#application-x-www-form-urlencoded-percent-encode-set|dedicated percent-encoding set]] is defined:
The application/x-www-form-urlencoded percent-encode set contains all code points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and U+005F (_).WHATWG URL also defines a [[https://url.spec.whatwg.org/#urlencoded-serializing|dedicated algorithm]] for "serialization" (in this context, serialization means recomposition - converting the list to a query string): the space code point is percent-encoded as the plus code point (''+''), and the rest of the code points in the percent-encoding set are encoded how WHATWG URL normally does so. This behavior deviates from the percent-encoding rules of the query component of WHATWG URL, as the [[https://url.spec.whatwg.org/#query-percent-encode-set|query percent-encode set]] contains much less characters, and the space code point is handled differently again. It's also important to compare how the percent encoding rules of RFC 1866's as well as WHATWG URL's application/x-www-form-urlencoded media type differ: they handle the asterisk (''*'') and the tilde (''~'') symbols differently: ''UriQueryParams'' percent-encodes the first one, but ''UrlQueryParams'' doesn't, however ''UriQueryParams'' doesn't percent-encode the latter one, but ''UrlQueryParams'' does so. Even though it comes from the percent-encoding definition directly, it may still be difficult to realize that the ''application/x-www-form-urlencoded'' media type (both RFC 1866's and WHATWG URL's definition) even also percent-encodes "%" itself, no matter that it's part of an existing percent-encoded octet. It's counterintuitive (normally, RFC 3986 and WHATWG URL does not percent-encode "%" twice) and quite unsafe behavior due to the [[https://en.wikipedia.org/wiki/Double_encoding|double encoding vulnerability]].
$params = Uri\Rfc3986\UriQueryParams::fromArray(
[
["foo" => "b%61r"],
]
);
echo $params->toFormDataString(); // foo=b%2561r
$params = Uri\WhatWg\UrlQueryParams::fromArray(
[
["foo" => "b%61r"],
]
);
echo $params->toString(); // foo=b%2561r
As surprising as is, the ''Uri\Rfc3986\UriQueryParams::toFormDataString()'' and ''Uri\WhatWg\UrlQueryParams::toString()'' methods percent-encode **"%" itself** (thus "%" becomes "%25" first, and then "61r" is appended), rather than leaving the already percent-encoded octet alone. Another conclusion to note is that it's very important to pass unencoded input to the ''UriQueryParams'' and ''UrlQueryParams'' classes so that double-encoding cannot happen (the only exception when it is not a problem are the parse*() methods because they automatically percent-decode their input).
=== Recomposition ===
In order to be consistent with the design of ''Uri\Rfc3986\Uri'' and the ''Uri\WhatWg\Url'' classes, neither ''UriQueryParams'', nor ''UrlQueryParams'' have a ''%%__toString()%%'' magic method. Instead, they contain custom ''to*String()'' methods to recompose the query string from the parsed query parameters.
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar&foo=baz");
echo $params->toRfc3986String(); // foo=bar&foo=baz
echo $params->toFormDataString(); // foo=bar&foo=baz
$params = Uri\WhatWg\UrlQueryParams::parse("foo=bar&foo=baz");
echo $params->toString(); // foo=bar&foo=baz
All ''to*String()'' methods (''Uri\Rfc3986\UriQueryParams::toRfc3986String()'', ''Uri\Rfc3986\UriQueryParams::toFormDataString()'', ''Uri\WhatWg\UrlQueryParams::toString()'') automatically percent-encode their output according to the rules outlined in the [[https://wiki.php.net/rfc/uri_followup#percent-encoding_and_decoding|previous section]], otherwise it would be possible that an invalid output is returned.
$params = Uri\Rfc3986\UriQueryParams::fromArray([["foo[]" => "bar baz"]]);
echo $params->toRfc3986String(); // foo%5B%5D=bar%20baz
echo $params->toFormDataString(); // foo%5B%5D=bar+baz
$params = Uri\WhatWg\UrlQueryParams::fromArray([["foo[]" => "bar baz"]]);
echo $params->toString(); // foo%5B%5D=bar+baz
Unlike ''Uri\Rfc3986\Uri'', the ''Uri\Rfc3986\UriQueryParams'' class doesn't have a ''toRawString()'' method because it could be misleading what it exactly does: ''toRawString()'' cannot really provide a "raw" representation of the query string, since automatic percent-encoding must happen any way to make the produced query string valid.
=== Relation to the query component ===
After learning about the details of the percent-encoding and decoding behavior of ''UriQueryParams'' and ''UrlQueryParams'', it should be clarified how the new classes can interoperate with the existing ''Uri\Rfc3986\Uri'' and ''Uri\WhatWg\Url'' classes?
In case of ''UriQueryParams'', full compatibility with ''Uri\Rfc3986\Uri'' can be achieved via the ''fromRfc3986()'' and ''toRfc3986String()'' methods:
$uri = new Uri\Rfc3986\Uri("https://example.com?foo=a b");
$params = $uri->getQueryParams();
// The above line is effectively the same as the following one:
$params = Uri\Rfc3986\UriQueryParams::fromRfc3986($uri->getQuery());
$uri = $uri->withQuery($params->toRfc3986String());
echo $uri->getQuery(); // foo=a b
As it can be seen in the example above, the behavior is roundtripable: parsing a query string to a ''UriQueryParams'' instance and then modifying the original query string to the parsed one will result in the original query string. Of course, this won't necessarily be the case when using ''parseFormData()'' or ''toFormDataString()'', if the query string contains some specific characters (most notably, the space character):
$uri = new Uri\Rfc3986\Uri("https://example.com?foo=a b");
$params = $uri->getQueryParams();
$uri = $uri->withQuery($params->toFormDataString());
echo $uri->getQuery(); // foo=a+b
''Uri\WhatWg\UrlQueryParams'' and ''Uri\WhatWg\Url'' have the very same incompatibility due to the different percent-encoding and decoding algorithm, and this is even encoded in the WHATWG URL specification itself, so it's not possible to work around on PHP's side:
$url = new Uri\WhatWg\Url("https://example.com?foo=a b");
$params = $url->getQueryParams();
$url = $url->withQuery($params->toString());
echo $uri->getQuery(); // foo=a+b
=== Modification ===
The ''append()'' method can be used to append a parameter to the end of the list. As normally, the same query parameter can be added multiple times:
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar");
$params->append("baz", "qux");
$params->append("baz", "qaz"); // Appends "baz" twice
echo $params->toString(); // foo=bar&baz=qux&baz=qaz
Updating a parameter is possible via the ''set()'' method:
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar&foo=baz");
$params->set("foo", "baz"); // Overwrites the first item "foo", and removes the second one
$params->set("qux", "qaz"); // Appends a new item "qux"
echo $params->toString(); // foo=bar&baz=qux&baz=qaz
Actually, the ''set()'' method has a hybrid behavior: if a parameter is not present in the list, then it adds it just like ''append()'' does. Otherwise, it overwrites the first item, and removes the rest of the occurrences.
Neither ''append()'', nor ''set()'' do any percent-encoding or decoding of their arguments.
$params = new Uri\WhatWg\UrlQueryParams::parse();
$params->append("foo%5B%5D", "ab%63"); // Percent-encoded form of "foo[]=abc"
$params->set("bar%5B%5D", "de%66"); // Percent-encoded form of "bar[]=def"
echo $params->get("foo%5B%5D"); // ab%63
echo $params->get("bar%5B%5D"); // de%66
Removing parameters is possible via either the ''delete()'' or the ''deleteValue()'' method: the former one removes all occurrences of the given parameter name, while the latter one removes all occurrences of a parameter if the given name and value both matches it, as demonstrated below:
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar&foo=baz&foo=qux");
$params->deleteValue("foo", "baz"); // Deletes the "foo=baz" parameter
$params->delete("foo"); // Deletes the rest of the occurrences: "foo=bar" and "foo=qux"
$params->delete("non-existent"); // The parameter is not present: nothing happens
Finally, ''sort()'' sorts the query parameter list alphabetically:
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar&baz=qux&baz=quux");
$params->sort();
echo $params->toString(); // baz=qux&baz=quux&foo=bar
The main purpose of ''sort()'' is to provide a consistent order of the key-value pairs (e.g. to increase cache hits), therefore more advanced features such as sorting in descending order, or user-provided comparison methods are not proposed.
=== Supported types ===
What's also important to clarify is how non-string values are mapped to query parameters which inherently have string type? PHP's [[http_build_query()|https://www.php.net/manual/en/function.http-build-query.php]] and functions can basically map any type to query parameters, however, this is completely PHP-specific behavior, as any such type mapping is not specified by either RFC 3986 or WHATWG URL: RFC 3986 completely omits any information how query parameters should be built, while WHATWG URL's ''URLSearchParams'' only accepts and returns string data.
The position of this RFC is that it's important to follow the road that ''http_build_query()'' has already paved because of better developer experience and better interoperability with the existing ecosystem. That's why the following type mapping behavior is proposed **when a query parameter is added/updated**:
* **bool:** Becomes string "0" (in case of ''false'') or string "1" (in case of ''true'')
* **int:** Becomes a numeric string (123 -> "123")
* **float:** Becomes a decimal string (3.14 -> "3.14")
* **resource:** Invalid mapping, an exception is thrown
* **array:**
* **empty array**: An empty array has zero items, therefore empty arrays are omitted from the query parameter list.
* **list**: An array is a list if its keys are consecutive integers starting from 0. Lists are converted to query parameters by repeating the given query parameter name appended by a bracket pair (''[]'') along with each value in the list mapped recursively according to the currently described type mapping rules. E.g. adding a query parameter with the ''array'' name and the ''[1, false, "foo"]'' value will result in an ''array[]=1&array[]=0&array[]=foo'' query string.
* **map**: An array is a map if it is not a list. Maps are converted to query parameters by appending the array keys contained within brackets (''[]'') to the given query parameter name along with each value in the map mapped recursively according to the currently described type mapping rules. E.g. adding a query parameter with an ''array'' name and the ''[1 => 1, 2 => true, 3 => "foo"]'' value will result in an ''array[1]=1&array[2]=1&array[3]=foo'' query string.
* **enum:**
* **backed enums** are converted to their backing value
* **enums without backing type** are invalid, and an exception is thrown
* **object:** invalid mapping, an exception is thrown
The above conversion rules work for both ''UriQueryParams'' and ''UrlQueryParams''. However, ''Uri\Rfc3986\UriQueryParams'' can additionally properly handle ''null'' values: a ''null'' input is mapped to a query component so that only the parameter name is present — the "=" and the parameter value is omitted. On the other hand, ''Uri\WhatWg\UrlQueryParams'' converts ''null'' values to an empty string. For reference, ''http_build_query()'' omits parameters with ''null'' values.
A few examples demonstrating how ''UriQueryParams'' handles scalar types:
$params = new Uri\Rfc3986\UriQueryParams();
$params->append("null", null);
$params->append("bool", true);
$params->append("int", 123);
$params->append("float", 3.14);
var_dump($params->getFirst("null")); // NULL
var_dump($params->getFirst("bool")); // string(1) "1"
var_dump($params->getFirst("int")); // string(3) "123"
var_dump($params->getFirst("float")); // string(4) "3.14"
echo $params->toString(); // null&bool=1&int=123&float=3.14
Let's also see a few examples about how ''UrlQueryParams'' handles scalar types. Note how ''null'' is represented differently than in case of ''UriQueryParams'':
$params = new Uri\WhatWg\UrlQueryParams();
$params->append("null", null);
$params->append("bool", true);
$params->append("int", 123);
$params->append("float", 3.14);
var_dump($params->getFirst("null")); // string(0) ""
var_dump($params->getFirst("bool")); // string(1) "1"
var_dump($params->getFirst("int")); // string(3) "123"
var_dump($params->getFirst("float")); // string(4) "3.14"
echo $params->toString(); // null=&bool=1&int=123&float=3.14
=== Array API ===
In order to better support arrays (which is a completely PHP-specific feature), the current RFC proposes a dedicated API. This way, the rest of the methods can follow WHATWG URL without any customization, and the Array API can have its custom behavior.
In order to add arrays to ''UriQueryParams'' or ''UrlQueryParams'', one can use the ''fromArray()'' factory methods:
$params = Uri\Rfc3986\UriQueryParams::fromArray(
[
"empty" => []
"list" => ["a", "b", "c"],
"map" => ["a" => 0, "b" => 1, "c" => 2],
]
);
In order to retrieve an array of query parameters, the ''getArray()'' method can be used. This behaves similarly to the ''getAll()'' method, but it actually retrieves all query params whose name start with the supplied ''$name'' argument, and possibly only differ from it by the ''[...]'' suffix. Let's see an example:
$params = Uri\Rfc3986\UriQueryParams::fromArray(
[
"empty" => []
"list" => ["a", "b", "c"],
"map" => ["a" => 0, "b" => 1, "c" => 2],
]
);
/*
Internally, this results in the the following array:
array(4) {
["list"]=>
array(3) {
[0]=>
string(1) "a"
[1]=>
string(1) "b"
[2]=>
string(1) "c"
}
["map"]=>
array(3) {
["a"]=>
string(1) "0"
["b"]=>
string(1) "1"
["c"]=>
string(1) "2"
}
}
*/
echo $params->getFirst("empty"); // null
echo $params->getAll("empty"); // []
echo $params->getArray("empty"); // []
echo $params->getFirst("list"); // "a"
echo $params->getAll("list"); // []
echo $params->getAll("list[]"); // ["a", "b", "c"]
echo $params->getArray("list"); // ["a", "b", "c"]
echo $params->getFirst("map"); // 0
echo $params->getAll("map"); // []
echo $params->getArray("map"); // ["a" => "0", "b" => "1", "c" => "2"]
Similarly to the ''append()'' and ''set()'' methods, there are ''appendArray()'' and ''setArray()'' methods:
$params = new Uri\Rfc3986\UriQueryParams();
$params->appendArray("empty", []);
$params->appendArray("list", ["a", "b", "c"]);
$params->appendArray("map", ["a" => 0, "b" => 1, "c" => 2]);
$params = new Uri\Rfc3986\UriQueryParams();
$params->appendArray("empty", []);
$params->appendArray("list", ["a", "b", "c"]);
$params->appendArray("map", ["a" => 0, "b" => 1, "c" => 2]);
echo $params->getFirst("empty"); // null
echo $params->getAll("empty"); // []
echo $params->getArray("empty"); // []
echo $params->getFirst("list"); // "a"
echo $params->getAll("list"); // []
echo $params->getAll("list[]"); // ["a", "b", "c"]
echo $params->getArray("list"); // ["a", "b", "c"]
echo $params->getFirst("map"); // 0
echo $params->getAll("map"); // []
echo $params->getArray("map"); // ["a" => "0", "b" => "1", "c" => "2"]
echo $params->toString(); // list=a&list=b&list=c&map%5Ba%5D=0&map%5Bb%5D=1&map%5Bc%5D=2
And a few examples demonstrating how ''UrlQueryParams'' handles complex types:
$params = new Uri\WhatWg\UrlQueryParams();
$params->appendArray("empty", []);
$params->appendArray("list", ["a", "b", "c"]);
$params->appendArray("map", ["a" => 0, "b" => 1, "c" => 2]);
echo $params->getFirst("empty"); // null
echo $params->getAll("empty"); // []
echo $params->getArray("empty"); // []
echo $params->getFirst("list"); // "a"
echo $params->getAll("list"); // []
echo $params->getAll("list[]"); // ["a", "b", "c"]
echo $params->getArray("list"); // ["a", "b", "c"]
echo $params->getFirst("map"); // 0
echo $params->getAll("map"); // []
echo $params->getArray("map"); // ["a" => "0", "b" => "1", "c" => "2"]
echo $params->toString(); // list=a&list=b&list=c&map%5Ba%5D=0&map%5Bb%5D=1&map%5Bc%5D=2
Finally, let's see how multi-dimensional arrays are represented:
$params = new Uri\Rfc3986\UriQueryParams();
$params->appendArray(
"array",
[
"list" => [1, 2, 3],
"map" => ["foo" => 1, "bar" => 2, "baz" => 3]
]
);
var_dump($params->getArray("array"));
/*
array(4) {
["array[list]"]=>
array(3) {
[0]=>
string(1) "1"
[1]=>
string(1) "2"
[2]=>
string(1) "3"
}
["array[map][foo]"]=>
string(1) "1"
["array[map][bar]"]=>
string(1) "2"
["array[map][baz]"]=>
string(1) "3"
}
*/
=== Class signature ===
The ''UriQueryParams'' and ''UrlQueryParams'' classes are ''final'' [[https://wiki.php.net/rfc/url_parsing_api#why_should_the_uri_rfc3986_uri_and_the_uri_whatwg_url_classes_be_final|for the same reason]] as all the other URI classes are final: mainly, in order to make followup changes possible without breaking backward compatibility.
Additionally, ''UriQueryParams'' and ''UrlQueryParams'' could be ''readonly'' classes, but this still has to be decided.
The ''UriQueryParams'' and ''UrlQueryParams'' classes implement the ''IteratorAggregate'' and the ''Countable'' interfaces. Implementing ''IteratorAggregate'' seems straightforward at the first sight (query parameter names could be returned as iterator keys, while query parameter values could be returned as iterator values), unfortunately, it's more tricky than that due to query components that share the same name, e.g.: ''param=foo¶m=bar¶m=baz''. In this case, the same key (''param'') would be repeated by default 3 times - and it's actually not possible to support with iterators.
That's why the iterator returns each query parameter name and value as a list of pairs. Similarly to the ''get*()'' methods, the iterator returns the "raw" parameter names and values without percent-encoding. Let's see an example:
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("param=foo¶m=bar¶m=baz");
foreach ($params as $key => $value) {
echo "$key => $value[0], $value[1]";
}
/*
0 => param, foo
1 => param, bar
2 => param, baz
*/
=== Cloning ===
Cloning of ''UriQueryParams'' and ''UrlQueryParams'' is supported.
$params1 = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar&foo=baz");
$params2 = clone $params1;
$params2->append("foo", "qux");
echo $params1->toRfc3986String(); // foo=bar&foo=baz
echo $params2->toRfc3986String(); // foo=bar&foo=baz&foo=qux
''UrlQueryParams'' works the same way:
$params1 = Uri\WhatWg\UrlQueryParams::parse("foo=bar&foo=baz");
$params2 = clone $params1;
$params2->append("foo", "qux");
echo $params1->toString(); // foo=bar&foo=baz
echo $params2->toString(); // foo=bar&foo=baz&foo=qux
=== Serialization ===
Both classes support serialization and deserialization via the the [[https://wiki.php.net/rfc/custom_object_serialization|new serialization API]]. The only implementation gotcha is that the serialized format is slightly unexpected: instead of recomposing the query parameters into a query string, the individual query parameter name and value pairs are serialized as an array of key-value pairs, similarly to the output of the ''list()'' method. During deserialization, the query parameter list is directly created from this array without any transformation (the same way how the ''fromArray()'' method works).
The main advantage of this choice is that the query parameters can be serialized and deserialized as-is, without any modifications (remember, the recomposition algorithms must percent-encode their output, and percent-decoding is needed during parsing, both of which processes modify the original data). Additionally, this behavior is more efficient than the former one, because it eliminates the overhead of parsing, including percent-encoding and decoding.
=== Debugging ===
Both classes contain a ''%%__debugInfo()%%'' method that returns all items in the query parameter list in order to make debugging easier. Effectively, this has a similar output to the ''list()'' method.
$params = Uri\Rfc3986\UriQueryParams::parseRfc3986("foo=bar&foo=baz&foo=qux");
var_dump($params);
/*
object(Uri\Rfc3986\UriQueryParams)#1 (1) {
["params"]=> array(3) {
[0]=>
array(2) {
[0]=>
string(3) "foo",
[1]=>
string(3) "bar"
}
[1]=>
array(2) {
[0]=>
string(3) "foo",
[1]=>
string(3) "baz"
}
[2]=>
array(2) {
[0]=>
string(3) "foo",
[1]=>
string(3) "qux"
}
}
}
*/
$params = Uri\WhatWg\UrlQueryParams::parse("foo=bar&foo=baz&foo=qux");
var_dump($params);
/*
object(Uri\WhatWg\UrlQueryParams)#1 (1) {
["params"]=> array(3) {
[0]=>
array(2) {
[0]=>
string(3) "foo",
[1]=>
string(3) "bar"
}
[1]=>
array(2) {
[0]=>
string(3) "foo",
[1]=>
string(3) "baz"
}
[2]=>
array(2) {
[0]=>
string(3) "foo",
[1]=>
string(3) "qux"
}
}
}
*/
=== Relation to $_GET ===
The ''$_GET'' superglobal stores the query parameters of the current request, percent decoded according to RFC 1866. That's why the proposed ''UriQueryParams'' and ''UrlQueryParams'' classes are its direct alternatives when it comes to processing the current request. In theory, it would also be possible to populate ''$_GET'' according to the other relevant specifications (RFC 3986 and WHATWG URL). For example, this could be achieved by adding support for a new php.ini configuration option.
The position of this RFC though is that ''$_GET'' (and superglobals in general) shouldn't be changed in any way, but rather gradually phased out on the long term by offering better alternatives. In this case, ''UriQueryParams'' and ''UrlQueryParams'' can be used directly instead of ''$_GET'', so migrating away from the superglobal usage should be straightforward in most cases.
It should also be noted that introducing a php.ini option for controlling the rules how ''$_GET'' is filled in is not a safe solution, and could possibly cause security vulnerabilities due to parsing confusion, not to mention the headache for libraries which should prepare for all possible configuration options. That's why the current RFC leaves ''$_GET'' out of its scope.
=== Vote ===
namespace Uri\Rfc3986 {
final readonly class Uri
{
...
public function getRawPathSegments(): array {}
public function getPathSegments(): array {}
public function withPathSegments(array $segments, bool $addLeadingSlashForNonEmptyRelativeUri = true): static {}
...
}
}
namespace Uri\WhatWg {
final readonly class Url
{
...
public function getPathSegments(): array {}
public function withPathSegments(array $segments): static {}
...
}
}
This way, it is possible to write the following code:
$uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/baz");
$segments = $uri->getPathSegments(); // ["foo", "bar", "baz"]
$uri = $uri->withPathSegments(["a", "b"]);
echo $uri->getPath(); // /a/b
The same also works for WHATWG URL:
$url = new Uri\WhatWg\Url("https://example.com/foo/bar/baz");
$segments = $url->getPathSegments(); // ["foo", "bar", "baz"]
$url = $url->withPathSegments(["a", "b"]);
echo $url->getPath(); // /a/b
In order to understand better why and exactly how this functionality works, we should more carefully understand how RFC 3986 defines the path and path segments: according to the specification, path segments start after the leading "/" in the path due to the following ABNF rule:
path-abempty = *( "/" segment )
That is, the ''path-abempty'' syntax only applies in case of URIs containing an [[https://datatracker.ietf.org/doc/html/rfc3986#section-3.2|authority]] component, and it declares that the path is either empty, or contains a "/" followed by a segment one or multiple times. Then segments have the following syntax:
segment = *pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
That is, segments are composed of zero or multiple characters in the "pchar" charset (the exact values don't matter in this case). It should be mentioned that there are some additional special-case segment syntaxes (they are marked with ''segment-nz'' and ''segment-nz-nc'' in the ABNF syntax), but let's disregard them now for ease of understanding.
The above definitions imply that an empty path has zero segments:
$uri = new Uri\Rfc3986\Uri("https://example.com");
$segments = $uri->getPathSegments(); // []
When the path consists of a leading "/" and a string matching the ''segment'' syntax (e.g. ''/foo''), the path has one segment:
$uri = new Uri\Rfc3986\Uri("https://example.com/foo");
$segments = $uri->getPathSegments(); // ["foo"]
We can easily see based on the above example that the URI ''https://example.com/'' also has a single segment - but it's empty:
$uri = new Uri\Rfc3986\Uri("https://example.com/");
$segments = $uri->getPathSegments(); // [""]
This is perfectly valid, because segments can be empty (at least in the above case when the URI has an authority). Another interesting question is how segments are represented when the path has a trailing slash (e.g. ''/foo/'')? Consistent to the above rules, it's the following:
$uri = new Uri\Rfc3986\Uri("https://example.com/foo/");
$segments = $uri->getPathSegments(); // ["foo", ""]
A few other special cases are also collected below:
* "%%https://%%": It means that the URI has an empty authority starting after the "%%//%%" characters, therefore the path is also empty, and therefore this URI has zero path segments
* "https:/": It means that the UR has no authority and the path starts after the ":" character (it is "/"), therefore this URI has one empty path segment
* "https:": It means that the URI has no authority, and the path starts after the ":" character (it is ""), therefore this URI has zero path segments
* "" (empty string): It means that the relative reference consists of a single path component which is empty, and therefore this relative reference has zero path segments
* "/foo": It means that the relative reference consists of a single path component which is "/foo", and therefore this relative reference has one path segment "foo"
* "foo": It means that the relative reference consists of a single path component which is "foo", and therefore this relative reference has one path segment "foo"
* "foo/": It means that the relative reference consists of a single path component which is "foo/", and therefore this relative reference has two path segments "foo" and ""
* "/": It means that the relative reference consists of a single path component which is "/", and therefore this relative reference has one empty path segment
The above described behavior satisfies the definitions of RFC 3986. However, one case needs disambiguation in relation to the ''withPathSegments()'' method: "/foo" vs "foo".
That's why ''Uri\Rfc3986\Uri::withPathSegments()'' has a second parameter ''$addLeadingSlashForNonEmptyRelativeUri'', which can be used to decide if a relative reference should became an absolute- or a relative-path reference:
$uri = new Uri\Rfc3986\Uri("/foo"); // absolute-path reference
$uri = $uri->withPathSegments(["bar"], false); // The leading slash is not prepended
echo $uri->getPath(); // bar
$uri = new Uri\Rfc3986\Uri("foo"); // relative-path reference
$uri = $uri->withPathSegments(["bar"], true); // The leading slash is prepended
echo $uri->getPath(); // /bar
The ''$addLeadingSlashForNonEmptyRelativeUri'' parameter only has effect when the URI is a relative reference, and the first path segment is not empty, any other cases are unambiguous.
''Uri\Rfc3986\Uri::withPathSegments()'' and ''Uri\WhatWg\Url::withPathSegments()'' internally concatenate the input segments separated by a ''/'' character, and then trigger ''Uri\Rfc3986\Uri::withPath()'' and ''Uri\WhatWg\Url::withPath()'', respectively.
namespace Uri\Rfc3986 {
enum UriHostType
{
case IPv4;
case IPv6;
case IPvFuture;
case RegisteredName;
}
final readonly class Uri
{
...
public function getHostType(): ?\Uri\Rfc3986\UriHostType {}
...
}
}
namespace Uri\WhatWg {
enum UrlHostType
{
case IPv4;
case IPv6;
case Domain;
case Opaque;
case Empty;
}
final readonly class Url
{
...
public function getHostType(): ?\Uri\WhatWg\UrlHostType {}
...
}
}
The new ''getHostType()'' methods return the type of the host component for both specifications:
$uri = new Uri("https://192.168.0.1/");
echo $uri->getHostType(); // UriHostType::IPv4
$uri = new Uri("https://[2001:db8::1]/");
echo $uri->getHostType(); // UriHostType::IPv6
$uri = new Uri("https://[v1.1.2.3]/");
echo $uri->getHostType(); // UriHostType::IPvFuture
$uri = new Uri("https://example.com/");
echo $uri->getHostType(); // UriHostType::RegisteredName
$uri = new Uri("/foo/bar");
echo $uri->getHostType(); // null
The same for WHATWG URL:
$url = new Uri\WhatWg\Url("https://192.168.0.1/");
echo $url->getHostType(); // UrlHostType::IPv4
$url = new Uri\WhatWg\Url("https://[2001:db8::1]/");
echo $uri->getHostType(); // UrlHostType::IPv6
$url = new Uri\WhatWg\Url("https://example.com/");
echo $url->getHostType(); // UrlHostType::Domain
$url = new Uri\WhatWg\Url("scheme://example.com/");
echo $url->getHostType(); // UrlHostType::Opaque
$url = new Uri\WhatWg\Url("mailto://john.doe@example.com");
echo $url->getHostType(); // UrlHostType::Empty
$url = new Uri\WhatWg\Url("scheme://john.doe@example.com");
echo $url->getHostType(); // null
namespace Uri\Rfc3986 {
enum UriType
{
case AbsolutePathReference;
case RelativePathReference;
case NetworkPathReference;
case Uri;
}
final readonly class Uri
{
...
public function getUriType(): Uri\Rfc3986\UriType {}
...
}
}
This way, it becomes easier to detect URI types:
$uri = new Uri\Rfc3986\Uri("https://example.com");
var_dump($uri->getUriType()); // Uri\Rfc3986\UriType::Uri
$uri = new Uri\Rfc3986\Uri("https:");
var_dump($uri->getUriType()); // Uri\Rfc3986\UriType::Uri
$uri = new Uri\Rfc3986\Uri("/foo");
var_dump($uri->getUriType()); // Uri\Rfc3986\UriType::AbsolutePathReference
$uri = new Uri\Rfc3986\Uri("foo");
var_dump($uri->getUriType()); // Uri\Rfc3986\UriType::RelativePathReference
$uri = new Uri\Rfc3986\Uri("//host.com/foo");
var_dump($uri->getUriType()); // Uri\Rfc3986\UriType::NetworkPathReference
The position of this RFC is that identifying the distinction between URIs and absolute URIs doesn't need special support, therefore a dedicated ''Uri\Rfc3986\UriType'' enum case is omitted.
The WHATWG URL specification defines some [[https://url.spec.whatwg.org/#is-special|special schemes]] (''http'', ''https'', ''ftp'', ''file'', ''ws'', ''wss''), which have distinct parsing and serialization rules. In order to make checks for special URLs easier to perform, a new ''Uri\WhatWg\Url::isSpecialScheme()'' method is added:
namespace Uri\WhatWg {
final readonly class Url
{
...
public function isSpecialScheme(): bool {}
...
}
}
This enables low-level control for applications that need to mirror WHATWG behaviors in parsing or normalization.
$url = new Uri\WhatWg\Url("https://example.com");
var_dump($url->isSpecialScheme()); // true
$url = new Uri\WhatWg\Url("custom:example");
var_dump($url->isSpecialScheme()); // false
It should also be mentioned that in fact, ''urlencode()'' and ''urldecode()'' should rather be used for the ''application/x-www-form-urlencoded'' media type, and ''rawurlencode()'' and ''rawurldecode()'' more closely implements RFC 3986.For example, the path component dedicates special meaning for the ''/'' character. Therefore, this character doesn't necessarily have to be percent-encoded in the path component. There are some cases though when it makes sense to percent-encode them, as highlighted by the [[https://wiki.php.net/rfc/url_parsing_api#advanced_examples|first example]] within the "Advanced examples" section of the original URI RFC. Unfortunately, ''rawurlencode()'' doesn't take the component into account, and replaces the "/" with "%2F" unconditionally.
echo rawurlencode("/foo/bar/baz"); // %2Ffoo%2Fbar%2Fbaz
In order to correctly handle percent-encoding and decoding based on the rules of RFC 3986 and WHATWG URL, the following methods and enums are proposed to be added:
namespace Uri\Rfc3986 {
enum UriPercentEncodingMode
{
case UserInfo;
case Host;
case RelativeReferencePath;
case RelativeReferenceFirstPathSegment;
case Path;
case PathSegment;
case Query;
case FormQuery;
case Fragment;
case AllReservedCharacters;
case All;
}
final readonly class Uri
{
...
public static function percentEncode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
public static function percentDecode(string $input, \Uri\Rfc3986\UriPercentEncodingMode $mode): string {}
...
}
}
namespace Uri\WhatWg {
enum UrlPercentEncodingMode
{
case UserInfo;
case Host;
case OpaqueHost;
case Path;
case PathSegment;
case OpaquePath;
case OpaquePathSegment;
case Query;
case SpecialQuery;
case FormQuery;
case Fragment;
}
final readonly class Url
{
...
public static function percentEncode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
public static function percentDecode(string $input, \Uri\WhatWg\UrlPercentEncodingMode $mode): string {}
...
}
}
The ''percentEncode()'' and ''percentDecode()'' methods both require an input string and a ''PercentEncodingMode'' enum to be passed. The enums make the context of the encoding/decoding processes fully explicit and clear. The following modes are supported:
* **Uri\Rfc3986\UriPercentEncodingMode**
* **UserInfo:** Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**". Any other characters are percent-encoded.
* **Host:** If the input string is a valid IPv4, an IPv6 or an IPvFuture address, no percent-encoding is performed, since these host types do not support the process. Otherwise (for registered names), [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]] are allowed to be present. Any other characters are percent-encoded.
* **AbsolutePathReferenceFirstSegment:** The first segment of absolute-path references cannot start with "**%%//%%**" characters (e.g. ''%%//foo%%''), otherwise the path [[https://datatracker.ietf.org/doc/html/rfc3986#section-4.2|would be confusable]] with a network-path reference. Therefore, besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**", "**@**". Any other characters are percent-encoded.
* **RelativePathReferenceFirstSegment:** The first segment of relative-path references cannot contain a "**:**" character (e.g. ''this:that''), otherwise the path [[https://datatracker.ietf.org/doc/html/rfc3986#section-4.2|would be confusable]] with a scheme name. Therefore, besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**@**". Any other characters are percent-encoded.
* **RelativeReferencePath:**
* **Path:** Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**/**", "**:**", "**@**". Any other characters are percent-encoded.
* **PathSegment:** Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**", "**@**". Any other characters are percent-encoded.
* **Query:** Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**", "**@**", "**/**", and "**?**". Any other characters are percent-encoded.
* FormQuery: It is mostly the same as ''Uri\Rfc3986\UriPercentEncodingMode::Query'', but it behaves according to the ''application/x-www-form-urlencode'' media type rather than RFC 3986. The only difference between the two is that " " is encoded as "**+**".
* Fragment: Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]], [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], as well as [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|sub-delimiters]], it also allows the following characters to be present: "**:**", "**@**", "**/**", and "**?**". Any other characters are percent-encoded.
* AllReservedCharacters: All [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.2|reserved characters]] are percent-encoded. The rest of the characters are left as-is.
* AllButUnreservedCharacters: Besides [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.3|unreserved characters]] and [[https://datatracker.ietf.org/doc/html/rfc3986#section-2.1|percent-encoded octets]], all other characters are percent-encoded.
For the complete ABNF syntax of each component, consult [[https://datatracker.ietf.org/doc/html/rfc3986#appendix-A|Appendix A]] of RFC 3986.
* **Uri\WhatWg\UrlPercentEncodingMode**
* **UserInfo:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::Path'', the following code points are percent-encoded: U+002F (**/**), U+003A (**:**), U+003B (**;**), U+003D (**=**), U+0040 (**@**), U+005B (**[**) to U+005D (**]**), inclusive, and U+007C (**|**).
* **OpaqueHost:** [[https://infra.spec.whatwg.org/#c0-control|Control characters]], and all [[https://url.spec.whatwg.org/#c0-control-percent-encode-set|code points greater than ~]] are percent-encoded.
* **Path:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::Query'', the following code points are percent-encoded: U+003F (**?**), U+005E (**^**), U+0060 (**`**), U+007B (**{**), and U+007D (**}**).
* **PathSegment:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::Query'', the following code points are percent-encoded: U+003F (**?**), U+005E (**^**), U+0060 (**`**), U+007B (**{**), U+007D (**}**), and U+002F (**/**).
* **OpaquePathSegment:**
* **Query:** Besides [[https://infra.spec.whatwg.org/#c0-control|Control characters]], and all [[https://url.spec.whatwg.org/#c0-control-percent-encode-set|code points greater than ~]], the following code points are percent-encoded: U+0020 SPACE, U+0022 (**"**), U+0023 (**#**), U+003C (**<**), and U+003E (**>**).
* **SpecialQuery:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::Query'', the following code points are percent-encoded: U+0027 (**'**)
* **FormQuery:** Besides the code points percent-encoded by ''Uri\WhatWg\UrlPercentEncodingMode::UserInfo'', the following code points are percent-encoded: U+0024 (**$**) to U+0026 (**&**), inclusive, U+002B (**+**), U+002C (**,**), U+0021 (**!**), U+0027 (**'**) to U+0029 RIGHT PARENTHESIS, inclusive, and U+007E (**~**).
* **Fragment:** Besides [[https://infra.spec.whatwg.org/#c0-control|Control characters]], and all [[https://url.spec.whatwg.org/#c0-control-percent-encode-set|code points greater than ~]], the following code points are percent-encoded: U+0020 SPACE, U+0022 (**"**), U+003C (**<**), U+003E (**>**), and U+0060 (**`**).
Since neither RFC 3986, nor WHATWG URL support percent-encoded characters inside the scheme component, none of the enums contain a ''Scheme'' case. WHATWG URL automatically percent-decodes the host when [[https://wiki.php.net/rfc/uri_followup#determining_if_the_whatwg_url_is_special|it's special]], so ''Uri\WhatWg\UrlPercentEncodingMode'' doesn't contain a ''Host'' case.
The ''percentDecode()'' methods perform the inverted operation of ''percentEncode()'': it decodes every character that is percent-encoded, but which are otherwise allowed by the current percent-encoding mode.
$uri = new Uri\Rfc3986\Uri("https://example.com#_%40%2F"); // The fragment is the percent-encoded form of "_@/"
echo Uri\Rfc3986\Uri::percentDecode(
$uri->getFragment(),
Uri\Rfc3986\UriPercentEncodingMode::Fragment
); // _%40/
The "/" character is allowed in the fragment, so it's needlessly percent-encoded in the URI - that's why it can be percent-decoded by ''percentDecode()''. On the other hand, "@" is not supported in the context of the fragment, so it's kept in the percent-encoded octet form.
RFC 3986 has a sentence that apparently contradicts with the behavior of ''Uri\Rfc3986\Uri::percentDecode()'':
> Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.
According to this rule, reserved characters - even if they are allowed in the context of a component - should not be percent-decoded during normalization. Even though the ''Uri\Rfc3986\Uri'' getters respect this rule, the ''percentDecode()'' method intentionally disregards it so that it can serve in use-cases where those getters cannot. Let's see an example:
$uri = new Uri\Rfc3986\Uri("https://example.com/?q=%3A%29"); // The query is the percent-encoded form of ":)"
echo $uri->getQuery(); // %3A%29
echo Uri\Rfc3986\Uri::percentDecode(
$uri->getQuery(),
Uri\Rfc3986\UriPercentEncodingMode::Query
); // :)
As it can be seen above, the ''getQuery()'' getter only normalizes the "%20" percent-encoded octet, and it leaves the two reserved characters (":" and ")") as-is, even though both ":" and ")" are allowed in the context of the query (so they shouldn't be percent-encoded at all). By using ''percentDecode()'' one can make the input consumable directly, and scheme-specific or producer-specific algorithms should continue to use the getters should they need to perform any kind of custom processing.
By using the proposed percent-encoding and decoding capabilities, many use-cases will become possible to implement in a specification-compliant way which was difficult to achieve before.
For example, path segments can be properly percent-encoded when they contain the ''/'' character:
$uri = new Uri\Rfc3986\Uri("https://example.com");
$uri = $uri->withPathSegments(
[
"foo",
Uri\Rfc3986\Uri::percentEncode("bar/baz", Uri\Rfc3986\UriPercentEncodingMode::PathSegment)
]
);
$uri->toRawString(); // https://example.com/foo/bar%2Fbaz