rfc:uri_path_segments

PHP RFC: URI/URL Path Segment Support

Introduction

Sometimes, accessing path segments rather than the whole path as string is needed. When this is the case, splitting the path to segments manually after retrieval is both inconvenient and disadvantageous performance-wise, especially considering the fact that Uri\Rfc3986\Uri internally stores the path as a list of segments.

Proposal

That's why the following methods are proposed to be added:

namespace Uri\Rfc3986 {
    final readonly class Uri
    {
        ...
 
        public function getRawPathSegments(): array {}
 
        public function getPathSegments(): array {}
 
        public function getDecodedPathSegments(): array {}
 
        public function withPathSegments(array $segments, \Uri\Rfc3986\LeadingSlashPolicy $leadingSlashPolicy = \Uri\Rfc3986\LeadingSlashPolicy::AddForNonEmptyRelative): static {}
 
        ...
    }
 
    enum LeadingSlashPolicy
    {
        case AddForNonEmptyRelative;
        case NeverAdd;
    }
}
 
namespace Uri\WhatWg {
    final readonly class Url
    {
        ...
 
        public function getPathSegments(): array|string {}
 
        public function getDecodedPathSegments(): array|string {}
 
        public function withPathSegments(array $segments): static {}
 
        ...
    }
}

This way, it is possible to write the following code:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/baz");
$segments = $uri->getPathSegments();        // ["foo", "bar", "baz"]
 
$uri = $uri->withPathSegments(["a", "b"]);
echo $uri->getPath();                       // /a/b

The same also works for WHATWG URL:

$url = new Uri\WhatWg\Url("https://example.com/foo/bar/baz");
$segments = $url->getPathSegments();        // ["foo", "bar", "baz"]
 
$url = $url->withPathSegments(["a", "b"]);
echo $url->getPath();                       // /a/b

Uri\Rfc3986\Uri::getPathSegments() and Uri\WhatWg\Url::getPathSegments() split the path according to the rules of their respective specification.

Uri\Rfc3986\Uri::withPathSegments() and Uri\WhatWg\Url::withPathSegments() internally concatenate the input segments separated by a / character, and then trigger the respective withPath() method to update the path.

Segment definition of RFC 3986

In order to understand better why and exactly how this functionality works, we should more carefully understand how RFC 3986 defines the path and path segments: according to the specification, path segments start after the leading “/” due to the following ABNF rule:

path-abempty  = *( "/" segment )

That is, the path-abempty syntax only applies in case of URIs containing an authority component, and it declares that the path is either empty, or contains a “/” followed by a segment one or multiple times. Then segments have the following syntax:

segment       = *pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

That is, segments are composed of zero or multiple characters in the “pchar” charset (the exact values don't matter in this case). It should be mentioned that there are some additional special-case segment syntaxes (they are marked with segment-nz and segment-nz-nc in the ABNF syntax), but let's disregard them now for ease of understanding.

The above definitions imply that an empty path has zero segments:

$uri = new Uri\Rfc3986\Uri("https://example.com");
$segments = $uri->getPathSegments();        // []

When the path consists of a leading “/” and a string matching the segment syntax (e.g. /foo), the path has one segment:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo");
$segments = $uri->getPathSegments();        // ["foo"]

We can easily see based on the above example that the URI https://example.com/ also has a single segment - but it's empty:

$uri = new Uri\Rfc3986\Uri("https://example.com/");
$segments = $uri->getPathSegments();        // [""]

This is perfectly valid, because segments can be empty (at least in the above case when the URI has an authority). Another interesting question is how segments are represented when the path has a trailing slash (e.g. /foo/)? Consistent to the above rules, it's the following:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo/");
$segments = $uri->getPathSegments();        // ["foo", ""]

A few other special cases are also collected below:

  • “https://#foo”: It means that the URI has an empty authority starting after the “//” characters, and the path is also empty, and therefore this URI has zero path segments
  • “https:/”: It means that the URI has no authority and the path starts after the “:” character (it is “/”), therefore this URI has one empty path segment
  • “https:”: It means that the URI has no authority, and the path starts after the “:” character (it is “”), therefore this URI has zero path segments
  • “” (empty string): It means that the relative reference consists of a single path component which is empty, and therefore this relative reference has zero path segments
  • “/foo”: It means that the relative reference consists of a single path component which is “/foo”, and therefore this relative reference has one path segment “foo”
  • “foo”: It means that the relative reference consists of a single path component which is “foo”, and therefore this relative reference has one path segment “foo”
  • “foo/”: It means that the relative reference consists of a single path component which is “foo/”, and therefore this relative reference has two path segments “foo” and “”
  • “/”: It means that the relative reference consists of a single path component which is “/”, and therefore this relative reference has one empty path segment

Segment definition of WHATWG URL

As always, WHATWG URL has similar, but somewhat different rules. First of all, the definition of a path segment is the following:

A URL path segment is an ASCII string. It commonly refers to a directory or a file, but has no predefined meaning.

Even though it's not a very specific definition, it aligns with the segment definition of RFC 3986. Then WHATWG URL defines the path component based on segments:

A URL path is either a URL path segment or a list of zero or more URL path segments.

This is a major shift from RFC 3986, because it states that the path segment can be either a single segment or a list of segments in the following cases:

  • Opaque URLs have an opaque path, which cannot be divided any further into segments, therefore the path consists of a single path segment, represented as a string
  • The path of special URLs can be split into segments, therefore the path consists of a list of zero or more path segments, represented as an array of strings

This behavior is consistent with how WHATWG URL categorizes host types: only special hosts which are known are attempted to be inspected, and no assumptions are made against the unknown ones. The same happens in case of paths.

Let's see a few typical examples, first related to special URLs where WHATWG URL behaves the same way as RFC 3986:

$url = new Uri\WhatWg\Url("https://example.com/");
$segments = $url->getPathSegments();           // [""]
 
$url = new Uri\WhatWg\Url("https://example.com/foo");
$segments = $url->getPathSegments();           // ["foo"]
 
$url = new Uri\WhatWg\Url("https://example.com/foo/");
$segments = $url->getPathSegments();           // ["foo", ""]
 
$url = new Uri\WhatWg\Url("https://example.com/foo/bar");
$segments = $url->getPathSegments();           // ["foo", "bar"]

Now let's see how non-special URLs behave:

$url = new Uri\WhatWg\Url("scheme://example.com/");
$segments = $url->getPathSegments();           // "/"
 
$url = new Uri\WhatWg\Url("scheme://example.com/foo");
$segments = $url->getPathSegments();           // "/foo"
 
$url = new Uri\WhatWg\Url("scheme://example.com/foo/");
$segments = $url->getPathSegments();           // "/foo/"
 
$url = new Uri\WhatWg\Url("scheme://example.com/foo/bar");
$segments = $url->getPathSegments();           // "/foo/bar"

Consistent to WHATWG URL's definition, Uri\WhatWg\Url::getPathSegments() returns the whole path as a string in case of non-special hosts.

$url = new Uri\WhatWg\Url("scheme://example.com/");
$url = $url->withPathSegments(["foo"]);
echo $url->getPath();                          // "/foo"
 
$url = new Uri\WhatWg\Url("scheme://example.com/");
$url = $url->withPathSegments(["foo", ""]);
echo $url->getPath();                          // "/foo/"

Even though opaque paths cannot be split into segments, path modification via Uri\WhatWg\Url::withPathSegments() still works the same way as for non-opaque paths: the $segment argument is concatenated into a path string, and the path component is overwritten with this value.

Ambiguity

There's one edge case which needs disambiguation in relation to the RFC 3986 specification and the withPathSegments() method. Let's consider the following example:

$uri = new Uri\Rfc3986\Uri("/foo");            // absolute-path reference
 
$uri = $uri->withPathSegments(["bar"]);        // should the result be "/bar" or "bar"?

In this case, it would be ambiguous whether the resulting URI is an absolute- or a relative-path reference.

That's why Uri\Rfc3986\Uri::withPathSegments() has a second parameter $leadingSlashPolicy, which can be used to decide if a relative reference should became an absolute- or a relative-path reference:

$uri = new Uri\Rfc3986\Uri("/foo");            // absolute-path reference
 
$uri = $uri->withPathSegments(["bar"], Uri\Rfc3986\LeadingSlashPolicy::NeverAdd); // The leading slash is not prepended
 
echo $uri->getPath();                          // bar
 
$uri = new Uri\Rfc3986\Uri("foo");             // relative-path reference
 
$uri = $uri->withPathSegments(["bar"], Uri\Rfc3986\LeadingSlashPolicy::AddForNonEmptyRelative);  // The leading slash is prepended
 
echo $uri->getPath();                          // /bar

The Uri\Rfc3986\LeadingSlashPolicy::AddForNonEmptyRelative enum case only has effect when the URI is a relative reference, and the first path segment is not empty. Any other cases are unambiguous.

Since WHATWG URL doesn't support relative references, there's no case which needs disambiguation, and that's why the $leadingSlashPolicy parameter is not needed when modifying path segments.

Percent-Encoding and Decoding

Path segment retrieval works the same way as path retrieval does. In case of RFC 3986, the getRawPathSegments() and getPathSegments() methods follow the percent-decoding behavior of getRawPath() and getPath(), respectively. Furthermore, a getDecodedPathSegments() method is added to improve user experience.

  • getRawPathSegments(): Returns the path segments non-normalized, without any post-processing.
  • getPathSegments(): Returns the normalized path segments, without any post-processing.
  • getDecodedPathSegments(): Returns the normalized path segments, with the percent-encoded octets decoded.

Let's see what the distinction is between the above methods in practice:

$uri = new Uri\Rfc3986\Uri("/fo%6F/bar%2fbaz/qux%20quux");  // percent-encoded form of "/foo/bar/baz/qux quux"
 
$segments = $uri->getRawPathSegments();          // ["foo", "bar%2fbaz", "qux%20quux"]
$segments = $uri->getPathSegments();             // ["foo", "bar%2Fbaz", "qux%20quux"]
$segments = $uri->getDecodedPathSegments();      // ["foo", "bar/baz", "qux quux"]

Uri\Rfc3986\Uri::getRawPathSegments() returns the path separated into segments as-is.

Uri\Rfc3986\Uri::getPathSegments() returns the normalized path separated into segments consistent to other Uri\Rfc3986\Uri getters, as seen in the Advanced Examples of the original ext/uri RFC. That is, reserved characters are not percent-decoded, as mentioned in the Generic percent-decoding introduction of the original ext/uri RFC.

Finally, Uri\Rfc3986\Uri::getDecodedPathSegments() returns the normalized path separated into segments whose contents are percent-decoded. In addition to Uri\Rfc3986\Uri::getPathSegments(), any percent-encoded octets are percent-decoded, including the “%2F” percent-encoded octet representing the “/” character.

Although “/” is a reserved character in the generic URI syntax, once the path has already been split, it is no longer syntactically ambiguous within an individual segment. As a result, this method intentionally goes beyond a strict application of the generic syntax defined by RFC 3986. By decoding reserved characters, it provides an application-level interpretation of the path segments rather than a purely syntactic representation.

Uri\WhatWg\Url::getDecodedPathSegments() also decodes all percent-encoded octets the same way as Uri\Rfc3986\Uri::getDecodedPathSegments() does.

Design considerations

Should Uri\Rfc3986\Uri and Uri\WhatWg\Url really support path segment handling?

Some schemes don't use “/” to express the hierarchy inside the path according to their scheme-specific processing rules: e.g. in case of the “mailto” scheme, the “@” separates the “local name” and the “domain name” in the path ( nobody@example.com). There are even such schemes which don't support hierarchy in the path component at all. This leads to the question whether Uri\Rfc3986\Uri should really support path segments as described, because Uri\Rfc3986\Uri is supposed to implement the generic URI syntax, therefore its functionality should apply to all URIs?

The answer is yes, it does, because the generic URI syntax uses path segments and the “/” separator to define the path component (remember the ABNF rules above!). It's possible that separating the path into segments is semantically incorrect in case of some schemes, but syntactically speaking, path segments are first class citizens of the generic URI syntax.

WHATWG URL takes one step further, and it explicitly defines how the path is separated into segments only in case of some specific schemes (special URLs), and it explicitly leaves them undefined for the rest of the URLs (opaque URLs). This way, there's no gap between the syntactic and semantic interpretation of path segments.

Why isn't there a PathSegments class?

Path segments could be modeled as a dedicated class (e.g. PathSegments) rather than simple arrays. Mainly, this would improve their extensibility - new features could be added to this class easily in the future. However, the current RFC still chooses the array model due to a few reasons.

First, it's unclear how and when validation should happen:

  • Should the PathSegments class be always valid just like the rest of the ext/uri classes? If the answer is no, then there will be a discrepancy, and possibly it would go against user expectations.
  • However, if the answer is yes, then there are a few big hurdles:
    • Should we add a dedicated class per specification? WHATWG URL and RFC 3986 have vastly different path definitions, so indeed, there should be two classes ideally, or at very least, a dedicated factory method per specification.
    • To complicate things, both specifications use a context-sensitive algorithm for path validation: WHATWG URL has the notion of opaque paths which depend on the scheme and the presence of the host. RFC 3986 has different segment parsing rules for relative-path references, absolute-path references, and URIs (see segment, segment-nz, segment-nz-nc).

Backward Incompatible Changes

None.

Proposed PHP Version(s)

Next minor version (PHP 8.6)

RFC Impact

To the Ecosystem

What effect will the RFC have on IDEs, Language Servers (LSPs), Static Analyzers, Auto-Formatters, Linters and commonly used userland PHP libraries?

To Existing Extensions

None.

To SAPIs

None.

Future Scope

Voting Choices

The vote requires a 2/3 majority to be accepted.

Add support for accessing path segments as an array as outlined in the RFC?
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Implementation

After the RFC is implemented, this section should contain:

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

Rejected Features

Keep this updated with features that were discussed on the mail lists.

Changelog

If there are major changes to the initial proposal, please include a short summary with a date or a link to the mailing list announcement here, as not everyone has access to the wikis' version history.

rfc/uri_path_segments.txt · Last modified: by kocsismate