PHP does not currently expose a small shared primitive for iterating over a string as UTF-8 code points. This RFC proposes such a primitive in the form of str_iter(). Its scope is limited to traversal. Validation behavior for ill-formed UTF-8 and grapheme-cluster handling are left for separate discussion.
This RFC proposes adding the following function to PHP:
str_iter(string $str): Traversable
Example:
foreach (str_iter("héllo🙂") as $index => $char) { var_dump($index, $char); }
Output:
int(0) string(1) "h" int(1) string(2) "é" int(2) string(1) "l" int(3) string(1) "l" int(4) string(1) "o" int(5) string(4) "🙂"
str_iter() returns a Traversable that iterates over the input string as a sequence of UTF-8 code points.
Each iteration yields a string containing a single UTF-8 code point. Keys are zero-based integers representing the iteration index rather than byte offsets.
str_iter() does not validate whether the input is well-formed UTF-8, does not report encoding errors, and does not define replacement behavior for ill-formed byte sequences. For ill-formed input, traversal still guarantees forward progress by consuming at least one byte per iteration.
The returned object supports repeated iteration. Its concrete internal representation is not part of this RFC.
None.
This RFC is limited to UTF-8 code-point traversal. It does not attempt to define validation behavior, grapheme-cluster handling, or code-point-aware length and substring APIs.
PHP already exposes several APIs that operate at or near UTF-8 code-point boundaries, but it does not expose traversal itself as a small shared primitive. As a result, userland code that needs code-point traversal must either depend on extension-specific APIs or reconstruct traversal indirectly through other operations. This RFC proposes exposing traversal itself as a core building block.
WordPress 6.9 introduces a UTF-8 fallback parser that is used to answer higher-level questions such as string length and the spans of ill-formed byte sequences.1)
Symfony Polyfill provides mbstring compatibility in environments where mbstring may be unavailable. Its Mbstring implementation describes itself as “iconv based, UTF-8 centric,” and is organized around compatibility with existing mbstring-style APIs.
PHP already contains internal UTF-8 traversal logic in ext/standard/html.c, including php_next_utf8_char. This shows that UTF-8 code-point traversal is not a new concern for PHP itself.
The mbstring extension already exposes APIs that operate at or near code-point boundaries, including mb_ord(), mb_chr(), and mb_str_split(). These APIs indicate existing demand for code-point-oriented string operations, but they do not expose traversal itself as a shared primitive.
The intl extension already exposes code-point-aware boundary iteration through IntlCodePointBreakIterator. More generally, IntlBreakIterator exposes text-boundary iteration in terms of successive UTF-8 byte offsets. This shows that code-point-level traversal is already a recognized operation in PHP.
Traversal is naturally expressed as sequential iteration over a string. An iterator-based API exposes that operation directly without requiring array materialization. It also keeps this RFC focused on traversal semantics, rather than introducing additional questions about indexing, offsets, argument behavior, and compatibility with existing substring or length APIs.
This RFC does not treat code-point traversal as an mbstring-specific feature. As the survey shows, related traversal needs already appear across core, extensions, frameworks, and CMS fallbacks. For that reason, this RFC proposes str_iter() as a core shared primitive rather than an mbstring-specific API.
This RFC defines traversal semantics, not validation policy.
Detecting ill-formed UTF-8 and deciding how to handle it are separate questions with their own API and compatibility implications.
For that reason, str_iter() does not report encoding errors and does not define replacement behavior. It only guarantees forward progress during traversal.
This RFC defines traversal over UTF-8 code points because that is the smaller primitive. Grapheme-cluster handling involves separate boundary rules and different user-facing expectations. It is therefore left for separate discussion.
Validation remains a distinct question outside the scope of this RFC. Whether PHP should expose a shared validation primitive, and what form that API should take, can be discussed separately.
Grapheme-cluster traversal involves different boundary semantics and different user-facing expectations. It therefore remains outside the scope of this RFC.
Code-point-aware length and substring APIs raise additional questions around indexing, offsets, argument behavior, and compatibility with existing APIs. They are therefore left for separate discussion.