====== Add str_iter() for UTF-8 code-point iteration ======

  * Date: 2026-03-24
  * Author: Masaki Kagaya masakielastic@gmail.com
  * Status: Draft
  * Implementation

===== Introduction =====

PHP does not currently expose a small shared primitive for iterating over a string as UTF-8 code points. This RFC proposes such a primitive in the form of ''str_iter()''. Its scope is limited to traversal. Validation behavior for ill-formed UTF-8 and grapheme-cluster handling are left for separate discussion.

===== Proposal =====

This RFC proposes adding the following function to PHP:

<PHP>
str_iter(string $str): Traversable
</PHP>

Example:

<PHP>
foreach (str_iter("héllo🙂") as $index => $char) {
    var_dump($index, $char);
}
</PHP>

Output:

<PHP>
int(0)
string(1) "h"
int(1)
string(2) "é"
int(2)
string(1) "l"
int(3)
string(1) "l"
int(4)
string(1) "o"
int(5)
string(4) "🙂"
</PHP>

''str_iter()'' returns a ''Traversable'' that iterates over the input string as a sequence of UTF-8 code points.

Each iteration yields a string containing a single UTF-8 code point. Keys are zero-based integers representing the iteration index rather than byte offsets.

''str_iter()'' does not validate whether the input is well-formed UTF-8, does not report encoding errors, and does not define replacement behavior for ill-formed byte sequences. For ill-formed input, traversal still guarantees forward progress by consuming at least one byte per iteration.

The returned object supports repeated iteration. Its concrete internal representation is not part of this RFC.

===== Backward Incompatible Changes =====

None.

===== Scope of this RFC =====

This RFC is limited to UTF-8 code-point traversal. It does not attempt to define validation behavior, grapheme-cluster handling, or code-point-aware length and substring APIs.

===== Motivation =====

PHP already exposes several APIs that operate at or near UTF-8 code-point boundaries, but it does not expose traversal itself as a small shared primitive. As a result, userland code that needs code-point traversal must either depend on extension-specific APIs or reconstruct traversal indirectly through other operations. This RFC proposes exposing traversal itself as a core building block.

===== Survey of existing practice and project needs =====

==== WordPress ====

WordPress 6.9 introduces a UTF-8 fallback parser that is used to answer higher-level questions such as string length and the spans of ill-formed byte sequences.((https://make.wordpress.org/core/2025/11/18/modernizing-utf-8-support-in-wordpress-6-9/
))
 
==== Symfony Polyfill ====

Symfony Polyfill provides mbstring compatibility in environments where mbstring may be unavailable. Its Mbstring implementation describes itself as “iconv based, UTF-8 centric,” and is organized around compatibility with existing mbstring-style APIs.


==== PHP core ====
PHP already contains internal UTF-8 traversal logic in ''ext/standard/html.c'', including ''php_next_utf8_char''. This shows that UTF-8 code-point traversal is not a new concern for PHP itself.

==== mbstring ====

The mbstring extension already exposes APIs that operate at or near code-point boundaries, including ''mb_ord()'', ''mb_chr()'', and ''mb_str_split()''. These APIs indicate existing demand for code-point-oriented string operations, but they do not expose traversal itself as a shared primitive.


==== intl ====

The intl extension already exposes code-point-aware boundary iteration through ''IntlCodePointBreakIterator''. More generally, ''IntlBreakIterator'' exposes text-boundary iteration in terms of successive UTF-8 byte offsets. This shows that code-point-level traversal is already a recognized operation in PHP.


===== Design Rationale =====

==== Why define traversal as an iterator-based API? ====

Traversal is naturally expressed as sequential iteration over a string.
An iterator-based API exposes that operation directly without requiring array materialization.
It also keeps this RFC focused on traversal semantics, rather than introducing additional questions about indexing, offsets, argument behavior, and compatibility with existing substring or length APIs.

==== Why introduce str_iter() in core rather than mb_iter() in mbstring? ====

This RFC does not treat code-point traversal as an mbstring-specific feature. As the survey shows, related traversal needs already appear across core, extensions, frameworks, and CMS fallbacks. For that reason, this RFC proposes ''str_iter()'' as a core shared primitive rather than an mbstring-specific API.

==== Why does this RFC leave validation behavior unspecified for ill-formed UTF-8? ====

This RFC defines traversal semantics, not validation policy.
Detecting ill-formed UTF-8 and deciding how to handle it are separate questions with their own API and compatibility implications.
For that reason, ''str_iter()'' does not report encoding errors and does not define replacement behavior. It only guarantees forward progress during traversal.

==== Why does this RFC iterate over code points rather than grapheme clusters? ====

This RFC defines traversal over UTF-8 code points because that is the smaller primitive. Grapheme-cluster handling involves separate boundary rules and different user-facing expectations. It is therefore left for separate discussion.

===== Future Scope =====

==== Validation primitives ====

Validation remains a distinct question outside the scope of this RFC. Whether PHP should expose a shared validation primitive, and what form that API should take, can be discussed separately.

==== Grapheme cluster support ====

Grapheme-cluster traversal involves different boundary semantics and different user-facing expectations. It therefore remains outside the scope of this RFC.

==== Length and substring APIs ====

Code-point-aware length and substring APIs raise additional questions around indexing, offsets, argument behavior, and compatibility with existing APIs. They are therefore left for separate discussion.