rfc:str_iter

Add str_iter() for UTF-8 code-point iteration

  • Date: 2026-03-24
  • Author: Masaki Kagaya masakielastic@gmail.com
  • Status: Draft
  • Implementation

Introduction

PHP does not currently expose a small shared primitive for iterating over a string as UTF-8 code points. This RFC proposes such a primitive in the form of str_iter(). Its scope is limited to traversal. Validation behavior for ill-formed UTF-8 and grapheme-cluster handling are left for separate discussion.

Proposal

This RFC proposes adding the following function to PHP:

str_iter(string $str): Traversable

Example:

foreach (str_iter("héllo🙂") as $index => $char) {
    var_dump($index, $char);
}

Output:

int(0)
string(1) "h"
int(1)
string(2) "é"
int(2)
string(1) "l"
int(3)
string(1) "l"
int(4)
string(1) "o"
int(5)
string(4) "🙂"

str_iter() returns a Traversable that iterates over the input string as a sequence of UTF-8 code points.

Each iteration yields a string containing a single UTF-8 code point. Keys are zero-based integers representing the iteration index rather than byte offsets.

str_iter() does not validate whether the input is well-formed UTF-8, does not report encoding errors, and does not define replacement behavior for ill-formed byte sequences. For ill-formed input, traversal still guarantees forward progress by consuming at least one byte per iteration.

The returned object supports repeated iteration. Its concrete internal representation is not part of this RFC.

Backward Incompatible Changes

None.

Scope of this RFC

This RFC is limited to UTF-8 code-point traversal. It does not attempt to define validation behavior, grapheme-cluster handling, or code-point-aware length and substring APIs.

Motivation

PHP already exposes several APIs that operate at or near UTF-8 code-point boundaries, but it does not expose traversal itself as a small shared primitive. As a result, userland code that needs code-point traversal must either depend on extension-specific APIs or reconstruct traversal indirectly through other operations. This RFC proposes exposing traversal itself as a core building block.

Survey of existing practice and project needs

WordPress

WordPress 6.9 introduces a UTF-8 fallback parser that is used to answer higher-level questions such as string length and the spans of ill-formed byte sequences.1)

Symfony Polyfill

Symfony Polyfill provides mbstring compatibility in environments where mbstring may be unavailable. Its Mbstring implementation describes itself as “iconv based, UTF-8 centric,” and is organized around compatibility with existing mbstring-style APIs.

PHP core

PHP already contains internal UTF-8 traversal logic in ext/standard/html.c, including php_next_utf8_char. This shows that UTF-8 code-point traversal is not a new concern for PHP itself.

mbstring

The mbstring extension already exposes APIs that operate at or near code-point boundaries, including mb_ord(), mb_chr(), and mb_str_split(). These APIs indicate existing demand for code-point-oriented string operations, but they do not expose traversal itself as a shared primitive.

intl

The intl extension already exposes code-point-aware boundary iteration through IntlCodePointBreakIterator. More generally, IntlBreakIterator exposes text-boundary iteration in terms of successive UTF-8 byte offsets. This shows that code-point-level traversal is already a recognized operation in PHP.

Design Rationale

Why define traversal as an iterator-based API?

Traversal is naturally expressed as sequential iteration over a string. An iterator-based API exposes that operation directly without requiring array materialization. It also keeps this RFC focused on traversal semantics, rather than introducing additional questions about indexing, offsets, argument behavior, and compatibility with existing substring or length APIs.

Why introduce str_iter() in core rather than mb_iter() in mbstring?

This RFC does not treat code-point traversal as an mbstring-specific feature. As the survey shows, related traversal needs already appear across core, extensions, frameworks, and CMS fallbacks. For that reason, this RFC proposes str_iter() as a core shared primitive rather than an mbstring-specific API.

Why does this RFC leave validation behavior unspecified for ill-formed UTF-8?

This RFC defines traversal semantics, not validation policy. Detecting ill-formed UTF-8 and deciding how to handle it are separate questions with their own API and compatibility implications. For that reason, str_iter() does not report encoding errors and does not define replacement behavior. It only guarantees forward progress during traversal.

Why does this RFC iterate over code points rather than grapheme clusters?

This RFC defines traversal over UTF-8 code points because that is the smaller primitive. Grapheme-cluster handling involves separate boundary rules and different user-facing expectations. It is therefore left for separate discussion.

Future Scope

Validation primitives

Validation remains a distinct question outside the scope of this RFC. Whether PHP should expose a shared validation primitive, and what form that API should take, can be discussed separately.

Grapheme cluster support

Grapheme-cluster traversal involves different boundary semantics and different user-facing expectations. It therefore remains outside the scope of this RFC.

Length and substring APIs

Code-point-aware length and substring APIs raise additional questions around indexing, offsets, argument behavior, and compatibility with existing APIs. They are therefore left for separate discussion.

rfc/str_iter.txt · Last modified: by masakielastic