====== Add FILTER_VALIDATE_STRLEN for UTF-8 code point length validation ======

  * Date: 2026-03-19
  * Author: Masaki Kagaya masakielastic@gmail.com
  * Status: Draft
  * Implementation: https://github.com/php/php-src/pull/21429


===== Introduction =====
This RFC treats code-point length validation as part of PHP’s built-in validation baseline rather than as a specialized text-processing feature.

===== Proposal =====
This RFC proposes adding a new validation filter, ''FILTER_VALIDATE_STRLEN''.

The filter validates the length of a string measured in Unicode code points derived from UTF-8 input.

Example:

<PHP>
<?php
filter_var("hello😀", FILTER_VALIDATE_STRLEN, [
    "options" => [
        "min_len" => 6,
        "max_len" => 6,
    ],
]);
</PHP>

In this example, validation succeeds because the input length is exactly 6 Unicode code points, so the original string is returned.

''FILTER_VALIDATE_STRLEN'' accepts min_len and max_len as length bounds. At least one of these options must be specified. If both are specified, min_len must be less than or equal to max_len.

On success, the filter returns the original string. On failure, it returns false, or null if ''FILTER_NULL_ON_FAILURE'' is used.

This filter performs length validation only. It does not validate whether the input is well-formed UTF-8.

===== Backward Incompatible Changes =====
None. This RFC adds a new validation filter and does not change the behavior of existing filters.

===== Proposed PHP Version(s) =====
Next PHP 8.x (PHP 8.6).

===== FAQ =====
==== Why add a length validator to filter? ====

Validation baseline consistency.

Length validation is a common rule, but it is currently expressed through fragmented extension-specific or userland approaches. Placing it in the filter API provides a shared validation baseline for this class of check.

==== Why does this validator require UTF-8 input? ====

UTF-8 keeps the feature narrowly scoped and semantically clear.

A code-point-based validator needs a defined encoding context. Multi-encoding support would raise additional questions about API shape and behavior, which are outside the scope of this RFC.

==== Why measure length in Unicode code points instead of grapheme clusters? ====

Predictable validation semantics.

Code-point length provides a stable and deterministic measure suitable for validation. Grapheme-cluster-based validation raises additional boundary and API design questions, so this RFC keeps the scope limited to code-point length.

==== Why not reuse min_range and max_range? ====

''min_len'' and ''max_len'' make the option semantics specific to string length.

''min_range'' and ''max_range'' are associated with numeric validation in filter. Using length-specific names keeps the constraint explicit and avoids reusing numeric range terminology for a different validation model.

==== Why is max_len = 0 allowed? ====

Allowing ''max_len = 0'' keeps the boundary rules complete and consistent.

A zero upper bound has a clear meaning: only the empty string can satisfy it. Disallowing it would add a special case without improving the model.

==== Why not validate UTF-8? ====

Ambiguous failure semantics.

The filter API returns a single validation result, which cannot distinguish between length failure and encoding failure. Keeping this filter limited to length validation avoids combining two different validation concerns into one result.

==== Is a UTF-8 encoding validator needed? ====

Separate design question.

This RFC does not address UTF-8 validity. Whether encoding validation should be introduced, and how it should be exposed, remains a separate design question.