This RFC treats code-point length validation as part of PHP’s built-in validation baseline rather than as a specialized text-processing feature.
This RFC proposes adding a new validation filter, FILTER_VALIDATE_STRLEN.
The filter validates the length of a string measured in Unicode code points derived from UTF-8 input.
Example:
<?php filter_var("hello😀", FILTER_VALIDATE_STRLEN, [ "options" => [ "min_len" => 6, "max_len" => 6, ], ]);
In this example, validation succeeds because the input length is exactly 6 Unicode code points, so the original string is returned.
FILTER_VALIDATE_STRLEN accepts min_len and max_len as length bounds. At least one of these options must be specified. If both are specified, min_len must be less than or equal to max_len.
On success, the filter returns the original string. On failure, it returns false, or null if FILTER_NULL_ON_FAILURE is used.
This filter performs length validation only. It does not validate whether the input is well-formed UTF-8.
None. This RFC adds a new validation filter and does not change the behavior of existing filters.
Next PHP 8.x (PHP 8.6).
Validation baseline consistency.
Length validation is a common rule, but it is currently expressed through fragmented extension-specific or userland approaches. Placing it in the filter API provides a shared validation baseline for this class of check.
UTF-8 keeps the feature narrowly scoped and semantically clear.
A code-point-based validator needs a defined encoding context. Multi-encoding support would raise additional questions about API shape and behavior, which are outside the scope of this RFC.
Predictable validation semantics.
Code-point length provides a stable and deterministic measure suitable for validation. Grapheme-cluster-based validation raises additional boundary and API design questions, so this RFC keeps the scope limited to code-point length.
min_len and max_len make the option semantics specific to string length.
min_range and max_range are associated with numeric validation in filter. Using length-specific names keeps the constraint explicit and avoids reusing numeric range terminology for a different validation model.
Allowing max_len = 0 keeps the boundary rules complete and consistent.
A zero upper bound has a clear meaning: only the empty string can satisfy it. Disallowing it would add a special case without improving the model.
Ambiguous failure semantics.
The filter API returns a single validation result, which cannot distinguish between length failure and encoding failure. Keeping this filter limited to length validation avoids combining two different validation concerns into one result.
Separate design question.
This RFC does not address UTF-8 validity. Whether encoding validation should be introduced, and how it should be exposed, remains a separate design question.