====== Add FILTER_VALIDATE_STRLEN for UTF-8 code point length validation ====== * Date: 2026-03-19 * Author: Masaki Kagaya masakielastic@gmail.com * Status: Draft * Implementation: https://github.com/php/php-src/pull/21429 ===== Introduction ===== This RFC treats code-point length validation as part of PHP’s built-in validation baseline rather than as a specialized text-processing feature. ===== Proposal ===== This RFC proposes adding a new validation filter, ''FILTER_VALIDATE_STRLEN''. The filter validates the length of a string measured in Unicode code points derived from UTF-8 input. Example: [ "min_len" => 6, "max_len" => 6, ], ]); In this example, validation succeeds because the input length is exactly 6 Unicode code points, so the original string is returned. ''FILTER_VALIDATE_STRLEN'' accepts min_len and max_len as length bounds. At least one of these options must be specified. If both are specified, min_len must be less than or equal to max_len. On success, the filter returns the original string. On failure, it returns false, or null if ''FILTER_NULL_ON_FAILURE'' is used. This filter performs length validation only. It does not validate whether the input is well-formed UTF-8. ===== Backward Incompatible Changes ===== None. This RFC adds a new validation filter and does not change the behavior of existing filters. ===== Proposed PHP Version(s) ===== Next PHP 8.x (PHP 8.6). ===== FAQ ===== ==== Why add a length validator to filter? ==== Validation baseline consistency. Length validation is a common rule, but it is currently expressed through fragmented extension-specific or userland approaches. Placing it in the filter API provides a shared validation baseline for this class of check. ==== Why does this validator require UTF-8 input? ==== UTF-8 keeps the feature narrowly scoped and semantically clear. A code-point-based validator needs a defined encoding context. Multi-encoding support would raise additional questions about API shape and behavior, which are outside the scope of this RFC. ==== Why measure length in Unicode code points instead of grapheme clusters? ==== Predictable validation semantics. Code-point length provides a stable and deterministic measure suitable for validation. Grapheme-cluster-based validation raises additional boundary and API design questions, so this RFC keeps the scope limited to code-point length. ==== Why not reuse min_range and max_range? ==== ''min_len'' and ''max_len'' make the option semantics specific to string length. ''min_range'' and ''max_range'' are associated with numeric validation in filter. Using length-specific names keeps the constraint explicit and avoids reusing numeric range terminology for a different validation model. ==== Why is max_len = 0 allowed? ==== Allowing ''max_len = 0'' keeps the boundary rules complete and consistent. A zero upper bound has a clear meaning: only the empty string can satisfy it. Disallowing it would add a special case without improving the model. ==== Why not validate UTF-8? ==== Ambiguous failure semantics. The filter API returns a single validation result, which cannot distinguish between length failure and encoding failure. Keeping this filter limited to length validation avoids combining two different validation concerns into one result. ==== Is a UTF-8 encoding validator needed? ==== Separate design question. This RFC does not address UTF-8 validity. Whether encoding validation should be introduced, and how it should be exposed, remains a separate design question.