Add FILTER_VALIDATE_STRLEN for UTF-8 code point length validation

Add FILTER_VALIDATE_STRLEN for UTF-8 code point length validation

Date: 2026-03-19
Author: Masaki Kagaya masakielastic@gmail.com
Status: Draft
Implementation: https://github.com/php/php-src/pull/21429

Introduction

This RFC treats code-point length validation as part of PHP’s built-in validation baseline rather than as a specialized text-processing feature.

Proposal

This RFC proposes adding a new validation filter, FILTER_VALIDATE_STRLEN.

The filter validates the length of a string measured in Unicode code points derived from UTF-8 input.

Example:

<?php
filter_var("hello😀", FILTER_VALIDATE_STRLEN, [
    "options" => [
        "min_len" => 6,
        "max_len" => 6,
    ],
]);

In this example, validation succeeds because the input length is exactly 6 Unicode code points, so the original string is returned.

FILTER_VALIDATE_STRLEN accepts min_len and max_len as length bounds. At least one of these options must be specified. If both are specified, min_len must be less than or equal to max_len.

On success, the filter returns the original string. On failure, it returns false, or null if FILTER_NULL_ON_FAILURE is used.

This filter performs length validation only. It does not validate whether the input is well-formed UTF-8.

Backward Incompatible Changes

None. This RFC adds a new validation filter and does not change the behavior of existing filters.

Proposed PHP Version(s)

Next PHP 8.x (PHP 8.6).

FAQ

Why add a length validator to filter?

Validation baseline consistency.

Length validation is a common rule, but it is currently expressed through fragmented extension-specific or userland approaches. Placing it in the filter API provides a shared validation baseline for this class of check.

Why does this validator require UTF-8 input?

UTF-8 keeps the feature narrowly scoped and semantically clear.

A code-point-based validator needs a defined encoding context. Multi-encoding support would raise additional questions about API shape and behavior, which are outside the scope of this RFC.

Why measure length in Unicode code points instead of grapheme clusters?

Predictable validation semantics.

Code-point length provides a stable and deterministic measure suitable for validation. Grapheme-cluster-based validation raises additional boundary and API design questions, so this RFC keeps the scope limited to code-point length.

Why not reuse min_range and max_range?

min_len and max_len make the option semantics specific to string length.

min_range and max_range are associated with numeric validation in filter. Using length-specific names keeps the constraint explicit and avoids reusing numeric range terminology for a different validation model.

Why is max_len = 0 allowed?

Allowing max_len = 0 keeps the boundary rules complete and consistent.

A zero upper bound has a clear meaning: only the empty string can satisfy it. Disallowing it would add a special case without improving the model.

Why not validate UTF-8?

Ambiguous failure semantics.

The filter API returns a single validation result, which cannot distinguish between length failure and encoding failure. Keeping this filter limited to length validation avoids combining two different validation concerns into one result.

Is a UTF-8 encoding validator needed?

Separate design question.

This RFC does not address UTF-8 validity. Whether encoding validation should be introduced, and how it should be exposed, remains a separate design question.

Table of Contents