PHP RFC: Add locale for case insensitive grapheme functions
- Version: 2.0
- Date: 2025-06-07
- Author: Yuya Hamada, youkidearitai@gmail.com
- Status: Accepted
- First Published at: https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive
Introduction
A grapheme functions is not locale dependency. This RFC is add locale parameter for grapheme case insensitive functions. That is more enhancements for Unicode can be expected.
- grapheme_strpos
- grapheme_stripos
- grapheme_strrpos
- grapheme_strripos
- grapheme_strstr
- grapheme_stristr
- grapheme_levenshtein
By this RFC can cover locale. For example.
var_dump(grapheme_stripos("i", "\u{0130}", 0, "tr_TR")); // Result is 0 var_dump(grapheme_stripos("i", "\u{0130}", 0, "en_US")); // Result is false
Proposal
Add a $locale parameter and $strength parameter in these functions.
function grapheme_strpos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {} function grapheme_stripos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {} function grapheme_strrpos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {} function grapheme_strripos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {} function grapheme_substr(string $string, int $offset, ?int $length = null, string $locale = ""): string|false {} function grapheme_strstr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = ""): string|false {} function grapheme_stristr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = ""): string|false {} function grapheme_levenshtein(string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1, string $locale = ""): int|false {}
$locale is based on LDML. https://www.unicode.org/reports/tr35
Specifying strength can change the match for CJK characters, For example:
$nabe = '邊'; $nabe_E0101 = "邊\u{E0101}"; var_dump(grapheme_levenshtein($nabe, $nabe_E0101)); // result is 0 var_dump(grapheme_levenshtein($nabe, $nabe_E0101, locale: "ja_JP-u-ks-identic")); // result is 1 var_dump(grapheme_strpos($nabe, $nabe_E0101)); // result is 0 var_dump(grapheme_strpos($nabe, $nabe_E0101, locale: "ja_JP-u-ks-identic")); // result is false
If $locale is not valid, returns false and set intl_error_code_set
and intl_error_set_custom_msg
. Therefore, PHP userland can use intl_get_error_code
and intl_get_error_message
for reason.
$ sapi/cli/php -r 'var_dump(grapheme_levenshtein("abc", "def", locale: "defaaaaaaaaaaaaa"), intl_get_error_message());' bool(false) string(44) "Error on ucol_open: U_ILLEGAL_ARGUMENT_ERROR"
grapheme_stri* The strength of the function remains unchanged and UCOL_SECONDARY is used.
var_dump(grapheme_stripos("邊", "邊\u{E0101}", locale: "ja_JP-u-ks-identic")); // result is 0 (matched) because strength is UCOL_SECONDARY.
The reason for removing $strength is to avoid complexity and because it can be specified with $locale.
Backward Incompatible Changes
Nothing if added parameter is default values.
Proposed PHP Version(s)
8.5
RFC Impact
To SAPIs
No effects.
To Existing Extensions
No effects.
To Opcache
No effects.
New Constants
No effects.
Open Issues
Nothing.
Future Scope
This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.
Proposed Voting Choices
Patches and Tests
Implementation
References
Rejected Features
Keep this updated with features that were discussed on the mail lists.