====== PHP RFC: Add locale for case insensitive grapheme functions ======
* Version: 2.0
* Date: 2025-06-07
* Author: Yuya Hamada, youkidearitai@gmail.com
* Status: Under Discussion
* First Published at: https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive
===== Introduction =====
A grapheme functions is not locale dependency. This RFC is add locale parameter for grapheme case insensitive functions. That is more enhancements for Unicode can be expected.
* grapheme_strpos
* grapheme_stripos
* grapheme_strrpos
* grapheme_strripos
* grapheme_strstr
* grapheme_stristr
* grapheme_levenshtein
By this RFC can cover locale. For example.
var_dump(grapheme_stripos("i", "\u{0130}", 0, "tr_TR")); // Result is 0
var_dump(grapheme_stripos("i", "\u{0130}", 0, "en_US")); // Result is false
===== Proposal =====
Add a $locale parameter and $strength parameter in these functions.
function grapheme_strpos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {}
function grapheme_stripos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {}
function grapheme_strrpos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {}
function grapheme_strripos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {}
function grapheme_substr(string $string, int $offset, ?int $length = null, string $locale = ""): string|false {}
function grapheme_strstr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = ""): string|false {}
function grapheme_stristr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = ""): string|false {}
function grapheme_levenshtein(string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1, string $locale = ""): int|false {}
Specifying strength can change the match for CJK characters, For example:
$nabe = '邊';
$nabe_E0101 = "邊\u{E0101}";
var_dump(grapheme_levenshtein($nabe, $nabe_E0101)); // result is 0
var_dump(grapheme_levenshtein($nabe, $nabe_E0101, locale: "ja_JP-u-ks-identic")); // result is 1
var_dump(grapheme_strpos($nabe, $nabe_E0101)); // result is 0
var_dump(grapheme_strpos($nabe, $nabe_E0101, locale: "ja_JP-u-ks-identic")); // result is false
If $locale is not valid, returns false and set ''intl_error_code_set'' and ''intl_error_set_custom_msg''. Therefore, PHP userland can use ''intl_get_error_code'' and ''intl_get_error_message'' for reason.
grapheme_stri* The strength of the function remains unchanged and UCOL_SECONDARY is used.
The reason for removing $strength is to avoid complexity and because it can be specified with $locale.
===== Backward Incompatible Changes =====
Nothing if added parameter is default values.
===== Proposed PHP Version(s) =====
8.5
===== RFC Impact =====
==== To SAPIs ====
No effects.
==== To Existing Extensions ====
No effects.
==== To Opcache ====
No effects.
==== New Constants ====
No effects.
===== Open Issues =====
Nothing.
===== Future Scope =====
This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.
===== Proposed Voting Choices =====
I am sorry for stopping vote. I'll fix this RFC.
===== Patches and Tests =====
https://github.com/php/php-src/pull/18792
===== Implementation =====
https://github.com/php/php-src/pull/18792
===== References =====
https://unicode-org.github.io/icu/userguide/transforms/casemappings.html#full-language-specific-case-mapping
===== Rejected Features =====
Keep this updated with features that were discussed on the mail lists.