A grapheme functions is not locale dependency. This RFC is add locale parameter for grapheme case insensitive functions. That is more enhancements for Unicode can be expected.
By this RFC can cover locale. For example.
var_dump(grapheme_stripos("i", "\u{0130}", 0, "tr_TR")); // Result is 0 var_dump(grapheme_stripos("i", "\u{0130}", 0, "en_US")); // Result is false
Add a $locale parameter and $strength parameter in these functions.
function grapheme_strpos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {} function grapheme_stripos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {} function grapheme_strrpos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {} function grapheme_strripos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {} function grapheme_substr(string $string, int $offset, ?int $length = null, string $locale = ""): string|false {} function grapheme_strstr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = ""): string|false {} function grapheme_stristr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = ""): string|false {} function grapheme_levenshtein(string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1, string $locale = ""): int|false {}
$locale is based on LDML. https://www.unicode.org/reports/tr35
Specifying strength can change the match for CJK characters, For example:
$nabe = '邊'; $nabe_E0101 = "邊\u{E0101}"; var_dump(grapheme_levenshtein($nabe, $nabe_E0101)); // result is 0 var_dump(grapheme_levenshtein($nabe, $nabe_E0101, locale: "ja_JP-u-ks-identic")); // result is 1 var_dump(grapheme_strpos($nabe, $nabe_E0101)); // result is 0 var_dump(grapheme_strpos($nabe, $nabe_E0101, locale: "ja_JP-u-ks-identic")); // result is false
If $locale is not valid, returns false and set intl_error_code_set
and intl_error_set_custom_msg
. Therefore, PHP userland can use intl_get_error_code
and intl_get_error_message
for reason.
$ sapi/cli/php -r 'var_dump(grapheme_levenshtein("abc", "def", locale: "defaaaaaaaaaaaaa"), intl_get_error_message());' bool(false) string(44) "Error on ucol_open: U_ILLEGAL_ARGUMENT_ERROR"
grapheme_stri* The strength of the function remains unchanged and UCOL_SECONDARY is used.
var_dump(grapheme_stripos("邊", "邊\u{E0101}", locale: "ja_JP-u-ks-identic")); // result is 0 (matched) because strength is UCOL_SECONDARY.
The reason for removing $strength is to avoid complexity and because it can be specified with $locale.
Nothing if added parameter is default values.
8.5
No effects.
No effects.
No effects.
No effects.
Nothing.
This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.
Keep this updated with features that were discussed on the mail lists.