rfc:grapheme_add_locale_for_case_insensitive

PHP RFC: Add locale for case insensitive grapheme functions

Introduction

A grapheme functions is not locale dependency. This RFC is add locale parameter for grapheme case insensitive functions. That is more enhancements for Unicode can be expected.

  • grapheme_strpos
  • grapheme_stripos
  • grapheme_strrpos
  • grapheme_strripos
  • grapheme_strstr
  • grapheme_stristr
  • grapheme_levenshtein

By this RFC can cover locale. For example.

var_dump(grapheme_stripos("i", "\u{0130}", 0, "tr_TR")); // Result is 0
var_dump(grapheme_stripos("i", "\u{0130}", 0, "en_US")); // Result is false

Proposal

Add a $locale parameter and $strength parameter in these functions.

function grapheme_strpos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {}
function grapheme_stripos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {}
function grapheme_strrpos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {}
function grapheme_strripos(string $haystack, string $needle, int $offset = 0, string $locale = ""): int|false {}
function grapheme_substr(string $string, int $offset, ?int $length = null, string $locale = ""): string|false {}
function grapheme_strstr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = ""): string|false {}
function grapheme_stristr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = ""): string|false {}
function grapheme_levenshtein(string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1, string $locale = ""): int|false {}

Specifying strength can change the match for CJK characters, For example:

$nabe = '邊';
$nabe_E0101 = "邊\u{E0101}";
var_dump(grapheme_levenshtein($nabe, $nabe_E0101)); // result is 0
var_dump(grapheme_levenshtein($nabe, $nabe_E0101, locale: "ja_JP-u-ks-identic")); // result is 1
var_dump(grapheme_strpos($nabe, $nabe_E0101)); // result is 0
var_dump(grapheme_strpos($nabe, $nabe_E0101, locale: "ja_JP-u-ks-identic")); // result is false

If $locale is not valid, returns false and set intl_error_code_set and intl_error_set_custom_msg. Therefore, PHP userland can use intl_get_error_code and intl_get_error_message for reason.

grapheme_stri* The strength of the function remains unchanged and UCOL_SECONDARY is used.

The reason for removing $strength is to avoid complexity and because it can be specified with $locale.

Backward Incompatible Changes

Nothing if added parameter is default values.

Proposed PHP Version(s)

8.5

RFC Impact

To SAPIs

No effects.

To Existing Extensions

No effects.

To Opcache

No effects.

New Constants

No effects.

Open Issues

Nothing.

Future Scope

This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.

Proposed Voting Choices

I am sorry for stopping vote. I'll fix this RFC.

Patches and Tests

Implementation

References

Rejected Features

Keep this updated with features that were discussed on the mail lists.

rfc/grapheme_add_locale_for_case_insensitive.txt · Last modified: by youkidearitai