I creating mb_levenshtein https://wiki.php.net/rfc/mb_levenshtein. However, there was some discussion that the Levenshtein function for each grapheme cluster might be more logical, and I thought so too, so I created a PoC.
ref: https://github.com/php/php-src/issues/16428
For example, combined character is works fine.
var_dump(grapheme_levenshtein("\u{0065}\u{0301}", "\u{00e9}")); // Result is 0 when use grapheme_levenshtein. mb_levenshtein is not works well.
Also, variable selector is works fine.
// variable $nabe and $nabe_E0100 is seems nothing different. // However, $nabe_E0100 is variable selector in U+908A U+E0100. // So grapheme_levenshtein result is maybe 0. $nabe = '邊'; $nabe_E0100 = "邊󠄀"; var_dump(grapheme_levenshtein($nabe, $nabe_E0100)); // Result is 0 when use grapheme_levenshtein. mb_levenshtein result is 1 that it's not works fine.
Add grapheme_levenshtein function.
function grapheme_levenshtein(string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1): int|false {}
$string1 and $string2 is only need UTF-8. Returns false is failed parse to UTF-8.
This could break a function existing in userland with the same name.
PHP 8.5
To SAPIs Will add the aforementioned functions to all PHP environments.
Adds grapheme_levenshtein() to the intl extension.
No effect.
No new constants.
No changed php.ini settings.
This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.
Include these so readers know where you are heading and can discuss the proposed voting options.
TBD.
Nothing.
Keep this updated with features that were discussed on the mail lists.