PHP RFC: Grapheme cluster for levenshtein, grapheme_levenshtein function
- Version: 0.1
- Date: 2024-10-14
- Author: Yuya Hamada, youkidearitai@gmail.com
- Status: Draft
- First Published at: http://wiki.php.net/rfc/grapheme_levenshtein
Introduction
I creating mb_levenshtein https://wiki.php.net/rfc/mb_levenshtein. However, there was some discussion that the Levenshtein function for each grapheme cluster might be more logical, and I thought so too, so I created a PoC.
ref: https://github.com/php/php-src/issues/16428
For example, combined character is works fine.
var_dump(grapheme_levenshtein("\u{0065}\u{0301}", "\u{00e9}")); // Result is 0 when use grapheme_levenshtein. mb_levenshtein is not works well.
Also, variable selector is works fine.
// variable $nabe and $nabe_E0100 is seems nothing different. // However, $nabe_E0100 is variable selector in U+908A U+E0100. // So grapheme_levenshtein result is maybe 0. $nabe = '邊'; $nabe_E0100 = "邊󠄀"; var_dump(grapheme_levenshtein($nabe, $nabe_E0100)); // Result is 0 when use grapheme_levenshtein. mb_levenshtein result is 1 that it's not works fine.
Proposal
Add grapheme_levenshtein function.
function grapheme_levenshtein(string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1): int|false {}
$string1 and $string2 is only need UTF-8. Returns false is failed parse to UTF-8.
Backward Incompatible Changes
This could break a function existing in userland with the same name.
Proposed PHP Version(s)
PHP 8.5
RFC Impact
To SAPIs
To SAPIs Will add the aforementioned functions to all PHP environments.
To Existing Extensions
Adds grapheme_levenshtein() to the intl extension.
To Opcache
No effect.
New Constants
No new constants.
php.ini Defaults
No changed php.ini settings.
Open Issues
Future Scope
This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.
Proposed Voting Choices
Include these so readers know where you are heading and can discuss the proposed voting options.
Voting
TBD.
Implementation
References
Nothing.
Rejected Features
Keep this updated with features that were discussed on the mail lists.