PHP RFC: Multibyte for levenshtein, mb_levenshtein function
- Version: 0.1
- Date: 2024-09-25
- Author: Yuya Hamada, youkidearitai@gmail.com
- Status: Voting
- First Published at: http://wiki.php.net/rfc/mb_levenshtein
Introduction
Multibyte levenshtein distances have feature requests in the past. Therefore, we would like to create the mb_levenshtein function to implement this.
Levenshtein distance difference mb_levenshtein vs grapheme_levenshtein
The mb_levenshtein function is the Levenshtein distance in code points. This is useful for comparing Unicode code points. For example, this can be used to compare concatenated characters.
var_dump(mb_levenshtein("\u{0065}\u{0301}", "\u{00e9}")); // "é" result is 1.
Surely, There are times when I want to consider this to be the same. In that case, I will propose grapheme_levenshtein separately.
Proposal
Add mb_levenshtein function.
function mb_levenshtein(string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1, ?string $encoding = null): int {}
Backward Incompatible Changes
This could break a function existing in userland with the same name.
Proposed PHP Version(s)
PHP 8.5
RFC Impact
To SAPIs
To SAPIs Will add the aforementioned functions to all PHP environments.
To Existing Extensions
Adds mb_levenshtein() to the mbstring extension.
To Opcache
No effect.
New Constants
No new constants.
php.ini Defaults
No changed php.ini settings.
Open Issues
Future Scope
This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.
Proposed Voting Choices
Include these so readers know where you are heading and can discuss the proposed voting options.
Voting
Implementation
References
Userland implementation is here:
Rejected Features
Keep this updated with features that were discussed on the mail lists.