rfc:mb_levenshtein

PHP RFC: Multibyte for levenshtein, mb_levenshtein function

Introduction

Multibyte levenshtein distances have feature requests in the past. Therefore, we would like to create the mb_levenshtein function to implement this.

ref: https://github.com/php/php-src/issues/10180

Levenshtein distance difference mb_levenshtein vs grapheme_levenshtein

The mb_levenshtein function is the Levenshtein distance in code points. This is useful for comparing Unicode code points. For example, this can be used to compare concatenated characters.

var_dump(mb_levenshtein("\u{0065}\u{0301}", "\u{00e9}")); // "é" result is 1.

Surely, There are times when I want to consider this to be the same. In that case, I will propose grapheme_levenshtein separately.

Proposal

Add mb_levenshtein function.

function mb_levenshtein(string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1, ?string $encoding = null): int {}

Backward Incompatible Changes

This could break a function existing in userland with the same name.

Proposed PHP Version(s)

PHP 8.5

RFC Impact

To SAPIs

To SAPIs Will add the aforementioned functions to all PHP environments.

To Existing Extensions

Adds mb_levenshtein() to the mbstring extension.

To Opcache

No effect.

New Constants

No new constants.

php.ini Defaults

No changed php.ini settings.

Open Issues

Future Scope

This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.

Proposed Voting Choices

Include these so readers know where you are heading and can discuss the proposed voting options.

Voting

Add mb_levenshtein function
Real name Yes No
cschneid (cschneid)  
nielsdos (nielsdos)  
timwolla (timwolla)  
Count: 0 3

Implementation

References

Userland implementation is here:

Rejected Features

Keep this updated with features that were discussed on the mail lists.

rfc/mb_levenshtein.txt · Last modified: 2025/02/21 00:37 by youkidearitai