Table of Contents

PHP RFC: Grapheme cluster for levenshtein, grapheme_levenshtein function

Introduction

I creating mb_levenshtein https://wiki.php.net/rfc/mb_levenshtein. However, there was some discussion that the Levenshtein function for each grapheme cluster might be more logical, and I thought so too, so I created a PoC.

ref: https://github.com/php/php-src/issues/16428

For example, combined character is works fine.

var_dump(grapheme_levenshtein("\u{0065}\u{0301}", "\u{00e9}")); // Result is 0 when use grapheme_levenshtein. mb_levenshtein is not works well.

Also, variable selector is works fine.

// variable $nabe and $nabe_E0100 is seems nothing different.
// However, $nabe_E0100 is variable selector in U+908A U+E0100.
// So grapheme_levenshtein result is maybe 0.
$nabe = '邊';
$nabe_E0100 = "邊󠄀";
var_dump(grapheme_levenshtein($nabe, $nabe_E0100)); // Result is 0 when use grapheme_levenshtein. mb_levenshtein result is 1 that it's not works fine.

Proposal

Add grapheme_levenshtein function.

function grapheme_levenshtein(string $string1, string $string2, int $insertion_cost = 1, int $replacement_cost = 1, int $deletion_cost = 1): int|false {}

$string1 and $string2 is only need UTF-8. Returns false is failed parse to UTF-8.

Backward Incompatible Changes

This could break a function existing in userland with the same name.

Proposed PHP Version(s)

PHP 8.5

RFC Impact

To SAPIs

To SAPIs Will add the aforementioned functions to all PHP environments.

To Existing Extensions

Adds grapheme_levenshtein() to the intl extension.

To Opcache

No effect.

New Constants

No new constants.

php.ini Defaults

No changed php.ini settings.

Open Issues

https://github.com/php/php-src/issues/16428

Future Scope

This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.

Proposed Voting Choices

Include these so readers know where you are heading and can discuss the proposed voting options.

Voting

TBD.

Implementation

https://github.com/php/php-src/pull/16043

References

Nothing.

Rejected Features

Keep this updated with features that were discussed on the mail lists.