====== PHP RFC: Grapheme cluster for str_split function: grapheme_str_split ======
* Version: 0.1
* Date: 2024-03-04
* Author: Yuya Hamada, youkidearitai@gmail.com
* Status: Implemented
* Target Version: PHP 8.4
* Implementation: https://github.com/php/php-src/pull/13580
* First Published at: http://wiki.php.net/rfc/grapheme_str_split
===== Introduction =====
I noticed PHP does not have a grapheme cluster based str_split function. So I think need str_split for grapheme cluster, grapheme_str_split function using [[https://unicode-org.github.io/icu/userguide/icu4c/|ICU]]. Creating this function in the Intl extension would provide stronger support for grapheme clusters. This feature will allow to correctly handle emoji and Variation Selectors.
grapheme_str_split function is correctly support for grapheme cluster.
$ sapi/cli/php -r 'var_dump(grapheme_str_split("๐โโ๏ธ"));'
array(1) {
[0]=>
string(13) "๐โโ๏ธ"
}
For example, compare to mb_str_split function, mb_str_split function is str_split for Unicode codepoint. (Of course, sometimes this is more convenient.)
$ sapi/cli/php -r 'var_dump(mb_str_split("๐โโ๏ธ"));'
array(4) {
[0]=>
string(4) "๐"
[1]=>
string(3) "โ" // U+200D Zero Width Joinner
[2]=>
string(3) "โ"
[3]=>
string(3) "๏ธ" // U+FE0F VARIATION SELECTOR
}
Until now, PCRE functions were required to support grapheme clusters.
$ sapi/cli/php -r 'preg_match_all("/(\X)/u", "๐โโ๏ธ", $matches, PREG_OFFSET_CAPTURE); var_dump($matches[1]);'
array(1) {
[0]=>
array(2) {
[0]=>
string(13) "๐โโ๏ธ"
[1]=>
int(0)
}
}
Examples of other languages. Ruby is already support grapheme clusters as [[https://ruby-doc.org/3.2.2/String.html#method-i-grapheme_clusters|String#grapheme_clusters]]
s = "\u0061\u0308-pqr-\u0062\u0308-xyz-\u0063\u0308" # => "aฬ-pqr-bฬ-xyz-cฬ"
s.grapheme_clusters
# => ["aฬ", "-", "p", "q", "r", "-", "bฬ", "-", "x", "y", "z", "-", "cฬ"]
grapheme_str_split support to grapheme clusters (variation selectors).
$ sapi/cli/php -r 'var_dump(grapheme_str_split("รค-pqr-bฬ-xyz-cฬ"));'
array(13) {
[0]=>
string(2) "รค"
[1]=>
string(1) "-"
[2]=>
string(1) "p"
[3]=>
string(1) "q"
[4]=>
string(1) "r"
[5]=>
string(1) "-"
[6]=>
string(3) "bฬ"
[7]=>
string(1) "-"
[8]=>
string(1) "x"
[9]=>
string(1) "y"
[10]=>
string(1) "z"
[11]=>
string(1) "-"
[12]=>
string(3) "cฬ"
}
===== Proposal =====
Add grapheme_str_split function.
function grapheme_str_split(string $string, int $length = 1): array|false {}
$string is only support UTF-8. $length is the length of the grapheme cluster per element of the array.
===== Backward Incompatible Changes =====
This could break a function existing in userland with the same name.
===== Proposed PHP Version(s) =====
PHP 8.4
===== RFC Impact =====
==== To SAPIs ====
To SAPIs Will add the aforementioned functions to all PHP environments.
==== To Existing Extensions ====
Add grapheme_str_split() to the intl extension.
==== To Opcache ====
No effect.
==== New Constants ====
No new constants.
==== php.ini Defaults ====
No changed php.ini settings.
===== Open Issues =====
No issues
===== Future Scope =====
This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.
===== Proposed Voting Choices =====
* Yes
* No
===== Implementation =====
https://github.com/php/php-src/pull/13580
===== Rejected Features =====
Keep this updated with features that were discussed on the mail lists.