====== PHP RFC: Grapheme cluster for str_split function: grapheme_str_split ====== * Version: 0.1 * Date: 2024-03-04 * Author: Yuya Hamada, youkidearitai@gmail.com * Status: Implemented * Target Version: PHP 8.4 * Implementation: https://github.com/php/php-src/pull/13580 * First Published at: http://wiki.php.net/rfc/grapheme_str_split ===== Introduction ===== I noticed PHP does not have a grapheme cluster based str_split function. So I think need str_split for grapheme cluster, grapheme_str_split function using [[https://unicode-org.github.io/icu/userguide/icu4c/|ICU]]. Creating this function in the Intl extension would provide stronger support for grapheme clusters. This feature will allow to correctly handle emoji and Variation Selectors. grapheme_str_split function is correctly support for grapheme cluster. $ sapi/cli/php -r 'var_dump(grapheme_str_split("๐Ÿ™‡โ€โ™‚๏ธ"));' array(1) { [0]=> string(13) "๐Ÿ™‡โ€โ™‚๏ธ" } For example, compare to mb_str_split function, mb_str_split function is str_split for Unicode codepoint. (Of course, sometimes this is more convenient.) $ sapi/cli/php -r 'var_dump(mb_str_split("๐Ÿ™‡โ€โ™‚๏ธ"));' array(4) { [0]=> string(4) "๐Ÿ™‡" [1]=> string(3) "โ€" // U+200D Zero Width Joinner [2]=> string(3) "โ™‚" [3]=> string(3) "๏ธ" // U+FE0F VARIATION SELECTOR } Until now, PCRE functions were required to support grapheme clusters. $ sapi/cli/php -r 'preg_match_all("/(\X)/u", "๐Ÿ™‡โ€โ™‚๏ธ", $matches, PREG_OFFSET_CAPTURE); var_dump($matches[1]);' array(1) { [0]=> array(2) { [0]=> string(13) "๐Ÿ™‡โ€โ™‚๏ธ" [1]=> int(0) } } Examples of other languages. Ruby is already support grapheme clusters as [[https://ruby-doc.org/3.2.2/String.html#method-i-grapheme_clusters|String#grapheme_clusters]] s = "\u0061\u0308-pqr-\u0062\u0308-xyz-\u0063\u0308" # => "aฬˆ-pqr-bฬˆ-xyz-cฬˆ" s.grapheme_clusters # => ["aฬˆ", "-", "p", "q", "r", "-", "bฬˆ", "-", "x", "y", "z", "-", "cฬˆ"] grapheme_str_split support to grapheme clusters (variation selectors). $ sapi/cli/php -r 'var_dump(grapheme_str_split("รค-pqr-bฬˆ-xyz-cฬˆ"));' array(13) { [0]=> string(2) "รค" [1]=> string(1) "-" [2]=> string(1) "p" [3]=> string(1) "q" [4]=> string(1) "r" [5]=> string(1) "-" [6]=> string(3) "bฬˆ" [7]=> string(1) "-" [8]=> string(1) "x" [9]=> string(1) "y" [10]=> string(1) "z" [11]=> string(1) "-" [12]=> string(3) "cฬˆ" } ===== Proposal ===== Add grapheme_str_split function. function grapheme_str_split(string $string, int $length = 1): array|false {} $string is only support UTF-8. $length is the length of the grapheme cluster per element of the array. ===== Backward Incompatible Changes ===== This could break a function existing in userland with the same name. ===== Proposed PHP Version(s) ===== PHP 8.4 ===== RFC Impact ===== ==== To SAPIs ==== To SAPIs Will add the aforementioned functions to all PHP environments. ==== To Existing Extensions ==== Add grapheme_str_split() to the intl extension. ==== To Opcache ==== No effect. ==== New Constants ==== No new constants. ==== php.ini Defaults ==== No changed php.ini settings. ===== Open Issues ===== No issues ===== Future Scope ===== This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC. ===== Proposed Voting Choices ===== * Yes * No ===== Implementation ===== https://github.com/php/php-src/pull/13580 ===== Rejected Features ===== Keep this updated with features that were discussed on the mail lists.