rfc:grapheme_str_split

PHP RFC: Grapheme cluster for str_split function: grapheme_str_split

Introduction

I noticed PHP does not have a grapheme cluster based str_split function. So I think need str_split for grapheme cluster, grapheme_str_split function using ICU. Creating this function in the Intl extension would provide stronger support for grapheme clusters. This feature will allow to correctly handle emoji and Variation Selectors.

grapheme_str_split function is correctly support for grapheme cluster.

$ sapi/cli/php -r 'var_dump(grapheme_str_split("๐Ÿ™‡โ€โ™‚๏ธ"));'
array(1) {
  [0]=>
  string(13) "๐Ÿ™‡โ€โ™‚๏ธ"
}

For example, compare to mb_str_split function, mb_str_split function is str_split for Unicode codepoint. (Of course, sometimes this is more convenient.)

$ sapi/cli/php -r 'var_dump(mb_str_split("๐Ÿ™‡โ€โ™‚๏ธ"));'
array(4) {
  [0]=>
  string(4) "๐Ÿ™‡"
  [1]=>
  string(3) "โ€" // U+200D Zero Width Joinner
  [2]=>
  string(3) "โ™‚"
  [3]=>
  string(3) "๏ธ" // U+FE0F VARIATION SELECTOR
}

Until now, PCRE functions were required to support grapheme clusters.

$ sapi/cli/php -r  'preg_match_all("/(\X)/u", "๐Ÿ™‡โ€โ™‚๏ธ", $matches, PREG_OFFSET_CAPTURE); var_dump($matches[1]);'
array(1) {
  [0]=>
  array(2) {
    [0]=>
    string(13) "๐Ÿ™‡โ€โ™‚๏ธ"
    [1]=>
    int(0)
  }
}

Examples of other languages. Ruby is already support grapheme clusters as String#grapheme_clusters

s = "\u0061\u0308-pqr-\u0062\u0308-xyz-\u0063\u0308" # => "aฬˆ-pqr-bฬˆ-xyz-cฬˆ"
s.grapheme_clusters
# => ["aฬˆ", "-", "p", "q", "r", "-", "bฬˆ", "-", "x", "y", "z", "-", "cฬˆ"]

grapheme_str_split support to grapheme clusters (variation selectors).

$ sapi/cli/php -r 'var_dump(grapheme_str_split("รค-pqr-bฬˆ-xyz-cฬˆ"));'
array(13) {
  [0]=>
  string(2) "รค"
  [1]=>
  string(1) "-"
  [2]=>
  string(1) "p"
  [3]=>
  string(1) "q"
  [4]=>
  string(1) "r"
  [5]=>
  string(1) "-"
  [6]=>
  string(3) "bฬˆ"
  [7]=>
  string(1) "-"
  [8]=>
  string(1) "x"
  [9]=>
  string(1) "y"
  [10]=>
  string(1) "z"
  [11]=>
  string(1) "-"
  [12]=>
  string(3) "cฬˆ"
}

Proposal

Add grapheme_str_split function.

function grapheme_str_split(string $string, int $length = 1): array|false {}

$string is only support UTF-8. $length is the length of the grapheme cluster per element of the array.

Backward Incompatible Changes

This could break a function existing in userland with the same name.

Proposed PHP Version(s)

PHP 8.4

RFC Impact

To SAPIs

To SAPIs Will add the aforementioned functions to all PHP environments.

To Existing Extensions

Add grapheme_str_split() to the intl extension.

To Opcache

No effect.

New Constants

No new constants.

php.ini Defaults

No changed php.ini settings.

Open Issues

No issues

Future Scope

This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.

Proposed Voting Choices

Add grapheme cluster for str_split function: grapheme_str_split
Real name Yes No
ashnazg (ashnazg)  
beberlei (beberlei)  
crell (crell)  
dams (dams)  
derick (derick)  
devnexen (devnexen)  
galvao (galvao)  
imsop (imsop)  
jimw (jimw)  
jwage (jwage)  
mcmic (mcmic)  
nicolasgrekas (nicolasgrekas)  
nielsdos (nielsdos)  
ocramius (ocramius)  
petk (petk)  
saki (saki)  
sergey (sergey)  
theodorejb (theodorejb)  
weierophinney (weierophinney)  
Final result: 19 0
This poll has been closed.

Implementation

Rejected Features

Keep this updated with features that were discussed on the mail lists.

rfc/grapheme_str_split.txt ยท Last modified: 2024/04/10 19:21 by youkidearitai

๏ปฟ