rfc:grapheme_limit_codepoints

PHP RFC: Your Title Here

I noticed there is not exist limit of codepoint that reading Unicode Standard Annex #29(UAX#29). So that means grapheme cluster can crash computer because computer resource is limited but grapheme cluster is not limited.

Introduction

This proposal is use to safe for grapheme cluster that it is limit codepoint.

<?php
$f = "あい👨‍👨‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦‍👦うえお";
var_dump(grapheme_limit_codepoints($f)); // returns false because no 3 grapheme cluster is greater than 32 codepoints.
 
$f = "あいうえお👨‍👨‍👦";
var_dump(grapheme_limit_codepoints($f)); // returns true
 
$f = "あいうえおH̵̛͕̞̦̰̜͍̰̥̟͆̏͂̌͑ͅä̷͔̟͓̬̯̟͍̭͉͈̮͙̣̯̬͚̞̭̍̀̾͠m̴̡̧̛̝̯̹̗̹̤̲̺̟̥̈̏͊̔̑̍͆̌̀̚͝͝b̴̢̢̫̝̠̗̼̬̻̮̺̭͔̘͑̆̎̚ư̵̧̡̥̙̭̿̈̀̒̐̊͒͑r̷̡̡̲̼̖͎̫̮̜͇̬͌͘g̷̹͍͎̬͕͓͕̐̃̈́̓̆̚͝ẻ̵̡̼̬̥̹͇̭͔̯̉͛̈́̕r̸̮̖̻̮̣̗͚͖̝̂͌̾̓̀̿̔̀͋̈́͌̈́̋͜👨‍👨‍👦";
var_dump(grapheme_limit_codepoints($f)); // returns true because zalgo text for Hamburger but lower than 32 codepoints
?>

Proposal

Check grapheme cluster's codepoints lower than $limit

<?php
function grapheme_limit_codepoints(string $string, int $limit = GRAPHEME_LIMIT_CODEPOINTS): bool {}
?>

GRAPHEME_LIMIT_CODEPOINTS is 32, Because based on UAX#15 Stream-safe Text Format. Unicode's official answer is not rely Stream-safe Text Format, But I think make sense to it.

Examples

Check the codepoints per grapheme cluster. Then measure grapheme_strlen.

Simple example:

<?php
 
$f = "あいうえお👨‍👨‍👦";
var_dump(grapheme_limit_codepoints($f)); // true
var_dump(grapheme_strlen($f)); // result is 6
 
?>

Backward Incompatible Changes

This could break a function existing in userland with the same name.

Proposed PHP Version(s)

Next of PHP 8.5 (PHP 8.6 or PHP 9.0)

RFC Impact

To the Ecosystem

None

To Existing Extensions

Adds grapheme_limit_codepoints() to the intl extension.

To SAPIs

None

Future Scope

None

Voting Choices

Please consult the php/policies repository for the current voting guidelines.


Primary Vote requiring a 2/3 majority to accept the RFC:

Add grapheme_limit_codepoints
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

Patches and Tests

Implementation

References

Rejected Features

Keep this updated with features that were discussed on the mail lists.

Changelog

If there are major changes to the initial proposal, please include a short summary with a date or a link to the mailing list announcement here, as not everyone has access to the wikis' version history.

rfc/grapheme_limit_codepoints.txt · Last modified: by youkidearitai