====== PHP RFC: Multibyte for trim function mb_trim, mb_ltrim and mb_rtrim ======
* Version: 0.1
* Date: 2023-10-18
* Author: Yuya Hamada (https://github.com/youkidearitai), youkidearitai@gmail.com based on 8ctopus(https://github.com/8ctopus), hello@octopuslabs.io
* Status: Implemented
* First Published at: http://wiki.php.net/rfc/mb_trim
===== Introduction =====
PHP does not have a multibyte equivalent of the trim function. It is possible to get close enough behavior using preg_replace("/^\s+|\s+$/u", '', $string), however adding a pre-built function to do this will improve the readability and clarity of PHP code. It will also standardize how it is done as it can be tricky. This feature would be of use to many PHP developers with varying levels of experience and would complete the mbstring extension.
One of use case is "trim Byte Order Mark". I think mb_ltrim would be work:
mb_ltrim($string, "\u{FEFF}\u{FFFE}");
===== Proposal =====
Add mb_trim() function:
function mb_trim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}"): string
function mb_ltrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}
function mb_rtrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}
Here's the list of characters trimmed:
Same as trim:
U+0020 SPACE (also in Separator category)
U+0009 \t
U+000A \n
U+000B \v
U+000D \r
not removed in trim(), probably it wasn't common enough, but ok for mb_trim
U+000C \f
Removed in trim, but not included in regex \s
U+0000 \0
whole Separator Z category (20 codepoints) covered by regex \s:
U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
Other symbols (included in regex \s):
U+0085 NEXT LINE (NEL)
U+180E MONGOLIAN VOWEL SEPARATOR
On the other hand, The ".." notation for $characters that was in the trim function was not supported. ex: \u{0000}..\u{FFFF}
Because the reason is below:
* Unicode character is very wide
* Difficult to search
* Difficult to store in memory
* Mapping with other character codes may be incompatible
* For example, to express Hiragana, UTF-8 uses [あ-ゞ], EUC-JP [あ-ゝゞ], and Shift_JIS [あ-ん].
===== Backward Incompatible Changes =====
This could break a function existing in userland with the same name.
===== Proposed PHP Version(s) =====
next PHP 8.x
===== RFC Impact =====
==== To SAPIs ====
To SAPIs
Will add the aforementioned functions to all PHP environments.
==== To Existing Extensions ====
Adds mb_trim(), mb_ltrim() and mb_rtrim() to the mbstring extension.
==== To Opcache ====
No effect.
==== New Constants ====
No new constants.
==== php.ini Defaults ====
No changed php.ini settings.
===== Open Issues =====
https://github.com/php/php-src/issues/9216
===== Future Scope =====
This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.
===== Proposed Voting Choices =====
Include these so readers know where you are heading and can discuss the proposed voting options.
===== Voting =====
* Yes
* No
===== Implementation =====
https://github.com/php/php-src/pull/12459
===== Rejected Features =====
Keep this updated with features that were discussed on the mail lists.