rfc:mb_trim

PHP RFC: Multibyte for trim function mb_trim, mb_ltrim and mb_rtrim

Introduction

PHP does not have a multibyte equivalent of the trim function. It is possible to get close enough behavior using preg_replace(“/^\s+|\s+$/u”, '', $string), however adding a pre-built function to do this will improve the readability and clarity of PHP code. It will also standardize how it is done as it can be tricky. This feature would be of use to many PHP developers with varying levels of experience and would complete the mbstring extension.

One of use case is “trim Byte Order Mark”. I think mb_ltrim would be work:

mb_ltrim($string, "\u{FEFF}\u{FFFE}");

Proposal

Add mb_trim() function:

function mb_trim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}"): string
function mb_ltrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}
function mb_rtrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}

Here's the list of characters trimmed:

Same as trim:

U+0020 SPACE (also in Separator category)
U+0009 \t
U+000A \n
U+000B \v
U+000D \r

not removed in trim(), probably it wasn't common enough, but ok for mb_trim

U+000C \f

Removed in trim, but not included in regex \s

U+0000 \0

whole Separator Z category (20 codepoints) covered by regex \s:

U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

Other symbols (included in regex \s):

U+0085 NEXT LINE (NEL)
U+180E MONGOLIAN VOWEL SEPARATOR

On the other hand, The “..” notation for $characters that was in the trim function was not supported. ex: \u{0000}..\u{FFFF} Because the reason is below:

  • Unicode character is very wide
    • Difficult to search
    • Difficult to store in memory
    • Mapping with other character codes may be incompatible
      • For example, to express Hiragana, UTF-8 uses [あ-ゞ], EUC-JP [あ-ゝゞ], and Shift_JIS [あ-ん].

Backward Incompatible Changes

This could break a function existing in userland with the same name.

Proposed PHP Version(s)

next PHP 8.x

RFC Impact

To SAPIs

To SAPIs Will add the aforementioned functions to all PHP environments.

To Existing Extensions

Adds mb_trim(), mb_ltrim() and mb_rtrim() to the mbstring extension.

To Opcache

No effect.

New Constants

No new constants.

php.ini Defaults

No changed php.ini settings.

Open Issues

Future Scope

This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.

Proposed Voting Choices

Include these so readers know where you are heading and can discuss the proposed voting options.

Voting

Multibyte for trim function mb_trim, mb_ltrim and mb_rtrim
Real name Yes No
ashnazg (ashnazg)  
brzuchal (brzuchal)  
bukka (bukka)  
derick (derick)  
girgias (girgias)  
heiglandreas (heiglandreas)  
kguest (kguest)  
mbeccati (mbeccati)  
mcmic (mcmic)  
nicolasgrekas (nicolasgrekas)  
nielsdos (nielsdos)  
ocramius (ocramius)  
petk (petk)  
sergey (sergey)  
theodorejb (theodorejb)  
Final result: 15 0
This poll has been closed.

Implementation

Rejected Features

Keep this updated with features that were discussed on the mail lists.

rfc/mb_trim.txt · Last modified: 2024/04/15 08:40 by youkidearitai