rfc:mb_trim

This is an old revision of the document!


PHP RFC: Multibyte for trim function mb_trim, mb_ltrim and mb_rtrim

Introduction

PHP does not have a multibyte equivalent of the trim function. It is possible to get close enough behavior using preg_replace(“/^\s+|\s+$/u”, '', $string), however adding a pre-built function to do this will improve the readability and clarity of PHP code. It will also standardize how it is done as it can be tricky. This feature would be of use to many PHP developers with varying levels of experience and would complete the mbstring extension.

Proposal

Add mb_trim() function:

function mb_trim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}"): string
function mb_ltrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}
function mb_rtrim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200A}\u{2028}\u{2029}\u{202F}\u{205F}\u{3000}\u{0085}\u{180E}", ?string $encoding = null): string {}

Here's the list of characters trimmed:

Same as trim:

U+0020 SPACE (also in Separator category)
U+0009 \t
U+000A \n
U+000B \v
U+000D \r

not removed in trim(), probably it wasn't common enough, but ok for mb_trim

U+000C \f

Removed in trim, but not included in regex \s

U+0000 \0

whole Separator Z category (20 codepoints) covered by regex \s:

U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

Other symbols:

U+0085 NEXT LINE (NEL)
U+180E MONGOLIAN VOWEL SEPARATOR

On the other hand, The “..” notation for $characters that was in the trim function was not supported. ex: \u{0000}..\u{FFFF} Because the reason is below:

  • Unicode character is very wide
    • Difficult to search
    • Difficult to store in memory

Backward Incompatible Changes

This could break a function existing in userland with the same name.

Proposed PHP Version(s)

next PHP 8.x

RFC Impact

To SAPIs

To SAPIs Will add the aforementioned functions to all PHP environments.

To Existing Extensions

Adds mb_trim(), mb_ltrim() and mb_rtrim() to the mbstring extension.

To Opcache

No effect.

New Constants

No new constants.

php.ini Defaults

No changed php.ini settings.

Open Issues

Unaffected PHP Functionality

List existing areas/features of PHP that will not be changed by the RFC.

This helps avoid any ambiguity, shows that you have thought deeply about the RFC's impact, and helps reduces mail list noise.

Future Scope

This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.

Proposed Voting Choices

Include these so readers know where you are heading and can discuss the proposed voting options.

Patches and Tests

Links to any external patches and tests go here.

If there is no patch, make it clear who will create a patch, or whether a volunteer to help with implementation is needed.

Make it clear if the patch is intended to be the final patch, or is just a prototype.

For changes affecting the core language, you should also provide a patch for the language specification.

Implementation

References

Links to external references, discussions or RFCs

Rejected Features

Keep this updated with features that were discussed on the mail lists.

rfc/mb_trim.1697612582.txt.gz · Last modified: 2023/10/18 07:03 by youkidearitai