rfc:mb_trim_change_characters

This is an old revision of the document!


PHP RFC: Change the default of $characters in mb_trim function

  • Version: 0.1
  • Date: 2024-04-03
  • Author: Yuya Hamada, youkidearitai@gmail.com
  • Status: Draft (or Under Discussion or Accepted or Declined)
  • First Published at: http://wiki.php.net/rfc/mb_trim

Introduction

We found a problem with $characters, the second argument of mb_trim. This RFC will be a proposed solution to that problem.

First, the mbstring_arginfo.h file output by mbstring.stub.php, which is added by the mb_trim function, generates UTF-8 strings, which prevents it from compiling with some on Visual C++

https://github.com/php/php-src/issues/13789

Next is the problem that $characters in the mb_trim function cannot be trimmed with other character encodings with the default.

https://github.com/php/php-src/issues/13815

Putting all these together, we create this RFC that $characters in mb_trim is more appropriate to be null.

Proposal

Change the default of $characters in mb_trim, mb_ltrim and mb_rtrim functions

function mb_trim(string $string, ?string $characters = null, ?string $encoding = null): string
function mb_ltrim(string $string, ?string $characters = null, ?string $encoding = null): string
function mb_rtrim(string $string, ?string $characters = null, ?string $encoding = null): string

If $characters is null, the following characters are trimmed by default.

Here's the list of characters trimmed:

Same as trim:

U+0020 SPACE (also in Separator category)
U+0009 \t
U+000A \n
U+000B \v
U+000D \r

not removed in trim(), probably it wasn't common enough, but ok for mb_trim

U+000C \f

Removed in trim, but not included in regex \s

U+0000 \0

whole Separator Z category (20 codepoints) covered by regex \s:

U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

Other symbols (included in regex \s):

U+0085 NEXT LINE (NEL)
U+180E MONGOLIAN VOWEL SEPARATOR

Backward Incompatible Changes

This could break a function existing in userland with the same name.

Proposed PHP Version(s)

PHP 8.4

RFC Impact

To SAPIs

To SAPIs Will add the aforementioned functions to all PHP environments.

To Existing Extensions

Fixes mb_trim(), mb_ltrim() and mb_rtrim() to the mbstring extension.

To Opcache

No effect.

New Constants

No new constants.

php.ini Defaults

No changed php.ini settings.

Open Issues

Future Scope

This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.

Proposed Voting Choices

Include these so readers know where you are heading and can discuss the proposed voting options.

Implementation

References

Rejected Features

Keep this updated with features that were discussed on the mail lists.

rfc/mb_trim_change_characters.1712127934.txt.gz · Last modified: 2024/04/03 07:05 by youkidearitai