rfc:mb_str_pad

This is an old revision of the document!


PHP RFC: mb_str_pad

Introduction

In PHP, various string functions are available in two variants: one for byte strings and another for multibyte strings. However, a notable absence among the multibyte string functions is a mbstring equivalent of str_pad(). The str_pad() function lacks multibyte character support, causing issues when working with languages that utilize multibyte encodings like UTF-8. This RFC proposes the addition of such a function to PHP, which we will call mb_str_pad().

Proposal

The proposal is to introduce a new mbstring function mb_str_pad(). Both the input string and the padding string may be multibyte strings. The function follows the same signature as the str_pad() function, except that it has an additional argument for the string encoding. The encoding argument works analogously to the other mbstring functions. The argument applies on both the $input string and the $pad_string. If null, the default mbstring encoding is used. The $pad_type argument can take three possible values: STR_PAD_LEFT, STR_PAD_RIGHT, STR_PAD_BOTH. str_pad() uses the same constants.

function mb_str_pad(string $string, int $length, string $pad_string = " ", int $pad_type = STR_PAD_RIGHT, ?string $encoding = null): string {}

Error conditions

mb_str_pad() has the same error conditions as str_pad():

  • $pad must not be an empty string. Otherwise it will result in a value error.
  • $pad_type must be one of STR_PAD_LEFT, STR_PAD_RIGHT, STR_PAD_BOTH. Otherwise it will result in a value error.

There is one additional error condition that str_pad() doesn't have:

  • $encoding must be a valid and supported character encoding, if provided. Otherwise it will output a value error just like the other mbstring functions do with an invalid encoding. This error condition is typical for mbstring functions.

Examples and Comparison Against str_pad()

This section shows some examples and comparison between str_pad() and mb_str_pad() output for UTF-8 strings. str_pad() has trouble with special characters or letters used in some languages, because those are encoded in multiple bytes. The first example demonstrates this by using the word “Français”. The word “Français” is 8 characters long, but 9 bytes long because the letter ç is encoded as two bytes. Therefore, in the following example, str_pad() will produce the wrong result whereas mb_str_pad() produces the correct result.

var_dump(str_pad('Français', 10, '_', STR_PAD_RIGHT));   // BAD: string(9) "Français_"
var_dump(str_pad('Français', 10, '_', STR_PAD_LEFT));    // BAD: string(9) "_Français"
var_dump(str_pad('Français', 10, '_', STR_PAD_BOTH));    // BAD: string(9) "Français_"
 
var_dump(mb_str_pad('Français', 10, '_', STR_PAD_RIGHT));// GOOD: string(10) "Français__"
var_dump(mb_str_pad('Français', 10, '_', STR_PAD_LEFT)); // GOOD: string(10) "__Français"
var_dump(mb_str_pad('Français', 10, '_', STR_PAD_BOTH)); // GOOD: string(11) "_Français_"

The problems with str_pad() become even more prominent for languages which use a non-latin alphabet.

var_dump(str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_RIGHT));    // BAD: string(35) "Δεν μιλάω ελληνικά."
var_dump(str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_LEFT));     // BAD: string(35) "Δεν μιλάω ελληνικά."
var_dump(str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_BOTH));     // BAD: string(35) "Δεν μιλάω ελληνικά."
 
var_dump(mb_str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_RIGHT)); // GOOD: string(37) "Δεν μιλάω ελληνικά.__"
var_dump(mb_str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_LEFT));  // GOOD: string(37) "__Δεν μιλάω ελληνικά."
var_dump(mb_str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_BOTH));  // GOOD: string(37) "_Δεν μιλάω ελληνικά._"

We can also use emojis and symbols, which may be useful for some CLI applications. This is an example from the original issue report.

var_dump(str_pad('▶▶', 6, '❤❓❇', STR_PAD_RIGHT));    // BAD: string(6) "▶▶"
var_dump(str_pad('▶▶', 6, '❤❓❇', STR_PAD_LEFT));     // BAD: string(6) "▶▶"
var_dump(str_pad('▶▶', 6, '❤❓❇', STR_PAD_BOTH));     // BAD: string(6) "▶▶"
 
var_dump(mb_str_pad('▶▶', 6, '❤❓❇', STR_PAD_RIGHT)); // GOOD: string(18) "▶▶❤❓❇❤"
var_dump(mb_str_pad('▶▶', 6, '❤❓❇', STR_PAD_LEFT));  // GOOD: string(18) "❤❓❇❤▶▶"
var_dump(mb_str_pad('▶▶', 6, '❤❓❇', STR_PAD_BOTH));  // GOOD: string(18) "❤❓▶▶❤❓"

Backward Incompatible Changes

Since this is a new function, and no existing functions change, the risk of a BC break is minimal. The only break occurs when a userland PHP project defines their own mb_str_pad() function without first checking if PHP itself does not define one. TODO

Proposed PHP Version(s)

Next PHP 8.x (at the time of writing this is PHP 8.3).

RFC Impact

To SAPIs

None.

To Existing Extensions

mbstring: A new function mb_str_pad() will be added to mbstring. The implementation of this function will leverage the existing internal functions of mbstring. No modifications will be made to any existing functions, and no new internal functions will be added. By reusing existing internal functions, the maintenance burden of mb_str_pad() stays quite low.

To Opcache

None.

New Constants

None.

php.ini Defaults

None.

Open Issues

Make sure there are no open issues when the vote starts!

Unaffected PHP Functionality

Everything outside of mbstring.

Future Scope

None.

Proposed Voting Choices

One primary yes/no vote to decide if the function may be introduced.

Patches and Tests

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature
  4. a link to the language specification section (if any)

References

Original issue report suggesting this feature: https://github.com/php/php-src/issues/10203

Rejected Features

Keep this updated with features that were discussed on the mail lists.

rfc/mb_str_pad.1684522846.txt.gz · Last modified: 2023/05/19 19:00 by nielsdos