rfc:mb_str_pad

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:mb_str_pad [2023/05/19 18:49] – proposal work nielsdosrfc:mb_str_pad [2023/11/13 19:55] (current) – link to docs nielsdos
Line 1: Line 1:
 ====== PHP RFC: mb_str_pad ====== ====== PHP RFC: mb_str_pad ======
-  * Version: 0.1+  * Version: 0.1.2
   * Date: 2023-05-19   * Date: 2023-05-19
   * Author: Niels Dossche (nielsdos), dossche.niels@gmail.com   * Author: Niels Dossche (nielsdos), dossche.niels@gmail.com
-  * Status: Draft+  * Status: [[https://github.com/php/php-src/commit/68591632b22289962127cf777b4c3aeaea768bb6|Implemented]] 
 +  * Target Version: PHP 8.3 
 +  * Implementation: https://github.com/php/php-src/pull/11284
   * First Published at: http://wiki.php.net/rfc/mb_str_pad   * First Published at: http://wiki.php.net/rfc/mb_str_pad
  
 ===== Introduction ===== ===== Introduction =====
-Many string functions in PHP come in two different variants: one for byte stringsand one for multibyte strings. This RFC proposes the addition of mb_str_pad() to PHP as a multibyte equivalent of the existing str_pad() function. The current str_pad() function lacks multibyte character support, causing issues when working with languages that utilize multibyte encodings like UTF-8.+In PHP, various string functions are available in two variants: one for byte strings and another for multibyte strings. However, notable absence among the multibyte string functions is a mbstring equivalent of str_pad(). The str_pad() function lacks multibyte character support, causing issues when working with languages that utilize multibyte encodings like UTF-8. This RFC proposes the addition of such a function to PHP, which we will call mb_str_pad().
  
 ===== Proposal ===== ===== Proposal =====
-The proposal is to introduce a new mbstring function mb_str_pad(). The function follows the same signature as the str_pad() function, except that it has an additional argument for the string encoding. The encoding argument works analogously to the other mbstring functions. The argument applies on both the $input string and the $pad_string. If null, the default mbstring encoding is used. The $pad_type argument can take three possible values: STR_PAD_LEFT, STR_PAD_RIGHT, STR_PAD_BOTH. str_pad() uses the same constants.+This proposal aims to introduce a new mbstring function mb_str_pad(). Both the input string and the padding string may be multibyte strings. The function follows the same signature as the str_pad() function, except that it has an additional argument for the string encoding. The encoding argument works analogously to the encoding argument of other mbstring functions. The encoding applies to both $input and $pad_string. If the encoding is null, the default mbstring encoding is used. The $pad_type argument can take three possible values: STR_PAD_LEFT, STR_PAD_RIGHT, STR_PAD_BOTH. str_pad() uses the same constants.
  
 <code php> <code php>
 function mb_str_pad(string $string, int $length, string $pad_string = " ", int $pad_type = STR_PAD_RIGHT, ?string $encoding = null): string {} function mb_str_pad(string $string, int $length, string $pad_string = " ", int $pad_type = STR_PAD_RIGHT, ?string $encoding = null): string {}
 </code> </code>
 +
 +This proposal defines character as code point, which is how the other mbstring functions define characters as well.
  
 ==== Error conditions ==== ==== Error conditions ====
Line 23: Line 27:
  
 There is one additional error condition that str_pad() doesn't have: There is one additional error condition that str_pad() doesn't have:
-  * $encoding must be a valid and supported character encoding, if provided.  Otherwise it will output a value error just like the other mbstring functions do with an invalid encoding. This error condition is typical for mbstring functions.+  * $encoding must be a valid and supported character encoding, if provided.  Otherwise it will output a value error just like the other mbstring functions do with an invalid encoding. You'll find this error condition in many mbstring functions.
  
 ==== Examples and Comparison Against str_pad() ==== ==== Examples and Comparison Against str_pad() ====
  
-This section shows some examples and comparison between str_pad() and mb_str_pad() output for UTF-8 strings. +This section shows some examples and comparisons between str_pad() and mb_str_pad() output for multibyte strings. 
-str_pad() already has trouble with special characters in some languagesbecause they are encoded in multiple bytes. The first example demonstrates this by using the word "Français". The word "Français" is 8 characters long, but 9 bytes long because the letter ç is encoded as two bytes. Therefore, using str_pad() will produce the wrong resultwhereas mb_str_pad() produces the correct result.+str_pad() has trouble with special characters or letters used in some languages because those are encoded in multiple bytes. The first example demonstrates this by using the word "Français". The word "Français" is 8 //characters// long, but 9 //bytes// long because the letter ç is encoded as two bytes. Therefore, in the following example, str_pad() will produce the wrong result whereas mb_str_pad() will produce the correct result.
  
 <code php> <code php>
-var_dump(mb_str_pad('Français', 9, '_', STR_PAD_RIGHT)); // GOOD: string(10) "Français_" +// This will pad such that the string will become 10 bytes long. 
-var_dump(mb_str_pad('Français', 9, '_', STR_PAD_LEFT));  // GOOD: string(10) "Français_+var_dump(str_pad('Français', 10, '_', STR_PAD_RIGHT));   // BAD: string(10) "Français_" 
-var_dump(mb_str_pad('Français', 9, '_', STR_PAD_BOTH));  // GOOD: string(10) "Français_"+var_dump(str_pad('Français', 10, '_', STR_PAD_LEFT));    // BAD: string(10) "_Français
 +var_dump(str_pad('Français', 10, '_', STR_PAD_BOTH));    // BAD: string(10) "Français_"
  
-var_dump(str_pad('Français', 9, '_', STR_PAD_RIGHT));    // BAD: string(9) "Français+// This will pad such that the string will become 10 characters long, and in this case 11 bytes. 
-var_dump(str_pad('Français', 9, '_', STR_PAD_LEFT));     // BAD: string(9) "Français+var_dump(mb_str_pad('Français', 10, '_', STR_PAD_RIGHT));// GOOD: string(11) "Français__
-var_dump(str_pad('Français', 9, '_', STR_PAD_BOTH));     // BAD: string(9) "Français"+var_dump(mb_str_pad('Français', 10, '_', STR_PAD_LEFT)); // GOOD: string(11) "__Français
 +var_dump(mb_str_pad('Français', 10, '_', STR_PAD_BOTH)); // GOOD: string(11) "_Français_"
 </code> </code>
  
-We can also use emojis and symbols, which may be useful for some CLI applications. This is an example from the original issue report.+The problems with str_pad() become even more prominent for languages which use a non-latin alphabet (like Greek for example). 
 + 
 +<code php> 
 +var_dump(str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_RIGHT));    // BAD: string(35) "Δεν μιλάω ελληνικά." 
 +var_dump(str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_LEFT));     // BAD: string(35) "Δεν μιλάω ελληνικά." 
 +var_dump(str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_BOTH));     // BAD: string(35) "Δεν μιλάω ελληνικά." 
 + 
 +var_dump(mb_str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_RIGHT)); // GOOD: string(37) "Δεν μιλάω ελληνικά.__" 
 +var_dump(mb_str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_LEFT));  // GOOD: string(37) "__Δεν μιλάω ελληνικά." 
 +var_dump(mb_str_pad('Δεν μιλάω ελληνικά.', 21, '_', STR_PAD_BOTH));  // GOOD: string(37) "_Δεν μιλάω ελληνικά._" 
 +</code> 
 + 
 +We can also use emojis and symbols, which may have some uses in CLI applications. This is an example from the original feature request report.
  
 <code php> <code php>
Line 53: Line 71:
  
 ===== Backward Incompatible Changes ===== ===== Backward Incompatible Changes =====
-Since this is a new functionand no existing functions change, the risk of a BC break is minimal. The only break occurs when a userland PHP project defines their own mb_str_pad() function without first checking if PHP itself does not define one+Since this is a new function and no existing functions change, there is no behavioural backwards incompatibility. The only backwards compatible break occurs when a userland PHP project declares their own mb_str_pad() function without first checking if PHP doesn't already declare it. In that case, a fatal error "Cannot redeclare mb_str_pad()" will be thrown. 
-TODO+ 
 +I did a quick search using GitHub's [[https://github.com/search?q=mb_str_pad+lang%3Aphp&type=code|code search]] on "mb_str_pad" in PHP files, and found 326 matches (as of 2023-05-19). This also gives us an opportunity to look at how many correct vs incorrect implementations there are. 
 + 
 +Looking at the function / method //declarations// for "mb_str_pad": 
 +  * 47 in classes 
 +  * 12 free functions, checked if PHP doesn't already declare it 
 +  * 42 free functions, not checked (correctly) 
 + 
 +This means that for 42 implementations, the introduction of mb_str_pad() will cause a fatal error as described above. Fortunately, the users can simply remove their implementation, or guard it with a check to resolve the error
 + 
 +Let's also take a look at correctness: 
 +  * 36 likely correct implementations. I did not test or read them thoroughly, I just ran some inputs through them automatically. 
 +  * 65 implementations which break if the padding string is a multibyte string. Almost all these implementations are very similar to each other. 
 + 
 +As we can see it appears to be a function that's a little tricky to implement correctly. 
 +Note that these results don't include numbers for inline implementations or for implementations under a different name. Hence the reported numbers are quite low. It is very likely more implementations exist under different names, but that doesn't matter for a backwards compatibility check.
  
 ===== Proposed PHP Version(s) ===== ===== Proposed PHP Version(s) =====
Line 82: Line 115:
  
 ===== Future Scope ===== ===== Future Scope =====
-None.+In the future we could add a string padding function that works on grapheme clusters instead of code points: grapheme_str_pad(). This should be added to ext/intl. This will of course require another RFC.
  
 ===== Proposed Voting Choices ===== ===== Proposed Voting Choices =====
-One primary yes/no vote to decide if the function may be introduced.+One primary yes/no vote to decide if the function may be introduced, requires 2/3 majority. 
 + 
 +Voting starts on 2023-06-05 20:00 GMT+2, and ends on 2023-06-19 20:00 GMT+2. 
 + 
 +<doodle title="mb_str_pad" auth="nielsdos" voteType="single" closed="true" closeon="2023-06-19T20:00:00+02:00"> 
 +   * Yes 
 +   * No 
 +</doodle>
  
 ===== Patches and Tests ===== ===== Patches and Tests =====
Line 92: Line 132:
 ===== Implementation ===== ===== Implementation =====
 After the project is implemented, this section should contain  After the project is implemented, this section should contain 
-  - the version(s) it was merged into +  - the version(s) it was merged into: PHP 8.3 
-  - a link to the git commit(s) +  - a link to the git commit(s): https://github.com/php/php-src/commit/68591632b22289962127cf777b4c3aeaea768bb6 
-  - a link to the PHP manual entry for the feature +  - a link to the PHP manual entry for the feature: https://www.php.net/manual/en/function.mb-str-pad 
-  - a link to the language specification section (if any)+  - a link to the language specification section (if any): N/A
  
 ===== References ===== ===== References =====
Line 102: Line 142:
 ===== Rejected Features ===== ===== Rejected Features =====
 Keep this updated with features that were discussed on the mail lists. Keep this updated with features that were discussed on the mail lists.
 +
 +===== Changelog =====
 +
 +  * 0.1.2: Clarify that we use the mbstring definition of character (i.e. code point) instead of grapheme cluster.
 +  * 0.1.1: Initial version placed under discussion
rfc/mb_str_pad.1684522177.txt.gz · Last modified: 2023/05/19 18:49 by nielsdos