Both sides previous revisionPrevious revisionNext revision | Previous revision |
rfc:saner-numeric-strings [2020/07/09 23:42] – Rewritting girgias | rfc:saner-numeric-strings [2020/11/25 12:46] (current) – Add implentation version number girgias |
---|
====== PHP RFC: Saner numeric strings ====== | ====== PHP RFC: Saner numeric strings ====== |
* Version: 1.3 | * Version: 1.4 |
* Date: 2020-06-28 | * Date: 2020-06-28 |
| * Original Author: Andrea Faulds <ajf@ajf.me> |
| * Original RFC: [[http://wiki.php.net/rfc/trailing_whitespace_numerics|PHP RFC: Permit trailing whitespace in numeric strings]] |
* Author: George Peter Banyard <girgias@php.net> | * Author: George Peter Banyard <girgias@php.net> |
* Status: Under Discussion | * Status: Implemented in PHP 8.0 |
* First Published at: http://wiki.php.net/rfc/saner-numeric-strings | * First Published at: http://wiki.php.net/rfc/saner-numeric-strings |
* Implementation: https://github.com/php/php-src/pull/5762 | * Implementation: https://github.com/php/php-src/pull/5762 |
A string can be categorised in three ways according to its numeric-ness, as [[https://github.com/php/php-langspec/blob/be010b4435e7b0801737bb66b5bbdd8f9fb51dde/spec/05-types.md#the-string-type|described by the language specification]]: | A string can be categorised in three ways according to its numeric-ness, as [[https://github.com/php/php-langspec/blob/be010b4435e7b0801737bb66b5bbdd8f9fb51dde/spec/05-types.md#the-string-type|described by the language specification]]: |
| |
* A //numeric string// is a string containing only a [[https://github.com/php/php-langspec/blob/be010b4435e7b0801737bb66b5bbdd8f9fb51dde/spec/05-types.md#grammar-str-number|number]], optionally preceded by white-space characters. For example, <php>"123"</php> or <php>" 1.23e2"</php>. | * A //numeric string// is a string containing only a [[https://github.com/php/php-langspec/blob/be010b4435e7b0801737bb66b5bbdd8f9fb51dde/spec/05-types.md#grammar-str-number|number]], optionally preceded by whitespace characters. For example, <php>"123"</php> or <php>" 1.23e2"</php>. |
* A //leading-numeric string// is a string that begins with a numeric string but is followed by non-number characters (including white-space characters). For example, <php>"123abc"</php> or <php>"123 "</php>. | * A //leading-numeric string// is a string that begins with a numeric string but is followed by non-number characters (including whitespace characters). For example, <php>"123abc"</php> or <php>"123 "</php>. |
* A //non-numeric string// is a string which is neither a numeric string nor a leading-numeric string. | * A //non-numeric string// is a string which is neither a numeric string nor a leading-numeric string. |
| |
A fourth way PHP might deal with numeric strings is when using an //integer// string for an array index. | A fourth way PHP might deal with numeric strings is when using an //integer// string for an array index. |
An integer string is stricter than a numeric string as it has the following additional constraints: | An integer string is stricter than a numeric string as it has the following additional constraints: |
* It doesn't accept leading white-spaces | * It doesn't accept leading whitespace |
* It doesn't accept leading zeros (''0'') | * It doesn't accept leading zeros (''0'') |
| |
"03" => "Integer index with leading 0/octal", | "03" => "Integer index with leading 0/octal", |
"2str" => "leading numeric string", | "2str" => "leading numeric string", |
" 1" => "leading white-space", | " 1" => "leading whitespace", |
"5.5" => "Float", | "5.5" => "Float", |
]; | ]; |
string(22) "leading numeric string" | string(22) "leading numeric string" |
[" 1"]=> | [" 1"]=> |
string(19) "leading white-space" | string(19) "leading whitespace" |
["5.5"]=> | ["5.5"]=> |
string(5) "Float" | string(5) "Float" |
| |
This RFC does not affect how array indexes behave, and thus won't mention them again. | This RFC does not affect how array indexes behave, and thus won't mention them again. |
| |
| Another aspect which should be noted is that arithmetic/bitwise operators will convert all operands to their numeric/integer equivalent and emit a notice/warning on malformed/invalid numeric string, except for the <php>&</php>, <php>|</php>, and <php>^</php> bitwise operators when both operands are strings and the <php>~</php> operator, in which case it will perform the operation on the ASCII values of the characters that make up the strings and the result will be a string, as per the [[https://www.php.net/manual/en/language.operators.bitwise.php|documentation on bitwise operators]]. |
| |
One final behaviour of PHP which needs to be presented is how PHP performs weak comparisons, i.e. a comparison with one of the following binary operators: <php>==</php>, <php>!=</php>, <php><></php>, <php><</php>, <php>></php>, <php><=</php>, and <php>>=</php>, in the string-to-string case and in the string-to-int/float case. | One final behaviour of PHP which needs to be presented is how PHP performs weak comparisons, i.e. a comparison with one of the following binary operators: <php>==</php>, <php>!=</php>, <php><></php>, <php><</php>, <php>></php>, <php><=</php>, and <php>>=</php>, in the string-to-string case and in the string-to-int/float case. |
var_dump((float) "string"); // float(0) | var_dump((float) "string"); // float(0) |
</PHP> | </PHP> |
* Implicit conversions of strings to number types in weak typing mode (i.e. no ''strict_type'' declare statement or ''strict_types=0'') due to type declarations [note: internal function behave similarly in PHP 8], e.g.<PHP> | * Implicit conversions of strings to number types in weak typing mode (i.e. no ''strict_type'' declare statement or ''strict_types=0'') due to type declarations [note: internal functions behave similarly in PHP 8], e.g.<PHP> |
function foo(int $i) { var_dump($i); } | function foo(int $i) { var_dump($i); } |
foo("123"); // int(123) | foo("123"); // int(123) |
$str = 'The world'; | $str = 'The world'; |
var_dump($str['4']); // string(1) "w" | var_dump($str['4']); // string(1) "w" |
var_dump($str['03']); // string(1) " " | var_dump($str['04']); // string(1) "w" |
var_dump($str['2str']); // string(1) "e" with E_NOTICE "A non well formed numeric value encountered" | var_dump($str['4str']); // string(1) "w" with E_NOTICE "A non well formed numeric value encountered" |
var_dump($str[' 1']); // string(1) "h" | var_dump($str[' 4']); // string(1) "w" |
var_dump($str['5.5']); // string(1) "o" with E_WARNING "Illegal string offset '5.5'" | var_dump($str['4.5']); // string(1) "w" with E_WARNING "Illegal string offset '4.5'" |
var_dump($str['string']); // string(1) "T" with E_WARNING "Illegal string offset 'string'" | var_dump($str['string']); // string(1) "T" with E_WARNING "Illegal string offset 'string'" |
</PHP> | </PHP> |
var_dump(123 + "string"); // int(123) with E_WARNING "A non-numeric value encountered" | var_dump(123 + "string"); // int(123) with E_WARNING "A non-numeric value encountered" |
</PHP> | </PHP> |
* Increment/Decrement operators, i.e. <php>++</php> and <php>--</php>, e.g.<PHP> | * Increment/decrement operators, i.e. <php>++</php> and <php>--</php>, e.g.<PHP> |
$a = "5"; | $a = "5"; |
var_dump(++$a); // int(6) | var_dump(++$a); // int(6) |
var_dump("123" == "123abc"); // bool(false) | var_dump("123" == "123abc"); // bool(false) |
</PHP> | </PHP> |
| * Bitwise operations, e.g.<PHP> |
| var_dump(123 & "123"); // int(123) |
| var_dump(123 & " 123"); // int(123) |
| var_dump(123 & "123 "); // int(123) with E_NOTICE "A non well formed numeric value encountered" |
| var_dump(123 & "123abc"); // int(123) with E_NOTICE "A non well formed numeric value encountered" |
| var_dump(123 & "abc"); // int(0) with E_WARNING "A non-numeric value encountered" |
| </PHP> |
| |
===== The Problem ===== | ===== The Problem ===== |
| |
The current behaviour of numerical strings has various issues: | The current behaviour of numerical strings has various issues: |
* numeric strings with leading white-space are considered more numeric than numeric strings with trailing white-space | * Numeric strings with leading whitespace are considered more numeric than numeric strings with trailing whitespace. |
* strings which happen to start with a digit, e.g. hashes, may at times be interpreted as numbers, which can lead to bugs | * Strings which happen to start with a digit, e.g. hashes, may at times be interpreted as numbers, which can lead to bugs. |
* <php>\is_numeric()</php> is misleading, as it will reject values that a weak-mode parameter check will accept | * <php>\is_numeric()</php> is misleading, as it will reject values that a weak-mode parameter check will accept. |
* leading-numeric strings is a rather strange concept and an unintuitive/surprising behaviour. | * Leading-numeric strings is a rather strange concept with unintuitive/surprising behaviour. |
| |
===== Proposal ===== | ===== Proposal ===== |
Unify the various numeric string modes into a single concept: Numeric characters only with both leading and trailing white-spaces allowed. | Unify the various numeric string modes into a single concept: Numeric characters only with both leading and trailing whitespace allowed. Any other type of string is non-numeric and will throw <php>TypeError</php>s when used in a numeric context. |
| |
| This means, all strings which currently emit the <php>E_NOTICE</php> “A non well formed numeric value encountered” will be reclassified into the <php>E_WARNING</php> “A non-numeric value encountered” //except// if the leading-numeric string contained only trailing whitespace. And the various cases which currently emit an <php>E_WARNING</php> will be promoted to <php>TypeError</php>s. |
| |
| One exception to this are type declarations as they only accept proper numeric strings, thus some <php>E_NOTICE</php> will result in a <php>TypeError</php>. See below for an example. |
| |
This means, all strings which currently emit the <php>E_NOTICE</php> “A non well formed numeric value encountered” will emit the <php>E_WARNING</php> “A non-numeric value encountered” //except// if the leading-numeric string contained only trailing white-spaces. | |
| |
For string offsets accessed using numeric strings the following changes will be made: | For string offsets accessed using numeric strings the following changes will be made: |
* Leading numeric strings will emit the “Illegal string offset” instead of the “A non well formed numeric value encountered” notice, and continue to evaluate to their respective values. | * Leading numeric strings will emit the “Illegal string offset” warning instead of the “A non well formed numeric value encountered” notice, and continue to evaluate to their respective values. |
* Non-numeric strings which emitted the “Illegal string offset” warning will throw an “Illegal offset type” TypeError | * Non-numeric strings which emitted the “Illegal string offset” warning will throw an “Illegal offset type” TypeError. |
* A secondary implementation vote will decide if: numeric strings which correspond to well formed floats will emit the more usual “String offset cast occurred” warning instead of the “Illegal string offset” warning. | * There is a secondary implementation vote to decide the following: should numeric strings which correspond to well-formed floats remain a warning (by emitting the same “String offset cast occurred” warning that occurs when a float is used for a string offset), or should the current “Illegal string offset” warning simply be promoted to a <php>TypeError</php>? Our position is that this case should be a TypeError, as it simplifies the implementation and is consistent with the handling of other strings (see this [[https://github.com/php/php-src/pull/5762/commits/897c37727b1ee393f04f57a88fc48d69c3cf0d1d|commit]]). |
| |
| |
foo("123abc"); // TypeError | foo("123abc"); // TypeError |
</PHP> | </PHP> |
* <php>\is_numeric</php> will return <php>true</php> for numeric strings with trailing white-spaces<PHP> | * <php>\is_numeric</php> will return <php>true</php> for numeric strings with trailing whitespace<PHP> |
var_dump(is_numeric("123 ")); // bool(true) | var_dump(is_numeric("123 ")); // bool(true) |
</PHP> | </PHP> |
* String offsets<PHP> | * String offsets<PHP> |
$str = 'The world'; | $str = 'The world'; |
var_dump($str['2str']); // string(1) "e" with E_WARNING "Illegal string offset '2str'" | var_dump($str['4str']); // string(1) "w" with E_WARNING "Illegal string offset '4str'" |
var_dump($str['5.5']); // string(1) "o" with E_WARNING "String offset cast occurred" if the secondary vote is accepted | var_dump($str['4.5']); // string(1) "w" with E_WARNING "String offset cast occurred" if the secondary vote is accepted otherwise TypeError |
var_dump($str['string']); // TypeError | var_dump($str['string']); // TypeError |
</PHP> | </PHP> |
* Arithmetic operations<PHP> | * Arithmetic operations<PHP> |
var_dump(123 + "123 "); // int(246) | var_dump(123 + "123 "); // int(246) |
var_dump(123 + "123abc"); // int(123) with E_WARNING "A non-numeric value encountered" | var_dump(123 + "123abc"); // int(246) with E_WARNING "A non-numeric value encountered" |
var_dump(123 + "string"); // int(123) with E_WARNING "A non-numeric value encountered" | var_dump(123 + "string"); // TypeError |
</PHP> | </PHP> |
* The <php>++</php> and <php>--</php> operators would convert numeric strings with trailing white-space to integers or floats, as appropriate, rather than applying the alphanumeric increment rules<PHP> | * The <php>++</php> and <php>--</php> operators would convert numeric strings with trailing whitespace to integers or floats, as appropriate, rather than applying the alphanumeric increment rules<PHP> |
$d = "5 "; | $d = "5 "; |
var_dump(++$d); // int(6) | var_dump(++$d); // int(6) |
* String-to-string comparisons<PHP> | * String-to-string comparisons<PHP> |
var_dump("123" == "123 "); // bool(true) | var_dump("123" == "123 "); // bool(true) |
| </PHP> |
| * Bitwise operations, e.g.<PHP> |
| var_dump(123 & "123 "); // int(123) |
| var_dump(123 & "123abc"); // int(123) with E_WARNING "A non-numeric value encountered" |
| var_dump(123 & "abc"); // TypeError |
</PHP> | </PHP> |
| |
| |
===== Backward Incompatible Changes ===== | ===== Backward Incompatible Changes ===== |
There are two backward incompatible changes: | There are three backward incompatible changes: |
* code relying on numerical strings with trailing white-spaces to be considered non-well-formed | * Code relying on numerical strings with trailing whitespace to be considered non-well-formed. |
* code with liberal use of leading-numerical strings might need to use explicit type casts | * Code with liberal use of leading-numeric strings might need to use explicit type casts. |
| * Code relying on the fact that <php>''</php> (an empty string) evaluates to <php>0</php> for arithmetic/bitwise operations. |
| |
The first reason is a precise requirement and therefore should be checked explicitly. A small poly-fill to check for the previous <php>is_numeric()</php> behaviour: | The first reason is a precise requirement and therefore should be checked explicitly. A small poly-fill to check for the previous <php>is_numeric()</php> behaviour: |
Breaking the second reason will allow to catch various bugs ahead of time, and the previous behaviour can be obtained by adding explicit casts, e.g.: | Breaking the second reason will allow to catch various bugs ahead of time, and the previous behaviour can be obtained by adding explicit casts, e.g.: |
<PHP> | <PHP> |
var_dump((int) "2px"); // int(2) | var_dump((int) "2px"); // int(2) |
var_dump((float) "2px"); // float(2) | var_dump((float) "2px"); // float(2) |
var_dump((int) "2.5px"); // int(2) | var_dump((int) "2.5px"); // int(2) |
var_dump((float) "2.5px"); // float(2.5) | var_dump((float) "2.5px"); // float(2.5) |
</PHP> | </PHP> |
| |
| The third reason already emitted an <php>E_WARNING</php>. We considered special-casing this to evaluate to <php>0</php>, but this would be inconsistent with how type declarations deal with an empty string, namely throwing a TypeError. Therefore a TypeError will also be emitted in this case. The error can be avoided by explicitly checking for an empty string and changing it to <php>0</php>. |
| |
===== Proposed PHP Version ===== | ===== Proposed PHP Version ===== |
===== Future Scope ===== | ===== Future Scope ===== |
* Nikita Popov's [[rfc:string_to_number_comparison|PHP RFC: Saner string to number comparisons]] | * Nikita Popov's [[rfc:string_to_number_comparison|PHP RFC: Saner string to number comparisons]] |
* Adding an E_NOTICE for numerical strings with leading/trailing white-spaces | * Adding an E_NOTICE for numerical strings with leading/trailing whitespace |
* Adding a flag to <php>\is_numeric</php> to accept or reject numerical strings with leading/trailing white-spaces | * Adding a flag to <php>\is_numeric</php> to accept or reject numeric strings with leading/trailing whitespace |
* Align string offset behaviour with array offsets | * Align string offset behaviour with array offsets |
* Promote remaining "Illegal string offset" warnings to Type Errors in PHP 9 | * Promote remaining warnings to Type Errors in PHP 9 |
* Warn on illegal offsets when used within <php>isset()</php> or <php>empty()</php> | * Warn on illegal offsets when used within <php>isset()</php> or <php>empty()</php> |
| |
===== Proposed Voting Choices ===== | ===== Vote ===== |
Per the Voting RFC, there would be a single Yes/No vote requiring a 2/3 majority. | Per the Voting RFC, there is a single Yes/No vote requiring a 2/3 majority for the main proposal. A secondary Yes/No vote requiring a 50%+1 majority will decide whether float strings used as string offsets should continue to produce a warning (with different wording) instead of consistently becoming a TypeError. |
| |
| Primary vote: |
| <doodle title="Accept Saner numeric string RFC proposal" auth="girgias" voteType="single" closed="true"> |
| * Yes |
| * No |
| </doodle> |
| |
| Secondary vote: |
| <doodle title="Should valid float strings for string offsets remain a warning" auth="girgias" voteType="single" closed="true"> |
| * Yes |
| * No |
| </doodle> |
| |
===== Patches and Tests ===== | ===== Patches and Tests ===== |
A pull request for a complete PHP interpreter patch, including a test file, can be found here: https://github.com/php/php-src/pull/5762 | A pull request for a complete PHP interpreter patch, including test files, can be found here: https://github.com/php/php-src/pull/5762 |
| |
A language specification patch still needs to be done. | A language specification patch still needs to be done. |
| |
===== Changelog ===== | ===== Changelog ===== |
| * 2020-07-13: Tweak inconsistency in regards to Arithmetic/Bitwise ops |
* 2020-07-10: Major rewrite | * 2020-07-10: Major rewrite |
* 2020-07-02: Explain difference between array and string offsets, and how the RFC will impact string offsets | * 2020-07-02: Explain difference between array and string offsets, and how the RFC will impact string offsets |
* 2020-07-01: Add explicit cast behaviour for leading numeric strings | * 2020-07-01: Add explicit cast behaviour for leading numeric strings |
* 2020-06-28: Initial version | * 2020-06-28: Initial version |