rfc:saner-numeric-strings

This is an old revision of the document!


PHP RFC: Saner numeric strings

Technical Background

The PHP language has a concept of numeric strings, strings which can be interpreted as numbers.

A string can be categorised in three ways according to its numeric-ness, as described by the language specification:

  • A numeric string is a string containing only a number, optionally preceded by white-space characters. For example, "123" or " 1.23e2".
  • A leading-numeric string is a string that begins with a numeric string but is followed by non-number characters (including white-space characters). For example, "123abc" or "123 ".
  • A non-numeric string is a string which is neither a numeric string nor a leading-numeric string.

A fourth way PHP might deal with numeric strings is when using an integer string for an array index. An integer string is stricter than a numeric string as it has the following additional constraints:

  • It doesn't accept leading white-spaces
  • It doesn't accept leading zeros (0)

How PHP deals with array indexes is shown in the following code snippet:

$a = [
    "4" => "Integer index",
    "03" => "Integer index with leading 0/octal",
    "2str" => "leading numeric string",
    " 1" => "leading white-space",
    "5.5" => "Float",
];
var_dump($a);

Which results in the following output:

array(5) {
  [4]=>
  string(13) "Integer index"
  ["03"]=>
  string(34) "Integer index with leading 0/octal"
  ["2str"]=>
  string(22) "leading numeric string"
  [" 1"]=>
  string(19) "leading white-space"
  ["5.5"]=>
  string(5) "Float"
}

This RFC does not affect how array indexes behave, and thus won't mention them again.

Another aspect which should be noted is that arithmetic/bitwise operators will convert all operands to their numeric/integer equivalent and emit a notice/warning on malformed/invalid numeric string, except for the &, |, and ^ bitwise operators when both operands are strings and the ~ operator, in which case it will perform the operation on the ASCII values of the characters that make up the strings and the result will be a string, as per the documentation on bitwise operators.

One final behaviour of PHP which needs to be presented is how PHP performs weak comparisons, i.e. a comparison with one of the following binary operators: ==, !=, <>, <, >, <=, and >=, in the string-to-string case and in the string-to-int/float case.

String-to-string comparisons are performed numerically if and only if both strings are numeric strings.

String-to-int/float are always performed numerically, therefore the string will be type-juggled silently regardless of its numeric-ness.

This RFC does not propose to modify this behaviour, see PHP RFC: Saner string to number comparisons instead.

The concept of numeric strings is used in a few places, and the distinction between a numeric string and a leading-numeric string is significant as certain operations distinguish between these:

  • Explicit conversions of strings to number types, such as (int) and (float) type casts or settype(), convert numeric and leading-numeric strings and produce 0 for non-numeric strings silently, e.g.:
    var_dump((float) "123");    // float(123)
    var_dump((float) "   123"); // float(123)
    var_dump((float) "123   "); // float(123)
    var_dump((float) "123abc"); // float(123)
    var_dump((float) "string"); // float(0)
  • Implicit conversions of strings to number types in weak typing mode (i.e. no strict_type declare statement or strict_types=0) due to type declarations [note: internal functions behave similarly in PHP 8], e.g.
    function foo(int $i) { var_dump($i); }
    foo("123");    // int(123)
    foo("   123"); // int(123)
    foo("123   "); // int(123) with E_NOTICE "A non well formed numeric value encountered"
    foo("123abc"); // int(123) with E_NOTICE "A non well formed numeric value encountered"
    foo("string"); // TypeError
  • \is_numeric() returns true only for numeric strings, e.g.
    var_dump(is_numeric("123"));     // bool(true)
    var_dump(is_numeric("   123"));  // bool(true)
    var_dump(is_numeric("123   "));  // bool(false)
    var_dump(is_numeric("123abc"));  // bool(false)
  • String offsets, e.g.
    $str = 'The world';
    var_dump($str['4']);      // string(1) "w"
    var_dump($str['04']);     // string(1) "w"
    var_dump($str['4str']);   // string(1) "w" with E_NOTICE "A non well formed numeric value encountered"
    var_dump($str[' 4']);     // string(1) "w"
    var_dump($str['4.5']);    // string(1) "w" with E_WARNING "Illegal string offset '4.5'"
    var_dump($str['string']); // string(1) "T" with E_WARNING "Illegal string offset 'string'"
  • Arithmetic operations, i.e. -, +, *, /, %, or **, strings will be converted to int/float but will emit the E_NOTICE/E_WARNING as needed, e.g.
    var_dump(123 + "123");    // int(246)
    var_dump(123 + "   123"); // int(246)
    var_dump(123 + "123   "); // int(246) with E_NOTICE "A non well formed numeric value encountered"
    var_dump(123 + "123abc"); // int(246) with E_NOTICE "A non well formed numeric value encountered"
    var_dump(123 + "string"); // int(123) with E_WARNING "A non-numeric value encountered"
  • Increment/Decrement operators, i.e. ++ and --, e.g.
    $a = "5";
    var_dump(++$a); // int(6)
    $b = " 5";
    var_dump(++$b); // int(6)
    $c = "5z";
    var_dump(++$c); // string(2) "6a"
    $d = "5 ";
    var_dump(++$d); // string(2) "5 "
  • String-to-string comparisons, e.g.
    var_dump("123" == "123.0");  // bool(true)
    var_dump("123" == "   123"); // bool(true)
    var_dump("123" == "123   "); // bool(false)
    var_dump("123" == "123abc"); // bool(false)
  • Bitwise operations, e.g.
    var_dump(123 & "123");    // int(123)
    var_dump(123 & "123  ");  // int(123) with E_NOTICE "A non well formed numeric value encountered"
    var_dump(123 & "123abc"); // int(123) with E_NOTICE "A non well formed numeric value encountered"
    var_dump(123 & "abc");    // int(0) with E_WARNING "A non-numeric value encountered"

The Problem

The current behaviour of numerical strings has various issues:

  • Numeric strings with leading whitespace are considered more numeric than numeric strings with trailing whitespace.
  • Strings which happen to start with a digit, e.g. hashes, may at times be interpreted as numbers, which can lead to bugs.
  • \is_numeric() is misleading, as it will reject values that a weak-mode parameter check will accept.
  • Leading-numeric strings is a rather strange concept with unintuitive/surprising behaviour.

Proposal

Unify the various numeric string modes into a single concept: Numeric characters only with both leading and trailing white-spaces allowed. Any other type of string is non-numeric and will throw TypeErrors when used in a numeric context.

This means, all strings which currently emit the E_NOTICE “A non well formed numeric value encountered” will de reclassified into the E_WARNING “A non-numeric value encountered” except if the leading-numeric string contained only trailing white-spaces. And the various cases which currently emit an E_WARNING will be promoted to TypeErrors.

One exception to this are type declarations as they only accept proper numeric strings, thus some E_NOTICE will result in a TypeError. See below for an example.

For string offsets accessed using numeric strings the following changes will be made:

  • Leading numeric strings will emit the “Illegal string offset” instead of the “A non well formed numeric value encountered” notice, and continue to evaluate to their respective values.
  • Non-numeric strings which emitted the “Illegal string offset” warning will throw an “Illegal offset type” TypeError.
  • A secondary implementation vote will decide if: numeric strings which correspond to well formed floats will remain a warning by emit the more usual “String offset cast occurred” warning instead of the current “Illegal string offset” warning which is being promoted to TypeError, the reason for this is adjusting this behaviour requires some additional boilerplate code in the Engine, as can mostly be seen in this commit.

The following cases will produce this behaviour under the proposal:

  • Type declarations
    function foo(int $i) { var_dump($i); }
    foo("123   "); // int(123)
    foo("123abc"); // TypeError
  • \is_numeric will return true for numeric strings with trailing white-spaces
    var_dump(is_numeric("123   "));  // bool(true)
  • String offsets
    $str = 'The world';
    var_dump($str['4str']);   // string(1) "w" with E_WARNING "Illegal string offset '4str'"
    var_dump($str['4.5']);    // string(1) "w" with E_WARNING "String offset cast occurred" if the secondary vote is accepted otherwise TypeError
    var_dump($str['string']); // TypeError
  • Arithmetic operations
    var_dump(123 + "123   "); // int(246)
    var_dump(123 + "123abc"); // int(246) with E_WARNING "A non-numeric value encountered"
    var_dump(123 + "string"); // TypeError
  • The ++ and -- operators would convert numeric strings with trailing white-space to integers or floats, as appropriate, rather than applying the alphanumeric increment rules
    $d = "5 ";
    var_dump(++$d); // int(6)
  • String-to-string comparisons
    var_dump("123" == "123   "); // bool(true)
  • Bitwise operations, e.g.
    var_dump(123 & "123  ");  // int(123)
    var_dump(123 & "123abc"); // int(123) with E_WARNING "A non-numeric value encountered"
    var_dump(123 & "abc");    // TypeError

These changes will be accomplished by modifying the is_numeric_string C function (and its variants) in the Zend Engine.

For the string offset behaviour changes the following C Zend engine function and their JIT equivalent will be modified zend_check_string_offset() and zend_fetch_dimension_address_read().

The PHP language specification's definition of str-numeric would be modified by the addition of str-whitespaceopt after str-number and the removal of the following sentence: “A leading-numeric string is a string whose initial characters follow the requirements of a numeric string, and whose trailing characters are non-numeric”.

Backward Incompatible Changes

There are three backward incompatible changes:

  • Code relying on numerical strings with trailing white-spaces to be considered non-well-formed.
  • Code with liberal use of leading-numeric strings might need to use explicit type casts.
  • Code relying on the fact that '' (an empty string) evaluates to 0 for arithmetic/bitwise operations

The first reason is a precise requirement and therefore should be checked explicitly. A small poly-fill to check for the previous is_numeric() behaviour:

if (is_numeric($str) && strlen($str) === strlen(rtrim($str)) ){...}

Breaking the second reason will allow to catch various bugs ahead of time, and the previous behaviour can be obtained by adding explicit casts, e.g.:

var_dump((int) "2px"); // int(2)
var_dump((float) "2px"); // float(2)
var_dump((int) "2.5px"); // int(2)
var_dump((float) "2.5px"); // float(2.5)

The third reason already emitted an E_WARNING, it was considered to special case this to evaluate to 0, but this would be inconsistent with how type declarations deal with an empty string, namely throwing a TypeError, therefore a TypeError will also be emitted in this case. This can be mitigated by checking beforehand for an empty string value and change it to 0.

Proposed PHP Version

PHP 8.0.

RFC Impact

To Existing Extensions

Any extension using the C is_numeric_string, its variants, or other functions which themselves use it, will be affected.

To Opcache

None that I am aware of.

Unaffected PHP Functionality

This does not affect the filter extension, which handles numeric strings itself in a different fashion.

Future Scope

  • Adding an E_NOTICE for numerical strings with leading/trailing white-spaces
  • Adding a flag to \is_numeric to accept or reject numerical strings with leading/trailing white-spaces
  • Align string offset behaviour with array offsets
  • Promote remaining warnings to Type Errors in PHP 9
  • Warn on illegal offsets when used within isset() or empty()

Proposed Voting Choices

Per the Voting RFC, there would be a single Yes/No vote requiring a 2/3 majority for the main proposal. And a secondary Yes/No vote requiring a 50%+1 majority for the implementation vote about float strings for strings offsets.

Patches and Tests

A pull request for a complete PHP interpreter patch, including test files, can be found here: https://github.com/php/php-src/pull/5762

A language specification patch still needs to be done.

A possible documentation patch still needs to be done.

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature
  4. a link to the language specification section (if any)

Acknowledgement

To Andrea Faulds for the PHP RFC: Permit trailing whitespace in numeric strings on which this RFC and patch is based of.

To Theodore Brown and Larry Garfield for reviewing the RFC.

Changelog

  • 2020-07-13: Tweak inconsistency in regards to Arithmetic/Bitwise ops
  • 2020-07-10: Major rewrite
  • 2020-07-02: Explain difference between array and string offsets, and how the RFC will impact string offsets
  • 2020-07-01: Add explicit cast behaviour for leading numeric strings
  • 2020-06-28: Initial version
rfc/saner-numeric-strings.1594737233.txt.gz · Last modified: 2020/07/14 14:33 by theodorejb