Table of Contents

PHP RFC: Saner numeric strings

Technical Background

The PHP language has a concept of numeric strings, strings which can be interpreted as numbers.

A string can be categorised in three ways according to its numeric-ness, as described by the language specification:

A fourth way PHP might deal with numeric strings is when using an integer string for an array index. An integer string is stricter than a numeric string as it has the following additional constraints:

How PHP deals with array indexes is shown in the following code snippet:

$a = [
    "4" => "Integer index",
    "03" => "Integer index with leading 0/octal",
    "2str" => "leading numeric string",
    " 1" => "leading whitespace",
    "5.5" => "Float",
];
var_dump($a);

Which results in the following output:

array(5) {
  [4]=>
  string(13) "Integer index"
  ["03"]=>
  string(34) "Integer index with leading 0/octal"
  ["2str"]=>
  string(22) "leading numeric string"
  [" 1"]=>
  string(19) "leading whitespace"
  ["5.5"]=>
  string(5) "Float"
}

This RFC does not affect how array indexes behave, and thus won't mention them again.

Another aspect which should be noted is that arithmetic/bitwise operators will convert all operands to their numeric/integer equivalent and emit a notice/warning on malformed/invalid numeric string, except for the &, |, and ^ bitwise operators when both operands are strings and the ~ operator, in which case it will perform the operation on the ASCII values of the characters that make up the strings and the result will be a string, as per the documentation on bitwise operators.

One final behaviour of PHP which needs to be presented is how PHP performs weak comparisons, i.e. a comparison with one of the following binary operators: ==, !=, <>, <, >, <=, and >=, in the string-to-string case and in the string-to-int/float case.

String-to-string comparisons are performed numerically if and only if both strings are numeric strings.

String-to-int/float are always performed numerically, therefore the string will be type-juggled silently regardless of its numeric-ness.

This RFC does not propose to modify this behaviour, see PHP RFC: Saner string to number comparisons instead.

The concept of numeric strings is used in a few places, and the distinction between a numeric string and a leading-numeric string is significant as certain operations distinguish between these:

The Problem

The current behaviour of numerical strings has various issues:

Proposal

Unify the various numeric string modes into a single concept: Numeric characters only with both leading and trailing whitespace allowed. Any other type of string is non-numeric and will throw TypeErrors when used in a numeric context.

This means, all strings which currently emit the E_NOTICE “A non well formed numeric value encountered” will be reclassified into the E_WARNING “A non-numeric value encountered” except if the leading-numeric string contained only trailing whitespace. And the various cases which currently emit an E_WARNING will be promoted to TypeErrors.

One exception to this are type declarations as they only accept proper numeric strings, thus some E_NOTICE will result in a TypeError. See below for an example.

For string offsets accessed using numeric strings the following changes will be made:

The following cases will produce this behaviour under the proposal:

These changes will be accomplished by modifying the is_numeric_string C function (and its variants) in the Zend Engine.

For the string offset behaviour changes the following C Zend engine function and their JIT equivalent will be modified zend_check_string_offset() and zend_fetch_dimension_address_read().

The PHP language specification's definition of str-numeric would be modified by the addition of str-whitespaceopt after str-number and the removal of the following sentence: “A leading-numeric string is a string whose initial characters follow the requirements of a numeric string, and whose trailing characters are non-numeric”.

Backward Incompatible Changes

There are three backward incompatible changes:

The first reason is a precise requirement and therefore should be checked explicitly. A small poly-fill to check for the previous is_numeric() behaviour:

if (is_numeric($str) && strlen($str) === strlen(rtrim($str)) ){...}

Breaking the second reason will allow to catch various bugs ahead of time, and the previous behaviour can be obtained by adding explicit casts, e.g.:

var_dump((int) "2px");     // int(2)
var_dump((float) "2px");   // float(2)
var_dump((int) "2.5px");   // int(2)
var_dump((float) "2.5px"); // float(2.5)

The third reason already emitted an E_WARNING. We considered special-casing this to evaluate to 0, but this would be inconsistent with how type declarations deal with an empty string, namely throwing a TypeError. Therefore a TypeError will also be emitted in this case. The error can be avoided by explicitly checking for an empty string and changing it to 0.

Proposed PHP Version

PHP 8.0.

RFC Impact

To Existing Extensions

Any extension using the C is_numeric_string, its variants, or other functions which themselves use it, will be affected.

To Opcache

None that I am aware of.

Unaffected PHP Functionality

This does not affect the filter extension, which handles numeric strings itself in a different fashion.

Future Scope

Vote

Per the Voting RFC, there is a single Yes/No vote requiring a 2/3 majority for the main proposal. A secondary Yes/No vote requiring a 50%+1 majority will decide whether float strings used as string offsets should continue to produce a warning (with different wording) instead of consistently becoming a TypeError.

Primary vote:

Accept Saner numeric string RFC proposal
Real name Yes No
ajf (ajf)  
alec (alec)  
ashnazg (ashnazg)  
beberlei (beberlei)  
brzuchal (brzuchal)  
bwoebi (bwoebi)  
carusogabriel (carusogabriel)  
daverandom (daverandom)  
derick (derick)  
galvao (galvao)  
girgias (girgias)  
guilhermeblanco (guilhermeblanco)  
ilutov (ilutov)  
jasny (jasny)  
kalle (kalle)  
kguest (kguest)  
kocsismate (kocsismate)  
marandall (marandall)  
mariano (mariano)  
mauricio (mauricio)  
mcmic (mcmic)  
nicolasgrekas (nicolasgrekas)  
ocramius (ocramius)  
ramsey (ramsey)  
reywob (reywob)  
salathe (salathe)  
sebastian (sebastian)  
sergey (sergey)  
stas (stas)  
svpernova09 (svpernova09)  
tandre (tandre)  
theodorejb (theodorejb)  
trowski (trowski)  
wyrihaximus (wyrihaximus)  
Final result: 30 4
This poll has been closed.

Secondary vote:

Should valid float strings for string offsets remain a warning
Real name Yes No
ajf (ajf)  
alec (alec)  
ashnazg (ashnazg)  
beberlei (beberlei)  
bwoebi (bwoebi)  
carusogabriel (carusogabriel)  
daverandom (daverandom)  
derick (derick)  
galvao (galvao)  
guilhermeblanco (guilhermeblanco)  
ilutov (ilutov)  
jasny (jasny)  
kguest (kguest)  
marandall (marandall)  
mariano (mariano)  
mauricio (mauricio)  
nicolasgrekas (nicolasgrekas)  
ocramius (ocramius)  
ramsey (ramsey)  
reywob (reywob)  
salathe (salathe)  
sebastian (sebastian)  
sergey (sergey)  
stas (stas)  
svpernova09 (svpernova09)  
tandre (tandre)  
theodorejb (theodorejb)  
trowski (trowski)  
wyrihaximus (wyrihaximus)  
Final result: 2 27
This poll has been closed.

Patches and Tests

A pull request for a complete PHP interpreter patch, including test files, can be found here: https://github.com/php/php-src/pull/5762

A language specification patch still needs to be done.

A possible documentation patch still needs to be done.

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature
  4. a link to the language specification section (if any)

Acknowledgement

To Andrea Faulds for the PHP RFC: Permit trailing whitespace in numeric strings on which this RFC and patch is based of.

To Theodore Brown and Larry Garfield for reviewing the RFC.

Changelog