rfc:invalid_strings_in_arithmetic

PHP RFC: Warn about invalid strings in arithmetic

Introduction

PHP's arithmetic operators allow not only numbers, but also numeric strings to be used as operands. For example, not only 1 + 1 produces 2, but also '1' + '1'. This can be a useful feature when dealing with user input, which is often a string when dealing with the web. However, the arithmetic operators do not just accept numeric strings, but any string at all, and do not produce any type of error message if nonsensical input is given, instead simply considering it equivalent to zero. This means, for example, that "not a number" + "12" produces 12, without any warning that the string "not a number" was thrown away. Similarly, strings which only start with a number are accepted, with the remainder silently ignored. For example, "10 apples" + "5 pears" results in 15, also without any error message. The lack of error message produced here can create bugs which are not immediately obvious.

This can be particularly problematic for users dealing with other programming languages. For example, in many other languages the + symbol is used for concatenating strings, whereas in PHP the . operator is used. When a user has been using another language and then writes PHP code, they may mistakenly use the + operator to concatenate strings. Unfortunately, PHP will not warn them of their mistake in this case, instead simply producing 0 as the result of the operation. This is easy to miss when looking at a program's output or, worse, may not even be in the output itself. It may also be hard to figure out where the zero value originated in the source code.

Proposal

New E_NOTICE and E_WARNING

For all arithmetic operators, using a non-numeric string (such as "foobar") as an operand where a number is expected will produce an E_WARNING error, and using a non-well-formed numeric string (such as "10 apples") will produce an E_NOTICE error. For our purposes, the “arithmetic operators” are considered to be the set { + - * / ** % << >> | & ^ }. The bitwise NOT operator ~ is not included, because it does not perform automatic numeric string conversion.

In short, the following code, which currently produces no errors

<?php
 
$numberOfApples = "10 apples" + "5 pears";

will now produce the following errors:

Notice: A non well formed numeric string encountered in example.php on line 3

Notice: A non well formed numeric string encountered in example.php on line 3

Similarly, this code

<?php
 
$numberOfPears = 5 * "orange";

will now produce this error:

Warning: A non-numeric string encountered in example.php on line 3

The “A non well formed numeric string encountered” E_NOTICE is the same as that currently produced when passing such non-well-formed values where a number is expected to a built-in PHP function or userland function with type declarations. It is produced for strings which start with a number (possibly preceded by whitespace), but also contain non-numeric content. This includes strings like “ 123abc” or “ 1.23e3FOOBAR”. It also, perhaps unfortunately, includes strings with trailing whitespace like “ 123 ”.

The “A non-numeric string encountered” E_WARNING is a newly invented error for this RFC, with the wording chosen to match the existing error. It is shown when a string does not start with a number (possibly preceded by whitespace). There is no such E_WARNING for function type checks, because they simply reject non-numeric strings outright. This RFC does not do this, in order to avoid breaking backwards compatibility (see the “Backwards Incompatible Changes” section below).

Note that the operation still completes, since E_NOTICE and E_WARNING do not, by default, stop code execution, though see “Backwards Incompatible Changes” below.

Fractional and scientific notation strings with integer operators

Currently, there is an inconsistency in PHP with how numeric strings are interpreted. This comes down to which function is used internally to convert numeric strings into integers or floats. There are three functions at play:

  • strtol, a C standard library function for converting strings to integers. In its usage in PHP, it is always given a base argument value of 10. It only accepts numbers in the format of optional leading whitespace, followed by an optional leading sign (- or +), followed by a sequence of decimal digits (0 through 9). So, it accepts numbers like -12345 and +42.
  • strtod, a C standard library function for converting strings to floating-point numbers. It accepts anything strtol accepts, but also numbers with a decimal point in them, and numbers written in scientific notation with an e separating the coefficient and exponent. So, it accepts not only numbers like -12345 and +42, but also numbers like 123.45 and 1.2345e9.
  • is_numeric_string_ex, a Zend Engine function for converting strings to integers or floating-point numbers. It accepts anything strtol or strtod accepts, and intelligently chooses whether to convert to an integer or a float. It is the function which produces the “A non well formed numeric string encountered” error.

Operators that can take either an integer or a float (+ - * / **), and type checks for function parameters and return types (both int and float) use is_numeric_string_ex. This means that they can handle numbers in scientific notation, so for example, var_dump("1.2345e9" + 0); results in float(1234500000), and var_dump(intdiv("1.2345e9", 1)); results in int(1234500000).

However, the integer operators (% << >> | & ^) and the integer type casts ((int) and intval()) both use strtol. strtol stops reading a string when it hits a character it doesn't accept, so -123.45 is interpreted correctly, as it stops reading at the unaccepted . and produces -123. But for numbers in scientific notation, this produces the wrong result: when reading 1.2345e9, it will stop at . and produce 1. Thus, var_dump((int)"1.2345e9"); and var_dump("1.2345e9" | 0); both produce int(1).

This inconsistency is unintuitive, and becomes a greater problem if we add warnings and notices for non-numeric strings, as some operators would tell you strings like “1.2345e9” and “-123.45” are non-well-formed, whereas other operators would happily accept such strings.

To avoid this problem, this RFC proposes to use is_numeric_string_ex instead of strtol for the integer operators and integer casts (intval() with $base = 10, (int), settype(), etc.), resolving the inconsistency. This also affects the Zend Engine C functions zval_get_long and convert_to_long, and so PHP functions which uses these internal functions are also affected, including decbin(), decoct() and dechex().

Backward Incompatible Changes

The introduction of a new E_NOTICE and E_WARNING may create backwards-incompatibility issues in projects which use error handlers to convert these types of errors into exceptions, or have other special handling of E_NOTICEs and E_WARNINGs. This is, unfortunately, an unavoidable consequence of producing error messages where it was not done before. There is less risk of breakage with the E_NOTICE, as E_NOTICE is often silenced and ignored in production.

However, the situations where an E_WARNING is produced are likely to be accidental, so the introduction of this error message may be helpful. Furthermore, it is trivial to fix any case where these new errors would be produced, either by using an explicit conversion (e.g. (int)"10 apples" + (int)"5 pears"), suppressing the error (e.g. @("10 apples" + "5 pears")) or fixing whatever issue caused invalid data to be used in the operation.

This RFC specifically chooses to introduce an E_WARNING for using a non-numeric string, rather than produce a TypeError, in order to reduce potential backwards-compatibility issues.

Recognising scientific notation numeric strings when casting to integers (due to now using is_numeric_string_ex across-the-board) may cause backwards-compatibility issues in code which expects the exponent part to be ignored (i.e. expecting that, e.g. “1.2345e9” will be converted to 1). This is very unlikely to cause problems in practice, but it is a possibility.

Proposed PHP Version(s)

This is proposed for the next minor version of PHP, currently PHP 7.1.

RFC Impact

To Internals

This RFC is implemented by modifying how add_function, sub_function etc. coerce strings to numbers. Operator functions which used convert_scalar_to_number now use a private, modified version which instructs is_numeric_string_ex to not silence errors, and additionally produces the “A non-numeric string encountered” warning if is_numeric_string_ex indicates failure. _zval_get_long_func has been modified to use is_numeric_string_ex for conversion, and operator functions which used _zval_get_long_func now use a private, modified version which instructs is_numeric_string_ex to not silence errors, likewise producing the “A non-numeric string encountered” warning if it indicates failure.

To Existing Extensions

Because this RFC affects the operator functions, which are part of the Zend API, any extension which uses them will now produce the new E_NOTICEs and E_WARNINGs detailed above.

Because this RFC affects the Zend Engine's integer conversion functions (_zval_get_long_func, and its wrappers _zval_get_long, zval_get_long, convert_to_long and multi_convert_to_long_ex), any extension which uses them will now convert numeric strings using scientific notation differently, as detailed above.

To SAPIs

No specific impact I am aware of.

To Opcache

I have tested the RFC against opcache, and patched three different Zend Optimizer optimisations which created problems. At the time of writing this, all tests now pass both with Opcache enabled and Opcache disabled.

To Constants

Constant scalar expressions (e.g. const APPLE_COUNT = "10 apples" + "5 pears";) are not excepted from the introduction of this E_NOTICE and E_WARNING. Similarly, they are already subject to the “Undefined offset:” E_NOTICE.

Open Issues

None.

Unaffected PHP Functionality

This does not impact the type conversion rules for functions. It also does not impact the behaviour of type juggling for comparisons.

Future Scope

Ideally, using non-numeric strings where numbers are expected in arithmetic operations would produce a TypeError in the next major version of PHP, currently PHP 8.0. This may be worth adding to the RFC.

This RFC only affects numeric strings with arithmetic operators, but the behaviour of allowing resources to be silently, implicitly converted here is similarly problematic. A separate RFC may wish to get rid of this, although this would be unnecessary if the legacy resource type is phased out.

The fact that is_numeric_string_ex considers numeric strings with trailing whitespace to be “non-well-formed” and "inf", "-inf" and "nan" to be non-numeric may not be ideal, but fixing this would be beyond the scope of this RFC.

At present, the declare(strict_types=1); directive only applies to function calls and return statements, but a future version of PHP may wish to make it also affect operators.

Vote

The vote is a simple Yes/No on whether to accept the RFC for the next minor version of PHP and merge the patch into master.

As this is a language change, the RFC requires a 2/3 majority to pass.

Voting started on 2016-03-20 and ended on 2016-03-28.

Accept the ‘Warn about invalid strings in arithmetic’ v1.1 RFC for PHP 7.1?
Real name Yes No
ajf (ajf)  
bishop (bishop)  
brandon (brandon)  
bwoebi (bwoebi)  
colinodell (colinodell)  
danack (danack)  
daverandom (daverandom)  
davey (davey)  
galvao (galvao)  
gasolwu (gasolwu)  
guilhermeblanco (guilhermeblanco)  
hywan (hywan)  
jwage (jwage)  
krakjoe (krakjoe)  
leigh (leigh)  
levim (levim)  
lstrojny (lstrojny)  
marcio (marcio)  
mariano (mariano)  
mike (mike)  
nikic (nikic)  
ocramius (ocramius)  
patrickallaert (patrickallaert)  
pierrick (pierrick)  
sobak (sobak)  
svpernova09 (svpernova09)  
tpunt (tpunt)  
trowski (trowski)  
yohgaki (yohgaki)  
zimt (zimt)  
Final result: 29 1
This poll has been closed.

Voting had previously opened on 2016-01-23, but it was cancelled due to the Fractional and scientific notation strings with integer operators issue.

Patches and Tests

A complete pull request for the PHP interpreter, including tests, can be found here: https://github.com/php/php-src/pull/1718

A complete pull request for the PHP language specification, including tests, can be found here: https://github.com/php/php-langspec/pull/155

Implementation

The interpreter patch was merged into 7.1 here: https://github.com/php/php-src/commit/1e82ad8038d3100b7e27be870652c1f639a7200a

The UPGRADING file notes can be found here (more extensive than in the previously-linked patch), see sections 1 and 2: https://github.com/php/php-src/blob/0105bd20b706c8ab5b0a71f59f11a9dabe735f6b/UPGRADING

The corresponding mention in the manual can be found here: http://php.net/manual/en/migration71.other-changes.php

The language specification patch was merged into 7.1 here: https://github.com/php/php-langspec/commit/a3ea4e992f43ea9083c3fe3738a5ded03412f6e1

References

Rejected Features

Keep this updated with features that were discussed on the mail lists.

Changelog

  • v1.1.3 (2016-02-14) - Update for opcache compatibility
  • v1.1.2 (2016-02-05) - List more functions affected by support for scientific-notation numeric strings
  • v1.1.1 (2016-02-05) - Clarify that handling of scientific-notation numeric strings with intval() only applies for $base = 10, and that it applies to settype()
  • v1.1 (2016-01-26) - Expanded proposal to change handling of fractional and scientific-notation numeric strings with integer operators
  • v1.0 (2016-01-18) - First public version
rfc/invalid_strings_in_arithmetic.txt · Last modified: 2016/12/28 00:11 by ajf