====== PHP RFC: Warn about invalid strings in arithmetic ======
* Version: 1.1.3
* Date: 2016-01-08
* Author: Andrea Faulds, ajf@ajf.me
* Status: Implemented (PHP 7.1)
* First Published at: http://wiki.php.net/rfc/invalid_strings_in_arithmetic
===== Introduction =====
PHP's arithmetic operators allow not only numbers, but also numeric strings to be used as operands. For example, not only 1 + 1 produces 2, but also '1' + '1'. This can be a useful feature when dealing with user input, which is often a string when dealing with the web. However, the arithmetic operators do not just accept numeric strings, but any string at all, and do not produce any type of error message if nonsensical input is given, instead simply considering it equivalent to zero. This means, for example, that "not a number" + "12" produces 12, without any warning that the string "not a number" was thrown away. Similarly, strings which only start with a number are accepted, with the remainder silently ignored. For example, "10 apples" + "5 pears" results in 15, also without any error message. The lack of error message produced here can create bugs which are not immediately obvious.
This can be particularly problematic for users dealing with other programming languages. For example, in many other languages the ''+'' symbol is used for concatenating strings, whereas in PHP the ''.'' operator is used. When a user has been using another language and then writes PHP code, they may mistakenly use the ''+'' operator to concatenate strings. Unfortunately, PHP will not warn them of their mistake in this case, instead simply producing 0 as the result of the operation. This is easy to miss when looking at a program's output or, worse, may not even be in the output itself. It may also be hard to figure out where the zero value originated in the source code.
===== Proposal =====
==== New E_NOTICE and E_WARNING ====
For all arithmetic operators, using a non-numeric string (such as "foobar") as an operand where a number is expected will produce an [[http://php.net/manual/en/errorfunc.constants.php|E_WARNING]] error, and using a non-well-formed numeric string (such as "10 apples") will produce an [[http://php.net/manual/en/errorfunc.constants.php|E_NOTICE]] error. For our purposes, the "arithmetic operators" are considered to be the set { ''+ - * / *''''* % <''''< >''''> | & ^'' }. The bitwise NOT operator ''~'' is not included, because it does not perform automatic numeric string conversion.
In short, the following code, which currently produces no errors
will now produce the following errors:
Notice: A non well formed numeric string encountered in example.php on line 3
Notice: A non well formed numeric string encountered in example.php on line 3
Similarly, this code
will now produce this error:
Warning: A non-numeric string encountered in example.php on line 3
The "A non well formed numeric string encountered" E_NOTICE is the same as that currently produced when passing such non-well-formed values where a number is expected to a built-in PHP function or userland function with type declarations. It is produced for strings which start with a number (possibly preceded by whitespace), but also contain non-numeric content. This includes strings like ''" 123abc"'' or ''" 1.23e3FOOBAR"''. It also, perhaps unfortunately, includes strings with trailing whitespace like ''" 123 "''.
The "A non-numeric string encountered" E_WARNING is a newly invented error for this RFC, with the wording chosen to match the existing error. It is shown when a string does not start with a number (possibly preceded by whitespace). There is no such E_WARNING for function type checks, because they simply reject non-numeric strings outright. This RFC does not do this, in order to avoid breaking backwards compatibility (see the "Backwards Incompatible Changes" section below).
Note that the operation still completes, since E_NOTICE and E_WARNING do not, by default, stop code execution, though see "Backwards Incompatible Changes" below.
==== Fractional and scientific notation strings with integer operators ====
Currently, there is an inconsistency in PHP with how numeric strings are interpreted. This comes down to which function is used internally to convert numeric strings into integers or floats. There are three functions at play:
* [[http://www.cplusplus.com/reference/cstdlib/strtol/|strtol]], a C standard library function for converting strings to integers. In its usage in PHP, it is always given a ''base'' argument value of 10. It only accepts numbers in the format of optional leading whitespace, followed by an optional leading sign (''-'' or ''+''), followed by a sequence of decimal digits (''0'' through ''9''). So, it accepts numbers like ''-12345'' and ''+42''.
* [[http://www.cplusplus.com/reference/cstdlib/strtod/|strtod]], a C standard library function for converting strings to floating-point numbers. It accepts anything ''strtol'' accepts, but also numbers with a decimal point in them, and numbers written in scientific notation with an ''e'' separating the coefficient and exponent. So, it accepts not only numbers like ''-12345'' and ''+42'', but also numbers like ''123.45'' and ''1.2345e9''.
* [[http://lxr.php.net/s?refs=is_numeric_string_ex&project=PHP_7_0|is_numeric_string_ex]], a Zend Engine function for converting strings to integers or floating-point numbers. It accepts anything ''strtol'' or ''strtod'' accepts, and intelligently chooses whether to convert to an integer or a float. It is the function which produces the "A non well formed numeric string encountered" error.
Operators that can take either an integer or a float (''+ - * / *''''*''), and type checks for function parameters and return types (both ''int'' and ''float'') use ''is_numeric_string_ex''. This means that they can handle numbers in scientific notation, so for example, var_dump("1.2345e9" + 0); results in ''float(1234500000)'', and var_dump(intdiv("1.2345e9", 1)); results in ''int(1234500000)''.
However, the integer operators (''% <''''< >''''> | & ^'') and the integer type casts ((int) and intval()) both use ''strtol''. ''strtol'' stops reading a string when it hits a character it doesn't accept, so ''-123.45'' is interpreted correctly, as it stops reading at the unaccepted ''.'' and produces ''-123''. But for numbers in scientific notation, this produces the wrong result: when reading ''1.2345e9'', it will stop at ''.'' and produce ''1''. Thus, var_dump((int)"1.2345e9"); and var_dump("1.2345e9" | 0); both produce ''int(1)''.
This inconsistency is unintuitive, and becomes a greater problem if we add warnings and notices for non-numeric strings, as some operators would tell you strings like ''"1.2345e9"'' and ''"-123.45"'' are non-well-formed, whereas other operators would happily accept such strings.
To avoid this problem, this RFC proposes to use ''is_numeric_string_ex'' instead of ''strtol'' for the integer operators and integer casts (intval() with $base = 10, (int), settype(), etc.), resolving the inconsistency. This also affects the Zend Engine C functions ''zval_get_long'' and ''convert_to_long'', and so PHP functions which uses these internal functions are also affected, including decbin(), decoct() and dechex().
===== Backward Incompatible Changes =====
The introduction of a new E_NOTICE and E_WARNING may create backwards-incompatibility issues in projects which use error handlers to convert these types of errors into exceptions, or have other special handling of E_NOTICEs and E_WARNINGs. This is, unfortunately, an unavoidable consequence of producing error messages where it was not done before. There is less risk of breakage with the E_NOTICE, as E_NOTICE is often silenced and ignored in production.
However, the situations where an E_WARNING is produced are likely to be accidental, so the introduction of this error message may be helpful. Furthermore, it is trivial to fix any case where these new errors would be produced, either by using an explicit conversion (e.g. (int)"10 apples" + (int)"5 pears"), suppressing the error (e.g. @("10 apples" + "5 pears")) or fixing whatever issue caused invalid data to be used in the operation.
This RFC specifically chooses to introduce an E_WARNING for using a non-numeric string, rather than produce a TypeError, in order to reduce potential backwards-compatibility issues.
Recognising scientific notation numeric strings when casting to integers (due to now using ''is_numeric_string_ex'' across-the-board) may cause backwards-compatibility issues in code which expects the exponent part to be ignored (i.e. expecting that, e.g. ''"1.2345e9"'' will be converted to ''1''). This is very unlikely to cause problems in practice, but it is a possibility.
===== Proposed PHP Version(s) =====
This is proposed for the next minor version of PHP, currently PHP 7.1.
===== RFC Impact =====
==== To Internals ====
This RFC is implemented by modifying how ''add_function'', ''sub_function'' etc. coerce strings to numbers. Operator functions which used ''convert_scalar_to_number'' now use a private, modified version which instructs ''is_numeric_string_ex'' to not silence errors, and additionally produces the "A non-numeric string encountered" warning if ''is_numeric_string_ex'' indicates failure. ''_zval_get_long_func'' has been modified to use ''is_numeric_string_ex'' for conversion, and operator functions which used ''_zval_get_long_func'' now use a private, modified version which instructs ''is_numeric_string_ex'' to not silence errors, likewise producing the "A non-numeric string encountered" warning if it indicates failure.
==== To Existing Extensions ====
Because this RFC affects the operator functions, which are part of the Zend API, any extension which uses them will now produce the new E_NOTICEs and E_WARNINGs detailed above.
Because this RFC affects the Zend Engine's integer conversion functions (''_zval_get_long_func'', and its wrappers ''_zval_get_long'', ''zval_get_long'', ''convert_to_long'' and ''multi_convert_to_long_ex''), any extension which uses them will now convert numeric strings using scientific notation differently, as detailed above.
==== To SAPIs ====
No specific impact I am aware of.
==== To Opcache ====
I have tested the RFC against opcache, and patched three different Zend Optimizer optimisations which created problems. At the time of writing this, all tests now pass both with Opcache enabled and Opcache disabled.
==== To Constants ===
Constant scalar expressions (e.g. const APPLE_COUNT = "10 apples" + "5 pears";) are not excepted from the introduction of this E_NOTICE and E_WARNING. Similarly, they are already subject to the "Undefined offset:" E_NOTICE.
===== Open Issues =====
None.
===== Unaffected PHP Functionality =====
This does not impact the type conversion rules for functions. It also does not impact the behaviour of type juggling for comparisons.
===== Future Scope =====
Ideally, using non-numeric strings where numbers are expected in arithmetic operations would produce a ''TypeError'' in the next major version of PHP, currently PHP 8.0. This may be worth adding to the RFC.
This RFC only affects numeric strings with arithmetic operators, but the behaviour of allowing resources to be silently, implicitly converted here is similarly problematic. A separate RFC may wish to get rid of this, although this would be unnecessary if the legacy resource type is phased out.
The fact that ''is_numeric_string_ex'' considers numeric strings with trailing whitespace to be "non-well-formed" and "inf", "-inf" and "nan" to be non-numeric may not be ideal, but fixing this would be beyond the scope of this RFC.
At present, the declare(strict_types=1); directive only applies to function calls and return statements, but a future version of PHP may wish to make it also affect operators.
===== Vote =====
The vote is a simple Yes/No on **whether to accept the RFC for the next minor version of PHP and merge the patch into ''master''**.
As this is a language change, the RFC requires a 2/3 majority to pass.
Voting started on 2016-03-20 and ended on 2016-03-28.
* Yes
* No
[[https://wiki.php.net/rfc/invalid_strings_in_arithmetic?rev=1453841945|Voting had previously opened on 2016-01-23]], but it was cancelled due to the //Fractional and scientific notation strings with integer operators// issue.
===== Patches and Tests =====
A complete pull request for the PHP interpreter, including tests, can be found here: https://github.com/php/php-src/pull/1718
A complete pull request for the PHP language specification, including tests, can be found here: https://github.com/php/php-langspec/pull/155
===== Implementation =====
The interpreter patch was merged into 7.1 here: https://github.com/php/php-src/commit/1e82ad8038d3100b7e27be870652c1f639a7200a
The UPGRADING file notes can be found here (more extensive than in the previously-linked patch), see sections 1 and 2: https://github.com/php/php-src/blob/0105bd20b706c8ab5b0a71f59f11a9dabe735f6b/UPGRADING
The corresponding mention in the manual can be found here: http://php.net/manual/en/migration71.other-changes.php
The language specification patch was merged into 7.1 here: https://github.com/php/php-langspec/commit/a3ea4e992f43ea9083c3fe3738a5ded03412f6e1
===== References =====
* In a talk I gave at PHP North West in October 2015, I gave a personal anecdote about why I would like this RFC: https://www.youtube.com/watch?v=bYMUbavj9uE&t=15m27s
===== Rejected Features =====
Keep this updated with features that were discussed on the mail lists.
===== Changelog =====
* v1.1.3 (2016-02-14) - Update for opcache compatibility
* v1.1.2 (2016-02-05) - List more functions affected by support for scientific-notation numeric strings
* v1.1.1 (2016-02-05) - Clarify that handling of scientific-notation numeric strings with intval() only applies for $base = 10, and that it applies to settype()
* v1.1 (2016-01-26) - Expanded proposal to change handling of fractional and scientific-notation numeric strings with integer operators
* v1.0 (2016-01-18) - First public version