rfc:trailing_whitespace_numerics

This is an old revision of the document!


PHP RFC: Revise trailing character handling for numeric strings

Background

PHP is a dynamically-typed programming language with implicit and explicit type coercions: if a value of one type is required, and a value of another type is given, PHP can in many cases convert from one to the other. One of the most common conversions is from a string to a number type, which has possibly the most complex type conversion rules in PHP. This RFC seeks to further simplify those rules and make them more consistent.

Since PHP 7.1, most parts of PHP that perform string to number conversions use the same definitions of numeric strings, and differ only in the types of errors that non-well-formed and non-numeric strings produce. According to those definitions:

  • A well-formed numeric string contains a number optionally preceded by whitespace. For example, "123" is well-formed (just a number), and <php“ 1.23e2”</php> is also well-formed (a number preceded by whitespace).
  • A non-well-formed numeric string is any string beginning with a well-formed numeric string but followed by other characters, notably including whitespace. For example, "1.23e2abc" is non-well-formed (a number followed by unrelated letters), and " 1.23e2 " (a number both preceded and followed by whitespace) is also non-well-formed.
  • A non-numeric string is a string that is neither a well-formed nor a non-well-formed numeric string. For example, "abc1.23e2" is non-numeric (it doesn't start with a number, nor does it start with whitespace followed by a number).

There are two problems here:

  1. Whitespace is handled inconsistently, accepted as part of a well-formed string if it precedes a number, but causing a non-well-formed error if placed after a number. There is no obvious benefit to treating these differently and this behaviour lacks the positives of accepting (tolerant to user input which may have extra surrounding spaces) or rejecting (strictly only accepting numbers themselves), pleasing nobody.
  2. Having two tiers of numeric string (“well-formed” and “non-well-formed”) complicates error handling by making it necessary to handle two different errors instead of one, possibly using different mechanisms (e.g. TypeError vs E_NOTICE in the case of type declarations on functions), and can cause bugs if code unintentionally relies on two parts of the language accepting the same string as numeric where one doesn't accept non-well-formed strings and the other does (e.g. < vs -).

Proposal

This RFC proposes to remove both problems by making two changes.

Part 1: Accept trailing whitespace as well-formed in a numeric string

For the next PHP 7.x (currently PHP 7.4), this RFC proposes that trailing whitespace be accepted as part of a well-formed numeric string. This would make PHP more consistent, less surprising, and save time by avoiding the need to trim trailing whitespace from numeric strings.

For the PHP interpreter, this would be accomplished by modifying the is_numeric_string C function (and its variants) in the Zend Engine. This would therefore affect PHP features which make use of this function, including:

  • Arithmetic operators would no longer produce an E_NOTICE-level error when used with a numeric string with trailing whitespace
  • The int and float type declarations would, in weak typing mode, no longer produce an E_NOTICE-level error when passed a numeric string with trailing whitespace
  • Type checks for built-in/extension (“internal”) PHP functions would, in weak typing mode, no longer produce an E_NOTICE-level error when passed a numeric string with trailing whitespace
  • The comparison operators will now consider numeric strings with trailing whitespace to be numeric, therefore meaning that, for example, "123 " == 123 produces true, much like " 123" == 123 does at present
  • The \is_numeric function would return true for numeric strings with trailing whitespace
  • The ++ and -- operators woukd convert numeric strings with trailing whitespace to integers or floats, as appropriate, rather than applying the alphanumeric increment rules

The PHP language specification's definition of str-numeric would be modified by the addition of str-whitespaceopt after str-number.

This change would be almost completely backwards-compatible, as no string that was previously accepted would now be rejected. However, if an application relies on trailing whitespace not being considered well-formed, it would need updating.

Part 2: Remove non-well-formed numeric strings

To follow on from part 1, for the next PHP x.0 (currently PHP 8.0), this RFC proposes that the concept of the “non-well-formed” numeric string be removed, and instead all such strings be treated as non-numeric. This change would break backwards-compatibility and thus is proposed for a major instead of minor PHP version.

The hope is that the backwards compatibility impact would be limited by Part 1's acceptance of trailing whitespace, since that would prevent a large category of currently non-well-formed strings from being affected.

In order to prepare for the backwards-compatibility break in the following major version, the “A non well formed numeric value encountered” notice (where currently produced) should be changed in the PHP 7.x (currently PHP 7.4) to mention that this behaviour is deprecated, i.e. ”A non well formed numeric value encountered (non well formed numeric values are deprecated and will be considered non-numeric in PHP 8.0)”.

For the PHP interpreter, this change would be accomplished by modifying the is_numeric_string C function (and its variants) in the Zend Engine. This would therefore affect PHP features which make use of this function, including:

  • Arithmetic operators would now produce the same E_WARNING as for other non-numeric strings (TBD: and return 0)
  • The int and float type declarations would produce the same TypeError as for other non-numeric strings
  • Type checks for built-in/extension (“internal”) PHP functions would produce the same E_WARNING error and return NULL (weak typing mode) or the same TypeError (strict typing mode) as for other non-numeric strings

It would not affect the following features, since they already treat non-well-formed numeric strings strictly:

  • The comparison operators
  • The \is_numeric function
  • The ++ and -- operators

TBD: what about explicit conversions, though?

The PHP language specification's definition of str-numeric would be modified. TBD.

RFC Impact

To Existing Extensions

Any extension using is_numeric_string, its variants, or other functions which themselves use it, will be affected.

To Opcache

In the patch, all tests pass with Opcache enabled. I am not aware of any issues arising here.

Unaffected PHP Functionality

This does not affect the filter extension, which handles numeric strings itself in a different fashion.

Future Scope

None conceivable.

Proposed Voting Choices

These are language changes, and require a 2/3 majority. There will be two votes, held simultaneously, on whether to accept Part 1 and Part 2 individually and apply their changes to the proposed PHP versions, with the proviso that the outcome of the Part 2 vote is ignored if Part 1 is rejected, as these changes build on eachother.

Patches and Tests

For Part 1, a pull request for a complete PHP interpreter patch, including a test file, can be found here: https://github.com/php/php-src/pull/2317

FIXME: There is no patch yet for Part 2, nor language specification patches.

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature
  4. a link to the language specification section (if any)

Changelog

- 2019-02-07, v1.1: Added proposal to remove “non-well-formed” numeric strings at the suggestion of Nikita Popov, renamed to “Revise trailing character handling for numeric strings” - 2017-01-18, v1.0: First draft as “Permit trailing whitespace in numeric strings”

rfc/trailing_whitespace_numerics.1549503513.txt.gz · Last modified: 2019/02/07 01:38 by ajf