rfc:binary_string_comparison

PHP RFC: Binary String Comparison

This RFC proposes to change the behavior of non-strict string to string comparison to be binary safe.

Introduction

In PHP on comparing two strings in non-strict mode both operands will be casted to numbers and if it succeed both numbers will be compared instead of a binary string comparison.

The current behavior is very confusing also in the world of a loosely typed language. Because both operands have the same type and there is no context which could mark them as numeric. Only because the content could represent numeric characters using the ASCII table it's not automatically a number. Btw. in the world of multilingual webpages you will get other errors (e.g. in German the comma and the dot is switched).

The current behavior is very unknown in the wold of PHP developers and newcomers because there is no numeric context. Sure there is a note somewhere in the documentation but it is nothing 99.9% of people would expect.

The current behavior leads to bugs that are very hard to find/catch and it makes code hard to know what's going on.

Using strict string comparison helps to workaround such behavior bit it ends up in using strict comparison all over which makes non-strict comparison useless and some structures like the switch statement can't be used as it internally uses non-strict comparison.

Since PHP 5.2.1 binary is an alias of the string type and a prefix b exists to mark a string binary which has no effect.

Proposal

This RFC proposes to change the behavior of non-strict string to string comparison to be binary safe (as the strict comparison operator does).

On comparing two numeric strings both operands will be equal if the string representation will be the same. On comparing two numeric strings the first operand will be greater if the first not matching byte will be higher. On comparing two numeric strings the first operand will be lower if the first not matching byte will be lower.

As a side effect it makes string comparison much faster and force developer to really write what they mean (No need to guess) and to force developers to cast/filter input once which also affects performance.

On C-Level the function zendi_smart_strcmp will be unused and marked as deprecated.

string == string

(http://3v4l.org/2bIUj)

  <?php
  echo ('1' == '1' ? 'true' : 'false') . " ('1' == '1')\n";
  echo ('2' == '1' ? 'true' : 'false') . " ('2' == '1')\n";
  echo ('0' == '0x0' ? 'true' : 'false') . " ('0' == '0x0')\n";
  echo ('0' == '00' ? 'true' : 'false') . " ('0' == '00')\n";
  echo ('1e1' == '10' ? 'true' : 'false') . " ('1e1' == '10')\n";
  echo ('1E1' == '10' ? 'true' : 'false') . " ('1E1' == '10')\n";
  echo ('1e-1' == '0.1' ? 'true' : 'false') . " ('1e-1' == '0.1')\n";
  echo ('1E-1' == '0.1' ? 'true' : 'false') . " ('1E-1' == '0.1')\n";
  echo ('+1' == '1' ? 'true' : 'false') . " ('+1' == '1')\n";
  echo ('+0' == '-0' ? 'true' : 'false') . " ('+0' == '-0')\n";
  echo ('0.99999999999999994' == '1' ? 'true' : 'false') . " ('0.99999999999999994' == '1')\n";
  echo ('0.99999999999999995' == '1' ? 'true' : 'false') . " ('0.99999999999999995' == '1')\n";
  echo ("\n1" == '1' ? 'true' : 'false') . " (\"\\n1\" == '1')\n";
  echo ("1\n" == '1' ? 'true' : 'false') . " (\"1\\n\" == '1')\n";

Current Behavior (handle both strings as numbers):

  true ('1' == '1')
  false ('2' == '1')
  true ('0' == '0x0')
  true ('0' == '00')
  true ('1e1' == '10')
  true ('1E1' == '10')
  true ('1e-1' == '0.1')
  true ('1E-1' == '0.1')
  true ('+1' == '1')
  true ('+0' == '-0')
  false ('0.99999999999999994' == '1')
  true ('0.99999999999999995' == '1')
  true ("\n1" == '1')
  false ("1\n" == '1')

Changed Behavior (handle both strings as binary):

  true ('1' == '1')
  false ('2' == '1')
  false ('0' == '0x0')
  false ('0' == '00')
  false ('1e1' == '10')
  false ('1E1' == '10')
  false ('1e-1' == '0.1')
  false ('1E-1' == '0.1')
  false ('+1' == '1')
  false ('+0' == '-0')
  false ('0.99999999999999994' == '1')
  false ('0.99999999999999995' == '1')
  false ("\n1" == '1')
  false ("1\n" == '1')

string > string | string >= string | string < string | string <= string

(http://3v4l.org/e7JHg)

  <?php
  echo ('1' > '2' ? 'true' : 'false') . " ('1' > '2')\n";
  echo ('1' < '2' ? 'true' : 'false') . " ('1' < '2')\n";
  echo ('02' > '1' ? 'true' : 'false') . " ('02' > '1')\n";
  echo ('1E1' <= '10' ? 'true' : 'false') . " ('1E1' <= '10')\n";
  echo ('+1' < '1' ? 'true' : 'false') . " ('+1' < '1')\n";
  echo ('+0' < '-0' ? 'true' : 'false') . " ('+0' < '-0')\n";
  echo ('0.99999999999999995' < '1' ? 'true' : 'false') . " ('0.99999999999999995' < '1')\n";
  echo ('0.99999999999999995' > '1' ? 'true' : 'false') . " ('0.99999999999999995' > '1')\n";
  echo ('0.99999999999999995' <= '1' ? 'true' : 'false') . " ('0.99999999999999995' <= '1')\n";
  echo ('0.99999999999999995' >= '1' ? 'true' : 'false') . " ('0.99999999999999995' >= '1')\n";

Current Behavior (handle both strings as numbers):

  false ('1' > '2')
  true ('1' < '2')
  true ('02' > '1')
  true ('1E1' <= '10')
  false ('+1' < '1')
  false ('+0' < '-0')
  false ('0.99999999999999995' < '1')
  false ('0.99999999999999995' > '1')
  true ('0.99999999999999995' <= '1')
  true ('0.99999999999999995' >= '1')

Changed Behavior (handle both strings as binary):

  false ('1' > '2')
  true ('1' < '2')
  false ('02' > '1')
  false ('1E1' <= '10')
  true ('+1' < '1')
  true ('+0' < '-0')
  true ('0.99999999999999995' < '1')
  false ('0.99999999999999995' > '1')
  true ('0.99999999999999995' <= '1')
  false ('0.99999999999999995' >= '1')

binary marked strings (since PHP 5.2.1)

(http://3v4l.org/bWnUG)

  <?php
  var_dump((binary)'1e1' == (binary)'10');
  var_dump(b'1e1' == b'10');

Current Behavior (binary marked strings will be handled numerically):

  bool(true)
  bool(true)

Changed Behavior (all strings will be handled binary without a context):

  bool(false)
  bool(false)

sorting of strings

(http://3v4l.org/mA0Yq)

  <?php 
  
  $arr = array('1', 3, 2, '03', '01', '02');
  
  echo "Sort regular:\n";
  sort($arr);
  var_dump($arr);
  
  echo "Sort numeric:\n"; 
  sort($arr, SORT_NUMERIC);
  var_dump($arr);
  
  echo "Sort binary:\n";
  sort($arr, SORT_STRING);
  var_dump($arr);

Current Behavior:

  Sort regular:
  array(6) {
    [0] =>
    string(2) "01"
    [1] =>
    string(1) "1"
    [2] =>
    string(2) "02"
    [3] =>
    int(2)
    [4] =>
    int(3)
    [5] =>
    string(2) "03"
  }
  Sort numeric:
  array(6) {
    [0] =>
    string(2) "01"
    [1] =>
    string(1) "1"
    [2] =>
    int(2)
    [3] =>
    string(2) "02"
    [4] =>
    string(2) "03"
    [5] =>
    int(3)
  }
  Sort binary:
  array(6) {
    [0] =>
    string(2) "01"
    [1] =>
    string(2) "02"
    [2] =>
    string(2) "03"
    [3] =>
    string(1) "1"
    [4] =>
    int(2)
    [5] =>
    int(3)
  }

Changed Behavior:

  Sort regular:
  array(6) {
    [0]=>
    string(2) "01"
    [1]=>
    string(2) "02"
    [2]=>
    string(1) "1"
    [3]=>
    int(2)
    [4]=>
    int(3)
    [5]=>
    string(2) "03"
  }
  Sort numeric:
  array(6) {
    [0]=>
    string(2) "01"
    [1]=>
    string(1) "1"
    [2]=>
    string(2) "02"
    [3]=>
    int(2)
    [4]=>
    string(2) "03"
    [5]=>
    int(3)
  }
  Sort binary:
  array(6) {
    [0]=>
    string(2) "01"
    [1]=>
    string(2) "02"
    [2]=>
    string(2) "03"
    [3]=>
    string(1) "1"
    [4]=>
    int(2)
    [5]=>
    int(3)
  }

Backward Incompatible Changes

Existing code that relies on the current behavior on non-strict string to string comparison will only produce the originally expected result if the string representation is the same. This can be easily resolved by explicitly casting one of the operands to an integer or float respectively define the sorting algorithm.

Proposed PHP Version(s)

As this is a backwards-incompatible change, this RFC targets PHP.next.

Affected PHP Functionality

Only non-strict string to string comparison will be affected. Means the operators ==, !=, <, >, >=, >= and related sorting functions using the default sorting flag SORT_REGULAR.

Proposed Voting Choices

Voting Choices: Yes or No

This RFC requires a 2/3 majority as it changes the language itself.

Patches and Tests

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

Rejected Features

None so far.

rfc/binary_string_comparison.txt · Last modified: 2017/09/22 13:28 by 127.0.0.1