rfc:bigint

This is an old revision of the document!


PHP RFC: Big Integer Support

  • Version: 0.1.7
  • Date: 2014-06-20 (Initial Draft; Put Under Discussion 2014-10-10, Last updated 2015-01-10)
  • Author: Andrea Faulds, ajf@ajf.me
  • Status: Under Discussion
  • First Published at: http://wiki.php.net/rfc/bigint

Introduction

Since the beginning, PHP has had only two numeric types: integer, and float. The former has been a platform-dependent C long, usually either 32-bit or 64-bit, and the latter has been a platform-dependent C double, usually an IEEE 754 double-precision floating-point number.

Both work relatively well, but beyond the maximum integer value on a specific platform, things get a bit messy. Typically, PHP will have integers overflow to floats, resulting in a loss of precision. Integer size is platform-specific, so code dealing with large integers won't act the same on a 32-bit machine versus a 64-bit machine.

Some applications need to deal with very large integers beyond 32-bit or 64-bit and for this they can resort to extensions like gmp. However, dealing with these so-called “big integers” or “bigints” is rather clumsy. You must write all your code to deal with them specifically, and you must create objects for them rather than simply using numeric literals like for the built-in integer and float types.

Hence, this RFC proposes the addition of built-in bigint support to PHP. Now, you can do operations with integers of any size, so long as you have enough memory. While there are now two types internally (long and bigint), userland code will continue to see only “integers”, and the two types will be indistinguishable.

The advantages of doing this are numerous. Now integers will always be consistent across platforms, with programmers not needing to worry about the size of a long – 32-bit, 64-bit or otherwise – on their platform. Operations, too, will always be consistent. This will help the portability of PHP code and mean less time wasted by programmers dealing with platform differences, strengthening PHP's cross-platform guarantees. Dealing with extremely large data sets becomes easier, as you no longer need to anticipate if your IDs will exceed 32 or 64 bits. Integer overflow is largely relegated to being an issue for internals programmers, as userland code will never have to deal with it, and there is no risk of a loss of precision as they will no longer become floats. All this combined is likely to make for more robust, less buggy applications. Finally, being able to deal with large integers “natively” makes PHP more attractive for web developers needing to do large integer math, such as applications dealing with currency, or perhaps statistical applications.

Proposal

New type

To complement the existing internal IS_LONG and IS_DOUBLE types, a new IS_BIGINT type is introduced. IS_BIGINT is a reference-counted, copy-on-write type which is not garbage collected, much like a string. Behind-the-scenes, the a bigint library - LibTomMath by default, but GMP can also be used - is used to implement it, but it is abstracted with a new family of zend_bigint_* functions and the zend_bigint type, which allows the aforementioned choice of libraries. As stated in the Introduction, no new userland type is added to PHP, and instead “integer” now covers two internal types: IS_LONG and IS_BIGINT. There should be no visible difference to userland code between these types. Internally, a new “fake type” is also added, namely IS_BIGINT_OR_LONG. This is used by a few functions dealing with conversions and casts, and is now the “type” that (integer) will cast to.

Type specifiers for zend_parse_parameters that previously yielded a long will continue to do so. The type specifiers i, for a bigint or a long, and I, for a bigint, are added, along with the corresponding Z_PARAM_BIGINT_OR_LONG(_EX) and Z_PARAM_BIGINT(_EX) FAST_ZPP macros.

Changes to operators for the sake of consistency

In order to make integer arithmetic consistent between longs and bigints, certain changes to existing operator behaviour will be made:

  • Bitwise operators will now deal with integers of any size (i.e. both longs and bigints) instead of being bounded by the size of a long on a machine.
  • Left shifts will promote to bigints rather than overflowing. Similarly, right shifts can deal with bigints, so (1 << 67) >> 66 will result in 2.
  • The pow (**) operator will now error when an exponent too large is used if it is dealing with an integer. This is because both GMP and LibTomMath can't handle exponents beyond the size of an unsigned long. This restriction will not occur when using the pow operator when either operand is a float.

Standard library changes

  • All math functions are updated to work with bigints.
  • array_sum and array_product are now implemented in the patch using add_function and mul_function, respectively. This means that they support bigints now, but also internal objects with operator overloading (currently only the GMP extension, to the best of my knowledge).

Examples

Currently, if an integer gets too large in PHP, it becomes a float, accuracy is lost, and operations start behaving differently. Take this code for example:

$x = PHP_INT_MAX - 1;
var_dump($x);
$x++;
var_dump($x);
$x++;
var_dump($x);
$x++;
var_dump($x);

Under PHP 5.5 on a 64-bit machine, it produces the following result:

int(9223372036854775806)
int(9223372036854775807)
float(9.2233720368548E+18)
float(9.2233720368548E+18)

The last six digits are lost, and incrementing suddenly does nothing!

However, the output would be different with this RFC:

int(9223372036854775806)
int(9223372036854775807)
int(9223372036854775808)
int(9223372036854775809)

No digits are lost, incrementing still works, and it's still an integer. Under the hood, it may technically be a different type (depending on the platform), but from the user's perspective, it's still an integer, and it functions exactly the same.

This means you can do arbitrarily large integer operations with full accuracy, so long as there is enough memory available. For example:

$ php -r 'var_dump(10 ** 100);'
int(10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000)
$ php -r 'var_dump((1 << 67) >> 63);'
int(16)
$ php -r 'var_dump(2 ** 3 ** 4);'
int(2417851639229258349412352)
$ php -r 'var_dump((10 ** 100) % 10);'
int(0)
$ php -r 'var_dump(123098209381029380128301298301298309812098213);'
int(123098209381029380128301298301298309812098213)

This works consistently across platforms. So, it is possible to handle 64-bit integers with full precision on a 32-bit machine with exactly the same code - indeed, it does not matter how many bits are in the integer, so long as there is sufficient memory to store it. Every example above works on a 64-bit machine running OS X, but would equally function identically on a 32-bit Windows machine, or a 64-bit Linux server, or any other platform.

Backward Incompatible Changes

As mentioned before, the shift left and shift right operators act differently, as does pow for very large exponents.

Longs will no longer overflow to float, but instead become bigints (which, so far as userland cares, are just integers). Code expecting large integer literals to be floats will now end up with bigints instead, which might cause problems. However, if a float is still desired, this can be fixed simply by appending .0.

Internals changes

Some internal APIs, mostly ones dealing with numbers, will necessarily change their signatures or behaviour:

  1. For example, is_numeric_string/_ex now takes a zend_bigint** parameter
  2. The cast_object object handler now has to deal with IS_BIGINT_OR_LONG and IS_BIGINT

Proposed PHP Version(s)

This is proposed for the next PHP X, currently PHP 7. The patch is based off of phpng, and my intention is for it to be merged into phpng.

RFC Impact

Performance

The performance penalties are minor for normal integer and float arithmetic. While left shifts and right shifts now require overflow checks, generally bigints will just take the place of floats in existing overflow checks so the performance impact is minimal.

Fatal errors

Unfortunately, bigints would introduce two new ways to cause fatal errors in PHP.

Firstly, if you do an operation resulting in an extremely large number, you might hit your request memory limit.

Secondly, when trying to calculate a value that would require more memory than size_t can describe, GMP prints the overflow in mpz type error to the command line and abort()s. Allowing this to happen and kill the PHP process would be very problematic, so instead, this commit introduces a workaround whereby we try to check ahead-of-time if the operation would fall foul of the overflow error and, if so, throw an E_ERROR with the message “Result of integer operation would be too large to represent”. For LibTomMath we don't need to check this ourselves because the library has sensible error handling, but we still produce E_ERROR in this case.

Licensing and dependency issues

I am current porting this to use LibTomMath, a dual-licensed Public Domain/WTFPL arbitrary-length integer library written in C, which is available packaged for several platforms, and is battle-tested as it is used by Tcl. As it is available under both Public Domain and the WTFPL, with the latter an extremely liberal license, it doesn't pose any licensing issues. Its source is contained in the repo and built with the rest of Zend, which avoids an external dependency.

At compile-time, it is also possible to choose to use GMP, which is LGPLv3 licensed, but it is not the default.

Arrays

A problem arising from allowing integers to be arbitrarily large is that array keys using strings for numeric keys beyond the maximum size of a long would probably seem weird. At present, bigints are just dealt with as if they were numeric strings when using them as array keys and indices, but this may not be optimal. This RFC aims for integer consistency across platforms, and this would be a remaining inconsistency. It also doesn't make sense from a user perspective to have integers over a certain value suddenly become string keys, though whether this matters much in practise with PHP's type casting and juggling is a different question.

This also presents a further issue: inconsistency between longs, bigints and doubles, which must be avoided, as integer consistency cross-platform is a key goal of this RFC. Currently in PHP, doubles used as indexes are simply casted to longs, without any regard for size. This means that they overflow if they are larger than the platform's long size, either 32-bit or 64-bit. However, bigints as implemented, will be treated as strings if they are outside of the bounds of a long on the platform. While bigints are likely to break existing code anyway, this would be a particularly bad breakage, as code relying on very large numbers being floats and wrapping when used as indices would break. Hence some sort of solution must be found. Either we cast bigints to longs and let them overflow (not terribly desirable), we don't change the current behaviour (inconsistent), or we change the handling of doubles. Personally, I don't like what PHP does here and would to go for this last option.

To SAPIs

This should have no impact on existing SAPIs.

To Existing Extensions

Any which request numeric parameters as zvals rather than longs or doubles from zend_parse_parameters will need changes. Those dealing with numerical operations specifically will require deeper changes. Obviously, ext/standard will need some updating.

ext/gmp will be updated to handle bigints. However, due to behavioural and implementation differences between GMP objects and the bigint type, it won't just pass through to the built-in operator functions. With the addition of bigints, ext/gmp would quickly become irrelevant except for backwards-compatibility with existing applications, and might eventually be moved to PECL.

Extensions dealing with parts of the Zend API that deal with numbers will need to be modified to deal with changes in signatures and behaviour. (See “Backwards Incompatible Changes”)

To Opcache

Both GMP and LibTomMath can only have one custom allocator, so I weighed the options and made that be emalloc, not malloc. I expect this would pose a problem for opcache, as any bigints would be destroyed upon the end of a request, so opcache would need to store bigints persistently. Hence, some sort of import/export mechanism could be added to zend_bigint. It is obviously possible to use strings, but gmp also has its own format for serialisation which would be more efficient, so that might be a good way.

I have not yet dealt with opcache implementation-wise, and I might need help when the time comes.

New Constants

None.

php.ini Defaults

No changes.

Open Issues

The patch is unfinished. Many tests are still broken, I haven't gotten round to updating the extensions, and it almost certainly does not work with opcache.

Open Questions

  • Should we rework array key handling? (See “Arrays” above)

TODO

Must be done

  • Finish LibTomMath port
    • TODOs
  • Deal with bigints string indices better. Currently we cast to long, but we should check for it being capped at LONG_MAX/_MIN and throw the “uninitalized index” error. Possibly a novel error (“string index too large”?)
    • Numeric string offset thing in zend_language_scanner.l
  • GMP backend needs the segfault fix ext/gmp has (custom allocator switching)
  • Test coverage:
    • Fix remaining broken tests on 64-bit and 32-bit
    • Write more tests for bigints, especially for areas that aren't covered just now
  • Better extension coverage.
    • Fully ported:
      • JSON - Can correctly encode and decode bigints
    • Partially ported:
      • standard
    • Compiles, not necessarily fully working:
      • core, ctype, curl, date, dom, ereg, fileinfo, gd, gettext, hash, iconv, intl, json, libxml, mbstring, mysql, mysqli, mysqlnd, pcre, pgsql, phar, reflection, session, shmop, simplexml, spl, sqlite3, standard, tidy, tokenizer, wddx, xml, xmlreader, xmlwriter, xsl
    • Need doing:
      • Basically everything, but in particular:
      • Important exts (json (wait until jsond?), session, PDO, etc.)
      • Make PHP at least build without --disable-all?
  • Opcache
    • Bigints are allocated in non-persistent memory, so we'll have to create some sort of persistent storage format

Optional, possibly future work

  • IS_BIGINT_OR_LONG should be renamed to _IS_BIGINT_OR_LONG for consistency with _IS_BOOL. That way, it's more obviously a fake type.
  • Optimisations:
    • We currently use clang and GCC 5.0 checked arithmetic builtins to implement faster overflow checks in fast_add_function, fast_sub_function and ZEND_SIGNED_MULTIPLY_LONG, unlike php-src master. For the sake of compilers that aren't GCC 5.0 or clang, some of the old inline assembly routines for this checking could be restored and updated for bigints.
    • That clz/bsr assembly TODO: We need to do a clz/bsr operation for bit shift overflow checking, and currently we do this with a double conversion and frexp. It would be more efficient to use assembly for this.
  • Other optimisations:
    • Possibly mark the zend_bigint_* functions as to be inlined and move them to the header

Unaffected PHP Functionality

As previously mentioned, the handling of array keys might need to be looked at. Otherwise, it shouldn't affect the behaviour of other PHP functionality, but obviously the implementations of anything dealing with integers may need to be changed.

Future Scope

None I can think of particularly.

Proposed Voting Choices

In some respects this is just an implementation detail, but as this would break backwards-compatibility for some apps and arguably changes the language, I think this requires a 2/3 majority. It would be a straight Yes/No vote.

Patches and Tests

A work-in-progress, unfinished pull request is here: https://github.com/php/php-src/pull/876

The branch itself is here: https://github.com/TazeTSchnitzel/php-src/tree/bigint

The LibTomMath backend (the default) is a work-in-progress. Use --enable-bigint-gmp to use the GMP backend.

It is based off the phpng branch. Many tests are still broken, as as mentioned previously, I still need to deal with extensions and opcache. It is very much unfinished, but it does work to a degree.

See the TODO section in Open Issues (above) for unfinished areas.

Implementation

If/when this is implemented, this section would/will contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

Inspiration

  • I was inspired in part by Python 2's bigint support with its separate “long” type (different from the machine-dependent “int” type), and how Python 3 unified these into the single “int” type - see http://legacy.python.org/dev/peps/pep-0237/
  • Some other languages also do it: Erlang, Haskell and Smalltalk

Discussion

General

Changelog

  • v0.1.7 - Minor changes, removed some outdated information
  • v0.1.6 - LibTomMath built as part of PHP
  • v0.1.5 - Switchable back-ends
  • v0.1.4 - LibTomMath migration from GMP
  • v0.1.3 - Examples
  • v0.1.2 - Int64 clarifications
  • v0.1.1 - Added stdlib changes
  • v0.1 - Not actually the first version, but I kept no changelog until now
rfc/bigint.1421464714.txt.gz · Last modified: 2017/09/22 13:28 (external edit)