PHP RFC: Numeric Literal Separator
- Date: 2019-05-15
- Author: Theodore Brown theodorejb@outlook.com, Bishop Bettini bishop@php.net
- Based on previous RFC by: Thomas Punt tpunt@php.net
- Status: Implemented (in PHP 7.4)
- Discussion: https://externals.io/message/105714
- Target version: PHP 7.4
- Implementation: https://github.com/php/php-src/pull/4165
Introduction
The human eye is not optimized for quickly parsing long sequences of digits. Thus, a lack of visual separators makes it take longer to read and debug code, and can lead to unintended mistakes.
1000000000; // Is this a billion? 100 million? 10 billion? 107925284.88; // What scale or power of 10 is this?
Additionally, without a visual separator numeric literals fail to convey any additional information, such as whether a financial quantity is stored in cents:
$discount = 13500; // Is this 13,500? Or 135, because it's in cents?
Proposal
Enable improved code readability by supporting an underscore in numeric literals to visually separate groups of digits.
$threshold = 1_000_000_000; // a billion! $testValue = 107_925_284.88; // scale is hundreds of millions $discount = 135_00; // $135, stored as cents
Underscore separators can be used in all numeric literal notations supported by PHP:
6.674_083e-11; // float 299_792_458; // decimal 0xCAFE_F00D; // hexadecimal 0b0101_1111; // binary 0137_041; // octal
Restrictions
The only restriction is that each underscore in a numeric literal must be directly between two digits. This rule means that none of the following usages are valid numeric literals:
_100; // already a valid constant name // these all produce "Parse error: syntax error": 100_; // trailing 1__1; // next to underscore 1_.0; 1._0; // next to decimal point 0x_123; // next to x 0b_101; // next to b 1_e2; 1e_2; // next to e
Unaffected PHP Functionality
Adding an underscore between digits in a numeric literal will not change its value. The underscores are stripped out during the lexing stage, so the runtime is not affected.
var_dump(1_000_000); // int(1000000)
This RFC does not change the behavior of string to number conversion. Numeric separators are intended to improve code readability, not alter how input is processed.
Backward Incompatible Changes
None.
Discussion
Use cases
Digit separators make possible the cognitive process of subitizing. That is, accurately and confidently “telling at a glance” the number of digits, rather than having to count them. This measurably lessens the time to correctly read numbers longer than four digits.
Large numeric literals are commonly used for business logic constants, unit test values, and performing data conversions. For example:
Composer's retry delay when removing a file:
usleep(350000); // without separator usleep(350_000); // with separator
Conversion of an Active Directory timestamp (the number of 100-nanosecond intervals since January 1, 1601) to a Unix timestamp:
$time = (int) ($adTime / 10000000 - 11644473600); // without separator $time = (int) ($adTime / 10_000_000 - 11_644_473_600); // with separator
Working with scientific constants:
const ASTRONOMICAL_UNIT = 149597870700; // without separator const ASTRONOMICAL_UNIT = 149_597_870_700; // with separator
Separating bytes in a binary or hex literal:
0b01010100011010000110010101101111; // without separator 0b01010100_01101000_01100101_01101111; // with separator 0x42726F776E; // without separator 0x42_72_6F_77_6E; // with separator
Use cases to avoid
It may be tempting to use integers for storing data such as phone, credit card, and social security numbers since these values appear numeric. However, this is almost always a bad idea, since such numbers often have prefixes and leading digits that are significant.
A good rule of thumb is that if it doesn't make sense to use mathematical operators on a value (e.g. adding it, multiplying it, etc.), then an integer probably isn't the best way to store it.
// don't do this: $phoneNumber = 345_6789; $creditCard = 231_6547_9081_2543; $socialSecurity = 111_11_1111;
Will it be harder to search for numbers?
A concern that has been raised is whether numeric literal separators will make it more difficult to search for numbers, since the same value can be written in more than one way.
This is already possible, however. The same number can be written in binary, octal, decimal, hexadecimal, or exponential notation. In practice, this isn't problematic as long as a codebase is consistent.
Furthermore, separators can sometimes make it easier to find numbers. To use an earlier example, 13_500 and 135_00 could be differentiated in a find/replace. Another example would be separated bytes in a hex literal, which allows searching for a value like “_6F_” to find only the numbers containing that specific byte.
Should it be the role of an IDE to group digits?
It has been suggested that numeric literal separators aren't needed for better readability, since IDEs could be updated to automatically display large numbers in groups of three digits.
However, it isn't always desirable to group numbers the same way.
For example, a programmer may write 10050000
differently
depending on whether or not it represents a financial quantity stored
as cents:
$total = 100_500_00; // represents $100,500.00 stored as cents $total = 10_050_000; // represents $10,050,000
Binary and hex literals may also be grouped by a varying number of digits to reflect how they are used (e.g. bits may be separated into nibbles, bytes, or words). An IDE cannot do this automatically without knowing the programmer's intent for each numeric literal.
Why resurrect this proposal?
The previous RFC was originally voted on over three years ago (January 2016). While a majority of voters supported it, it did not reach the required 2/3 threshold for acceptance.
Based on reading the discussion at the time, it didn't receive enough positive votes because there weren't many good use cases put forward for it. Also, the RFC had a short voting period of only 1 week.
Since that time, the ability to use underscores in numeric literals has been implemented in additional popular languages (e.g. Python, JavaScript, and TypeScript), and a stronger case can be made for the feature than was made before.
Should I vote for this feature?
Andrea Faulds summarized the considerations as follows:
This feature offers some benefit in some cases. It doesn't introduce much new complexity. There's no new syntax or tokens, it just modifies the form of the existing number tokens. It fits in well [with] what's already there, consistently applying to all number literals. It follows established convention in other languages. Its appearance at least hints that values with these separators are not constants or identifiers, but numbers, reducing potential for confusion. It limits its own application to prevent abuse (no leading, trailing, or repeated separators). And it's relatively intuitive.
Comparison to other languages
Numeric literal separators are widely supported in other programming languages.
- Ada: single, between digits 1
- C#: multiple, between digits 2
- C++: single, between digits (single quote used as separator) 3
- Java: multiple, between digits 4
- JavaScript and TypeScript: single, between digits 5
- Julia: single, between digits 6
- Kotlin: multiple, between digits 7
- Perl: single, between digits 8
- Python: single, between digits 9
- Ruby: single, between digits 10
- Rust: multiple, anywhere 11
- Swift: multiple, between digits 12
Vote
Voting started 2019-05-30 and ended 2019-06-13.
References
Request to revive RFC: https://externals.io/message/105450
Discussion from previous RFC: https://externals.io/message/89925, https://externals.io/message/90626, https://marc.info/?l=php-internals&m=145320709922246&w=2.
Blog post about original implementation: https://phpinternals.net/articles/implementing_a_digit_separator.