Despite the wide and increasing adoption of Unicode (and UTF-8 in particular) in PHP applications, PHP does not yet have a Unicode codepoint escape syntax in string literals, unlike many other languages. This is unfortunate, as in many cases it can be useful to specify Unicode codepoints by number, rather than using the codepoint directly. For example, say you wish to output the UTF-8 encoded Unicode codepoint U+202E RIGHT-TO-LEFT OVERRIDE
in order to display text right-to-left. You could embed it in source code directly, but it is an invisible character and would display the rest of the line of code (or indeed entire program) in reverse!
The solution is to add a Unicode codepoint escape sequence syntax to string literals. This would mean you could produce U+202E like so:
echo "\u{202E}Reversed text"; // outputs โฎReversed text
Another use is to visually distinguish between visually similar or identical, yet differently encoded, Unicode characters, if you need to output one or the other specifically. The following two lines of code actually have slightly different output, but you couldn't tell by looking at them:
echo "maรฑana"; echo "maรฑana";
However, by using an escape sequence to produce the รฑ, it becomes clearer:
echo "ma\u{00F1}ana"; // pre-composed character echo "man\u{0303}ana"; // "n" with combining ~ character (U+0303)
A further use is to produce characters you can't type on your keyboard. If you are unable to type the emoji for FACE WITH TEARS OF JOY
, you can use its escape sequence instead:
echo "\u{1F602}"; // outputs ๐
A new escape sequence is added for double-quoted strings and heredocs, with the following syntax:
dq-unicode-escape-sequence:: \u{ codepoint-digits }
codepoint-digits:: hexadecimal-digit hexadecimal-digit codepoint-digits
It produces the UTF-8 encoding of a Unicode codepoint, specified with hexadecimal digits. If the codepoint is outside the maximum range permissible (beyond U+10FFFF), an error is thrown.
In most languages with a Unicode codepoint escape syntax, it follows the format \uXXXX
, where XXXX is four hexadecimal digits. That would then beg the question of why I didn't decide to follow other languages here.
The first reason is that only allowing four hexadecimal digits restricts the syntax to only representing codepoints in the Basic Multilingual Plane (U+0000 to U+FFFF). However, Unicode has supported codepoints beyond 16 bits (and hence outside the BMP) since UTF-16 in 1996, 18 years ago, and many useful characters are outside of the BMP, so it would be unreasonable to restrict programmers to only using BMP codepoints. We could instead require six hexadecimal digits (which covers the entirety of Unicode), but this would cause bugs, as programmers used to other languages would only expect four to be supported, and expect โ\u100000โ
to produce U+1000 followed by โ00โ
, not U+100000
. We could also make it variable-length, but this would cause the same problems.
The second reason is that I think non-clearly-delineated escape sequences are harmful for readability and likely to cause bugs. It would not be a stretch for a programmer to expect the octal escape sequence โ\10000โ
to refer to produce some single character character 10000
, however it actually would produce โ\u100โ . โ00โ
. Plus, these sequences, if they are variable-length, force programmers to insert awkward breaks in the middle of strings when these escape sequences precede literal numbers or other characters that would be interpreted as part of the escape sequence, but are not intended to be.
Finally, the \uXXXX
syntax is fixed-length and therefore requires leading zeroes to be used for codepoints, which makes some sequences longer than they need to be.
For all these reasons, the \u{xxxxxx}
syntax is proposed instead. It can easily represent any valid Unicode character, e.g. โ\u{20}โ
, โ\u{FF}โ
, โ\u{202e}โ
or โ\u{10F602}โ
. It has a clearly delimited start and end, which avoids ambiguity (compare โ\u001000โ
and โ\u{10}00โ
) and accidental misinterpretation. Finally, it doesn't require leading zeros, they are entirely optional, so the programmer can write โ\u{00FF}โ
or โ\u{FF}โ
as they see fit.
ECMAScript 6 will have an identical \u{xxxxxx}
syntax to that which is proposed.
Ruby supports this syntax also, however it allows for multiple codepoints, e.g. \u{20AC A3 A5}
, which is not proposed in this RFC.
(See References below)
The production of UTF-8 might be controversial, given PHP's strings don't have any specific encoding. However, UTF-8 is now the de facto standard encoding for PHP, with most standard library functions assuming this is used unless told otherwise, and UTF-8 is also now the effective standard encoding of the web. It is, furthermore, highly unlikely that this will change any time soon. I do not expect it will cause problems with other Unicode representations, as UTF-16 and UTF-32 are very rarely used in modern web applications, and this is getting even rarer. Finally, it is worth remembering that applications which aren't using UTF-8 would not be forced to use this.
Double-quoted strings and heredocs that contained sequences beginning with \u
will now be interpreted differently, and if what followed did not form a valid Unicode escape sequence, PHP will throw a fatal compile error.
This change would take place in a major version, so some level of backwards-compatibility breakage would be justified. In cases where it caused problems with existing code, fixing it could be done quite trivially by either switching to single-quoted strings, or escaping the backslash.
In order to reduce backwards-compatibility issues, particularly with JSON in string literals, \u
which is not followed by an opening {
will pass through verbatim (instead of being interpreted as an escape sequence) and not raise an error. This means that existing code like json_decode(โ\โ\u202e\โโ);
will continue to work properly. On the other hand, โ\u{foobarโ
will raise an error.
This is proposed for the next major version of PHP, which would be PHP 7 at the time of writing.
Single-quoted strings and nowdocs are unaffected. This produces a UTF-8 encoding of the codepoint as bytes, but it does not change the fact that PHP's strings are byte-strings with no specific encoding.
Alain Williams suggested on the mailing list that we could add a named literal syntax (i.e. something like \U{arabic letter alef}
), like Perl's \N.
As this is a language change, a 2/3 majority would be required.
Voting started on 2014-12-08 and ended on 2014-12-18.
A working pull request containing a patch with tests, is here: https://github.com/php/php-src/pull/918
A language specification pull request with a patch and tests can be found here: https://github.com/php/php-langspec/pull/92
Provisional HHVM implementation: https://reviews.facebook.net/D30153
No manual entry yet.
\u{xxxxxx}
syntax: http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/Keep this updated with features that were discussed on the mail lists.
The name of this RFC ought to have been "unicode codepoint escape sequence", not "unicode codepoint escape syntax".
\u
without a following opening {
passes through verbatim