rfc:unicode_escape

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:unicode_escape [2014/11/24 22:02] – Fixed syntax ajfrfc:unicode_escape [2017/09/22 13:28] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== PHP RFC: Unicode Codepoint Escape Syntax ====== ====== PHP RFC: Unicode Codepoint Escape Syntax ======
-  * Version: 0.1 +  * Version: 0.1.3 
-  * Date: 2014-01-24+  * Date: 2014-11-24, Last Updated 2014-12-08
   * Author: Andrea Faulds, ajf@ajf.me   * Author: Andrea Faulds, ajf@ajf.me
-  * Status: Under Discussion+  * Status: Implemented (PHP 7.0)
   * First Published at: http://wiki.php.net/rfc/unicode_escape   * First Published at: http://wiki.php.net/rfc/unicode_escape
  
Line 27: Line 27:
 <code php> <code php>
 echo "ma\u{00F1}ana"; // pre-composed character echo "ma\u{00F1}ana"; // pre-composed character
-echo "man\u{006E}ana"; // "n" with combining ~ character U+006E+echo "man\u{0303}ana"; // "n" with combining ~ character (U+0303)
 </code> </code>
  
Line 47: Line 47:
      hexadecimal-digit   codepoint-digits      hexadecimal-digit   codepoint-digits
  
-It produces the UTF-8 encoding of a Unicode codepoint, specified with hexadecimal digits. +It produces the UTF-8 encoding of a Unicode codepoint, specified with hexadecimal digits. If the codepoint is outside the maximum range permissible (beyond U+10FFFF), an error is thrown.
  
 ==== Syntax Rationale ==== ==== Syntax Rationale ====
Line 61: Line 61:
 For all these reasons, the ''\u{xxxxxx}'' syntax is proposed instead. It can easily represent any valid Unicode character, e.g. ''"\u{20}"'', ''"\u{FF}"'', ''"\u{202e}"'' or ''"\u{10F602}"''. It has a clearly delimited start and end, which avoids ambiguity (compare ''"\u001000"'' and ''"\u{10}00"'') and accidental misinterpretation. Finally, it doesn't require leading zeros, they are entirely optional, so the programmer can write ''"\u{00FF}"'' or ''"\u{FF}"'' as they see fit. For all these reasons, the ''\u{xxxxxx}'' syntax is proposed instead. It can easily represent any valid Unicode character, e.g. ''"\u{20}"'', ''"\u{FF}"'', ''"\u{202e}"'' or ''"\u{10F602}"''. It has a clearly delimited start and end, which avoids ambiguity (compare ''"\u001000"'' and ''"\u{10}00"'') and accidental misinterpretation. Finally, it doesn't require leading zeros, they are entirely optional, so the programmer can write ''"\u{00FF}"'' or ''"\u{FF}"'' as they see fit.
  
-As it happens, ECMAScript 6 will also have this syntax (see References below)in order to allow specifying non-BMP codepoints. This is actually mere coincidenceI came up with this syntax before learning ES 6 would support this.+=== Prior Art === 
 + 
 +ECMAScript 6 will have an identical ''\u{xxxxxx}'' syntax to that which is proposed. 
 + 
 +Ruby supports this syntax alsohowever it allows for multiple codepoints, e.g. ''\u{20AC A3 A5}''which is not proposed in this RFC. 
 + 
 +(See References below)
  
 ==== Encoding Rationale ==== ==== Encoding Rationale ====
Line 72: Line 78:
  
 This change would take place in a major version, so some level of backwards-compatibility breakage would be justified. In cases where it caused problems with existing code, fixing it could be done quite trivially by either switching to single-quoted strings, or escaping the backslash. This change would take place in a major version, so some level of backwards-compatibility breakage would be justified. In cases where it caused problems with existing code, fixing it could be done quite trivially by either switching to single-quoted strings, or escaping the backslash.
 +
 +In order to reduce backwards-compatibility issues, particularly with JSON in string literals, ''\u'' which is not followed by an opening ''{'' will pass through verbatim (instead of being interpreted as an escape sequence) and not raise an error. This means that existing code like ''json_decode("\"\u202e\"");'' will continue to work properly. On the other hand, ''"\u{foobar"'' will raise an error.
  
 ===== Proposed PHP Version(s) ===== ===== Proposed PHP Version(s) =====
Line 83: Line 91:
 ===== Future Scope ===== ===== Future Scope =====
  
-None foreseeable.+Alain Williams suggested on the mailing list that we could add a named literal syntax (i.e. something like ''\U{arabic letter alef}''), like [[http://perldoc.perl.org/perlreref.html#ESCAPE-SEQUENCES|Perl's \N]].
  
-===== Proposed Voting Choices =====+===== Vote =====
  
 As this is a language change, a 2/3 majority would be required. As this is a language change, a 2/3 majority would be required.
 +
 +Voting started on 2014-12-08 and ended on 2014-12-18.
 +
 +<doodle title="Accept the Unicode Codepoint Escape Syntax RFC and merge into master?" auth="ajf" voteType="single" closed="true">
 +   * Yes
 +   * No
 +</doodle>
  
 ===== Patches and Tests ===== ===== Patches and Tests =====
Line 94: Line 109:
  
 A language specification pull request with a patch and tests can be found here: https://github.com/php/php-langspec/pull/92 A language specification pull request with a patch and tests can be found here: https://github.com/php/php-langspec/pull/92
 +
 +Provisional HHVM implementation: https://reviews.facebook.net/D30153
  
 ===== Implementation ===== ===== Implementation =====
-After the project is implemented, this section should contain  + 
-  - the version(sit was merged to +  * php-src merge: https://github.com/php/php-src/commit/bae46f307c2d0cdef9b8f5426adcc46920776700 (will go into PHP 7
-  - a link to the git commit(s) +  * HHVM merge: https://github.com/facebook/hhvm/commit/b2df7016e63ddcf328dc5bcfdf18760bba8549ec 
-  - a link to the PHP manual entry for the feature+ 
 +No manual entry yet.
  
 ===== References ===== ===== References =====
  
-  * ECMAScript 6 will have the same ''\u{xxxxxx}'' syntax: https://mathiasbynens.be/notes/javascript-unicode+  * Ruby supports the same ''\u{xxxxxx}'' syntax: http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/ 
 +  * ECMAScript 6 will also have this syntax: https://mathiasbynens.be/notes/javascript-unicode
  
 ===== Rejected Features ===== ===== Rejected Features =====
  
 Keep this updated with features that were discussed on the mail lists. Keep this updated with features that were discussed on the mail lists.
 +
 +===== Errata =====
 +
 +The name of this RFC [[https://blog.ajf.me/2015-12-07-poorly-named-rfcs|ought to have been "unicode codepoint escape sequence", not "unicode codepoint escape syntax"]].
 +
 +===== Changelog =====
 +
 +  * (2016-03-13) Added Errata
 +  * v0.1.3 - ''\u'' without a following opening ''{'' passes through verbatim
 +  * v0.1.2 - Ruby support
 +  * v0.1.1 - Added Future Scope note on named literals
 +  * v0.1 - Initial version
rfc/unicode_escape.1416866524.txt.gz · Last modified: 2017/09/22 13:28 (external edit)