rfc:unicode_escape

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
rfc:unicode_escape [2014/11/25 02:36] – Reference provision HHVM implementation pollitarfc:unicode_escape [2016/03/13 02:05] – Add Errata section ajf
Line 1: Line 1:
 ====== PHP RFC: Unicode Codepoint Escape Syntax ====== ====== PHP RFC: Unicode Codepoint Escape Syntax ======
-  * Version: 0.1.1 +  * Version: 0.1.3 
-  * Date: 2014-01-24+  * Date: 2014-11-24, Last Updated 2014-12-08
   * Author: Andrea Faulds, ajf@ajf.me   * Author: Andrea Faulds, ajf@ajf.me
-  * Status: Under Discussion+  * Status: Implemented (PHP 7.0)
   * First Published at: http://wiki.php.net/rfc/unicode_escape   * First Published at: http://wiki.php.net/rfc/unicode_escape
  
Line 61: Line 61:
 For all these reasons, the ''\u{xxxxxx}'' syntax is proposed instead. It can easily represent any valid Unicode character, e.g. ''"\u{20}"'', ''"\u{FF}"'', ''"\u{202e}"'' or ''"\u{10F602}"''. It has a clearly delimited start and end, which avoids ambiguity (compare ''"\u001000"'' and ''"\u{10}00"'') and accidental misinterpretation. Finally, it doesn't require leading zeros, they are entirely optional, so the programmer can write ''"\u{00FF}"'' or ''"\u{FF}"'' as they see fit. For all these reasons, the ''\u{xxxxxx}'' syntax is proposed instead. It can easily represent any valid Unicode character, e.g. ''"\u{20}"'', ''"\u{FF}"'', ''"\u{202e}"'' or ''"\u{10F602}"''. It has a clearly delimited start and end, which avoids ambiguity (compare ''"\u001000"'' and ''"\u{10}00"'') and accidental misinterpretation. Finally, it doesn't require leading zeros, they are entirely optional, so the programmer can write ''"\u{00FF}"'' or ''"\u{FF}"'' as they see fit.
  
-As it happens, ECMAScript 6 will also have this syntax (see References below)in order to allow specifying non-BMP codepoints. This is actually mere coincidenceI came up with this syntax before learning ES 6 would support this.+=== Prior Art === 
 + 
 +ECMAScript 6 will have an identical ''\u{xxxxxx}'' syntax to that which is proposed. 
 + 
 +Ruby supports this syntax alsohowever it allows for multiple codepoints, e.g. ''\u{20AC A3 A5}''which is not proposed in this RFC. 
 + 
 +(See References below)
  
 ==== Encoding Rationale ==== ==== Encoding Rationale ====
Line 72: Line 78:
  
 This change would take place in a major version, so some level of backwards-compatibility breakage would be justified. In cases where it caused problems with existing code, fixing it could be done quite trivially by either switching to single-quoted strings, or escaping the backslash. This change would take place in a major version, so some level of backwards-compatibility breakage would be justified. In cases where it caused problems with existing code, fixing it could be done quite trivially by either switching to single-quoted strings, or escaping the backslash.
 +
 +In order to reduce backwards-compatibility issues, particularly with JSON in string literals, ''\u'' which is not followed by an opening ''{'' will pass through verbatim (instead of being interpreted as an escape sequence) and not raise an error. This means that existing code like ''json_decode("\"\u202e\"");'' will continue to work properly. On the other hand, ''"\u{foobar"'' will raise an error.
  
 ===== Proposed PHP Version(s) ===== ===== Proposed PHP Version(s) =====
Line 85: Line 93:
 Alain Williams suggested on the mailing list that we could add a named literal syntax (i.e. something like ''\U{arabic letter alef}''), like [[http://perldoc.perl.org/perlreref.html#ESCAPE-SEQUENCES|Perl's \N]]. Alain Williams suggested on the mailing list that we could add a named literal syntax (i.e. something like ''\U{arabic letter alef}''), like [[http://perldoc.perl.org/perlreref.html#ESCAPE-SEQUENCES|Perl's \N]].
  
-===== Proposed Voting Choices =====+===== Vote =====
  
 As this is a language change, a 2/3 majority would be required. As this is a language change, a 2/3 majority would be required.
 +
 +Voting started on 2014-12-08 and ended on 2014-12-18.
 +
 +<doodle title="Accept the Unicode Codepoint Escape Syntax RFC and merge into master?" auth="ajf" voteType="single" closed="true">
 +   * Yes
 +   * No
 +</doodle>
  
 ===== Patches and Tests ===== ===== Patches and Tests =====
Line 95: Line 110:
 A language specification pull request with a patch and tests can be found here: https://github.com/php/php-langspec/pull/92 A language specification pull request with a patch and tests can be found here: https://github.com/php/php-langspec/pull/92
  
-Provisional HHVM implementation (includes name support): https://github.com/sgolemon/hhvm/compare/unicode-escape+Provisional HHVM implementation: https://reviews.facebook.net/D30153
  
 ===== Implementation ===== ===== Implementation =====
-After the project is implemented, this section should contain  + 
-  - the version(sit was merged to +  * php-src merge: https://github.com/php/php-src/commit/bae46f307c2d0cdef9b8f5426adcc46920776700 (will go into PHP 7
-  - a link to the git commit(s) +  * HHVM merge: https://github.com/facebook/hhvm/commit/b2df7016e63ddcf328dc5bcfdf18760bba8549ec 
-  - a link to the PHP manual entry for the feature+ 
 +No manual entry yet.
  
 ===== References ===== ===== References =====
  
-  * ECMAScript 6 will have the same ''\u{xxxxxx}'' syntax: https://mathiasbynens.be/notes/javascript-unicode+  * Ruby supports the same ''\u{xxxxxx}'' syntax: http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/ 
 +  * ECMAScript 6 will also have this syntax: https://mathiasbynens.be/notes/javascript-unicode
  
 ===== Rejected Features ===== ===== Rejected Features =====
  
 Keep this updated with features that were discussed on the mail lists. Keep this updated with features that were discussed on the mail lists.
 +
 +===== Errata =====
 +
 +The name of this RFC [[https://blog.ajf.me/2015-12-07-poorly-named-rfcs|ought to have been "unicode codepoint escape sequence", not "unicode codepoint escape syntax"]].
  
 ===== Changelog ===== ===== Changelog =====
  
-* v0.1.1 - Added Future Scope note on named literals +  * (2016-03-13) Added Errata 
-* v0.1 - Initial version+  * v0.1.3 - ''\u'' without a following opening ''{'' passes through verbatim 
 +  * v0.1.2 - Ruby support 
 +  * v0.1.1 - Added Future Scope note on named literals 
 +  * v0.1 - Initial version
rfc/unicode_escape.txt · Last modified: 2017/09/22 13:28 by 127.0.0.1