rfc:intl.char

PHP RFC: IntlChar class

Introduction

ICU exposes a great deal of i18n/l10n functionality beyond what is currently exposed by PHP. This RFC seeks to expose just a little bit more...

Proposal

Expose additional ICU functionality from uchar.h as IntlChar::*() following the ICU API as much as possible.

See hphp/runtime/ext/icu/ext_icu_uchar.php in https://reviews.facebook.net/D30573 for a full breakdown of the functions (complete with docblocks). Constants can be found in either PR's *-enum.h files.

Proposed PHP Version(s)

PHP 7 (or 5.next if there is one)

New Constants

Enumerations of UProperty, UCharNameChoice, UPropertyNameChoice, UCharDirection, UBlockCode, etc... For example:

class IntlChar {
  const PROPERTY_ALPHABETIC = _UCHAR_ALPHABETIC_;
  const PROPERTY_ASCII_HEX_DIGIT = _UCHAR_ASCII_HEX_DIGIT_;
  /* etc... */
}

New Static Methods

Mapping of ICU API to PHP. For example:

class IntlChar {
  static public function hasBinaryProperty(int $codepoint, int $property): bool;
  static public function isAlphabetic(int $codepoint): bool;
  /* etc... */
}

Note that properties taking a codepoint will accept either an integer codepoint value (e.g. 0x2603 for U+2603 SNOWMAN), or the character encoded as UTF-8 (e.g. “\xE2\x98\x83”). For methods which return a codepoint, they will return int unless they accepted a codepoint as a utf-8 string, in which case they remain utf-8.

Notes

I also added IntlChar::chr() and IntlChar::ord() which aren't directly part of the API, but they made sense as wrappers for the U8_*() family of macros.

Some methods take a range in the form ($start, $limit) which the range is INclusive of $start, and EXclusive of $limit. i.e. (0x20, 0x30) => 0x20..0x2F. I kept this meaning for $limit to stay consistent with the ICU API, but changing $limit to have the semantics of $end would probably make more sense in PHP.

Vote

Accept the IntlChar RFC and merge into master?
Real name Yes No
ab (ab)  
aharvey (aharvey)  
ajf (ajf)  
guilhermeblanco (guilhermeblanco)  
indeyets (indeyets)  
irker (irker)  
jedibc (jedibc)  
kinncj (kinncj)  
leigh (leigh)  
mike (mike)  
pollita (pollita)  
rasmus (rasmus)  
salathe (salathe)  
stas (stas)  
Final result: 14 0
This poll has been closed.
  • Voting opened: 2014-12-26 02:20 UTC
  • Voting closed: 2015-01-16 17:20 UTC (3 week voting period)

Implementation

rfc/intl.char.txt · Last modified: 2017/09/22 13:28 by 127.0.0.1