rfc:uconverter

This is an old revision of the document!


Request for Comments: ext/intl::UConverter

Exposes ICU's UConverter functions by adding a class to the ext/intl extension

Introduction

The ext/intl extension only exposes some of ICU's powerful i18n functionality. This diff covers the the ucnv_* family of function in ICU4C, exposing both a simple API: UConverter::transcode(), and a more robust class with greater flexibility.

Specification of the Class

class UConverter {
  /* UConverterCallbackReason */
  const UCNV_UNASSIGNED;
  const UCNV_ILLEGAL;
  const UCNV_IRREGULAR;
  const UCNV_RESET;
  const UCNV_CLOSE;
  const UCNV_CLONE;
  
  /* UConverterType */
  const UCNV_UNSUPPORTED_CONVERTER);
  const UCNV_SBCS;
  const UCNV_DBCS;
  const UCNV_MBCS;
  const UCNV_LATIN_1;
  const UCNV_UTF8;
  const UCNV_UTF16_BigEndian;
  const UCNV_UTF16_LittleEndian;
  const UCNV_UTF32_BigEndian;
  const UCNV_UTF32_LittleEndian;
  const UCNV_EBCDIC_STATEFUL;
  const UCNV_ISO_2022;
  const UCNV_LMBCS_1;
  const UCNV_LMBCS_2;
  const UCNV_LMBCS_3;
  const UCNV_LMBCS_4;
  const UCNV_LMBCS_5;
  const UCNV_LMBCS_6;
  const UCNV_LMBCS_8;
  const UCNV_LMBCS_11;
  const UCNV_LMBCS_16;
  const UCNV_LMBCS_17;
  const UCNV_LMBCS_18;
  const UCNV_LMBCS_19;
  const UCNV_LMBCS_LAST;
  const UCNV_HZ;
  const UCNV_SCSU;
  const UCNV_ISCII;
  const UCNV_US_ASCII;
  const UCNV_UTF7;
  const UCNV_BOCU1;
  const UCNV_UTF16;
  const UCNV_UTF32;
  const UCNV_CESU8;
  const UCNV_IMAP_MAILBOX;
  
  __construct(string $toEncoding, string $fromEncoding);
  
  /* Setting/Checking current encoders */
  string getSourceEncoding();
  void setSourceEncoding(string $encoding);
  string getDestinationEncoding();
  void setDestinationEncoding(string $encoding);
  
  /* Introspection for algorithmic conversions */
  UConverterType getSourceType();
  UConverterType getDestinationType();
  
  /* Basic error handling */
  string getSubstChars();
  void setSubstChars(string $chars);
  
  /* Default callback functions */
  string toUCallback  (UConverterCallbackReason $reason, string $source, string $codeUnits, UErrorCode &$error);
  string fromUCallback(UConverterCallbackReason $reason, Array  $source, long   $codePoint, UErrorCode &$error);
  
  /* Primary conversion workhorses */
  string convert(string $str[, bool $reserve = false]);
  static string transcode(string $str, string $toEncoding, string $fromEncoding[, Array $options]);
  
  /* Ennumeration and lookup */
  static string reasonText(UConverterCallbackReason $reason);
  static Array getAvailable();
  static Array getAliases(string $encoding);
  static Array getStandards();
}

Simple uses

The usage and purpose of UConverter::transcode() is identical to it's mbstring counterpart mb_convert_encoding() with the exception of an added “options” parameter.

$utf8string = UConverter::transcode($latin1string, 'utf-8', 'latin1');

By default, ICU will substitute a ^Z character (U+001A) in place of any code point which cannot be converted from the original encoding to Unicode, or from Unicode to the target encoding. Note that the former condition is extremely rare compared to the latter.

$asciiString = UConverter::transcode("Espa\xD1ol", 'ascii', 'latin1');
// Yields Espa^Zol

To override the default substitution, the optional fourth parameter may be set to an array of options.

$opts = array('from_subst' => '?', 'to_subst' => '?');
$asciiString = UConverter::transcode("Espa\xD1ol", 'ascii', 'latin1', $opts);
// Yields Espa?ol

Note that substitution characters must represent a single codepoint in the encoding which is being converted from or to.

Object Oriented Use

The OOP use-case allows the caller to reuse the same converter across multiple calls:

$c = new UConverter('utf-8', 'latin1');
echo $c->convert("123 PHPstra\xDFa\n");
echo $c->convert("M\xFCnchen DE\n");

Similar to the functional interface above, basic error handling may be employed using substitution characters:

$c = new UConverter('ascii', 'latin1');
$c->setSubstChars('?');
echo $c->convert("123 PHPstra\xDFa\n");
echo $c->convert("M\xFCnchen DE\n");

The converter may also run the conversion backwards with an optional second parameter to UConverter::convert:

$c = new UConverter('utf-8', 'latin1');
echo $c->convert("123 PHPstra\xC3\x9Fa\n", true);
echo $c->convert("M\xC3\xBCnchen DE\n", true);

Advanced Use

The UConverter class actually does two conversion cycles. One from the source encoding to its internal UChar (Unicode) representation, then again from that to the destination encoding. During each cycle, errors are handled by the built-in toUCallback() and fromUCallback() methods which may be overridden in a child class:

class MyConverter extends UConverter {
  public function fromUCallback($reason, $source, $codepoint, &$error) {
    if (($reason == UConverter::UCNV_UNASSIGNED) && ($codepoint == 0x00F1)) {
      // Basic transliteration 'ñ' to 'n'
      $error = U_ZERO_ERROR;
      return 'n';
    }
  }
}
$c = new MyConverter('ascii', 'latin1');
echo "Espa\xF1ol";
// Yields "Espanol"

$reason will be one of the UConverterCallbackReason constants defined in the class definition above. UCNV_RESET, UCNV_CLOSE, and UCNV_CLONE are informational events and do not require any direct action. The remaining events describe some form of exception case which must be handled. See Return Values below.

$source is the context from the original or intermediate string from the codeunits or codepoint where the exception occured onward. For toUCallback(), this will be a string of codeunits, for fromUCallback(), this will be an array of codepoints (integers).

$codeUnits is one (or more) code unit from the original string in its source encoding which was unable to be translated to Unicode.

$codepoint is the Unicode character from the intermediate string which could not be converter to the output encoding.

$error is a by-reference value which will contain the specific ICU error encountered on input, and should be modified to U_ZERO_ERROR (or some appropriate value) before returning the replacement codepoint/codeunits.

Return values for this method may be: NULL, Long, String, or Array. A value of NULL indicates that the codepoint/codeunit should be ignored and left out of the destination/intermediate string. A Long return value will be treated as either a Unicode codepoint for toUCallback(), or a single-byte character in the target encoding for fromUCallback(). A String return value will be treated as one (or more) UTF8 encoded codepoints for toUCallback(), or a multi-byte character (or characters) in the target encoding for fromUCallback().

Error Handling

Any errors encountered while calling UConverter::transcode() are raised as standard E_WARNING notices and NULL is returned (to conform with non-OOP error handling styles). Errors encountered in OOP usage are raised as a thrown instance of UConverterException.

Ennumerators

A few enumeration methods are exposed as convenience. Hopefully their usage is obvious enough that they don't bear going into beyond the class definition above.

References

ICU4C ucnv.h documentation: http://icu-project.org/apiref/icu4c/ucnv_8h.html

Path: An implementation of the above can be found at https://github.com/sgolemon/php-src/compare/master...uconverter

rfc/uconverter.1351635273.txt.gz · Last modified: 2017/09/22 13:28 (external edit)