rfc:intl.charset-detector

PHP RFC: IntlCharsetDetector

Introduction

PHP's implementation of ICU is still incomplete. One of the features currently missing is a wrapping of UCharsetDetector meant to make an “educated guess” as to the encoding used for a given string.

Proposal

Wrap the ICU UCharsetDetector API in the PHP intl extension. The following API is proposed in the attached patch.

class IntlCharsetDetector {
  /* Initialize a UCharsetDetector, optionally initializing the bound text string
   * @throws ErrorException on failure
  */
  public function __construct(string $str = null) {
     if ($str !== null) {
       $this->setText($str);
     }
   }
   
   /* Bind a text string to the internal object
    * @param string - Text to bind
    * @returns bool - TRUE on success, FALSE on failure
    */
   public function setText(strign $str): bool;
   
   /* Provide a hint to ICU of the expected encoding
    * ICU may choose to entirely ignore this hint.
    * @param string - High confidence encoding to hint
    * @return bool - TRUE on success, FALSE on failure
    */
   public function setDeclaredEncoding(string $encoding): bool;
   
   /* Return the "best guess" character set detected for the bound string
    * @return array<string,mixed> on success, FALSE on failure
    * array(
    *   'name' => 'iso-8859-1', // Likely character set encoing
    *  'confidence' => 35, // How certain the detector is as a percentage, 0-100
    *  'language' => 'en', // Associated language code determined during detection
    * )
    * 
    * CAUTION: Per http://icu-project.org/apiref/icu4c/ucsdet_8h.html#a54b1e448b1d9cce1ac017962aaa801aa 
    * 1. Language information is not available for input data encoded in all charsets. In particular, no language is identified for UTF-8 input data.
    * 2. Closely related languages may sometimes be confused.
    * If more accurate language detection is required, a linguistic analysis package should be used.
    */
   public function detect(): array<string,mixed>;
   
   /* Returns all character set detection guesses, rather than just the "best guess"
    * @return array<array<string,mixed>> - Numerically indexed array from best to worst guess of guess arrays in the formet describe by detect(), above, or FALSE on failure
    */
   public function detectAll(): array<array<string, mixed>>;
   
   /* @return array<string> - List of detectable character sets associated with this UCharsetDetector object, or FALSE on failure. */
   public function getAllDetectableCharsets(): array<string>;
   
   /* Enables (or disables) input filtering.
    * If filtering is enabled, text within angle brackets ("<" and ">") will be removed before detection, which will remove most HTML or xml markup.
    * @param bool $enable - TRUE to enable filtering, FALSE to disable it
    * @return bool - TRUE on success, or FALSE on failure
    */
   public function enableInputFilter(bool $enable): bool;
   
   /* @returns bool - Whether or not input filtering is enabled */
   public function isInputFilterEnabled(): bool;
}

// Functional interface shadowing OOP interface
function ucsdet_create(strign $text=null) {
  try {
    return new IntlCharsetDetector($text);
  } catch (\ErrorException $e) {
    return false;
  }
}

function ucsdet_*(IntlCharsetDetector $cs, ...$args) {
  return $cs->*(...$args);
}

Existing Alternatives

PHP currently delivers a version of this functionality in the mbstring extension as mb_detect_encoding(), however mbstring is undermaintained, knows of fewer encodings, and is discouraged in favor of ICU in other PHP functions such as mb_convert_encoding().

Other Implementations

HHVM already exposes this feature as EncodingDetector and returns an EncodingMatch object rather than a marshalled array.

This RFC opts to use the Intl* class prefix common to all other ext/intl classes, and directly marshall results rather than providing object instances to query.

Proposed PHP Version(s)

7.1

Open Issues

Quite simply, character set detection is /hard/, and the best guess made from UCharsetDetector is often wrong. Users should always consult the confidence metric and act accordingly.

Proposed Voting Choices

Simple 50% + 1 majority will be required.

Patches and Tests

Initial patch is at https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector Note that this has a TODO and some minor fixes to apply yet. It was created as a proof of concept before initial discussion of the viability of the library.

References

Rejected Features

Keep this updated with features that were discussed on the mail lists.

rfc/intl.charset-detector.txt · Last modified: 2017/09/22 13:28 by 127.0.0.1