====== PHP RFC: IntlCharsetDetector ====== * Version: 1.0 * Date: 2016-04-11 * Author: Sara Golemon * Status: Withdrawn * First Published at: https://wiki.php.net/rfc/intl.charset-detector ===== Introduction ===== PHP's implementation of ICU is still incomplete. One of the features currently missing is a wrapping of [[http://icu-project.org/apiref/icu4c/ucsdet_8h.html|UCharsetDetector]] meant to make an "educated guess" as to the encoding used for a given string. ===== Proposal ===== Wrap the ICU UCharsetDetector API in the PHP intl extension. The following API is proposed in the attached patch. class IntlCharsetDetector { /* Initialize a UCharsetDetector, optionally initializing the bound text string * @throws ErrorException on failure */ public function __construct(string $str = null) { if ($str !== null) { $this->setText($str); } } /* Bind a text string to the internal object * @param string - Text to bind * @returns bool - TRUE on success, FALSE on failure */ public function setText(strign $str): bool; /* Provide a hint to ICU of the expected encoding * ICU may choose to entirely ignore this hint. * @param string - High confidence encoding to hint * @return bool - TRUE on success, FALSE on failure */ public function setDeclaredEncoding(string $encoding): bool; /* Return the "best guess" character set detected for the bound string * @return array on success, FALSE on failure * array( * 'name' => 'iso-8859-1', // Likely character set encoing * 'confidence' => 35, // How certain the detector is as a percentage, 0-100 * 'language' => 'en', // Associated language code determined during detection * ) * * CAUTION: Per http://icu-project.org/apiref/icu4c/ucsdet_8h.html#a54b1e448b1d9cce1ac017962aaa801aa * 1. Language information is not available for input data encoded in all charsets. In particular, no language is identified for UTF-8 input data. * 2. Closely related languages may sometimes be confused. * If more accurate language detection is required, a linguistic analysis package should be used. */ public function detect(): array; /* Returns all character set detection guesses, rather than just the "best guess" * @return array> - Numerically indexed array from best to worst guess of guess arrays in the formet describe by detect(), above, or FALSE on failure */ public function detectAll(): array>; /* @return array - List of detectable character sets associated with this UCharsetDetector object, or FALSE on failure. */ public function getAllDetectableCharsets(): array; /* Enables (or disables) input filtering. * If filtering is enabled, text within angle brackets ("<" and ">") will be removed before detection, which will remove most HTML or xml markup. * @param bool $enable - TRUE to enable filtering, FALSE to disable it * @return bool - TRUE on success, or FALSE on failure */ public function enableInputFilter(bool $enable): bool; /* @returns bool - Whether or not input filtering is enabled */ public function isInputFilterEnabled(): bool; } // Functional interface shadowing OOP interface function ucsdet_create(strign $text=null) { try { return new IntlCharsetDetector($text); } catch (\ErrorException $e) { return false; } } function ucsdet_*(IntlCharsetDetector $cs, ...$args) { return $cs->*(...$args); } ===== Existing Alternatives ===== PHP currently delivers a version of this functionality in the mbstring extension as [[http://php.net/manual/en/function.mb-detect-encoding.php|mb_detect_encoding]](), however mbstring is undermaintained, knows of fewer encodings, and is discouraged in favor of ICU in other PHP functions such as [[http://php.net/manual/en/function.mb-convert-encoding.php|mb_convert_encoding]](). ===== Other Implementations ===== HHVM already exposes this feature as [[https://github.com/facebook/hhvm/blob/master/hphp/runtime/ext/icu/ext_icu_ucsdet.php|EncodingDetector]] and returns an EncodingMatch object rather than a marshalled array. This RFC opts to use the Intl* class prefix common to all other ext/intl classes, and directly marshall results rather than providing object instances to query. ===== Proposed PHP Version(s) ===== 7.1 ===== Open Issues ===== Quite simply, character set detection is /hard/, and the best guess made from UCharsetDetector is often wrong. Users should always consult the confidence metric and act accordingly. ===== Proposed Voting Choices ===== Simple 50% + 1 majority will be required. ===== Patches and Tests ===== Initial patch is at https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector Note that this has a TODO and some minor fixes to apply yet. It was created as a proof of concept before initial discussion of the viability of the library. ===== References ===== Initial list discussion: https://marc.info/?l=php-internals&m=145981827302414 ===== Rejected Features ===== Keep this updated with features that were discussed on the mail lists.