PHP RFC: IntlCharsetDetector
- Version: 1.0
- Date: 2016-04-11
- Author: Sara Golemon pollita@php.net
- Status: Withdrawn
- First Published at: https://wiki.php.net/rfc/intl.charset-detector
Introduction
PHP's implementation of ICU is still incomplete. One of the features currently missing is a wrapping of UCharsetDetector meant to make an “educated guess” as to the encoding used for a given string.
Proposal
Wrap the ICU UCharsetDetector API in the PHP intl extension. The following API is proposed in the attached patch.
class IntlCharsetDetector { /* Initialize a UCharsetDetector, optionally initializing the bound text string * @throws ErrorException on failure */ public function __construct(string $str = null) { if ($str !== null) { $this->setText($str); } } /* Bind a text string to the internal object * @param string - Text to bind * @returns bool - TRUE on success, FALSE on failure */ public function setText(strign $str): bool; /* Provide a hint to ICU of the expected encoding * ICU may choose to entirely ignore this hint. * @param string - High confidence encoding to hint * @return bool - TRUE on success, FALSE on failure */ public function setDeclaredEncoding(string $encoding): bool; /* Return the "best guess" character set detected for the bound string * @return array<string,mixed> on success, FALSE on failure * array( * 'name' => 'iso-8859-1', // Likely character set encoing * 'confidence' => 35, // How certain the detector is as a percentage, 0-100 * 'language' => 'en', // Associated language code determined during detection * ) * * CAUTION: Per http://icu-project.org/apiref/icu4c/ucsdet_8h.html#a54b1e448b1d9cce1ac017962aaa801aa * 1. Language information is not available for input data encoded in all charsets. In particular, no language is identified for UTF-8 input data. * 2. Closely related languages may sometimes be confused. * If more accurate language detection is required, a linguistic analysis package should be used. */ public function detect(): array<string,mixed>; /* Returns all character set detection guesses, rather than just the "best guess" * @return array<array<string,mixed>> - Numerically indexed array from best to worst guess of guess arrays in the formet describe by detect(), above, or FALSE on failure */ public function detectAll(): array<array<string, mixed>>; /* @return array<string> - List of detectable character sets associated with this UCharsetDetector object, or FALSE on failure. */ public function getAllDetectableCharsets(): array<string>; /* Enables (or disables) input filtering. * If filtering is enabled, text within angle brackets ("<" and ">") will be removed before detection, which will remove most HTML or xml markup. * @param bool $enable - TRUE to enable filtering, FALSE to disable it * @return bool - TRUE on success, or FALSE on failure */ public function enableInputFilter(bool $enable): bool; /* @returns bool - Whether or not input filtering is enabled */ public function isInputFilterEnabled(): bool; } // Functional interface shadowing OOP interface function ucsdet_create(strign $text=null) { try { return new IntlCharsetDetector($text); } catch (\ErrorException $e) { return false; } } function ucsdet_*(IntlCharsetDetector $cs, ...$args) { return $cs->*(...$args); }
Existing Alternatives
PHP currently delivers a version of this functionality in the mbstring extension as mb_detect_encoding(), however mbstring is undermaintained, knows of fewer encodings, and is discouraged in favor of ICU in other PHP functions such as mb_convert_encoding().
Other Implementations
HHVM already exposes this feature as EncodingDetector and returns an EncodingMatch object rather than a marshalled array.
This RFC opts to use the Intl* class prefix common to all other ext/intl classes, and directly marshall results rather than providing object instances to query.
Proposed PHP Version(s)
7.1
Open Issues
Quite simply, character set detection is /hard/, and the best guess made from UCharsetDetector is often wrong. Users should always consult the confidence metric and act accordingly.
Proposed Voting Choices
Simple 50% + 1 majority will be required.
Patches and Tests
Initial patch is at https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector Note that this has a TODO and some minor fixes to apply yet. It was created as a proof of concept before initial discussion of the viability of the library.
References
Initial list discussion: https://marc.info/?l=php-internals&m=145981827302414
Rejected Features
Keep this updated with features that were discussed on the mail lists.