PHP RFC: IntlCharsetDetector
- Version: 1.0
- Date: 2016-04-11
- Author: Sara Golemon pollita@php.net
- Status: Withdrawn
- First Published at: https://wiki.php.net/rfc/intl.charset-detector
Introduction
PHP's implementation of ICU is still incomplete. One of the features currently missing is a wrapping of UCharsetDetector meant to make an “educated guess” as to the encoding used for a given string.
Proposal
Wrap the ICU UCharsetDetector API in the PHP intl extension. The following API is proposed in the attached patch.
class IntlCharsetDetector {
/* Initialize a UCharsetDetector, optionally initializing the bound text string
* @throws ErrorException on failure
*/
public function __construct(string $str = null) {
if ($str !== null) {
$this->setText($str);
}
}
/* Bind a text string to the internal object
* @param string - Text to bind
* @returns bool - TRUE on success, FALSE on failure
*/
public function setText(strign $str): bool;
/* Provide a hint to ICU of the expected encoding
* ICU may choose to entirely ignore this hint.
* @param string - High confidence encoding to hint
* @return bool - TRUE on success, FALSE on failure
*/
public function setDeclaredEncoding(string $encoding): bool;
/* Return the "best guess" character set detected for the bound string
* @return array<string,mixed> on success, FALSE on failure
* array(
* 'name' => 'iso-8859-1', // Likely character set encoing
* 'confidence' => 35, // How certain the detector is as a percentage, 0-100
* 'language' => 'en', // Associated language code determined during detection
* )
*
* CAUTION: Per http://icu-project.org/apiref/icu4c/ucsdet_8h.html#a54b1e448b1d9cce1ac017962aaa801aa
* 1. Language information is not available for input data encoded in all charsets. In particular, no language is identified for UTF-8 input data.
* 2. Closely related languages may sometimes be confused.
* If more accurate language detection is required, a linguistic analysis package should be used.
*/
public function detect(): array<string,mixed>;
/* Returns all character set detection guesses, rather than just the "best guess"
* @return array<array<string,mixed>> - Numerically indexed array from best to worst guess of guess arrays in the formet describe by detect(), above, or FALSE on failure
*/
public function detectAll(): array<array<string, mixed>>;
/* @return array<string> - List of detectable character sets associated with this UCharsetDetector object, or FALSE on failure. */
public function getAllDetectableCharsets(): array<string>;
/* Enables (or disables) input filtering.
* If filtering is enabled, text within angle brackets ("<" and ">") will be removed before detection, which will remove most HTML or xml markup.
* @param bool $enable - TRUE to enable filtering, FALSE to disable it
* @return bool - TRUE on success, or FALSE on failure
*/
public function enableInputFilter(bool $enable): bool;
/* @returns bool - Whether or not input filtering is enabled */
public function isInputFilterEnabled(): bool;
}
// Functional interface shadowing OOP interface
function ucsdet_create(strign $text=null) {
try {
return new IntlCharsetDetector($text);
} catch (\ErrorException $e) {
return false;
}
}
function ucsdet_*(IntlCharsetDetector $cs, ...$args) {
return $cs->*(...$args);
}
Existing Alternatives
PHP currently delivers a version of this functionality in the mbstring extension as mb_detect_encoding(), however mbstring is undermaintained, knows of fewer encodings, and is discouraged in favor of ICU in other PHP functions such as mb_convert_encoding().
Other Implementations
HHVM already exposes this feature as EncodingDetector and returns an EncodingMatch object rather than a marshalled array.
This RFC opts to use the Intl* class prefix common to all other ext/intl classes, and directly marshall results rather than providing object instances to query.
Proposed PHP Version(s)
7.1
Open Issues
Quite simply, character set detection is /hard/, and the best guess made from UCharsetDetector is often wrong. Users should always consult the confidence metric and act accordingly.
Proposed Voting Choices
Simple 50% + 1 majority will be required.
Patches and Tests
Initial patch is at https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector Note that this has a TODO and some minor fixes to apply yet. It was created as a proof of concept before initial discussion of the viability of the library.
References
Initial list discussion: https://marc.info/?l=php-internals&m=145981827302414
Rejected Features
Keep this updated with features that were discussed on the mail lists.