rfc:unicode_text_processing
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
rfc:unicode_text_processing [2022/11/21 15:14] – derick | rfc:unicode_text_processing [2024/09/11 14:16] (current) – derick | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== PHP RFC: Unicode Text Processing ====== | ====== PHP RFC: Unicode Text Processing ====== | ||
- | * Version: 0.9 | + | * Version: 0.9.2 |
- | * Date: 2022-11-09 | + | * Date: 2022-12-21 (Original date: 2022-12-15) |
* Author: Derick Rethans < | * Author: Derick Rethans < | ||
* Status: Draft | * Status: Draft | ||
* First Published at: http:// | * First Published at: http:// | ||
- | |||
===== Introduction ===== | ===== Introduction ===== | ||
Line 14: | Line 13: | ||
create an API that developers can use to do Unicode text processing | create an API that developers can use to do Unicode text processing | ||
correctly, without having to know all the intricacies. | correctly, without having to know all the intricacies. | ||
+ | |||
+ | Although PHP has decent maths features, it is solely missing performant | ||
+ | Unicode text processing always available in the core. | ||
==== Definitions ==== | ==== Definitions ==== | ||
Line 23: | Line 25: | ||
===== Proposal ===== | ===== Proposal ===== | ||
- | To introduce a new " | + | To introduce a new final " |
- | stored in the objects. | + | text stored in the objects. |
Methods on the class will all return a new (immutable) object. | Methods on the class will all return a new (immutable) object. | ||
+ | |||
+ | The proposal is to make the '' | ||
+ | mean that it is therefore always available to user. As the implementation | ||
+ | requires ICU, this would also mean that PHP will depend on the ICU library. | ||
+ | |||
==== Basics ==== | ==== Basics ==== | ||
Line 32: | Line 39: | ||
constructor. | constructor. | ||
- | The '' | + | The '' |
UTF-8 encoded string, which can be used by all existing PHP functions | UTF-8 encoded string, which can be used by all existing PHP functions | ||
that accept strings. | that accept strings. | ||
- | The internal representation | + | The internal representation |
Unlike the PHP 6 approach, the conversion to/from the internal | Unlike the PHP 6 approach, the conversion to/from the internal | ||
representation only happens on the boundaries: UTF-8 to UTF-16 through | representation only happens on the boundaries: UTF-8 to UTF-16 through | ||
- | the constructor, | + | the constructor, |
There are multiple groups of methods indicated below. Some are to | There are multiple groups of methods indicated below. Some are to | ||
Line 51: | Line 58: | ||
* prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments. | * prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments. | ||
* operations are on **graphemes** | * operations are on **graphemes** | ||
- | * no redundent | + | * no redundant |
* more as we discuss this... | * more as we discuss this... | ||
Line 64: | Line 71: | ||
If an argument to any of the methods is listed as '' | If an argument to any of the methods is listed as '' | ||
passing in a '' | passing in a '' | ||
- | the passed value with '' | + | the passed value with '' |
- | object that this method is called on is also used for this new wrapped | + | from the Text object that this method is called on is also used for this new |
- | value, if necessary. | + | wrapped |
- | ==== Locales and Internationalisation ==== | + | ==== Locales, Collators, |
- | By default each string will have the " | + | By default each string will have the "root" locale and " |
- | but it is possible to configure a specific collator by using the | + | associated with it, but it is possible to configure a specific |
- | '' | + | collator by using the '' |
- | a string describing an ICU locale name: | + | addition to the locale, and affects sorting and finding operations. |
+ | |||
+ | The '' | ||
+ | name: | ||
https:// | https:// | ||
- | For example, | + | The methods on the Text object all use the '' |
- | case-insensitive sorting for the English locale. This will require | + | |
- | extensive documentation. | + | |
- | Numerical order collation (such as PHP's '' | + | For example, the locale (and collation) name '' |
- | by adding the '' | + | case-insensitive sorting ('' |
- | (case-sensitive German, with numerics in value order). | + | The format of this locale/ |
+ | |||
+ | Numerical order collation (such as PHP's '' | ||
+ | adding the '' | ||
+ | (case-sensitive German | ||
Other options are described in BCP47: | Other options are described in BCP47: | ||
Line 88: | Line 100: | ||
and defaults at http:// | and defaults at http:// | ||
- | Specifying the locale | + | Building a locale/collation string |
- | '' | + | '' |
- | (https:// | + | of collations. The class performs the same function as '' |
- | descritive construction of a locale with all its options. | + | (https:// |
+ | descriptive methods | ||
+ | class is so that you don't have to depend on the '' | ||
+ | make it more developer-friendly. It converts the configured | ||
+ | string, which can then be used in any location where '' | ||
+ | used in the function signatures to the methods on the '' | ||
Line 98: | Line 115: | ||
This section lists all the method that construct a Text object. | This section lists all the method that construct a Text object. | ||
- | === __construct(string $text, string $locale | + | === __construct(string $text, string $collation |
The constructor takes a UTF-8 encoded text, and stores this in an internal | The constructor takes a UTF-8 encoded text, and stores this in an internal | ||
structure. The constructor will also convert the given text to Unicode | structure. The constructor will also convert the given text to Unicode | ||
- | Canonical Form. Passing in non-well-formed UTF-8 will result in an | + | Canonical Form (also called Normalisation Form C, or NFC). Passing in |
- | '' | + | non-well-formed UTF-8 will result in an '' |
- | (Byte-Order-Mark) character, if present. | + | The constructor will also strip out a BOM (Byte-Order-Mark) character, |
+ | if present. | ||
- | === static Text::join(array(string|Text) | + | === static Text::create(string $text, string $collation = ' |
- | Creates a new Text object by concatenating the each Text element | + | The Symfony String package, offers a static function to construct a String |
+ | through a single-character function ('' | ||
+ | file scope (with '' | ||
+ | |||
+ | This method solves a similar use, so that you can shorten '' | ||
+ | '' | ||
+ | For example with '' | ||
+ | |||
+ | |||
+ | === static Text:: | ||
+ | |||
+ | Creates a new Text object by concatenating | ||
+ | into a new Text object. | ||
+ | |||
+ | If the '' | ||
+ | '' | ||
+ | |||
+ | |||
+ | === static Text:: | ||
+ | |||
+ | Creates a new Text object by looping over all the string/Text elements | ||
'' | '' | ||
- | Semantics | + | The semantics are like: '' |
+ | |||
+ | If the '' | ||
+ | element from the '' | ||
+ | created object. | ||
+ | |||
+ | If the '' | ||
+ | '' | ||
+ | If the iterator produces a non-string/ | ||
+ | will be thrown. | ||
==== Standard String Operations ==== | ==== Standard String Operations ==== | ||
- | === split(string|Text $separator, int $limit = PHP_INT_MAX): | + | === split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) === |
Returns an array of Text objects, each of which is a substring of '' | Returns an array of Text objects, each of which is a substring of '' | ||
Line 133: | Line 180: | ||
https:// | https:// | ||
- | === trimLeft, trimRight, trim === | + | === trimStart, trimEnd, trim : \Text === |
Removes white space at the start of, the end of, or both sides of the text. | Removes white space at the start of, the end of, or both sides of the text. | ||
Line 142: | Line 189: | ||
=== wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) === | === wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) === | ||
- | Wraps a text to a given number of graphemes into an array of Text objects. | + | Wraps a text to a given number of graphemes |
+ | objects. | ||
Like: '' | Like: '' | ||
Line 150: | Line 198: | ||
'' | '' | ||
- | + | === reverse() : \Text === | |
- | === replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) === | + | |
- | + | ||
- | Replaces the first '' | + | |
- | '' | + | |
- | + | ||
- | The '' | + | |
- | items are being replace. The '' | + | |
- | argument that is being replaced (0-indexed), | + | |
- | last item. Positive numbers are counted from the first occurence of | + | |
- | '' | + | |
- | occurrence. | + | |
- | + | ||
- | + | ||
- | === replaceTextCaseInsensitively(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) === | + | |
- | + | ||
- | Replaces every occurrence of '' | + | |
- | the object that the method is called on. The locale of '' | + | |
- | '' | + | |
- | + | ||
- | '' | + | |
- | + | ||
- | + | ||
- | === reverse() | + | |
Reverses a text, taking into account grapheme boundaries. | Reverses a text, taking into account grapheme boundaries. | ||
Line 182: | Line 207: | ||
Methods to find text in other text. | Methods to find text in other text. | ||
- | === getPositionOfFirstOccurrence(string|Text $textToFind, int $offset) : int|false === | + | In all these methods, the locale and collator of '' |
+ | sub-strings that match, if it is a '' | ||
+ | collator that are embedded in the object that the method is called on is used. | ||
+ | |||
+ | |||
+ | === getPositionOfFirstOccurrence(string|Text $search, int $offset) : int|false === | ||
Returns the position (in grapheme units) of the first occurrence of | Returns the position (in grapheme units) of the first occurrence of | ||
- | '' | + | '' |
- | Like: '' | + | Like: '' |
https:// | https:// | ||
- | *I think this method name is too long* | + | Alternative suggested names: '' |
- | === getPositionOfLastOccurrence(string|Text $textToFind, int $offset) : int|false === | + | |
+ | === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === | ||
Like '' | Like '' | ||
+ | Alternative suggested names: '' | ||
- | === returnFromFirstOccurence(string|Text $textToFind) : Text|false === | ||
- | Returns the '' | + | === returnFromFirstOccurence(string|Text $search) : Text|false === |
+ | |||
+ | Returns the '' | ||
otherwise '' | otherwise '' | ||
- | Like: '' | + | Like: '' |
(https:// | (https:// | ||
+ | Alternative suggested names: '' | ||
- | === returnFromLastOccurence(string|Text $textToFind) : Text|false === | + | |
+ | === returnFromLastOccurence(string|Text $search) : Text|false === | ||
Like '' | Like '' | ||
- | === contains(string|Text $string) === | + | Alternative suggested names: '' |
+ | |||
+ | |||
+ | === contains(string|Text $search) === | ||
- | Returns true if the text '' | + | Returns true if the text '' |
Like '' | Like '' | ||
- | === endsWith(string|Text $string) : bool === | + | === endsWith(string|Text $search) : bool === |
- | Could be constructed from '' | + | Compares the last '' |
+ | |||
+ | Case-insensitive comparison can be achieved by setting the right | ||
+ | '' | ||
+ | |||
+ | Could be constructed from '' | ||
'' | '' | ||
too. | too. | ||
- | === startsWith(string|Text $string) : bool === | + | === startsWith(string|Text $search) : bool === |
- | Compares the first '' | + | Compares the first '' |
- | locale and collator that are configured with '' | + | |
Case-insensitive comparison can be achieved by setting the right | Case-insensitive comparison can be achieved by setting the right | ||
- | '' | + | '' |
Could be constructed from '' | Could be constructed from '' | ||
but it's an often required method, and standard PHP has it | but it's an often required method, and standard PHP has it | ||
too. | too. | ||
+ | |||
+ | === replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text === | ||
+ | |||
+ | Replaces occurrences of '' | ||
+ | |||
+ | The '' | ||
+ | items are being replaced. The '' | ||
+ | argument that is being replaced (0-indexed), | ||
+ | last item. Positive numbers are counted from the first occurrence of | ||
+ | '' | ||
+ | occurrence. | ||
+ | |||
+ | In order to find sub-strings case-insensitively, | ||
+ | argument to '' | ||
==== Comparing Text Objects ==== | ==== Comparing Text Objects ==== | ||
- | === compareWith(Text $other) : int === | + | === compareWith(Text $other, string $collation = NULL) : int === |
- | Uses the configured '' | + | Uses the configured '' |
- | '' | + | '' |
This same method is also used for comparing two Text objects as " | This same method is also used for comparing two Text objects as " | ||
- | handler" | + | handler" |
+ | taken into account. | ||
+ | |||
+ | === equals(Text $other, string $collation = NULL) : boolean === | ||
+ | |||
+ | Alias for '' | ||
==== Case Conversions ==== | ==== Case Conversions ==== | ||
+ | These operations all use the collation that is configured on the Text object. | ||
- | === toLower === | + | === toLower |
Converts the text to lower case, using the lower case variant of each | Converts the text to lower case, using the lower case variant of each | ||
Unicode code point that makes up the text. | Unicode code point that makes up the text. | ||
+ | Example: '' | ||
+ | |||
+ | |||
+ | === toUpper : \Text === | ||
+ | |||
+ | The same, but then to upper case. | ||
- | === toUpper === | + | Example: '' |
+ | === toTitle : \Text === | ||
- | === toTitle === | + | The same, but then to title case (the first letter of each word). |
+ | Example: '' | ||
- | === firstToLower === | + | === firstToLower |
Converts the first grapheme in the text to a lower case variant. | Converts the first grapheme in the text to a lower case variant. | ||
+ | Example: '' | ||
- | === firstToUpper === | ||
+ | === firstToUpper : \Text === | ||
+ | The same, but then to upper case. | ||
- | === firstToTitle === | + | Example: '' |
+ | |||
+ | |||
+ | === wordsToLower : \Text === | ||
+ | |||
+ | Converts the first grapheme in every word to an lower case variant. | ||
+ | |||
+ | Example: '' | ||
+ | |||
+ | |||
+ | === wordsToUpper : \Text === | ||
+ | |||
+ | The same, but then to upper case. | ||
+ | |||
+ | Example: '' | ||
Line 282: | Line 368: | ||
- | === getByteCount() === | + | === getByteCount() |
Returns the size in bytes that the text will take when converted to UTF-8. | Returns the size in bytes that the text will take when converted to UTF-8. | ||
- | === length(), getCharacterCount() === | + | === length(), getCharacterCount(): int |
Returns the number of characters that make up the text. A character (also | Returns the number of characters that make up the text. A character (also | ||
Line 295: | Line 381: | ||
- | === getCodePointCount() === | + | === getCodePointCount() |
Returns the number of Unicode code points that make up the text. | Returns the number of Unicode code points that make up the text. | ||
Line 301: | Line 387: | ||
- | === countWords() === | + | === getWordCount() : int === |
Pretty much a shortcut for:: | Pretty much a shortcut for:: | ||
Line 316: | Line 402: | ||
The return of the iterators are effected by the text's locale. | The return of the iterators are effected by the text's locale. | ||
+ | These are inspired by ICU4J' | ||
+ | (https:// | ||
+ | and Intl's create*Instance methods on '' | ||
+ | (https:// | ||
- | === getCharacterIterator === | + | === getCharacterIterator |
+ | Returns an Iterator that locates boundaries between logical characters. | ||
+ | Because of the structure of the Unicode encoding, a logical character may be | ||
+ | stored internally as more than one Unicode code point. (A with an umlaut may | ||
+ | be stored as an ' | ||
+ | example, but the user still thinks of it as one character.) This iterator | ||
+ | allows various processes (especially text editors) to treat as characters the | ||
+ | units of text that a user would think of as characters, rather than the units | ||
+ | of text that the computer sees as " | ||
+ | === getWordIterator : \Iterator === | ||
- | === getLineIterator === | + | Returns an Iterator that locates boundaries between words. This is useful |
+ | for double-click selection or "find whole words" searches. This type of | ||
+ | iterator makes sure there is a boundary position at the beginning and end | ||
+ | of each legal word. (Numbers count as words, too.) Whitespace and punctuation | ||
+ | are kept separate from real words. | ||
+ | === getLineIterator : \Iterator === | ||
+ | Returns an Iterator that locates positions where it is legal for a text | ||
+ | editor to wrap lines. This is similar to word breaking, but not the same: | ||
+ | punctuation and whitespace are generally kept with words (you don't want a | ||
+ | line to start with whitespace, for example), and some special characters can | ||
+ | force a position to be considered a line-break position or prevent a position | ||
+ | from being a line-break position. | ||
- | === getSentenceIterator === | + | === getSentenceIterator |
+ | Returns an Iterator that locates boundaries between sentences. | ||
- | === getTitleIterator | + | === getTitleIterator |
- | + | ||
- | + | ||
- | + | ||
- | === getWordIterator | + | |
+ | Returns an Iterator that locates boundaries between title breaks. | ||
Line 342: | Line 450: | ||
- | === transliterate(string $transliterationString) | + | === transliterate(string $transliterationString) |
- | + | ||
- | + | ||
- | + | ||
- | === transliterate(\Intl\Transliterator $transliterator) | + | |
- | + | ||
- | + | ||
- | With the first one being a " | + | |
- | Transliterator for more complex cases. | + | |
- | + | ||
- | Should we add shortcuts for a set of often used ones, such as '' | + | |
- | think so, as it's the majority use case. | + | |
+ | Transliterates the content of the '' | ||
+ | specified in the '' | ||
- | === toLatin === | + | There are a few constants for specific and often used cases, such as creating |
+ | an ASCII transliterated version of any Text: | ||
- | Converts | + | - const Text:: |
+ | any script to Latin, and also strips all the accents. | ||
+ | - const Text:: | ||
+ | any script to Latin, but does not remove the accents. | ||
- | === removeAccents | + | - const Text::removeAccents |
+ | the transliteration string ''" | ||
- | Removes the accents from a (latin script) text. | + | ===== Implementation Details ===== |
- | A shortcut for the transliteration string ''" | + | The functionality as is described in this RFC is mostly implemented by using |
- | suitable one, which I believe | + | functionality from the ICU library, which is also used by the Intl extension. |
- | NFC."'' | + | |
+ | In order for PHP to continue to work on an as widest range of platforms and | ||
+ | distributions, | ||
+ | Linux distributions' | ||
+ | which this functionality is implemented. | ||
===== Backward Incompatible Changes ===== | ===== Backward Incompatible Changes ===== | ||
- | Introducing a new class could impact code bases that already use this class | + | Introducing a new '' |
- | name. But as PHP owns the global namespace, this should not deter us from | + | class name. But as PHP owns the global namespace, this should not deter us |
- | adding such a code class. | + | from adding such a code class. |
===== Proposed PHP Version(s) ===== | ===== Proposed PHP Version(s) ===== | ||
Line 387: | Line 494: | ||
===== Open Issues ===== | ===== Open Issues ===== | ||
- | ==== Class Name ==== | + | - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8. |
- | I have currently picked " | + | ===== Questions and Answers ===== |
- | represent single words (strings). Alternatively, | + | |
- | " | + | |
+ | ==== Why is this not a composer package? ==== | ||
+ | |||
+ | The goal of this RFC is that PHP users can always rely on performant text | ||
+ | processing capabilities. | ||
+ | |||
+ | Text processors written in PHP already exist, but suffer from performance | ||
+ | issues (PHP is slower than C), and are sometimes tailored to specific use | ||
+ | cases. By having them written in C, and utilising ICU's well tested and often | ||
+ | updated rules and algorithms, both the performance and correctness issues will | ||
+ | be addressed. | ||
===== Future Scope ===== | ===== Future Scope ===== | ||
Line 421: | Line 536: | ||
Nothing rejected yet. | Nothing rejected yet. | ||
+ | |||
+ | |||
+ | ===== Changes ===== | ||
+ | |||
+ | 0.9.2 — 2022-12-21 | ||
+ | |||
+ | * Tim Düsterhus: Added concat and equals methods; changed join to accept an iterator. | ||
+ | * Enhance explanation of locales and collations, and standardize on using '' | ||
+ | |||
+ | 0.9.1 — 2022-12-16 | ||
+ | |||
+ | * Tim Düsterhus: Removed firstToTitle/ | ||
+ | * Paul Crovella: Clarify which normalisation is being used. | ||
+ | * Daniel Wolfe: Update trimLeft/ |
rfc/unicode_text_processing.1669043671.txt.gz · Last modified: 2022/11/21 15:14 by derick