rfc:unicode_text_processing
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
rfc:unicode_text_processing [2022/11/21 15:14] – derick | rfc:unicode_text_processing [2022/12/15 15:29] – derick | ||
---|---|---|---|
Line 3: | Line 3: | ||
* Date: 2022-11-09 | * Date: 2022-11-09 | ||
* Author: Derick Rethans < | * Author: Derick Rethans < | ||
- | * Status: | + | * Status: |
* First Published at: http:// | * First Published at: http:// | ||
Line 14: | Line 14: | ||
create an API that developers can use to do Unicode text processing | create an API that developers can use to do Unicode text processing | ||
correctly, without having to know all the intricacies. | correctly, without having to know all the intricacies. | ||
+ | |||
+ | Although PHP has decent maths features, it is solely missing performant | ||
+ | Unicode text processing always available in the core. | ||
==== Definitions ==== | ==== Definitions ==== | ||
Line 27: | Line 30: | ||
Methods on the class will all return a new (immutable) object. | Methods on the class will all return a new (immutable) object. | ||
+ | |||
+ | The proposal is to make the '' | ||
+ | mean that it is therefore always available to user. As the implementation | ||
+ | requires ICU, this would also mean that PHP will depend on the ICU library. | ||
+ | |||
==== Basics ==== | ==== Basics ==== | ||
Line 32: | Line 40: | ||
constructor. | constructor. | ||
- | The '' | + | The '' |
UTF-8 encoded string, which can be used by all existing PHP functions | UTF-8 encoded string, which can be used by all existing PHP functions | ||
that accept strings. | that accept strings. | ||
- | The internal representation | + | The internal representation |
Unlike the PHP 6 approach, the conversion to/from the internal | Unlike the PHP 6 approach, the conversion to/from the internal | ||
representation only happens on the boundaries: UTF-8 to UTF-16 through | representation only happens on the boundaries: UTF-8 to UTF-16 through | ||
- | the constructor, | + | the constructor, |
There are multiple groups of methods indicated below. Some are to | There are multiple groups of methods indicated below. Some are to | ||
Line 51: | Line 59: | ||
* prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments. | * prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments. | ||
* operations are on **graphemes** | * operations are on **graphemes** | ||
- | * no redundent | + | * no redundant |
* more as we discuss this... | * more as we discuss this... | ||
Line 80: | Line 88: | ||
extensive documentation. | extensive documentation. | ||
- | Numerical order collation (such as PHP's '' | + | Numerical order collation (such as PHP's '' |
by adding the '' | by adding the '' | ||
(case-sensitive German, with numerics in value order). | (case-sensitive German, with numerics in value order). | ||
Line 88: | Line 96: | ||
and defaults at http:// | and defaults at http:// | ||
- | Specifying the locale | + | Building a locale/collation string |
- | '' | + | '' |
- | (https:// | + | of collations. The class performs the same function as '' |
- | descritive construction of a locale with all its options. | + | (https:// |
+ | descriptive methods | ||
+ | class is so that you don't have to depend on the '' | ||
+ | make it more developer-friendly. It converts the configured | ||
+ | string, which can then be used in any location where '' | ||
+ | used in the function signatures to the methods on the '' | ||
Line 98: | Line 111: | ||
This section lists all the method that construct a Text object. | This section lists all the method that construct a Text object. | ||
- | === __construct(string $text, string $locale = ' | + | === __construct(string $text, string $locale = ' |
The constructor takes a UTF-8 encoded text, and stores this in an internal | The constructor takes a UTF-8 encoded text, and stores this in an internal | ||
Line 106: | Line 119: | ||
(Byte-Order-Mark) character, if present. | (Byte-Order-Mark) character, if present. | ||
+ | === static Text:: | ||
- | === static | + | The Symfony String package, offers a static |
+ | through a single-character function | ||
+ | file scope (with '' | ||
- | Creates a new Text object by concatenating the each Text element in | + | This method solves a similar use, so that you can shorten '' |
+ | '' | ||
+ | For example with '' | ||
+ | |||
+ | === static Text:: | ||
+ | |||
+ | Creates a new Text object by concatenating the Text element in | ||
'' | '' | ||
- | Semantics | + | The semantics are like: '' |
+ | |||
+ | If the '' | ||
+ | element in the '' | ||
+ | object. | ||
+ | |||
+ | If the '' | ||
+ | '' | ||
Line 142: | Line 171: | ||
=== wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) === | === wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) === | ||
- | Wraps a text to a given number of graphemes into an array of Text objects. | + | Wraps a text to a given number of graphemes |
+ | objects. | ||
Like: '' | Like: '' | ||
Line 155: | Line 185: | ||
Replaces the first '' | Replaces the first '' | ||
'' | '' | ||
+ | |||
+ | The locale of '' | ||
+ | match, if it is a '' | ||
+ | that the method is called on. | ||
The '' | The '' | ||
- | items are being replace. The '' | + | items are being replaced. The '' |
argument that is being replaced (0-indexed), | argument that is being replaced (0-indexed), | ||
- | last item. Positive numbers are counted from the first occurence | + | last item. Positive numbers are counted from the first occurrence |
'' | '' | ||
occurrence. | occurrence. | ||
- | + | In order to find sub-strings case-insensitively, you can use the '' | |
- | === replaceTextCaseInsensitively(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) === | + | argument to the constructor |
- | + | ||
- | Replaces every occurrence of '' | + | |
- | the object that the method is called on. The locale of '' | + | |
- | '' | + | |
- | + | ||
- | '' | + | |
=== reverse() === | === reverse() === | ||
Line 182: | Line 209: | ||
Methods to find text in other text. | Methods to find text in other text. | ||
- | === getPositionOfFirstOccurrence(string|Text $textToFind, int $offset) : int|false === | + | In all these methods, the locale of '' |
+ | match, if it is a '' | ||
+ | that the method is called on. | ||
+ | |||
+ | |||
+ | === getPositionOfFirstOccurrence(string|Text $search, int $offset) : int|false === | ||
Returns the position (in grapheme units) of the first occurrence of | Returns the position (in grapheme units) of the first occurrence of | ||
- | '' | + | '' |
- | Like: '' | + | Like: '' |
https:// | https:// | ||
*I think this method name is too long* | *I think this method name is too long* | ||
- | === getPositionOfLastOccurrence(string|Text $textToFind, int $offset) : int|false === | + | === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === |
Line 198: | Line 230: | ||
- | === returnFromFirstOccurence(string|Text $textToFind) : Text|false === | + | === returnFromFirstOccurence(string|Text $search) : Text|false === |
- | Returns the '' | + | Returns the '' |
otherwise '' | otherwise '' | ||
- | Like: '' | + | Like: '' |
(https:// | (https:// | ||
- | === returnFromLastOccurence(string|Text $textToFind) : Text|false === | + | === returnFromLastOccurence(string|Text $search) : Text|false === |
Like '' | Like '' | ||
- | === contains(string|Text $string) === | + | === contains(string|Text $search) === |
- | Returns true if the text '' | + | Returns true if the text '' |
Like '' | Like '' | ||
- | === endsWith(string|Text $string) : bool === | + | === endsWith(string|Text $search) : bool === |
- | Could be constructed from '' | + | Compares the last '' |
+ | |||
+ | Case-insensitive comparison can be achieved by setting the right | ||
+ | '' | ||
+ | |||
+ | Could be constructed from '' | ||
'' | '' | ||
too. | too. | ||
- | === startsWith(string|Text $string) : bool === | + | === startsWith(string|Text $search) : bool === |
- | Compares the first '' | + | Compares the first '' |
- | locale and collator that are configured with '' | + | |
Case-insensitive comparison can be achieved by setting the right | Case-insensitive comparison can be achieved by setting the right | ||
- | '' | + | '' |
Could be constructed from '' | Could be constructed from '' | ||
Line 240: | Line 276: | ||
==== Comparing Text Objects ==== | ==== Comparing Text Objects ==== | ||
- | === compareWith(Text $other) : int === | + | === compareWith(Text $other, string $collator = NULL) : int === |
- | Uses the configured '' | + | Uses the configured '' |
- | '' | + | '' |
This same method is also used for comparing two Text objects as " | This same method is also used for comparing two Text objects as " | ||
- | handler" | + | handler" |
==== Case Conversions ==== | ==== Case Conversions ==== | ||
+ | These operations all use the collation that is configured on the Text object. | ||
=== toLower === | === toLower === | ||
Line 256: | Line 293: | ||
Converts the text to lower case, using the lower case variant of each | Converts the text to lower case, using the lower case variant of each | ||
Unicode code point that makes up the text. | Unicode code point that makes up the text. | ||
- | |||
=== toUpper === | === toUpper === | ||
+ | The same, but then to upper case. | ||
=== toTitle === | === toTitle === | ||
+ | The same, but then to title case (the first letter of each word). | ||
=== firstToLower === | === firstToLower === | ||
Converts the first grapheme in the text to a lower case variant. | Converts the first grapheme in the text to a lower case variant. | ||
- | |||
=== firstToUpper === | === firstToUpper === | ||
+ | The same, but then to upper case. | ||
=== firstToTitle === | === firstToTitle === | ||
+ | The same, but then to title case (the first letter of each word). | ||
+ | |||
+ | |||
+ | === wordsToLower === | ||
+ | |||
+ | Converts the first grapheme in every word to an lower case variant. | ||
+ | |||
+ | === wordsToUpper === | ||
+ | |||
+ | The same, but then to upper case. | ||
+ | |||
+ | === wordsToTitle === | ||
+ | |||
+ | The same, but then to title case (the first letter of each word). | ||
Line 301: | Line 350: | ||
- | === countWords() === | + | === getWordCount() === |
Pretty much a shortcut for:: | Pretty much a shortcut for:: | ||
Line 315: | Line 364: | ||
These functions return an iterator that can be used to iterator over the text. | These functions return an iterator that can be used to iterator over the text. | ||
The return of the iterators are effected by the text's locale. | The return of the iterators are effected by the text's locale. | ||
+ | i | ||
+ | These are inspired by ICU4J' | ||
+ | (https:// | ||
+ | and Intl's create*Instance methods on '' | ||
+ | (https:// | ||
=== getCharacterIterator === | === getCharacterIterator === | ||
+ | Returns an Iterator that locates boundaries between logical characters. | ||
+ | Because of the structure of the Unicode encoding, a logical character may be | ||
+ | stored internally as more than one Unicode code point. (A with an umlaut may | ||
+ | be stored as an ' | ||
+ | example, but the user still thinks of it as one character.) This iterator | ||
+ | allows various processes (especially text editors) to treat as characters the | ||
+ | units of text that a user would think of as characters, rather than the units | ||
+ | of text that the computer sees as " | ||
+ | === getWordIterator === | ||
- | === getLineIterator === | + | Returns an Iterator that locates boundaries between words. This is useful |
+ | for double-click selection or "find whole words" searches. This type of | ||
+ | iterator makes sure there is a boundary position at the beginning and end | ||
+ | of each legal word. (Numbers count as words, too.) Whitespace and punctuation | ||
+ | are kept separate from real words. | ||
+ | === getLineIterator === | ||
+ | Returns an Iterator that locates positions where it is legal for a text | ||
+ | editor to wrap lines. This is similar to word breaking, but not the same: | ||
+ | punctuation and whitespace are generally kept with words (you don't want a | ||
+ | line to start with whitespace, for example), and some special characters can | ||
+ | force a position to be considered a line-break position or prevent a position | ||
+ | from being a line-break position. | ||
=== getSentenceIterator === | === getSentenceIterator === | ||
+ | Returns an Iterator that locates boundaries between sentences. | ||
=== getTitleIterator === | === getTitleIterator === | ||
- | + | Returns an Iterator that locates boundaries between title breaks. | |
- | + | ||
- | === getWordIterator === | + | |
Line 344: | Line 415: | ||
=== transliterate(string $transliterationString) === | === transliterate(string $transliterationString) === | ||
+ | Transliterates the content of the '' | ||
+ | specified in the '' | ||
+ | There are a few constants for specific and often used cases, such as creating | ||
+ | an ASCII transliterated version of any Text: | ||
- | === transliterate(\Intl\Transliterator $transliterator) === | + | - const Text:: |
+ | any script to Latin, and also strips all the accents. | ||
+ | - const Text:: | ||
+ | any script to Latin, but does not remove the accents. | ||
- | With the first one being a "simple" | + | - const Text:: |
- | Transliterator for more complex cases. | + | the transliteration string '' |
- | Should we add shortcuts for a set of often used ones, such as '' | + | ===== Implementation Details ===== |
- | think so, as it's the majority use case. | + | |
+ | The functionality as is described in this RFC is mostly implemented by using | ||
+ | functionality from the ICU library, which is also used by the Intl extension. | ||
- | === toLatin === | + | In order for PHP to continue to work on an as widest range of platforms and |
- | + | distributions, | |
- | Converts any script | + | Linux distributions' |
- | + | which this functionality | |
- | + | ||
- | === removeAccents === | + | |
- | + | ||
- | Removes | + | |
- | + | ||
- | A shortcut for the transliteration string | + | |
- | suitable one, which I believe | + | |
- | NFC."'' | + | |
===== Backward Incompatible Changes ===== | ===== Backward Incompatible Changes ===== | ||
- | Introducing a new class could impact code bases that already use this class | + | Introducing a new '' |
- | name. But as PHP owns the global namespace, this should not deter us from | + | class name. But as PHP owns the global namespace, this should not deter us |
- | adding such a code class. | + | from adding such a code class. |
===== Proposed PHP Version(s) ===== | ===== Proposed PHP Version(s) ===== | ||
Line 387: | Line 457: | ||
===== Open Issues ===== | ===== Open Issues ===== | ||
- | ==== Class Name ==== | ||
- | I have currently picked " | + | ===== Questions and Answers ===== |
- | represent single words (strings). Alternatively, | + | |
- | " | + | ==== Why is this not a composer package? ==== |
+ | |||
+ | The goal of this RFC is that PHP users can always rely on performant text | ||
+ | processing capabilities. | ||
+ | Text processors written in PHP already exist, but suffer from performance | ||
+ | issues (PHP is slower than C), and are sometimes tailored to specific use | ||
+ | cases. By having them written in C, and utilising ICU's well tested and often | ||
+ | updated rules and algorithms, both the performance and correctness issues will | ||
+ | be addressed. | ||
===== Future Scope ===== | ===== Future Scope ===== |
rfc/unicode_text_processing.txt · Last modified: 2022/12/21 11:48 by derick