rfc:unicode_text_processing
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
rfc:unicode_text_processing [2022/11/09 16:47] – created first rough draft derick | rfc:unicode_text_processing [2024/09/11 14:16] (current) – derick | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== PHP RFC: Unicode Text Processing ====== | ====== PHP RFC: Unicode Text Processing ====== | ||
- | * Version: 0.9 | + | * Version: 0.9.2 |
- | * Date: 2022-11-09 | + | * Date: 2022-12-21 (Original date: 2022-12-15) |
* Author: Derick Rethans < | * Author: Derick Rethans < | ||
- | * Status: | + | * Status: Draft |
* First Published at: http:// | * First Published at: http:// | ||
- | |||
===== Introduction ===== | ===== Introduction ===== | ||
This RFC suggests to introduce a new class to make using and processing | This RFC suggests to introduce a new class to make using and processing | ||
- | (Unicode) text significantly more developer friendly compared to the wealth of | + | (Unicode) text significantly more developer friendly compared to the |
- | functionality that the intl extension provides. The goal is to make it easy for | + | wealth of functionality that the intl extension provides. The goal is to |
- | developers to do Unicode text processing correctly. The RFC does not aim to | + | create an API that developers |
- | introduce a class that does everything that the intl extension provides with | + | correctly, without having |
- | regards to Unicode | + | |
+ | Although PHP has decent maths features, it is solely missing performant | ||
+ | Unicode text processing always available in the core. | ||
+ | |||
+ | ==== Definitions ==== | ||
+ | |||
+ | ^ Term ^ Description ^ | ||
+ | | Grapheme | A Unicode | ||
===== Proposal ===== | ===== Proposal ===== | ||
- | To introduce a new " | + | To introduce a new final " |
- | in the objects. | + | text stored in the objects. |
Methods on the class will all return a new (immutable) object. | Methods on the class will all return a new (immutable) object. | ||
+ | |||
+ | The proposal is to make the '' | ||
+ | mean that it is therefore always available to user. As the implementation | ||
+ | requires ICU, this would also mean that PHP will depend on the ICU library. | ||
+ | |||
+ | ==== Basics ==== | ||
+ | |||
+ | Text objects are constructed by passing a UTF-8 encoded string to the | ||
+ | constructor. | ||
+ | |||
+ | The '' | ||
+ | UTF-8 encoded string, which can be used by all existing PHP functions | ||
+ | that accept strings. | ||
+ | |||
+ | The internal representation of the text is UTF-16, as that's what ICU uses. | ||
+ | Unlike the PHP 6 approach, the conversion to/from the internal | ||
+ | representation only happens on the boundaries: UTF-8 to UTF-16 through | ||
+ | the constructor, | ||
+ | |||
+ | There are multiple groups of methods indicated below. Some are to | ||
+ | represent PHP's existing string functions (substr, wordwrap, etc.), but | ||
+ | with meaningful names. | ||
+ | |||
+ | Design Goals: | ||
+ | |||
+ | * keep it simple | ||
+ | * default behaviour should be the most expected | ||
+ | * prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments. | ||
+ | * operations are on **graphemes** | ||
+ | * no redundant methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used | ||
+ | * more as we discuss this... | ||
+ | |||
+ | Non Design Goals: | ||
+ | |||
+ | * introduce every feature of the intl extension | ||
+ | |||
+ | Each section below contains a list of expected methods. This list is | ||
+ | currently not exhaustive. Please join the discussion on the mailing list | ||
+ | to suggest modifications or additions, keeping the design goals in mind. | ||
+ | |||
+ | If an argument to any of the methods is listed as '' | ||
+ | passing in a '' | ||
+ | the passed value with '' | ||
+ | from the Text object that this method is called on is also used for this new | ||
+ | wrapped value, if necessary. | ||
+ | |||
+ | ==== Locales, Collators, and Internationalisation ==== | ||
+ | |||
+ | By default each string will have the " | ||
+ | associated with it, but it is possible to configure a specific locale and | ||
+ | collator by using the '' | ||
+ | addition to the locale, and affects sorting and finding operations. | ||
+ | |||
+ | The '' | ||
+ | name: | ||
+ | https:// | ||
+ | |||
+ | The methods on the Text object all use the '' | ||
+ | |||
+ | For example, the locale (and collation) name '' | ||
+ | case-insensitive sorting ('' | ||
+ | The format of this locale/ | ||
+ | |||
+ | Numerical order collation (such as PHP's '' | ||
+ | adding the '' | ||
+ | (case-sensitive German (''' | ||
+ | |||
+ | Other options are described in BCP47: | ||
+ | https:// | ||
+ | and defaults at http:// | ||
+ | |||
+ | Building a locale/ | ||
+ | '' | ||
+ | of collations. The class performs the same function as '' | ||
+ | (https:// | ||
+ | descriptive methods to set collation properties. The reason for a separate | ||
+ | class is so that you don't have to depend on the '' | ||
+ | make it more developer-friendly. It converts the configured options to a | ||
+ | string, which can then be used in any location where '' | ||
+ | used in the function signatures to the methods on the '' | ||
+ | |||
+ | |||
+ | ==== Construction ==== | ||
+ | |||
+ | This section lists all the method that construct a Text object. | ||
+ | |||
+ | === __construct(string $text, string $collation = ' | ||
The constructor takes a UTF-8 encoded text, and stores this in an internal | The constructor takes a UTF-8 encoded text, and stores this in an internal | ||
structure. The constructor will also convert the given text to Unicode | structure. The constructor will also convert the given text to Unicode | ||
- | Canonical Form. Passing in non-well-formed UTF-8 will result in an | + | Canonical Form (also called Normalisation Form C, or NFC). Passing in |
- | `InvalidEncodingException`. The constructor will also strip out a BOM | + | non-well-formed UTF-8 will result in an '' |
- | (Byte-Order-Mark) character, if present. | + | The constructor will also strip out a BOM (Byte-Order-Mark) character, |
+ | if present. | ||
- | By default each string will have the " | ||
- | is possible to configure a specific locale by using the `$locale` argument in | ||
- | the constructor. | ||
- | The ``__toString()`` method collapses the internally stored text into a | + | === static Text:: |
- | UTF-8 encoded | + | |
- | accept strings. | + | |
- | Methods fall into multiple groups. Some to implement PHP's existing | + | The Symfony String package, offers a static function |
- | string functions | + | through a single-character function |
- | design goal is to rather create more methods, than allowing | + | file scope (with '' |
- | methods to be changed through | + | |
- | The internal representation would be UTF-16, as that's what ICU uses. Unlike | + | This method solves a similar use, so that you can shorten '' |
- | the PHP 6 approach, the conversion | + | '' |
- | happens on the boundaries: UTF-8 to UTF-16 through the constructor, | + | For example with '' |
- | reverse through the ``__toString()`` method. | + | |
- | ==== Groups of Methods ==== | ||
- | Each section will contain a list of expected methods, which from the start | + | === static Text:: |
- | might not be exhaustive. Please join the discussion on the mailing list to | + | |
- | suggest modifications or additions, keeping the design goals in mind. | + | |
- | === Construction === | + | Creates a new Text object by concatenating all the given string/Text arguments |
+ | into a new Text object. | ||
- | ``__construct(string | + | If the '' |
+ | '' | ||
- | === Standard String Operations === | ||
- | All string | + | === static Text:: |
- | character, a character with diacritics, a character with space modifiers, or | + | |
- | an emojis. | + | |
- | I am not sure if these should accept `string|Text` or only `Text` as | + | Creates a new Text object by looping over all the string/Text elements in |
- | `$textToFind`. Accepting a string makes for a easier to use API, but with the | + | '' |
- | caveat that we internally need to convert it pretty much to a `Text` object | + | |
- | any way. | + | |
- | ``splitByText(Text $separator, | + | The semantics are like: '' |
- | Returns an array of Text objects, each of which is a substring of `$this`, | + | |
- | formed by splitting it on boundaries formed by the text `$separator`. | + | |
- | Like `explode($separator, $limit)`. | + | If the '' |
+ | element from the '' | ||
+ | created object. | ||
- | ``static Text:: | + | If the '' |
- | Creates a new Text object | + | '' |
- | `$elements`, | + | |
- | Semantics like `implode(string | + | If the iterator produces a non-string/Text element, then a '' |
+ | will be thrown. | ||
- | ``subString(int $offset, int $length) : Text|false`` | + | ==== Standard String Operations ==== |
- | Returns a sub-string, starting at `$offset` for `$length` graphemes. | + | |
- | Like: `grapheme_substr($this, | ||
- | https:// | ||
- | ``trimLeft`` | + | === split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) === |
- | ``trimRight`` | + | |
- | ``trim`` | + | |
- | Removes white space at the start of, the end of, or both sides of the text. | + | |
- | Like: `ltrim`, `rtrim`, and `trim`, but with using the unicode definition | + | Returns an array of Text objects, each of which is a substring of '' |
- | of what white space is. https:// | + | formed by splitting it on boundaries formed by the text '' |
- | ``wrap(int $maxWidth, bool $cutLongWords = false) : array(Text)`` | + | Like '' |
- | Wraps a text to a given number of graphemes into an array of Text objects. | + | |
- | Like: `wordwrap`, but based on graphemes and returning an array instead of | ||
- | inserting a break character. | ||
- | If `$cutLongWords` is set, no Text element will be larger than | + | === subString(int |
- | `$maxWidth`. | + | |
- | ``replaceText(Text $search, Text $replace)`` ?? | + | Returns a sub-string, starting at '' |
- | ``replaceTextCaseInsensitively(Text $search, Text $replace)`` ?? | + | Like: '' |
- | Will have to use locales too. | + | https:// |
- | ``reverse()`` | + | === trimStart, trimEnd, trim : \Text === |
- | Reverses a text, taking into account grapheme boundaries. | + | |
- | === Finding | + | Removes white space at the start of, the end of, or both sides of the text. |
+ | |||
+ | Like: '' | ||
+ | of what white space is. https:// | ||
+ | |||
+ | === wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) === | ||
+ | |||
+ | Wraps a text to a given number of graphemes per line, into an array of Text | ||
+ | objects. | ||
+ | |||
+ | Like: '' | ||
+ | inserting a break character. | ||
+ | |||
+ | If '' | ||
+ | '' | ||
+ | |||
+ | === reverse() : \Text === | ||
+ | |||
+ | Reverses a text, taking into account grapheme boundaries. | ||
+ | |||
+ | |||
+ | ==== Finding Text in Text ==== | ||
Methods to find text in other text. | Methods to find text in other text. | ||
- | ``getPositionOfFirstOccurrence(string|Text $textToFind, int $offset) : int|false`` | + | In all these methods, the locale and collator of '' |
- | Returns | + | sub-strings that match, if it is a '' |
- | `$textToFind` starting at the (grapheme) `$offset`, or false if not found. | + | collator that are embedded |
- | Like: `grapheme_strpos($this, | ||
- | https:// | ||
- | ``getPositionOfLastOccurrence(string|Text $textToFind, int $offset) : int|false`` | + | === getPositionOfFirstOccurrence(string|Text $search, int $offset) : int|false |
- | Like `getPositionOfFirstOccurrence` but then from the end of the text. | + | |
- | ``returnFromFirstOccurence(string|Text $textToFind) : Text|false`` | + | Returns the position |
- | Returns | + | '' |
- | otherwise `false`. | + | |
- | Like: `grapheme_strstr($this, $textToFind)` | + | Like: '' |
- | (https:// | + | https:// |
- | ``returnFromLastOccurence(string|Text $textToFind) | + | Alternative suggested names: '' |
- | Like `returnFromFirstOccurence` but then from the end of the text. | + | |
- | `compareWith(Text $other) : int` (or also the Text's compare handler) | ||
- | Needs to use a locale, and sorting text strength (to avoid all the many | ||
- | options)... perhaps use Intl's collator instead? Or have two methods? | ||
- | `compareWithNaturalOrder(Text $other) : int` | + | === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === |
- | Like `strnatcmp`/ | + | |
- | `compareWithCollator` with a `$collator` with the NUMERIC_COLLATION option | + | |
- | turned on. | + | |
- | `compareWithCollator(Text $other, \Intl\Collator $collator) : int` | ||
- | ``contains(Text $string)`` | + | Like '' |
- | Returns true if the text `$string` can be found in the text. | + | |
- | Like `str_contains`. | + | Alternative suggested names: '' |
- | ``endsWith(Text $string)`` | ||
- | ``startsWith(Text $string)`` | + | === returnFromFirstOccurence(string|Text $search) : Text|false === |
+ | Returns the '' | ||
+ | otherwise '' | ||
- | Case-insensitive variants are not included. If you need this, convert the | + | Like: '' |
- | text(s) with ``toLower`` first. Or allow for using Intl's Collator? That'd be | + | (https:// |
- | nicer... | + | |
- | === Case Conversions === | + | Alternative suggested names: '' |
- | ``toLower`` | ||
- | Converts the text to lower case, using the lower case variant of each | ||
- | Unicode code point that makes up the text. | ||
- | ``toUpper`` | + | === returnFromLastOccurence(string|Text $search) : Text|false === |
- | ``toTitle`` | + | Like '' |
- | i | + | |
- | ``firstToLower`` | + | |
- | Converts | + | |
- | ``firstToUpper`` | + | Alternative suggested names: '' |
- | ``firstToTitle`` | ||
+ | === contains(string|Text $search) === | ||
- | === Counting === | + | Returns true if the text '' |
- | `getByteCount()` | + | Like '' |
- | Returns the size in bytes that the text will take when converted to UTF-8. | + | |
- | `length()` | ||
- | `getCharacterCount()` | ||
- | Returns the number of characters that make up the text. A character (also | ||
- | sometimes call a grapheme) consists of the base-character, | ||
- | combining diacritics. Unicode calls these " | ||
- | http:// | ||
- | `getCodePointCount()` | + | === endsWith(string|Text $search) : bool === |
- | Returns the number of Unicode code points that make up the text. | + | |
- | (Not sure if we should add this, as it doesn' | + | |
- | `countWords()` | + | Compares the last '' |
- | Pretty much a shortcut for:: | + | |
+ | Case-insensitive comparison can be achieved by setting the right | ||
+ | '' | ||
+ | |||
+ | Could be constructed from '' | ||
+ | '' | ||
+ | too. | ||
+ | |||
+ | |||
+ | === startsWith(string|Text $search) : bool === | ||
+ | |||
+ | Compares the first '' | ||
+ | |||
+ | Case-insensitive comparison can be achieved by setting the right | ||
+ | '' | ||
+ | |||
+ | Could be constructed from '' | ||
+ | but it's an often required method, and standard PHP has it | ||
+ | too. | ||
+ | |||
+ | === replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text === | ||
+ | |||
+ | Replaces occurrences of '' | ||
+ | |||
+ | The '' | ||
+ | items are being replaced. The '' | ||
+ | argument that is being replaced (0-indexed), | ||
+ | last item. Positive numbers are counted from the first occurrence of | ||
+ | '' | ||
+ | occurrence. | ||
+ | |||
+ | In order to find sub-strings case-insensitively, | ||
+ | argument to '' | ||
+ | |||
+ | |||
+ | ==== Comparing Text Objects ==== | ||
+ | |||
+ | === compareWith(Text $other, string $collation = NULL) : int === | ||
+ | |||
+ | Uses the configured '' | ||
+ | '' | ||
+ | |||
+ | This same method is also used for comparing two Text objects as " | ||
+ | handler" | ||
+ | taken into account. | ||
+ | |||
+ | === equals(Text $other, string $collation = NULL) : boolean === | ||
+ | |||
+ | Alias for '' | ||
+ | |||
+ | |||
+ | ==== Case Conversions ==== | ||
+ | |||
+ | These operations all use the collation that is configured on the Text object. | ||
+ | |||
+ | === toLower : \Text === | ||
+ | |||
+ | Converts the text to lower case, using the lower case variant of each | ||
+ | Unicode code point that makes up the text. | ||
+ | |||
+ | Example: '' | ||
+ | |||
+ | |||
+ | === toUpper : \Text === | ||
+ | |||
+ | The same, but then to upper case. | ||
+ | |||
+ | Example: '' | ||
+ | |||
+ | |||
+ | === toTitle : \Text === | ||
+ | |||
+ | The same, but then to title case (the first letter of each word). | ||
+ | |||
+ | Example: '' | ||
+ | |||
+ | |||
+ | === firstToLower : \Text === | ||
+ | |||
+ | Converts the first grapheme in the text to a lower case variant. | ||
+ | |||
+ | Example: '' | ||
+ | |||
+ | |||
+ | === firstToUpper : \Text === | ||
+ | |||
+ | The same, but then to upper case. | ||
+ | |||
+ | Example: '' | ||
+ | |||
+ | |||
+ | |||
+ | === wordsToLower : \Text === | ||
+ | |||
+ | Converts the first grapheme in every word to an lower case variant. | ||
+ | |||
+ | Example: '' | ||
+ | |||
+ | |||
+ | === wordsToUpper : \Text === | ||
+ | |||
+ | The same, but then to upper case. | ||
+ | |||
+ | Example: '' | ||
+ | |||
+ | |||
+ | ==== Counting ==== | ||
+ | |||
+ | |||
+ | === getByteCount() : int === | ||
+ | |||
+ | Returns the size in bytes that the text will take when converted to UTF-8. | ||
+ | |||
+ | |||
+ | === length(), getCharacterCount(): | ||
+ | |||
+ | Returns the number of characters that make up the text. A character (also | ||
+ | sometimes call a grapheme) consists of the base-character, | ||
+ | combining diacritics. Unicode calls these " | ||
+ | http:// | ||
+ | |||
+ | |||
+ | === getCodePointCount() : int === | ||
+ | |||
+ | Returns the number of Unicode code points that make up the text. | ||
+ | (Not sure if we should add this, as it doesn' | ||
+ | |||
+ | |||
+ | === getWordCount() : int === | ||
+ | |||
+ | Pretty much a shortcut for:: | ||
$count = 0; | $count = 0; | ||
foreach ($text-> | foreach ($text-> | ||
- | Uses the locale, just like the iterators. | + | Uses the locale, just like the iterators. |
- | === Iterators === | + | ==== Iterators |
These functions return an iterator that can be used to iterator over the text. | These functions return an iterator that can be used to iterator over the text. | ||
The return of the iterators are effected by the text's locale. | The return of the iterators are effected by the text's locale. | ||
- | ``getCharacterIterator`` | + | These are inspired by ICU4J' |
+ | (https:// | ||
+ | and Intl's create*Instance methods on '' | ||
+ | (https:// | ||
- | ``getLineIterator`` | + | === getCharacterIterator : \Iterator === |
- | ``getSentenceIterator`` | + | Returns an Iterator that locates boundaries between logical characters. |
+ | Because of the structure of the Unicode encoding, a logical character may be | ||
+ | stored internally as more than one Unicode code point. (A with an umlaut may | ||
+ | be stored as an ' | ||
+ | example, but the user still thinks of it as one character.) This iterator | ||
+ | allows various processes (especially text editors) to treat as characters the | ||
+ | units of text that a user would think of as characters, rather than the units | ||
+ | of text that the computer sees as " | ||
- | ``getTitleIterator`` | + | === getWordIterator : \Iterator === |
- | ``getWordIterator`` | + | Returns an Iterator that locates boundaries between words. This is useful |
+ | for double-click selection or "find whole words" searches. This type of | ||
+ | iterator makes sure there is a boundary position at the beginning and end | ||
+ | of each legal word. (Numbers count as words, too.) Whitespace and punctuation | ||
+ | are kept separate from real words. | ||
+ | === getLineIterator : \Iterator === | ||
- | === Transliteration === | + | Returns an Iterator that locates positions where it is legal for a text |
+ | editor to wrap lines. This is similar to word breaking, but not the same: | ||
+ | punctuation and whitespace are generally kept with words (you don't want a | ||
+ | line to start with whitespace, for example), and some special characters can | ||
+ | force a position to be considered a line-break position or prevent a position | ||
+ | from being a line-break position. | ||
+ | |||
+ | === getSentenceIterator : \Iterator === | ||
+ | |||
+ | Returns an Iterator that locates boundaries between sentences. | ||
+ | |||
+ | |||
+ | === getTitleIterator : \Iterator === | ||
+ | |||
+ | Returns an Iterator that locates boundaries between title breaks. | ||
+ | |||
+ | |||
+ | ==== Transliteration | ||
Converts text between scripts and other properties. | Converts text between scripts and other properties. | ||
- | ``transliterate(string $transliterationString)`` | ||
- | ``transliterate(\Intl\Transliterator | + | === transliterate(string |
- | With the first one being a " | + | Transliterates |
- | Transliterator for more complex cases. | + | specified in the '' |
- | Should we add shortcuts | + | There are a few constants |
- | think so, as it's the majority use case. | + | an ASCII transliterated version of any Text: |
- | ``toLatin`` | + | - const Text:: |
- | Converts | + | |
- | ``removeAccents`` | + | - const Text:: |
- | Removes | + | any script to Latin, but does not remove |
- | A shortcut for the transliteration string | + | - const Text:: |
- | suitable one, which I believe is `"NFD; [: | + | the transliteration string |
- | NFC."`. | + | |
+ | ===== Implementation Details ===== | ||
+ | |||
+ | The functionality as is described in this RFC is mostly implemented by using | ||
+ | functionality from the ICU library, which is also used by the Intl extension. | ||
+ | |||
+ | In order for PHP to continue to work on an as widest range of platforms and | ||
+ | distributions, | ||
+ | Linux distributions' | ||
+ | which this functionality is implemented. | ||
===== Backward Incompatible Changes ===== | ===== Backward Incompatible Changes ===== | ||
- | Introducing a new class could impact code bases that already use this class | + | Introducing a new '' |
- | name. But as PHP owns the global namespace, this should not deter us from | + | class name. But as PHP owns the global namespace, this should not deter us |
- | adding such a code class. | + | from adding such a code class. |
===== Proposed PHP Version(s) ===== | ===== Proposed PHP Version(s) ===== | ||
Line 261: | Line 494: | ||
===== Open Issues ===== | ===== Open Issues ===== | ||
- | ==== Class Name ==== | + | - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8. |
- | I have currently picked " | + | ===== Questions and Answers ===== |
- | represent single words (strings). Alternatively, | + | |
- | " | + | |
+ | ==== Why is this not a composer package? ==== | ||
+ | |||
+ | The goal of this RFC is that PHP users can always rely on performant text | ||
+ | processing capabilities. | ||
+ | |||
+ | Text processors written in PHP already exist, but suffer from performance | ||
+ | issues (PHP is slower than C), and are sometimes tailored to specific use | ||
+ | cases. By having them written in C, and utilising ICU's well tested and often | ||
+ | updated rules and algorithms, both the performance and correctness issues will | ||
+ | be addressed. | ||
===== Future Scope ===== | ===== Future Scope ===== | ||
Line 283: | Line 524: | ||
===== Implementation ===== | ===== Implementation ===== | ||
- | After the project is implemented, | + | After the project is implemented, |
- the version(s) it was merged into | - the version(s) it was merged into | ||
- a link to the git commit(s) | - a link to the git commit(s) | ||
Line 295: | Line 536: | ||
Nothing rejected yet. | Nothing rejected yet. | ||
+ | |||
+ | |||
+ | ===== Changes ===== | ||
+ | |||
+ | 0.9.2 — 2022-12-21 | ||
+ | |||
+ | * Tim Düsterhus: Added concat and equals methods; changed join to accept an iterator. | ||
+ | * Enhance explanation of locales and collations, and standardize on using '' | ||
+ | |||
+ | 0.9.1 — 2022-12-16 | ||
+ | |||
+ | * Tim Düsterhus: Removed firstToTitle/ | ||
+ | * Paul Crovella: Clarify which normalisation is being used. | ||
+ | * Daniel Wolfe: Update trimLeft/ |
rfc/unicode_text_processing.1668012433.txt.gz · Last modified: 2022/11/09 16:47 by derick