rfc:unicode_text_processing
Differences
This shows you the differences between two versions of the page.
Next revisionBoth sides next revision | |||
rfc:unicode_text_processing [2022/11/09 16:47] – created first rough draft derick | rfc:unicode_text_processing [2022/11/21 15:14] – derick | ||
---|---|---|---|
Line 3: | Line 3: | ||
* Date: 2022-11-09 | * Date: 2022-11-09 | ||
* Author: Derick Rethans < | * Author: Derick Rethans < | ||
- | * Status: | + | * Status: Draft |
* First Published at: http:// | * First Published at: http:// | ||
Line 10: | Line 10: | ||
This RFC suggests to introduce a new class to make using and processing | This RFC suggests to introduce a new class to make using and processing | ||
- | (Unicode) text significantly more developer friendly compared to the wealth of | + | (Unicode) text significantly more developer friendly compared to the |
- | functionality that the intl extension provides. The goal is to make it easy for | + | wealth of functionality that the intl extension provides. The goal is to |
- | developers to do Unicode text processing correctly. The RFC does not aim to | + | create an API that developers |
- | introduce a class that does everything that the intl extension provides with | + | correctly, without having |
- | regards to Unicode | + | |
+ | ==== Definitions ==== | ||
+ | |||
+ | ^ Term ^ Description ^ | ||
+ | | Grapheme | A Unicode | ||
===== Proposal ===== | ===== Proposal ===== | ||
- | To introduce a new " | + | To introduce a new " |
- | in the objects. | + | stored |
Methods on the class will all return a new (immutable) object. | Methods on the class will all return a new (immutable) object. | ||
+ | ==== Basics ==== | ||
+ | |||
+ | Text objects are constructed by passing a UTF-8 encoded string to the | ||
+ | constructor. | ||
+ | |||
+ | The '' | ||
+ | UTF-8 encoded string, which can be used by all existing PHP functions | ||
+ | that accept strings. | ||
+ | |||
+ | The internal representation would be UTF-16, as that's what ICU uses. | ||
+ | Unlike the PHP 6 approach, the conversion to/from the internal | ||
+ | representation only happens on the boundaries: UTF-8 to UTF-16 through | ||
+ | the constructor, | ||
+ | |||
+ | There are multiple groups of methods indicated below. Some are to | ||
+ | represent PHP's existing string functions (substr, wordwrap, etc.), but | ||
+ | with meaningful names. | ||
+ | |||
+ | Design Goals: | ||
+ | |||
+ | * keep it simple | ||
+ | * default behaviour should be the most expected | ||
+ | * prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments. | ||
+ | * operations are on **graphemes** | ||
+ | * no redundent methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used | ||
+ | * more as we discuss this... | ||
+ | |||
+ | Non Design Goals: | ||
+ | |||
+ | * introduce every feature of the intl extension | ||
+ | |||
+ | Each section below contains a list of expected methods. This list is | ||
+ | currently not exhaustive. Please join the discussion on the mailing list | ||
+ | to suggest modifications or additions, keeping the design goals in mind. | ||
+ | |||
+ | If an argument to any of the methods is listed as '' | ||
+ | passing in a '' | ||
+ | the passed value with '' | ||
+ | object that this method is called on is also used for this new wrapped | ||
+ | value, if necessary. | ||
+ | |||
+ | ==== Locales and Internationalisation ==== | ||
+ | |||
+ | By default each string will have the " | ||
+ | but it is possible to configure a specific collator by using the | ||
+ | '' | ||
+ | a string describing an ICU locale name: | ||
+ | https:// | ||
+ | |||
+ | For example, the locale (or collation) name '' | ||
+ | case-insensitive sorting for the English locale. This will require | ||
+ | extensive documentation. | ||
+ | |||
+ | Numerical order collation (such as PHP's '' | ||
+ | by adding the '' | ||
+ | (case-sensitive German, with numerics in value order). | ||
+ | |||
+ | Other options are described in BCP47: | ||
+ | https:// | ||
+ | and defaults at http:// | ||
+ | |||
+ | Specifying the locale and collator will also be possible by passing in a | ||
+ | '' | ||
+ | (https:// | ||
+ | descritive construction of a locale with all its options. | ||
+ | |||
+ | |||
+ | ==== Construction ==== | ||
+ | |||
+ | This section lists all the method that construct a Text object. | ||
+ | |||
+ | === __construct(string $text, string $locale = ' | ||
The constructor takes a UTF-8 encoded text, and stores this in an internal | The constructor takes a UTF-8 encoded text, and stores this in an internal | ||
structure. The constructor will also convert the given text to Unicode | structure. The constructor will also convert the given text to Unicode | ||
Canonical Form. Passing in non-well-formed UTF-8 will result in an | Canonical Form. Passing in non-well-formed UTF-8 will result in an | ||
- | `InvalidEncodingException`. The constructor will also strip out a BOM | + | '' |
(Byte-Order-Mark) character, if present. | (Byte-Order-Mark) character, if present. | ||
- | By default each string will have the " | ||
- | is possible to configure a specific locale by using the `$locale` argument in | ||
- | the constructor. | ||
- | The ``__toString()`` method collapses the internally stored text into a | + | === static Text:: |
- | UTF-8 encoded | + | |
- | accept strings. | + | |
- | Methods fall into multiple groups. Some to implement PHP's existing | + | Creates a new Text object by concatenating |
- | string functions (substr, wordwrap, etc.), but with meaningful names. A | + | '' |
- | design goal is to rather create more methods, than allowing | + | |
- | methods to be changed through (optional) arguments. | + | |
- | The internal representation would be UTF-16, as that's what ICU uses. Unlike | + | Semantics like: '' |
- | the PHP 6 approach, the conversion to/from the internal representation only | + | |
- | happens on the boundaries: UTF-8 to UTF-16 through the constructor, | + | |
- | reverse through the ``__toString()`` method. | + | |
- | ==== Groups of Methods ==== | ||
- | Each section will contain a list of expected methods, which from the start | + | ==== Standard String Operations ==== |
- | might not be exhaustive. Please join the discussion on the mailing list to | + | |
- | suggest modifications or additions, keeping the design goals in mind. | + | |
- | === Construction === | ||
- | ``__construct(string $text, string | + | === split(string|Text $separator, int $limit = PHP_INT_MAX): array(Text) === |
- | === Standard String Operations === | + | Returns an array of Text objects, each of which is a substring of '' |
+ | formed by splitting it on boundaries formed by the text '' | ||
- | All string operators operate on **graphemes**, which are generally: a normal | + | Like '' |
- | character, a character with diacritics, a character with space modifiers, or | + | |
- | an emojis. | + | |
- | I am not sure if these should accept `string|Text` or only `Text` as | ||
- | `$textToFind`. Accepting a string makes for a easier to use API, but with the | ||
- | caveat that we internally need to convert it pretty much to a `Text` object | ||
- | any way. | ||
- | ``splitByText(Text $separator, int $limit = PHP_INT_MAX): array(Text)`` | + | === subString(int $offset, int $length) : Text|false === |
- | Returns an array of Text objects, each of which is a substring of `$this`, | + | |
- | formed by splitting it on boundaries formed by the text `$separator`. | + | |
- | Like `explode($separator, $limit)`. | + | Returns a sub-string, starting at '' |
- | ``static Text:: | + | Like: '' |
- | Creates a new Text object by concatenating the each Text element in | + | https:// |
- | `$elements`, inserting `$separator` in between each element. | + | |
- | Semantics like `implode(string $separator, array(string) $array);` | + | === trimLeft, trimRight, trim === |
- | ``subString(int $offset, int $length) : Text|false`` | + | Removes white space at the start of, the end of, or both sides of the text. |
- | Returns a sub-string, starting at `$offset` for `$length` graphemes. | + | |
- | Like: `grapheme_substr($this, $offset, $length)` | + | Like: '' |
- | https://www.php.net/manual/en/function.grapheme-substr.php | + | of what white space is. https://unicode.org/reports/tr44/# |
- | ``trimLeft`` | + | === wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) === |
- | ``trimRight`` | + | |
- | ``trim`` | + | |
- | Removes white space at the start of, the end of, or both sides of the text. | + | |
- | Like: `ltrim`, `rtrim`, and `trim`, but with using the unicode definition | + | Wraps a text to a given number |
- | of what white space is. https:// | + | |
- | ``wrap(int $maxWidth, bool $cutLongWords = false) : array(Text)`` | + | Like: '' |
- | Wraps a text to a given number of graphemes | + | inserting a break character. |
- | Like: `wordwrap`, but based on graphemes and returning an array instead of | + | If '' |
- | inserting a break character. | + | '' |
- | If `$cutLongWords` is set, no Text element will be larger than | ||
- | `$maxWidth`. | ||
- | ``replaceText(Text $search, Text $replace)`` ?? | + | === replaceText(string|Text $search, |
- | ``replaceTextCaseInsensitively(Text | + | Replaces the first '' |
- | Will have to use locales too. | + | '' |
- | ``reverse()`` | + | The '' |
- | Reverses a text, taking into account grapheme boundaries. | + | items are being replace. The '' |
+ | argument that is being replaced | ||
+ | last item. Positive numbers are counted from the first occurence of | ||
+ | '' | ||
+ | occurrence. | ||
- | === Finding | + | |
+ | === replaceTextCaseInsensitively(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) === | ||
+ | |||
+ | Replaces every occurrence of '' | ||
+ | the object that the method is called on. The locale of '' | ||
+ | '' | ||
+ | |||
+ | '' | ||
+ | |||
+ | |||
+ | === reverse() === | ||
+ | |||
+ | Reverses a text, taking into account grapheme boundaries. | ||
+ | |||
+ | |||
+ | ==== Finding Text in Text ==== | ||
Methods to find text in other text. | Methods to find text in other text. | ||
- | ``getPositionOfFirstOccurrence(string|Text $textToFind, | + | === getPositionOfFirstOccurrence(string|Text $textToFind, |
- | Returns the position (in grapheme units) of the first occurrence of | + | |
- | `$textToFind` starting at the (grapheme) `$offset`, or false if not found. | + | |
- | Like: `grapheme_strpos($this, | + | Returns the position |
- | https:// | + | '' |
- | ``getPositionOfLastOccurrence(string|Text | + | Like: '' |
- | Like `getPositionOfFirstOccurrence` but then from the end of the text. | + | https:// |
- | ``returnFromFirstOccurence(string|Text $textToFind) : Text|false`` | + | *I think this method name is too long* |
- | Returns the `Text` starting with the `$textToFind` if found, and | + | |
- | otherwise `false`. | + | |
- | Like: `grapheme_strstr($this, $textToFind)` | + | === getPositionOfLastOccurrence(string|Text |
- | (https:// | + | |
- | ``returnFromLastOccurence(string|Text $textToFind) : Text|false`` | ||
- | Like `returnFromFirstOccurence` but then from the end of the text. | ||
- | `compareWith(Text $other) : int` (or also the Text's compare handler) | + | Like '' |
- | Needs to use a locale, and sorting text strength (to avoid all the many | + | |
- | options)... perhaps use Intl's collator instead? Or have two methods? | + | |
- | `compareWithNaturalOrder(Text $other) : int` | ||
- | Like `strnatcmp`/ | ||
- | `compareWithCollator` with a `$collator` with the NUMERIC_COLLATION option | ||
- | turned on. | ||
- | `compareWithCollator(Text $other, \Intl\Collator $collator) : int` | + | === returnFromFirstOccurence(string|Text $textToFind) : Text|false === |
- | ``contains(Text $string)`` | + | Returns the '' |
- | Returns true if the text `$string` can be found in the text. | + | otherwise '' |
- | Like `str_contains`. | + | Like: '' |
+ | (https:// | ||
- | ``endsWith(Text $string)`` | ||
- | ``startsWith(Text $string)`` | + | === returnFromLastOccurence(string|Text $textToFind) : Text|false === |
+ | Like '' | ||
- | Case-insensitive variants are not included. If you need this, convert the | + | === contains(string|Text $string) === |
- | text(s) with ``toLower`` first. Or allow for using Intl's Collator? That'd be | + | |
- | nicer... | + | |
- | === Case Conversions === | + | Returns true if the text '' |
- | ``toLower`` | + | Like '' |
- | Converts the text to lower case, using the lower case variant of each | + | |
- | Unicode code point that makes up the text. | + | |
- | ``toUpper`` | ||
- | ``toTitle`` | + | === endsWith(string|Text $string) : bool === |
- | i | + | |
- | ``firstToLower`` | + | |
- | Converts the first grapheme in the text to a lower case variant. | + | |
- | ``firstToUpper`` | + | Could be constructed from '' |
+ | '' | ||
+ | too. | ||
- | ``firstToTitle`` | ||
+ | === startsWith(string|Text $string) : bool === | ||
- | === Counting === | + | Compares the first '' |
+ | locale and collator that are configured with '' | ||
- | `getByteCount()` | + | Case-insensitive comparison can be achieved by setting the right |
- | Returns the size in bytes that the text will take when converted to UTF-8. | + | '' |
- | `length()` | + | Could be constructed from '' |
- | `getCharacterCount()` | + | but it's an often required method, and standard PHP has it |
- | Returns the number of characters that make up the text. A character (also | + | too. |
- | sometimes call a grapheme) consists of the base-character, and all | + | |
- | combining diacritics. Unicode calls these " | + | |
- | http:// | + | |
- | `getCodePointCount()` | ||
- | Returns the number of Unicode code points that make up the text. | ||
- | (Not sure if we should add this, as it doesn' | ||
- | `countWords()` | + | ==== Comparing Text Objects ==== |
- | Pretty much a shortcut for:: | + | |
+ | === compareWith(Text $other) : int === | ||
+ | |||
+ | Uses the configured '' | ||
+ | '' | ||
+ | |||
+ | This same method is also used for comparing two Text objects as " | ||
+ | handler" | ||
+ | |||
+ | |||
+ | ==== Case Conversions ==== | ||
+ | |||
+ | |||
+ | === toLower === | ||
+ | |||
+ | Converts the text to lower case, using the lower case variant of each | ||
+ | Unicode code point that makes up the text. | ||
+ | |||
+ | |||
+ | === toUpper === | ||
+ | |||
+ | |||
+ | |||
+ | === toTitle === | ||
+ | |||
+ | |||
+ | |||
+ | === firstToLower === | ||
+ | |||
+ | Converts the first grapheme in the text to a lower case variant. | ||
+ | |||
+ | |||
+ | === firstToUpper === | ||
+ | |||
+ | |||
+ | |||
+ | === firstToTitle === | ||
+ | |||
+ | |||
+ | |||
+ | ==== Counting ==== | ||
+ | |||
+ | |||
+ | === getByteCount() === | ||
+ | |||
+ | Returns the size in bytes that the text will take when converted to UTF-8. | ||
+ | |||
+ | |||
+ | === length(), getCharacterCount() === | ||
+ | |||
+ | Returns the number of characters that make up the text. A character (also | ||
+ | sometimes call a grapheme) consists of the base-character, | ||
+ | combining diacritics. Unicode calls these " | ||
+ | http:// | ||
+ | |||
+ | |||
+ | === getCodePointCount() === | ||
+ | |||
+ | Returns the number of Unicode code points that make up the text. | ||
+ | (Not sure if we should add this, as it doesn' | ||
+ | |||
+ | |||
+ | === countWords() | ||
+ | |||
+ | Pretty much a shortcut for:: | ||
$count = 0; | $count = 0; | ||
foreach ($text-> | foreach ($text-> | ||
- | Uses the locale, just like the iterators. | + | Uses the locale, just like the iterators. |
- | === Iterators === | + | ==== Iterators |
These functions return an iterator that can be used to iterator over the text. | These functions return an iterator that can be used to iterator over the text. | ||
The return of the iterators are effected by the text's locale. | The return of the iterators are effected by the text's locale. | ||
- | ``getCharacterIterator`` | ||
- | ``getLineIterator`` | + | === getCharacterIterator === |
- | ``getSentenceIterator`` | ||
- | ``getTitleIterator`` | ||
- | ``getWordIterator`` | + | === getLineIterator === |
- | === Transliteration === | + | |
+ | === getSentenceIterator === | ||
+ | |||
+ | |||
+ | |||
+ | === getTitleIterator === | ||
+ | |||
+ | |||
+ | |||
+ | === getWordIterator === | ||
+ | |||
+ | |||
+ | |||
+ | ==== Transliteration | ||
Converts text between scripts and other properties. | Converts text between scripts and other properties. | ||
- | ``transliterate(string $transliterationString)`` | ||
- | ``transliterate(\Intl\Transliterator $transliterator)`` | + | === transliterate(string $transliterationString) === |
+ | |||
+ | |||
+ | |||
+ | === transliterate(\Intl\Transliterator $transliterator) | ||
With the first one being a " | With the first one being a " | ||
Transliterator for more complex cases. | Transliterator for more complex cases. | ||
- | Should we add shortcuts for a set of often used ones, such as `Any-Latin`? I | + | Should we add shortcuts for a set of often used ones, such as '' |
think so, as it's the majority use case. | think so, as it's the majority use case. | ||
- | ``toLatin`` | ||
- | Converts any script to Latin. | ||
- | ``removeAccents`` | + | === toLatin === |
- | Removes the accents from a (latin script) text. | + | |
+ | Converts any script to Latin. | ||
+ | |||
+ | |||
+ | === removeAccents | ||
+ | |||
+ | Removes the accents from a (latin script) text. | ||
- | A shortcut for the transliteration string | + | A shortcut for the transliteration string |
- | suitable one, which I believe is `"NFD; [: | + | suitable one, which I believe is '' |
- | NFC."`. | + | NFC."'' |
Line 283: | Line 409: | ||
===== Implementation ===== | ===== Implementation ===== | ||
- | After the project is implemented, | + | After the project is implemented, |
- the version(s) it was merged into | - the version(s) it was merged into | ||
- a link to the git commit(s) | - a link to the git commit(s) |
rfc/unicode_text_processing.txt · Last modified: 2022/12/21 11:48 by derick