rfc:unicode_text_processing
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
rfc:unicode_text_processing [2022/12/15 15:28] – Argue the case for a C-based implementation, and mention ICU implementation details derick | rfc:unicode_text_processing [2024/09/11 14:16] (current) – derick | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== PHP RFC: Unicode Text Processing ====== | ====== PHP RFC: Unicode Text Processing ====== | ||
- | * Version: 0.9 | + | * Version: 0.9.2 |
- | * Date: 2022-11-09 | + | * Date: 2022-12-21 (Original date: 2022-12-15) |
* Author: Derick Rethans < | * Author: Derick Rethans < | ||
* Status: Draft | * Status: Draft | ||
* First Published at: http:// | * First Published at: http:// | ||
- | |||
===== Introduction ===== | ===== Introduction ===== | ||
Line 26: | Line 25: | ||
===== Proposal ===== | ===== Proposal ===== | ||
- | To introduce a new " | + | To introduce a new final " |
- | stored in the objects. | + | text stored in the objects. |
Methods on the class will all return a new (immutable) object. | Methods on the class will all return a new (immutable) object. | ||
Line 72: | Line 71: | ||
If an argument to any of the methods is listed as '' | If an argument to any of the methods is listed as '' | ||
passing in a '' | passing in a '' | ||
- | the passed value with '' | + | the passed value with '' |
- | object that this method is called on is also used for this new wrapped | + | from the Text object that this method is called on is also used for this new |
- | value, if necessary. | + | wrapped |
- | ==== Locales and Internationalisation ==== | + | ==== Locales, Collators, |
- | By default each string will have the " | + | By default each string will have the "root" locale and " |
- | but it is possible to configure a specific collator by using the | + | associated with it, but it is possible to configure a specific |
- | '' | + | collator by using the '' |
- | a string describing an ICU locale name: | + | addition to the locale, and affects sorting and finding operations. |
+ | |||
+ | The '' | ||
+ | name: | ||
https:// | https:// | ||
- | For example, the locale (or collation) name '' | + | The methods on the Text object all use the '' |
- | case-insensitive sorting for the English locale. | + | |
- | extensive documentation. | + | For example, the locale (and collation) name '' |
+ | case-insensitive sorting | ||
+ | The format of this locale/ | ||
- | Numerical order collation (such as PHP's '' | + | Numerical order collation (such as PHP's '' |
- | by adding the '' | + | adding the '' |
- | (case-sensitive German, with numerics in value order). | + | (case-sensitive German |
Other options are described in BCP47: | Other options are described in BCP47: | ||
Line 111: | Line 115: | ||
This section lists all the method that construct a Text object. | This section lists all the method that construct a Text object. | ||
- | === __construct(string $text, string $locale | + | === __construct(string $text, string $collation |
The constructor takes a UTF-8 encoded text, and stores this in an internal | The constructor takes a UTF-8 encoded text, and stores this in an internal | ||
structure. The constructor will also convert the given text to Unicode | structure. The constructor will also convert the given text to Unicode | ||
- | Canonical Form. Passing in non-well-formed UTF-8 will result in an | + | Canonical Form (also called Normalisation Form C, or NFC). Passing in |
- | '' | + | non-well-formed UTF-8 will result in an '' |
- | (Byte-Order-Mark) character, if present. | + | The constructor will also strip out a BOM (Byte-Order-Mark) character, |
+ | if present. | ||
- | === static Text:: | + | |
+ | === static Text:: | ||
The Symfony String package, offers a static function to construct a String | The Symfony String package, offers a static function to construct a String | ||
Line 129: | Line 135: | ||
For example with '' | For example with '' | ||
- | === static Text:: | ||
- | Creates a new Text object by concatenating the Text element | + | === static Text:: |
+ | |||
+ | Creates a new Text object by concatenating | ||
+ | into a new Text object. | ||
+ | |||
+ | If the '' | ||
+ | '' | ||
+ | |||
+ | |||
+ | === static Text:: | ||
+ | |||
+ | Creates a new Text object by looping over all the string/Text elements | ||
'' | '' | ||
The semantics are like: '' | The semantics are like: '' | ||
- | If the '' | + | If the '' |
- | element | + | element |
- | object. | + | created |
- | If the '' | + | If the '' |
- | '' | + | '' |
+ | If the iterator produces a non-string/ | ||
+ | will be thrown. | ||
==== Standard String Operations ==== | ==== Standard String Operations ==== | ||
- | === split(string|Text $separator, int $limit = PHP_INT_MAX): | + | === split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) === |
Returns an array of Text objects, each of which is a substring of '' | Returns an array of Text objects, each of which is a substring of '' | ||
Line 162: | Line 180: | ||
https:// | https:// | ||
- | === trimLeft, trimRight, trim === | + | === trimStart, trimEnd, trim : \Text === |
Removes white space at the start of, the end of, or both sides of the text. | Removes white space at the start of, the end of, or both sides of the text. | ||
Line 180: | Line 198: | ||
'' | '' | ||
- | + | === reverse() : \Text === | |
- | === replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) === | + | |
- | + | ||
- | Replaces the first '' | + | |
- | '' | + | |
- | + | ||
- | The locale of '' | + | |
- | match, if it is a '' | + | |
- | that the method is called on. | + | |
- | + | ||
- | The '' | + | |
- | items are being replaced. The '' | + | |
- | argument that is being replaced (0-indexed), | + | |
- | last item. Positive numbers are counted from the first occurrence of | + | |
- | '' | + | |
- | occurrence. | + | |
- | + | ||
- | In order to find sub-strings case-insensitively, | + | |
- | argument to the constructor of the '' | + | |
- | + | ||
- | === reverse() | + | |
Reverses a text, taking into account grapheme boundaries. | Reverses a text, taking into account grapheme boundaries. | ||
Line 209: | Line 207: | ||
Methods to find text in other text. | Methods to find text in other text. | ||
- | In all these methods, the locale of '' | + | In all these methods, the locale |
- | match, if it is a '' | + | sub-strings that match, if it is a '' |
- | that the method is called on. | + | collator that are embedded in the object that the method is called on is used. |
Line 222: | Line 220: | ||
https:// | https:// | ||
- | *I think this method name is too long* | + | Alternative suggested names: '' |
=== getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === | === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === | ||
Line 228: | Line 227: | ||
Like '' | Like '' | ||
+ | |||
+ | Alternative suggested names: '' | ||
Line 237: | Line 238: | ||
Like: '' | Like: '' | ||
(https:// | (https:// | ||
+ | |||
+ | Alternative suggested names: '' | ||
Line 242: | Line 245: | ||
Like '' | Like '' | ||
+ | |||
+ | Alternative suggested names: '' | ||
+ | |||
=== contains(string|Text $search) === | === contains(string|Text $search) === | ||
Line 255: | Line 261: | ||
Case-insensitive comparison can be achieved by setting the right | Case-insensitive comparison can be achieved by setting the right | ||
- | '' | + | '' |
Could be constructed from '' | Could be constructed from '' | ||
Line 267: | Line 273: | ||
Case-insensitive comparison can be achieved by setting the right | Case-insensitive comparison can be achieved by setting the right | ||
- | '' | + | '' |
Could be constructed from '' | Could be constructed from '' | ||
but it's an often required method, and standard PHP has it | but it's an often required method, and standard PHP has it | ||
too. | too. | ||
+ | |||
+ | === replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text === | ||
+ | |||
+ | Replaces occurrences of '' | ||
+ | |||
+ | The '' | ||
+ | items are being replaced. The '' | ||
+ | argument that is being replaced (0-indexed), | ||
+ | last item. Positive numbers are counted from the first occurrence of | ||
+ | '' | ||
+ | occurrence. | ||
+ | |||
+ | In order to find sub-strings case-insensitively, | ||
+ | argument to '' | ||
==== Comparing Text Objects ==== | ==== Comparing Text Objects ==== | ||
- | === compareWith(Text $other, string $collator | + | === compareWith(Text $other, string $collation |
- | Uses the configured '' | + | Uses the configured '' |
- | '' | + | '' |
This same method is also used for comparing two Text objects as " | This same method is also used for comparing two Text objects as " | ||
- | handler" | + | handler" |
+ | taken into account. | ||
+ | |||
+ | === equals(Text $other, string $collation = NULL) : boolean === | ||
+ | |||
+ | Alias for '' | ||
Line 289: | Line 314: | ||
These operations all use the collation that is configured on the Text object. | These operations all use the collation that is configured on the Text object. | ||
- | === toLower === | + | === toLower |
Converts the text to lower case, using the lower case variant of each | Converts the text to lower case, using the lower case variant of each | ||
Unicode code point that makes up the text. | Unicode code point that makes up the text. | ||
- | === toUpper === | + | Example: '' |
+ | |||
+ | |||
+ | === toUpper | ||
The same, but then to upper case. | The same, but then to upper case. | ||
- | === toTitle === | + | Example: '' |
+ | |||
+ | |||
+ | === toTitle | ||
The same, but then to title case (the first letter of each word). | The same, but then to title case (the first letter of each word). | ||
- | === firstToLower === | + | Example: '' |
+ | |||
+ | |||
+ | === firstToLower | ||
Converts the first grapheme in the text to a lower case variant. | Converts the first grapheme in the text to a lower case variant. | ||
- | === firstToUpper === | + | Example: '' |
+ | |||
+ | |||
+ | === firstToUpper | ||
The same, but then to upper case. | The same, but then to upper case. | ||
- | === firstToTitle === | + | Example: '' |
- | The same, but then to title case (the first letter of each word). | ||
- | === wordsToLower === | + | === wordsToLower |
Converts the first grapheme in every word to an lower case variant. | Converts the first grapheme in every word to an lower case variant. | ||
- | === wordsToUpper === | + | Example: '' |
- | The same, but then to upper case. | ||
- | === wordsToTitle | + | === wordsToUpper : \Text === |
- | The same, but then to title case (the first letter of each word). | + | The same, but then to upper case. |
+ | |||
+ | Example: '' | ||
Line 331: | Line 368: | ||
- | === getByteCount() === | + | === getByteCount() |
Returns the size in bytes that the text will take when converted to UTF-8. | Returns the size in bytes that the text will take when converted to UTF-8. | ||
- | === length(), getCharacterCount() === | + | === length(), getCharacterCount(): int |
Returns the number of characters that make up the text. A character (also | Returns the number of characters that make up the text. A character (also | ||
Line 344: | Line 381: | ||
- | === getCodePointCount() === | + | === getCodePointCount() |
Returns the number of Unicode code points that make up the text. | Returns the number of Unicode code points that make up the text. | ||
Line 350: | Line 387: | ||
- | === getWordCount() === | + | === getWordCount() |
Pretty much a shortcut for:: | Pretty much a shortcut for:: | ||
Line 364: | Line 401: | ||
These functions return an iterator that can be used to iterator over the text. | These functions return an iterator that can be used to iterator over the text. | ||
The return of the iterators are effected by the text's locale. | The return of the iterators are effected by the text's locale. | ||
- | i | + | |
These are inspired by ICU4J' | These are inspired by ICU4J' | ||
(https:// | (https:// | ||
Line 370: | Line 407: | ||
(https:// | (https:// | ||
- | === getCharacterIterator === | + | === getCharacterIterator |
Returns an Iterator that locates boundaries between logical characters. | Returns an Iterator that locates boundaries between logical characters. | ||
Line 381: | Line 418: | ||
of text that the computer sees as " | of text that the computer sees as " | ||
- | === getWordIterator === | + | === getWordIterator |
Returns an Iterator that locates boundaries between words. This is useful | Returns an Iterator that locates boundaries between words. This is useful | ||
Line 389: | Line 426: | ||
are kept separate from real words. | are kept separate from real words. | ||
- | === getLineIterator === | + | === getLineIterator |
Returns an Iterator that locates positions where it is legal for a text | Returns an Iterator that locates positions where it is legal for a text | ||
Line 398: | Line 435: | ||
from being a line-break position. | from being a line-break position. | ||
- | === getSentenceIterator === | + | === getSentenceIterator |
Returns an Iterator that locates boundaries between sentences. | Returns an Iterator that locates boundaries between sentences. | ||
- | === getTitleIterator === | + | === getTitleIterator |
Returns an Iterator that locates boundaries between title breaks. | Returns an Iterator that locates boundaries between title breaks. | ||
Line 413: | Line 450: | ||
- | === transliterate(string $transliterationString) === | + | === transliterate(string $transliterationString) |
Transliterates the content of the '' | Transliterates the content of the '' | ||
Line 457: | Line 494: | ||
===== Open Issues ===== | ===== Open Issues ===== | ||
+ | - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8. | ||
===== Questions and Answers ===== | ===== Questions and Answers ===== | ||
Line 498: | Line 536: | ||
Nothing rejected yet. | Nothing rejected yet. | ||
+ | |||
+ | |||
+ | ===== Changes ===== | ||
+ | |||
+ | 0.9.2 — 2022-12-21 | ||
+ | |||
+ | * Tim Düsterhus: Added concat and equals methods; changed join to accept an iterator. | ||
+ | * Enhance explanation of locales and collations, and standardize on using '' | ||
+ | |||
+ | 0.9.1 — 2022-12-16 | ||
+ | |||
+ | * Tim Düsterhus: Removed firstToTitle/ | ||
+ | * Paul Crovella: Clarify which normalisation is being used. | ||
+ | * Daniel Wolfe: Update trimLeft/ |
rfc/unicode_text_processing.1671118113.txt.gz · Last modified: 2022/12/15 15:28 by derick