rfc:unicode_text_processing
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
rfc:unicode_text_processing [2022/12/15 15:28] – Argue the case for a C-based implementation, and mention ICU implementation details derick | rfc:unicode_text_processing [2022/12/18 17:29] – Fix several typos or difficult wording theodorejb | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== PHP RFC: Unicode Text Processing ====== | ====== PHP RFC: Unicode Text Processing ====== | ||
* Version: 0.9 | * Version: 0.9 | ||
- | * Date: 2022-11-09 | + | * Date: 2022-12-16 (Original date: 2022-12-15) |
* Author: Derick Rethans < | * Author: Derick Rethans < | ||
* Status: Draft | * Status: Draft | ||
Line 26: | Line 26: | ||
===== Proposal ===== | ===== Proposal ===== | ||
- | To introduce a new " | + | To introduce a new final " |
- | stored in the objects. | + | text stored in the objects. |
Methods on the class will all return a new (immutable) object. | Methods on the class will all return a new (immutable) object. | ||
The proposal is to make the '' | The proposal is to make the '' | ||
- | mean that it is therefore always available to user. As the implementation | + | mean that it is therefore always available to users. As the implementation |
requires ICU, this would also mean that PHP will depend on the ICU library. | requires ICU, this would also mean that PHP will depend on the ICU library. | ||
Line 109: | Line 109: | ||
==== Construction ==== | ==== Construction ==== | ||
- | This section lists all the method | + | This section lists all the methods |
- | === __construct(string $text, string $locale = ' | + | === __construct(string $text, string $locale = ' |
The constructor takes a UTF-8 encoded text, and stores this in an internal | The constructor takes a UTF-8 encoded text, and stores this in an internal | ||
structure. The constructor will also convert the given text to Unicode | structure. The constructor will also convert the given text to Unicode | ||
- | Canonical Form. Passing in non-well-formed UTF-8 will result in an | + | Canonical Form (also called Normalisation Form C, or NFC). Passing in |
- | '' | + | non-well-formed UTF-8 will result in an '' |
- | (Byte-Order-Mark) character, if present. | + | The constructor will also strip out a BOM (Byte-Order-Mark) character, |
+ | if present. | ||
- | === static Text:: | + | === static Text:: |
- | The Symfony String package, offers a static function to construct a String | + | The Symfony String package offers a static function to construct a String |
through a single-character function ('' | through a single-character function ('' | ||
file scope (with '' | file scope (with '' | ||
This method solves a similar use, so that you can shorten '' | This method solves a similar use, so that you can shorten '' | ||
- | '' | + | '' |
- | For example with '' | + | '' |
- | === static Text:: | + | === static Text:: |
Creates a new Text object by concatenating the Text element in | Creates a new Text object by concatenating the Text element in | ||
Line 137: | Line 138: | ||
If the '' | If the '' | ||
- | element in the '' | + | element in the '' |
object. | object. | ||
Line 147: | Line 148: | ||
- | === split(string|Text $separator, int $limit = PHP_INT_MAX): | + | === split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) === |
Returns an array of Text objects, each of which is a substring of '' | Returns an array of Text objects, each of which is a substring of '' | ||
Line 162: | Line 163: | ||
https:// | https:// | ||
- | === trimLeft, trimRight, trim === | + | === trimStart, trimEnd, trim : \Text === |
Removes white space at the start of, the end of, or both sides of the text. | Removes white space at the start of, the end of, or both sides of the text. | ||
Line 181: | Line 182: | ||
- | === replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) === | + | === replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text === |
- | Replaces | + | Replaces occurrences of '' |
- | '' | + | |
The locale of '' | The locale of '' | ||
Line 198: | Line 198: | ||
In order to find sub-strings case-insensitively, | In order to find sub-strings case-insensitively, | ||
- | argument to the constructor | + | argument to '' |
- | === reverse() === | + | === reverse() |
Reverses a text, taking into account grapheme boundaries. | Reverses a text, taking into account grapheme boundaries. | ||
Line 222: | Line 222: | ||
https:// | https:// | ||
- | *I think this method name is too long* | + | Alternative suggested names: '' |
=== getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === | === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === | ||
Line 228: | Line 229: | ||
Like '' | Like '' | ||
+ | |||
+ | Alternative suggested names: '' | ||
Line 237: | Line 240: | ||
Like: '' | Like: '' | ||
(https:// | (https:// | ||
+ | |||
+ | Alternative suggested names: '' | ||
Line 248: | Line 253: | ||
Like '' | Like '' | ||
+ | |||
+ | Alternative suggested names: '' | ||
Line 289: | Line 296: | ||
These operations all use the collation that is configured on the Text object. | These operations all use the collation that is configured on the Text object. | ||
- | === toLower === | + | === toLower |
Converts the text to lower case, using the lower case variant of each | Converts the text to lower case, using the lower case variant of each | ||
Unicode code point that makes up the text. | Unicode code point that makes up the text. | ||
- | === toUpper === | + | Example: '' |
+ | |||
+ | |||
+ | === toUpper | ||
The same, but then to upper case. | The same, but then to upper case. | ||
- | === toTitle === | + | Example: '' |
+ | |||
+ | |||
+ | === toTitle | ||
The same, but then to title case (the first letter of each word). | The same, but then to title case (the first letter of each word). | ||
- | === firstToLower === | + | Example: '' |
+ | |||
+ | |||
+ | === firstToLower | ||
Converts the first grapheme in the text to a lower case variant. | Converts the first grapheme in the text to a lower case variant. | ||
- | === firstToUpper === | + | Example: '' |
+ | |||
+ | |||
+ | === firstToUpper | ||
The same, but then to upper case. | The same, but then to upper case. | ||
- | === firstToTitle === | + | Example: '' |
- | The same, but then to title case (the first letter of each word). | ||
- | === wordsToLower === | + | === wordsToLower |
Converts the first grapheme in every word to an lower case variant. | Converts the first grapheme in every word to an lower case variant. | ||
- | === wordsToUpper === | + | Example: '' |
- | The same, but then to upper case. | ||
- | === wordsToTitle | + | === wordsToUpper : \Text === |
- | The same, but then to title case (the first letter of each word). | + | The same, but then to upper case. |
+ | |||
+ | Example: '' | ||
Line 331: | Line 350: | ||
- | === getByteCount() === | + | === getByteCount() |
Returns the size in bytes that the text will take when converted to UTF-8. | Returns the size in bytes that the text will take when converted to UTF-8. | ||
- | === length(), getCharacterCount() === | + | === length(), getCharacterCount(): int |
Returns the number of characters that make up the text. A character (also | Returns the number of characters that make up the text. A character (also | ||
Line 344: | Line 363: | ||
- | === getCodePointCount() === | + | === getCodePointCount() |
Returns the number of Unicode code points that make up the text. | Returns the number of Unicode code points that make up the text. | ||
Line 350: | Line 369: | ||
- | === getWordCount() === | + | === getWordCount() |
Pretty much a shortcut for:: | Pretty much a shortcut for:: | ||
Line 370: | Line 389: | ||
(https:// | (https:// | ||
- | === getCharacterIterator === | + | === getCharacterIterator |
Returns an Iterator that locates boundaries between logical characters. | Returns an Iterator that locates boundaries between logical characters. | ||
Line 381: | Line 400: | ||
of text that the computer sees as " | of text that the computer sees as " | ||
- | === getWordIterator === | + | === getWordIterator |
Returns an Iterator that locates boundaries between words. This is useful | Returns an Iterator that locates boundaries between words. This is useful | ||
Line 389: | Line 408: | ||
are kept separate from real words. | are kept separate from real words. | ||
- | === getLineIterator === | + | === getLineIterator |
Returns an Iterator that locates positions where it is legal for a text | Returns an Iterator that locates positions where it is legal for a text | ||
Line 398: | Line 417: | ||
from being a line-break position. | from being a line-break position. | ||
- | === getSentenceIterator === | + | === getSentenceIterator |
Returns an Iterator that locates boundaries between sentences. | Returns an Iterator that locates boundaries between sentences. | ||
- | === getTitleIterator === | + | === getTitleIterator |
Returns an Iterator that locates boundaries between title breaks. | Returns an Iterator that locates boundaries between title breaks. | ||
Line 413: | Line 432: | ||
- | === transliterate(string $transliterationString) === | + | === transliterate(string $transliterationString) |
Transliterates the content of the '' | Transliterates the content of the '' | ||
Line 457: | Line 476: | ||
===== Open Issues ===== | ===== Open Issues ===== | ||
+ | - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8. | ||
+ | - Tidy up language related to locale/ | ||
===== Questions and Answers ===== | ===== Questions and Answers ===== | ||
Line 498: | Line 519: | ||
Nothing rejected yet. | Nothing rejected yet. | ||
+ | |||
+ | |||
+ | ===== Changes ===== | ||
+ | |||
+ | 0.9.1 — 2022-12-16 | ||
+ | |||
+ | * Tim Düsterhus: Removed firstToTitle/ | ||
+ | * Paul Crovella: Clarify which normalisation is being used. | ||
+ | * Daniel Wolfe: Update trimLeft/ |
rfc/unicode_text_processing.txt · Last modified: 2022/12/21 11:48 by derick