Differences

This shows you the differences between two versions of the page.

--- rfc:unicode_text_processing [2022/11/21 15:14] – derick
+++ rfc:unicode_text_processing [2022/12/21 11:48] (current) – Added concat and equals methods; changed join to accept an iterator; Enhance explanation of locales and collations, and standardize on using ''$collator'' as an argument name everywhere. derick
@@ Line 1: / Line 1: @@
 ====== PHP RFC: Unicode Text Processing ======
-  * Version: 0.9
+  * Version: 0.9.2
-  * Date: 2022-11-09
+  * Date: 2022-12-21 (Original date: 2022-12-15)
   * Author: Derick Rethans <derick@php.net>
   * Status: Draft
@@ Line 14: / Line 14: @@
 create an API that developers can use to do Unicode text processing
 correctly, without having to know all the intricacies.
+Although PHP has decent maths features, it is solely missing performant
+Unicode text processing always available in the core.
 ==== Definitions ====
@@ Line 23: / Line 26: @@
 ===== Proposal =====
-To introduce a new "Text" class, with methods to operate on the text
+To introduce a new final "Text" class, with methods to operate on the
-stored in the objects.
+text stored in the objects.
 Methods on the class will all return a new (immutable) object.
+The proposal is to make the ''Text'' class part of the PHP core. This would
+mean that it is therefore always available to user. As the implementation
+requires ICU, this would also mean that PHP will depend on the ICU library.
 ==== Basics ====
@@ Line 32: / Line 40: @@
 constructor.
-The ''toString()'' method collapses the internally stored text into a
+The ''_****_toString()'' method collapses the internally stored text into a
 UTF-8 encoded string, which can be used by all existing PHP functions
 that accept strings.
-The internal representation would be UTF-16, as that's what ICU uses.
+The internal representation of the text is UTF-16, as that's what ICU uses.
 Unlike the PHP 6 approach, the conversion to/from the internal
 representation only happens on the boundaries: UTF-8 to UTF-16 through
-the constructor, and the reverse through the ''toString()'' method.
+the constructor, and the reverse through the ''_****_toString()'' method.
 There are multiple groups of methods indicated below. Some are to
@@ Line 51: / Line 59: @@
   * prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments.
   * operations are on **graphemes**
-  * no redundent methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used
+  * no redundant methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used
   * more as we discuss this...
@@ Line 64: / Line 72: @@
 If an argument to any of the methods is listed as ''string|Text'',
 passing in a ''string'' value will have the same semantics as replacing
-the passed value with ''new Text($string)''. The locale from the Text
+the passed value with ''new Text($string)''. The locale and default collator
-object that this method is called on is also used for this new wrapped
+from the Text object that this method is called on is also used for this new
-value, if necessary.
+wrapped value, if necessary.
-==== Locales and Internationalisation ====
+==== Locales, Collators, and Internationalisation ====
-By default each string will have the "root" collator associated with it,
+By default each string will have the "root" locale and "standard" collator
-but it is possible to configure a specific collator by using the
+associated with it, but it is possible to configure a specific locale and
-''$collator'' argument in the constructor. The ''$collator'' is specified as
+collator by using the ''$collation'' argument in the constructor. Collation is in
-a string describing an ICU locale name:
+addition to the locale, and affects sorting and finding operations.
+The ''$collation'' is specified as a string describing an ICU locale/collation
+name:
 https://unicode-org.github.io/icu/userguide/collation/api.html#instantiating-the-predefined-collators
-For example, the locale (or collation) name ''en-u-ks-level1'' means
+The methods on the Text object all use the ''$collation'' argument name.
-case-insensitive sorting for the English locale. This will require
-extensive documentation.
-Numerical order collation (such as PHP's ''natsort()'') can be achived
+For example, the locale (and collation) name ''en-u-ks-level1'' means
-by adding the ''kn'' flag to the locale name, such as in ''de-u-kn''
+case-insensitive sorting (''ks-level1'') for the English locale (''en-u'').
-(case-sensitive German, with numerics in value order).
+The format of this locale/collation name needs extensive documentation.
+Numerical order collation (such as PHP's ''natsort()'') can be achieved by
+adding the ''kn'' flag to the collator specification, such as in ''de-u-kn''
+(case-sensitive German ('''de-u''), with numerics in value order (''kn'')).
 Other options are described in BCP47:
@@ Line 88: / Line 101: @@
 and defaults at http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings
-Specifying the locale and collator will also be possible by passing in a
+Building a locale/collation string will also be possible by using a
-''Intl\\Collator'' object
+''TextCollator'' object, to allow for better and easier-to-read customization
-(https://www.php.net/manual/en/class.collator.php) to allow for more
+of collations. The class performs the same function as ''\Intl\Collator''
-descritive construction of a locale with all its options.
+(https://www.php.net/manual/en/class.collator.php), except that it has
+descriptive methods to set collation properties. The reason for a separate
+class is so that you don't have to depend on the ''Intl'' extension, and to
+make it more developer-friendly. It converts the configured options to a
+string, which can then be used in any location where ''string $collator'' is
+used in the function signatures to the methods on the ''Text'' class.
@@ Line 98: / Line 116: @@
 This section lists all the method that construct a Text object.
-=== __construct(string $text, string $locale = 'root/standard'), __construct(string $text, \\Intl\\Collator $collator = new \\Intl\\Collator('root/standard')) ===
+=== __construct(string $text, string $collation = 'root/standard') : \Text ===
 The constructor takes a UTF-8 encoded text, and stores this in an internal
 structure. The constructor will also convert the given text to Unicode
-Canonical Form. Passing in non-well-formed UTF-8 will result in an
+Canonical Form (also called Normalisation Form C, or NFC). Passing in
-''InvalidEncodingException''. The constructor will also strip out a BOM
+non-well-formed UTF-8 will result in an ''InvalidEncodingException''.
-(Byte-Order-Mark) character, if present.
+The constructor will also strip out a BOM (Byte-Order-Mark) character,
+if present.
-=== static Text::join(array(string|Text) $elements, string|Text $separator) ===
+=== static Text::create(string $text, string $collation = 'root/standard') : \Text ===
-Creates a new Text object by concatenating the each Text element in
+The Symfony String package, offers a static function to construct a String
+through a single-character function (''u''), which you can import into the
+file scope (with ''use'').
+This method solves a similar use, so that you can shorten ''new Text(…)'' to
+''t'' after having imported the method into the file's scope with:
+For example with ''use \Text::create as t''.
+=== static Text::concat(string|Text ...$elements) : \Text ===
+Creates a new Text object by concatenating all the given string/Text arguments
+into a new Text object.
+If the ''$elements'' array is empty, an empty ''Text'' object with the
+''root'' locale and ''standard'' collation is created.
+=== static Text::join(iterable<string|Text> $elements, string|Text $separator, string $collation = NULL) : \Text ===
+Creates a new Text object by looping over all the string/Text elements in
 ''$elements'', inserting ''$separator'' in between each element.
-Semantics like: ''implode(string $separator, array(string) $array)''
+The semantics are like: ''implode(string $separator, array(string) $array)''
+If the ''$collation'' is not specified, it uses the collation of the first
+element from the ''$elements'' iterable. This will also be then set on the
+created object.
+If the ''$elements'' iterator has no items, an empty ''Text'' object with the
+''root'' locale and ''standard'' collation is created.
+If the iterator produces a non-string/Text element, then a ''\ValueException''
+will be thrown.
 ==== Standard String Operations ====
-=== split(string|Text $separator, int $limit = PHP_INT_MAX): array(Text) ===
+=== split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) ===
 Returns an array of Text objects, each of which is a substring of ''$this'',
@@ Line 133: / Line 181: @@
 https://www.php.net/manual/en/function.grapheme-substr.php
-=== trimLeft, trimRight, trim ===
+=== trimStart, trimEnd, trim : \Text ===
 Removes white space at the start of, the end of, or both sides of the text.
@@ Line 142: / Line 190: @@
 === wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) ===
-Wraps a text to a given number of graphemes into an array of Text objects.
+Wraps a text to a given number of graphemes per line, into an array of Text
+objects.
 Like: ''wordwrap'', but based on graphemes and returning an array instead of
@@ Line 150: / Line 199: @@
 ''$maxWidth''.
+=== reverse() : \Text ===
-=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) ===
-Replaces the first ''$maxReplacements'' occurrences of ''$search'' with
-''$replace''.
-The ''$replaceFrom'' and ''$replaceTo'' arguments control which found
-items are being replace. The ''$replaceFrom'' argument is the first
-argument that is being replaced (0-indexed), and ''$replaceTo'' is the
-last item. Positive numbers are counted from the first occurence of
-''$search'' in the Text, and negative numbers from the last found
-occurrence.
-=== replaceTextCaseInsensitively(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) ===
-Replaces every occurrence of ''$search'' with ''$replace'' using the locale of
-the object that the method is called on. The locale of ''$search'' and
-''$replace'' is ignored.
-''$replaceFrom'' and ''$replaceTo'' behave as with ''replaceText''.
-=== reverse() ===
 Reverses a text, taking into account grapheme boundaries.
@@ Line 182: / Line 208: @@
 Methods to find text in other text.
-=== getPositionOfFirstOccurrence(string|Text $textToFind, int $offset) : int|false ===
+In all these methods, the locale and collator of ''$search'' are used to find
+sub-strings that match, if it is a ''Text'' object, otherwise the locale and
+collator that are embedded in the object that the method is called on is used.
+=== getPositionOfFirstOccurrence(string|Text $search, int $offset) : int|false ===
 Returns the position (in grapheme units) of the first occurrence of
-''$textToFind'' starting at the (grapheme) ''$offset'', or false if not found.
+''$search'' starting at the (grapheme) ''$offset'', or false if not found.
-Like: ''grapheme_strpos($this, $textToFind, $offset)''
+Like: ''grapheme_strpos($this, $search, $offset)''
 https://www.php.net/manual/en/function.grapheme-strpos.php
-*I think this method name is too long*
+Alternative suggested names: ''findOffset''
-=== getPositionOfLastOccurrence(string|Text $textToFind, int $offset) : int|false ===
+=== getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false ===
 Like ''getPositionOfFirstOccurrence'' but then from the end of the text.
+Alternative suggested names: ''findOffsetLast''
-=== returnFromFirstOccurence(string|Text $textToFind) : Text|false ===
-Returns the ''Text'' starting with the ''$textToFind'' if found, and
+=== returnFromFirstOccurence(string|Text $search) : Text|false ===
+Returns the ''Text'' starting with the ''$search'' if found, and
 otherwise ''false''.
-Like: ''grapheme_strstr($this, $textToFind)''
+Like: ''grapheme_strstr($this, $search)''
 (https://www.php.net/manual/en/function.grapheme-strstr.php)
+Alternative suggested names: ''startingWith'', ''startingAt''
-=== returnFromLastOccurence(string|Text $textToFind) : Text|false ===
+=== returnFromLastOccurence(string|Text $search) : Text|false ===
 Like ''returnFromFirstOccurence'' but then from the end of the text.
-=== contains(string|Text $string) ===
+Alternative suggested names: ''startingWithLast'', ''startingAtLast''
-Returns true if the text ''$string'' can be found in the text.
+=== contains(string|Text $search) ===
+Returns true if the text ''$search'' can be found in the text.
 Like ''str_contains''.
-=== endsWith(string|Text $string) : bool ===
+=== endsWith(string|Text $search) : bool ===
+Compares the last ''$search.Length()'' graphemes of ''$this''.
+Case-insensitive comparison can be achieved by setting the right
+''$collation'' on ''$search''.
-Could be constructed from ''getPositionOfFirstOccurrence()'' and
+Could be constructed from ''getPositionOflastOccurrence()'' and
 ''length()'', but it's an often required method, and standard PHP has it
 too.
-=== startsWith(string|Text $string) : bool ===
+=== startsWith(string|Text $search) : bool ===
-Compares the first ''$string.Length()'' graphemes of ''$this'' using the
+Compares the first ''$search.Length()'' graphemes of ''$this''.
-locale and collator that are configured with ''$this''.
 Case-insensitive comparison can be achieved by setting the right
-''$locale'' and ''$collator'' on ''$this''.
+''$collation'' on ''$search''.
 Could be constructed from ''getPositionOfFirstOccurrence()'',
 but it's an often required method, and standard PHP has it
 too.
+=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text ===
+Replaces occurrences of ''$search'' with ''$replace''.
+The ''$replaceFrom'' and ''$replaceTo'' arguments control which found
+items are being replaced. The ''$replaceFrom'' argument is the first
+argument that is being replaced (0-indexed), and ''$replaceTo'' is the
+last item. Positive numbers are counted from the first occurrence of
+''$search'' in the Text, and negative numbers from the last found
+occurrence.
+In order to find sub-strings case-insensitively, you can use the ''$collation''
+argument to ''Text::__construct'' of the ''$search'' argument.
 ==== Comparing Text Objects ====
-=== compareWith(Text $other) : int ===
+=== compareWith(Text $other, string $collation = NULL) : int ===
-Uses the configured ''$locale'' of ''$this'' to compare it against
+Uses the configured ''$collation'' of ''$this'' to compare it against
-''$other''. The locale of ''$other'' is ignored.
+''$other'', unless the ''$collation'' argument is specified as an override.
 This same method is also used for comparing two Text objects as "compare
-handler".
+handler" (an overloaded ''=='' operator). Here only the locale on ''$this'' is
+taken into account.
+=== equals(Text $other, string $collation = NULL) : boolean ===
+Alias for ''compareWith($other, $collation) === 0''.
 ==== Case Conversions ====
+These operations all use the collation that is configured on the Text object.
-=== toLower ===
+=== toLower : \Text ===
 Converts the text to lower case, using the lower case variant of each
 Unicode code point that makes up the text.
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het ĳsselmeer is vol met ideëen''.
+=== toUpper : \Text ===
+The same, but then to upper case.
-=== toUpper ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''HET ĲSSELMEER IS VOL MET IDEËEN''.
+=== toTitle : \Text ===
-=== toTitle ===
+The same, but then to title case (the first letter of each word).
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer is Vol met Ideëen''.
-=== firstToLower ===
+=== firstToLower : \Text ===
 Converts the first grapheme in the text to a lower case variant.
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het Ĳsselmeer is vol met ideëen''.
-=== firstToUpper ===
+=== firstToUpper : \Text ===
+The same, but then to upper case.
-=== firstToTitle ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer is vol met ideëen''.
+=== wordsToLower : \Text ===
+Converts the first grapheme in every word to an lower case variant.
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het ĳsselmeer is vol met ideëen''.
+=== wordsToUpper : \Text ===
+The same, but then to upper case.
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer Is Vol Met Ideëen''.
@@ Line 282: / Line 369: @@
-=== getByteCount() ===
+=== getByteCount() : int ===
 Returns the size in bytes that the text will take when converted to UTF-8.
-=== length(), getCharacterCount() ===
+=== length(), getCharacterCount(): int  ===
 Returns the number of characters that make up the text. A character (also
@@ Line 295: / Line 382: @@
-=== getCodePointCount() ===
+=== getCodePointCount() : int ===
 Returns the number of Unicode code points that make up the text.
@@ Line 301: / Line 388: @@
-=== countWords() ===
+=== getWordCount() : int ===
 Pretty much a shortcut for::
@@ Line 316: / Line 403: @@
 The return of the iterators are effected by the text's locale.
+These are inspired by ICU4J's BreakIterators
+(https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html)
+and Intl's create*Instance methods on ''Intl\BreakIterator''
+(https://www.php.net/manual/en/class.intlbreakiterator.php).
-=== getCharacterIterator ===
+=== getCharacterIterator : \Iterator ===
+Returns an Iterator that locates boundaries between logical characters.
+Because of the structure of the Unicode encoding, a logical character may be
+stored internally as more than one Unicode code point. (A with an umlaut may
+be stored as an 'a' followed by a separate combining umlaut character, for
+example, but the user still thinks of it as one character.) This iterator
+allows various processes (especially text editors) to treat as characters the
+units of text that a user would think of as characters, rather than the units
+of text that the computer sees as "characters".
+=== getWordIterator : \Iterator ===
-=== getLineIterator ===
+Returns an Iterator that locates boundaries between words. This is useful
+for double-click selection or "find whole words" searches. This type of
+iterator makes sure there is a boundary position at the beginning and end
+of each legal word. (Numbers count as words, too.) Whitespace and punctuation
+are kept separate from real words.
+=== getLineIterator : \Iterator ===
+Returns an Iterator that locates positions where it is legal for a text
+editor to wrap lines. This is similar to word breaking, but not the same:
+punctuation and whitespace are generally kept with words (you don't want a
+line to start with whitespace, for example), and some special characters can
+force a position to be considered a line-break position or prevent a position
+from being a line-break position.
-=== getSentenceIterator ===
+=== getSentenceIterator : \Iterator ===
+Returns an Iterator that locates boundaries between sentences.
-=== getTitleIterator ===
+=== getTitleIterator : \Iterator ===
-=== getWordIterator ===
+Returns an Iterator that locates boundaries between title breaks.
@@ Line 342: / Line 451: @@
-=== transliterate(string $transliterationString) ===
+=== transliterate(string $transliterationString) : \Text ===
-=== transliterate(\Intl\Transliterator $transliterator) ===
-With the first one being a "simple" one to use, and the second using Intl's
-Transliterator for more complex cases.
-Should we add shortcuts for a set of often used ones, such as ''Any-Latin''? I
-think so, as it's the majority use case.
+Transliterates the content of the ''Text'' object according to the rules as
+specified in the ''$transliterationString''.
-=== toLatin ===
+There are a few constants for specific and often used cases, such as creating
+an ASCII transliterated version of any Text:
-Converts any script to Latin.
+ - const Text::toAscii : A shortcut for a transliteration string that converts
+   any script to Latin, and also strips all the accents.
+ - const Text::toLatin : A shortcut for a transliteration string that converts
+   any script to Latin, but does not remove the accents.
-=== removeAccents ===
+ - const Text::removeAccents : Removes the accents from a Text. A shortcut for
+   the transliteration string ''"NFD; [:Nonspacing Mark:] Remove; NFC."''.
-Removes the accents from a (latin script) text.
+===== Implementation Details =====
-A shortcut for the transliteration string ''"Latin-ASCII"'' (or a more
+The functionality as is described in this RFC is mostly implemented by using
-suitable one, which I believe is ''"NFD; [:Nonspacing Mark:] Remove;
+functionality from the ICU library, which is also used by the Intl extension.
-NFC."''.
+In order for PHP to continue to work on an as widest range of platforms and
+distributions, the minimum ICU version will be chosen accordingly to common
+Linux distributions' lowest version, which would include the version of PHP in
+which this functionality is implemented.
 ===== Backward Incompatible Changes =====
-Introducing a new class could impact code bases that already use this class
+Introducing a new ''Text'' class could impact code bases that already use this
-name. But as PHP owns the global namespace, this should not deter us from
+class name. But as PHP owns the global namespace, this should not deter us
-adding such a code class.
+from adding such a code class.
 ===== Proposed PHP Version(s) =====
@@ Line 387: / Line 495: @@
 ===== Open Issues =====
-==== Class Name ====
+  - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8.
-I have currently picked "Text", as it describes that the object does not only
+===== Questions and Answers =====
-represent single words (strings). Alternatively, we can pick something like
-"Utext" (for Unicode Text), but I find that a distraction.
+==== Why is this not a composer package? ====
+The goal of this RFC is that PHP users can always rely on performant text
+processing capabilities.
+Text processors written in PHP already exist, but suffer from performance
+issues (PHP is slower than C), and are sometimes tailored to specific use
+cases. By having them written in C, and utilising ICU's well tested and often
+updated rules and algorithms, both the performance and correctness issues will
+be addressed.
 ===== Future Scope =====
@@ Line 421: / Line 537: @@
 Nothing rejected yet.
+===== Changes =====
+.9.2 — 2022-12-21
+  * Tim Düsterhus: Added concat and equals methods; changed join to accept an iterator.
+  * Enhance explanation of locales and collations, and standardize on using ''$collator'' as an argument name everywhere.
+.9.1 — 2022-12-16
+  * Tim Düsterhus: Removed firstToTitle/wordsToTitle; added examples for toUpper and friends; added return types everywhere; added suggested other names for getPosition... methods; marked class as final.
+  * Paul Crovella: Clarify which normalisation is being used.
+  * Daniel Wolfe: Update trimLeft/trimRight to trimStart/trimEnd.

Differences

Page Tools