Differences

This shows you the differences between two versions of the page.

--- rfc:unicode_text_processing [2022/12/15 15:29] – derick
+++ rfc:unicode_text_processing [2022/12/21 11:48] (current) – Added concat and equals methods; changed join to accept an iterator; Enhance explanation of locales and collations, and standardize on using ''$collator'' as an argument name everywhere. derick
@@ Line 1: / Line 1: @@
 ====== PHP RFC: Unicode Text Processing ======
-  * Version: 0.9
+  * Version: 0.9.2
-  * Date: 2022-11-09
+  * Date: 2022-12-21 (Original date: 2022-12-15)
   * Author: Derick Rethans <derick@php.net>
-  * Status: Under Discussion
+  * Status: Draft
   * First Published at: http://wiki.php.net/rfc/unicode_text_processing
@@ Line 26: / Line 26: @@
 ===== Proposal =====
-To introduce a new "Text" class, with methods to operate on the text
+To introduce a new final "Text" class, with methods to operate on the
-stored in the objects.
+text stored in the objects.
 Methods on the class will all return a new (immutable) object.
@@ Line 72: / Line 72: @@
 If an argument to any of the methods is listed as ''string|Text'',
 passing in a ''string'' value will have the same semantics as replacing
-the passed value with ''new Text($string)''. The locale from the Text
+the passed value with ''new Text($string)''. The locale and default collator
-object that this method is called on is also used for this new wrapped
+from the Text object that this method is called on is also used for this new
-value, if necessary.
+wrapped value, if necessary.
-==== Locales and Internationalisation ====
+==== Locales, Collators, and Internationalisation ====
-By default each string will have the "root" collator associated with it,
+By default each string will have the "root" locale and "standard" collator
-but it is possible to configure a specific collator by using the
+associated with it, but it is possible to configure a specific locale and
-''$collator'' argument in the constructor. The ''$collator'' is specified as
+collator by using the ''$collation'' argument in the constructor. Collation is in
-a string describing an ICU locale name:
+addition to the locale, and affects sorting and finding operations.
+The ''$collation'' is specified as a string describing an ICU locale/collation
+name:
 https://unicode-org.github.io/icu/userguide/collation/api.html#instantiating-the-predefined-collators
-For example, the locale (or collation) name ''en-u-ks-level1'' means
+The methods on the Text object all use the ''$collation'' argument name.
-case-insensitive sorting for the English locale. This will require
-extensive documentation.
+For example, the locale (and collation) name ''en-u-ks-level1'' means
+case-insensitive sorting (''ks-level1'') for the English locale (''en-u'').
+The format of this locale/collation name needs extensive documentation.
-Numerical order collation (such as PHP's ''natsort()'') can be achieved
+Numerical order collation (such as PHP's ''natsort()'') can be achieved by
-by adding the ''kn'' flag to the locale name, such as in ''de-u-kn''
+adding the ''kn'' flag to the collator specification, such as in ''de-u-kn''
-(case-sensitive German, with numerics in value order).
+(case-sensitive German ('''de-u''), with numerics in value order (''kn'')).
 Other options are described in BCP47:
@@ Line 111: / Line 116: @@
 This section lists all the method that construct a Text object.
-=== __construct(string $text, string $locale = 'root/standard') ===
+=== __construct(string $text, string $collation = 'root/standard') : \Text ===
 The constructor takes a UTF-8 encoded text, and stores this in an internal
 structure. The constructor will also convert the given text to Unicode
-Canonical Form. Passing in non-well-formed UTF-8 will result in an
+Canonical Form (also called Normalisation Form C, or NFC). Passing in
-''InvalidEncodingException''. The constructor will also strip out a BOM
+non-well-formed UTF-8 will result in an ''InvalidEncodingException''.
-(Byte-Order-Mark) character, if present.
+The constructor will also strip out a BOM (Byte-Order-Mark) character,
+if present.
-=== static Text::create(string $text, string $locale = 'root/standard') ===
+=== static Text::create(string $text, string $collation = 'root/standard') : \Text ===
 The Symfony String package, offers a static function to construct a String
@@ Line 129: / Line 136: @@
 For example with ''use \Text::create as t''.
-=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) ===
-Creates a new Text object by concatenating the Text element in
+=== static Text::concat(string|Text ...$elements) : \Text ===
+Creates a new Text object by concatenating all the given string/Text arguments
+into a new Text object.
+If the ''$elements'' array is empty, an empty ''Text'' object with the
+''root'' locale and ''standard'' collation is created.
+=== static Text::join(iterable<string|Text> $elements, string|Text $separator, string $collation = NULL) : \Text ===
+Creates a new Text object by looping over all the string/Text elements in
 ''$elements'', inserting ''$separator'' in between each element.
 The semantics are like: ''implode(string $separator, array(string) $array)''
-If the ''$collator'' is not specified, it uses the collection of the first
+If the ''$collation'' is not specified, it uses the collation of the first
-element in the ''$elements'' array. This will also be then set on the created
+element from the ''$elements'' iterable. This will also be then set on the
-object.
+created object.
-If the ''$elements'' array is empty, an empty ''Text'' object with the
+If the ''$elements'' iterator has no items, an empty ''Text'' object with the
-''root'' locale is created.
+''root'' locale and ''standard'' collation is created.
+If the iterator produces a non-string/Text element, then a ''\ValueException''
+will be thrown.
 ==== Standard String Operations ====
-=== split(string|Text $separator, int $limit = PHP_INT_MAX): array(Text) ===
+=== split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) ===
 Returns an array of Text objects, each of which is a substring of ''$this'',
@@ Line 162: / Line 181: @@
 https://www.php.net/manual/en/function.grapheme-substr.php
-=== trimLeft, trimRight, trim ===
+=== trimStart, trimEnd, trim : \Text ===
 Removes white space at the start of, the end of, or both sides of the text.
@@ Line 180: / Line 199: @@
 ''$maxWidth''.
+=== reverse() : \Text ===
-=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) ===
-Replaces the first ''$maxReplacements'' occurrences of ''$search'' with
-''$replace''.
-The locale of ''$search'' is used to find sub-strings that
-match, if it is a ''Text'' object, otherwise the locale embedded in the object
-that the method is called on.
-The ''$replaceFrom'' and ''$replaceTo'' arguments control which found
-items are being replaced. The ''$replaceFrom'' argument is the first
-argument that is being replaced (0-indexed), and ''$replaceTo'' is the
-last item. Positive numbers are counted from the first occurrence of
-''$search'' in the Text, and negative numbers from the last found
-occurrence.
-In order to find sub-strings case-insensitively, you can use the ''$collator''
-argument to the constructor of the ''$search'' argument.
-=== reverse() ===
 Reverses a text, taking into account grapheme boundaries.
@@ Line 209: / Line 208: @@
 Methods to find text in other text.
-In all these methods, the locale of ''$search'' is used to find sub-strings that
+In all these methods, the locale and collator of ''$search'' are used to find
-match, if it is a ''Text'' object, otherwise the locale embedded in the object
+sub-strings that match, if it is a ''Text'' object, otherwise the locale and
-that the method is called on.
+collator that are embedded in the object that the method is called on is used.
@@ Line 222: / Line 221: @@
 https://www.php.net/manual/en/function.grapheme-strpos.php
-*I think this method name is too long*
+Alternative suggested names: ''findOffset''
 === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false ===
@@ Line 228: / Line 228: @@
 Like ''getPositionOfFirstOccurrence'' but then from the end of the text.
+Alternative suggested names: ''findOffsetLast''
@@ Line 237: / Line 239: @@
 Like: ''grapheme_strstr($this, $search)''
 (https://www.php.net/manual/en/function.grapheme-strstr.php)
+Alternative suggested names: ''startingWith'', ''startingAt''
@@ Line 242: / Line 246: @@
 Like ''returnFromFirstOccurence'' but then from the end of the text.
+Alternative suggested names: ''startingWithLast'', ''startingAtLast''
 === contains(string|Text $search) ===
@@ Line 255: / Line 262: @@
 Case-insensitive comparison can be achieved by setting the right
-''$collator'' on ''$search''.
+''$collation'' on ''$search''.
 Could be constructed from ''getPositionOflastOccurrence()'' and
@@ Line 267: / Line 274: @@
 Case-insensitive comparison can be achieved by setting the right
-''$collator'' on ''$search''.
+''$collation'' on ''$search''.
 Could be constructed from ''getPositionOfFirstOccurrence()'',
 but it's an often required method, and standard PHP has it
 too.
+=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text ===
+Replaces occurrences of ''$search'' with ''$replace''.
+The ''$replaceFrom'' and ''$replaceTo'' arguments control which found
+items are being replaced. The ''$replaceFrom'' argument is the first
+argument that is being replaced (0-indexed), and ''$replaceTo'' is the
+last item. Positive numbers are counted from the first occurrence of
+''$search'' in the Text, and negative numbers from the last found
+occurrence.
+In order to find sub-strings case-insensitively, you can use the ''$collation''
+argument to ''Text::__construct'' of the ''$search'' argument.
 ==== Comparing Text Objects ====
-=== compareWith(Text $other, string $collator = NULL) : int ===
+=== compareWith(Text $other, string $collation = NULL) : int ===
-Uses the configured ''$collator'' of ''$this'' to compare it against
+Uses the configured ''$collation'' of ''$this'' to compare it against
-''$other'', unless the ''$collator'' argument is specified as an override.
+''$other'', unless the ''$collation'' argument is specified as an override.
 This same method is also used for comparing two Text objects as "compare
-handler". Here only the locale on ''$this'' is taken into account.
+handler" (an overloaded ''=='' operator). Here only the locale on ''$this'' is
+taken into account.
+=== equals(Text $other, string $collation = NULL) : boolean ===
+Alias for ''compareWith($other, $collation) === 0''.
@@ Line 289: / Line 315: @@
 These operations all use the collation that is configured on the Text object.
-=== toLower ===
+=== toLower : \Text ===
 Converts the text to lower case, using the lower case variant of each
 Unicode code point that makes up the text.
-=== toUpper ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het ĳsselmeer is vol met ideëen''.
+=== toUpper : \Text ===
 The same, but then to upper case.
-=== toTitle ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''HET ĲSSELMEER IS VOL MET IDEËEN''.
+=== toTitle : \Text ===
 The same, but then to title case (the first letter of each word).
-=== firstToLower ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer is Vol met Ideëen''.
+=== firstToLower : \Text ===
 Converts the first grapheme in the text to a lower case variant.
-=== firstToUpper ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het Ĳsselmeer is vol met ideëen''.
+=== firstToUpper : \Text ===
 The same, but then to upper case.
-=== firstToTitle ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer is vol met ideëen''.
-The same, but then to title case (the first letter of each word).
-=== wordsToLower ===
+=== wordsToLower : \Text ===
 Converts the first grapheme in every word to an lower case variant.
-=== wordsToUpper ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het ĳsselmeer is vol met ideëen''.
-The same, but then to upper case.
-=== wordsToTitle ===
+=== wordsToUpper : \Text ===
-The same, but then to title case (the first letter of each word).
+The same, but then to upper case.
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer Is Vol Met Ideëen''.
@@ Line 331: / Line 369: @@
-=== getByteCount() ===
+=== getByteCount() : int ===
 Returns the size in bytes that the text will take when converted to UTF-8.
-=== length(), getCharacterCount() ===
+=== length(), getCharacterCount(): int  ===
 Returns the number of characters that make up the text. A character (also
@@ Line 344: / Line 382: @@
-=== getCodePointCount() ===
+=== getCodePointCount() : int ===
 Returns the number of Unicode code points that make up the text.
@@ Line 350: / Line 388: @@
-=== getWordCount() ===
+=== getWordCount() : int ===
 Pretty much a shortcut for::
@@ Line 364: / Line 402: @@
 These functions return an iterator that can be used to iterator over the text.
 The return of the iterators are effected by the text's locale.
-i
 These are inspired by ICU4J's BreakIterators
 (https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html)
@@ Line 370: / Line 408: @@
 (https://www.php.net/manual/en/class.intlbreakiterator.php).
-=== getCharacterIterator ===
+=== getCharacterIterator : \Iterator ===
 Returns an Iterator that locates boundaries between logical characters.
@@ Line 381: / Line 419: @@
 of text that the computer sees as "characters".
-=== getWordIterator ===
+=== getWordIterator : \Iterator ===
 Returns an Iterator that locates boundaries between words. This is useful
@@ Line 389: / Line 427: @@
 are kept separate from real words.
-=== getLineIterator ===
+=== getLineIterator : \Iterator ===
 Returns an Iterator that locates positions where it is legal for a text
@@ Line 398: / Line 436: @@
 from being a line-break position.
-=== getSentenceIterator ===
+=== getSentenceIterator : \Iterator ===
 Returns an Iterator that locates boundaries between sentences.
-=== getTitleIterator ===
+=== getTitleIterator : \Iterator ===
 Returns an Iterator that locates boundaries between title breaks.
@@ Line 413: / Line 451: @@
-=== transliterate(string $transliterationString) ===
+=== transliterate(string $transliterationString) : \Text ===
 Transliterates the content of the ''Text'' object according to the rules as
@@ Line 457: / Line 495: @@
 ===== Open Issues =====
+  - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8.
 ===== Questions and Answers =====
@@ Line 498: / Line 537: @@
 Nothing rejected yet.
+===== Changes =====
+.9.2 — 2022-12-21
+  * Tim Düsterhus: Added concat and equals methods; changed join to accept an iterator.
+  * Enhance explanation of locales and collations, and standardize on using ''$collator'' as an argument name everywhere.
+.9.1 — 2022-12-16
+  * Tim Düsterhus: Removed firstToTitle/wordsToTitle; added examples for toUpper and friends; added return types everywhere; added suggested other names for getPosition... methods; marked class as final.
+  * Paul Crovella: Clarify which normalisation is being used.
+  * Daniel Wolfe: Update trimLeft/trimRight to trimStart/trimEnd.

Differences

Page Tools