rfc:unicode_text_processing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:unicode_text_processing [2022/12/15 15:29] derickrfc:unicode_text_processing [2024/09/11 14:16] (current) derick
Line 1: Line 1:
 ====== PHP RFC: Unicode Text Processing ====== ====== PHP RFC: Unicode Text Processing ======
-  * Version: 0.9 +  * Version: 0.9.2 
-  * Date: 2022-11-09+  * Date: 2022-12-21 (Original date: 2022-12-15)
   * Author: Derick Rethans <derick@php.net>   * Author: Derick Rethans <derick@php.net>
-  * Status: Under Discussion+  * Status: Draft
   * First Published at: http://wiki.php.net/rfc/unicode_text_processing   * First Published at: http://wiki.php.net/rfc/unicode_text_processing
- 
  
 ===== Introduction ===== ===== Introduction =====
Line 26: Line 25:
 ===== Proposal ===== ===== Proposal =====
  
-To introduce a new "Text" class, with methods to operate on the text +To introduce a new final "Text" class, with methods to operate on the 
-stored in the objects.+text stored in the objects.
  
 Methods on the class will all return a new (immutable) object. Methods on the class will all return a new (immutable) object.
Line 72: Line 71:
 If an argument to any of the methods is listed as ''string|Text'', If an argument to any of the methods is listed as ''string|Text'',
 passing in a ''string'' value will have the same semantics as replacing passing in a ''string'' value will have the same semantics as replacing
-the passed value with ''new Text($string)''. The locale from the Text +the passed value with ''new Text($string)''. The locale and default collator 
-object that this method is called on is also used for this new wrapped +from the Text object that this method is called on is also used for this new 
-value, if necessary.+wrapped value, if necessary.
  
-==== Locales and Internationalisation ====+==== Locales, Collators, and Internationalisation ====
  
-By default each string will have the "root" collator associated with it, +By default each string will have the "root" locale and "standard" collator 
-but it is possible to configure a specific collator by using the +associated with it, but it is possible to configure a specific locale and 
-''$collator'' argument in the constructor. The ''$collator'' is specified as +collator by using the ''$collation'' argument in the constructor. Collation is in 
-a string describing an ICU locale name:+addition to the locale, and affects sorting and finding operations. 
 + 
 +The ''$collation'' is specified as a string describing an ICU locale/collation 
 +name:
 https://unicode-org.github.io/icu/userguide/collation/api.html#instantiating-the-predefined-collators https://unicode-org.github.io/icu/userguide/collation/api.html#instantiating-the-predefined-collators
  
-For example, the locale (or collation) name ''en-u-ks-level1'' means +The methods on the Text object all use the ''$collation'' argument name. 
-case-insensitive sorting for the English locale. This will require + 
-extensive documentation.+For example, the locale (and collation) name ''en-u-ks-level1'' means 
 +case-insensitive sorting (''ks-level1''for the English locale (''en-u'')
 +The format of this locale/collation name needs extensive documentation.
  
-Numerical order collation (such as PHP's ''natsort()'') can be achieved +Numerical order collation (such as PHP's ''natsort()'') can be achieved by 
-by adding the ''kn'' flag to the locale name, such as in ''de-u-kn'' +adding the ''kn'' flag to the collator specification, such as in ''de-u-kn'' 
-(case-sensitive German, with numerics in value order).+(case-sensitive German ('''de-u''), with numerics in value order (''kn'')).
  
 Other options are described in BCP47: Other options are described in BCP47:
Line 111: Line 115:
 This section lists all the method that construct a Text object. This section lists all the method that construct a Text object.
  
-=== __construct(string $text, string $locale = 'root/standard') ===+=== __construct(string $text, string $collation = 'root/standard': \Text ===
  
 The constructor takes a UTF-8 encoded text, and stores this in an internal The constructor takes a UTF-8 encoded text, and stores this in an internal
 structure. The constructor will also convert the given text to Unicode structure. The constructor will also convert the given text to Unicode
-Canonical Form. Passing in non-well-formed UTF-8 will result in an +Canonical Form (also called Normalisation Form C, or NFC). Passing in 
-''InvalidEncodingException''. The constructor will also strip out a BOM +non-well-formed UTF-8 will result in an ''InvalidEncodingException''. 
-(Byte-Order-Mark) character, if present.+The constructor will also strip out a BOM (Byte-Order-Mark) character, 
 +if present.
  
-=== static Text::create(string $text, string $locale = 'root/standard') ===+ 
 +=== static Text::create(string $text, string $collation = 'root/standard': \Text ===
  
 The Symfony String package, offers a static function to construct a String The Symfony String package, offers a static function to construct a String
Line 129: Line 135:
 For example with ''use \Text::create as t''. For example with ''use \Text::create as t''.
  
-=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) === 
  
-Creates a new Text object by concatenating the Text element in+=== static Text::concat(string|Text ...$elements) : \Text === 
 + 
 +Creates a new Text object by concatenating all the given string/Text arguments 
 +into a new Text object.  
 + 
 +If the ''$elements'' array is empty, an empty ''Text'' object with the 
 +''root'' locale and ''standard'' collation is created. 
 + 
 + 
 +=== static Text::join(iterable<string|Text> $elements, string|Text $separator, string $collation = NULL) : \Text === 
 + 
 +Creates a new Text object by looping over all the string/Text elements in
 ''$elements'', inserting ''$separator'' in between each element. ''$elements'', inserting ''$separator'' in between each element.
  
 The semantics are like: ''implode(string $separator, array(string) $array)'' The semantics are like: ''implode(string $separator, array(string) $array)''
  
-If the ''$collator'' is not specified, it uses the collection of the first +If the ''$collation'' is not specified, it uses the collation of the first 
-element in the ''$elements'' array. This will also be then set on the created +element from the ''$elements'' iterable. This will also be then set on the 
-object.+created object.
  
-If the ''$elements'' array is empty, an empty ''Text'' object with the +If the ''$elements'' iterator has no items, an empty ''Text'' object with the 
-''root'' locale is created.+''root'' locale and ''standard'' collation is created.
  
 +If the iterator produces a non-string/Text element, then a ''\ValueException''
 +will be thrown.
  
 ==== Standard String Operations ==== ==== Standard String Operations ====
  
  
-=== split(string|Text $separator, int $limit = PHP_INT_MAX): array(Text) ===+=== split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) ===
  
 Returns an array of Text objects, each of which is a substring of ''$this'', Returns an array of Text objects, each of which is a substring of ''$this'',
Line 162: Line 180:
 https://www.php.net/manual/en/function.grapheme-substr.php https://www.php.net/manual/en/function.grapheme-substr.php
  
-=== trimLefttrimRight, trim ===+=== trimStarttrimEnd, trim : \Text ===
  
 Removes white space at the start of, the end of, or both sides of the text. Removes white space at the start of, the end of, or both sides of the text.
Line 180: Line 198:
 ''$maxWidth''. ''$maxWidth''.
  
- +=== reverse() : \Text ===
-=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 === +
- +
-Replaces the first ''$maxReplacements'' occurrences of ''$search'' with +
-''$replace''+
- +
-The locale of ''$search'' is used to find sub-strings that +
-match, if it is a ''Text'' object, otherwise the locale embedded in the object +
-that the method is called on. +
- +
-The ''$replaceFrom'' and ''$replaceTo'' arguments control which found +
-items are being replaced. The ''$replaceFrom'' argument is the first +
-argument that is being replaced (0-indexed), and ''$replaceTo'' is the +
-last item. Positive numbers are counted from the first occurrence of +
-''$search'' in the Text, and negative numbers from the last found +
-occurrence. +
- +
-In order to find sub-strings case-insensitively, you can use the ''$collator'' +
-argument to the constructor of the ''$search'' argument. +
- +
-=== reverse() ===+
  
 Reverses a text, taking into account grapheme boundaries. Reverses a text, taking into account grapheme boundaries.
Line 209: Line 207:
 Methods to find text in other text. Methods to find text in other text.
  
-In all these methods, the locale of ''$search'' is used to find sub-strings that  +In all these methods, the locale and collator of ''$search'' are used to find 
-match, if it is a ''Text'' object, otherwise the locale embedded in the object +sub-strings that match, if it is a ''Text'' object, otherwise the locale and 
-that the method is called on.+collator that are embedded in the object that the method is called on is used.
  
  
Line 222: Line 220:
 https://www.php.net/manual/en/function.grapheme-strpos.php https://www.php.net/manual/en/function.grapheme-strpos.php
  
-*I think this method name is too long*+Alternative suggested names: ''findOffset'' 
  
 === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false ===
Line 228: Line 227:
  
 Like ''getPositionOfFirstOccurrence'' but then from the end of the text. Like ''getPositionOfFirstOccurrence'' but then from the end of the text.
 +
 +Alternative suggested names: ''findOffsetLast''
  
  
Line 237: Line 238:
 Like: ''grapheme_strstr($this, $search)'' Like: ''grapheme_strstr($this, $search)''
 (https://www.php.net/manual/en/function.grapheme-strstr.php) (https://www.php.net/manual/en/function.grapheme-strstr.php)
 +
 +Alternative suggested names: ''startingWith'', ''startingAt''
  
  
Line 242: Line 245:
  
 Like ''returnFromFirstOccurence'' but then from the end of the text. Like ''returnFromFirstOccurence'' but then from the end of the text.
 +
 +Alternative suggested names: ''startingWithLast'', ''startingAtLast''
 +
  
 === contains(string|Text $search) === === contains(string|Text $search) ===
Line 255: Line 261:
  
 Case-insensitive comparison can be achieved by setting the right Case-insensitive comparison can be achieved by setting the right
-''$collator'' on ''$search''.+''$collation'' on ''$search''.
  
 Could be constructed from ''getPositionOflastOccurrence()'' and Could be constructed from ''getPositionOflastOccurrence()'' and
Line 267: Line 273:
  
 Case-insensitive comparison can be achieved by setting the right Case-insensitive comparison can be achieved by setting the right
-''$collator'' on ''$search''.+''$collation'' on ''$search''.
  
 Could be constructed from ''getPositionOfFirstOccurrence()'', Could be constructed from ''getPositionOfFirstOccurrence()'',
 but it's an often required method, and standard PHP has it but it's an often required method, and standard PHP has it
 too. too.
 +
 +=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text ===
 +
 +Replaces occurrences of ''$search'' with ''$replace''.
 +
 +The ''$replaceFrom'' and ''$replaceTo'' arguments control which found
 +items are being replaced. The ''$replaceFrom'' argument is the first
 +argument that is being replaced (0-indexed), and ''$replaceTo'' is the
 +last item. Positive numbers are counted from the first occurrence of
 +''$search'' in the Text, and negative numbers from the last found
 +occurrence.
 +
 +In order to find sub-strings case-insensitively, you can use the ''$collation''
 +argument to ''Text::__construct'' of the ''$search'' argument.
  
  
 ==== Comparing Text Objects ==== ==== Comparing Text Objects ====
  
-=== compareWith(Text $other, string $collator = NULL) : int ===+=== compareWith(Text $other, string $collation = NULL) : int ===
  
-Uses the configured ''$collator'' of ''$this'' to compare it against +Uses the configured ''$collation'' of ''$this'' to compare it against 
-''$other'', unless the ''$collator'' argument is specified as an override.+''$other'', unless the ''$collation'' argument is specified as an override.
  
 This same method is also used for comparing two Text objects as "compare This same method is also used for comparing two Text objects as "compare
-handler". Here only the locale on ''$this'' is taken into account.+handler" (an overloaded ''=='' operator). Here only the locale on ''$this'' is 
 +taken into account
 + 
 +=== equals(Text $other, string $collation = NULL) : boolean === 
 + 
 +Alias for ''compareWith($other, $collation) === 0''.
  
  
Line 289: Line 314:
 These operations all use the collation that is configured on the Text object. These operations all use the collation that is configured on the Text object.
  
-=== toLower ===+=== toLower : \Text ===
  
 Converts the text to lower case, using the lower case variant of each Converts the text to lower case, using the lower case variant of each
 Unicode code point that makes up the text. Unicode code point that makes up the text.
  
-=== toUpper ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''het ijsselmeer is vol met ideëen''
 + 
 + 
 +=== toUpper : \Text ===
  
 The same, but then to upper case. The same, but then to upper case.
  
-=== toTitle ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''HET IJSSELMEER IS VOL MET IDEËEN''
 + 
 + 
 +=== toTitle : \Text ===
  
 The same, but then to title case (the first letter of each word). The same, but then to title case (the first letter of each word).
  
-=== firstToLower ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''Het IJsselmeer is Vol met Ideëen''
 + 
 + 
 +=== firstToLower : \Text ===
  
 Converts the first grapheme in the text to a lower case variant. Converts the first grapheme in the text to a lower case variant.
  
-=== firstToUpper ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''het IJsselmeer is vol met ideëen''
 + 
 + 
 +=== firstToUpper : \Text ===
  
 The same, but then to upper case. The same, but then to upper case.
  
-=== firstToTitle ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''Het IJsselmeer is vol met ideëen''.
  
-The same, but then to title case (the first letter of each word). 
  
  
-=== wordsToLower ===+=== wordsToLower : \Text ===
  
 Converts the first grapheme in every word to an lower case variant. Converts the first grapheme in every word to an lower case variant.
  
-=== wordsToUpper ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''het ijsselmeer is vol met ideëen''.
  
-The same, but then to upper case. 
  
-=== wordsToTitle ===+=== wordsToUpper : \Text ===
  
-The same, but then to title case (the first letter of each word).+The same, but then to upper case
 + 
 +Example: ''Het IJsselmeer is vol met ideëen'' to ''Het IJsselmeer Is Vol Met Ideëen''.
  
  
Line 331: Line 368:
  
  
-=== getByteCount() ===+=== getByteCount() : int ===
  
 Returns the size in bytes that the text will take when converted to UTF-8. Returns the size in bytes that the text will take when converted to UTF-8.
  
  
-=== length(), getCharacterCount() ===+=== length(), getCharacterCount(): int  ===
  
 Returns the number of characters that make up the text. A character (also Returns the number of characters that make up the text. A character (also
Line 344: Line 381:
  
  
-=== getCodePointCount() ===+=== getCodePointCount() : int ===
  
 Returns the number of Unicode code points that make up the text. Returns the number of Unicode code points that make up the text.
Line 350: Line 387:
  
  
-=== getWordCount() ===+=== getWordCount() : int ===
  
 Pretty much a shortcut for:: Pretty much a shortcut for::
Line 364: Line 401:
 These functions return an iterator that can be used to iterator over the text. These functions return an iterator that can be used to iterator over the text.
 The return of the iterators are effected by the text's locale. The return of the iterators are effected by the text's locale.
-i+
 These are inspired by ICU4J's BreakIterators These are inspired by ICU4J's BreakIterators
 (https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html) (https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html)
Line 370: Line 407:
 (https://www.php.net/manual/en/class.intlbreakiterator.php). (https://www.php.net/manual/en/class.intlbreakiterator.php).
  
-=== getCharacterIterator ===+=== getCharacterIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between logical characters. Returns an Iterator that locates boundaries between logical characters.
Line 381: Line 418:
 of text that the computer sees as "characters". of text that the computer sees as "characters".
  
-=== getWordIterator ===+=== getWordIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between words. This is useful Returns an Iterator that locates boundaries between words. This is useful
Line 389: Line 426:
 are kept separate from real words.  are kept separate from real words. 
  
-=== getLineIterator ===+=== getLineIterator : \Iterator ===
  
 Returns an Iterator that locates positions where it is legal for a text Returns an Iterator that locates positions where it is legal for a text
Line 398: Line 435:
 from being a line-break position.  from being a line-break position. 
  
-=== getSentenceIterator ===+=== getSentenceIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between sentences. Returns an Iterator that locates boundaries between sentences.
  
  
-=== getTitleIterator ===+=== getTitleIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between title breaks.  Returns an Iterator that locates boundaries between title breaks. 
Line 413: Line 450:
  
  
-=== transliterate(string $transliterationString) ===+=== transliterate(string $transliterationString) : \Text ===
  
 Transliterates the content of the ''Text'' object according to the rules as Transliterates the content of the ''Text'' object according to the rules as
Line 457: Line 494:
 ===== Open Issues ===== ===== Open Issues =====
  
 +  - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8.
  
 ===== Questions and Answers ===== ===== Questions and Answers =====
Line 498: Line 536:
  
 Nothing rejected yet. Nothing rejected yet.
 +
 +
 +===== Changes =====
 +
 +0.9.2 — 2022-12-21
 +
 +  * Tim Düsterhus: Added concat and equals methods; changed join to accept an iterator.
 +  * Enhance explanation of locales and collations, and standardize on using ''$collator'' as an argument name everywhere.
 +
 +0.9.1 — 2022-12-16
 +
 +  * Tim Düsterhus: Removed firstToTitle/wordsToTitle; added examples for toUpper and friends; added return types everywhere; added suggested other names for getPosition... methods; marked class as final.
 +  * Paul Crovella: Clarify which normalisation is being used.
 +  * Daniel Wolfe: Update trimLeft/trimRight to trimStart/trimEnd.
rfc/unicode_text_processing.1671118174.txt.gz · Last modified: 2022/12/15 15:29 by derick