rfc:unicode_text_processing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revisionBoth sides next revision
rfc:unicode_text_processing [2022/12/15 15:31] – Set date for initial announcement derickrfc:unicode_text_processing [2022/12/16 13:54] derick
Line 1: Line 1:
 ====== PHP RFC: Unicode Text Processing ====== ====== PHP RFC: Unicode Text Processing ======
   * Version: 0.9   * Version: 0.9
-  * Date: 2022-12-15+  * Date: 2022-12-16 (Original date: 2022-12-15)
   * Author: Derick Rethans <derick@php.net>   * Author: Derick Rethans <derick@php.net>
-  * Status: Under Discussion+  * Status: Draft
   * First Published at: http://wiki.php.net/rfc/unicode_text_processing   * First Published at: http://wiki.php.net/rfc/unicode_text_processing
  
Line 26: Line 26:
 ===== Proposal ===== ===== Proposal =====
  
-To introduce a new "Text" class, with methods to operate on the text +To introduce a new final "Text" class, with methods to operate on the 
-stored in the objects.+text stored in the objects.
  
 Methods on the class will all return a new (immutable) object. Methods on the class will all return a new (immutable) object.
Line 111: Line 111:
 This section lists all the method that construct a Text object. This section lists all the method that construct a Text object.
  
-=== __construct(string $text, string $locale = 'root/standard') ===+=== __construct(string $text, string $locale = 'root/standard': \Text ===
  
 The constructor takes a UTF-8 encoded text, and stores this in an internal The constructor takes a UTF-8 encoded text, and stores this in an internal
 structure. The constructor will also convert the given text to Unicode structure. The constructor will also convert the given text to Unicode
-Canonical Form. Passing in non-well-formed UTF-8 will result in an +Canonical Form (also called Normalisation Form C, or NFC). Passing in 
-''InvalidEncodingException''. The constructor will also strip out a BOM +non-well-formed UTF-8 will result in an ''InvalidEncodingException''. 
-(Byte-Order-Mark) character, if present.+The constructor will also strip out a BOM (Byte-Order-Mark) character, 
 +if present.
  
-=== static Text::create(string $text, string $locale = 'root/standard') ===+=== static Text::create(string $text, string $locale = 'root/standard': \Text ===
  
 The Symfony String package, offers a static function to construct a String The Symfony String package, offers a static function to construct a String
Line 129: Line 130:
 For example with ''use \Text::create as t''. For example with ''use \Text::create as t''.
  
-=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) ===+=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) : \Text ===
  
 Creates a new Text object by concatenating the Text element in Creates a new Text object by concatenating the Text element in
Line 147: Line 148:
  
  
-=== split(string|Text $separator, int $limit = PHP_INT_MAX): array(Text) ===+=== split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) ===
  
 Returns an array of Text objects, each of which is a substring of ''$this'', Returns an array of Text objects, each of which is a substring of ''$this'',
Line 162: Line 163:
 https://www.php.net/manual/en/function.grapheme-substr.php https://www.php.net/manual/en/function.grapheme-substr.php
  
-=== trimLefttrimRight, trim ===+=== trimStarttrimEnd, trim : \Text ===
  
 Removes white space at the start of, the end of, or both sides of the text. Removes white space at the start of, the end of, or both sides of the text.
Line 181: Line 182:
  
  
-=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) ===+=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text ===
  
-Replaces the first ''$maxReplacements'' occurrences of ''$search'' with +Replaces occurrences of ''$search'' with ''$replace''.
-''$replace''.+
  
 The locale of ''$search'' is used to find sub-strings that The locale of ''$search'' is used to find sub-strings that
Line 198: Line 198:
  
 In order to find sub-strings case-insensitively, you can use the ''$collator'' In order to find sub-strings case-insensitively, you can use the ''$collator''
-argument to the constructor of the ''$search'' argument.+argument to ''Text::__construct'' of the ''$search'' argument.
  
-=== reverse() ===+=== reverse() : \Text ===
  
 Reverses a text, taking into account grapheme boundaries. Reverses a text, taking into account grapheme boundaries.
Line 222: Line 222:
 https://www.php.net/manual/en/function.grapheme-strpos.php https://www.php.net/manual/en/function.grapheme-strpos.php
  
-*I think this method name is too long*+Alternative suggested names: ''findOffset'' 
  
 === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false ===
Line 228: Line 229:
  
 Like ''getPositionOfFirstOccurrence'' but then from the end of the text. Like ''getPositionOfFirstOccurrence'' but then from the end of the text.
 +
 +Alternative suggested names: ''findOffsetLast''
  
  
Line 237: Line 240:
 Like: ''grapheme_strstr($this, $search)'' Like: ''grapheme_strstr($this, $search)''
 (https://www.php.net/manual/en/function.grapheme-strstr.php) (https://www.php.net/manual/en/function.grapheme-strstr.php)
 +
 +Alternative suggested names: ''startingWith''
  
  
Line 248: Line 253:
  
 Like ''str_contains''. Like ''str_contains''.
 +
 +Alternative suggested names: ''startingWithLast''
  
  
Line 289: Line 296:
 These operations all use the collation that is configured on the Text object. These operations all use the collation that is configured on the Text object.
  
-=== toLower ===+=== toLower : \Text ===
  
 Converts the text to lower case, using the lower case variant of each Converts the text to lower case, using the lower case variant of each
 Unicode code point that makes up the text. Unicode code point that makes up the text.
  
-=== toUpper ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''het ijsselmeer is vol met ideëen''
 + 
 + 
 +=== toUpper : \Text ===
  
 The same, but then to upper case. The same, but then to upper case.
  
-=== toTitle ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''HET IJSSELMEER IS VOL MET IDEËEN''
 + 
 + 
 +=== toTitle : \Text ===
  
 The same, but then to title case (the first letter of each word). The same, but then to title case (the first letter of each word).
  
-=== firstToLower ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''Het IJsselmeer is Vol met Ideëen''
 + 
 + 
 +=== firstToLower : \Text ===
  
 Converts the first grapheme in the text to a lower case variant. Converts the first grapheme in the text to a lower case variant.
  
-=== firstToUpper ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''het IJsselmeer is vol met ideëen''
 + 
 + 
 +=== firstToUpper : \Text ===
  
 The same, but then to upper case. The same, but then to upper case.
  
-=== firstToTitle ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''Het IJsselmeer is vol met ideëen''.
  
-The same, but then to title case (the first letter of each word). 
  
  
-=== wordsToLower ===+=== wordsToLower : \Text ===
  
 Converts the first grapheme in every word to an lower case variant. Converts the first grapheme in every word to an lower case variant.
  
-=== wordsToUpper ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''het ijsselmeer is vol met ideëen''.
  
-The same, but then to upper case. 
  
-=== wordsToTitle ===+=== wordsToUpper : \Text ===
  
-The same, but then to title case (the first letter of each word).+The same, but then to upper case
 + 
 +Example: ''Het IJsselmeer is vol met ideëen'' to ''Het IJsselmeer Is Vol Met Ideëen''.
  
  
Line 331: Line 350:
  
  
-=== getByteCount() ===+=== getByteCount() : int ===
  
 Returns the size in bytes that the text will take when converted to UTF-8. Returns the size in bytes that the text will take when converted to UTF-8.
  
  
-=== length(), getCharacterCount() ===+=== length(), getCharacterCount(): int  ===
  
 Returns the number of characters that make up the text. A character (also Returns the number of characters that make up the text. A character (also
Line 344: Line 363:
  
  
-=== getCodePointCount() ===+=== getCodePointCount() : int ===
  
 Returns the number of Unicode code points that make up the text. Returns the number of Unicode code points that make up the text.
Line 350: Line 369:
  
  
-=== getWordCount() ===+=== getWordCount() : int ===
  
 Pretty much a shortcut for:: Pretty much a shortcut for::
Line 370: Line 389:
 (https://www.php.net/manual/en/class.intlbreakiterator.php). (https://www.php.net/manual/en/class.intlbreakiterator.php).
  
-=== getCharacterIterator ===+=== getCharacterIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between logical characters. Returns an Iterator that locates boundaries between logical characters.
Line 381: Line 400:
 of text that the computer sees as "characters". of text that the computer sees as "characters".
  
-=== getWordIterator ===+=== getWordIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between words. This is useful Returns an Iterator that locates boundaries between words. This is useful
Line 389: Line 408:
 are kept separate from real words.  are kept separate from real words. 
  
-=== getLineIterator ===+=== getLineIterator : \Iterator ===
  
 Returns an Iterator that locates positions where it is legal for a text Returns an Iterator that locates positions where it is legal for a text
Line 398: Line 417:
 from being a line-break position.  from being a line-break position. 
  
-=== getSentenceIterator ===+=== getSentenceIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between sentences. Returns an Iterator that locates boundaries between sentences.
  
  
-=== getTitleIterator ===+=== getTitleIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between title breaks.  Returns an Iterator that locates boundaries between title breaks. 
Line 413: Line 432:
  
  
-=== transliterate(string $transliterationString) ===+=== transliterate(string $transliterationString) : \Text ===
  
 Transliterates the content of the ''Text'' object according to the rules as Transliterates the content of the ''Text'' object according to the rules as
Line 457: Line 476:
 ===== Open Issues ===== ===== Open Issues =====
  
 +  - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8.
 +  - Tidy up language related to locale/collator. As Tim Starling says: "If the input is an ICU locale string, then I think you should just call it locale. Then the user will be armed with the correct terminology when they go looking for more information in the ICU manual. In ICU, case conversion and BreakIterator need a locale, not a collator.
  
 ===== Questions and Answers ===== ===== Questions and Answers =====
Line 498: Line 519:
  
 Nothing rejected yet. Nothing rejected yet.
 +
 +
 +===== Changes =====
 +
 +0.9.1 — 2022-12-16
 +
 +  * Tim Düsterhus: Removed firstToTitle/wordsToTitle; added examples for toUpper and friends; added return types everywhere; added suggested other names for getPosition... methods; marked class as final.
 +  * Paul Crovella: Clarify which normalisation is being used.
 +  * Daniel Wolfe: Update trimLeft/trimRight to trimStart/trimEnd.
rfc/unicode_text_processing.txt · Last modified: 2022/12/21 11:48 by derick