rfc:unicode_text_processing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
rfc:unicode_text_processing [2022/12/15 15:28] – Argue the case for a C-based implementation, and mention ICU implementation details derickrfc:unicode_text_processing [2022/12/18 17:29] – Fix several typos or difficult wording theodorejb
Line 1: Line 1:
 ====== PHP RFC: Unicode Text Processing ====== ====== PHP RFC: Unicode Text Processing ======
   * Version: 0.9   * Version: 0.9
-  * Date: 2022-11-09+  * Date: 2022-12-16 (Original date: 2022-12-15)
   * Author: Derick Rethans <derick@php.net>   * Author: Derick Rethans <derick@php.net>
   * Status: Draft   * Status: Draft
Line 26: Line 26:
 ===== Proposal ===== ===== Proposal =====
  
-To introduce a new "Text" class, with methods to operate on the text +To introduce a new final "Text" class, with methods to operate on the 
-stored in the objects.+text stored in the objects.
  
 Methods on the class will all return a new (immutable) object. Methods on the class will all return a new (immutable) object.
  
 The proposal is to make the ''Text'' class part of the PHP core. This would The proposal is to make the ''Text'' class part of the PHP core. This would
-mean that it is therefore always available to user. As the implementation+mean that it is therefore always available to users. As the implementation
 requires ICU, this would also mean that PHP will depend on the ICU library. requires ICU, this would also mean that PHP will depend on the ICU library.
  
Line 109: Line 109:
 ==== Construction ==== ==== Construction ====
  
-This section lists all the method that construct a Text object.+This section lists all the methods that construct a Text object.
  
-=== __construct(string $text, string $locale = 'root/standard') ===+=== __construct(string $text, string $locale = 'root/standard': \Text ===
  
 The constructor takes a UTF-8 encoded text, and stores this in an internal The constructor takes a UTF-8 encoded text, and stores this in an internal
 structure. The constructor will also convert the given text to Unicode structure. The constructor will also convert the given text to Unicode
-Canonical Form. Passing in non-well-formed UTF-8 will result in an +Canonical Form (also called Normalisation Form C, or NFC). Passing in 
-''InvalidEncodingException''. The constructor will also strip out a BOM +non-well-formed UTF-8 will result in an ''InvalidEncodingException''. 
-(Byte-Order-Mark) character, if present.+The constructor will also strip out a BOM (Byte-Order-Mark) character, 
 +if present.
  
-=== static Text::create(string $text, string $locale = 'root/standard') ===+=== static Text::create(string $text, string $locale = 'root/standard': \Text ===
  
-The Symfony String packageoffers a static function to construct a String+The Symfony String package offers a static function to construct a String
 through a single-character function (''u''), which you can import into the through a single-character function (''u''), which you can import into the
 file scope (with ''use''). file scope (with ''use'').
  
 This method solves a similar use, so that you can shorten ''new Text(…)'' to This method solves a similar use, so that you can shorten ''new Text(…)'' to
-''t'' after having imported the method into the file's scope with: +''t'' after having imported the method into the file's scope with (for example)
-For example with ''use \Text::create as t''.+''use \Text::create as t''.
  
-=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) ===+=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) : \Text ===
  
 Creates a new Text object by concatenating the Text element in Creates a new Text object by concatenating the Text element in
Line 137: Line 138:
  
 If the ''$collator'' is not specified, it uses the collection of the first If the ''$collator'' is not specified, it uses the collection of the first
-element in the ''$elements'' array. This will also be then set on the created+element in the ''$elements'' array. This will then also be set on the created
 object. object.
  
Line 147: Line 148:
  
  
-=== split(string|Text $separator, int $limit = PHP_INT_MAX): array(Text) ===+=== split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) ===
  
 Returns an array of Text objects, each of which is a substring of ''$this'', Returns an array of Text objects, each of which is a substring of ''$this'',
Line 162: Line 163:
 https://www.php.net/manual/en/function.grapheme-substr.php https://www.php.net/manual/en/function.grapheme-substr.php
  
-=== trimLefttrimRight, trim ===+=== trimStarttrimEnd, trim : \Text ===
  
 Removes white space at the start of, the end of, or both sides of the text. Removes white space at the start of, the end of, or both sides of the text.
Line 181: Line 182:
  
  
-=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) ===+=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text ===
  
-Replaces the first ''$maxReplacements'' occurrences of ''$search'' with +Replaces occurrences of ''$search'' with ''$replace''.
-''$replace''.+
  
 The locale of ''$search'' is used to find sub-strings that The locale of ''$search'' is used to find sub-strings that
Line 198: Line 198:
  
 In order to find sub-strings case-insensitively, you can use the ''$collator'' In order to find sub-strings case-insensitively, you can use the ''$collator''
-argument to the constructor of the ''$search'' argument.+argument to ''Text::__construct'' of the ''$search'' argument.
  
-=== reverse() ===+=== reverse() : \Text ===
  
 Reverses a text, taking into account grapheme boundaries. Reverses a text, taking into account grapheme boundaries.
Line 222: Line 222:
 https://www.php.net/manual/en/function.grapheme-strpos.php https://www.php.net/manual/en/function.grapheme-strpos.php
  
-*I think this method name is too long*+Alternative suggested names: ''findOffset'' 
  
 === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false === === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false ===
Line 228: Line 229:
  
 Like ''getPositionOfFirstOccurrence'' but then from the end of the text. Like ''getPositionOfFirstOccurrence'' but then from the end of the text.
 +
 +Alternative suggested names: ''findOffsetLast''
  
  
Line 237: Line 240:
 Like: ''grapheme_strstr($this, $search)'' Like: ''grapheme_strstr($this, $search)''
 (https://www.php.net/manual/en/function.grapheme-strstr.php) (https://www.php.net/manual/en/function.grapheme-strstr.php)
 +
 +Alternative suggested names: ''startingWith''
  
  
Line 248: Line 253:
  
 Like ''str_contains''. Like ''str_contains''.
 +
 +Alternative suggested names: ''startingWithLast''
  
  
Line 289: Line 296:
 These operations all use the collation that is configured on the Text object. These operations all use the collation that is configured on the Text object.
  
-=== toLower ===+=== toLower : \Text ===
  
 Converts the text to lower case, using the lower case variant of each Converts the text to lower case, using the lower case variant of each
 Unicode code point that makes up the text. Unicode code point that makes up the text.
  
-=== toUpper ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''het ijsselmeer is vol met ideëen''
 + 
 + 
 +=== toUpper : \Text ===
  
 The same, but then to upper case. The same, but then to upper case.
  
-=== toTitle ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''HET IJSSELMEER IS VOL MET IDEËEN''
 + 
 + 
 +=== toTitle : \Text ===
  
 The same, but then to title case (the first letter of each word). The same, but then to title case (the first letter of each word).
  
-=== firstToLower ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''Het IJsselmeer is Vol met Ideëen''
 + 
 + 
 +=== firstToLower : \Text ===
  
 Converts the first grapheme in the text to a lower case variant. Converts the first grapheme in the text to a lower case variant.
  
-=== firstToUpper ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''het IJsselmeer is vol met ideëen''
 + 
 + 
 +=== firstToUpper : \Text ===
  
 The same, but then to upper case. The same, but then to upper case.
  
-=== firstToTitle ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''Het IJsselmeer is vol met ideëen''.
  
-The same, but then to title case (the first letter of each word). 
  
  
-=== wordsToLower ===+=== wordsToLower : \Text ===
  
 Converts the first grapheme in every word to an lower case variant. Converts the first grapheme in every word to an lower case variant.
  
-=== wordsToUpper ===+Example: ''Het IJsselmeer is vol met ideëen'' to ''het ijsselmeer is vol met ideëen''.
  
-The same, but then to upper case. 
  
-=== wordsToTitle ===+=== wordsToUpper : \Text ===
  
-The same, but then to title case (the first letter of each word).+The same, but then to upper case
 + 
 +Example: ''Het IJsselmeer is vol met ideëen'' to ''Het IJsselmeer Is Vol Met Ideëen''.
  
  
Line 331: Line 350:
  
  
-=== getByteCount() ===+=== getByteCount() : int ===
  
 Returns the size in bytes that the text will take when converted to UTF-8. Returns the size in bytes that the text will take when converted to UTF-8.
  
  
-=== length(), getCharacterCount() ===+=== length(), getCharacterCount(): int  ===
  
 Returns the number of characters that make up the text. A character (also Returns the number of characters that make up the text. A character (also
Line 344: Line 363:
  
  
-=== getCodePointCount() ===+=== getCodePointCount() : int ===
  
 Returns the number of Unicode code points that make up the text. Returns the number of Unicode code points that make up the text.
Line 350: Line 369:
  
  
-=== getWordCount() ===+=== getWordCount() : int ===
  
 Pretty much a shortcut for:: Pretty much a shortcut for::
Line 370: Line 389:
 (https://www.php.net/manual/en/class.intlbreakiterator.php). (https://www.php.net/manual/en/class.intlbreakiterator.php).
  
-=== getCharacterIterator ===+=== getCharacterIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between logical characters. Returns an Iterator that locates boundaries between logical characters.
Line 381: Line 400:
 of text that the computer sees as "characters". of text that the computer sees as "characters".
  
-=== getWordIterator ===+=== getWordIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between words. This is useful Returns an Iterator that locates boundaries between words. This is useful
Line 389: Line 408:
 are kept separate from real words.  are kept separate from real words. 
  
-=== getLineIterator ===+=== getLineIterator : \Iterator ===
  
 Returns an Iterator that locates positions where it is legal for a text Returns an Iterator that locates positions where it is legal for a text
Line 398: Line 417:
 from being a line-break position.  from being a line-break position. 
  
-=== getSentenceIterator ===+=== getSentenceIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between sentences. Returns an Iterator that locates boundaries between sentences.
  
  
-=== getTitleIterator ===+=== getTitleIterator : \Iterator ===
  
 Returns an Iterator that locates boundaries between title breaks.  Returns an Iterator that locates boundaries between title breaks. 
Line 413: Line 432:
  
  
-=== transliterate(string $transliterationString) ===+=== transliterate(string $transliterationString) : \Text ===
  
 Transliterates the content of the ''Text'' object according to the rules as Transliterates the content of the ''Text'' object according to the rules as
Line 457: Line 476:
 ===== Open Issues ===== ===== Open Issues =====
  
 +  - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8.
 +  - Tidy up language related to locale/collator. As Tim Starling says: "If the input is an ICU locale string, then I think you should just call it locale. Then the user will be armed with the correct terminology when they go looking for more information in the ICU manual. In ICU, case conversion and BreakIterator need a locale, not a collator.
  
 ===== Questions and Answers ===== ===== Questions and Answers =====
Line 498: Line 519:
  
 Nothing rejected yet. Nothing rejected yet.
 +
 +
 +===== Changes =====
 +
 +0.9.1 — 2022-12-16
 +
 +  * Tim Düsterhus: Removed firstToTitle/wordsToTitle; added examples for toUpper and friends; added return types everywhere; added suggested other names for getPosition... methods; marked class as final.
 +  * Paul Crovella: Clarify which normalisation is being used.
 +  * Daniel Wolfe: Update trimLeft/trimRight to trimStart/trimEnd.
rfc/unicode_text_processing.txt · Last modified: 2022/12/21 11:48 by derick