Differences

This shows you the differences between two versions of the page.

--- rfc:unicode_text_processing [2022/12/15 15:31] – Set date for initial announcement derick
+++ rfc:unicode_text_processing [2022/12/16 13:54] – derick
@@ Line 1: / Line 1: @@
 ====== PHP RFC: Unicode Text Processing ======
   * Version: 0.9
-  * Date: 2022-12-15
+  * Date: 2022-12-16 (Original date: 2022-12-15)
   * Author: Derick Rethans <derick@php.net>
-  * Status: Under Discussion
+  * Status: Draft
   * First Published at: http://wiki.php.net/rfc/unicode_text_processing
@@ Line 26: / Line 26: @@
 ===== Proposal =====
-To introduce a new "Text" class, with methods to operate on the text
+To introduce a new final "Text" class, with methods to operate on the
-stored in the objects.
+text stored in the objects.
 Methods on the class will all return a new (immutable) object.
@@ Line 111: / Line 111: @@
 This section lists all the method that construct a Text object.
-=== __construct(string $text, string $locale = 'root/standard') ===
+=== __construct(string $text, string $locale = 'root/standard') : \Text ===
 The constructor takes a UTF-8 encoded text, and stores this in an internal
 structure. The constructor will also convert the given text to Unicode
-Canonical Form. Passing in non-well-formed UTF-8 will result in an
+Canonical Form (also called Normalisation Form C, or NFC). Passing in
-''InvalidEncodingException''. The constructor will also strip out a BOM
+non-well-formed UTF-8 will result in an ''InvalidEncodingException''.
-(Byte-Order-Mark) character, if present.
+The constructor will also strip out a BOM (Byte-Order-Mark) character,
+if present.
-=== static Text::create(string $text, string $locale = 'root/standard') ===
+=== static Text::create(string $text, string $locale = 'root/standard') : \Text ===
 The Symfony String package, offers a static function to construct a String
@@ Line 129: / Line 130: @@
 For example with ''use \Text::create as t''.
-=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) ===
+=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) : \Text ===
 Creates a new Text object by concatenating the Text element in
@@ Line 147: / Line 148: @@
-=== split(string|Text $separator, int $limit = PHP_INT_MAX): array(Text) ===
+=== split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) ===
 Returns an array of Text objects, each of which is a substring of ''$this'',
@@ Line 162: / Line 163: @@
 https://www.php.net/manual/en/function.grapheme-substr.php
-=== trimLeft, trimRight, trim ===
+=== trimStart, trimEnd, trim : \Text ===
 Removes white space at the start of, the end of, or both sides of the text.
@@ Line 181: / Line 182: @@
-=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) ===
+=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text ===
-Replaces the first ''$maxReplacements'' occurrences of ''$search'' with
+Replaces occurrences of ''$search'' with ''$replace''.
-''$replace''.
 The locale of ''$search'' is used to find sub-strings that
@@ Line 198: / Line 198: @@
 In order to find sub-strings case-insensitively, you can use the ''$collator''
-argument to the constructor of the ''$search'' argument.
+argument to ''Text::__construct'' of the ''$search'' argument.
-=== reverse() ===
+=== reverse() : \Text ===
 Reverses a text, taking into account grapheme boundaries.
@@ Line 222: / Line 222: @@
 https://www.php.net/manual/en/function.grapheme-strpos.php
-*I think this method name is too long*
+Alternative suggested names: ''findOffset''
 === getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false ===
@@ Line 228: / Line 229: @@
 Like ''getPositionOfFirstOccurrence'' but then from the end of the text.
+Alternative suggested names: ''findOffsetLast''
@@ Line 237: / Line 240: @@
 Like: ''grapheme_strstr($this, $search)''
 (https://www.php.net/manual/en/function.grapheme-strstr.php)
+Alternative suggested names: ''startingWith''
@@ Line 248: / Line 253: @@
 Like ''str_contains''.
+Alternative suggested names: ''startingWithLast''
@@ Line 289: / Line 296: @@
 These operations all use the collation that is configured on the Text object.
-=== toLower ===
+=== toLower : \Text ===
 Converts the text to lower case, using the lower case variant of each
 Unicode code point that makes up the text.
-=== toUpper ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het ĳsselmeer is vol met ideëen''.
+=== toUpper : \Text ===
 The same, but then to upper case.
-=== toTitle ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''HET ĲSSELMEER IS VOL MET IDEËEN''.
+=== toTitle : \Text ===
 The same, but then to title case (the first letter of each word).
-=== firstToLower ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer is Vol met Ideëen''.
+=== firstToLower : \Text ===
 Converts the first grapheme in the text to a lower case variant.
-=== firstToUpper ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het Ĳsselmeer is vol met ideëen''.
+=== firstToUpper : \Text ===
 The same, but then to upper case.
-=== firstToTitle ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer is vol met ideëen''.
-The same, but then to title case (the first letter of each word).
-=== wordsToLower ===
+=== wordsToLower : \Text ===
 Converts the first grapheme in every word to an lower case variant.
-=== wordsToUpper ===
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het ĳsselmeer is vol met ideëen''.
-The same, but then to upper case.
-=== wordsToTitle ===
+=== wordsToUpper : \Text ===
-The same, but then to title case (the first letter of each word).
+The same, but then to upper case.
+Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer Is Vol Met Ideëen''.
@@ Line 331: / Line 350: @@
-=== getByteCount() ===
+=== getByteCount() : int ===
 Returns the size in bytes that the text will take when converted to UTF-8.
-=== length(), getCharacterCount() ===
+=== length(), getCharacterCount(): int  ===
 Returns the number of characters that make up the text. A character (also
@@ Line 344: / Line 363: @@
-=== getCodePointCount() ===
+=== getCodePointCount() : int ===
 Returns the number of Unicode code points that make up the text.
@@ Line 350: / Line 369: @@
-=== getWordCount() ===
+=== getWordCount() : int ===
 Pretty much a shortcut for::
@@ Line 370: / Line 389: @@
 (https://www.php.net/manual/en/class.intlbreakiterator.php).
-=== getCharacterIterator ===
+=== getCharacterIterator : \Iterator ===
 Returns an Iterator that locates boundaries between logical characters.
@@ Line 381: / Line 400: @@
 of text that the computer sees as "characters".
-=== getWordIterator ===
+=== getWordIterator : \Iterator ===
 Returns an Iterator that locates boundaries between words. This is useful
@@ Line 389: / Line 408: @@
 are kept separate from real words.
-=== getLineIterator ===
+=== getLineIterator : \Iterator ===
 Returns an Iterator that locates positions where it is legal for a text
@@ Line 398: / Line 417: @@
 from being a line-break position.
-=== getSentenceIterator ===
+=== getSentenceIterator : \Iterator ===
 Returns an Iterator that locates boundaries between sentences.
-=== getTitleIterator ===
+=== getTitleIterator : \Iterator ===
 Returns an Iterator that locates boundaries between title breaks.
@@ Line 413: / Line 432: @@
-=== transliterate(string $transliterationString) ===
+=== transliterate(string $transliterationString) : \Text ===
 Transliterates the content of the ''Text'' object according to the rules as
@@ Line 457: / Line 476: @@
 ===== Open Issues =====
+  - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8.
+  - Tidy up language related to locale/collator. As Tim Starling says: "If the input is an ICU locale string, then I think you should just call it locale. Then the user will be armed with the correct terminology when they go looking for more information in the ICU manual. In ICU, case conversion and BreakIterator need a locale, not a collator.
 ===== Questions and Answers =====
@@ Line 498: / Line 519: @@
 Nothing rejected yet.
+===== Changes =====
+.9.1 — 2022-12-16
+  * Tim Düsterhus: Removed firstToTitle/wordsToTitle; added examples for toUpper and friends; added return types everywhere; added suggested other names for getPosition... methods; marked class as final.
+  * Paul Crovella: Clarify which normalisation is being used.
+  * Daniel Wolfe: Update trimLeft/trimRight to trimStart/trimEnd.

Differences

Page Tools