rfc:unicode_text_processing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
rfc:unicode_text_processing [2022/11/21 15:14] derickrfc:unicode_text_processing [2022/12/15 15:29] derick
Line 3: Line 3:
   * Date: 2022-11-09   * Date: 2022-11-09
   * Author: Derick Rethans <derick@php.net>   * Author: Derick Rethans <derick@php.net>
-  * Status: Draft+  * Status: Under Discussion
   * First Published at: http://wiki.php.net/rfc/unicode_text_processing   * First Published at: http://wiki.php.net/rfc/unicode_text_processing
  
Line 14: Line 14:
 create an API that developers can use to do Unicode text processing create an API that developers can use to do Unicode text processing
 correctly, without having to know all the intricacies. correctly, without having to know all the intricacies.
 +
 +Although PHP has decent maths features, it is solely missing performant
 +Unicode text processing always available in the core.
  
 ==== Definitions ==== ==== Definitions ====
Line 27: Line 30:
  
 Methods on the class will all return a new (immutable) object. Methods on the class will all return a new (immutable) object.
 +
 +The proposal is to make the ''Text'' class part of the PHP core. This would
 +mean that it is therefore always available to user. As the implementation
 +requires ICU, this would also mean that PHP will depend on the ICU library.
 +
 ==== Basics ==== ==== Basics ====
  
Line 32: Line 40:
 constructor. constructor.
  
-The ''toString()'' method collapses the internally stored text into a+The ''_****_toString()'' method collapses the internally stored text into a
 UTF-8 encoded string, which can be used by all existing PHP functions UTF-8 encoded string, which can be used by all existing PHP functions
 that accept strings. that accept strings.
  
-The internal representation would be UTF-16, as that's what ICU uses.+The internal representation of the text is UTF-16, as that's what ICU uses.
 Unlike the PHP 6 approach, the conversion to/from the internal Unlike the PHP 6 approach, the conversion to/from the internal
 representation only happens on the boundaries: UTF-8 to UTF-16 through representation only happens on the boundaries: UTF-8 to UTF-16 through
-the constructor, and the reverse through the ''toString()'' method.+the constructor, and the reverse through the ''_****_toString()'' method.
  
 There are multiple groups of methods indicated below. Some are to There are multiple groups of methods indicated below. Some are to
Line 51: Line 59:
   * prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments.   * prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments.
   * operations are on **graphemes**   * operations are on **graphemes**
-  * no redundent methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used+  * no redundant methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used
   * more as we discuss this...   * more as we discuss this...
  
Line 80: Line 88:
 extensive documentation. extensive documentation.
  
-Numerical order collation (such as PHP's ''natsort()'') can be achived+Numerical order collation (such as PHP's ''natsort()'') can be achieved
 by adding the ''kn'' flag to the locale name, such as in ''de-u-kn'' by adding the ''kn'' flag to the locale name, such as in ''de-u-kn''
 (case-sensitive German, with numerics in value order). (case-sensitive German, with numerics in value order).
Line 88: Line 96:
 and defaults at http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings and defaults at http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings
  
-Specifying the locale and collator will also be possible by passing in +Building a locale/collation string will also be possible by using 
-''Intl\\Collator'' object +''TextCollator'' object, to allow for better and easier-to-read customization 
-(https://www.php.net/manual/en/class.collator.php) to allow for more +of collations. The class performs the same function as ''\Intl\Collator'' 
-descritive construction of locale with all its options.+(https://www.php.net/manual/en/class.collator.php), except that it has 
 +descriptive methods to set collation properties. The reason for a separate 
 +class is so that you don't have to depend on the ''Intl'' extension, and to 
 +make it more developer-friendly. It converts the configured options to a 
 +string, which can then be used in any location where ''string $collator'' is 
 +used in the function signatures to the methods on the ''Text'' class.
  
  
Line 98: Line 111:
 This section lists all the method that construct a Text object. This section lists all the method that construct a Text object.
  
-=== __construct(string $text, string $locale = 'root/standard'), __construct(string $text, \\Intl\\Collator $collator = new \\Intl\\Collator('root/standard')) ===+=== __construct(string $text, string $locale = 'root/standard') ===
  
 The constructor takes a UTF-8 encoded text, and stores this in an internal The constructor takes a UTF-8 encoded text, and stores this in an internal
Line 106: Line 119:
 (Byte-Order-Mark) character, if present. (Byte-Order-Mark) character, if present.
  
 +=== static Text::create(string $text, string $locale = 'root/standard') ===
  
-=== static Text::join(array(string|Text$elementsstring|Text $separator===+The Symfony String package, offers a static function to construct a String 
 +through a single-character function (''u''), which you can import into the 
 +file scope (with ''use'').
  
-Creates a new Text object by concatenating the each Text element in+This method solves a similar use, so that you can shorten ''new Text(…)'' to 
 +''t'' after having imported the method into the file's scope with: 
 +For example with ''use \Text::create as t''
 + 
 +=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) === 
 + 
 +Creates a new Text object by concatenating the Text element in
 ''$elements'', inserting ''$separator'' in between each element. ''$elements'', inserting ''$separator'' in between each element.
  
-Semantics like: ''implode(string $separator, array(string) $array)''+The semantics are like: ''implode(string $separator, array(string) $array)'' 
 + 
 +If the ''$collator'' is not specified, it uses the collection of the first 
 +element in the ''$elements'' array. This will also be then set on the created 
 +object. 
 + 
 +If the ''$elements'' array is empty, an empty ''Text'' object with the 
 +''root'' locale is created.
  
  
Line 142: Line 171:
 === wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) === === wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) ===
  
-Wraps a text to a given number of graphemes into an array of Text objects.+Wraps a text to a given number of graphemes per line, into an array of Text 
 +objects.
  
 Like: ''wordwrap'', but based on graphemes and returning an array instead of Like: ''wordwrap'', but based on graphemes and returning an array instead of
Line 155: Line 185:
 Replaces the first ''$maxReplacements'' occurrences of ''$search'' with Replaces the first ''$maxReplacements'' occurrences of ''$search'' with
 ''$replace''. ''$replace''.
 +
 +The locale of ''$search'' is used to find sub-strings that
 +match, if it is a ''Text'' object, otherwise the locale embedded in the object
 +that the method is called on.
  
 The ''$replaceFrom'' and ''$replaceTo'' arguments control which found The ''$replaceFrom'' and ''$replaceTo'' arguments control which found
-items are being replace. The ''$replaceFrom'' argument is the first+items are being replaced. The ''$replaceFrom'' argument is the first
 argument that is being replaced (0-indexed), and ''$replaceTo'' is the argument that is being replaced (0-indexed), and ''$replaceTo'' is the
-last item. Positive numbers are counted from the first occurence of+last item. Positive numbers are counted from the first occurrence of
 ''$search'' in the Text, and negative numbers from the last found ''$search'' in the Text, and negative numbers from the last found
 occurrence. occurrence.
  
- +In order to find sub-strings case-insensitivelyyou can use the ''$collator'' 
-=== replaceTextCaseInsensitively(string|Text $searchstring|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) === +argument to the constructor of the ''$search'' argument.
- +
-Replaces every occurrence of ''$search'' with ''$replace'' using the locale of +
-the object that the method is called on. The locale of ''$search'' and +
-''$replace'' is ignored. +
- +
-''$replaceFrom'' and ''$replaceTo'' behave as with ''replaceText''+
  
 === reverse() === === reverse() ===
Line 182: Line 209:
 Methods to find text in other text. Methods to find text in other text.
  
-=== getPositionOfFirstOccurrence(string|Text $textToFind, int $offset) : int|false ===+In all these methods, the locale of ''$search'' is used to find sub-strings that  
 +match, if it is a ''Text'' object, otherwise the locale embedded in the object 
 +that the method is called on. 
 + 
 + 
 +=== getPositionOfFirstOccurrence(string|Text $search, int $offset) : int|false ===
  
 Returns the position (in grapheme units) of the first occurrence of Returns the position (in grapheme units) of the first occurrence of
-''$textToFind'' starting at the (grapheme) ''$offset'', or false if not found.+''$search'' starting at the (grapheme) ''$offset'', or false if not found.
  
-Like: ''grapheme_strpos($this, $textToFind, $offset)''+Like: ''grapheme_strpos($this, $search, $offset)''
 https://www.php.net/manual/en/function.grapheme-strpos.php https://www.php.net/manual/en/function.grapheme-strpos.php
  
 *I think this method name is too long* *I think this method name is too long*
  
-=== getPositionOfLastOccurrence(string|Text $textToFind, int $offset) : int|false ===+=== getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false ===
  
  
Line 198: Line 230:
  
  
-=== returnFromFirstOccurence(string|Text $textToFind) : Text|false ===+=== returnFromFirstOccurence(string|Text $search) : Text|false ===
  
-Returns the ''Text'' starting with the ''$textToFind'' if found, and+Returns the ''Text'' starting with the ''$search'' if found, and
 otherwise ''false''. otherwise ''false''.
  
-Like: ''grapheme_strstr($this, $textToFind)''+Like: ''grapheme_strstr($this, $search)''
 (https://www.php.net/manual/en/function.grapheme-strstr.php) (https://www.php.net/manual/en/function.grapheme-strstr.php)
  
  
-=== returnFromLastOccurence(string|Text $textToFind) : Text|false ===+=== returnFromLastOccurence(string|Text $search) : Text|false ===
  
 Like ''returnFromFirstOccurence'' but then from the end of the text. Like ''returnFromFirstOccurence'' but then from the end of the text.
  
-=== contains(string|Text $string) ===+=== contains(string|Text $search) ===
  
-Returns true if the text ''$string'' can be found in the text.+Returns true if the text ''$search'' can be found in the text.
  
 Like ''str_contains''. Like ''str_contains''.
  
  
-=== endsWith(string|Text $string) : bool ===+=== endsWith(string|Text $search) : bool ===
  
-Could be constructed from ''getPositionOfFirstOccurrence()'' and+Compares the last ''$search.Length()'' graphemes of ''$this''
 + 
 +Case-insensitive comparison can be achieved by setting the right 
 +''$collator'' on ''$search''
 + 
 +Could be constructed from ''getPositionOflastOccurrence()'' and
 ''length()'', but it's an often required method, and standard PHP has it ''length()'', but it's an often required method, and standard PHP has it
 too. too.
  
  
-=== startsWith(string|Text $string) : bool ===+=== startsWith(string|Text $search) : bool ===
  
-Compares the first ''$string.Length()'' graphemes of ''$this'' using the +Compares the first ''$search.Length()'' graphemes of ''$this''.
-locale and collator that are configured with ''$this''.+
  
 Case-insensitive comparison can be achieved by setting the right Case-insensitive comparison can be achieved by setting the right
-''$locale'' and ''$collator'' on ''$this''.+''$collator'' on ''$search''.
  
 Could be constructed from ''getPositionOfFirstOccurrence()'', Could be constructed from ''getPositionOfFirstOccurrence()'',
Line 240: Line 276:
 ==== Comparing Text Objects ==== ==== Comparing Text Objects ====
  
-=== compareWith(Text $other) : int ===+=== compareWith(Text $other, string $collator = NULL) : int ===
  
-Uses the configured ''$locale'' of ''$this'' to compare it against +Uses the configured ''$collator'' of ''$this'' to compare it against 
-''$other''. The locale of ''$other'' is ignored.+''$other'', unless the ''$collator'' argument is specified as an override.
  
 This same method is also used for comparing two Text objects as "compare This same method is also used for comparing two Text objects as "compare
-handler".+handler". Here only the locale on ''$this'' is taken into account.
  
  
 ==== Case Conversions ==== ==== Case Conversions ====
  
 +These operations all use the collation that is configured on the Text object.
  
 === toLower === === toLower ===
Line 256: Line 293:
 Converts the text to lower case, using the lower case variant of each Converts the text to lower case, using the lower case variant of each
 Unicode code point that makes up the text. Unicode code point that makes up the text.
- 
  
 === toUpper === === toUpper ===
  
 +The same, but then to upper case.
  
 === toTitle === === toTitle ===
  
 +The same, but then to title case (the first letter of each word).
  
 === firstToLower === === firstToLower ===
  
 Converts the first grapheme in the text to a lower case variant. Converts the first grapheme in the text to a lower case variant.
- 
  
 === firstToUpper === === firstToUpper ===
  
 +The same, but then to upper case.
  
 === firstToTitle === === firstToTitle ===
  
 +The same, but then to title case (the first letter of each word).
 +
 +
 +=== wordsToLower ===
 +
 +Converts the first grapheme in every word to an lower case variant.
 +
 +=== wordsToUpper ===
 +
 +The same, but then to upper case.
 +
 +=== wordsToTitle ===
 +
 +The same, but then to title case (the first letter of each word).
  
  
Line 301: Line 350:
  
  
-=== countWords() ===+=== getWordCount() ===
  
 Pretty much a shortcut for:: Pretty much a shortcut for::
Line 315: Line 364:
 These functions return an iterator that can be used to iterator over the text. These functions return an iterator that can be used to iterator over the text.
 The return of the iterators are effected by the text's locale. The return of the iterators are effected by the text's locale.
 +
 +These are inspired by ICU4J's BreakIterators 
 +(https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html) 
 +and Intl's create*Instance methods on ''Intl\BreakIterator'' 
 +(https://www.php.net/manual/en/class.intlbreakiterator.php).
  
 === getCharacterIterator === === getCharacterIterator ===
  
 +Returns an Iterator that locates boundaries between logical characters.
 +Because of the structure of the Unicode encoding, a logical character may be
 +stored internally as more than one Unicode code point. (A with an umlaut may
 +be stored as an 'a' followed by a separate combining umlaut character, for
 +example, but the user still thinks of it as one character.) This iterator
 +allows various processes (especially text editors) to treat as characters the
 +units of text that a user would think of as characters, rather than the units
 +of text that the computer sees as "characters".
  
 +=== getWordIterator ===
  
-=== getLineIterator ===+Returns an Iterator that locates boundaries between words. This is useful 
 +for double-click selection or "find whole words" searches. This type of 
 +iterator makes sure there is a boundary position at the beginning and end 
 +of each legal word. (Numbers count as words, too.) Whitespace and punctuation 
 +are kept separate from real words. 
  
 +=== getLineIterator ===
  
 +Returns an Iterator that locates positions where it is legal for a text
 +editor to wrap lines. This is similar to word breaking, but not the same:
 +punctuation and whitespace are generally kept with words (you don't want a
 +line to start with whitespace, for example), and some special characters can
 +force a position to be considered a line-break position or prevent a position
 +from being a line-break position. 
  
 === getSentenceIterator === === getSentenceIterator ===
  
 +Returns an Iterator that locates boundaries between sentences.
  
  
 === getTitleIterator === === getTitleIterator ===
  
- +Returns an Iterator that locates boundaries between title breaks. 
- +
-=== getWordIterator === +
  
  
Line 344: Line 415:
 === transliterate(string $transliterationString) === === transliterate(string $transliterationString) ===
  
 +Transliterates the content of the ''Text'' object according to the rules as
 +specified in the ''$transliterationString''.
  
 +There are a few constants for specific and often used cases, such as creating
 +an ASCII transliterated version of any Text:
  
-=== transliterate(\Intl\Transliterator $transliterator) ===+ - const Text::toAscii : A shortcut for a transliteration string that converts 
 +   any script to Latin, and also strips all the accents.
  
 + - const Text::toLatin : A shortcut for a transliteration string that converts
 +   any script to Latin, but does not remove the accents.
  
-With the first one being a "simpleone to use, and the second using Intl'+ - const Text::removeAccents : Removes the accents from Text. A shortcut for 
-Transliterator for more complex cases.+   the transliteration string ''"NFD; [:Nonspacing Mark:] Remove; NFC."''.
  
-Should we add shortcuts for a set of often used ones, such as ''Any-Latin''?+===== Implementation Details =====
-think so, as it's the majority use case.+
  
 +The functionality as is described in this RFC is mostly implemented by using
 +functionality from the ICU library, which is also used by the Intl extension.
  
-=== toLatin === +In order for PHP to continue to work on an as widest range of platforms and 
- +distributions, the minimum ICU version will be chosen accordingly to common 
-Converts any script to Latin. +Linux distributionslowest version, which would include the version of PHP in 
- +which this functionality is implemented.
- +
-=== removeAccents === +
- +
-Removes the accents from a (latin script) text. +
- +
-A shortcut for the transliteration string ''"Latin-ASCII"'' (or a more +
-suitable one, which I believe is ''"NFD; [:Nonspacing Mark:] Remove; +
-NFC."''+
  
 ===== Backward Incompatible Changes ===== ===== Backward Incompatible Changes =====
  
-Introducing a new class could impact code bases that already use this class +Introducing a new ''Text'' class could impact code bases that already use this 
-name. But as PHP owns the global namespace, this should not deter us from +class name. But as PHP owns the global namespace, this should not deter us 
-adding such a code class.+from adding such a code class.
  
 ===== Proposed PHP Version(s) ===== ===== Proposed PHP Version(s) =====
Line 387: Line 457:
 ===== Open Issues ===== ===== Open Issues =====
  
-==== Class Name ==== 
  
-I have currently picked "Text", as it describes that the object does not only +===== Questions and Answers ===== 
-represent single words (strings). Alternatively, we can pick something like + 
-"Utext" (for Unicode Text), but I find that a distraction.+==== Why is this not a composer package? ==== 
 + 
 +The goal of this RFC is that PHP users can always rely on performant text 
 +processing capabilities.
  
 +Text processors written in PHP already exist, but suffer from performance
 +issues (PHP is slower than C), and are sometimes tailored to specific use
 +cases. By having them written in C, and utilising ICU's well tested and often
 +updated rules and algorithms, both the performance and correctness issues will
 +be addressed.
  
 ===== Future Scope ===== ===== Future Scope =====
rfc/unicode_text_processing.txt · Last modified: 2022/12/21 11:48 by derick