Differences

This shows you the differences between two versions of the page.

--- rfc:unicode_text_processing [2022/11/21 15:14] – derick
+++ rfc:unicode_text_processing [2022/12/15 15:29] – derick
@@ Line 3: / Line 3: @@
   * Date: 2022-11-09
   * Author: Derick Rethans <derick@php.net>
-  * Status: Draft
+  * Status: Under Discussion
   * First Published at: http://wiki.php.net/rfc/unicode_text_processing
@@ Line 14: / Line 14: @@
 create an API that developers can use to do Unicode text processing
 correctly, without having to know all the intricacies.
+Although PHP has decent maths features, it is solely missing performant
+Unicode text processing always available in the core.
 ==== Definitions ====
@@ Line 27: / Line 30: @@
 Methods on the class will all return a new (immutable) object.
+The proposal is to make the ''Text'' class part of the PHP core. This would
+mean that it is therefore always available to user. As the implementation
+requires ICU, this would also mean that PHP will depend on the ICU library.
 ==== Basics ====
@@ Line 32: / Line 40: @@
 constructor.
-The ''toString()'' method collapses the internally stored text into a
+The ''_****_toString()'' method collapses the internally stored text into a
 UTF-8 encoded string, which can be used by all existing PHP functions
 that accept strings.
-The internal representation would be UTF-16, as that's what ICU uses.
+The internal representation of the text is UTF-16, as that's what ICU uses.
 Unlike the PHP 6 approach, the conversion to/from the internal
 representation only happens on the boundaries: UTF-8 to UTF-16 through
-the constructor, and the reverse through the ''toString()'' method.
+the constructor, and the reverse through the ''_****_toString()'' method.
 There are multiple groups of methods indicated below. Some are to
@@ Line 51: / Line 59: @@
   * prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments.
   * operations are on **graphemes**
-  * no redundent methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used
+  * no redundant methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used
   * more as we discuss this...
@@ Line 80: / Line 88: @@
 extensive documentation.
-Numerical order collation (such as PHP's ''natsort()'') can be achived
+Numerical order collation (such as PHP's ''natsort()'') can be achieved
 by adding the ''kn'' flag to the locale name, such as in ''de-u-kn''
 (case-sensitive German, with numerics in value order).
@@ Line 88: / Line 96: @@
 and defaults at http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings
-Specifying the locale and collator will also be possible by passing in a
+Building a locale/collation string will also be possible by using a
-''Intl\\Collator'' object
+''TextCollator'' object, to allow for better and easier-to-read customization
-(https://www.php.net/manual/en/class.collator.php) to allow for more
+of collations. The class performs the same function as ''\Intl\Collator''
-descritive construction of a locale with all its options.
+(https://www.php.net/manual/en/class.collator.php), except that it has
+descriptive methods to set collation properties. The reason for a separate
+class is so that you don't have to depend on the ''Intl'' extension, and to
+make it more developer-friendly. It converts the configured options to a
+string, which can then be used in any location where ''string $collator'' is
+used in the function signatures to the methods on the ''Text'' class.
@@ Line 98: / Line 111: @@
 This section lists all the method that construct a Text object.
-=== __construct(string $text, string $locale = 'root/standard'), __construct(string $text, \\Intl\\Collator $collator = new \\Intl\\Collator('root/standard')) ===
+=== __construct(string $text, string $locale = 'root/standard') ===
 The constructor takes a UTF-8 encoded text, and stores this in an internal
@@ Line 106: / Line 119: @@
 (Byte-Order-Mark) character, if present.
+=== static Text::create(string $text, string $locale = 'root/standard') ===
-=== static Text::join(array(string|Text) $elements, string|Text $separator) ===
+The Symfony String package, offers a static function to construct a String
+through a single-character function (''u''), which you can import into the
+file scope (with ''use'').
-Creates a new Text object by concatenating the each Text element in
+This method solves a similar use, so that you can shorten ''new Text(…)'' to
+''t'' after having imported the method into the file's scope with:
+For example with ''use \Text::create as t''.
+=== static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) ===
+Creates a new Text object by concatenating the Text element in
 ''$elements'', inserting ''$separator'' in between each element.
-Semantics like: ''implode(string $separator, array(string) $array)''
+The semantics are like: ''implode(string $separator, array(string) $array)''
+If the ''$collator'' is not specified, it uses the collection of the first
+element in the ''$elements'' array. This will also be then set on the created
+object.
+If the ''$elements'' array is empty, an empty ''Text'' object with the
+''root'' locale is created.
@@ Line 142: / Line 171: @@
 === wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) ===
-Wraps a text to a given number of graphemes into an array of Text objects.
+Wraps a text to a given number of graphemes per line, into an array of Text
+objects.
 Like: ''wordwrap'', but based on graphemes and returning an array instead of
@@ Line 155: / Line 185: @@
 Replaces the first ''$maxReplacements'' occurrences of ''$search'' with
 ''$replace''.
+The locale of ''$search'' is used to find sub-strings that
+match, if it is a ''Text'' object, otherwise the locale embedded in the object
+that the method is called on.
 The ''$replaceFrom'' and ''$replaceTo'' arguments control which found
-items are being replace. The ''$replaceFrom'' argument is the first
+items are being replaced. The ''$replaceFrom'' argument is the first
 argument that is being replaced (0-indexed), and ''$replaceTo'' is the
-last item. Positive numbers are counted from the first occurence of
+last item. Positive numbers are counted from the first occurrence of
 ''$search'' in the Text, and negative numbers from the last found
 occurrence.
+In order to find sub-strings case-insensitively, you can use the ''$collator''
-=== replaceTextCaseInsensitively(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) ===
+argument to the constructor of the ''$search'' argument.
-Replaces every occurrence of ''$search'' with ''$replace'' using the locale of
-the object that the method is called on. The locale of ''$search'' and
-''$replace'' is ignored.
-''$replaceFrom'' and ''$replaceTo'' behave as with ''replaceText''.
 === reverse() ===
@@ Line 182: / Line 209: @@
 Methods to find text in other text.
-=== getPositionOfFirstOccurrence(string|Text $textToFind, int $offset) : int|false ===
+In all these methods, the locale of ''$search'' is used to find sub-strings that
+match, if it is a ''Text'' object, otherwise the locale embedded in the object
+that the method is called on.
+=== getPositionOfFirstOccurrence(string|Text $search, int $offset) : int|false ===
 Returns the position (in grapheme units) of the first occurrence of
-''$textToFind'' starting at the (grapheme) ''$offset'', or false if not found.
+''$search'' starting at the (grapheme) ''$offset'', or false if not found.
-Like: ''grapheme_strpos($this, $textToFind, $offset)''
+Like: ''grapheme_strpos($this, $search, $offset)''
 https://www.php.net/manual/en/function.grapheme-strpos.php
 *I think this method name is too long*
-=== getPositionOfLastOccurrence(string|Text $textToFind, int $offset) : int|false ===
+=== getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false ===
@@ Line 198: / Line 230: @@
-=== returnFromFirstOccurence(string|Text $textToFind) : Text|false ===
+=== returnFromFirstOccurence(string|Text $search) : Text|false ===
-Returns the ''Text'' starting with the ''$textToFind'' if found, and
+Returns the ''Text'' starting with the ''$search'' if found, and
 otherwise ''false''.
-Like: ''grapheme_strstr($this, $textToFind)''
+Like: ''grapheme_strstr($this, $search)''
 (https://www.php.net/manual/en/function.grapheme-strstr.php)
-=== returnFromLastOccurence(string|Text $textToFind) : Text|false ===
+=== returnFromLastOccurence(string|Text $search) : Text|false ===
 Like ''returnFromFirstOccurence'' but then from the end of the text.
-=== contains(string|Text $string) ===
+=== contains(string|Text $search) ===
-Returns true if the text ''$string'' can be found in the text.
+Returns true if the text ''$search'' can be found in the text.
 Like ''str_contains''.
-=== endsWith(string|Text $string) : bool ===
+=== endsWith(string|Text $search) : bool ===
-Could be constructed from ''getPositionOfFirstOccurrence()'' and
+Compares the last ''$search.Length()'' graphemes of ''$this''.
+Case-insensitive comparison can be achieved by setting the right
+''$collator'' on ''$search''.
+Could be constructed from ''getPositionOflastOccurrence()'' and
 ''length()'', but it's an often required method, and standard PHP has it
 too.
-=== startsWith(string|Text $string) : bool ===
+=== startsWith(string|Text $search) : bool ===
-Compares the first ''$string.Length()'' graphemes of ''$this'' using the
+Compares the first ''$search.Length()'' graphemes of ''$this''.
-locale and collator that are configured with ''$this''.
 Case-insensitive comparison can be achieved by setting the right
-''$locale'' and ''$collator'' on ''$this''.
+''$collator'' on ''$search''.
 Could be constructed from ''getPositionOfFirstOccurrence()'',
@@ Line 240: / Line 276: @@
 ==== Comparing Text Objects ====
-=== compareWith(Text $other) : int ===
+=== compareWith(Text $other, string $collator = NULL) : int ===
-Uses the configured ''$locale'' of ''$this'' to compare it against
+Uses the configured ''$collator'' of ''$this'' to compare it against
-''$other''. The locale of ''$other'' is ignored.
+''$other'', unless the ''$collator'' argument is specified as an override.
 This same method is also used for comparing two Text objects as "compare
-handler".
+handler". Here only the locale on ''$this'' is taken into account.
 ==== Case Conversions ====
+These operations all use the collation that is configured on the Text object.
 === toLower ===
@@ Line 256: / Line 293: @@
 Converts the text to lower case, using the lower case variant of each
 Unicode code point that makes up the text.
 === toUpper ===
+The same, but then to upper case.
 === toTitle ===
+The same, but then to title case (the first letter of each word).
 === firstToLower ===
 Converts the first grapheme in the text to a lower case variant.
 === firstToUpper ===
+The same, but then to upper case.
 === firstToTitle ===
+The same, but then to title case (the first letter of each word).
+=== wordsToLower ===
+Converts the first grapheme in every word to an lower case variant.
+=== wordsToUpper ===
+The same, but then to upper case.
+=== wordsToTitle ===
+The same, but then to title case (the first letter of each word).
@@ Line 301: / Line 350: @@
-=== countWords() ===
+=== getWordCount() ===
 Pretty much a shortcut for::
@@ Line 315: / Line 364: @@
 These functions return an iterator that can be used to iterator over the text.
 The return of the iterators are effected by the text's locale.
+i
+These are inspired by ICU4J's BreakIterators
+(https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html)
+and Intl's create*Instance methods on ''Intl\BreakIterator''
+(https://www.php.net/manual/en/class.intlbreakiterator.php).
 === getCharacterIterator ===
+Returns an Iterator that locates boundaries between logical characters.
+Because of the structure of the Unicode encoding, a logical character may be
+stored internally as more than one Unicode code point. (A with an umlaut may
+be stored as an 'a' followed by a separate combining umlaut character, for
+example, but the user still thinks of it as one character.) This iterator
+allows various processes (especially text editors) to treat as characters the
+units of text that a user would think of as characters, rather than the units
+of text that the computer sees as "characters".
+=== getWordIterator ===
-=== getLineIterator ===
+Returns an Iterator that locates boundaries between words. This is useful
+for double-click selection or "find whole words" searches. This type of
+iterator makes sure there is a boundary position at the beginning and end
+of each legal word. (Numbers count as words, too.) Whitespace and punctuation
+are kept separate from real words.
+=== getLineIterator ===
+Returns an Iterator that locates positions where it is legal for a text
+editor to wrap lines. This is similar to word breaking, but not the same:
+punctuation and whitespace are generally kept with words (you don't want a
+line to start with whitespace, for example), and some special characters can
+force a position to be considered a line-break position or prevent a position
+from being a line-break position.
 === getSentenceIterator ===
+Returns an Iterator that locates boundaries between sentences.
 === getTitleIterator ===
+Returns an Iterator that locates boundaries between title breaks.
-=== getWordIterator ===
@@ Line 344: / Line 415: @@
 === transliterate(string $transliterationString) ===
+Transliterates the content of the ''Text'' object according to the rules as
+specified in the ''$transliterationString''.
+There are a few constants for specific and often used cases, such as creating
+an ASCII transliterated version of any Text:
-=== transliterate(\Intl\Transliterator $transliterator) ===
+ - const Text::toAscii : A shortcut for a transliteration string that converts
+   any script to Latin, and also strips all the accents.
+ - const Text::toLatin : A shortcut for a transliteration string that converts
+   any script to Latin, but does not remove the accents.
-With the first one being a "simple" one to use, and the second using Intl's
+ - const Text::removeAccents : Removes the accents from a Text. A shortcut for
-Transliterator for more complex cases.
+   the transliteration string ''"NFD; [:Nonspacing Mark:] Remove; NFC."''.
-Should we add shortcuts for a set of often used ones, such as ''Any-Latin''? I
+===== Implementation Details =====
-think so, as it's the majority use case.
+The functionality as is described in this RFC is mostly implemented by using
+functionality from the ICU library, which is also used by the Intl extension.
-=== toLatin ===
+In order for PHP to continue to work on an as widest range of platforms and
+distributions, the minimum ICU version will be chosen accordingly to common
-Converts any script to Latin.
+Linux distributions' lowest version, which would include the version of PHP in
+which this functionality is implemented.
-=== removeAccents ===
-Removes the accents from a (latin script) text.
-A shortcut for the transliteration string ''"Latin-ASCII"'' (or a more
-suitable one, which I believe is ''"NFD; [:Nonspacing Mark:] Remove;
-NFC."''.
 ===== Backward Incompatible Changes =====
-Introducing a new class could impact code bases that already use this class
+Introducing a new ''Text'' class could impact code bases that already use this
-name. But as PHP owns the global namespace, this should not deter us from
+class name. But as PHP owns the global namespace, this should not deter us
-adding such a code class.
+from adding such a code class.
 ===== Proposed PHP Version(s) =====
@@ Line 387: / Line 457: @@
 ===== Open Issues =====
-==== Class Name ====
-I have currently picked "Text", as it describes that the object does not only
+===== Questions and Answers =====
-represent single words (strings). Alternatively, we can pick something like
-"Utext" (for Unicode Text), but I find that a distraction.
+==== Why is this not a composer package? ====
+The goal of this RFC is that PHP users can always rely on performant text
+processing capabilities.
+Text processors written in PHP already exist, but suffer from performance
+issues (PHP is slower than C), and are sometimes tailored to specific use
+cases. By having them written in C, and utilising ICU's well tested and often
+updated rules and algorithms, both the performance and correctness issues will
+be addressed.
 ===== Future Scope =====

Differences

Page Tools