rfc:unicode_text_processing

This is an old revision of the document!


PHP RFC: Unicode Text Processing

Introduction

This RFC suggests to introduce a new class to make using and processing (Unicode) text significantly more developer friendly compared to the wealth of functionality that the intl extension provides. The goal is to create an API that developers can use to do Unicode text processing correctly, without having to know all the intricacies.

Although PHP has decent maths features, it is solely missing performant Unicode text processing always available in the core.

Definitions

Term Description
Grapheme A Unicode “character”. A single character includes: a normal character (p), a character with diacritics (ô), a character with space modifiers, or an emoji (☺).

Proposal

To introduce a new final “Text” class, with methods to operate on the text stored in the objects.

Methods on the class will all return a new (immutable) object.

The proposal is to make the Text class part of the PHP core. This would mean that it is therefore always available to users. As the implementation requires ICU, this would also mean that PHP will depend on the ICU library.

Basics

Text objects are constructed by passing a UTF-8 encoded string to the constructor.

The __toString() method collapses the internally stored text into a UTF-8 encoded string, which can be used by all existing PHP functions that accept strings.

The internal representation of the text is UTF-16, as that's what ICU uses. Unlike the PHP 6 approach, the conversion to/from the internal representation only happens on the boundaries: UTF-8 to UTF-16 through the constructor, and the reverse through the __toString() method.

There are multiple groups of methods indicated below. Some are to represent PHP's existing string functions (substr, wordwrap, etc.), but with meaningful names.

Design Goals:

  • keep it simple
  • default behaviour should be the most expected
  • prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments.
  • operations are on graphemes
  • no redundant methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used
  • more as we discuss this...

Non Design Goals:

  • introduce every feature of the intl extension

Each section below contains a list of expected methods. This list is currently not exhaustive. Please join the discussion on the mailing list to suggest modifications or additions, keeping the design goals in mind.

If an argument to any of the methods is listed as string|Text, passing in a string value will have the same semantics as replacing the passed value with new Text($string). The locale from the Text object that this method is called on is also used for this new wrapped value, if necessary.

Locales and Internationalisation

By default each string will have the “root” collator associated with it, but it is possible to configure a specific collator by using the $collator argument in the constructor. The $collator is specified as a string describing an ICU locale name: https://unicode-org.github.io/icu/userguide/collation/api.html#instantiating-the-predefined-collators

For example, the locale (or collation) name en-u-ks-level1 means case-insensitive sorting for the English locale. This will require extensive documentation.

Numerical order collation (such as PHP's natsort()) can be achieved by adding the kn flag to the locale name, such as in de-u-kn (case-sensitive German, with numerics in value order).

Other options are described in BCP47: https://github.com/unicode-org/cldr/blob/main/common/bcp47/collation.xml and defaults at http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings

Building a locale/collation string will also be possible by using a TextCollator object, to allow for better and easier-to-read customization of collations. The class performs the same function as \Intl\Collator (https://www.php.net/manual/en/class.collator.php), except that it has descriptive methods to set collation properties. The reason for a separate class is so that you don't have to depend on the Intl extension, and to make it more developer-friendly. It converts the configured options to a string, which can then be used in any location where string $collator is used in the function signatures to the methods on the Text class.

Construction

This section lists all the methods that construct a Text object.

__construct(string $text, string $locale = 'root/standard') : \Text

The constructor takes a UTF-8 encoded text, and stores this in an internal structure. The constructor will also convert the given text to Unicode Canonical Form (also called Normalisation Form C, or NFC). Passing in non-well-formed UTF-8 will result in an InvalidEncodingException. The constructor will also strip out a BOM (Byte-Order-Mark) character, if present.

static Text::create(string $text, string $locale = 'root/standard') : \Text

The Symfony String package offers a static function to construct a String through a single-character function (u), which you can import into the file scope (with use).

This method solves a similar use, so that you can shorten new Text(…) to t after having imported the method into the file's scope with (for example): use \Text::create as t.

static Text::join(array(string|Text) $elements, string|Text $separator, string $collator = NULL) : \Text

Creates a new Text object by concatenating the Text element in $elements, inserting $separator in between each element.

The semantics are like: implode(string $separator, array(string) $array)

If the $collator is not specified, it uses the collection of the first element in the $elements array. This will then also be set on the created object.

If the $elements array is empty, an empty Text object with the root locale is created.

Standard String Operations

split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text)

Returns an array of Text objects, each of which is a substring of $this, formed by splitting it on boundaries formed by the text $separator.

Like explode($separator, $limit).

subString(int $offset, int $length) : Text|false

Returns a sub-string, starting at $offset for $length graphemes.

Like: grapheme_substr($this, $offset, $length) https://www.php.net/manual/en/function.grapheme-substr.php

trimStart, trimEnd, trim : \Text

Removes white space at the start of, the end of, or both sides of the text.

Like: ltrim, rtrim, and trim, but with using the Unicode definition of what white space is. https://unicode.org/reports/tr44/#White_Space

wrap(int $maxWidth, bool $cutLongWords = false) : array(Text)

Wraps a text to a given number of graphemes per line, into an array of Text objects.

Like: wordwrap, but based on graphemes and returning an array instead of inserting a break character.

If $cutLongWords is set, no Text element will be larger than $maxWidth.

replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text

Replaces occurrences of $search with $replace.

The locale of $search is used to find sub-strings that match, if it is a Text object, otherwise the locale embedded in the object that the method is called on.

The $replaceFrom and $replaceTo arguments control which found items are being replaced. The $replaceFrom argument is the first argument that is being replaced (0-indexed), and $replaceTo is the last item. Positive numbers are counted from the first occurrence of $search in the Text, and negative numbers from the last found occurrence.

In order to find sub-strings case-insensitively, you can use the $collator argument to Text::__construct of the $search argument.

reverse() : \Text

Reverses a text, taking into account grapheme boundaries.

Finding Text in Text

Methods to find text in other text.

In all these methods, the locale of $search is used to find sub-strings that match, if it is a Text object, otherwise the locale embedded in the object that the method is called on.

getPositionOfFirstOccurrence(string|Text $search, int $offset) : int|false

Returns the position (in grapheme units) of the first occurrence of $search starting at the (grapheme) $offset, or false if not found.

Like: grapheme_strpos($this, $search, $offset) https://www.php.net/manual/en/function.grapheme-strpos.php

Alternative suggested names: findOffset

getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false

Like getPositionOfFirstOccurrence but then from the end of the text.

Alternative suggested names: findOffsetLast

returnFromFirstOccurence(string|Text $search) : Text|false

Returns the Text starting with the $search if found, and otherwise false.

Like: grapheme_strstr($this, $search) (https://www.php.net/manual/en/function.grapheme-strstr.php)

Alternative suggested names: startingWith

returnFromLastOccurence(string|Text $search) : Text|false

Like returnFromFirstOccurence but then from the end of the text.

Returns true if the text $search can be found in the text.

Like str_contains.

Alternative suggested names: startingWithLast

endsWith(string|Text $search) : bool

Compares the last $search.Length() graphemes of $this.

Case-insensitive comparison can be achieved by setting the right $collator on $search.

Could be constructed from getPositionOflastOccurrence() and length(), but it's an often required method, and standard PHP has it too.

startsWith(string|Text $search) : bool

Compares the first $search.Length() graphemes of $this.

Case-insensitive comparison can be achieved by setting the right $collator on $search.

Could be constructed from getPositionOfFirstOccurrence(), but it's an often required method, and standard PHP has it too.

Comparing Text Objects

compareWith(Text $other, string $collator = NULL) : int

Uses the configured $collator of $this to compare it against $other, unless the $collator argument is specified as an override.

This same method is also used for comparing two Text objects as “compare handler”. Here only the locale on $this is taken into account.

Case Conversions

These operations all use the collation that is configured on the Text object.

toLower : \Text

Converts the text to lower case, using the lower case variant of each Unicode code point that makes up the text.

Example: Het IJsselmeer is vol met ideëen to het ijsselmeer is vol met ideëen.

toUpper : \Text

The same, but then to upper case.

Example: Het IJsselmeer is vol met ideëen to HET IJSSELMEER IS VOL MET IDEËEN.

toTitle : \Text

The same, but then to title case (the first letter of each word).

Example: Het IJsselmeer is vol met ideëen to Het IJsselmeer is Vol met Ideëen.

firstToLower : \Text

Converts the first grapheme in the text to a lower case variant.

Example: Het IJsselmeer is vol met ideëen to het IJsselmeer is vol met ideëen.

firstToUpper : \Text

The same, but then to upper case.

Example: Het IJsselmeer is vol met ideëen to Het IJsselmeer is vol met ideëen.

wordsToLower : \Text

Converts the first grapheme in every word to an lower case variant.

Example: Het IJsselmeer is vol met ideëen to het ijsselmeer is vol met ideëen.

wordsToUpper : \Text

The same, but then to upper case.

Example: Het IJsselmeer is vol met ideëen to Het IJsselmeer Is Vol Met Ideëen.

Counting

getByteCount() : int

Returns the size in bytes that the text will take when converted to UTF-8.

length(), getCharacterCount(): int

Returns the number of characters that make up the text. A character (also sometimes call a grapheme) consists of the base-character, and all combining diacritics. Unicode calls these “extended grapheme clusters”. http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

getCodePointCount() : int

Returns the number of Unicode code points that make up the text. (Not sure if we should add this, as it doesn't really have any use).

getWordCount() : int

Pretty much a shortcut for::

$count = 0;
foreach ($text->getWordIterator as $word) { $count++ };

Uses the locale, just like the iterators.

Iterators

These functions return an iterator that can be used to iterator over the text. The return of the iterators are effected by the text's locale. i These are inspired by ICU4J's BreakIterators (https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html) and Intl's create*Instance methods on Intl\BreakIterator (https://www.php.net/manual/en/class.intlbreakiterator.php).

getCharacterIterator : \Iterator

Returns an Iterator that locates boundaries between logical characters. Because of the structure of the Unicode encoding, a logical character may be stored internally as more than one Unicode code point. (A with an umlaut may be stored as an 'a' followed by a separate combining umlaut character, for example, but the user still thinks of it as one character.) This iterator allows various processes (especially text editors) to treat as characters the units of text that a user would think of as characters, rather than the units of text that the computer sees as “characters”.

getWordIterator : \Iterator

Returns an Iterator that locates boundaries between words. This is useful for double-click selection or “find whole words” searches. This type of iterator makes sure there is a boundary position at the beginning and end of each legal word. (Numbers count as words, too.) Whitespace and punctuation are kept separate from real words.

getLineIterator : \Iterator

Returns an Iterator that locates positions where it is legal for a text editor to wrap lines. This is similar to word breaking, but not the same: punctuation and whitespace are generally kept with words (you don't want a line to start with whitespace, for example), and some special characters can force a position to be considered a line-break position or prevent a position from being a line-break position.

getSentenceIterator : \Iterator

Returns an Iterator that locates boundaries between sentences.

getTitleIterator : \Iterator

Returns an Iterator that locates boundaries between title breaks.

Transliteration

Converts text between scripts and other properties.

transliterate(string $transliterationString) : \Text

Transliterates the content of the Text object according to the rules as specified in the $transliterationString.

There are a few constants for specific and often used cases, such as creating an ASCII transliterated version of any Text:

- const Text::toAscii : A shortcut for a transliteration string that converts

 any script to Latin, and also strips all the accents.

- const Text::toLatin : A shortcut for a transliteration string that converts

 any script to Latin, but does not remove the accents.

- const Text::removeAccents : Removes the accents from a Text. A shortcut for

 the transliteration string ''"NFD; [:Nonspacing Mark:] Remove; NFC."''.

Implementation Details

The functionality as is described in this RFC is mostly implemented by using functionality from the ICU library, which is also used by the Intl extension.

In order for PHP to continue to work on an as widest range of platforms and distributions, the minimum ICU version will be chosen accordingly to common Linux distributions' lowest version, which would include the version of PHP in which this functionality is implemented.

Backward Incompatible Changes

Introducing a new Text class could impact code bases that already use this class name. But as PHP owns the global namespace, this should not deter us from adding such a code class.

Proposed PHP Version(s)

Next PHP 8.x

RFC Impact

There will be no impact to SAPIs, existing extensions, nor Opcache.

Open Issues

  1. Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8.
  2. Tidy up language related to locale/collator. As Tim Starling says: “If the input is an ICU locale string, then I think you should just call it locale. Then the user will be armed with the correct terminology when they go looking for more information in the ICU manual. In ICU, case conversion and BreakIterator need a locale, not a collator.

Questions and Answers

Why is this not a composer package?

The goal of this RFC is that PHP users can always rely on performant text processing capabilities.

Text processors written in PHP already exist, but suffer from performance issues (PHP is slower than C), and are sometimes tailored to specific use cases. By having them written in C, and utilising ICU's well tested and often updated rules and algorithms, both the performance and correctness issues will be addressed.

Future Scope

More methods than described in this RFC can be added in the future.

Proposed Voting Choices

Either “yes” or “no” on including the proposed class.

Patches and Tests

There is no patch yet.

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature
  4. a link to the language specification section (if any)

References

Rejected Features

Nothing rejected yet.

Changes

0.9.1 — 2022-12-16

  • Tim Düsterhus: Removed firstToTitle/wordsToTitle; added examples for toUpper and friends; added return types everywhere; added suggested other names for getPosition... methods; marked class as final.
  • Paul Crovella: Clarify which normalisation is being used.
  • Daniel Wolfe: Update trimLeft/trimRight to trimStart/trimEnd.
rfc/unicode_text_processing.1671384551.txt.gz · Last modified: 2022/12/18 17:29 by theodorejb