====== PHP RFC: Unicode Text Processing ======
  * Version: 0.9.2
  * Date: 2022-12-21 (Original date: 2022-12-15)
  * Author: Derick Rethans <derick@php.net>
  * Status: Draft
  * First Published at: http://wiki.php.net/rfc/unicode_text_processing

===== Introduction =====

This RFC suggests to introduce a new class to make using and processing
(Unicode) text significantly more developer friendly compared to the
wealth of functionality that the intl extension provides. The goal is to
create an API that developers can use to do Unicode text processing
correctly, without having to know all the intricacies.

Although PHP has decent maths features, it is solely missing performant
Unicode text processing always available in the core.

==== Definitions ====

^ Term ^ Description ^
| Grapheme | A Unicode "character". A **single** character includes: a normal character (p), a character with diacritics (ô), a character with space modifiers, or an emoji (☺). |


===== Proposal =====

To introduce a new final "Text" class, with methods to operate on the
text stored in the objects.

Methods on the class will all return a new (immutable) object.

The proposal is to make the ''Text'' class part of the PHP core. This would
mean that it is therefore always available to user. As the implementation
requires ICU, this would also mean that PHP will depend on the ICU library.

==== Basics ====

Text objects are constructed by passing a UTF-8 encoded string to the
constructor.

The ''_****_toString()'' method collapses the internally stored text into a
UTF-8 encoded string, which can be used by all existing PHP functions
that accept strings.

The internal representation of the text is UTF-16, as that's what ICU uses.
Unlike the PHP 6 approach, the conversion to/from the internal
representation only happens on the boundaries: UTF-8 to UTF-16 through
the constructor, and the reverse through the ''_****_toString()'' method.

There are multiple groups of methods indicated below. Some are to
represent PHP's existing string functions (substr, wordwrap, etc.), but
with meaningful names.

Design Goals:

  * keep it simple
  * default behaviour should be the most expected
  * prefer a method per function, instead of allowing the behaviour of a method to be changed through (optional) arguments.
  * operations are on **graphemes**
  * no redundant methods that can be constructed from other methods, unless they already exist in PHP, or are frequently used
  * more as we discuss this...

Non Design Goals:

  * introduce every feature of the intl extension

Each section below contains a list of expected methods. This list is
currently not exhaustive. Please join the discussion on the mailing list
to suggest modifications or additions, keeping the design goals in mind.

If an argument to any of the methods is listed as ''string|Text'',
passing in a ''string'' value will have the same semantics as replacing
the passed value with ''new Text($string)''. The locale and default collator
from the Text object that this method is called on is also used for this new
wrapped value, if necessary.

==== Locales, Collators, and Internationalisation ====

By default each string will have the "root" locale and "standard" collator
associated with it, but it is possible to configure a specific locale and
collator by using the ''$collation'' argument in the constructor. Collation is in
addition to the locale, and affects sorting and finding operations.

The ''$collation'' is specified as a string describing an ICU locale/collation
name:
https://unicode-org.github.io/icu/userguide/collation/api.html#instantiating-the-predefined-collators

The methods on the Text object all use the ''$collation'' argument name.

For example, the locale (and collation) name ''en-u-ks-level1'' means
case-insensitive sorting (''ks-level1'') for the English locale (''en-u'').
The format of this locale/collation name needs extensive documentation.

Numerical order collation (such as PHP's ''natsort()'') can be achieved by
adding the ''kn'' flag to the collator specification, such as in ''de-u-kn''
(case-sensitive German ('''de-u''), with numerics in value order (''kn'')).

Other options are described in BCP47:
https://github.com/unicode-org/cldr/blob/main/common/bcp47/collation.xml
and defaults at http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings

Building a locale/collation string will also be possible by using a
''TextCollator'' object, to allow for better and easier-to-read customization
of collations. The class performs the same function as ''\Intl\Collator''
(https://www.php.net/manual/en/class.collator.php), except that it has
descriptive methods to set collation properties. The reason for a separate
class is so that you don't have to depend on the ''Intl'' extension, and to
make it more developer-friendly. It converts the configured options to a
string, which can then be used in any location where ''string $collator'' is
used in the function signatures to the methods on the ''Text'' class.


==== Construction ====

This section lists all the method that construct a Text object.

=== __construct(string $text, string $collation = 'root/standard') : \Text ===

The constructor takes a UTF-8 encoded text, and stores this in an internal
structure. The constructor will also convert the given text to Unicode
Canonical Form (also called Normalisation Form C, or NFC). Passing in
non-well-formed UTF-8 will result in an ''InvalidEncodingException''.
The constructor will also strip out a BOM (Byte-Order-Mark) character,
if present.


=== static Text::create(string $text, string $collation = 'root/standard') : \Text ===

The Symfony String package, offers a static function to construct a String
through a single-character function (''u''), which you can import into the
file scope (with ''use'').

This method solves a similar use, so that you can shorten ''new Text(…)'' to
''t'' after having imported the method into the file's scope with:
For example with ''use \Text::create as t''.


=== static Text::concat(string|Text ...$elements) : \Text ===

Creates a new Text object by concatenating all the given string/Text arguments
into a new Text object. 

If the ''$elements'' array is empty, an empty ''Text'' object with the
''root'' locale and ''standard'' collation is created.


=== static Text::join(iterable<string|Text> $elements, string|Text $separator, string $collation = NULL) : \Text ===

Creates a new Text object by looping over all the string/Text elements in
''$elements'', inserting ''$separator'' in between each element.

The semantics are like: ''implode(string $separator, array(string) $array)''

If the ''$collation'' is not specified, it uses the collation of the first
element from the ''$elements'' iterable. This will also be then set on the
created object.

If the ''$elements'' iterator has no items, an empty ''Text'' object with the
''root'' locale and ''standard'' collation is created.

If the iterator produces a non-string/Text element, then a ''\ValueException''
will be thrown.

==== Standard String Operations ====


=== split(string|Text $separator, int $limit = PHP_INT_MAX) : array(Text) ===

Returns an array of Text objects, each of which is a substring of ''$this'',
formed by splitting it on boundaries formed by the text ''$separator''.

Like ''explode($separator, $limit)''.


=== subString(int $offset, int $length) : Text|false ===

Returns a sub-string, starting at ''$offset'' for ''$length'' graphemes.

Like: ''grapheme_substr($this, $offset, $length)''
https://www.php.net/manual/en/function.grapheme-substr.php

=== trimStart, trimEnd, trim : \Text ===

Removes white space at the start of, the end of, or both sides of the text.

Like: ''ltrim'', ''rtrim'', and ''trim'', but with using the Unicode definition
of what white space is. https://unicode.org/reports/tr44/#White_Space

=== wrap(int $maxWidth, bool $cutLongWords = false) : array(Text) ===

Wraps a text to a given number of graphemes per line, into an array of Text
objects.

Like: ''wordwrap'', but based on graphemes and returning an array instead of
inserting a break character.

If ''$cutLongWords'' is set, no Text element will be larger than
''$maxWidth''.

=== reverse() : \Text ===

Reverses a text, taking into account grapheme boundaries.


==== Finding Text in Text ====

Methods to find text in other text.

In all these methods, the locale and collator of ''$search'' are used to find
sub-strings that match, if it is a ''Text'' object, otherwise the locale and
collator that are embedded in the object that the method is called on is used.


=== getPositionOfFirstOccurrence(string|Text $search, int $offset) : int|false ===

Returns the position (in grapheme units) of the first occurrence of
''$search'' starting at the (grapheme) ''$offset'', or false if not found.

Like: ''grapheme_strpos($this, $search, $offset)''
https://www.php.net/manual/en/function.grapheme-strpos.php

Alternative suggested names: ''findOffset''


=== getPositionOfLastOccurrence(string|Text $search, int $offset) : int|false ===


Like ''getPositionOfFirstOccurrence'' but then from the end of the text.

Alternative suggested names: ''findOffsetLast''


=== returnFromFirstOccurence(string|Text $search) : Text|false ===

Returns the ''Text'' starting with the ''$search'' if found, and
otherwise ''false''.

Like: ''grapheme_strstr($this, $search)''
(https://www.php.net/manual/en/function.grapheme-strstr.php)

Alternative suggested names: ''startingWith'', ''startingAt''


=== returnFromLastOccurence(string|Text $search) : Text|false ===

Like ''returnFromFirstOccurence'' but then from the end of the text.

Alternative suggested names: ''startingWithLast'', ''startingAtLast''


=== contains(string|Text $search) ===

Returns true if the text ''$search'' can be found in the text.

Like ''str_contains''.


=== endsWith(string|Text $search) : bool ===

Compares the last ''$search.Length()'' graphemes of ''$this''.

Case-insensitive comparison can be achieved by setting the right
''$collation'' on ''$search''.

Could be constructed from ''getPositionOflastOccurrence()'' and
''length()'', but it's an often required method, and standard PHP has it
too.


=== startsWith(string|Text $search) : bool ===

Compares the first ''$search.Length()'' graphemes of ''$this''.

Case-insensitive comparison can be achieved by setting the right
''$collation'' on ''$search''.

Could be constructed from ''getPositionOfFirstOccurrence()'',
but it's an often required method, and standard PHP has it
too.

=== replaceText(string|Text $search, string|Text $replace, int $replaceFrom = 0, int $replaceTo = -1 ) : \Text ===

Replaces occurrences of ''$search'' with ''$replace''.

The ''$replaceFrom'' and ''$replaceTo'' arguments control which found
items are being replaced. The ''$replaceFrom'' argument is the first
argument that is being replaced (0-indexed), and ''$replaceTo'' is the
last item. Positive numbers are counted from the first occurrence of
''$search'' in the Text, and negative numbers from the last found
occurrence.

In order to find sub-strings case-insensitively, you can use the ''$collation''
argument to ''Text::__construct'' of the ''$search'' argument.


==== Comparing Text Objects ====

=== compareWith(Text $other, string $collation = NULL) : int ===

Uses the configured ''$collation'' of ''$this'' to compare it against
''$other'', unless the ''$collation'' argument is specified as an override.

This same method is also used for comparing two Text objects as "compare
handler" (an overloaded ''=='' operator). Here only the locale on ''$this'' is
taken into account.

=== equals(Text $other, string $collation = NULL) : boolean ===

Alias for ''compareWith($other, $collation) === 0''.


==== Case Conversions ====

These operations all use the collation that is configured on the Text object.

=== toLower : \Text ===

Converts the text to lower case, using the lower case variant of each
Unicode code point that makes up the text.

Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het ĳsselmeer is vol met ideëen''.


=== toUpper : \Text ===

The same, but then to upper case.

Example: ''Het Ĳsselmeer is vol met ideëen'' to ''HET ĲSSELMEER IS VOL MET IDEËEN''.


=== toTitle : \Text ===

The same, but then to title case (the first letter of each word).

Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer is Vol met Ideëen''.


=== firstToLower : \Text ===

Converts the first grapheme in the text to a lower case variant.

Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het Ĳsselmeer is vol met ideëen''.


=== firstToUpper : \Text ===

The same, but then to upper case.

Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer is vol met ideëen''.


=== wordsToLower : \Text ===

Converts the first grapheme in every word to an lower case variant.

Example: ''Het Ĳsselmeer is vol met ideëen'' to ''het ĳsselmeer is vol met ideëen''.


=== wordsToUpper : \Text ===

The same, but then to upper case.

Example: ''Het Ĳsselmeer is vol met ideëen'' to ''Het Ĳsselmeer Is Vol Met Ideëen''.


==== Counting ====


=== getByteCount() : int ===

Returns the size in bytes that the text will take when converted to UTF-8.


=== length(), getCharacterCount(): int  ===

Returns the number of characters that make up the text. A character (also
sometimes call a grapheme) consists of the base-character, and all
combining diacritics. Unicode calls these "extended grapheme clusters".
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries


=== getCodePointCount() : int ===

Returns the number of Unicode code points that make up the text.
(Not sure if we should add this, as it doesn't really have any use).


=== getWordCount() : int ===

Pretty much a shortcut for::

	$count = 0;
	foreach ($text->getWordIterator as $word) { $count++ };

Uses the locale, just like the iterators.


==== Iterators ====

These functions return an iterator that can be used to iterator over the text.
The return of the iterators are effected by the text's locale.

These are inspired by ICU4J's BreakIterators
(https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html)
and Intl's create*Instance methods on ''Intl\BreakIterator''
(https://www.php.net/manual/en/class.intlbreakiterator.php).

=== getCharacterIterator : \Iterator ===

Returns an Iterator that locates boundaries between logical characters.
Because of the structure of the Unicode encoding, a logical character may be
stored internally as more than one Unicode code point. (A with an umlaut may
be stored as an 'a' followed by a separate combining umlaut character, for
example, but the user still thinks of it as one character.) This iterator
allows various processes (especially text editors) to treat as characters the
units of text that a user would think of as characters, rather than the units
of text that the computer sees as "characters".

=== getWordIterator : \Iterator ===

Returns an Iterator that locates boundaries between words. This is useful
for double-click selection or "find whole words" searches. This type of
iterator makes sure there is a boundary position at the beginning and end
of each legal word. (Numbers count as words, too.) Whitespace and punctuation
are kept separate from real words. 

=== getLineIterator : \Iterator ===

Returns an Iterator that locates positions where it is legal for a text
editor to wrap lines. This is similar to word breaking, but not the same:
punctuation and whitespace are generally kept with words (you don't want a
line to start with whitespace, for example), and some special characters can
force a position to be considered a line-break position or prevent a position
from being a line-break position. 

=== getSentenceIterator : \Iterator ===

Returns an Iterator that locates boundaries between sentences.


=== getTitleIterator : \Iterator ===

Returns an Iterator that locates boundaries between title breaks. 


==== Transliteration ====

Converts text between scripts and other properties.


=== transliterate(string $transliterationString) : \Text ===

Transliterates the content of the ''Text'' object according to the rules as
specified in the ''$transliterationString''.

There are a few constants for specific and often used cases, such as creating
an ASCII transliterated version of any Text:

 - const Text::toAscii : A shortcut for a transliteration string that converts
   any script to Latin, and also strips all the accents.

 - const Text::toLatin : A shortcut for a transliteration string that converts
   any script to Latin, but does not remove the accents.

 - const Text::removeAccents : Removes the accents from a Text. A shortcut for
   the transliteration string ''"NFD; [:Nonspacing Mark:] Remove; NFC."''.

===== Implementation Details =====

The functionality as is described in this RFC is mostly implemented by using
functionality from the ICU library, which is also used by the Intl extension.

In order for PHP to continue to work on an as widest range of platforms and
distributions, the minimum ICU version will be chosen accordingly to common
Linux distributions' lowest version, which would include the version of PHP in
which this functionality is implemented.

===== Backward Incompatible Changes =====

Introducing a new ''Text'' class could impact code bases that already use this
class name. But as PHP owns the global namespace, this should not deter us
from adding such a code class.

===== Proposed PHP Version(s) =====

Next PHP 8.x

===== RFC Impact =====

There will be no impact to SAPIs, existing extensions, nor Opcache.


===== Open Issues =====

  - Add a method a like mb_strcut, to extract a string of a maximum amount of bytes from a position, as encoded through UTF-8.

===== Questions and Answers =====

==== Why is this not a composer package? ====

The goal of this RFC is that PHP users can always rely on performant text
processing capabilities.

Text processors written in PHP already exist, but suffer from performance
issues (PHP is slower than C), and are sometimes tailored to specific use
cases. By having them written in C, and utilising ICU's well tested and often
updated rules and algorithms, both the performance and correctness issues will
be addressed.

===== Future Scope =====

More methods than described in this RFC can be added in the future.

===== Proposed Voting Choices =====

Either "yes" or "no" on including the proposed class.


===== Patches and Tests =====

There is no patch yet.

===== Implementation =====

After the project is implemented, this section should contain
  - the version(s) it was merged into
  - a link to the git commit(s)
  - a link to the PHP manual entry for the feature
  - a link to the language specification section (if any)

===== References =====


===== Rejected Features =====

Nothing rejected yet.


===== Changes =====

0.9.2 — 2022-12-21

  * Tim Düsterhus: Added concat and equals methods; changed join to accept an iterator.
  * Enhance explanation of locales and collations, and standardize on using ''$collator'' as an argument name everywhere.

0.9.1 — 2022-12-16

  * Tim Düsterhus: Removed firstToTitle/wordsToTitle; added examples for toUpper and friends; added return types everywhere; added suggested other names for getPosition... methods; marked class as final.
  * Paul Crovella: Clarify which normalisation is being used.
  * Daniel Wolfe: Update trimLeft/trimRight to trimStart/trimEnd.