This is an old revision of the document!

Request for Comments: Alternative implementation of mbstring using ICU

Version: 1.0
Date: 2009-07-27
Author: Moriyoshi Koizumi moriyoshi@php.net
Status: Under Discussion
First Published at: http://wiki.php.net/rfc/altmbstring

Introduction

This RFC discusses the alternative implementation of mbstring extension that in turn uses ICU instead of libmbfl.

Rationale

Ever since its introduction in the very first version of PHP 4, mbstring extension has been controversial in a sense supposedly owing to the following reasons:

Lack of understanding -- It took long for those who don't use Unicode or other non-single-byte codesets to figure out how essential the functionality this extension covers, just until recently.
Huge bundled libraries -- One of the bundled libraries, libmbfl, consists of a large set of Unicode-to-legacy charset mapping tables and vice versa. This may look redundant to those who aren't interested in manipulating multibyte strings.
Limited support for locales -- libmbfl has a setting called “NLS” that determines the defaults for several functions, but only a random list of locales are supported; Armenian, Chinese (simplified and traditional), English, German, Japanese, Korean, Russian and Turkish (yes, French is not there...).
Incompliancy with the standards -- Character cases are not well handled in a case-insensitive matches performed by stripos(), strripos() and so on because libmbfl doesn't implement Unicode collations.

To overcome these issues, a complete rewrite of the extension has long been wanted. But it didn't come into reality because there was no good Unicode library. Now that ICU is stable and we already relies on it (intl in 5.3), why not make it happen?

Preliminary stuff

It is currently hosted by GitHub.

http://github.com/moriyoshi/mbstring-ng/

Implemented functions

mb_convert_encoding()
mb_detect_encoding()
mb_ereg()
mb_ereg_replace()
mb_internal_encoding()
mb_list_encodings()
mb_output_handler()
mb_parse_str()
mb_preferred_mime_name()
mb_regex_set_options()
mb_split()
mb_strcut()
mb_strimwidth()
mb_stripos()
mb_stristr()
mb_strlen()
mb_strpos()
mb_strripos()
mb_strrpos()
mb_strstr()
mb_strtolower()
mb_strtotitle()
mb_strtoupper()
mb_strwidth()
mb_substr()
mb_substr_count()

Removed (deprecated) functions and reasons behind it

mb_check_encoding() -- Not that usable as it is advertised, period. First of all, validation in terms of encoding is just as same as filtering through the converter supplied with the same value for the input and output encoding. Thus just use mb_convert_encoding().
mb_convert_case() -- Use mb_strtoupper(), mb_strtolower() and mb_strtotitle()
mb_convert_kana() -- This can't be standard-compliant. In addition, part of the functionality is already covered by Normalizer of intl extension, so we need to carefully consider what is actually needed here again.
mb_convert_variables() -- This can be implemented as a script.
mb_decode_mimeheader() and mb_encode_mimeheader() -- Non-standard compliancy.
mb_decode_numericentity() -- Removed in favor of html_entity_decode().
mb_encode_numericentity() -- Removed in favor of htmlentities() and htmlspecialchars().
mb_encoding_aliases() -- Just unnecessary.
mb_ereg_match() -- Use mb_ereg()
mb_ereg_search(), mb_ereg_search_getpos(), mb_ereg_search_getregs(), mb_ereg_search_init(), mb_ereg_search_pos(), mb_ereg_search_regs() and mb_ereg_search_setpos() -- I rarely heard a script that actively uses these functions. They involve an internal state that is not visible to users, and thus it most likely causes confusion when used across the function calls. Need to be reimplemented as a class.
mb_eregi() -- Use mb_regex_options() and mb_ereg()
mb_eregi_replace() -- I wonder why this function was added in the first place because giving 'i' option to mb_ereg_replace() works in the same way.
mb_detect_order(), mb_get_info(), mb_http_input(), mb_http_output(), mb_language() and mb_substitute_character() -- ini_set() and ini_get() are your friends, I guess...
mb_regex_encoding() -- It is really confusing that the current mbstring allows two different encoding defaults for regex functions and the rest. Those settings are unified in the alternative version and so this is no longer necessary.
mb_send_mail() -- The behavior of this function relies on the pseudo-locale setting called “mbstring.language” that supports just a limited set of possible locales. As not everyone can benefit from the function and most significant applications implement their own mail functions, I suppose this is no longer wanted.
mb_strrchr() -- Use mb_strrpos().
mb_strrichr() -- Use mb_strripos().

Known / remaining limitations and incompatibilities

mb_detect_encoding() doesn't work well anymore due to the inaccuracy of ICU's encoding detection.
Request encoding translator now takes advantage of SAPI filter, therefore the name parts of the query components are not to be converted anymore.
The features supported by ICU's regular expression engine is not as rich as of Oniguruma, which resulted in the reduced set f options for mb_regex_set_options(). With respect to this, I also extracted the regex functions from the former mbstring and repackaged it to be oniguruma extension
The group reference placeholders for mb_ereg_replace() is now $0, $1, $2... instead of \0, \1, \2. This can be avoided if we don't use uregex_replaceAll() and implement our own.
ILP64

Changelog

2009-07-27 Moriyoshi Koizumi: Initial