rfc:altmbstring

This is an old revision of the document!


Request for Comments: Alternative implementation of mbstring using ICU

Note: This RFC is consolidated into https://wiki.php.net/rfc/multibyte_char_handling

Introduction

This RFC discusses the alternative implementation of mbstring extension that in turn uses ICU instead of libmbfl.

Rationale

Ever since its introduction in the very first version of PHP 4, mbstring extension has been controversial in a sense supposedly owing to the following reasons:

  • LGPL license - libmbfl(multibyte filter) and Oniguruma(Multibyte regular expression) is licensed by LGPL. Users that complie PHP statically may have license problem.
  • Lack of understanding -- It took long for those who don't use Unicode or other non-single-byte codesets to figure out how essential the functionality this extension covers, just until recently.
  • Huge bundled libraries -- One of the bundled libraries, libmbfl, consists of a large set of Unicode-to-legacy charset mapping tables and vice versa. This may look redundant to those who aren't interested in manipulating multibyte strings.
  • Limited support for locales -- libmbfl has a setting called “NLS” that determines the defaults for several functions, but only a random list of locales are supported; Armenian, Chinese (simplified and traditional), English, German, Japanese, Korean, Russian and Turkish (yes, French is not there...).
  • Incompliancy with the standards -- Character cases are not well handled in a case-insensitive matches performed by stripos(), strripos() and so on because libmbfl doesn't implement Unicode collations.

To overcome these issues, a complete rewrite of the extension has long been wanted. But it didn't come into reality because there was no good Unicode library. Now that ICU is stable and we already relies on it (intl in 5.3), why not make it happen?

Preliminary stuff

It is currently hosted by GitHub.

http://github.com/moriyoshi/mbstring-ng/

Implemented functions

  • mb_convert_encoding()
  • mb_detect_encoding()
  • mb_ereg()
  • mb_ereg_replace()
  • mb_internal_encoding()
  • mb_list_encodings()
  • mb_output_handler()
  • mb_parse_str()
  • mb_preferred_mime_name()
  • mb_regex_set_options()
  • mb_split()
  • mb_strcut()
  • mb_strimwidth()
  • mb_stripos()
  • mb_stristr()
  • mb_strlen()
  • mb_strpos()
  • mb_strripos()
  • mb_strrpos()
  • mb_strstr()
  • mb_strtolower()
  • mb_strtotitle()
  • mb_strtoupper()
  • mb_strwidth()
  • mb_substr()
  • mb_substr_count()

Features to be implemented

  • All features that exist in mbstring will be ported to mbstring-ng unless there is technical difficulty.

Known / remaining limitations and incompatibilities

  • mb_detect_encoding() doesn't work well anymore due to the inaccuracy of ICU's encoding detection.
  • Request encoding translator now takes advantage of SAPI filter, therefore the name parts of the query components are not to be converted anymore.
  • The features supported by ICU's regular expression engine is not as rich as of Oniguruma, which resulted in the reduced set f options for mb_regex_set_options(). With respect to this, I also extracted the regex functions from the former mbstring and repackaged it to be oniguruma extension
  • The group reference placeholders for mb_ereg_replace() is now $0, $1, $2... instead of \0, \1, \2. This can be avoided if we don't use uregex_replaceAll() and implement our own.
  • ILP64 :-P

Proposal

Introduce mbsgring-ng as EXPERIMENTAL module to replace mbstring by mbstring-ng. Compiling multibyte aware module by default is important for eliminating vulnerabilities related to character encoding.

PHP Version

PHP 5.6 and up

VOTE

VOTE is not started.

Include mbstring-ng to PHP-5.6 as EXPERIMENTAL module
Real name Yes No
Final result: 0 0
This poll has been closed.

Thank you for voting!

Changelog

  1. 2014-01-27 Yasuo Ohgaki: Updated to replace existing mbstring
  2. 2009-07-27 Moriyoshi Koizumi: Initial
rfc/altmbstring.1391308882.txt.gz · Last modified: 2017/09/22 13:28 (external edit)