ideas:php6:unicode

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
ideas:php6:unicode [2014/02/20 06:36] – created pajoyeideas:php6:unicode [2017/09/22 13:28] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== Unicode Support ====== ====== Unicode Support ======
 +Author: Pierre Joye
 +
 +Status: Under discussion
 +
 +Unicode still remains one of the top requested features in PHP.
 +
 +However as Rasmus and other stated earlier, it is not a trivial job.
 +Some of the keys point we need to take care of are:
 +
 +  * UTF-8 storage
 +  * UTF-8 support for almost (if not all) existing string APIs
 +  * Performance
 +
 +As of today, I did not find any library covering at least two of these
 +key points.
 +
 +Please keep in mind that I am by no mean a Unicode expert, and this
 +summary is what I gather by reading the ICU and other projects
 +documentation and discussions archives. Experiments still have to be
 +done. However I rather prefer to discuss the options prior to go wild
 +with an implementation (huge task, even for basic features coverage).
 +
 +If one of the following statement is wrong or not accurate, please fix
 +it. I will keep a dedicated wiki page to summarize the discussions and
 +options about unicode support.
 +
 +====== ICU ======
 +
 +U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
 +ICU compile time setting.It is is not possible to set it at PHP
 +configure time. It means that users will have to create their own
 +build. Alternatively we can bundle ICU but this will be awkward, a
 +maintenance nightmare for both php and the distros.
 +
 +Alternatively UText can be used to create UTF-8 string. APIs accepting
 +UText allow almost everything we need. However the counterpart is that
 +a UTF-8 UText is readonly. Any operation altering its content will
 +require duplication, clones or conversions. That may kill all gains we
 +got from using UTF-8 only.
 +
 +The  U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
 + show stopper. Asking users to custom build ICU is not an option
 +either. I do not know if the distros will be ready to provide two
 +different builds of ICU either, it may add a lot of issues with all
 +projects using ICU.
 +
 +I have asked the ICU mailing list about this flag, here is an interesting first answer:
 +
 +http://sourceforge.net/p/icu/mailman/message/32031609/
 +
 +It sounds like this flag may be very useful for php after all.
 +
 +Performance comparison using UTF-8 or UTF-16 with ICU:
 +
 +http://site.icu-project.org/design/collation/v2/perf
 +
 +====== UTF8proc ======
 +
 +utf8proc is very attractive, small and relatively fast. I see it as a
 +good starting point. However its features cover a very little part of
 +what PHP needs.It is easy to bundle but will require a fork and a lot
 +of work to add all missing features.
 +
 +====== librope ======
 +
 +Same comments than utf8proc, with even less features.
 +
 +I would like to begin to discuss our option now already. I am not
 +asking to get in all implementation details from a userland point of
 +view (like u"some text" or addng new APIs or not) but only to see what
 +we can do internally to work with UTF-8 string.
  
 ====== References ====== ====== References ======
ideas/php6/unicode.1392878199.txt.gz · Last modified: 2017/09/22 13:28 (external edit)