Unicode Support
Author: Pierre Joye
Status: Under discussion
Unicode still remains one of the top requested features in PHP.
However as Rasmus and other stated earlier, it is not a trivial job. Some of the keys point we need to take care of are:
- UTF-8 storage
- UTF-8 support for almost (if not all) existing string APIs
- Performance
As of today, I did not find any library covering at least two of these key points.
Please keep in mind that I am by no mean a Unicode expert, and this summary is what I gather by reading the ICU and other projects documentation and discussions archives. Experiments still have to be done. However I rather prefer to discuss the options prior to go wild with an implementation (huge task, even for basic features coverage).
If one of the following statement is wrong or not accurate, please fix it. I will keep a dedicated wiki page to summarize the discussions and options about unicode support.
ICU
U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a ICU compile time setting.It is is not possible to set it at PHP configure time. It means that users will have to create their own build. Alternatively we can bundle ICU but this will be awkward, a maintenance nightmare for both php and the distros.
Alternatively UText can be used to create UTF-8 string. APIs accepting UText allow almost everything we need. However the counterpart is that a UTF-8 UText is readonly. Any operation altering its content will require duplication, clones or conversions. That may kill all gains we got from using UTF-8 only.
The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually show stopper. Asking users to custom build ICU is not an option either. I do not know if the distros will be ready to provide two different builds of ICU either, it may add a lot of issues with all projects using ICU.
I have asked the ICU mailing list about this flag, here is an interesting first answer:
http://sourceforge.net/p/icu/mailman/message/32031609/
It sounds like this flag may be very useful for php after all.
Performance comparison using UTF-8 or UTF-16 with ICU:
UTF8proc
utf8proc is very attractive, small and relatively fast. I see it as a good starting point. However its features cover a very little part of what PHP needs.It is easy to bundle but will require a fork and a lot of work to add all missing features.
librope
Same comments than utf8proc, with even less features.
I would like to begin to discuss our option now already. I am not asking to get in all implementation details from a userland point of view (like u“some text” or addng new APIs or not) but only to see what we can do internally to work with UTF-8 string.