ideas:php6:unicode
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revisionNext revisionBoth sides next revision | ||
ideas:php6:unicode [2014/02/20 06:36] – created pajoye | ideas:php6:unicode [2014/02/27 06:10] – add reply from ICU pajoye | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Unicode Support ====== | ====== Unicode Support ====== | ||
+ | Author: Pierre Joye | ||
+ | |||
+ | Status: Under discussion | ||
+ | |||
+ | Unicode still remains one of the top requested features in PHP. | ||
+ | |||
+ | However as Rasmus and other stated earlier, it is not a trivial job. | ||
+ | Some of the keys point we need to take care of are: | ||
+ | |||
+ | * UTF-8 storage | ||
+ | * UTF-8 support for almost (if not all) existing string APIs | ||
+ | * Performance | ||
+ | |||
+ | As of today, I did not find any library covering at least two of these | ||
+ | key points. | ||
+ | |||
+ | Please keep in mind that I am by no mean a Unicode expert, and this | ||
+ | summary is what I gather by reading the ICU and other projects | ||
+ | documentation and discussions archives. Experiments still have to be | ||
+ | done. However I rather prefer to discuss the options prior to go wild | ||
+ | with an implementation (huge task, even for basic features coverage). | ||
+ | |||
+ | If one of the following statement is wrong or not accurate, please fix | ||
+ | it. I will keep a dedicated wiki page to summarize the discussions and | ||
+ | options about unicode support. | ||
+ | |||
+ | ====== ICU ====== | ||
+ | |||
+ | U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a | ||
+ | ICU compile time setting.It is is not possible to set it at PHP | ||
+ | configure time. It means that users will have to create their own | ||
+ | build. Alternatively we can bundle ICU but this will be awkward, a | ||
+ | maintenance nightmare for both php and the distros. | ||
+ | |||
+ | Alternatively UText can be used to create UTF-8 string. APIs accepting | ||
+ | UText allow almost everything we need. However the counterpart is that | ||
+ | a UTF-8 UText is readonly. Any operation altering its content will | ||
+ | require duplication, | ||
+ | got from using UTF-8 only. | ||
+ | |||
+ | The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually | ||
+ | show stopper. Asking users to custom build ICU is not an option | ||
+ | either. I do not know if the distros will be ready to provide two | ||
+ | different builds of ICU either, it may add a lot of issues with all | ||
+ | projects using ICU. | ||
+ | |||
+ | I have asked the ICU mailing list about this flag, here is an interesting first answer: | ||
+ | |||
+ | http:// | ||
+ | |||
+ | It sounds like this flag may be very useful for php after all. | ||
+ | |||
+ | ====== UTF8proc ====== | ||
+ | |||
+ | utf8proc is very attractive, small and relatively fast. I see it as a | ||
+ | good starting point. However its features cover a very little part of | ||
+ | what PHP needs.It is easy to bundle but will require a fork and a lot | ||
+ | of work to add all missing features. | ||
+ | |||
+ | ====== librope ====== | ||
+ | |||
+ | Same comments than utf8proc, with even less features. | ||
+ | |||
+ | I would like to begin to discuss our option now already. I am not | ||
+ | asking to get in all implementation details from a userland point of | ||
+ | view (like u"some text" or addng new APIs or not) but only to see what | ||
+ | we can do internally to work with UTF-8 string. | ||
====== References ====== | ====== References ====== |
ideas/php6/unicode.txt · Last modified: 2017/09/22 13:28 by 127.0.0.1