Re: Re: [php6] Unicode support, options?

From: Date: Thu, 27 Feb 2014 09:57:12 +0000
Subject: Re: Re: [php6] Unicode support, options?
References: 1 2  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
Pierre Joye wrote:
On Thu, Feb 20, 2014 at 6:54 AM, Pierre Joye <[email protected]> wrote:
* ICU: U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a ICU compile time setting.It is is not possible to set it at PHP configure time. It means that users will have to create their own build. Alternatively we can bundle ICU but this will be awkward, a maintenance nightmare for both php and the distros. Alternatively UText can be used to create UTF-8 string. APIs accepting UText allow almost everything we need. However the counterpart is that a UTF-8 UText is readonly. Any operation altering its content will require duplication, clones or conversions. That may kill all gains we got from using UTF-8 only. The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually show stopper. Asking users to custom build ICU is not an option either. I do not know if the distros will be ready to provide two different builds of ICU either, it may add a lot of issues with all projects using ICU.
Here is a 1st reply from ICU: http://sourceforge.net/p/icu/mailman/message/32031609/ It sounds like this flag could be a good option for PHP's Unicode support.
Reading between the lines, it would seem that a switch to UTF-8 base is their preferred path, but the core code is too engrained as UTF-16? Since there is really no alternative to ICU for the heavy grunt, I do see this as the right starting point. Any 'bells and whistles' should use the ICU UTF-8 style rather than pulling in yet more variations? The main problem in all of this is how it dovetails into windows? The reliance on 'UTF-16' style WCHAR seems to be the real problem there?
Btw, I created a sub page for Unicode support: https://wiki.php.net/ideas/php6/unicode
Thoughts, comments or ideas?
Like you Pierre I'm no Unicode expert, and digging deeper simply reinforces the at times irritating compromises that Unicode contains. Obviously designed by committee? :( Currently I'm trying to work out just what is required at the core to support UTF-8 and while it is not a trivial problem, the bulk of the code is designed to handle strings of variable length and in it's basic form UTF-8 just creates longer strings? So isn't the next question quite simply 'case'? And how we handle case insensitivity in the core will determine what core Unicode functions are required?
I found another C++ library to do the basic UTF-8 operations, easl: https://code.google.com/p/easl/ It could be a nice one to use in combination with ICU, small and fast (1st tests).
C++ ? That what ever is used will need to be both tailored for PHP and transparent as far as ICU is concerned is as you have identified - a given. ICU is still built using 32bit string lengths ( I think? ) which does add to the fun, but I don't see any reason not to be using functions like compareUTF8() and ucasemap_utf8ToLower() from ICU in which case the strings need to be standard ICU UTF-8 strings? I can see the advantage of the 'fast' compare that I have been banging on about elsewhere, which looks for a simple match between two raw strings of bytes. UTF-8 only comes into that when you need to add 'rank'? But much of the core processing CAN simply ignore that as long as the generic calls don't have dead tails which activate it? Given the complexity of case conversion I can see the possible need for a mirror string holding a 'lower case' version which may be a different length and so 'string' could become a more complex object? But is this aspect what you are looking for the 'small fast library' to provide? easl would seem only to be trying to smooth the edges between windows and other platforms? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Thread (34 messages)

« previous php.internals (#72837) next »