Re: Re: [php6] Unicode support, options?

From: Date: Thu, 27 Feb 2014 10:28:38 +0000
Subject: Re: Re: [php6] Unicode support, options?
References: 1 2 3  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
On Thu, Feb 27, 2014 at 10:57 AM, Lester Caine <[email protected]> wrote:
> Pierre Joye wrote:
>>
>> On Thu, Feb 20, 2014 at 6:54 AM, Pierre Joye <[email protected]> wrote:
>>
>>> * ICU:
>>> U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
>>> ICU compile time setting.It is is not possible to set it at PHP
>>> configure time. It means that users will have to create their own
>>> build. Alternatively we can bundle ICU but this will be awkward, a
>>> maintenance nightmare for both php and the distros.
>>>
>>> Alternatively UText can be used to create UTF-8 string. APIs accepting
>>> UText allow almost everything we need. However the counterpart is that
>>> a UTF-8 UText is readonly. Any operation altering its content will
>>> require duplication, clones or conversions. That may kill all gains we
>>> got from using UTF-8 only.
>>>
>>> The  U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
>>>   show stopper. Asking users to custom build ICU is not an option
>>> either. I do not know if the distros will be ready to provide two
>>> different builds of ICU either, it may add a lot of issues with all
>>> projects using ICU.
>>
>>
>> Here is a 1st reply from ICU:
>>
>> http://sourceforge.net/p/icu/mailman/message/32031609/
>>
>> It sounds like this flag could be a good option for PHP's Unicode support.
>
>
> Reading between the lines, it would seem that a switch to UTF-8 base is
> their preferred path, but the core code is too engrained as UTF-16? Since
> there is really no alternative to ICU for the heavy grunt, I do see this as
> the right starting point. Any 'bells and whistles' should use the ICU UTF-8
> style rather than pulling in yet more variations?

There are optimizations when this flag is used. Not all operations are
possible using UTF-8, in these cases a conversions will be done
before.

There are not much to read between the lines here :)

> The main problem in all of this is how it dovetails into windows? The
> reliance on 'UTF-16' style WCHAR seems to be the real problem there?

wchar is not UTF-16, nor Unicode. It is something we have to deal with
no matter which road we take. Conversions from UTF-* to and from wchar
will be required anyway on windows, for any *w APIs call.


>> Btw, I created a sub page for Unicode support:
>>
>> https://wiki.php.net/ideas/php6/unicode
>>
>>> Thoughts, comments or ideas?
>
>
> Like you Pierre I'm no Unicode expert, and digging deeper simply reinforces
> the at times irritating compromises that Unicode contains. Obviously
> designed by committee? :(
>
> Currently I'm trying to work out just what is required at the core to
> support UTF-8 and while it is not a trivial problem, the bulk of the code is
> designed to handle strings of variable length and in it's basic form UTF-8
> just creates longer strings? So isn't the next question quite simply 'case'?
> And how we handle case insensitivity in the core will determine what core
> Unicode functions are required?

I do not care about case insensitivity yet, nor about unicode
function/method/constant/etc names. This is a secondary issue at this
stage.


>> I found another C++ library to do the basic UTF-8 operations, easl:
>>
>> https://code.google.com/p/easl/
>>
>> It could be a nice one to use in combination with ICU, small and fast
>> (1st tests).
>
>
> C++ ?

yes. with c helpers.

> That what ever is used will need to be both tailored for PHP and transparent
> as far as ICU is concerned is as you have identified - a given. ICU is still
> built using 32bit string lengths ( I think? ) which does add to the fun, but
> I don't see any reason not to be using functions like compareUTF8() and
> ucasemap_utf8ToLower() from ICU in which case the strings need to be
> standard ICU UTF-8 strings? I can see the advantage of the 'fast' compare
> that I have been banging on about elsewhere, which looks for a simple match
> between two raw strings of bytes. UTF-8 only comes into that when you need
> to add 'rank'? But much of the core processing CAN simply ignore that as
> long as the generic calls don't have dead tails which activate it?

We may use our own functions (or other lib) to covers operations not
implemented in ICU or too slow because of the conversions. That's why
investigating in other tools is still a good thing to do.

Cheers,
Pierre


Thread (34 messages)

« previous php.internals (#72838) next »