Re: [php6] Unicode support, options?
hi,
I'm a PHP developer a long time by have only a little knowledge in C/C++
so I can't know some really internal parts of the engine.
From my perspective the internal datatype "string" should be a binary
string (byte array) and only in specific context this binary string can
be interpreted as a more specialized string. In my understanding in 90%
it's already the case.
Unicode support (and other) could be done as a String class like it's
done in Java and implementing a magic method "__toString" to get the raw
binary string. - We already have "(binary)" as an alias for "(string)".
This should be almost compatible with current behavior and provide a
very clean API as sugar.
Only things were the current string type will not be handled as a binary
string without context needs to be updated.
... like var_dump("1e1" == "10"); but var_dump("1e1" == 10); should
work
as before because the integer type would switch the binary string into
the context of a numeric (ascii) string.
Thoughts?
Marc
On 20.02.2014 06:54, Pierre Joye wrote:
> hi,
>
> Unicode still remains one of the top requested features in PHP.
>
> However as Rasmus and other stated earlier, it is not a trivial job.
> Some of the keys point we need to take care of are:
>
> - UTF-8 storage
> - UTF-8 support for almost (if not all) existing string APIs
> - Performance
>
> As of today, I did not find any library covering at least two of these
> key points.
>
> Please keep in mind that I am by no mean a Unicode expert, and this
> summary is what I gather by reading the ICU and other projects
> documentation and discussions archives. Experiments still have to be
> done. However I rather prefer to discuss the options prior to go wild
> with an implementation (huge task, even for basic features coverage).
>
> If one of the following statement is wrong or not accurate, please fix
> it. I will keep a dedicated wiki page to summarize the discussions and
> options about unicode support.
>
> * ICU:
> U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
> ICU compile time setting.It is is not possible to set it at PHP
> configure time. It means that users will have to create their own
> build. Alternatively we can bundle ICU but this will be awkward, a
> maintenance nightmare for both php and the distros.
>
> Alternatively UText can be used to create UTF-8 string. APIs accepting
> UText allow almost everything we need. However the counterpart is that
> a UTF-8 UText is readonly. Any operation altering its content will
> require duplication, clones or conversions. That may kill all gains we
> got from using UTF-8 only.
>
> The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
> show stopper. Asking users to custom build ICU is not an option
> either. I do not know if the distros will be ready to provide two
> different builds of ICU either, it may add a lot of issues with all
> projects using ICU.
>
> * UTF8proc
> utf8proc is very attractive, small and relatively fast. I see it as a
> good starting point. However its features cover a very little part of
> what PHP needs.It is easy to bundle but will require a fork and a lot
> of work to add all missing features.
>
> librope
> Same comments than utf8proc, with even less features.
>
> I would like to begin to discuss our option now already. I am not
> asking to get in all implementation details from a userland point of
> view (like u"some text" or addng new APIs or not) but only to see what
> we can do internally to work with UTF-8 string.
>
> Thoughts, comments or ideas?
>
>
>
> Links&reference
> https://github.com/josephg/librope
> https://github.com/josephg/librope
> http://userguide.icu-project.org/strings/utf-8
>
>
> Cheers,
>
Thread (34 messages)