Re: [php6] Unicode support, options?

From: Marc Bennewitz Date: Fri, 21 Feb 2014 19:49:08 +0000

Subject: Re: [php6] Unicode support, options?

References: 1 Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message

hi,

I'm a PHP developer a long time by have only a little knowledge in C/C++
so I can't know some really internal parts of the engine.

From my perspective the internal datatype "string" should be a binary
string (byte array) and only in specific context this binary string can
be interpreted as a more specialized string. In my understanding in 90%
it's already the case.

Unicode support (and other) could be done as a String class like it's
done in Java and implementing a magic method "__toString" to get the raw
binary string. - We already have "(binary)" as an alias for "(string)".

This should be almost compatible with current behavior and provide a
very clean API as sugar.

Only things were the current string type will not be handled as a binary
string without context needs to be updated.
... like var_dump("1e1" == "10"); but var_dump("1e1" == 10); should
work
as before because the integer type would switch the binary string into
the context of a numeric (ascii) string.

Thoughts?

Marc

On 20.02.2014 06:54, Pierre Joye wrote:
> hi,
> 
> Unicode still remains one of the top requested features in PHP.
> 
> However as Rasmus and other stated earlier, it is not a trivial job.
> Some of the keys point we need to take care of are:
> 
> - UTF-8 storage
> - UTF-8 support for almost (if not all) existing string APIs
> - Performance
> 
> As of today, I did not find any library covering at least two of these
> key points.
> 
> Please keep in mind that I am by no mean a Unicode expert, and this
> summary is what I gather by reading the ICU and other projects
> documentation and discussions archives. Experiments still have to be
> done. However I rather prefer to discuss the options prior to go wild
> with an implementation (huge task, even for basic features coverage).
> 
> If one of the following statement is wrong or not accurate, please fix
> it. I will keep a dedicated wiki page to summarize the discussions and
> options about unicode support.
> 
> * ICU:
> U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
> ICU compile time setting.It is is not possible to set it at PHP
> configure time. It means that users will have to create their own
> build. Alternatively we can bundle ICU but this will be awkward, a
> maintenance nightmare for both php and the distros.
> 
> Alternatively UText can be used to create UTF-8 string. APIs accepting
> UText allow almost everything we need. However the counterpart is that
> a UTF-8 UText is readonly. Any operation altering its content will
> require duplication, clones or conversions. That may kill all gains we
> got from using UTF-8 only.
> 
> The  U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
>  show stopper. Asking users to custom build ICU is not an option
> either. I do not know if the distros will be ready to provide two
> different builds of ICU either, it may add a lot of issues with all
> projects using ICU.
> 
> * UTF8proc
> utf8proc is very attractive, small and relatively fast. I see it as a
> good starting point. However its features cover a very little part of
> what PHP needs.It is easy to bundle but will require a fork and a lot
> of work to add all missing features.
> 
> librope
> Same comments than utf8proc, with even less features.
> 
> I would like to begin to discuss our option now already. I am not
> asking to get in all implementation details from a userland point of
> view (like u"some text" or addng new APIs or not) but only to see what
> we can do internally to work with UTF-8 string.
> 
> Thoughts, comments or ideas?
> 
> 
> 
> Links&reference
> https://github.com/josephg/librope
> https://github.com/josephg/librope
> http://userguide.icu-project.org/strings/utf-8
> 
> 
> Cheers,
>

Thread (34 messages)

Pierre JoyeThu, 20 Feb 2014 05:54:21 +0000
Crypto CompressThu, 20 Feb 2014 15:04:34 +0000
Pierre JoyeThu, 20 Feb 2014 15:44:10 +0000
Ivan Enderlin @ HoaThu, 20 Feb 2014 15:48:29 +0000
Pierre JoyeThu, 20 Feb 2014 15:53:53 +0000
Ivan Enderlin @ HoaThu, 20 Feb 2014 15:55:28 +0000
Andrey HristovThu, 20 Feb 2014 15:56:49 +0000
Johannes SchlüterThu, 20 Feb 2014 16:25:44 +0000
Crypto CompressThu, 20 Feb 2014 21:04:41 +0000
Pierre JoyeFri, 21 Feb 2014 02:58:59 +0000
Lester CaineFri, 21 Feb 2014 12:04:09 +0000
Pierre JoyeFri, 21 Feb 2014 12:30:14 +0000
Lester CaineFri, 21 Feb 2014 13:28:44 +0000
Ivan Enderlin @ HoaThu, 20 Feb 2014 16:10:25 +0000
Marc BennewitzFri, 21 Feb 2014 19:49:08 +0000
Pierre JoyeThu, 27 Feb 2014 06:13:38 +0000Re: [php6] Unicode support, options?
Lester CaineThu, 27 Feb 2014 09:57:12 +0000Re: Re: [php6] Unicode support, options?
Pierre JoyeThu, 27 Feb 2014 10:28:38 +0000
Lester CaineThu, 27 Feb 2014 10:51:50 +0000
Pierre JoyeThu, 27 Feb 2014 11:05:32 +0000
Lester CaineThu, 27 Feb 2014 11:32:52 +0000
Crypto CompressThu, 13 Mar 2014 11:28:51 +0000
Yasuo OhgakiThu, 13 Mar 2014 23:07:34 +0000
Crypto CompressFri, 14 Mar 2014 07:49:00 +0000
Yasuo OhgakiFri, 14 Mar 2014 08:31:13 +0000
Pierre JoyeFri, 14 Mar 2014 08:52:09 +0000
Crypto CompressFri, 14 Mar 2014 09:19:18 +0000
Yasuo OhgakiFri, 14 Mar 2014 09:53:04 +0000
Yasuo OhgakiFri, 14 Mar 2014 10:21:34 +0000
Lester CaineFri, 14 Mar 2014 10:46:38 +0000
Nikita PopovFri, 14 Mar 2014 11:20:02 +0000
Alexey ZakhlestinFri, 14 Mar 2014 11:33:02 +0000
Yasuo OhgakiFri, 14 Mar 2014 22:11:20 +0000
Yasuo OhgakiFri, 14 Mar 2014 22:04:29 +0000

« previous	php.internals (#72746)	next »

From:	Marc Bennewitz	Date:	Fri, 21 Feb 2014 19:49:08 +0000
Subject:	Re: [php6] Unicode support, options?
References:	1	Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message