Re: What "Unicode support" really means

From: Lester Caine Date: Wed, 05 Mar 2014 08:29:14 +0000

Subject: Re: What "Unicode support" really means

References: 1 Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message

Quick reply as I've got to get out to another two sick windows machines ;)

Rowan Collins wrote:
The push to use UTF-8 as the internal representation stems from the likelihood
that most of these conversions will be from/to UTF-8. Conceptually, these would
still be conversions, since error-handling and type labelling needs to happen
*somewhere*, they would just be conversions which happen to be very efficient
under the hood.

Since a large part of the system has already made the move TO UTF-8, running native UTF-8 for many people will never actually need any conversions? My web systems only supply UTF-8 encoded pages these days and everything internally is stored the same, only hitting problems where legacy windows storage is accessed.

There are almost certainly yet more issues I haven't thought of here. Somebody
said Unicode felt like it was designed by committee; that may be, but it's also
complicated *because it has to be*. If you really want to support all the
world's languages properly, there is a limit to how far you can simplify the
problems. PHP should aim to simplify them as far as possible, and no farther.

As the one who said 'designed by committee' I'll just say that it is not necessarily a bad thing, but it does result in a lot of the complexity. YES there is need for much of it, but rather than a preferred standard base we end up with every method being equally valid?

The first question that still has not been answered is whether there is a general consensus that the whole core should cleanly support UTF-8 or be restricted to a single byte subset? Don't mix 'locale' up with this! We need an acceptable 'global' set of rules which the core works to, for which a 'single byte locale' MAY be the preferred option for everyone?

Conversion is a layer outside that main core and even things like Windows file processing falls into the outer shell rather than the main core. I'm comfortable if that main core becomes a clean 'single byte' environment, but I don't think it is the correct answer! I currently see UTF-8 strings but with a 'locale' which only allows a simple 'case insensitive' mapping (no string length changes!). However there IS still an open question on 'case insensitive' and dropping that would remove one of the straight jackets for the UTF-8 side ...

-- 
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Thread (53 messages)

Rowan CollinsTue, 04 Mar 2014 22:57:48 +0000
Pierre JoyeWed, 05 Mar 2014 07:38:32 +0000
Rowan CollinsThu, 06 Mar 2014 08:30:16 +0000
Pierre JoyeThu, 06 Mar 2014 08:43:44 +0000
Lester CaineWed, 05 Mar 2014 08:29:14 +0000
Andrea FauldsWed, 05 Mar 2014 19:33:34 +0000
Pierre JoyeWed, 05 Mar 2014 19:50:59 +0000
Andrea FauldsWed, 05 Mar 2014 20:02:31 +0000
Crypto CompressWed, 05 Mar 2014 20:28:04 +0000
Andrea FauldsWed, 05 Mar 2014 20:41:23 +0000
Crypto CompressWed, 05 Mar 2014 20:46:15 +0000
Andrea FauldsWed, 05 Mar 2014 20:48:23 +0000
Crypto CompressWed, 05 Mar 2014 21:02:45 +0000
Andrea FauldsWed, 05 Mar 2014 21:07:33 +0000
Crypto CompressWed, 05 Mar 2014 21:17:48 +0000
Andrea FauldsWed, 05 Mar 2014 21:23:43 +0000
Helmut TessarekWed, 05 Mar 2014 20:50:30 +0000
Derick RethansWed, 05 Mar 2014 21:07:39 +0000
Andrea FauldsWed, 05 Mar 2014 21:10:55 +0000
Derick RethansWed, 05 Mar 2014 21:25:56 +0000
Crypto CompressWed, 05 Mar 2014 21:31:32 +0000
Andrea FauldsWed, 05 Mar 2014 21:31:44 +0000
Lester CaineWed, 05 Mar 2014 22:04:57 +0000
Crypto CompressWed, 05 Mar 2014 22:37:07 +0000
Pierre JoyeThu, 06 Mar 2014 04:41:59 +0000
Crypto CompressThu, 06 Mar 2014 07:56:26 +0000
Rowan CollinsThu, 06 Mar 2014 08:46:18 +0000
Lester CaineThu, 06 Mar 2014 08:59:50 +0000
Pierre JoyeThu, 06 Mar 2014 09:07:34 +0000
Lester CaineThu, 06 Mar 2014 09:40:47 +0000
Stas MalyshevSun, 09 Mar 2014 07:24:15 +0000
Helmut TessarekSun, 09 Mar 2014 08:24:18 +0000
Rowan CollinsThu, 06 Mar 2014 08:43:16 +0000
Lester CaineThu, 06 Mar 2014 09:17:58 +0000
Rowan CollinsThu, 06 Mar 2014 09:31:52 +0000
Lester CaineThu, 06 Mar 2014 09:46:15 +0000
Lester CaineThu, 06 Mar 2014 09:49:29 +0000
Stas MalyshevSun, 09 Mar 2014 07:29:58 +0000
Lester CaineSun, 09 Mar 2014 08:27:41 +0000
Stas MalyshevSun, 09 Mar 2014 19:47:55 +0000
Stas MalyshevSun, 09 Mar 2014 20:00:22 +0000
Lester CaineSun, 09 Mar 2014 20:27:42 +0000
Andrea FauldsSun, 09 Mar 2014 21:23:38 +0000
Lester CaineSun, 09 Mar 2014 22:44:49 +0000
Pierre JoyeMon, 10 Mar 2014 06:28:53 +0000
Lester CaineMon, 10 Mar 2014 06:52:35 +0000
Pierre JoyeMon, 10 Mar 2014 07:01:33 +0000
Lester CaineMon, 10 Mar 2014 08:24:32 +0000
Alexey ZakhlestinMon, 10 Mar 2014 09:18:44 +0000
Lester CaineMon, 10 Mar 2014 09:46:23 +0000
Marco SchusterMon, 10 Mar 2014 09:18:03 +0000
Pierre JoyeMon, 10 Mar 2014 09:34:03 +0000
Lester CaineSun, 09 Mar 2014 20:27:20 +0000

« previous	php.internals (#72917)	next »

From:	Lester Caine	Date:	Wed, 05 Mar 2014 08:29:14 +0000
Subject:	Re: What "Unicode support" really means
References:	1	Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message