Quick reply as I've got to get out to another two sick windows machines ;)
Rowan Collins wrote:
The push to use UTF-8 as the internal representation stems from the likelihood
that most of these conversions will be from/to UTF-8. Conceptually, these would
still be conversions, since error-handling and type labelling needs to happen
*somewhere*, they would just be conversions which happen to be very efficient
under the hood.
Since a large part of the system has already made the move TO UTF-8, running native UTF-8 for many people will never actually need any conversions? My web systems only supply UTF-8 encoded pages these days and everything internally is stored the same, only hitting problems where legacy windows storage is accessed.
There are almost certainly yet more issues I haven't thought of here. Somebody
said Unicode felt like it was designed by committee; that may be, but it's also
complicated *because it has to be*. If you really want to support all the
world's languages properly, there is a limit to how far you can simplify the
problems. PHP should aim to simplify them as far as possible, and no farther.
As the one who said 'designed by committee' I'll just say that it is not necessarily a bad thing, but it does result in a lot of the complexity. YES there is need for much of it, but rather than a preferred standard base we end up with every method being equally valid?
The first question that still has not been answered is whether there is a general consensus that the whole core should cleanly support UTF-8 or be restricted to a single byte subset? Don't mix 'locale' up with this! We need an acceptable 'global' set of rules which the core works to, for which a 'single byte locale' MAY be the preferred option for everyone?
Conversion is a layer outside that main core and even things like Windows file processing falls into the outer shell rather than the main core. I'm comfortable if that main core becomes a clean 'single byte' environment, but I don't think it is the correct answer! I currently see UTF-8 strings but with a 'locale' which only allows a simple 'case insensitive' mapping (no string length changes!). However there IS still an open question on 'case insensitive' and dropping that would remove one of the straight jackets for the UTF-8 side ...
--
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk