Re: PHP6 wiki page
Rowan Collins wrote:
If somebody comes up with an implementation proposal of Unicode strings, whether
to have a mode that doesn't use it can be discussed, but right now there doesn't
seem to be such a live proposal.
I think it is now accepted that the mistake was UTF16?
Personally I always thought it was the wrong choice as other projects had already shown and so was not likely to work.
If we look at UTF8 as a starting point, then in the large majority of places all that results is longer strings? Modern tools will just display them and the bulk of PHP simply works without a problem already? I've mentioned one point already in other threads ... if you are simply looking to match a string, then equal/not-equal is all that is required. The current compare also looks to supply 'order' as well but in many cases this is simply not needed? Drop the 'order' and it does not mater if the string has strange characters in it.
The slide show recognises that converting to UTF8 and then back again is something that simply slots in on the periphery, and so should just work. I throw out any of the date and currency styling ... that is a different problem and is already covered well and can return uncode strings if required! To be honest I can't see why these are bundled with the unicode problem at all?
This just leaves 'sort' and more complex string handling? I accept that the major brick wall here is 'case-insensitivity' since unicode string length may well change when making a conversion. That is not exactly a 'PHP' problem, but a fact of life with the languages we are dealing with? strtolower already has problems with the more complex single byte character sets, but there is no reason that it could not follow the unicode defined rules as a starting point?
Moving to 'character handling' specifically I have always viewed that as a place where 'under the hood' the string being handled becomes UTF32 so that we are back looking at individual characters rather than 'bytes'? But I am getting out of my own depth when the character is fabricated from more than one unicode character. The bit that Yasuo outlined earlier re NFC/NFD normalization? This is another variation on the 'case-insensitive' problem? If an accent is added additionally to a base character, then I would tend to defaulting to combine them when processing a string, but that may not be correct ...
The bottom line is - I think - that we already know what works and what does not, and we can define a 'default' simple sort where case-sensitivity is restricted to fixed length string conversions to get a base system. Moving to sort routines that respect different locals is then a layer on top? But all of the groundwork for a default system does already exist? We have the filters to convert to and from UTF8, and we have all the basic processes for handling utf8 strings. It is just a matter of agreeing a method of pulling it together?
--
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
Thread (17 messages)