Re: PHP6 wiki page

From: Date: Sun, 16 Feb 2014 13:08:42 +0000
Subject: Re: PHP6 wiki page
References: 1 2 3 4 5 6  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
Lester Caine wrote:
If one simply ignores the transcoding in and out, leaving the core only to handle clean UTF8 strings what non-trivial things are left? Could this be a candidate for a SOC project?
Has anybody looked at the U_CHARSET_IS_UTF8 flag in ICU? The only reference I've found to it is on the UTF-8 page of the ICU site http://userguide.icu-project.org/strings/utf-8 but it would seem to be their attempt to remove the overheads of UTF-16 conversions when the base character set is already UTF-8? The bit I'm having trouble with is it's link to UCONFIG_NO_CONVERSION which would seem to disable any conversion filter, but we still want to convert into and export from UTF-8 in the outside world, so I don't see why that is appropriate? U_WCHAR_IS_UTF32 would seem to simplify codepoint based activity by using UTF-32 'strings' when looking at character based processing. This is how I've been viewing handling 'character' based string handling anyway. Rather than introducing the problems UTF-16 seems to create here, but I'm not sure what happens on windows based platforms here. It seems UTF-16 is the default for windows API in ICU. My simplistic view of things seems to think there are basically three string lengths ... 1/ Number of bytes for buffer 2/ Number of code points ( characters + control and embellishment ? ) 3/ Number of glyphs ( option to display or hide control codes as in ASCII ) But this has now been confused by the introduction of NFD/NFC/NFKC/NFKD? Which will vary all of the above in some cases? Being somewhat linguistically challenged, while I understand the concepts such as accents, Would standardising on say NFD help with actions like lower/upper conversion, or does accenting a character sometimes change it's alphabetical order so collations need the 'NFC' form to sort by? I think that what is clear is that while there may be a single 'UTF-8' writing standard, sorting collations are even more diverse than the previous codesets? Firebird has always managed COLLATION as a separate filter to CHARACTER SET, and allows individual fields to have their own COLLATION so we can index on different languages within the one table. I'm thinking that this may be required when adding sorting in a UTF-8 based setup? Rather than specifying 'encoding' one simply specifies 'collation' where it varies from the basic rules? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Thread (17 messages)

« previous php.internals (#72640) next »