Re: PHP6 wiki page

From: Lester Caine Date: Sun, 16 Feb 2014 13:08:42 +0000

Subject: Re: PHP6 wiki page

References: 1 2 3 4 5 6 Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message

Lester Caine wrote:
If one simply ignores the transcoding in and out, leaving the core only to
handle clean UTF8 strings what non-trivial things are left? Could this be a
candidate for a SOC project?

Has anybody looked at the U_CHARSET_IS_UTF8 flag in ICU?

The only reference I've found to it is on the UTF-8 page of the ICU site http://userguide.icu-project.org/strings/utf-8 but it would seem to be their attempt to remove the overheads of UTF-16 conversions when the base character set is already UTF-8? The bit I'm having trouble with is it's link to UCONFIG_NO_CONVERSION which would seem to disable any conversion filter, but we still want to convert into and export from UTF-8 in the outside world, so I don't see why that is appropriate?

U_WCHAR_IS_UTF32 would seem to simplify codepoint based activity by using UTF-32 'strings' when looking at character based processing. This is how I've been viewing handling 'character' based string handling anyway. Rather than introducing the problems UTF-16 seems to create here, but I'm not sure what happens on windows based platforms here. It seems UTF-16 is the default for windows API in ICU.

My simplistic view of things seems to think there are basically three string lengths ...
1/ Number of bytes for buffer
2/ Number of code points ( characters + control and embellishment ? )
3/ Number of glyphs ( option to display or hide control codes as in ASCII )

But this has now been confused by the introduction of NFD/NFC/NFKC/NFKD? Which will vary all of the above in some cases? Being somewhat linguistically challenged, while I understand the concepts such as accents, Would standardising on say NFD help with actions like lower/upper conversion, or does accenting a character sometimes change it's alphabetical order so collations need the 'NFC' form to sort by?

I think that what is clear is that while there may be a single 'UTF-8' writing standard, sorting collations are even more diverse than the previous codesets? Firebird has always managed COLLATION as a separate filter to CHARACTER SET, and allows individual fields to have their own COLLATION so we can index on different languages within the one table. I'm thinking that this may be required when adding sorting in a UTF-8 based setup? Rather than specifying 'encoding' one simply specifies 'collation' where it varies from the basic rules?

-- 
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Thread (17 messages)

Andrew FauldsWed, 25 Jul 2012 15:20:40 +0000
Ferenc KovacsWed, 25 Jul 2012 15:58:28 +0000
Lester CaineFri, 14 Feb 2014 11:02:20 +0000
Rowan CollinsFri, 14 Feb 2014 16:39:46 +0000
Lester CaineFri, 14 Feb 2014 17:52:58 +0000
Rasmus LerdorfFri, 14 Feb 2014 19:35:06 +0000
Lester CaineFri, 14 Feb 2014 21:06:50 +0000
Lester CaineSun, 16 Feb 2014 13:08:42 +0000
Pierre JoyeSat, 15 Feb 2014 05:43:16 +0000
Lester CaineSat, 15 Feb 2014 09:50:22 +0000
Pierre JoyeSat, 15 Feb 2014 11:46:49 +0000
Lester CaineSat, 15 Feb 2014 12:32:06 +0000
Stas MalyshevMon, 17 Feb 2014 03:13:54 +0000
Yasuo OhgakiMon, 17 Feb 2014 04:56:57 +0000
Julien PauliFri, 14 Feb 2014 16:44:37 +0000
Rowan CollinsFri, 14 Feb 2014 16:49:36 +0000
Julien PauliFri, 14 Feb 2014 16:57:00 +0000

« previous	php.internals (#72640)	next »

From:	Lester Caine	Date:	Sun, 16 Feb 2014 13:08:42 +0000
Subject:	Re: PHP6 wiki page
References:	1 2 3 4 5 6	Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message