On Fri, Feb 21, 2014 at 1:04 PM, Lester Caine <
[email protected]> wrote:
Pierre Joye wrote:
What do you understand by "storage"?
To have string stored as UTF-8 only, no conversion required for 99% of our
use.
I think that the first thing that needs to be agreed on is if there will be
support for UTF-8 in the core? As has already been said, in many places this
currently just works and so blocking that may be more of a problem now? The
question surly is "What is the 1% that needs some extra work?"
I think we pretty much agree already that we need UTF-8 as the base,
meaning are stored in UTF-8. Conversions may be needed for advanced
usages provided by ICU (or maybe not, I just do not know for sure
now).
I light library would be most appropriate for filling the gaps currently
created by use of UTF-8 strings in the core? It is not until one starts
adding the mbstring level of string processing that a more powerful library
is required. Something that simply ensures UTF-8 strings are valid and can
carry out comparisons as required?
it is more than only comparison. If only comparison, additions and the
likes, utf8proc is enough, or librope with some additions.
Only thing putting me off utf8proc is that it only supports Unicode 5.0.0
librope does not seem to understand any of the fine detail of the uncode standards? What I've been looking for is the case switch actions and currently all I can find is ICU to handle that?
The black hole is still 'case sensitivity' and it is perhaps laying down a
'light' set of rules for this which would allow a path forward? As I have
indicated, I'd prefer simply dropping case insensitivity, but a compromise
might be to retain it where a string length does not change, and a clean
reverse transform exists? So a library that provides that comparison as part
of the core package?
I do not care much about languages support for UTF-8 names for
methods, functons, variables etc. My take on it is that we should
stick to ASCII for it and be done with that. But that's only my
opinion :)
While I have no intention of using more than ASCII myself I can see the argument for supporting use of more user friendly names for functions and the like. I see the complaints about our current 'English' names and how they need improving while at the same time I am dealing with customer sites where we provide simple aliases for all text in a local translation. Easy enough in a relational database where you simply select the right set of entries from a table, but not so easy for PHP ...
We may end writing our own library for the core operations... But I
would prefer to avoid that as it is really not a trivial task.
Totally agree ... but I don't see a good path yet?
While ICU creates it's own complications, using ready bundled versions, it is by far the cleanest code for both UTF-8 and actually UTF-32 if one simply ditches all the UTF-16 mess. I'd much rather start from that code than any of the other libraries so far identified. In any case I don't see any option for the conversion process to and from UTF-8?