What "Unicode support" really means

From: Rowan Collins Date: Tue, 04 Mar 2014 22:57:48 +0000

Subject: What "Unicode support" really means

Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message

Hi,

There's been a lot of discussion recently about how to "implement Unicode" in a future version of PHP, but the focus seems to be almost entirely on the implementation details, without a clear statement of what problem is actually being solved. I want to go back to basics and define what we mean by "Unicode support", because I fear there are aspects of the problem that aren't being considered.

0) tl;dr

- the internal representation of strings is an implementation detail, not a goal, although it might affect performance
- users often want to think in terms of "grapheme clusters", not "code points"; userland string functions need to be thought through carefully
- using UTF-8 won't make encoding/decoding problems go away; the userland API for that needs to be designed too
- Unicode (or, rather, supporting all the writing in the world) is fundamentally hard; let's not settle for a quick fix

1) Types of string

Firstly, let's be clear what PHP currently has. It does not have a type representing an ASCII string, or ISO 8859-1 "extended ASCII", or anything like that. A PHP string is what network standards sometimes call an "octet stream" - a bunch of binary with a length that's a known multiple of 8 bits. This is a useful type in its own right, and we do not want to replace it - for instance file_get_contents() can't return any kind of text if called on a JPEG file.

However, most strings are intended to represent text, and some *interpretation* of that "octet stream" is needed. Some functions do this almost by accident - strlen() returns the number of bytes in a string, which happens to be the number of characters of text if encoded at one-byte-per-character. Others make assumptions, often poorly-documented - e.g. strtoupper() and sort() - or allow explicit hints from the programmer, such as htmlentities(), or the mbstring and intl extensions.

What is really needed for these cases is a type which represents a string of *characters* rather than *bytes*, so that interpretation isn't up to each function, but is universally agreed. The actual internal representation of this type doesn't actually matter *from a user's point of view* - it might be UTF-16, as in the previous Unicode implementation, or UTF-8, as widely proposed at the moment; it might not look like a byte array at all, but some structure optimised for expected manipulations. The only concrete requirement is that it be able to respresent the whole of Unicode, since that is the accepted standard covering all the languages anyone will need.

As I understand it, Python 3 has exactly this - its str type is an opaque "Unicode string", and to treat it as a series of bytes requires explictly encoding it in some form such as UTF-8, as a binary bytes object. Perl 5 [1] (and I think Python 2) takes a "softer" approach, where strings are automatically "upgraded" to Unicode "where necessary", using defined or defaulted input and output encodings. Ruby has a rather different take on the problem [2] which it calls "m17n" ("multilingualization"), where every string carries an encoding with it, and string functions can be implemented independently for different encodings rather than all strings being converted to "one true encoding" internally.

One other language to mention is Perl 6; like much of that language, the Unicode plan is apparently still in draft, but it has some interesting ideas, including an internal form which is *not* a standard unicode encoding. [3]

In all cases, the aim is to have a set of functions which "do the right thing" with text strings.

2) What can we do with text strings?

What do we mean by "do the right thing"? Acting on "multi-byte characters" is a start, but Unicode's basic unit - a "code point" - isn't something most code needs to care about. To quote an official Unicode Annex [4]:

It is important to recognize that what the user thinks of as a "character"—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

A nice example I came across recently [5] is reversing the string "noël", with a diaeresis on the 'e'; the expected output would normally be "lëon" (the diaeresis staying on the 'e') but reversing the code points would give something else if the diaeresis was a combining diacritic. In case you're thinking this is a normalisation problem, remember that not every combination has a composed form; that's the point of combining marks having their own code points. The Annex quoted goes on to mention much more complex examples of "grapheme clusters".

So, strrev() should probably work on grapheme clusters (by default). But what about, say, strlen()? You might want to know how many "user-perceived characters" are in the string; you might want to know how many Unicode code points you are passing to some other Unicode-aware system; or you might want to know how many bytes it will take up in UTF-8, or UTF-16. Do we supply all those options somehow?

substr() might have similar requirements, but in some cases what you'd actually want to say is "trim this string down so that it will fit in $b bytes when encoded as $e (e.g. UTF-16), but ensuring that no grapheme cluster is cut in half"...

Meanwhile, other functions, like strtoupper(), or sort(), need to additionally be *locale* aware, as can be seen by the Collation support in the intl extension. Do we assume some global locale (the not-even-thread-safe setlocale()?), or do we build explicit locale support into the design of those functions?

**IMHO, how we answer these questions is more important to most users of the language than what the implementation looks like internally.**

3) Input, output, decoding, encoding, and normalisation

This seems to be the part of the problem that has got most attention so far, so I'll just list out a few things that need considering:

- known vs unknown encodings (e.g. an HTTP POST request may specify the encoding of its body, but a %-encodded URL can represent any string of bytes)
- automatic vs manual decoding (should $_POST contain byte arrays or decoded strings?)
- implicit decoding (as used in Perl 5; is this a good idea, or does it just lead to programmers getting in trouble when their assumptions fail?)
- error handling (if a program states, or PHP assumes, a string is in a particular encoding, but it's not valid, what should happen?)
- file access (default to binary strings and allow an encoding to be specified, or provide separate functions for "binary" vs "text" access?)
- file system access (the need to call Win32 APIs with UTF-16 arguments in Windows builds)
- database connectors (how should PDO, mysqli et al negotiate character encodings with the engine?)
- extensions in general (what facilities do they need?)

The push to use UTF-8 as the internal representation stems from the likelihood that most of these conversions will be from/to UTF-8. Conceptually, these would still be conversions, since error-handling and type labelling needs to happen *somewhere*, they would just be conversions which happen to be very efficient under the hood.

There are almost certainly yet more issues I haven't thought of here. Somebody said Unicode felt like it was designed by committee; that may be, but it's also complicated *because it has to be*. If you really want to support all the world's languages properly, there is a limit to how far you can simplify the problems. PHP should aim to simplify them as far as possible, and no farther.


Refs:
[1] http://perldoc.perl.org/perluniintro.html
[2] http://yokolet.blogspot.co.uk/2009/07/design-and-implementation-of-ruby-m17n.html
[3] https://raw.github.com/perl6/specs/master/S15-unicode.pod (The HTML version should be at http://perlcabal.org/syn/S15.html but appears not to be rendered atm.)
[4] http://unicode.org/reports/tr29/
[5] http://mortoray.com/2013/11/27/the-string-type-is-broken/

Regards,
-- 
Rowan Collins
[IMSoP]

Thread (53 messages)

Rowan CollinsTue, 04 Mar 2014 22:57:48 +0000
Pierre JoyeWed, 05 Mar 2014 07:38:32 +0000
Rowan CollinsThu, 06 Mar 2014 08:30:16 +0000
Pierre JoyeThu, 06 Mar 2014 08:43:44 +0000
Lester CaineWed, 05 Mar 2014 08:29:14 +0000
Andrea FauldsWed, 05 Mar 2014 19:33:34 +0000
Pierre JoyeWed, 05 Mar 2014 19:50:59 +0000
Andrea FauldsWed, 05 Mar 2014 20:02:31 +0000
Crypto CompressWed, 05 Mar 2014 20:28:04 +0000
Andrea FauldsWed, 05 Mar 2014 20:41:23 +0000
Crypto CompressWed, 05 Mar 2014 20:46:15 +0000
Andrea FauldsWed, 05 Mar 2014 20:48:23 +0000
Crypto CompressWed, 05 Mar 2014 21:02:45 +0000
Andrea FauldsWed, 05 Mar 2014 21:07:33 +0000
Crypto CompressWed, 05 Mar 2014 21:17:48 +0000
Andrea FauldsWed, 05 Mar 2014 21:23:43 +0000
Helmut TessarekWed, 05 Mar 2014 20:50:30 +0000
Derick RethansWed, 05 Mar 2014 21:07:39 +0000
Andrea FauldsWed, 05 Mar 2014 21:10:55 +0000
Derick RethansWed, 05 Mar 2014 21:25:56 +0000
Crypto CompressWed, 05 Mar 2014 21:31:32 +0000
Andrea FauldsWed, 05 Mar 2014 21:31:44 +0000
Lester CaineWed, 05 Mar 2014 22:04:57 +0000
Crypto CompressWed, 05 Mar 2014 22:37:07 +0000
Pierre JoyeThu, 06 Mar 2014 04:41:59 +0000
Crypto CompressThu, 06 Mar 2014 07:56:26 +0000
Rowan CollinsThu, 06 Mar 2014 08:46:18 +0000
Lester CaineThu, 06 Mar 2014 08:59:50 +0000
Pierre JoyeThu, 06 Mar 2014 09:07:34 +0000
Lester CaineThu, 06 Mar 2014 09:40:47 +0000
Stas MalyshevSun, 09 Mar 2014 07:24:15 +0000
Helmut TessarekSun, 09 Mar 2014 08:24:18 +0000
Rowan CollinsThu, 06 Mar 2014 08:43:16 +0000
Lester CaineThu, 06 Mar 2014 09:17:58 +0000
Rowan CollinsThu, 06 Mar 2014 09:31:52 +0000
Lester CaineThu, 06 Mar 2014 09:46:15 +0000
Lester CaineThu, 06 Mar 2014 09:49:29 +0000
Stas MalyshevSun, 09 Mar 2014 07:29:58 +0000
Lester CaineSun, 09 Mar 2014 08:27:41 +0000
Stas MalyshevSun, 09 Mar 2014 19:47:55 +0000
Stas MalyshevSun, 09 Mar 2014 20:00:22 +0000
Lester CaineSun, 09 Mar 2014 20:27:42 +0000
Andrea FauldsSun, 09 Mar 2014 21:23:38 +0000
Lester CaineSun, 09 Mar 2014 22:44:49 +0000
Pierre JoyeMon, 10 Mar 2014 06:28:53 +0000
Lester CaineMon, 10 Mar 2014 06:52:35 +0000
Pierre JoyeMon, 10 Mar 2014 07:01:33 +0000
Lester CaineMon, 10 Mar 2014 08:24:32 +0000
Alexey ZakhlestinMon, 10 Mar 2014 09:18:44 +0000
Lester CaineMon, 10 Mar 2014 09:46:23 +0000
Marco SchusterMon, 10 Mar 2014 09:18:03 +0000
Pierre JoyeMon, 10 Mar 2014 09:34:03 +0000
Lester CaineSun, 09 Mar 2014 20:27:20 +0000

« previous	php.internals (#72914)	next »

From:	Rowan Collins	Date:	Tue, 04 Mar 2014 22:57:48 +0000
Subject:	What "Unicode support" really means
Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message