Unicode string literals and casting

From: Date: Tue, 14 Feb 2006 11:55:42 +0000
Subject: Unicode string literals and casting
Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The Unicode support design document in README.UNICODE discusses three types of
strings, IS_UNICODE, IS_STRING, and IS_BINARY, and specifies two new casts,
(unicode) and (binary). The spec allows Unicode and string types to be
implicitly concatenated and explicitly cast to one another, while the binary
type is a black hole that requires a conversion function call to get out of.

According to the notes from November I see this has been reduced to just Unicode
and binary types:
http://www.php.net/~derick/meeting-notes.html#different-string-types

I've been prodding some strings from user code to see how they react, and I'm
wondering if they're working as intended or if it's just some side effects of
this merge that haven't been finished yet...

Both the implicit coercions and the explicit casts seem to have vanished, and
behavior is worryingly inconsistent:

With unicode_semantics off:
* (unicode) cast fails on binary strings
* (string) converts things, including Unicode strings, to binary strings
* Binary and Unicode strings can't be concatenated.
* There's no available cast from string literals and variables to Unicode strings.

With unicode_semantics on:
* (unicode) fails on binary strings
* (string) behaves as (unicode), converting things to unicode strings
* Binary and Unicode strings can't be concatenated.
* There is no available cast from Unicode string variables to binary strings.
(For literals you can use b"blah".)


This looks like a pretty painful place to be as far as writing portable
Unicode-friendly code, because there is no way to write Unicode literals that
will reliably work. Even if your in-code literals are all ASCII, you can't mix
them with runtime Unicode strings because it throws a fatal error with
unicode_semantics off.

This is particularly bad if unicode_semantics can't be changed on a per-request
basis; this virtually guarantees that many hosting providers will turn it off
"for compatibility" or "for speed", and individual users won't be able to
do a
darn thing about it.


Wrapping every string literal in a conditional call to unicode_decode() sounds
less than ideal; if (unicode) casts worked they would still be pretty ugly too.

I would *love* a pragma setting like the declare(encoding="UTF-8") to say "I'm
going to use Unicode string literals in this file, whatever unicode_semantics
may be." Would there be any interest in supporting a mode like this?

A Python-style modifier like u"blah" could go along with the b"blah" binary
string literal as well, though I'd rather not have to put a sigil on every string...

- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFD8cU+wRnhpk1wk44RAnwKAJ99lNB5C44jvKhqbPzlBnLiUwKLBwCfYYQh
7VGvgqkgRrL+Le6bPxbsD54=
=JRAP
-----END PGP SIGNATURE-----


Thread (4 messages)

« previous php.internals (#21865) next »