Re: default charset confusion
Hi
I think following PHP 5.4.0 NEWS entry is misleading.
. Changed default value of "default_charset" php.ini option from ISO-8859-1 to
UTF-8. (Rasmus)
I thought default_charset became UTF-8, so I was expecting
following HTTP header.
content-type text/html; charset=UTF-8
However, I got empty charset (missing 'charset=UTF-8').
So I looked up to source and found the line in SAPI.h
293 #define SAPI_DEFAULT_CHARSET ""
Empty string should be "UTF-8", isn't it?
BTW, empty charset in HTTP header does not mean the default will
be ISO-8859-1, but it let browser guess the encoding is used.
Guessing encoding may cause XSS under certain conditions.
Anyway, I was curious so I've checked ext/standard/html.c and found
/* {{{ entity_charset determine_charset
* returns the charset identifier based on current locale or a hint.
* defaults to UTF-8 */
static enum entity_charset determine_charset(char *charset_hint TSRMLS_DC)
{
int i;
enum entity_charset charset = cs_utf_8;
int len = 0;
const zend_encoding *zenc;
/* Default is now UTF-8 */
if (charset_hint == NULL)
return cs_utf_8;
There are 2 problems.
- php.ini's default_charset should be UTF-8.
- determine_charset() should not blindly default to UTF-8 when there
are no hint.
Old htmlentities/htmlspecialchars actually determines charset from
default_charset/mbstring.internal_encoding/etc. I think old behavior
is better than now.
How about make determine_charset() behaves like 5.3 and set the
SAPI_DEFAULT_CHARSET to "UTF-8"?
Then PHP will behave like as NEWS mentions, htmlentities/htmlspecialchars
default encoding became 'UTF-8' and users will have control for default
htmlenties/htmlspecialchars encoding.
Regards,
--
Yasuo Ohgaki
[email protected]
Thread (39 messages)