Re: [RFC] Multibyte char handling

From: Date: Mon, 20 Jan 2014 06:38:12 +0000
Subject: Re: [RFC] Multibyte char handling
References: 1 2 3 4 5 6 7  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
On Jan 19, 2014 11:48 PM, "Yasuo Ohgaki" <[email protected]> wrote:
>
> Hi Pierre,
>
> On Mon, Jan 20, 2014 at 1:09 AM, Pierre Joye <[email protected]> wrote:
>>
>> On Thu, Jan 16, 2014 at 11:47 PM, Yasuo Ohgaki <[email protected]>
wrote:
>> > Hi Nikita,
>> >
>> > On Fri, Jan 17, 2014 at 7:38 AM, Nikita Popov <[email protected]>
wrote:
>> >
>> >> No, I don't want a locale-based approach. I want the string functions
to
>> >> stay as is. Multibyte variants of the functions can be added to the
>> >> multibyte extension.
>> >
>> >
>> > Creating mb_*() function would not solve security issues of
>> > multibyte char handling since multibyte aware functions are
>> > optional feature.
>>
>> We never supported nor claimed that these functions are multi bytes
>> safe. However I actually fully understand that we should solve this
>> problem, in one way or another.
>>
>> > However, it may work if PHP compiles mbstring by default and
>> > discourage use of addslashes()/var_export()/stripslashes()
>> > in favor of mb_*() variants.
>>
>> I do not think we should discourage the use of these functions but
>> clearly document to rely on mb_* APIs as long as multi bytes support
>> is required.
>>
>> I join other about not making any optional arguments in the existing
>> APIs, for a couple of reasons:
>>
>> 1. it does not solve anything as people still have to update their
>> code, and they won't unless maybe if they read the doc/changelog
>> 2. It is really not a clean solution
>> 3. we already have many duplicate functions in mb, it has worked well
>> so far and we can add the ones discussed here
>
>
> I'll leave existing ext/standard functions alone.

:)

>> The last question was about relying on locale. This is absolutely not
>> a solution. Locale has been proven to be totally unreliable, buggy and
>> unsafe. Let alone the total lack of real posix locale support on
>> Windows.
>
>
> mb_escape_shell_arg()/mb_escape_shell_cmd() need locale based
> solution, since there aren't good way to detect terminal encoding. I'll
> make mb version explicitly overrides this behavior by explicitly
specifying
> encoding.
>

Sounds good

> On UNIXes, UTF-8 encoding is popular terminal encoding, but there
> would be systems using other encoding such as EUC, or even SJIS, BIG5.

Right, and as far as I remember UTF-8 does not suffer from this problem.

> Windows uses different encoding for terminal encoding according to
locale,
> so it's much more complex.
>

Let me provide a function to detect it, but we need something to normalize
the names. Do we have such thing in mbstring?

> This is the reason why I would use locale. However, this implementation
> is debatable.
>

Yes :)

> We could say "Users should explicitly specify terminal encoding
> by themselves". In fact, I prefer this even if I am about to implement
> mb_escape_shell_*() using locale for automatic encoding detection.
>
> It may be better to raise E_NOTICE at least if encoding parameter is
> omitted for mb_escape_shell_*().

Notice sounds good too.

>
>> For anything related to locale, formats or encoding, we should rely on
>> intl (ICU) and not on systems's locale. This is the only way to be
>> portable, safe and updated.
>
>
> I agree.
> I also would like to propose
>
> https://wiki.php.net/rfc/altmbstring   - ICU
> version of mbstring
>

Oh, very nice.

> for future release. Most work has done by Moriyoshi. We may try to
> switch to it now, but I suppose there is not enough time for 5.6.

What's the status? We still have some time :)

Cheers,
Pierre


Thread (31 messages)

« previous php.internals (#71297) next »