Re: [Discussion] Should All String Functions Become Multi-Byte Safe?

From: Date: Mon, 12 Aug 2024 10:33:47 +0000
Subject: Re: [Discussion] Should All String Functions Become Multi-Byte Safe?
References: 1 2 3  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
2024年8月12日(月) 18:54 Daniel Haber <[email protected]>:
>
> On 8/12/2024 9:53 AM, Rowan Tommins [IMSoP] wrote:
> >
> >
> > On 11 August 2024 16:50:52 BST, Nick Lockheart <[email protected]> wrote:
> >> It seems that if everything on the Internet is multi-byte encoded now,
> >> then all of the PHP string functions should be multi-byte safe.
> >
> > The phrase "multibyte safe" may have made sense about 30 years ago, when it was
> > thought that a "universal character set" could just be a "wide ASCII", encoding
> > a straightforward list of characters, just more of them.
> >
> > Modern Unicode is so much more than that, because the world's writing systems
> > don't all work the same way. Should strlen() measure bytes, code points, or graphemes? Should
> > strtoupper() accept a locale, so it can handle cases like Turkish "dotless i" where
> > "I" is not the uppercase of "i"? And so on, and so on.
> >
> > I've seen plenty of languages boast that they are "Unicode aware" but few
> > actually engaging with the question of what that actually means. Often they equate
> > "character" with "code point" and stop there, which leads to results that are
> > just as useless to most of the world as if they'd equated it with "byte".
> >
> > Regards,
> > Rowan Tommins
> > [IMSoP]
>
> Feels appropriate to link to this:
> "The Absolute Minimum Every Software Developer Must Know About Unicode
> in 2023 (Still No Excuses!)"
> https://tonsky.me/blog/unicode/

Hi, there

> Feels appropriate to link to this:
> "The Absolute Minimum Every Software Developer Must Know About Unicode
> in 2023 (Still No Excuses!)"
> https://tonsky.me/blog/unicode/

I think it's the same as the quoted site.
However, In programming, there are times when you want to operate on
bytes, code points, or grapheme clusters.
UTF-8 can't solve everything, what to program is important for
programmers (byte programming, character programming etc).

Also, other character encodings are also important in mainly CJK.
Character set has a lot of consider of many things.

Regards
Yuya

-- 
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------


Thread (30 messages)

« previous php.internals (#124886) next »