Re: Potential RFC: mb_rawurlencode() ?

From: Rowan Tommins [IMSoP] Date: Sat, 22 Mar 2025 15:20:15 +0000

Subject: Re: Potential RFC: mb_rawurlencode() ?

References: 1 2 3 4 5 Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message

On 21/03/2025 11:17, Tim Düsterhus wrote:

I am not sure if that signature makes sense and if the proposed functionality fits into mbstring for that reason. IRIs are defined as UTF-8, any other encoding results in invalid output / results that are not interoperable.


This confirms a nagging feeling I had when I first saw the thread: the name "mb_rawurlencode" implies "do the same things as rawurlencode, but for multi-byte strings", but that's not what is being proposed.


Notably, a similar feature is actually slated for removal; to quote https://www.php.net/manual/en/migration82.deprecated.php#migration82.deprecated.mbstring

Usage of the QPrint, Base64, Uuencode, and HTML-ENTITIES 'text encodings' is deprecated for all MBString functions. Unlike all the other text encodings supported by MBString, these do not encode a sequence of Unicode codepoints, but rather a sequence of raw bytes. It is not clear what the correct return values for most MBString functions should be when one of these non-encodings is specified.

The same applies here: if you write mb_rawurlencode($my_string, 'SHIFT-JIS'), does that mean convert what you can to ASCII, and percent encode the rest for a URI; or does it mean convert to UTF-8, and percent encode as necessary for an IRI? If the input contains sequences which are not valid SHIFT-JIS, are those bytes treated as unencodable (producing errors or substitution characters), or are they directly percent encoded?


The correct solution to me is to build a proper thought-through API as part of the proposed new Uri namespace and not adding new standalone functions without a clear vision.


I completely agree.

For instance, the IRI standard does include an algorithm for converting a non-Unicode IRI representation to a URI - but it requires a Unicode Normalization step, which is a complex algorithm not included in ext/standard or ext/mbstring, only ext/intl. However, a function in the URI namespace that only handled the UTF-8 input case might still be useful.


Along those lines, I think there might need to be two additional changes/additions to help with encoding for RFC 3987 and WHATWG-URL component values:

- http_build_query() would need PHP_QUERY_3987 and PHP_QUERY_WHATWG flags and corresponding logic (or entirely new functions); and
- parse_str() would need a corresponding mb_parse_str().


I haven't followed the other URI thread at all, but isn't replacing the scattered standard library functions with a consistent API the whole point of that effort?

parse_str() in particular has a non-descriptive name, and a weird function signature because it used to directly overwrite variables by name.

As a comparison, we didn't extend the shuffle() function with an algorithm parameter, we added a shuffleArray() method to the new Randomizer class.


-- 
Rowan Tommins
[IMSoP]

Thread (6 messages)

youkidearitaiThu, 20 Mar 2025 06:31:38 +0000
Paul M. JonesThu, 20 Mar 2025 16:46:46 +0000Re: Potential RFC: mb_rawurlencode() ?
Tim DüsterhusFri, 21 Mar 2025 11:17:32 +0000
Rowan Tommins [IMSoP]Sat, 22 Mar 2025 15:20:15 +0000
Paul M. JonesSat, 22 Mar 2025 16:08:54 +0000
Rowan Tommins [IMSoP]Sun, 23 Mar 2025 12:04:10 +0000

« previous	php.internals (#126907)	next »

From:	Rowan Tommins [IMSoP]	Date:	Sat, 22 Mar 2025 15:20:15 +0000
Subject:	Re: Potential RFC: mb_rawurlencode() ?
References:	1 2 3 4 5	Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message