Re: Potential RFC: mb_rawurlencode() ?

From: Date: Sat, 22 Mar 2025 15:20:15 +0000
Subject: Re: Potential RFC: mb_rawurlencode() ?
References: 1 2 3 4 5  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
On 21/03/2025 11:17, Tim Düsterhus wrote:
I am not sure if that signature makes sense and if the proposed functionality fits into mbstring for that reason. IRIs are defined as UTF-8, any other encoding results in invalid output / results that are not interoperable.
This confirms a nagging feeling I had when I first saw the thread: the name "mb_rawurlencode" implies "do the same things as rawurlencode, but for multi-byte strings", but that's not what is being proposed. Notably, a similar feature is actually slated for removal; to quote https://www.php.net/manual/en/migration82.deprecated.php#migration82.deprecated.mbstring
Usage of the QPrint, Base64, Uuencode, and HTML-ENTITIES 'text encodings' is deprecated for all MBString functions. Unlike all the other text encodings supported by MBString, these do not encode a sequence of Unicode codepoints, but rather a sequence of raw bytes. It is not clear what the correct return values for most MBString functions should be when one of these non-encodings is specified.
The same applies here: if you write mb_rawurlencode($my_string, 'SHIFT-JIS'), does that mean convert what you can to ASCII, and percent encode the rest for a URI; or does it mean convert to UTF-8, and percent encode as necessary for an IRI? If the input contains sequences which are not valid SHIFT-JIS, are those bytes treated as unencodable (producing errors or substitution characters), or are they directly percent encoded?
The correct solution to me is to build a proper thought-through API as part of the proposed new Uri namespace and not adding new standalone functions without a clear vision.
I completely agree. For instance, the IRI standard does include an algorithm for converting a non-Unicode IRI representation to a URI - but it requires a Unicode Normalization step, which is a complex algorithm not included in ext/standard or ext/mbstring, only ext/intl. However, a function in the URI namespace that only handled the UTF-8 input case might still be useful.
Along those lines, I think there might need to be two additional changes/additions to help with encoding for RFC 3987 and WHATWG-URL component values: - http_build_query() would need PHP_QUERY_3987 and PHP_QUERY_WHATWG flags and corresponding logic (or entirely new functions); and - parse_str() would need a corresponding mb_parse_str().
I haven't followed the other URI thread at all, but isn't replacing the scattered standard library functions with a consistent API the whole point of that effort? parse_str() in particular has a non-descriptive name, and a weird function signature because it used to directly overwrite variables by name. As a comparison, we didn't extend the shuffle() function with an algorithm parameter, we added a shuffleArray() method to the new Randomizer class. -- Rowan Tommins [IMSoP]

Thread (6 messages)

« previous php.internals (#126907) next »