---------- Forwarded message ---------
From: youkidearitai <[email protected]>
Date: 2025年3月20日(木) 14:41
Subject: Re: [PHP-DEV] Potential RFC: mb_rawurlencode() ?
To: Paul M. Jones <[email protected]>
2025年3月19日(水) 2:52 Paul M. Jones <[email protected]>:
>
> Hi all,
>
> The discussion around WHATWG-URL on this list, as well as my work coordinating Uri-Interop
> <https://github.com/uri-interop/interface>,
> lead me to think PHP needs a multibyte equivalent of rawurlencode().
>
> Broadly speaking, as far as I can tell:
>
> - For an RFC 3986 URI, delimiters need to be percent-encoded, as well as non-ASCII characters.
> - For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS characters do not.
>
> (There are other details but I think you get the idea.)
>
> The rawurlencode() function does fine for URIs, but not for IRIs. Using rawurlencode() for an
> IRI will encode multibyte characters when it should leave them alone. For example:
>
> ```
> $val = 'fü bar';
>
> $uriPath = '/heads/' . rawurlencode($val) . '/tails/';
> assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true
>
> $iriPath = '/heads/' . rawurlencode($val) . '/tails/');
> assert($iriPath === '/heads/fü bar/tails/'; // false
> ```
>
> (This might apply to WHATWG-URL component construction as well.)
>
> Have I missed something, either in the specs or in PHP itself?
>
> If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland implementation
> might look something like the code below.
>
> Thoughts?
>
> * * *
>
> ```php
> function mb_rawurlencode(string $string) : string
> {
> $encoded = '';
>
> foreach (mb_str_split($string) as $char) {
> $encoded .= match ($char) {
> chr(0) => "%00",
> chr(1) => "%01",
> chr(2) => "%02",
> chr(3) => "%03",
> chr(4) => "%04",
> chr(5) => "%05",
> chr(6) => "%06",
> chr(7) => "%07",
> chr(8) => "%08",
> chr(9) => "%09",
> chr(10) => "%0A",
> chr(11) => "%0B",
> chr(12) => "%0C",
> chr(13) => "%0D",
> chr(14) => "%0E",
> chr(15) => "%0F",
> chr(16) => "%10",
> chr(17) => "%11",
> chr(18) => "%12",
> chr(19) => "%13",
> chr(20) => "%14",
> chr(21) => "%15",
> chr(22) => "%16",
> chr(23) => "%17",
> chr(24) => "%18",
> chr(25) => "%19",
> chr(26) => "%1A",
> chr(27) => "%1B",
> chr(28) => "%1C",
> chr(29) => "%1D",
> chr(30) => "%1E",
> chr(31) => "%1F",
> chr(127) => "%7F",
> "!" => '%21',
> "#" => '%23',
> "$" => '%24',
> "%" => '%25',
> "&" => '%26',
> "'" => '%27',
> "(" => '%28',
> ")" => '%29',
> "*" => '%2A',
> "+" => '%2B',
> "," => '%2C',
> "/" => '%2F',
> ":" => '%3A',
> ";" => '%3B',
> "=" => '%3D',
> "?" => '%3F',
> "[" => '%5B',
> "]" => '%5D',
> default => $char,
> };
> }
>
> return $encoded;
> }
> ```
>
> * * *
>
>
> -- pmj
Hi, Paul.
I think signature is below:
```php
function mb_rawurlencode(string $string, string $encode): string {}
```
Because the mbstring function is other than Unicode (ISO-8859-1 to
ISO-8859-16, Shift_JIS, EUC-* etc).
Other than that I don't know yet
Oops, I missing to send to internals.
Sorry resend this is.
Yuya
--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------