Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Ignace Nyamagana Butera Date: Mon, 31 Mar 2025 19:15:47 +0000

Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

References: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message


On 30/03/2025 22:53, Ignace Nyamagana Butera wrote:


On 30/03/2025 14:42, Tim Düsterhus wrote:
Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:
Hi Máté,

   for RFC 3986:
https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then
   this string is parsed and validated. Unfortunately, I recently
   realized that this approach may leave room for some kind of parsing
   confusion attack, namely when the scheme is for example "https", the
   authority is empty, and the path is "example.com
<http://example.com>". This will result in a https://example.com
   URI. I believe a similar bug is not possible with the rest of the
   components because they have their delimiters. So possibly some
   other solution will be needed, or maybe adding some additional
   validation (?).

This is not correct according to RFC3986 https://datatracker.ietf.org/doc/html/rfc3986#section-3


*When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException 🙂 for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the path component can never be null but that's another discussion. Like I said having a fromComponenta named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.

I just tested this with the implementation and it also appears to not yet be correct:

    var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); // string(11) "example.com"
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString()); // string(19) "https://example.com"

and

    var_dump((new Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); // string(3) "foo"

Best regards
Tim Düsterhus

Hi Tim and Maté upon further inspection and verification of RFC3986 I also see an issue with the example used for normalization in the RFC. According to RFC3986 (https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) :

      The reg-name syntax allows percent-encoded octets in order to
       represent non-ASCII registered names in a uniform way that is
        independent of the underlying name resolution technology.  Non-ASCII
        characters must first be encoded according to UTF-8 [STD63 <https://www.rfc-editor.org/rfc/rfc3986.html#ref-STD63>], and then
        each octet of the corresponding UTF-8 sequence must be percent-
        encoded to be represented as URI characters.  URI producing
        applications must not use percent-encoding in host unless it is used
        to represent a UTF-8 character sequence.  When a non-ASCII registered
        name represents an internationalized domain name intended for
        resolution via the DNS, the name must be transformed to the IDNA
        encoding [RFC3490 <https://www.rfc-editor.org/rfc/rfc3490>] prior to name lookup.

From this we can infer that:

- Host encoding can only happen for UTF-8 sequence but in your example "ex%61mple.com" is used which is not conforming to the rules (ie it should throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg URL it will get correctly converted with a soft error (??).

- That when available IDNA is preferred to percent-encoded sequences

Best regards

Ignace Nyamagana Butera


Hi Maté and all,

I spotted another inconsistency in the normalization under RFC3986

According to the RFC (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.1)

   For consistency, URI producers and normalizers should use uppercase hexadecimal
   digits for all percent-encodings.

So during normalization for any component uppercased percent-encodings should be used which is not the case for the example in the RFC. see for instance

$uri = Uri\Rfc3986\Uri::parse("https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com"); // percent-encoded form of https://你好你好.com
echo $uri->toString();                             // https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com

   the toString method should return
   https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com` instead.


Best regards

Ignace Nyamagana Butera


      
        
          Thread (152 messages)
        
        Máté KocsisFri, 28 Jun 2024 20:06:14 +0000Marco PivettaFri, 28 Jun 2024 20:21:33 +0000
LynnFri, 28 Jun 2024 21:02:08 +0000
Niels DosscheFri, 28 Jun 2024 21:35:36 +0000BilgeFri, 28 Jun 2024 22:53:12 +0000
Stephen ReaySat, 29 Jun 2024 09:57:17 +0000Rob LandersSat, 29 Jun 2024 10:33:16 +0000
ignace nyamagana buteraSun, 30 Jun 2024 06:51:52 +0000Máté KocsisSun, 07 Jul 2024 09:13:58 +0000Rob LandersSun, 07 Jul 2024 10:40:02 +0000Rob LandersSun, 07 Jul 2024 10:59:45 +0000
ignace nyamagana buteraSun, 07 Jul 2024 10:55:18 +0000Rob LandersSun, 07 Jul 2024 11:10:11 +0000
Nicolas GrekasMon, 08 Jul 2024 07:51:27 +0000
Máté KocsisMon, 15 Jul 2024 09:20:02 +0000Larry GarfieldMon, 15 Jul 2024 13:23:10 +0000
Ignace Nyamagana ButeraMon, 15 Jul 2024 19:31:27 +0000
Máté KocsisSun, 30 Jun 2024 06:00:00 +0000
Larry GarfieldFri, 28 Jun 2024 22:14:19 +0000Máté KocsisSat, 29 Jun 2024 22:42:06 +0000
Ben RamseyFri, 28 Jun 2024 23:28:36 +0000
nyamsprod the funky webmasterSat, 29 Jun 2024 08:20:11 +0000Ben RamseySat, 29 Jun 2024 17:35:46 +0000
Juris EvertovskisSat, 29 Jun 2024 16:19:12 +0000
KrinkleSat, 29 Jun 2024 20:27:50 +0000
LanreMon, 08 Jul 2024 17:24:09 +0000LanreFri, 19 Jul 2024 22:55:27 +0000
Niels DosscheSun, 21 Jul 2024 11:21:39 +0000ignace nyamagana buteraTue, 23 Jul 2024 06:38:40 +0000Máté KocsisMon, 26 Aug 2024 07:40:56 +0000Dennis SnellMon, 26 Aug 2024 22:25:35 +0000Máté KocsisTue, 19 Nov 2024 08:49:41 +0000
Dennis SnellFri, 03 Jan 2025 07:18:33 +0000ignace nyamagana buteraMon, 13 Jan 2025 15:09:50 +0000
Máté KocsisSun, 16 Feb 2025 22:01:36 +0000Tim DüsterhusFri, 21 Feb 2025 12:06:57 +0000Tim DüsterhusSun, 23 Feb 2025 15:05:25 +0000Juris EvertovskisSun, 23 Feb 2025 17:47:41 +0000RE: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing APITim DüsterhusMon, 24 Feb 2025 09:15:57 +0000Re: [RFC] [Discussion] Add WHATWG compliant URL parsing APIMáté KocsisMon, 10 Mar 2025 22:58:16 +0000
Ignace Nyamagana ButeraMon, 24 Feb 2025 09:18:10 +0000Tim DüsterhusMon, 24 Feb 2025 09:43:45 +0000Nicolas GrekasMon, 24 Feb 2025 11:08:07 +0000Tim DüsterhusMon, 24 Feb 2025 12:48:14 +0000Nicolas GrekasMon, 24 Feb 2025 13:44:53 +0000Marco PivettaMon, 24 Feb 2025 13:57:32 +0000Sebastian BergmannMon, 24 Feb 2025 14:23:44 +0000
Gina P. BanyardMon, 24 Feb 2025 13:57:55 +0000Hammed AjaoMon, 24 Feb 2025 14:05:37 +0000Tim DüsterhusMon, 24 Feb 2025 16:22:39 +0000
Máté KocsisFri, 14 Mar 2025 21:23:06 +0000
Nicolas GrekasMon, 24 Feb 2025 14:29:28 +0000
ignace nyamagana buteraTue, 25 Feb 2025 16:00:32 +0000
ignace nyamagana buteraTue, 25 Feb 2025 16:00:32 +0000
Máté KocsisFri, 14 Mar 2025 19:54:22 +0000
Máté KocsisFri, 14 Mar 2025 19:45:23 +0000ignace nyamagana buteraFri, 14 Mar 2025 22:26:04 +0000Máté KocsisMon, 17 Mar 2025 19:58:27 +0000Paul M. JonesTue, 18 Mar 2025 17:00:55 +0000Máté KocsisTue, 18 Mar 2025 20:15:52 +0000Paul M. JonesWed, 19 Mar 2025 15:13:42 +0000Máté KocsisTue, 25 Mar 2025 08:45:12 +0000Paul M . JonesFri, 28 Mar 2025 15:44:14 +0000Máté KocsisMon, 05 May 2025 21:32:33 +0000
Ignace Nyamagana ButeraWed, 19 Mar 2025 21:18:24 +0000Paul M. JonesSat, 22 Mar 2025 14:01:45 +0000Tim DüsterhusSun, 30 Mar 2025 11:25:15 +0000
Máté KocsisThu, 27 Mar 2025 21:04:27 +0000Ignace Nyamagana ButeraThu, 27 Mar 2025 22:49:39 +0000Tim DüsterhusSun, 30 Mar 2025 12:42:33 +0000Ignace Nyamagana ButeraSun, 30 Mar 2025 20:53:57 +0000Ignace Nyamagana ButeraMon, 31 Mar 2025 19:15:47 +0000Máté KocsisWed, 02 Apr 2025 17:59:11 +0000Ignace Nyamagana ButeraFri, 04 Apr 2025 17:46:55 +0000
Máté KocsisWed, 02 Apr 2025 20:41:55 +0000
Máté KocsisSun, 02 Mar 2025 22:00:08 +0000Tim DüsterhusSun, 30 Mar 2025 12:36:04 +0000Máté KocsisSun, 13 Apr 2025 12:10:52 +0000Tim DüsterhusTue, 15 Apr 2025 14:20:52 +0000Ignace Nyamagana ButeraTue, 15 Apr 2025 17:12:37 +0000
Máté KocsisTue, 15 Apr 2025 21:55:25 +0000Tim DüsterhusThu, 17 Apr 2025 07:22:34 +0000Máté KocsisThu, 17 Apr 2025 11:18:21 +0000ignace nyamagana buteraThu, 17 Apr 2025 11:49:54 +0000Máté KocsisThu, 17 Apr 2025 11:53:34 +0000Máté KocsisThu, 17 Apr 2025 12:04:53 +0000Paul M. JonesThu, 17 Apr 2025 20:47:46 +0000Tim DüsterhusThu, 17 Apr 2025 20:58:53 +0000Paul M. JonesThu, 17 Apr 2025 21:14:55 +0000Tim DüsterhusThu, 17 Apr 2025 21:19:20 +0000
Tim DüsterhusWed, 23 Apr 2025 10:50:44 +0000ignace nyamagana buteraSun, 27 Apr 2025 20:30:24 +0000Tim DüsterhusSun, 27 Apr 2025 20:32:44 +0000ignace nyamagana buteraSun, 27 Apr 2025 20:50:45 +0000Tim DüsterhusSun, 27 Apr 2025 21:05:37 +0000
Máté KocsisSat, 03 May 2025 21:18:35 +0000
Máté KocsisSun, 27 Apr 2025 21:47:04 +0000Tim DüsterhusSun, 27 Apr 2025 22:33:15 +0000ignace nyamagana buteraMon, 28 Apr 2025 07:05:29 +0000ignace nyamagana buteraMon, 28 Apr 2025 08:42:23 +0000
Máté KocsisMon, 28 Apr 2025 21:20:57 +0000ignace nyamagana buteraMon, 28 Apr 2025 21:31:02 +0000ignace nyamagana buteraTue, 29 Apr 2025 08:54:45 +0000Tim DüsterhusTue, 29 Apr 2025 18:55:04 +0000ignace nyamagana buteraWed, 30 Apr 2025 07:58:02 +0000ignace nyamagana buteraWed, 30 Apr 2025 16:42:03 +0000Máté KocsisSat, 03 May 2025 21:07:43 +0000
Máté KocsisSat, 03 May 2025 21:05:56 +0000
Paul M. JonesMon, 28 Apr 2025 19:49:24 +0000ignace nyamagana buteraMon, 28 Apr 2025 20:47:49 +0000Paul M. JonesTue, 29 Apr 2025 13:55:31 +0000ignace nyamagana buteraTue, 29 Apr 2025 20:08:24 +0000
Dennis SnellWed, 05 Mar 2025 22:45:37 +0000Máté KocsisSat, 15 Mar 2025 22:05:14 +0000Máté KocsisTue, 25 Mar 2025 22:23:03 +0000Dennis SnellTue, 25 Mar 2025 23:06:03 +0000Dennis SnellTue, 25 Mar 2025 23:53:08 +0000
Larry GarfieldSat, 31 Aug 2024 00:10:15 +0000Máté KocsisSun, 24 Nov 2024 20:40:07 +0000Tim DüsterhusFri, 29 Nov 2024 12:28:20 +0000
Tim DüsterhusFri, 29 Nov 2024 12:21:17 +0000Máté KocsisThu, 05 Dec 2024 21:49:43 +0000Christoph M. BeckerThu, 05 Dec 2024 23:16:10 +0000Larry GarfieldThu, 05 Dec 2024 23:43:29 +0000
Gina P. BanyardSun, 23 Feb 2025 17:30:14 +0000Paul M. JonesSun, 23 Feb 2025 17:57:09 +0000Gina P. BanyardMon, 24 Feb 2025 00:48:06 +0000
Tim DüsterhusMon, 24 Feb 2025 09:36:48 +0000Paul M . JonesTue, 25 Feb 2025 12:36:20 +0000ignace nyamagana buteraTue, 25 Feb 2025 15:55:20 +0000Paul M. JonesThu, 27 Feb 2025 13:48:02 +0000Faizan Akram DarThu, 27 Feb 2025 21:01:10 +0000Rob LandersThu, 27 Feb 2025 23:02:05 +0000LynnFri, 28 Feb 2025 08:38:11 +0000Rob LandersFri, 28 Feb 2025 09:26:48 +0000
Máté KocsisFri, 14 Mar 2025 21:41:28 +0000
ignace nyamagana buteraTue, 25 Feb 2025 15:55:20 +0000Paul M. JonesThu, 27 Feb 2025 13:48:02 +0000Faizan Akram DarThu, 27 Feb 2025 21:01:10 +0000Rob LandersThu, 27 Feb 2025 23:02:05 +0000LynnFri, 28 Feb 2025 08:38:11 +0000Rob LandersFri, 28 Feb 2025 09:26:48 +0000
Máté KocsisFri, 14 Mar 2025 21:41:28 +0000
Tim DüsterhusMon, 24 Feb 2025 09:15:01 +0000Máté KocsisWed, 12 Mar 2025 22:00:21 +0000Tim DüsterhusSun, 30 Mar 2025 11:12:09 +0000
Máté KocsisMon, 10 Mar 2025 22:51:45 +0000Larry GarfieldTue, 11 Mar 2025 04:34:37 +0000Máté KocsisSat, 29 Mar 2025 22:18:53 +0000Máté KocsisMon, 07 Apr 2025 23:00:25 +0000Máté KocsisMon, 07 Apr 2025 23:27:06 +0000
Máté KocsisMon, 05 May 2025 21:36:05 +0000Paul M. JonesWed, 07 May 2025 19:16:11 +0000Gina P. BanyardWed, 07 May 2025 22:02:37 +0000Paul M. JonesThu, 08 May 2025 17:38:08 +0000
Stephen ReaySat, 29 Jun 2024 09:31:41 +0000Re: [RFC] [Discussion] Add WHATWG compliant URL parsing APIBilgeSat, 29 Jun 2024 11:52:37 +0000
Máté KocsisSun, 07 Jul 2024 09:26:00 +0000
      
      
   
         « previous    
    php.internals (#126981)
         next »

From:	Ignace Nyamagana Butera	Date:	Mon, 31 Mar 2025 19:15:47 +0000
Subject:	Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References:	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17	Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message