Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Date: Mon, 31 Mar 2025 19:15:47 +0000
Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message

On 30/03/2025 22:53, Ignace Nyamagana Butera wrote:
On 30/03/2025 14:42, Tim Düsterhus wrote:
Hi Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:
Hi Máté,    for RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then    this string is parsed and validated. Unfortunately, I recently    realized that this approach may leave room for some kind of parsing    confusion attack, namely when the scheme is for example "https", the    authority is empty, and the path is "example.com <http://example.com>". This will result in a https://example.com    URI. I believe a similar bug is not possible with the rest of the    components because they have their delimiters. So possibly some    other solution will be needed, or maybe adding some additional    validation (?). This is not correct according to RFC3986 https://datatracker.ietf.org/doc/html/rfc3986#section-3 *When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). * So in your example it should throw an Uri\InvalidUriException 🙂 for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes. This is also one of the many reasons why at least for RFC3986 the path component can never be null but that's another discussion. Like I said having a fromComponenta named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.
I just tested this with the implementation and it also appears to not yet be correct:
    var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); // string(11) "example.com"
    var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString()); // string(19) "https://example.com"
and
    var_dump((new Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); // string(3) "foo"
Best regards Tim Düsterhus
Hi Tim and Maté upon further inspection and verification of RFC3986 I also see an issue with the example used for normalization in the RFC. According to RFC3986 (https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) :
      The reg-name syntax allows percent-encoded octets in order to
       represent non-ASCII registered names in a uniform way that is
        independent of the underlying name resolution technology.  Non-ASCII
        characters must first be encoded according to UTF-8 [STD63 <https://www.rfc-editor.org/rfc/rfc3986.html#ref-STD63>], and then
        each octet of the corresponding UTF-8 sequence must be percent-
        encoded to be represented as URI characters.  URI producing
        applications must not use percent-encoding in host unless it is used
        to represent a UTF-8 character sequence.  When a non-ASCII registered
        name represents an internationalized domain name intended for
        resolution via the DNS, the name must be transformed to the IDNA
        encoding [RFC3490 <https://www.rfc-editor.org/rfc/rfc3490>] prior to name lookup.
From this we can infer that: - Host encoding can only happen for UTF-8 sequence but in your example "ex%61mple.com" is used which is not conforming to the rules (ie it should throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg URL it will get correctly converted with a soft error (??). - That when available IDNA is preferred to percent-encoded sequences Best regards Ignace Nyamagana Butera Hi Maté and all,
I spotted another inconsistency in the normalization under RFC3986 According to the RFC (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.1) For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings. So during normalization for any component uppercased percent-encodings should be used which is not the case for the example in the RFC. see for instance $uri = Uri\Rfc3986\Uri::parse("https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com"); // percent-encoded form of https://你好你好.com
echo $uri->toString();                             // https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com
the toString method should return https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com` instead. Best regards Ignace Nyamagana Butera

Thread (152 messages)

« previous php.internals (#126981) next »