Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Dennis Snell Date: Wed, 05 Mar 2025 22:45:37 +0000

Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

References: 1 2 3 4 5 6 7 Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message

> On Feb 16, 2025, at 3:01 PM, Máté Kocsis <[email protected]> wrote:
> 
> 
> Hi Dennis,
>> 
>> I only harp on the WhatWG spec so much because for many people this will be the only one
>> they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack
>> targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of
>> the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly
>> RFC3986 systems or if the common-enough URLs are valid in both specs.
>> 
> 
> I think Ignace's examples already highlighted that the two specifications differ in
> nuances so much that even I had to admit after months of trying to squeeze them into the same
> interface that doing so would be irresponsible.
> The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs or URIs with
> scheme-specific behavior - like ldap apparently), but even the UriInterface of PSR-7 can build upon
> it. On the other hand, Uri\WhatWg\Url will be useful for representing browser links and any other
> URLs for the web (i.e. an HTTP application router component should use this class).
>  
>> Just to enlighten me and possibly others with less familiarity, how and when are RFC3986
>> URLs used and what are those systems supposed to do when an invalid URL appears, such as when
>> dealing with percent-encodings as you brought up in response to Tim?
>> 
> 
> I am not 100% sure what I brought up to Tim, but certainly, the biggest difference between the
> two specs regarding percent-encoding was recently documented in the RFC: https://wiki.php.net/rfc/url_parsing_api#percent-encoding
> . The other main difference is how the host component is stored: WHATWG automatically
> percent-decodes it, while RFC3986 doesn't. This is summarized in the https://wiki.php.net/rfc/url_parsing_api#component_retrieval
> section (a bit below).
>   
>> This would be fine, knowing in hindsight that it was originally a relative path. Of course,
>> this would mean that it’s critical that `https://example.com
>>  does not replace the actual host part if one is provided in $url`. For
>> example, this code should work.
>> 
>> ```
>>     $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc
>> ’, ‘https://example.com
>> ’ );
>>     $url->domain === 'wiki.php.net
>> '
>> 
> 
> Yes. it's the case. Both classes only use the base URL for relative URIs.
>  
>> Hopefully this won’t be too controversial, even though the concept was new to me when I
>> started having to reliably work with URLs. I choose the example I did because of human risk factors
>> in security exploits.  "xn--google.com
>> " is not in fact a Google domain, but an IDNA domain decoding to
>> "䕮䕵䕶䕱.com
>> ”
>> 
> 
> I got your point, so I implemented your suggestion. Actually, I made yet another larger API
> change in the meanwhile, but in any case, the WHATWG implementation now supports IDNA the following
> way:
> $url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘", null);
> 
> echo $url->getHost();                // xn--go8h.com
> 
> echo $url->getHostForDisplay();      // 🐘.com
> echo $url->toString();               // https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
> 
> echo $url->toDisplayString();        / https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 
> 
> 
> Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out at the end
> of https://externals.io/message/126182#126184
> ), and adding support for RFC3987 (therefore IRIs) would be a very heavy amount of
> work, it's just not feasible within this RFC :( To make things worse, its code should be
> written from scratch, since I haven't found any suitable C library yet for this purpose.
> That's why I'll leave them for
> 
> 
> On other notes, let me share some of the changes since my previous message to the mailing list:
> 
> 
> - First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from the proposal after
> Arnaud's feedback. Now, both the normalized (and decoded), as well as the non-normalized
> representation can equally be retrieved from the same URI instance. This was necessary to change in
> order for users to be able to consistently use URIs. Now, if someone needs an exact URI component
> value, they can use the getRaw*() getter. If they want the normalized and percent-decoded form then
> a get*() getter should be used. For more information, the
> 
> >  https://wiki.php.net/rfc/url_parsing_api#component_retrieval
> section should be consulted.
> 
> 

This seems like a good change.

> - I made a few less important API changes, like converting the WhatWgError class to an enum,
> adding a Uri\Rfc3986\IUri::getUserInfo() method, changing the return type of some getters (removing
> nullability) etc.
> 
> 

Love this.

> - I fixed quite some smaller details of the implementation along with a very important spec
> incompatibility: until now, the "path" component didn't contain the leading
> "/" character when it should have. Now, both classes conform to their respective
> specifications with regards to path handling.
> 
> 

This is a late thought, and surely amenable to a later RFC, but I was thinking about the get/set
path methods and the issue of the / and %2F.

 - If we exposed getPathIterator() or getPathSegments() could we not
report these in their fully-decoded forms? That is, because the path segments are separated by some
invocation or array element, they could be decoded?
 - Probably more valuably, if withPath() accepted an array, could we not allow fully
non-escaped PHP strings as path segments which the URL class could safely and by-default handle the
escaping for the caller?

Right now, if someone haphazardly joins path segments in order to set withPath() they
will likely be unaware of that nuance and get the path wrong. On the grand scale of things, I
suspect this is a really minor risk. However, if they could send in an array then they would never
need to be aware of that nuance in order to provide a fully-reliable URL, up to the class rejecting
path segments which cannot be represented.

> 
> 
> I think the RFC is now mature enough to consider voting in the foreseeable future, since most
> of the concerns which came up until now are addressed some way or another. However, the only
> remaining question that I still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
> should be final? Personally, I don't see much problem with opening them for extension (other
> than some technical challenges that I already shared a few months ago), and I think people will have
> legitimate use cases for extending these classes. On the other hand, having final classes may allow
> us to make slightly more significant changes without BC concerns until we have a more battle-tested
> API, and of course completely eliminate the need to overcome the said technical challenges.
> According to Tim, it may also result in safer code because spec-compliant base classes cannot be
> extended by possibly non-spec compliant/buggy children. I don't necessarily fully agree with
> this specific concern, but here it is.
> 
> 

I’ve taken another fresh and full review of the RFC and I just want to share my appreciation for
how well-written it seems, and how meticulously you have taken everyone’s feedback and
incorporated it. It seems mature enough to me as well, and I think it’s in a good place. Still,
here are some additional thoughts (and a previous one again) related to some of aspects, mostly
naming.

The HTML5 library has ::createFromString() instead of parse(). Did you
consider following this form? It doesn’t seem that important, but could be a nice improvement in
consistency among the newer spec-compliant APIs. Further, I think createFromString() is
a little more obvious in intent, as parse() is so generic.

Given the issues around equivalence, what about isEquivalent() instead of
equals()? In the RFC I think you have been careful to use the “equivalence”
terminology, but then in the actual interface we fall back to equals() and lose some of
the nuance.

Something about not implementing getRawScheme() and friends in the WHATWG class seems
off. Your rationale makes sense, but then I wonder what the problem is in exposing the raw
untranslated components, particularly since the “raw” part of the name already suggests some
kind of danger or risk in using it as some semantic piece.

Tim brought up the naming of getHost() and getHostForDisplay() as well as
the correspondence with the toString() methods. I’m not sure if it was overlooked or
I missed the followup, but I wonder what your thoughts are on passing an enum to these methods
indicating the rendering context.. Here’s why: I see developers reach for the first method that
looks right. In this case, that would almost always be getHost(), yet
getHost() or toString() or whatever is going to be inappropriate in many
common cases. I see two ways of baking in education into the API surface: creating two symmetric
methods (e.g. getDisplayableHost() and getNonDisplayableHost()); or
requiring an enum forcing the choice (e.g. getHost( ForDisplay | ForNonDisplay )). In
the case on an enum this could be equally applied across all of the relevant methods where such a
distinction exists. On one hand this could be seen as forcing callers to make a choice, but on the
other hand it can also be seen as a safeguard against an extremely-common foot-gun, making such an
easy oversight impossible.

Just this week I stumbled upon an issue with escaping the hash/fragment part of a URL. I think that
browsers used to decode percent-encodings in the fragment but they all stopped and this was removed
from the WHATWG HTML spec [no-percent-escaping]. The RFC currently shows getFragment()
decoding percent-encoded fragments, However, I believe that the WHATWG URL spec only indicates
percent-encoding when _setting_ the fragment. You can test this in a browser with the following
example: Chrome, Firefox, and Safari exhibit the same behavior.

    u = new URL(window.location)
    u.hash = ‘one and two’;
    u.hash === ‘#one%20and%20two’;
    u.toString() === ‘….#one%20and%20two’;

So I think it may be more accurate and consistent to handle Whatwg\Url::getFragment in
the same way as getScheme(). When setting a fragment we should percent-encode the
appropriate characters, but when reading it, we should never interpret those characters — it
should always return the “raw” value of the fragment.

[no-percent-escaping]: https://github.com/whatwg/url/issues/344

Once again, thank you for the great work you’ve put into this. I’m so excited to have it. All my
comments should be understood exclusively within the WHATWG domain as I don’t have the same
experience with the RFC3986 side.

Dennis Snell

> 
> 
> Regards,
> Máté
> 
> 
>

Thread (152 messages)

Máté KocsisFri, 28 Jun 2024 20:06:14 +0000
Marco PivettaFri, 28 Jun 2024 20:21:33 +0000
LynnFri, 28 Jun 2024 21:02:08 +0000
Niels DosscheFri, 28 Jun 2024 21:35:36 +0000
BilgeFri, 28 Jun 2024 22:53:12 +0000
Stephen ReaySat, 29 Jun 2024 09:57:17 +0000
Rob LandersSat, 29 Jun 2024 10:33:16 +0000
ignace nyamagana buteraSun, 30 Jun 2024 06:51:52 +0000
Máté KocsisSun, 07 Jul 2024 09:13:58 +0000
Rob LandersSun, 07 Jul 2024 10:40:02 +0000
Rob LandersSun, 07 Jul 2024 10:59:45 +0000
ignace nyamagana buteraSun, 07 Jul 2024 10:55:18 +0000
Rob LandersSun, 07 Jul 2024 11:10:11 +0000
Nicolas GrekasMon, 08 Jul 2024 07:51:27 +0000
Máté KocsisMon, 15 Jul 2024 09:20:02 +0000
Larry GarfieldMon, 15 Jul 2024 13:23:10 +0000
Ignace Nyamagana ButeraMon, 15 Jul 2024 19:31:27 +0000
Máté KocsisSun, 30 Jun 2024 06:00:00 +0000
Larry GarfieldFri, 28 Jun 2024 22:14:19 +0000
Máté KocsisSat, 29 Jun 2024 22:42:06 +0000
Ben RamseyFri, 28 Jun 2024 23:28:36 +0000
nyamsprod the funky webmasterSat, 29 Jun 2024 08:20:11 +0000
Ben RamseySat, 29 Jun 2024 17:35:46 +0000
Juris EvertovskisSat, 29 Jun 2024 16:19:12 +0000
KrinkleSat, 29 Jun 2024 20:27:50 +0000
LanreMon, 08 Jul 2024 17:24:09 +0000
LanreFri, 19 Jul 2024 22:55:27 +0000
Niels DosscheSun, 21 Jul 2024 11:21:39 +0000
ignace nyamagana buteraTue, 23 Jul 2024 06:38:40 +0000
Máté KocsisMon, 26 Aug 2024 07:40:56 +0000
Dennis SnellMon, 26 Aug 2024 22:25:35 +0000
Máté KocsisTue, 19 Nov 2024 08:49:41 +0000
Dennis SnellFri, 03 Jan 2025 07:18:33 +0000
ignace nyamagana buteraMon, 13 Jan 2025 15:09:50 +0000
Máté KocsisSun, 16 Feb 2025 22:01:36 +0000
Tim DüsterhusFri, 21 Feb 2025 12:06:57 +0000
Tim DüsterhusSun, 23 Feb 2025 15:05:25 +0000
Juris EvertovskisSun, 23 Feb 2025 17:47:41 +0000RE: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API
Tim DüsterhusMon, 24 Feb 2025 09:15:57 +0000Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
Máté KocsisMon, 10 Mar 2025 22:58:16 +0000
Ignace Nyamagana ButeraMon, 24 Feb 2025 09:18:10 +0000
Tim DüsterhusMon, 24 Feb 2025 09:43:45 +0000
Nicolas GrekasMon, 24 Feb 2025 11:08:07 +0000
Tim DüsterhusMon, 24 Feb 2025 12:48:14 +0000
Nicolas GrekasMon, 24 Feb 2025 13:44:53 +0000
Marco PivettaMon, 24 Feb 2025 13:57:32 +0000
Sebastian BergmannMon, 24 Feb 2025 14:23:44 +0000
Gina P. BanyardMon, 24 Feb 2025 13:57:55 +0000
Hammed AjaoMon, 24 Feb 2025 14:05:37 +0000
Tim DüsterhusMon, 24 Feb 2025 16:22:39 +0000
Máté KocsisFri, 14 Mar 2025 21:23:06 +0000
Nicolas GrekasMon, 24 Feb 2025 14:29:28 +0000
ignace nyamagana buteraTue, 25 Feb 2025 16:00:32 +0000
ignace nyamagana buteraTue, 25 Feb 2025 16:00:32 +0000
Máté KocsisFri, 14 Mar 2025 19:54:22 +0000
Máté KocsisFri, 14 Mar 2025 19:45:23 +0000
ignace nyamagana buteraFri, 14 Mar 2025 22:26:04 +0000
Máté KocsisMon, 17 Mar 2025 19:58:27 +0000
Paul M. JonesTue, 18 Mar 2025 17:00:55 +0000
Máté KocsisTue, 18 Mar 2025 20:15:52 +0000
Paul M. JonesWed, 19 Mar 2025 15:13:42 +0000
Máté KocsisTue, 25 Mar 2025 08:45:12 +0000
Paul M . JonesFri, 28 Mar 2025 15:44:14 +0000
Máté KocsisMon, 05 May 2025 21:32:33 +0000
Ignace Nyamagana ButeraWed, 19 Mar 2025 21:18:24 +0000
Paul M. JonesSat, 22 Mar 2025 14:01:45 +0000
Tim DüsterhusSun, 30 Mar 2025 11:25:15 +0000
Máté KocsisThu, 27 Mar 2025 21:04:27 +0000
Ignace Nyamagana ButeraThu, 27 Mar 2025 22:49:39 +0000
Tim DüsterhusSun, 30 Mar 2025 12:42:33 +0000
Ignace Nyamagana ButeraSun, 30 Mar 2025 20:53:57 +0000
Ignace Nyamagana ButeraMon, 31 Mar 2025 19:15:47 +0000
Máté KocsisWed, 02 Apr 2025 17:59:11 +0000
Ignace Nyamagana ButeraFri, 04 Apr 2025 17:46:55 +0000
Máté KocsisWed, 02 Apr 2025 20:41:55 +0000
Máté KocsisSun, 02 Mar 2025 22:00:08 +0000
Tim DüsterhusSun, 30 Mar 2025 12:36:04 +0000
Máté KocsisSun, 13 Apr 2025 12:10:52 +0000
Tim DüsterhusTue, 15 Apr 2025 14:20:52 +0000
Ignace Nyamagana ButeraTue, 15 Apr 2025 17:12:37 +0000
Máté KocsisTue, 15 Apr 2025 21:55:25 +0000
Tim DüsterhusThu, 17 Apr 2025 07:22:34 +0000
Máté KocsisThu, 17 Apr 2025 11:18:21 +0000
ignace nyamagana buteraThu, 17 Apr 2025 11:49:54 +0000
Máté KocsisThu, 17 Apr 2025 11:53:34 +0000
Máté KocsisThu, 17 Apr 2025 12:04:53 +0000
Paul M. JonesThu, 17 Apr 2025 20:47:46 +0000
Tim DüsterhusThu, 17 Apr 2025 20:58:53 +0000
Paul M. JonesThu, 17 Apr 2025 21:14:55 +0000
Tim DüsterhusThu, 17 Apr 2025 21:19:20 +0000
Tim DüsterhusWed, 23 Apr 2025 10:50:44 +0000
ignace nyamagana buteraSun, 27 Apr 2025 20:30:24 +0000
Tim DüsterhusSun, 27 Apr 2025 20:32:44 +0000
ignace nyamagana buteraSun, 27 Apr 2025 20:50:45 +0000
Tim DüsterhusSun, 27 Apr 2025 21:05:37 +0000
Máté KocsisSat, 03 May 2025 21:18:35 +0000
Máté KocsisSun, 27 Apr 2025 21:47:04 +0000
Tim DüsterhusSun, 27 Apr 2025 22:33:15 +0000
ignace nyamagana buteraMon, 28 Apr 2025 07:05:29 +0000
ignace nyamagana buteraMon, 28 Apr 2025 08:42:23 +0000
Máté KocsisMon, 28 Apr 2025 21:20:57 +0000
ignace nyamagana buteraMon, 28 Apr 2025 21:31:02 +0000
ignace nyamagana buteraTue, 29 Apr 2025 08:54:45 +0000
Tim DüsterhusTue, 29 Apr 2025 18:55:04 +0000
ignace nyamagana buteraWed, 30 Apr 2025 07:58:02 +0000
ignace nyamagana buteraWed, 30 Apr 2025 16:42:03 +0000
Máté KocsisSat, 03 May 2025 21:07:43 +0000
Máté KocsisSat, 03 May 2025 21:05:56 +0000
Paul M. JonesMon, 28 Apr 2025 19:49:24 +0000
ignace nyamagana buteraMon, 28 Apr 2025 20:47:49 +0000
Paul M. JonesTue, 29 Apr 2025 13:55:31 +0000
ignace nyamagana buteraTue, 29 Apr 2025 20:08:24 +0000
Dennis SnellWed, 05 Mar 2025 22:45:37 +0000
Máté KocsisSat, 15 Mar 2025 22:05:14 +0000
Máté KocsisTue, 25 Mar 2025 22:23:03 +0000
Dennis SnellTue, 25 Mar 2025 23:06:03 +0000
Dennis SnellTue, 25 Mar 2025 23:53:08 +0000
Larry GarfieldSat, 31 Aug 2024 00:10:15 +0000
Máté KocsisSun, 24 Nov 2024 20:40:07 +0000
Tim DüsterhusFri, 29 Nov 2024 12:28:20 +0000
Tim DüsterhusFri, 29 Nov 2024 12:21:17 +0000
Máté KocsisThu, 05 Dec 2024 21:49:43 +0000
Christoph M. BeckerThu, 05 Dec 2024 23:16:10 +0000
Larry GarfieldThu, 05 Dec 2024 23:43:29 +0000
Gina P. BanyardSun, 23 Feb 2025 17:30:14 +0000
Paul M. JonesSun, 23 Feb 2025 17:57:09 +0000
Gina P. BanyardMon, 24 Feb 2025 00:48:06 +0000
Tim DüsterhusMon, 24 Feb 2025 09:36:48 +0000
Paul M . JonesTue, 25 Feb 2025 12:36:20 +0000
ignace nyamagana buteraTue, 25 Feb 2025 15:55:20 +0000
Paul M. JonesThu, 27 Feb 2025 13:48:02 +0000
Faizan Akram DarThu, 27 Feb 2025 21:01:10 +0000
Rob LandersThu, 27 Feb 2025 23:02:05 +0000
LynnFri, 28 Feb 2025 08:38:11 +0000
Rob LandersFri, 28 Feb 2025 09:26:48 +0000
Máté KocsisFri, 14 Mar 2025 21:41:28 +0000
ignace nyamagana buteraTue, 25 Feb 2025 15:55:20 +0000
Paul M. JonesThu, 27 Feb 2025 13:48:02 +0000
Faizan Akram DarThu, 27 Feb 2025 21:01:10 +0000
Rob LandersThu, 27 Feb 2025 23:02:05 +0000
LynnFri, 28 Feb 2025 08:38:11 +0000
Rob LandersFri, 28 Feb 2025 09:26:48 +0000
Máté KocsisFri, 14 Mar 2025 21:41:28 +0000
Tim DüsterhusMon, 24 Feb 2025 09:15:01 +0000
Máté KocsisWed, 12 Mar 2025 22:00:21 +0000
Tim DüsterhusSun, 30 Mar 2025 11:12:09 +0000
Máté KocsisMon, 10 Mar 2025 22:51:45 +0000
Larry GarfieldTue, 11 Mar 2025 04:34:37 +0000
Máté KocsisSat, 29 Mar 2025 22:18:53 +0000
Máté KocsisMon, 07 Apr 2025 23:00:25 +0000
Máté KocsisMon, 07 Apr 2025 23:27:06 +0000
Máté KocsisMon, 05 May 2025 21:36:05 +0000
Paul M. JonesWed, 07 May 2025 19:16:11 +0000
Gina P. BanyardWed, 07 May 2025 22:02:37 +0000
Paul M. JonesThu, 08 May 2025 17:38:08 +0000
Stephen ReaySat, 29 Jun 2024 09:31:41 +0000Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
BilgeSat, 29 Jun 2024 11:52:37 +0000
Máté KocsisSun, 07 Jul 2024 09:26:00 +0000

« previous	php.internals (#126587)	next »

From:	Dennis Snell	Date:	Wed, 05 Mar 2025 22:45:37 +0000
Subject:	Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References:	1 2 3 4 5 6 7	Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message