> On Feb 16, 2025, at 3:01 PM, Máté Kocsis <[email protected]> wrote:
>
>
> Hi Dennis,
>>
>> I only harp on the WhatWG spec so much because for many people this will be the only one
>> they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack
>> targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of
>> the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly
>> RFC3986 systems or if the common-enough URLs are valid in both specs.
>>
>
> I think Ignace's examples already highlighted that the two specifications differ in
> nuances so much that even I had to admit after months of trying to squeeze them into the same
> interface that doing so would be irresponsible.
> The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs or URIs with
> scheme-specific behavior - like ldap apparently), but even the UriInterface of PSR-7 can build upon
> it. On the other hand, Uri\WhatWg\Url will be useful for representing browser links and any other
> URLs for the web (i.e. an HTTP application router component should use this class).
>
>> Just to enlighten me and possibly others with less familiarity, how and when are RFC3986
>> URLs used and what are those systems supposed to do when an invalid URL appears, such as when
>> dealing with percent-encodings as you brought up in response to Tim?
>>
>
> I am not 100% sure what I brought up to Tim, but certainly, the biggest difference between the
> two specs regarding percent-encoding was recently documented in the RFC: https://wiki.php.net/rfc/url_parsing_api#percent-encoding
> . The other main difference is how the host component is stored: WHATWG automatically
> percent-decodes it, while RFC3986 doesn't. This is summarized in the https://wiki.php.net/rfc/url_parsing_api#component_retrieval
> section (a bit below).
>
>> This would be fine, knowing in hindsight that it was originally a relative path. Of course,
>> this would mean that it’s critical that `https://example.com
>> does not replace the actual host part if one is provided in
$url`. For
>> example, this code should work.
>>
>> ```
>> $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc
>> ’, ‘https://example.com
>> ’ );
>> $url->domain === 'wiki.php.net
>> '
>>
>
> Yes. it's the case. Both classes only use the base URL for relative URIs.
>
>> Hopefully this won’t be too controversial, even though the concept was new to me when I
>> started having to reliably work with URLs. I choose the example I did because of human risk factors
>> in security exploits. "xn--google.com
>> " is not in fact a Google domain, but an IDNA domain decoding to
>> "䕮䕵䕶䕱.com
>> ”
>>
>
> I got your point, so I implemented your suggestion. Actually, I made yet another larger API
> change in the meanwhile, but in any case, the WHATWG implementation now supports IDNA the following
> way:
> $url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘", null);
>
> echo $url->getHost(); // xn--go8h.com
>
> echo $url->getHostForDisplay(); // 🐘.com
> echo $url->toString(); // https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
>
> echo $url->toDisplayString(); / https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98
>
>
> Unfortunately, RFC3986 doesn't support IDNA (as Ignace already pointed out at the end
> of https://externals.io/message/126182#126184
> ), and adding support for RFC3987 (therefore IRIs) would be a very heavy amount of
> work, it's just not feasible within this RFC :( To make things worse, its code should be
> written from scratch, since I haven't found any suitable C library yet for this purpose.
> That's why I'll leave them for
>
>
> On other notes, let me share some of the changes since my previous message to the mailing list:
>
>
> - First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from the proposal after
> Arnaud's feedback. Now, both the normalized (and decoded), as well as the non-normalized
> representation can equally be retrieved from the same URI instance. This was necessary to change in
> order for users to be able to consistently use URIs. Now, if someone needs an exact URI component
> value, they can use the getRaw*() getter. If they want the normalized and percent-decoded form then
> a get*() getter should be used. For more information, the
>
> > https://wiki.php.net/rfc/url_parsing_api#component_retrieval
> section should be consulted.
>
>
This seems like a good change.
> - I made a few less important API changes, like converting the WhatWgError class to an enum,
> adding a Uri\Rfc3986\IUri::getUserInfo() method, changing the return type of some getters (removing
> nullability) etc.
>
>
Love this.
> - I fixed quite some smaller details of the implementation along with a very important spec
> incompatibility: until now, the "path" component didn't contain the leading
> "/" character when it should have. Now, both classes conform to their respective
> specifications with regards to path handling.
>
>
This is a late thought, and surely amenable to a later RFC, but I was thinking about the get/set
path methods and the issue of the / and %2F.
- If we exposed getPathIterator()
or getPathSegments()
could we not
report these in their fully-decoded forms? That is, because the path segments are separated by some
invocation or array element, they could be decoded?
- Probably more valuably, if withPath()
accepted an array, could we not allow fully
non-escaped PHP strings as path segments which the URL class could safely and by-default handle the
escaping for the caller?
Right now, if someone haphazardly joins path segments in order to set withPath()
they
will likely be unaware of that nuance and get the path wrong. On the grand scale of things, I
suspect this is a really minor risk. However, if they could send in an array then they would never
need to be aware of that nuance in order to provide a fully-reliable URL, up to the class rejecting
path segments which cannot be represented.
>
>
> I think the RFC is now mature enough to consider voting in the foreseeable future, since most
> of the concerns which came up until now are addressed some way or another. However, the only
> remaining question that I still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
> should be final? Personally, I don't see much problem with opening them for extension (other
> than some technical challenges that I already shared a few months ago), and I think people will have
> legitimate use cases for extending these classes. On the other hand, having final classes may allow
> us to make slightly more significant changes without BC concerns until we have a more battle-tested
> API, and of course completely eliminate the need to overcome the said technical challenges.
> According to Tim, it may also result in safer code because spec-compliant base classes cannot be
> extended by possibly non-spec compliant/buggy children. I don't necessarily fully agree with
> this specific concern, but here it is.
>
>
I’ve taken another fresh and full review of the RFC and I just want to share my appreciation for
how well-written it seems, and how meticulously you have taken everyone’s feedback and
incorporated it. It seems mature enough to me as well, and I think it’s in a good place. Still,
here are some additional thoughts (and a previous one again) related to some of aspects, mostly
naming.
The HTML5 library has ::createFromString()
instead of parse()
. Did you
consider following this form? It doesn’t seem that important, but could be a nice improvement in
consistency among the newer spec-compliant APIs. Further, I think createFromString()
is
a little more obvious in intent, as parse()
is so generic.
Given the issues around equivalence, what about isEquivalent()
instead of
equals()
? In the RFC I think you have been careful to use the “equivalence”
terminology, but then in the actual interface we fall back to equals()
and lose some of
the nuance.
Something about not implementing getRawScheme()
and friends in the WHATWG class seems
off. Your rationale makes sense, but then I wonder what the problem is in exposing the raw
untranslated components, particularly since the “raw” part of the name already suggests some
kind of danger or risk in using it as some semantic piece.
Tim brought up the naming of getHost()
and getHostForDisplay()
as well as
the correspondence with the toString()
methods. I’m not sure if it was overlooked or
I missed the followup, but I wonder what your thoughts are on passing an enum to these methods
indicating the rendering context.. Here’s why: I see developers reach for the first method that
looks right. In this case, that would almost always be getHost()
, yet
getHost()
or toString()
or whatever is going to be inappropriate in many
common cases. I see two ways of baking in education into the API surface: creating two symmetric
methods (e.g. getDisplayableHost()
and getNonDisplayableHost()
); or
requiring an enum forcing the choice (e.g. getHost( ForDisplay | ForNonDisplay )
). In
the case on an enum this could be equally applied across all of the relevant methods where such a
distinction exists. On one hand this could be seen as forcing callers to make a choice, but on the
other hand it can also be seen as a safeguard against an extremely-common foot-gun, making such an
easy oversight impossible.
Just this week I stumbled upon an issue with escaping the hash/fragment part of a URL. I think that
browsers used to decode percent-encodings in the fragment but they all stopped and this was removed
from the WHATWG HTML spec [no-percent-escaping]. The RFC currently shows getFragment()
decoding percent-encoded fragments, However, I believe that the WHATWG URL spec only indicates
percent-encoding when _setting_ the fragment. You can test this in a browser with the following
example: Chrome, Firefox, and Safari exhibit the same behavior.
u = new URL(window.location)
u.hash = ‘one and two’;
u.hash === ‘#one%20and%20two’;
u.toString() === ‘….#one%20and%20two’;
So I think it may be more accurate and consistent to handle Whatwg\Url::getFragment
in
the same way as getScheme()
. When setting a fragment we should percent-encode the
appropriate characters, but when reading it, we should never interpret those characters — it
should always return the “raw” value of the fragment.
[no-percent-escaping]: https://github.com/whatwg/url/issues/344
Once again, thank you for the great work you’ve put into this. I’m so excited to have it. All my
comments should be understood exclusively within the WHATWG domain as I don’t have the same
experience with the RFC3986 side.
Dennis Snell
>
>
> Regards,
> Máté
>
>
>