Skip to content

Conversation

kocsismate
Copy link
Member

Split from #19970

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to give this a proper look tomorrow. Looking at the tests the output makes sense though.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me.
As a follow-up (likely master-only), you can use ZVAL_STRINGL_FAST which will avoid allocating empty strings (and 1-byte-long strings) by using interned strings.

Copy link
Member

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see, the WHATWG URL standard treats missing and empty queries and fragments the same with regard to the getters. The difference is however preserved when recomposing the URL (which is also the case for the toAsciiString() method in PHP already) - unless explicitly overwritten. Which makes for funky behavior:

> u = new URL('https://example.com/?#');
URL {
  href: 'https://example.com/?#',
  origin: 'https://example.com/',
  protocol: 'https:',
  username: '',
  password: '',
  host: 'example.com',
  hostname: 'example.com',
  port: '',
  pathname: '/',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}
> u.search
''
> u.hash
''
> u.toString()
'https://example.com/?#'
> u.hash = u.hash
''
> u.search = u.search
''
> u.toString()
'https://example.com/'

It thus makes sense to me to consider this another case of:

Getters of Uri\WhatWg\Url have a few gotchas for the ones who are inherently familiar with the WHATWG URL specification: they don't (entirely) follow the “getter steps” that are defined by the specification, but the individual components are returned directly without any other changes that the “getter steps” would otherwise specify.

And make them differ between missing and empty to be consistent with the recomposition and with Rfc3986.

Copy link
Member

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which makes for funky behavior:

I am seeing that this also exists with PHP after applying the patch:

$url1 = Uri\WhatWg\Url::parse("https://example.com/?#");
$url2 = $url1->withFragment($url1->getFragment());

var_dump($url1->getFragment()); // string(0) ""
var_dump($url2->getFragment()); // NULL
var_dump($url2->toAsciiString()); // string(21) "https://example.com/?"

So perhaps the correct solution is to not make getFragment() and getQuery() nullable and always return an empty string? That should still be okay for “baseline compatibility” between RFC 3986 and WHATWG.


Selecting “Request Changes” for visibility.

@TimWolla
Copy link
Member

Similarly for the username:

<?php

$url1 = Uri\WhatWg\Url::parse("https://:[email protected]");

var_dump($url1->getUsername()); // NULL
var_dump($url1->getPassword()); // string(3) "asd"
var_dump($url1->toAsciiString()); // string(25) "https://:[email protected]/"

I'd argue that the username is empty in this case, not missing. It currently returns null for getUsername(), though. However:

<?php

$url1 = Uri\WhatWg\Url::parse("https://example.com/");

var_dump($url1->getUsername()); // NULL
var_dump($url1->getPassword()); // NULL
var_dump($url1->toAsciiString()); // string(20) "https://example.com/"

Here the @ is actually gone after parsing. So NULL for the username is fine. Should this case also just be "" for both NULL and empty string?

@kocsismate
Copy link
Member Author

I am seeing that this also exists with PHP after applying the patch:

BTW It's possible to "fix" this case by the following changes: :D

-	zval_string_or_null_to_lexbor_str(value, &str);
+	if (Z_TYPE_P(value) == IS_STRING && Z_STRLEN_P(value) > 0) {
+		str.data = (lxb_char_t *) Z_STRVAL_P(value);
+		str.length = Z_STRLEN_P(value);
+	} else if (Z_TYPE_P(value) == IS_STRING && Z_STRLEN_P(value) == 0) {
+		str.data = (lxb_char_t *) "?";
+		str.length = 1;
+	} else {
+		ZEND_ASSERT(Z_ISNULL_P(value));
+		str.data = NULL;
+		str.length = 0;
+	}

I love that there's this loophole TBH... 😅 And I find it important to have a distinction between empty/missing, especially if toString() returns the leading ?/# characters depending on if these components are missing or empty.... I have never understood this part of the WHATWG spec: why getter steps don't expose the distinction :( Neither I would enjoy losing consistency with the RFC3986 query and fragment getters/withers :( but overall, I'm not sure what the best solution would be.

Here the @ is actually gone after parsing. So NULL for the username is fine. Should this case also just be "" for both NULL and empty string?

As sad as it is, I'm OK to get rid of nullability here.. It doesn't look right after seeing what happens in case of https://:@example.com. I see no way to make nullability work properly. :(

@kocsismate
Copy link
Member Author

I am seeing that this also exists with PHP after applying the patch:

$url1 = Uri\WhatWg\Url::parse("https://example.com/?#");
$url2 = $url1->withFragment($url1->getFragment());

var_dump($url1->getFragment()); // string(0) ""
var_dump($url2->getFragment()); // NULL
var_dump($url2->toAsciiString()); // string(21) "https://example.com/?"

After thinking about this a little bit more.. I'm not sure anymore if it's an issue at all. Apparently, WHATWG URL doesn't give a big importance for the distinction between missing and empty: this is best observed in the getter steps.

When passing an empty string to the withFragment(), we can imagine the process as the resulting URL is "normalized" so that the unnecessary # is removed. Therefore, I don't think it's a big issue if "roundtripability" is not satisfied in this edge case.

What matters more for me is that:

  • The empty vs missing distinction for the current URL is kept. I think this is especially important if someone wants to rebuild the component recomposition logic from scratch for whatever reason (e.g. slightly modify it)
  • The getters remains consistent with the toString() method: the # is present in the toString() output if and only if the getter returns a string.
  • Consistency is kept with RFC3986 so that at least the getter return types remain the same (with the exception of the scheme component)

All these seem more important to me than completely symmetrical results when an empty query/fragment is passed to a wither.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants