Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Date: Wed, 19 Mar 2025 15:13:42 +0000
Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References: 1 2 3 4 5 6 7 8 9 10 11 12 13 14  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
Hi Maté,

> On Mar 18, 2025, at 15:15, Máté Kocsis <[email protected]> wrote:

> 

> There's no way I would have written an implementation from scratch. I'm using the url
> module of the Lexbor C library (https://github.com/lexbor/lexbor/) for handling WHATWG URLs.
> It's already bundled in core, and it's also battle tested, and it has exceptional
> maintenance.

I did not mean to imply writing a parser from scratch; my apologies for phrasing it poorly.

> All I had to implement is the glue between userland and the C library.

That is more what I was getting at. Rowbot has a lot of what looks to be good design work on
structures that come out of the parsing, in addition to a separate parser class.

The RFC might benefit from an explicit and intentional review of, and maybe incorporation of, some
of the pre-existing Rowbot design work. At least one thing from Rowbot is absolutely not applicable
to the RFC (e.g. the PSR-3 logging); maybe none of rest of it will be applicable either, but as
prior art from someone acknowledged in the WHATWG-URL spec, I think it bears your close attention.

As an overview, the following is a brief comparison between Rowbot and the RFC; any missed or
misrepresented functionality is unintentional.

* * *

## RFC

One non-final readonly Url class:

- 5 getRaw...() methods, 8 get...() methods, and one get...ForDisplay() method
- immutability via 8 with...() methods, broadly expecting properly-encoded arguments, and
soft-erroring on invalid characters
- a static parse() method, with relative parsing capability and a place to capture errors
- equals() to compare two URLs
- toString() for machine-friendly string recomoposition
- toDisplayString() for human-friendly string recomposition
- resolve() to resolve a relative URL using the current URL as the base
- serialize/deserialize; "the serialized form only includes the recomposed URI itself exposed
as the __uri field, but the individual properties or URI components are not
present."
- no URLSearchParams implementation

## Rowbot

(None of the classes are readonly or final; these look to hew closely to the WHATWG-URL spec.)

A BasicURLParser class:

- affords relative parsing capability and an option parameter for the target URLRecord
- returns a URLRecord

A URLRecord class:

- public mutable properties for the URL components
- $scheme is a Scheme implementation with equals() and other is...() methods
- $host is a HostInterface (and implementations) with equals() and other is...() methods
- $path is a PathInterface (and PathList implementation) with PathSegment manipulation methods
- setUsername() and setPassword() mutators
- serializing
- getOrigin(), includesCredentials(), isEqual()

A URL class:

- Composed of a URLRecord and a URLSearchParams object
- Constructor takes a string, parses it to a URLRecord, and retains the URLRecord
- a static parse() method with relative parsing, as a convenience method
- __toString() and toString() return the serialized URLRecord
- Virtual properties for $href, $origin, $protocol, $username, $password, $host, $hostname, $port,
$pathname, $search, $searchParams, $hash
- Mutability of virtual properties via magic __set()
- Readability of virtual properties via magic __get()

A URLSearchParams class:

- search params manipulation methods
- implements Countable, Iterator, Stringable
- composed of a QueryList implementation and (optionally) the originating URLRecord

* * *


-- pmj


Thread (152 messages)

« previous php.internals (#126845) next »