Re: [RFC] Decoding HTML and the Ambiguous Ampersand

From: Dennis Snell Date: Sat, 24 Aug 2024 20:34:40 +0000

Subject: Re: [RFC] Decoding HTML and the Ambiguous Ampersand

References: 1 2 3 4 5 Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message



> On Aug 24, 2024, at 7:47 AM, Christoph M. Becker <[email protected]> wrote:
> 
> On 23.08.2024 at 01:02, Dennis Snell wrote:
> 
>>> If we could have a single implementation, that would be great. I do understand of
>>> course your concern that DOM is not a required extension, and therefore basing the internals on
>>> Lexbor makes it tied to the DOM extension which may not be available. I however suspect that a large
>>> chunk of people needing a function like this have DOM available (as DOM is required by many
>>> HTML-processing-related packages). I can also look into it sometime soon if you want; anyway feel
>>> free to ping me.
>> 
>> I’m also very open to lexbor-based approaches but I’ve so-far found it more complicated
>> than I expected. In some part this is because it involves setting up the parser and state machine
>> for the HTML specification and much of the actual decoding can be safely done without this.
>> 
>> The other part is the extension aspect. I hear you, that you would expect calling code to
>> have the DOM extensions available, but that’s simply not the case when developing a platform like
>> WordPress, which I do. We don’t have control over the servers or environments where people are
>> deploying this, and the availability of the DOM extensions is low enough that WordPress code simply
>> cannot use DOMDocument (even though it shouldn’t because of the wild problems that
>> has for attempting to parse HTML).
>> 
>> People resort to html_entity_decode() because that’s the only option. In
>> WordPress we now have a spec-compliant decoder, but as it’s in user-space PHP its performance is
>> far below what’s possible.
>> 
>> I’d love your help in setting up lexbor’s state machine to decode text nodes. I’d
>> love it even more if this could be part of the PHP language. It constantly surprises me that _the
>> language of the web_ (PHP) doesn’t have the tools to speak _the language of the web_ (HTML). This
>> RFC is all about taking a step towards ensuring that PHP developers can rely on PHP to be a reliable
>> middle-man between the HTML domain and the PHP domain.
>> 
>> In other words, requiring the DOM extension or DOM\HtmlDocument would be such
>> a non-starter for WordPress (accounting for 43% of the web today) that it would completely
>> unavailable.
> 
> Well, I don't think it would be a big deal to move the bundled lexbor to
> somewhere where it is always available.  I mean, so far it's only used
> by ext/dom so it's bundled there, but if other parts of the php-src code
> base would use it, we could put it elsewhere.

Having a DOM parser for HTML in PHP itself without requiring an extension would open up many new
possibilities. For example, WordPress test suites don’t have any functional
“assertEquivalentMarkup()” functions because there’s no spec-compliant parser in PHP. We
finally wrote our own user-space HTML parser, but relying on DOM\HtmlDocument would be
much easier.

These test suites need to run on a variety of environments and PHP versions, so it’s moot thinking
we could hasten the use of a native class to get the job done, but if it remains locked inside an
optional extension, it may be borderline impossible to ever migrate to it.

> 
> Christoph
> 

Dennis Snell

Thread (12 messages)

Dennis SnellMon, 19 Aug 2024 22:45:53 +0000
Niels DosscheThu, 22 Aug 2024 22:01:47 +0000
Dennis SnellThu, 22 Aug 2024 23:02:13 +0000
Bruce WeirdanThu, 22 Aug 2024 23:32:57 +0000
Christoph M. BeckerSat, 24 Aug 2024 12:47:43 +0000
Dennis SnellSat, 24 Aug 2024 20:34:40 +0000
Máté KocsisSun, 25 Aug 2024 21:17:40 +0000
Dennis SnellSun, 25 Aug 2024 21:56:06 +0000
Jakob GivoniSat, 24 Aug 2024 19:56:21 +0000
Dennis SnellSat, 24 Aug 2024 20:31:17 +0000
Jakob GivoniSun, 25 Aug 2024 08:15:26 +0000
Dennis SnellSun, 25 Aug 2024 15:25:07 +0000

« previous	php.internals (#125192)	next »

From:	Dennis Snell	Date:	Sat, 24 Aug 2024 20:34:40 +0000
Subject:	Re: [RFC] Decoding HTML and the Ambiguous Ampersand
References:	1 2 3 4 5	Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message