Re: [RFC] Decoding HTML and the Ambiguous Ampersand

From: Date: Sat, 24 Aug 2024 20:34:40 +0000
Subject: Re: [RFC] Decoding HTML and the Ambiguous Ampersand
References: 1 2 3 4 5  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message


> On Aug 24, 2024, at 7:47 AM, Christoph M. Becker <[email protected]> wrote:
> 
> On 23.08.2024 at 01:02, Dennis Snell wrote:
> 
>>> If we could have a single implementation, that would be great. I do understand of
>>> course your concern that DOM is not a required extension, and therefore basing the internals on
>>> Lexbor makes it tied to the DOM extension which may not be available. I however suspect that a large
>>> chunk of people needing a function like this have DOM available (as DOM is required by many
>>> HTML-processing-related packages). I can also look into it sometime soon if you want; anyway feel
>>> free to ping me.
>> 
>> I’m also very open to lexbor-based approaches but I’ve so-far found it more complicated
>> than I expected. In some part this is because it involves setting up the parser and state machine
>> for the HTML specification and much of the actual decoding can be safely done without this.
>> 
>> The other part is the extension aspect. I hear you, that you would expect calling code to
>> have the DOM extensions available, but that’s simply not the case when developing a platform like
>> WordPress, which I do. We don’t have control over the servers or environments where people are
>> deploying this, and the availability of the DOM extensions is low enough that WordPress code simply
>> cannot use DOMDocument (even though it shouldn’t because of the wild problems that
>> has for attempting to parse HTML).
>> 
>> People resort to html_entity_decode() because that’s the only option. In
>> WordPress we now have a spec-compliant decoder, but as it’s in user-space PHP its performance is
>> far below what’s possible.
>> 
>> I’d love your help in setting up lexbor’s state machine to decode text nodes. I’d
>> love it even more if this could be part of the PHP language. It constantly surprises me that _the
>> language of the web_ (PHP) doesn’t have the tools to speak _the language of the web_ (HTML). This
>> RFC is all about taking a step towards ensuring that PHP developers can rely on PHP to be a reliable
>> middle-man between the HTML domain and the PHP domain.
>> 
>> In other words, requiring the DOM extension or DOM\HtmlDocument would be such
>> a non-starter for WordPress (accounting for 43% of the web today) that it would completely
>> unavailable.
> 
> Well, I don't think it would be a big deal to move the bundled lexbor to
> somewhere where it is always available.  I mean, so far it's only used
> by ext/dom so it's bundled there, but if other parts of the php-src code
> base would use it, we could put it elsewhere.

Having a DOM parser for HTML in PHP itself without requiring an extension would open up many new
possibilities. For example, WordPress test suites don’t have any functional
“assertEquivalentMarkup()” functions because there’s no spec-compliant parser in PHP. We
finally wrote our own user-space HTML parser, but relying on DOM\HtmlDocument would be
much easier.

These test suites need to run on a variety of environments and PHP versions, so it’s moot thinking
we could hasten the use of a native class to get the job done, but if it remains locked inside an
optional extension, it may be borderline impossible to ever migrate to it.

> 
> Christoph
> 

Dennis Snell


Thread (12 messages)

« previous php.internals (#125192) next »