Re: [RFC] Decoding HTML and the Ambiguous Ampersand

From: Date: Sun, 25 Aug 2024 15:25:07 +0000
Subject: Re: [RFC] Decoding HTML and the Ambiguous Ampersand
References: 1 2 3 4 5 6 7  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message

> On Aug 25, 2024, at 3:15 AM, Jakob Givoni <[email protected]> wrote:
> 
> 
> On Sat, Aug 24, 2024 at 10:31 PM Dennis Snell <[email protected] <mailto:[email protected]>> wrote:
>> On Aug 24, 2024, at 2:56 PM, Jakob Givoni <[email protected] <mailto:[email protected]>> wrote:
>>> 
>>> Hi Dennis,
>>> 
>>> Overall it sounds like a reasonable RFC.
>>>   
>>> > Dennis:
>>> >
>>> > > Niels:
>>> > >
>>> > > I'm not so sure that the name "decode_html" is
>>> > > self-descriptive enough, it sounds very generic.
>>> >
>>> > The name is not very important to me. For the sake of history, the reason I have
>>> > chosen “decode HTML” is because, unlike an HTML parser, this is focused on taking a snippet of
>>> > HTML “text” content and decoding it into a “plain PHP string.”
>>> 
>>> Why not make it two methods called "decode_html_text" and
>>> "decode_html_attribute"?
>>> Consider the following reasons:
>>> 1. The function doesn't actually decode html as such, it decodes either an html
>>> text node string or an html attribute string.
>> 
>> Thanks Jakob. In WordPress I did just this.
>> https://developer.wordpress.org/reference/classes/wp_html_decoder/
>> 
>> Part of the reason for that was the inability to require something like an enum (due to PHP
>> version support requirements). The Enum solution feels very nice too.
>> 
>>> 2. Saves the $context parameter and the constants/enums, making the call significantly
>>> shorter. 
>> 
>> In my PR I’ve actually expanded the Enum to include a few other contexts. I feel like
>> there’s a balance we have to do if we want to ride the line between fully reliable and fully
>> convenient. On one hand, we could say “don’t send the text content of a SCRIPT element to this
>> function!” But on the other hand, that kind of forces people to expect that SCRIPT content is
>> different.
>> 
>> With the Enum there is that in-built training material when someone looks and finds
>> Attribute | BodyText | ForeignText | Script | Style (the contexts I’ve explored in my
>> PR). 
>> 
>> We could make the same argument for decode_html_script() and
>> decode_foreign_text_node() and decode_html_style(). Somehow the context
>> feels cleaner to me, and like a single entry point for learning instead of five.
>> 
> 
> Yes. With 5 different contexts it's starting to shift in favor of a single function :-)
> I only saw the RFC which from what I can tell still only features 2 of them. I haven't
> seen the PR (RFC Implementation section says "Yet to come"). 

Oops, I’ll get to this!

>>> 3. It feels like decoding either text or attribute are two significantly different
>>> things. I admit I could be wrong, if code like decode_html($e->isAttritbute() ?
>>> HtmlContext::Attribute : HtmlContext::Text, $e->getContent()) is likely to be seen.
>> 
>> None of these contexts are significantly different, which is one of the major dangers of
>> using html_entity_decode(). The results will look just about right most of the time.
>> It’s the subtle differences that matter most, I suppose.
> 
> Well, that was kind of what I meant - even if the differences are usually absent or subtle,
> they are significant (i.e. not necessarily big, but meaningful), meaning using it wrong would give
> the wrong result, right? Saying that they are not significantly different to me means that the
> result would just be a little less good sometimes, not directly wrong.

In hindsight I think I misunderstood what you were saying and got it backwards. I meant that the
algorithms are subtly different, but as you point out, yes, the outcomes can be significant. ln the
better cases we get data corruption, but these do lead to misidentification of unsafe content.

For example, “&#x6a&#x61;\x00avascript” should decode as “javascript” when rendered
by a browser when found inside the BODY of a page, but an attribute should read “ja�vascript.”

>> 
>> The lesson I have drawn is that people frequently have what they understand to be a text
>> node or an attribute value, but they aren’t aware that they are supposed to decode differently,
>> and they also aren’t reaching to interact with a full parser to get these values. If PHP could
>> train people as they use these functions, purely through their interfaces, I think that could help
>> elevate the level of reliability out there in the wild, as long as they aren’t too cumbersome
>> (hence explicitly no default context argument _or_ using separately-named functions).
>> 
>> Having the Enum I think enhances the ease with which people can reliably also decode things
>> like SCRIPT and STYLE nodes. “I know html_decode_text() but I don’t know what the
>> rules for SCRIPT are or if they’re different so I’ll just stick with that.” vs “My IDE
>> suggests that Script is a different context, that’s interesting, I’ll try that and
>> see how it’s different."
>> 
> 
> That is a good point and using enums favours that learning push since they are inherently
> grouped together.
>>> 
>>> Best,
>>> Jakob
>>>  
>> 
>> Thanks for your input. I’m grateful for the discussions and that people are sharing.
>> 
> 
> Cheers!

Warmly,
Dennis Snell



Thread (12 messages)

« previous php.internals (#125215) next »