Re: [RFC] Decoding HTML and the Ambiguous Ampersand

From: Date: Sat, 24 Aug 2024 19:56:21 +0000
Subject: Re: [RFC] Decoding HTML and the Ambiguous Ampersand
References: 1 2 3 4  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
Hi Dennis,

Overall it sounds like a reasonable RFC.

> Dennis:
>
> > Niels:
> >
> > I'm not so sure that the name "decode_html" is self-descriptive enough,
it sounds very generic.
>
> The name is not very important to me. For the sake of history, the reason
I have chosen “decode HTML” is because, unlike an HTML parser, this is
focused on taking a snippet of HTML “text” content and decoding it into a
“plain PHP string.”

Why not make it two methods called "decode_html_text" and
"decode_html_attribute"?
Consider the following reasons:
1. The function doesn't actually decode html as such, it decodes either an
html text node string or an html attribute string.
2. Saves the $context parameter and the constants/enums, making the call
significantly shorter.
3. It feels like decoding either text or attribute are two significantly
different things. I admit I could be wrong, if code like
decode_html($e->isAttritbute() ? HtmlContext::Attribute :
HtmlContext::Text, $e->getContent()) is likely to be seen. But I somehow
don't foresee a lot of situations where text and attribute strings end up
in the same code path?

A couple of other options that would silence anyone opposed to implicitly
favouring utf-8:
html_text_to_utf8 and html_attribute_to_utf8

Best,
Jakob


Thread (12 messages)

« previous php.internals (#125189) next »