Send a blank email to [email protected] to get a copy of this message
> On Jul 9, 2024, at 4:55 PM, Dennis Snell <[email protected]> wrote:> > Greetings all,> > The html_entity_decode( … ENT_HTML5 … ) function has a number of issues that
> I’d like to correct.> > - It’s missing 720 of HTML5’s specified named character references.> - 106 of these are named character references which do not require a trailing semicolon, such
> as ´> - It’s unaware of the ambiguous ampersand rule, which allows these 106 in special
> circumstances.> > HTML5 asserts that the list of named character references will not expand in the future. It can
> be found authoritatively at the following URL:> > https://html.spec.whatwg.org/entities.json> > The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 where ampersands
> were not properly encoded in attribute values, specifically in URL values. For example, in a query
> string for a search, one might find ?q=dog¬=cat. The ¬ in
> that value would decode to U+AC (¬), but since it’s in an attribute value it will be left as
> plaintext. Inside normal HTML markup it would transform into ?q=dog¬=cat. There are
> related nuances when numeric character references are found at the end of a string or boundary
> without the semicolon.> > The function signature of html_entity_decode() does not currently allow for
> correcting this behavior. I’d like to propose an RFC or a bug fix which either extends the
> function (perhaps by adding a new flag like ENT_AMBIGUOUS_AMPERSAND) or preferably
> creates a new function. For the missing character references I wonder if it would be enough to add
> them to the list of default translatable references.> > One challenge with the existing function is that the concept of the translation table stands in
> contrast with the fixed and static nature of HTML5’s replacement tables. A new function or set of
> functions could open up spec-compliant decoding while providing helpful methods that are necessary
> in many common server-side operations:> > - html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding =
> ‘utf-8' )> - html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle,
> $input_encoding = ‘utf-8’ )> - html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle,
> $input_encoding = ‘utf-8’ )> > These methods are handy for inspecting things like encoded attribute values in a
> memory-efficient and processing-efficient way, when it’s not necessary to decode the entire value.
> In common situations, one encounters data-URIs with potentially megabytes of image data and
> processing only the first few or tens of bytes can save a lot of overhead.> > We’re exploring pure-PHP solutions to these problems in WordPress in attempts to improve the
> reliability and safety of handling HTML. I’d love to hear your thoughts and know if anyone is
> willing to work with me to create an RFC or directly propose patches. We’ve created a step
> function which allows finding the next character reference and decoding it separately, enabling some
> novel features like highlighting the character references in source text.> > Should I propose an RFC for this?> > Warmly,> Dennis Snell> Automattic Inc.
Thanks everyone for your feedback so far on the decode_html() RFC
[https://wiki.php.net/rfc/decode_html]
I’ve updated it replacing the new constants with a new HtmlContext enum, and the
interface seems much nicer this way. I particularly like how PHP enforces passing a valid value, vs.
hoping that the right flag is used.
Additionally I added a section that I previously forgot, which highlights the source of the infamous
mojibake/gremlins. HTML has special rules for remapping the C1 control characters, as if they had
been stored or recorded for Windows-1251.
Warmly,
Dennis Snell