Re: [RFC] Decoding HTML and the Ambiguous Ampersand

From: Dennis Snell Date: Mon, 19 Aug 2024 22:45:53 +0000

Subject: Re: [RFC] Decoding HTML and the Ambiguous Ampersand

References: 1 Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message


> On Jul 9, 2024, at 4:55 PM, Dennis Snell <[email protected]> wrote:
> 
> Greetings all,
> 
> The html_entity_decode( … ENT_HTML5 … ) function has a number of issues that
> I’d like to correct.
> 
>  - It’s missing 720 of HTML5’s specified named character references.
>  - 106 of these are named character references which do not require a trailing semicolon, such
> as &acute
>  - It’s unaware of the ambiguous ampersand rule, which allows these 106 in special
> circumstances.
> 
> HTML5 asserts that the list of named character references will not expand in the future. It can
> be found authoritatively at the following URL:
> 
> https://html.spec.whatwg.org/entities.json
> 
> The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 where ampersands
> were not properly encoded in attribute values, specifically in URL values. For example, in a query
> string for a search, one might find ?q=dog&not=cat. The &not in
> that value would decode to U+AC (¬), but since it’s in an attribute value it will be left as
> plaintext. Inside normal HTML markup it would transform into ?q=dog¬=cat. There are
> related nuances when numeric character references are found at the end of a string or boundary
> without the semicolon.
> 
> The function signature of html_entity_decode() does not currently allow for
> correcting this behavior. I’d like to propose an RFC or a bug fix which either extends the
> function (perhaps by adding a new flag like ENT_AMBIGUOUS_AMPERSAND) or preferably
> creates a new function. For the missing character references I wonder if it would be enough to add
> them to the list of default translatable references.
> 
> One challenge with the existing function is that the concept of the translation table stands in
> contrast with the fixed and static nature of HTML5’s replacement tables. A new function or set of
> functions could open up spec-compliant decoding while providing helpful methods that are necessary
> in many common server-side operations:
> 
>   - html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding =
> ‘utf-8' )
>   - html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle,
> $input_encoding = ‘utf-8’ )
>   - html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle,
> $input_encoding = ‘utf-8’ )
> 
> These methods are handy for inspecting things like encoded attribute values in a
> memory-efficient and processing-efficient way, when it’s not necessary to decode the entire value.
> In common situations, one encounters data-URIs with potentially megabytes of image data and
> processing only the first few or tens of bytes can save a lot of overhead.
> 
> We’re exploring pure-PHP solutions to these problems in WordPress in attempts to improve the
> reliability and safety of handling HTML. I’d love to hear your thoughts and know if anyone is
> willing to work with me to create an RFC or directly propose patches. We’ve created a step
> function which allows finding the next character reference and decoding it separately, enabling some
> novel features like highlighting the character references in source text.
> 
> Should I propose an RFC for this?
> 
> Warmly,
> Dennis Snell
> Automattic Inc.

Thanks everyone for your feedback so far on the decode_html() RFC
[https://wiki.php.net/rfc/decode_html]

I’ve updated it replacing the new constants with a new HtmlContext enum, and the
interface seems much nicer this way. I particularly like how PHP enforces passing a valid value, vs.
hoping that the right flag is used.

Additionally I added a section that I previously forgot, which highlights the source of the infamous
mojibake/gremlins. HTML has special rules for remapping the C1 control characters, as if they had
been stored or recorded for Windows-1251.

Warmly,
Dennis Snell

Thread (12 messages)

Dennis SnellMon, 19 Aug 2024 22:45:53 +0000
Niels DosscheThu, 22 Aug 2024 22:01:47 +0000
Dennis SnellThu, 22 Aug 2024 23:02:13 +0000
Bruce WeirdanThu, 22 Aug 2024 23:32:57 +0000
Christoph M. BeckerSat, 24 Aug 2024 12:47:43 +0000
Dennis SnellSat, 24 Aug 2024 20:34:40 +0000
Máté KocsisSun, 25 Aug 2024 21:17:40 +0000
Dennis SnellSun, 25 Aug 2024 21:56:06 +0000
Jakob GivoniSat, 24 Aug 2024 19:56:21 +0000
Dennis SnellSat, 24 Aug 2024 20:31:17 +0000
Jakob GivoniSun, 25 Aug 2024 08:15:26 +0000
Dennis SnellSun, 25 Aug 2024 15:25:07 +0000

« previous	php.internals (#125055)	next »

From:	Dennis Snell	Date:	Mon, 19 Aug 2024 22:45:53 +0000
Subject:	Re: [RFC] Decoding HTML and the Ambiguous Ampersand
References:	1	Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message