Decoding HTML and the Ambiguous Ampersand

From: Dennis Snell Date: Wed, 10 Jul 2024 00:00:24 +0000

Subject: Decoding HTML and the Ambiguous Ampersand

Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message

Greetings all,


The html_entity_decode( … ENT_HTML5 … ) function has a number of issues that I’d
like to correct.


 - It’s missing 720 of HTML5’s specified named character references.
 - 106 of these are named character references which do not require a trailing semicolon, such as
&acute
 - It’s unaware of the ambiguous ampersand rule, which allows these 106 in special circumstances.


HTML5 asserts that the list of named character references will not expand in the future. It can be
found authoritatively at the following URL:


https://html.spec.whatwg.org/entities.json


The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 where ampersands were
not properly encoded in attribute values, specifically in URL values. For example, in a query string
for a search, one might find ?q=dog&not=cat. The &not in that
value would decode to U+AC (¬), but since it’s in an attribute value it will be left as
plaintext. Inside normal HTML markup it would transform into ?q=dog¬=cat. There are
related nuances when numeric character references are found at the end of a string or boundary
without the semicolon.


The function signature of html_entity_decode() does not currently allow for correcting
this behavior. I’d like to propose an RFC or a bug fix which either extends the function (perhaps
by adding a new flag like ENT_AMBIGUOUS_AMPERSAND) or preferably creates a new
function. For the missing character references I wonder if it would be enough to add them to the
list of default translatable references.


One challenge with the existing function is that the concept of the translation table stands in
contrast with the fixed and static nature of HTML5’s replacement tables. A new function or set of
functions could open up spec-compliant decoding while providing helpful methods that are necessary
in many common server-side operations:


  - html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = ‘utf-8'
)
  - html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, $input_encoding
= ‘utf-8’ )
  - html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle,
$input_encoding = ‘utf-8’ )


These methods are handy for inspecting things like encoded attribute values in a memory-efficient
and processing-efficient way, when it’s not necessary to decode the entire value. In common
situations, one encounters data-URIs with potentially megabytes of image data and processing only
the first few or tens of bytes can save a lot of overhead.


We’re exploring pure-PHP solutions to these problems in WordPress in attempts to improve the
reliability and safety of handling HTML. I’d love to hear your thoughts and know if anyone is
willing to work with me to create an RFC or directly propose patches. We’ve created a step
function which allows finding the next character reference and decoding it separately, enabling some
novel features like highlighting the character references in source text.


Should I propose an RFC for this?


Warmly,
Dennis Snell
Automattic Inc.

Thread (2 messages)

Dennis SnellWed, 10 Jul 2024 00:00:24 +0000
Jim WinsteadWed, 10 Jul 2024 22:42:20 +0000

« previous	php.internals (#124326)	next »

From:	Dennis Snell	Date:	Wed, 10 Jul 2024 00:00:24 +0000
Subject:	Decoding HTML and the Ambiguous Ampersand
Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message