Skip to content

gh-69426: only unescape properly terminated character entities in attribute values #95215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

sissbruecker
Copy link
Contributor

@sissbruecker sissbruecker commented Jul 24, 2022

Fixes HTMLParser to only unescape named character references in attribute values if they are properly terminated.

According to the HTML5 spec, named character references in attribute values should only be processed if they are not followed by an ASCII alphanumeric, or an equals sign. So the following references should be unescaped:

  • &cent
  • &cent foo
  • &cent-foo

While the following should not:

  • &center
  • &cent=

This change adds an attribute value specific character unescaping logic that should cover these cases.

Fixes: #69426

@ghost
Copy link

ghost commented Jul 24, 2022

All commit authors signed the Contributor License Agreement.
CLA signed

@@ -57,6 +58,26 @@
# </ and the tag name, so maybe this should be fixed
endtagfind = re.compile(r'</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)\s*>')

# Character reference processing logic specific to attribute values
# See: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
attr_charref = re.compile(r'&(#[0-9]+|#[xX][0-9a-fA-F]+|[a-zA-Z][a-zA-Z0-9]*)[;=]?')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This partially duplicates an existing Regex, but I was not able to reuse the existing one for this purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the issue only seems to affect named character references, is there a reason to include numeric charrefs too in this regex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the new _unescape_attrvalue is effectively a wrapper for html.escape that only delegates to html.escape if the attribute specific conditions are met. Since we still want to escape numeric and hex char refs in attributes, we need to include them in the regex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move this immediately after the definition of entityref and charref. If we change one regexp, we will not forget to change the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 350 to 360
expected = [('starttag', 'a', [('href', 'foo"zar')]),
expected = [('starttag', 'a', [('href', 'foo " zar')]),
('data', 'a"z'), ('endtag', 'a')]
for charref in charrefs:
self._run_check('<a href="foo{0}zar">a{0}z</a>'.format(charref),
self._run_check('<a href="foo {0} zar">a{0}z</a>'.format(charref),
expected, collector=collector())
# check charrefs at the beginning/end of the text/attributes
# check charrefs at the beginning/end of the text
expected = [('data', '"'),
('starttag', 'a', [('x', '"'), ('y', '"X'), ('z', 'X"')]),
('starttag', 'a', []),
('data', '"'), ('endtag', 'a'), ('data', '"')]
for charref in charrefs:
self._run_check('{0}<a x="{0}" y="{0}X" z="X{0}">'
self._run_check('{0}<a>'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the existing tests to remove flawed assumptions about how the unescaping in attribute values should work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to remove all attribute-related checks from this test, and move them in the next.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 418 to 423
# do unescape char refs at begging and end of text attributes
charrefs = ['&quot;', '&#34;', '&#x22;', '&quot', '&#34', '&#x22']
expected = [('starttag', 'a', [('x', '"'), ('y', '"-X'), ('z', 'X-"')]), ('endtag', 'a')]
for charref in charrefs:
self._run_check('<a x="{0}" y="{0}-X" z="X-{0}"></a>'.format(charref),
expected, collector=collector())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted this from test_convert_charrefs

@sissbruecker
Copy link
Contributor Author

@ezio-melotti I see you are marked as code owner. Would there be any interest in moving ahead with this?

Copy link
Member

@ezio-melotti ezio-melotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
I left a few inline comments, but if you prefer I could also make the suggested changes myself and push them to your branch.

@@ -57,6 +58,26 @@
# </ and the tag name, so maybe this should be fixed
endtagfind = re.compile(r'</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)\s*>')

# Character reference processing logic specific to attribute values
# See: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
attr_charref = re.compile(r'&(#[0-9]+|#[xX][0-9a-fA-F]+|[a-zA-Z][a-zA-Z0-9]*)[;=]?')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the issue only seems to affect named character references, is there a reason to include numeric charrefs too in this regex?

return ref

def unescape_attrvalue(s):
return attr_charref.sub(replace_attr_charref, s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both functions should be private, and their name prefixed by an _.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def replace_attr_charref(match):
ref = match.group(0)
# Numeric / hex char refs must always be unescaped
if ref[1] == '#':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ref[1] == '#':
if ref.startswith('&#'):

I think this is clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return unescape(ref)
# Named character / entity references must only be unescaped
# if they are an exact match, and they are not followed by an equals sign
terminates_with_equals = ref[-1:] == '='
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
terminates_with_equals = ref[-1:] == '='
terminates_with_equals = ref.endswith('=')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 350 to 360
expected = [('starttag', 'a', [('href', 'foo"zar')]),
expected = [('starttag', 'a', [('href', 'foo " zar')]),
('data', 'a"z'), ('endtag', 'a')]
for charref in charrefs:
self._run_check('<a href="foo{0}zar">a{0}z</a>'.format(charref),
self._run_check('<a href="foo {0} zar">a{0}z</a>'.format(charref),
expected, collector=collector())
# check charrefs at the beginning/end of the text/attributes
# check charrefs at the beginning/end of the text
expected = [('data', '"'),
('starttag', 'a', [('x', '"'), ('y', '"X'), ('z', 'X"')]),
('starttag', 'a', []),
('data', '"'), ('endtag', 'a'), ('data', '"')]
for charref in charrefs:
self._run_check('{0}<a x="{0}" y="{0}X" z="X{0}">'
self._run_check('{0}<a>'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to remove all attribute-related checks from this test, and move them in the next.

expected = [('starttag', 'a',
[('href', 'https://example.com?foo¢=123')]),
('endtag', 'a')]
self._run_check('<a href="https://example.com?foo&cent;=123"></a>', expected, collector=collector())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, it would be better to match the style of the previous test, creating different lists of charrefs (e.g. valid, invalid, named, numeric, etc.) and add them in different places in the attribute (beginning, end, before an alnum/space/semicolon/equal).

Also try to keep the lines shorter than 80 chars (you can remove the initial part of the URLs, since they are not necessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looking at it combining multiple cases in a single attribute is indeed hard to read. I restructured the test to have two scenarios:

  • terminated entity, numeric and hex char refs
  • unterminated entity char refs

Both include cases for start, middle, end, as well as followed by alphanumeric, non-alphanumeric and equals sign. I hope it's a bit clearer now.

Also updated formatting to respect the 80 char limit.

@sissbruecker
Copy link
Contributor Author

Thanks for taking the time to review @ezio-melotti . I have addressed all comments. Could you please take another look when you find some time?

@kurtqq
Copy link

kurtqq commented Jun 15, 2023

ping @ezio-melotti on this one would be nice to get it fixed

@serhiy-storchaka serhiy-storchaka self-requested a review May 6, 2025 19:17
@@ -57,6 +58,26 @@
# </ and the tag name, so maybe this should be fixed
endtagfind = re.compile(r'</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)\s*>')

# Character reference processing logic specific to attribute values
# See: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
attr_charref = re.compile(r'&(#[0-9]+|#[xX][0-9a-fA-F]+|[a-zA-Z][a-zA-Z0-9]*)[;=]?')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move this immediately after the definition of entityref and charref. If we change one regexp, we will not forget to change the other.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍

Thank you for your contribution, @sissbruecker.

@@ -23,6 +24,7 @@

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why there are . and - symbols in the name here? It may not be related to this issue.

@serhiy-storchaka serhiy-storchaka enabled auto-merge (squash) May 7, 2025 06:23
@serhiy-storchaka serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels May 7, 2025
@serhiy-storchaka serhiy-storchaka merged commit 77b14a6 into python:main May 7, 2025
47 checks passed
@miss-islington-app
Copy link

Thanks @sissbruecker for the PR, and @serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14.
🐍🍒⛏🤖

@miss-islington-app
Copy link

Sorry @sissbruecker and @serhiy-storchaka, I had trouble checking out the 3.14 backport branch.
Please retry by removing and re-adding the "needs backport to 3.14" label.
Alternatively, you can backport using cherry_picker on the command line.

cherry_picker 77b14a6d58e527f915966446eb0866652a46feb5 3.14

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 7, 2025
…er entities in attribute values (pythonGH-95215)

According to the HTML5 spec, named character references in attribute values
should only be processed if they are not followed by an ASCII alphanumeric,
or an equals sign.
(cherry picked from commit 77b14a6)

Co-authored-by: Sascha Ißbrücker <[email protected]>
https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
@bedevere-app
Copy link

bedevere-app bot commented May 7, 2025

GH-133586 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label May 7, 2025
@serhiy-storchaka serhiy-storchaka added needs backport to 3.14 bugs and security fixes and removed needs backport to 3.14 bugs and security fixes labels May 8, 2025
@miss-islington-app
Copy link

Thanks @sissbruecker for the PR, and @serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 8, 2025
…er entities in attribute values (pythonGH-95215)

According to the HTML5 spec, named character references in attribute values
should only be processed if they are not followed by an ASCII alphanumeric,
or an equals sign.
(cherry picked from commit 77b14a6)

Co-authored-by: Sascha Ißbrücker <[email protected]>
https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
@bedevere-app
Copy link

bedevere-app bot commented May 8, 2025

GH-133704 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label May 8, 2025
serhiy-storchaka pushed a commit that referenced this pull request May 9, 2025
…ter entities in attribute values (GH-95215) (GH-133704)

According to the HTML5 spec, named character references in attribute values
should only be processed if they are not followed by an ASCII alphanumeric,
or an equals sign.
(cherry picked from commit 77b14a6)


https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state

Co-authored-by: Sascha Ißbrücker <[email protected]>
serhiy-storchaka pushed a commit that referenced this pull request May 9, 2025
…ter entities in attribute values (GH-95215) (GH-133586)

According to the HTML5 spec, named character references in attribute values
should only be processed if they are not followed by an ASCII alphanumeric,
or an equals sign.
(cherry picked from commit 77b14a6)


https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state

Co-authored-by: Sascha Ißbrücker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTMLParser handle_starttag replaces entity references in attribute value even without semicolon
5 participants