Skip to content

HTMLParser handle_starttag replaces entity references in attribute value even without semicolon #69426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
frogcoder mannequin opened this issue Sep 26, 2015 · 2 comments · Fixed by #95215
Closed

HTMLParser handle_starttag replaces entity references in attribute value even without semicolon #69426

frogcoder mannequin opened this issue Sep 26, 2015 · 2 comments · Fixed by #95215
Assignees
Labels
3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@frogcoder
Copy link
Mannequin

frogcoder mannequin commented Sep 26, 2015

BPO 25239
Nosy @ezio-melotti
Files
  • parserentity.py: an example of the example described
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = None
    created_at = <Date 2015-09-26.16:46:39.294>
    labels = ['type-bug', 'library']
    title = 'HTMLParser handle_starttag replaces entity references in attribute value even without semicolon'
    updated_at = <Date 2015-09-26.17:06:12.433>
    user = 'https://bugs.python.org/frogcoder'

    bugs.python.org fields:

    activity = <Date 2015-09-26.17:06:12.433>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2015-09-26.16:46:39.294>
    creator = 'frogcoder'
    dependencies = []
    files = ['40588']
    hgrepos = []
    issue_num = 25239
    keywords = []
    message_count = 2.0
    messages = ['251654', '251657']
    nosy_count = 2.0
    nosy_names = ['ezio.melotti', 'frogcoder']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'test needed'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue25239'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6']

    Linked PRs

    @frogcoder
    Copy link
    Mannequin Author

    frogcoder mannequin commented Sep 26, 2015

    In the document of HTMLParser.handle_starttag, it states "All entity references from html.entities are replaced in the attribute values." However it will replace the string if it matches ampersand followed by the entity name without the semicolon.

    For example <a href="https://github.com/go?t=buy&currency=usd">foo</a> will produce "t=buy¤cy=usd" as the value of href attribute due to "curren" is the entity name for the currency sign.

    @frogcoder frogcoder mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Sep 26, 2015
    @ezio-melotti
    Copy link
    Member

    ezio-melotti commented Sep 26, 2015

    This seems indeed to be a bug. The relevant bit is at http://www.w3.org/TR/html5/syntax.html#consume-a-character-reference :

    If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.

    Off the top of my head, this paragraph is not implemented in HTMLParser (and it should).
    Also note that <a href="https://github.com/go?t=buy&currency=usd">foo</a> is not valid HTML and the & should have been escaped with &amp;.

    @ezio-melotti ezio-melotti self-assigned this Sep 26, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    sissbruecker added a commit to sissbruecker/cpython that referenced this issue Jul 24, 2022
    @serhiy-storchaka serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes and removed needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels May 7, 2025
    serhiy-storchaka pushed a commit that referenced this issue May 7, 2025
    …ities in attribute values (GH-95215)
    
    According to the HTML5 spec, named character references in attribute values
    should only be processed if they are not followed by an ASCII alphanumeric,
    or an equals sign.
    
    https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
    @github-project-automation github-project-automation bot moved this from Todo to Done in html.parser issues May 7, 2025
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 7, 2025
    …er entities in attribute values (pythonGH-95215)
    
    According to the HTML5 spec, named character references in attribute values
    should only be processed if they are not followed by an ASCII alphanumeric,
    or an equals sign.
    (cherry picked from commit 77b14a6)
    
    Co-authored-by: Sascha Ißbrücker <[email protected]>
    https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 8, 2025
    …er entities in attribute values (pythonGH-95215)
    
    According to the HTML5 spec, named character references in attribute values
    should only be processed if they are not followed by an ASCII alphanumeric,
    or an equals sign.
    (cherry picked from commit 77b14a6)
    
    Co-authored-by: Sascha Ißbrücker <[email protected]>
    https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
    serhiy-storchaka pushed a commit that referenced this issue May 9, 2025
    …ter entities in attribute values (GH-95215) (GH-133704)
    
    According to the HTML5 spec, named character references in attribute values
    should only be processed if they are not followed by an ASCII alphanumeric,
    or an equals sign.
    (cherry picked from commit 77b14a6)
    
    
    https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
    
    Co-authored-by: Sascha Ißbrücker <[email protected]>
    serhiy-storchaka pushed a commit that referenced this issue May 9, 2025
    …ter entities in attribute values (GH-95215) (GH-133586)
    
    According to the HTML5 spec, named character references in attribute values
    should only be processed if they are not followed by an ASCII alphanumeric,
    or an equals sign.
    (cherry picked from commit 77b14a6)
    
    
    https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
    
    Co-authored-by: Sascha Ißbrücker <[email protected]>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    2 participants