Skip to content

gh-135462: Fix quadratic complexity in processing special input in HTMLParser #135464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Jun 13, 2025

@serhiy-storchaka serhiy-storchaka force-pushed the htmlparser-quadratic-complexity branch from 77d5125 to c87cb49 Compare June 13, 2025 13:05
@serhiy-storchaka
Copy link
Member Author

The solution has been written in a way that simplifies backporting. There are other issues, and the code will be refactored in new versions after fixing them.

Comment on lines +724 to +727
@support.requires_resource('cpu')
def test_eof_no_quadratic_complexity(self):
# Each of these examples used to take about an hour.
# Now they take a fraction of a second.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they now take a fraction of a second, is there a reason to require the cpu resource?

My understanding is that:

  • with the requires_resource('cpu') decorator:
    • this test would normally be skipped
    • in case of regression, we won't notice unless the cpu is enabled
  • without the decorator:
    • this test is always run and completes quickly
    • in case of regression, the test will timeout/fail and expose the problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They totally take 1.3 seconds on my computer. All other tests take 0.1-0.2 seconds. It is a waste of time to run it several times for every update of any PR. Some buildbots are slower than my computer.

I think that it is enough to run this test only on the fastests builtbots. We already used requires_resource('cpu') in similar tests.

('data', '\n<img src="URL>'),
('comment', '/img'),
('endtag', 'html<')])
('data', '\n')])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that now everything after the first </html> is ignored (except the \n). This is technically a change in behavior, which should be fine if the new behavior matches the HTML5 specs, but maybe should be noted in the whatsnew.

There also seem to be other minor changes in behavior that -- if they follow the specs -- might not need to be documented (a generic "Some additional invalid constructs are now handled according to the HTML5 specs." might be enough)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, a double-quoted attribute value is never closed. This is https://html.spec.whatwg.org/multipage/parsing.html#parse-error-eof-in-tag .

I have update the NEWS entry.

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach seems sensible to me.

@serhiy-storchaka serhiy-storchaka merged commit 6eb6c5d into python:main Jun 13, 2025
43 checks passed
@miss-islington-app
Copy link

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10, 3.11, 3.12, 3.13, 3.14.
🐍🍒⛏🤖

@serhiy-storchaka serhiy-storchaka deleted the htmlparser-quadratic-complexity branch June 13, 2025 16:57
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 13, 2025
… in HTMLParser (pythonGH-135464)

End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5d)

Co-authored-by: Serhiy Storchaka <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 13, 2025
… in HTMLParser (pythonGH-135464)

End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5d)

Co-authored-by: Serhiy Storchaka <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Jun 13, 2025

GH-135481 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Jun 13, 2025
@miss-islington-app
Copy link

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 6eb6c5dbfb528bd07d77b60fd71fd05d81d45c41 3.12

@bedevere-app
Copy link

bedevere-app bot commented Jun 13, 2025

GH-135482 is a backport of this pull request to the 3.13 branch.

@miss-islington-app
Copy link

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.11 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 6eb6c5dbfb528bd07d77b60fd71fd05d81d45c41 3.11

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Jun 13, 2025
@miss-islington-app
Copy link

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.10 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 6eb6c5dbfb528bd07d77b60fd71fd05d81d45c41 3.10

@miss-islington-app
Copy link

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.9 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 6eb6c5dbfb528bd07d77b60fd71fd05d81d45c41 3.9

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Jun 13, 2025
…l input in HTMLParser (pythonGH-135464)

End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5d)

Co-authored-by: Serhiy Storchaka <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Jun 13, 2025

GH-135483 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Jun 13, 2025
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Jun 13, 2025
…l input in HTMLParser (pythonGH-135464)

End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5d)

Co-authored-by: Serhiy Storchaka <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Jun 13, 2025

GH-135484 is a backport of this pull request to the 3.11 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.11 only security fixes label Jun 13, 2025
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Jun 13, 2025
…l input in HTMLParser (pythonGH-135464)

End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5d)

Co-authored-by: Serhiy Storchaka <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Jun 13, 2025

GH-135485 is a backport of this pull request to the 3.10 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.10 only security fixes label Jun 13, 2025
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Jun 13, 2025
… input in HTMLParser (pythonGH-135464)

End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5d)

Co-authored-by: Serhiy Storchaka <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Jun 13, 2025

GH-135486 is a backport of this pull request to the 3.9 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.9 only security fixes label Jun 13, 2025
serhiy-storchaka added a commit that referenced this pull request Jun 13, 2025
…t in HTMLParser (GH-135464) (GH-135482)

End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5d)

Co-authored-by: Serhiy Storchaka <[email protected]>
serhiy-storchaka added a commit that referenced this pull request Jun 13, 2025
…t in HTMLParser (GH-135464) (GH-135481)

End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5d)

Co-authored-by: Serhiy Storchaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-security A security issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants