Skip to content

regrtest didn't respect the timeout when running test_subprocess on AMD64 Windows11 3.x #90872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vstinner opened this issue Feb 10, 2022 · 13 comments
Labels
3.11 only security fixes 3.12 only security fixes 3.13 bugs and security fixes OS-windows type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

vstinner commented Feb 10, 2022

BPO 46716
Nosy @vstinner, @jkloth, @eryksun
PRs
  • gh-90872: Handle negative timeouts for wait on Windows #32079
  • bpo-46716: Use strict unsigned long conversion for DWORD values #32081
  • Files
  • process.py: Reproducer script for test_timeout()
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-02-10.23:09:17.270>
    labels = ['tests', '3.11']
    title = "regrtest didn't respect the timeout when running test_subprocess on AMD64 Windows11 3.x"
    updated_at = <Date 2022-03-23.19:22:09.443>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2022-03-23.19:22:09.443>
    actor = 'jkloth'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Tests']
    creation = <Date 2022-02-10.23:09:17.270>
    creator = 'vstinner'
    dependencies = []
    files = ['50623']
    hgrepos = []
    issue_num = 46716
    keywords = ['patch']
    message_count = 9.0
    messages = ['413028', '413033', '413104', '413165', '413172', '413189', '413215', '413219', '413649']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'jkloth', 'jeremy.kloth', 'eryksun']
    pr_nums = ['32079', '32081']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue46716'
    versions = ['Python 3.11']

    Linked PRs

    @vstinner
    Copy link
    Member Author

    regrtest was run with --timeout 900 on AMD64 Windows11 3.x: timeout confirmed by "(timeout: 15 min, worker timeout: 20 min)" log. But then test_subprocss was only stopped after "4 hour 55 min".

    If the regrtest main process is able to display an update 2x per minutes (every 30 sec), it should be able to stop the test worker process (test_subprocess) after 20 minutes. How is it possible that the process took so long?

    There are multiple guards:

    • (1) in the worker process: _runtest() calls faulthandler.dump_traceback_later(ns.timeout, exit=True)
    • (2) libregrtest/runtest_mp.py: TestWorkerProcess._run_process() thread uses popen.communicate(timeout=self.timeout)
    • (3) faulthandler.dump_traceback_later(MAIN_PROCESS_TIMEOUT, exit=True): kill the parent process if it is blocked for longer than 5 minutes

    Guards (1) and (2) didn't work.

    Maybe the parent process should implement a 4th guard using the 20 minute timeout: almost 5 hours is way longer than 20 minutes!

    C:\buildbot\3.x.kloth-win11\build>"C:\buildbot\3.x.kloth-win11\build\PCbuild\amd64\python_d.exe" -u -Wd -E -bb -m test -uall -rwW --slowest --timeout 1200 --fail-env-changed -j1 -j2 --junit-xml test-results.xml -j40 --timeout 900
    == CPython 3.11.0a5+ (main, Feb 10 2022, 04:03:24) [MSC v.1930 64 bit (AMD64)]
    == Windows-10-10.0.22000-SP0 little-endian
    == cwd: C:\buildbot\3.x.kloth-win11\build\build\test_python_5732�
    == CPU count: 32
    == encodings: locale=cp1252, FS=utf-8
    Using random seed 6320493
    0:00:00 Run tests in parallel using 40 child processes (timeout: 15 min, worker timeout: 20 min)
    (...)
    0:03:13 load avg: 0.76 [431/432] test_multiprocessing_spawn passed (3 min 13 sec) -- running: test_subprocess (3 min 11 sec)
    0:03:43 load avg: 0.46 running: test_subprocess (3 min 41 sec)
    (...)
    4:53:17 load avg: 0.00 running: test_subprocess (4 hour 53 min)
    4:53:47 load avg: 0.00 running: test_subprocess (4 hour 53 min)
    4:54:17 load avg: 0.09 running: test_subprocess (4 hour 54 min)
    4:54:47 load avg: 0.35 running: test_subprocess (4 hour 54 min)
    4:55:17 load avg: 0.48 running: test_subprocess (4 hour 55 min)
    4:55:46 load avg: 0.50 [432/432/1] test_subprocess timed out (4 hour 55 min) (4 hour 55 min)

    == Tests result: FAILURE ==

    397 tests OK.

    10 slowest tests:

    • test_subprocess: 4 hour 55 min
    • test_multiprocessing_spawn: 3 min 13 sec
    • test_concurrent_futures: 2 min 46 sec
    • test_peg_generator: 2 min 32 sec
    • test_compileall: 1 min 34 sec
    • test_unparse: 1 min 31 sec
    • test_distutils: 1 min 23 sec
    • test_asyncio: 1 min 22 sec
    • test_tokenize: 1 min 8 sec
    • test_io: 1 min 5 sec

    1 test failed:
    test_subprocess

    @vstinner vstinner added 3.11 only security fixes tests Tests in the Lib/test dir labels Feb 10, 2022
    @vstinner vstinner changed the title regrtest didn't respect the timeout on AMD64 Windows11 3.x regrtest didn't respect the timeout when running test_subprocess on AMD64 Windows11 3.x Feb 10, 2022
    @vstinner vstinner changed the title regrtest didn't respect the timeout on AMD64 Windows11 3.x regrtest didn't respect the timeout when running test_subprocess on AMD64 Windows11 3.x Feb 10, 2022
    @jkloth
    Copy link
    Contributor

    jkloth commented Feb 11, 2022

    The test only completed once I purposefully terminated the offending Python process. The only identifying information I noticed was the command-line of -c "while True: pass", indicating it was stuck in either
    test_call_timeout() or test_timeout() in test_subprocess.py.

    Something to note is that Windows does not, by default, have a concept of process trees whereas terminating a parent automatically kills the children. Eryk Sun may have additional ideas on how this desired behavior could be accomplished.

    @eryksun
    Copy link
    Contributor

    eryksun commented Feb 11, 2022

    test_call_timeout() or test_timeout() in test_subprocess.py.

    These tests don't override the standard files, and they only spawn a single child with no descendants. I don't see why this would hang. It shouldn't be a problem with leaked pipe handles (see bpo-43346). It probably will need to be diagnosed by attaching a debugger, or offline with a dump file.

    process trees whereas terminating a parent automatically kills the children

    One can use a job object to manage a child process and all of its descendants, including resource usage and termination. A process can belong to multiple job objects in Windows 8+, which is required by Python 3.9+.

    For reliability, the child has to be created in a suspended state via CREATE_SUSPENDED. It can be resumed with ResumeThread() after adding it to the job with AssignProcessToJobObject().

    You can try to terminate a job cleanly, which is similar in effect to sending SIGTERM to a process group in POSIX. In Windows, this has to be approached differently for console vs graphical processes.

    To handle console apps, assuming the child inherits the current console, spawn it as a new process group via creationflags=subprocess.CREATE_NEW_PROCESS_GROUP. You can request an exit by sending a Ctrl+Break event to the group via os.kill(p.pid, signal.CTRL_BREAK_EVENT) [1]. The request might be ignored, but typically the default handler is called, which calls ExitProcess().

    To handle GUI apps, assuming the child inherits the current desktop (usually "WinSta0\Default"), first enumerate the top-level and message-only windows on the current desktop via EnumWindows() and FindWindowExW(). Use GetWindowThreadProcessId() to filter the list to include only windows that belong to the job. Post WM_CLOSE to each window in the job. A process might ignore a request to close. It could keep the window open or continue running in the background.

    After an internal timeout, you can call TerminateJobObject() to kill any process in the job that remains alive. This is a forced and abrupt termination, which is similar to sending SIGKILL to a process group in POSIX.

    ---

    [1] This usage of os.kill() is what we're stuck with. Rightfully, we should be using os.killpg(p.pid, signal.SIGBREAK) or os.kill(-p.pid, signal.SIGBREAK) (note the negative pid value).

    @jkloth
    Copy link
    Contributor

    jkloth commented Feb 13, 2022

    I've been able locally to reproduce the test_subprocess hang. The responsible function is subprocess.run(). The test case, test_timeout(), uses a small timeout value (0.0001), which, when given enough load, can cause the run() call to hang.

    A judicious use of prints in subprocess.py, reveals that the timeout passed to wait() ends up being negative. That value, once cast to a DWORD, ultimately causes a very long wait (0xfffffff2, in my testing).

    It does seem that only the Windows Popen._wait() cannot handle negative timeout values, so the fix should be as simple as coercing the timeout values to >= 0.

    @eryksun
    Copy link
    Contributor

    eryksun commented Feb 13, 2022

    the fix should be as simple as coercing the timeout values to >= 0.

    Popen._remaining_time() should return max(endtime - _time(), 0).

    Popen._wait() should raise OverflowError if the timeout is too large for the implementation. In Windows, the upper limit in milliseconds is _winapi.INFINITE - 1 (about 49.7 days). It's important to only allow the timeout in milliseconds to be _winapi.INFINITE when timeout is None.

    The DWORD converter in _winapi needs to subclass unsigned_long_converter. The current implementation based on the legacy format unit "k" is too lenient. Negative values and values that are too large should fail. I updated it to use the following definition:

        class DWORD_converter(unsigned_long_converter):
            type = 'DWORD'

    This produces the following improved results:

        >>> h = _winapi.GetCurrentProcess()
        >>> _winapi.WaitForSingleObject(h, _winapi.INFINITE + 1)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        OverflowError: Python int too large to convert to C unsigned long
    
        >>> _winapi.WaitForSingleObject(h, -1)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        ValueError: value must be positive

    @jeremykloth
    Copy link
    Mannequin

    jeremykloth mannequin commented Feb 13, 2022

    > the fix should be as simple as coercing the timeout values to >= 0.

    Popen._remaining_time() should return max(endtime - _time(), 0).

    That was my first initial instinct as well, however, that change would
    also affect more of the Popen behavior and need a much more thorough
    investigation of the POSIX side of Popen.

    Popen._wait() should raise OverflowError if the timeout is too large for the implementation. In Windows, the upper limit in milliseconds is _winapi.INFINITE - 1 (about 49.7 days). It's important to only allow the timeout in milliseconds to be _winapi.INFINITE when timeout is None.

    I agree.

    The DWORD converter in _winapi needs to subclass unsigned_long_converter. The current implementation based on the legacy format unit "k" is too lenient. Negative values and values that are too large should fail.

    Whilst I agree this is a correct solution, I fear the potential
    3rd-party breakage alone should bump this to its own issue.

    I believe that this then leads to the following action items for this issue:

    1. modify Windows Popen._wait() to raise on out of bounds values [< 0
      or >= INFINITE]
    2. cap Popen._remaining_time() return value to >= 0
    3. change _winapi DWORD converter be unsigned long
    4. use Job objects to group Windows processes for termination

    Have I missed anything? I should be able to knock out PRs for these today.

    Jeremy Kloth

    @eryksun
    Copy link
    Contributor

    eryksun commented Feb 14, 2022

    I fear the potential 3rd-party breakage alone should bump
    this to its own issue.

    The change to the DWORD converter in _winapi should only be in 3.11+. If this causes problems for other projects, they're probably depending on undefined behavior in the standard library. No third-party package or application should use _winapi directly.

    1. modify Windows Popen._wait() to raise on out of bounds
      values [< 0 or >= INFINITE]

    I think raising ValueError would be best at this level. For example:

                if timeout is None:
                    timeout_millis = _winapi.INFINITE
                else:
                    timeout_millis = int(timeout * 1000)
                    if timeout_millis >= _winapi.INFINITE:
                        raise ValueError("'timeout' exceeds the platform limit")
                    if timeout_millis < 0:
                        raise ValueError("'timeout' must be non-negative")

    @eryksun
    Copy link
    Contributor

    eryksun commented Feb 14, 2022

    1. use Job objects to group Windows processes for termination

    I think a separate issue should be created for this enhancement.

    _winapi wrappers would be needed for CreateJobObjectW(), SetInformationJobObject(), AssignProcessToJobObject(), TerminatejobObject(), and ResumeThread(), plus the constant CREATE_SUSPENDED. I'd also prefer to make related changes to send_signal(), which would require GetConsoleProcessList() and GenerateConsoleCtrlEvent().

    A new Popen option would be needed to configure whether the job allows descendants to break away via the process creation flag CREATE_BREAKAWAY_FROM_JOB. This should be allowed by default.

    ---

    send_signal(): SIGKILL, SIGTERM, SIBREAK, SIGINT

    Support Popen.kill(group=False) and Popen.terminate(group=False) on all platforms as Popen.send_signal(signal.SIGKILL, group=group) and Popen.send_signal(signal.SIGTERM, group=group).

    The Universal CRT (ucrt) in Windows doesn't define SIGKILL. Even when it's not defined by the platform, signal.SIGKILL should always be defined, preferably as 9 (unused by ucrt), else as an unused value in the range up to NSIG, else as NSIG + 1.

    The group keyword-only option broadens the scope to the process group or job. A process is a group leader if it was created with the flag CREATE_NEW_PROCESS_GROUP (save self._creationflags) and its process ID is in GetConsoleProcessList().

    For SIGKILL, always use forced termination. For SIGTERM, use forced termination either if group is false or if group is true and the process is not a group leader. To force termination, call TerminateJobObject(self._job_handle, 1) if group is true, else TerminateProcess(self._handle, 1).

    For SIGTERM, SIGBREAK, and SIGINT, call GenerateConsoleCtrlEvent() if group is true and the process is a group leader. For SIGTERM and SIGBREAK, send CTRL_BREAK_EVENT. For SIGINT, send CTRL_C_EVENT.

    Behavior to deprecate:

    When group is false and sig is CTRL_C_EVENT (0) or CTRL_BREAK_EVENT (1), send_signal() always calls os.kill(). This legacy behavior tries to handle a process as if it's a process group, and it combines POSIX kill() with Windows API constants. subprocess should distance itself from the fundamental problems with os.kill() in Windows.

    @vstinner
    Copy link
    Member Author

    It does seem that only the Windows Popen._wait() cannot handle negative timeout values, so the fix should be as simple as coercing the timeout values to >= 0.

    Oh. This function should maybe raise an exception if the timeout is negative, and ther caller must replace negative timeout with zero.

    A judicious use of prints in subprocess.py, reveals that the timeout passed to wait() ends up being negative. That value, once cast to a DWORD, ultimately causes a very long wait (0xfffffff2, in my testing).

    This sounds dangerous and must be fixed. Python must not convert negative values to very large positive values.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @vstinner
    Copy link
    Member Author

    No activity for 1 year, I just close the issue.

    @serhiy-storchaka serhiy-storchaka added 3.12 only security fixes 3.13 bugs and security fixes type-bug An unexpected behavior, bug, or error and removed tests Tests in the Lib/test dir labels Feb 23, 2024
    @serhiy-storchaka
    Copy link
    Member

    The problem is not only in tests, there is a bug in the code. Maybe more than one bug.

    vstinner added a commit to vstinner/cpython that referenced this issue Mar 19, 2024
    On Windows, subprocess.Popen.wait() no longer calls
    WaitForSingleObject() with a negative timeout: pass 0 ms if the
    timeout is negative.
    vstinner added a commit to vstinner/cpython that referenced this issue Mar 19, 2024
    On Windows, subprocess.Popen.wait() no longer calls
    WaitForSingleObject() with a negative timeout: pass 0 ms if the
    timeout is negative.
    @vstinner
    Copy link
    Member Author

    The problem is not only in tests, there is a bug in the code. Maybe more than one bug.

    I wrote PR gh-116989 to avoid passing a negative timeout to WaitForSingleObject() in Popen.wait().

    vstinner added a commit that referenced this issue Mar 19, 2024
    On Windows, subprocess.Popen.wait() no longer calls
    WaitForSingleObject() with a negative timeout: pass 0 ms if the
    timeout is negative.
    vstinner added a commit to vstinner/cpython that referenced this issue Mar 19, 2024
    …hon#116989)
    
    On Windows, subprocess.Popen.wait() no longer calls
    WaitForSingleObject() with a negative timeout: pass 0 ms if the
    timeout is negative.
    
    (cherry picked from commit 27cf3ed)
    @vstinner
    Copy link
    Member Author

    Let me bet that the fix 27cf3ed fixed the root issue. If the issue comes back, please open a new issue (referring to this one).

    To change DWORD behavior, please see the discussion: https://discuss.python.org/t/conversion-between-python-and-c-integers/44117

    vstinner added a commit that referenced this issue Mar 19, 2024
    …16989) (#117002)
    
    gh-90872: Fix subprocess.Popen.wait() for negative timeout (#116989)
    
    On Windows, subprocess.Popen.wait() no longer calls
    WaitForSingleObject() with a negative timeout: pass 0 ms if the
    timeout is negative.
    
    (cherry picked from commit 27cf3ed)
    vstinner added a commit that referenced this issue Mar 19, 2024
    …16989) (#117003)
    
    gh-90872: Fix subprocess.Popen.wait() for negative timeout (#116989)
    
    On Windows, subprocess.Popen.wait() no longer calls
    WaitForSingleObject() with a negative timeout: pass 0 ms if the
    timeout is negative.
    
    (cherry picked from commit 27cf3ed)
    vstinner added a commit to vstinner/cpython that referenced this issue Mar 20, 2024
    …hon#116989)
    
    On Windows, subprocess.Popen.wait() no longer calls
    WaitForSingleObject() with a negative timeout: pass 0 ms if the
    timeout is negative.
    adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
    …hon#116989)
    
    On Windows, subprocess.Popen.wait() no longer calls
    WaitForSingleObject() with a negative timeout: pass 0 ms if the
    timeout is negative.
    diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
    …hon#116989)
    
    On Windows, subprocess.Popen.wait() no longer calls
    WaitForSingleObject() with a negative timeout: pass 0 ms if the
    timeout is negative.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes 3.12 only security fixes 3.13 bugs and security fixes OS-windows type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants