-
-
Notifications
You must be signed in to change notification settings - Fork 32k
regrtest didn't respect the timeout when running test_subprocess on AMD64 Windows11 3.x #90872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
regrtest was run with --timeout 900 on AMD64 Windows11 3.x: timeout confirmed by "(timeout: 15 min, worker timeout: 20 min)" log. But then test_subprocss was only stopped after "4 hour 55 min". If the regrtest main process is able to display an update 2x per minutes (every 30 sec), it should be able to stop the test worker process (test_subprocess) after 20 minutes. How is it possible that the process took so long? There are multiple guards:
Guards (1) and (2) didn't work. Maybe the parent process should implement a 4th guard using the 20 minute timeout: almost 5 hours is way longer than 20 minutes! C:\buildbot\3.x.kloth-win11\build>"C:\buildbot\3.x.kloth-win11\build\PCbuild\amd64\python_d.exe" -u -Wd -E -bb -m test -uall -rwW --slowest --timeout 1200 --fail-env-changed -j1 -j2 --junit-xml test-results.xml -j40 --timeout 900 == Tests result: FAILURE == 397 tests OK. 10 slowest tests:
1 test failed: |
The test only completed once I purposefully terminated the offending Python process. The only identifying information I noticed was the command-line of Something to note is that Windows does not, by default, have a concept of process trees whereas terminating a parent automatically kills the children. Eryk Sun may have additional ideas on how this desired behavior could be accomplished. |
These tests don't override the standard files, and they only spawn a single child with no descendants. I don't see why this would hang. It shouldn't be a problem with leaked pipe handles (see bpo-43346). It probably will need to be diagnosed by attaching a debugger, or offline with a dump file.
One can use a job object to manage a child process and all of its descendants, including resource usage and termination. A process can belong to multiple job objects in Windows 8+, which is required by Python 3.9+. For reliability, the child has to be created in a suspended state via CREATE_SUSPENDED. It can be resumed with ResumeThread() after adding it to the job with AssignProcessToJobObject(). You can try to terminate a job cleanly, which is similar in effect to sending SIGTERM to a process group in POSIX. In Windows, this has to be approached differently for console vs graphical processes. To handle console apps, assuming the child inherits the current console, spawn it as a new process group via creationflags=subprocess.CREATE_NEW_PROCESS_GROUP. You can request an exit by sending a Ctrl+Break event to the group via os.kill(p.pid, signal.CTRL_BREAK_EVENT) [1]. The request might be ignored, but typically the default handler is called, which calls ExitProcess(). To handle GUI apps, assuming the child inherits the current desktop (usually "WinSta0\Default"), first enumerate the top-level and message-only windows on the current desktop via EnumWindows() and FindWindowExW(). Use GetWindowThreadProcessId() to filter the list to include only windows that belong to the job. Post WM_CLOSE to each window in the job. A process might ignore a request to close. It could keep the window open or continue running in the background. After an internal timeout, you can call TerminateJobObject() to kill any process in the job that remains alive. This is a forced and abrupt termination, which is similar to sending SIGKILL to a process group in POSIX. --- [1] This usage of os.kill() is what we're stuck with. Rightfully, we should be using os.killpg(p.pid, signal.SIGBREAK) or os.kill(-p.pid, signal.SIGBREAK) (note the negative pid value). |
I've been able locally to reproduce the test_subprocess hang. The responsible function is subprocess.run(). The test case, test_timeout(), uses a small timeout value (0.0001), which, when given enough load, can cause the run() call to hang. A judicious use of prints in subprocess.py, reveals that the timeout passed to wait() ends up being negative. That value, once cast to a DWORD, ultimately causes a very long wait (0xfffffff2, in my testing). It does seem that only the Windows Popen._wait() cannot handle negative timeout values, so the fix should be as simple as coercing the timeout values to >= 0. |
Popen._remaining_time() should return max(endtime - _time(), 0). Popen._wait() should raise OverflowError if the timeout is too large for the implementation. In Windows, the upper limit in milliseconds is The DWORD converter in _winapi needs to subclass unsigned_long_converter. The current implementation based on the legacy format unit "k" is too lenient. Negative values and values that are too large should fail. I updated it to use the following definition: class DWORD_converter(unsigned_long_converter):
type = 'DWORD' This produces the following improved results: >>> h = _winapi.GetCurrentProcess()
>>> _winapi.WaitForSingleObject(h, _winapi.INFINITE + 1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C unsigned long
>>> _winapi.WaitForSingleObject(h, -1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: value must be positive |
That was my first initial instinct as well, however, that change would
I agree.
Whilst I agree this is a correct solution, I fear the potential I believe that this then leads to the following action items for this issue:
Have I missed anything? I should be able to knock out PRs for these today.Jeremy Kloth |
The change to the DWORD converter in _winapi should only be in 3.11+. If this causes problems for other projects, they're probably depending on undefined behavior in the standard library. No third-party package or application should use _winapi directly.
I think raising ValueError would be best at this level. For example: if timeout is None:
timeout_millis = _winapi.INFINITE
else:
timeout_millis = int(timeout * 1000)
if timeout_millis >= _winapi.INFINITE:
raise ValueError("'timeout' exceeds the platform limit")
if timeout_millis < 0:
raise ValueError("'timeout' must be non-negative") |
I think a separate issue should be created for this enhancement. _winapi wrappers would be needed for CreateJobObjectW(), SetInformationJobObject(), AssignProcessToJobObject(), TerminatejobObject(), and ResumeThread(), plus the constant CREATE_SUSPENDED. I'd also prefer to make related changes to send_signal(), which would require GetConsoleProcessList() and GenerateConsoleCtrlEvent(). A new Popen option would be needed to configure whether the job allows descendants to break away via the process creation flag CREATE_BREAKAWAY_FROM_JOB. This should be allowed by default. --- send_signal(): SIGKILL, SIGTERM, SIBREAK, SIGINT Support Popen.kill(group=False) and Popen.terminate(group=False) on all platforms as Popen.send_signal(signal.SIGKILL, group=group) and Popen.send_signal(signal.SIGTERM, group=group). The Universal CRT (ucrt) in Windows doesn't define SIGKILL. Even when it's not defined by the platform, signal.SIGKILL should always be defined, preferably as 9 (unused by ucrt), else as an unused value in the range up to NSIG, else as NSIG + 1. The For SIGKILL, always use forced termination. For SIGTERM, use forced termination either if For SIGTERM, SIGBREAK, and SIGINT, call GenerateConsoleCtrlEvent() if Behavior to deprecate: When |
Oh. This function should maybe raise an exception if the timeout is negative, and ther caller must replace negative timeout with zero.
This sounds dangerous and must be fixed. Python must not convert negative values to very large positive values. |
No activity for 1 year, I just close the issue. |
The problem is not only in tests, there is a bug in the code. Maybe more than one bug. |
On Windows, subprocess.Popen.wait() no longer calls WaitForSingleObject() with a negative timeout: pass 0 ms if the timeout is negative.
On Windows, subprocess.Popen.wait() no longer calls WaitForSingleObject() with a negative timeout: pass 0 ms if the timeout is negative.
I wrote PR gh-116989 to avoid passing a negative timeout to WaitForSingleObject() in Popen.wait(). |
On Windows, subprocess.Popen.wait() no longer calls WaitForSingleObject() with a negative timeout: pass 0 ms if the timeout is negative.
…hon#116989) On Windows, subprocess.Popen.wait() no longer calls WaitForSingleObject() with a negative timeout: pass 0 ms if the timeout is negative. (cherry picked from commit 27cf3ed)
Let me bet that the fix 27cf3ed fixed the root issue. If the issue comes back, please open a new issue (referring to this one). To change DWORD behavior, please see the discussion: https://discuss.python.org/t/conversion-between-python-and-c-integers/44117 |
…hon#116989) On Windows, subprocess.Popen.wait() no longer calls WaitForSingleObject() with a negative timeout: pass 0 ms if the timeout is negative.
…hon#116989) On Windows, subprocess.Popen.wait() no longer calls WaitForSingleObject() with a negative timeout: pass 0 ms if the timeout is negative.
…hon#116989) On Windows, subprocess.Popen.wait() no longer calls WaitForSingleObject() with a negative timeout: pass 0 ms if the timeout is negative.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: