-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-139109: A new tracing JIT compiler frontend for CPython #140310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Closing because this is triggering some UB in CI and wasting resources. Will reopen once I fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exciting stuff.
I've done a preliminary review. More to come later.
@Fidget-Spinner do you have any ideas as to why the go benchmark is 20% slower?
Objects/codeobject.c
Outdated
PyMem_Free(co_extra); | ||
} | ||
#ifdef _Py_TIER2 | ||
_Py_JITTracer_InvalidateDependency(tstate, self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, but it made me curious.
OOI, did you just spot this, or did the JIT manage to end up running code from a deallocated code object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem isn't running deallocated code. The problem was that the JIT actually traced then tried to compile code that happened to deallocate while tracing. Thus leading to segfaults while optimizing.
This came up in a few tests from the test suite.
Python/optimizer_analysis.c
Outdated
if (length == 0) { | ||
return length; | ||
} | ||
optimize_uops( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may still want to prune the trace in the optimizer. For example, if the optimizer's confidence drops too low.
So leave optimize_uops
returning the length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code is wrong. The optimizer never errors. So length can never be 0.
The optimizer currently just returns the length of the unoptimized trace if it doesn't optimize. So this entire code is dead code.
However, I'll leave it there in case you need it in the future.
Python/ceval_macros.h
Outdated
DISPATCH_TABLE_VAR = TRACING_DISPATCH_TABLE; | ||
# define LEAVE_TRACING() \ | ||
DISPATCH_TABLE_VAR = DISPATCH_TABLE; | ||
# define BAIL_TRACING_NO_DISPATCH() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've already got too many macros here, and it makes the code really hard to follow.
It also creates code bloat as we replicate all this code 256 times.
At the very least, factor out as much as you can into helper functions.
Almost everything in this macro after LEAVE_TRACING()
could go in a helper function.
} | ||
#endif | ||
|
||
/* _PyEval_EvalFrameDefault is too large to optimize for speed with PGO on MSVC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never more true 🙂
} while (0) | ||
|
||
#define GOTO_TIER_ONE(TARGET) \ | ||
#define GOTO_TIER_ONE(TARGET, SHOULD_CONTINUE_TRACING) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHOULD_CONTINUE_TRACING
doesn't seem to be used anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used by _COLD_EXIT
to tell the interpreter to resume tracing upon exiting to tier 1. This is what allows it to form the side exit trace.
Python/optimizer.c
Outdated
ADD_TO_TRACE(_SET_IP, 0, (uintptr_t)instr, target); | ||
DPRINTF(2, "%p %d: %s(%d) %d\n", old_code, target, _PyOpcode_OpName[opcode], oparg, progress_needed); | ||
|
||
bool needs_guard_ip = _PyOpcode_NeedsGuardIp[opcode] && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fix up this table, so we don't need all the special cases afterwards?
@markshannon I suspect the greatest slowdown is due to the code being rather branchy. Our side exits are very costly right now still. I am planning to fix this thru #140434. Other sources of slowdown: we even trace through CALL_ALLOC_AND_ENTER_INIT (simple object initialization and creation) right now. The optimizer can't deal with that, so it stops optimizing there. I have another branch that adds this to the uop optimizer. Finally, note that I found that the current design slows down the base computed goto interpreter by a lot!!! (Roughly 6% pyperformance it seems). I suspect it's because we trace when we PGO, so we have no choice but to do a normal call to another interpreter when do computed goto rather than the dispatch table idea. Alternatively, just mandate tail calling for JIT builds. |
This PR changes the current JIT model from trace projection to trace recording. Benchmarking: 0.6% slower pyperformance versus TC+current JIT, However, 70% faster Richards on the most improved benchmark versus the current JIT. https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251020-3.15.0a1%2B-ec2971f-CLANG%2CJIT/bm-20251020-vultr-x86_64-Fidget%252dSpinner-tracing_jit-3.15.0a1%2B-ec2971f-vs-base.md
This new JIT frontend is already able to record/execute significantly more instructions than the previous JIT frontend. In this PR, we are now able to record through custom dunders, simple object creation, generators, etc. None of these were done by the old JIT frontend. Some custom dunders uops were discovered to be broken as part of this work gh-140277
Some tests for the optimizers are disabled for now not because the optimizer doesn't work, but that it's actually unrolling recursion, thus leading to different results in the trace!
The full test suite passes on my system.
Not close enough to a holiday, so no poem this time!
Pros:
optimizer.c
is now significantly simpler, as we don't have to do strange things to recover the bytecode from a trace.Cons:
Design:
_DYNAMIC_EXIT
. This new instruction will cause a re-trace when hot enough starting from the target instruction.