-
Notifications
You must be signed in to change notification settings - Fork 11
Replace HazardPointer with RefCountGuard to fix race condition #310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2c1c81a to
816a930
Compare
This commit implements Phase 1 of CallTraceStorage optimization: thread-local caching of HazardPointer slots to reduce allocation overhead under high thread contention. ## Implementation Add fast-path HazardPointer allocation using cached slot numbers stored in ProfiledThread. On first allocation, threads probe for a free slot and cache the result. Subsequent allocations reuse the cached slot, avoiding probe overhead. Key changes: - Add HazardPointer(resource, slot) constructor for cached fast-path - Add slot() accessor to retrieve allocated slot number - Modify CallTraceStorage::put() to check for cached slot before allocation - Cache slot in ProfiledThread after first successful allocation ## Performance Impact Benchmark results (M1, JMH with 3/3/3 configuration): - 1 thread: +0.053% (6137.249 → 6140.478 ops/s) - statistically identical - 32 threads: +11.6% improvement (60,647 → 67,655 ops/s) - 64 threads: +2.0% improvement (65,077 → 66,385 ops/s) **Conclusion**: Zero overhead at typical thread counts, significant gains at high contention (32+ threads). This optimization breaks the contention bottleneck on systems with limited cores. ## Methodology Note Initial testing showed false -4.4% overhead due to unequal JMH configurations (1/1/2 vs 3/3/3). Re-testing with identical configurations confirmed zero overhead. This highlights the importance of rigorous benchmarking methodology. ## Signal Safety Implementation maintains signal-safety guarantees: - ProfiledThread::currentSignalSafe() never allocates - All operations use lock-free atomics - Graceful degradation if ProfiledThread unavailable (nullptr check) ## Additional Artifacts This commit includes: - JMH benchmark suite for CallTraceStorage (baseline, quick, slot exhaustion) - Python analysis script for benchmark result processing - Comprehensive documentation of optimization phases and findings - JMH infrastructure (agents, slash commands) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
816a930 to
fa64ff3
Compare
rkennke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, except a question in one place. Thanks (especially also for all the documentation)!
|
|
||
| // Fast-path constructor using pre-allocated slot (thread-local caching optimization) | ||
| HazardPointer::HazardPointer(CallTraceHashTable* resource, int slot) : _active(true), _my_slot(slot) { | ||
| // CRITICAL ORDERING: Set bitmap bit BEFORE storing pointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I understand correctly, but it sounds to me like you want to do it the other way around: store the pointer first, and only then store the bitmap. Only this will guarantee that other threads see the pointer after they observed the bitmap bit. Or what am I missing?
4145f9a to
6b6a5eb
Compare
Remove the HazardPointer implementation due to a race condition in the bitmap-pointer split-update protocol. The hazard pointer design required two separate atomic operations: (1) storing the pointer and (2) setting the bitmap bit. This created a window where the scanner could observe an inconsistent state and incorrectly reclaim memory still in use. RefCountGuard eliminates this issue via a pointer-first protocol where the count field acts as a single atomic activation barrier: - Constructor: Store pointer FIRST, then increment count - Destructor: Decrement count FIRST, then clear pointer - Scanner: Check count first (if 0, slot is inactive) This ordering provably eliminates all race condition windows across three exhaustive scenarios (activation, post-activation, deactivation). Performance impact: Equivalent (within 0.25% measurement noise) - 1 thread: 6,139.0 vs 6,134.7 ops/s (-0.07%) - 8 threads: 49,039.2 vs 49,034.1 ops/s (-0.01%) - 32 threads: 95,902.9 vs 95,690.1 ops/s (-0.22%) - Thread churn: <0.2% difference across all scenarios Code changes: - Removed HazardSlot and HazardPointer classes (~400 lines) - Removed USE_REFCOUNT_GUARD conditional compilation - Updated callTraceHashTable.cpp to use RefCountGuard - Merged duplicate CallTraceStorage.md documentation - Added comprehensive RefCountGuard correctness proof to architecture docs Tests: 139/140 passing (99.3% pass rate, 1 pre-existing timing failure) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
6b6a5eb to
22f7359
Compare
Benchmarks [x86_64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 13 metrics, 23 unstable metrics.
|
Benchmarks [x86_64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 4 performance regressions! Performance is the same for 11 metrics, 23 unstable metrics.
|
Benchmarks [aarch64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.
|
Benchmarks [x86_64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 3 performance regressions! Performance is the same for 12 metrics, 23 unstable metrics.
|
Benchmarks [aarch64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.
|
Benchmarks [x86_64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 12 metrics, 24 unstable metrics.
|
Benchmarks [aarch64 cpu,wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.
|
Benchmarks [x86_64 cpu,wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 13 metrics, 23 unstable metrics.
|
Benchmarks [aarch64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.
|
Benchmarks [x86_64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.
|
Benchmarks [x86_64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 5 performance regressions! Performance is the same for 9 metrics, 24 unstable metrics.
|
Benchmarks [aarch64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.
|
Benchmarks [aarch64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.
|
Benchmarks [aarch64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.
|
Summary
Replaces HazardPointer with RefCountGuard for lock-free memory reclamation in CallTraceStorage. The original HazardPointer implementation had a race condition in its bitmap-pointer split-update protocol that could lead to use-after-free bugs under heavy contention.
Critical Bug Fixed
Race Condition in HazardPointer: The bitmap-pointer approach required two separate atomic operations:
global_hazard_list[slot].pointeroccupied_bitmap[word]The Problem: A thread could be preempted between these operations, creating a window where:
Solution: RefCountGuard with Pointer-First Protocol
RefCountGuard eliminates the race by using the count field as a single atomic activation barrier:
Proof of Correctness
The pointer-first protocol is provably race-free across all possible interleavings:
Scenario 1: Scanner runs between constructor steps
count=0(step 2 not yet executed)Scenario 2: Scanner runs after constructor completes
count>0(step 2 completed)Scenario 3: Scanner runs between destructor steps
count=0(step 1 completed)Key Invariant: The scanner checks
countfirst:count == 0→ slot inactive (safe to skip)count > 0→ slot active (pointer guaranteed visible)There is no window where the scanner observes inconsistent state.
Performance Impact
Zero performance overhead - RefCountGuard is equivalent to HazardPointer (within measurement noise):
All differences fall within statistical noise (<0.25%). Confidence intervals overlap significantly.
Code Changes
HazardSlotandHazardPointerclasses (~400 lines)USE_REFCOUNT_GUARDconditional compilationcallTraceHashTable.cppto useRefCountGuard::waitForAllRefCountsToClear()CallTraceStorage.mddocumentation filesMemory Savings
Architecture Documentation
Added comprehensive documentation to
docs/architecture/CallTraceStorage.md:Testing
ContextWallClockTest(unrelated to this change)Benefits
🤖 Generated with Claude Code