Skip to content

Conversation

pandurangpatil
Copy link
Member

@pandurangpatil pandurangpatil commented Jul 28, 2025

Summary by Bito

This pull request enhances the logging mechanisms in the dataflow analysis components and system, including the `ExtendedCfgNode`, `Engine`, and `HeldTaskCompletion` classes. It introduces detailed logging for task management, sources, sinks, and deduplication processes, improving observability and facilitating better debugging, performance monitoring, and optimization.

pandurangpatil and others added 3 commits July 24, 2025 23:10
…letion system

Addresses performance issues where dataflow analysis gets stuck in HeldTaskCompletion.completeHeldTasks()
for 10+ hours, particularly in large Python codebases with complex interprocedural dependencies.

## Key Changes

### HeldTaskCompletion.scala
- Added detailed logging for fixed-point iteration progress and timing
- Circuit breaker protection: max 1000 iterations to prevent infinite loops
- Performance warnings for slow iterations (>1 minute) and operations (>30 seconds)
- Sample held task information logging (sink types, call depths, source paths)
- Memory usage estimation and collection size monitoring

### deduplicateTableEntries() Enhancement
- Granular timing breakdown for groupBy, sortBy, and tie-breaking operations
- Large collection size warnings (>10,000 entries) with performance impact analysis
- Identification of largest groups causing hash computation bottlenecks
- Reduction ratio tracking and memory usage estimates

### Engine.scala
- Enhanced backwards() method logging with source-sink context
- Task submission ratio tracking (held vs executed tasks)
- Sample source/sink node information for debugging problematic combinations
- Performance warnings for slow analysis (>2 minutes)

### ExtendedCfgNode.scala
- Query scale warnings for large source×sink combinations (>100,000)
- Sample source/sink logging for identifying problematic node combinations
- Total analysis timing and result count tracking
- Early warning system for potentially expensive queries

## Performance Monitoring Features

- **Multi-level thresholds**: Debug, Info, Warn levels for different performance characteristics
- **Circuit breakers**: Prevent runaway processes with configurable limits
- **Memory monitoring**: Estimate memory usage for large collections
- **Progress tracking**: Monitor convergence in fixed-point iterations
- **Bottleneck identification**: Pinpoint exact operations causing slowdowns

## Debugging Capabilities

- Identify which specific source-sink combinations cause performance issues
- Track held task processing patterns and result set sizes
- Monitor memory usage patterns during deduplication operations
- Analyze iteration convergence behavior in pathological cases

This logging system provides the visibility needed to diagnose and resolve the exponential
performance degradation observed in large-scale dataflow analysis scenarios.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Value classes cannot have instance fields, so moved the logger to a companion object.
This resolves the compilation error:
"Value classes may not define non-parameter field"

Changes:
- Removed instance logger field from ExtendedCfgNode value class
- Added companion object with static logger
- Updated all logger references to use ExtendedCfgNode.logger

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
… node logging

This commit addresses the 10+ hour performance bottleneck in Python repository scanning
by implementing comprehensive parallelization and enhanced debugging capabilities.

Key improvements:
- Remove circuit breaker from HeldTaskCompletion as analysis showed bottleneck was in deduplicateTableEntries(), not the main loop
- Add multi-level parallelization: parallel table deduplication, parallel groupBy operations, and parallel group processing
- Implement hash key caching to avoid repeated expensive calculations during deduplication
- Add configurable performance thresholds via system properties for optimal tuning
- Replace unhelpful class name/ID logging with meaningful source code context (variable names, code snippets, file locations)
- Add comprehensive performance monitoring with timing logs and bottleneck identification
- Enhance memory usage warnings for large collection processing

Performance features:
- Configurable thresholds: joern.dataflow.parallel.table.threshold, joern.dataflow.parallel.dedup.threshold, joern.dataflow.parallel.groups.threshold
- Intelligent processing mode selection (PARALLEL vs SEQUENTIAL) based on data size
- Enhanced logging shows meaningful context: Identifier'username' [username] @ user.py:42

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant