Dataflow debug #206

pandurangpatil · 2025-07-28T06:51:19Z

Summary by Bito

This pull request enhances the logging mechanisms in the dataflow analysis components and system, including the `ExtendedCfgNode`, `Engine`, and `HeldTaskCompletion` classes. It introduces detailed logging for task management, sources, sinks, and deduplication processes, improving observability and facilitating better debugging, performance monitoring, and optimization.

…letion system Addresses performance issues where dataflow analysis gets stuck in HeldTaskCompletion.completeHeldTasks() for 10+ hours, particularly in large Python codebases with complex interprocedural dependencies. ## Key Changes ### HeldTaskCompletion.scala - Added detailed logging for fixed-point iteration progress and timing - Circuit breaker protection: max 1000 iterations to prevent infinite loops - Performance warnings for slow iterations (>1 minute) and operations (>30 seconds) - Sample held task information logging (sink types, call depths, source paths) - Memory usage estimation and collection size monitoring ### deduplicateTableEntries() Enhancement - Granular timing breakdown for groupBy, sortBy, and tie-breaking operations - Large collection size warnings (>10,000 entries) with performance impact analysis - Identification of largest groups causing hash computation bottlenecks - Reduction ratio tracking and memory usage estimates ### Engine.scala - Enhanced backwards() method logging with source-sink context - Task submission ratio tracking (held vs executed tasks) - Sample source/sink node information for debugging problematic combinations - Performance warnings for slow analysis (>2 minutes) ### ExtendedCfgNode.scala - Query scale warnings for large source×sink combinations (>100,000) - Sample source/sink logging for identifying problematic node combinations - Total analysis timing and result count tracking - Early warning system for potentially expensive queries ## Performance Monitoring Features - **Multi-level thresholds**: Debug, Info, Warn levels for different performance characteristics - **Circuit breakers**: Prevent runaway processes with configurable limits - **Memory monitoring**: Estimate memory usage for large collections - **Progress tracking**: Monitor convergence in fixed-point iterations - **Bottleneck identification**: Pinpoint exact operations causing slowdowns ## Debugging Capabilities - Identify which specific source-sink combinations cause performance issues - Track held task processing patterns and result set sizes - Monitor memory usage patterns during deduplication operations - Analyze iteration convergence behavior in pathological cases This logging system provides the visibility needed to diagnose and resolve the exponential performance degradation observed in large-scale dataflow analysis scenarios. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Value classes cannot have instance fields, so moved the logger to a companion object. This resolves the compilation error: "Value classes may not define non-parameter field" Changes: - Removed instance logger field from ExtendedCfgNode value class - Added companion object with static logger - Updated all logger references to use ExtendedCfgNode.logger 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

… node logging This commit addresses the 10+ hour performance bottleneck in Python repository scanning by implementing comprehensive parallelization and enhanced debugging capabilities. Key improvements: - Remove circuit breaker from HeldTaskCompletion as analysis showed bottleneck was in deduplicateTableEntries(), not the main loop - Add multi-level parallelization: parallel table deduplication, parallel groupBy operations, and parallel group processing - Implement hash key caching to avoid repeated expensive calculations during deduplication - Add configurable performance thresholds via system properties for optimal tuning - Replace unhelpful class name/ID logging with meaningful source code context (variable names, code snippets, file locations) - Add comprehensive performance monitoring with timing logs and bottleneck identification - Enhance memory usage warnings for large collection processing Performance features: - Configurable thresholds: joern.dataflow.parallel.table.threshold, joern.dataflow.parallel.dedup.threshold, joern.dataflow.parallel.groups.threshold - Intelligent processing mode selection (PARALLEL vs SEQUENTIAL) based on data size - Enhanced logging shows meaningful context: Identifier'username' [username] @ user.py:42 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

pandurangpatil and others added 3 commits July 24, 2025 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataflow debug #206

Dataflow debug #206

Uh oh!

pandurangpatil commented Jul 28, 2025 •

edited by bito-code-review bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dataflow debug #206

Are you sure you want to change the base?

Dataflow debug #206

Uh oh!

Conversation

pandurangpatil commented Jul 28, 2025 • edited by bito-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Bito

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pandurangpatil commented Jul 28, 2025 •

edited by bito-code-review bot

Loading