[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic #6615

yizhang-nv · 2025-08-05T03:30:39Z

Summary by CodeRabbit

New Features
- Added support for specifying token counts for piecewise CUDA graph capture during torch compilation via new configuration options.
- Introduced piecewise CUDA graph warmup with multiple forward passes per batch.
- Added configuration fields to customize CUDA graph capture behavior.
Refactor
- Unified and clarified warmup state management across model engine and executor components using properties.
- Simplified warmup logic and replaced explicit CUDA graph state tracking with context managers.
- Introduced thread-local flag and context manager to control multi-stream execution, replacing previous graph capturing checks.
- Renamed internal parameters and variables from batch sizes to token capture counts for improved clarity.
- Updated multi-stream CUDA synchronization logic to use new thread-local flag.
Bug Fixes
- Improved runtime checks for multi-stream scheduling to ensure correct CUDA graph capturing state.
Style
- Renamed method in MoE load balancer for clearer naming consistency.
Tests
- Updated tests to reflect method renaming in MoE load balancer.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-05T03:30:47Z

📝 Walkthrough

Walkthrough

This update introduces and propagates support for specifying token counts for piecewise CUDA graph capture in torch compile workflows. It unifies warmup state management using properties, refactors warmup logic for clarity, and updates related configuration and method names. It also replaces the multi-stream scheduling condition from a CUDA graph capturing check to a thread-local flag, adds new context managers for piecewise CUDA graph control, and renames a method in MoeLoadBalancer with corresponding test updates.

Changes

Cohort / File(s)	Change Summary
Piecewise CUDA Graph Token Support `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/_torch/pyexecutor/config.py`, `tensorrt_llm/_torch/compilation/backend.py`, `tensorrt_llm/_torch/compilation/piecewise_optimizer.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Adds `capture_num_tokens` field to `TorchCompileConfig` and propagates it to backend config; renames parameter from `cuda_graph_batch_sizes` to `capture_num_tokens` in backend and piecewise optimizer; updates model engine to support and use piecewise CUDA graph token counts for warmup and execution.
Warmup State Refactor `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Replaces direct warmup flag attribute with `is_warmup` property; updates warmup logic to use the property and clarifies state transitions; synchronizes warmup state across model engines.
Multi-Stream Scheduling and CUDA Graph Capturing Control `tensorrt_llm/_torch/modules/multi_stream_utils.py`, `tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py`, `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`, `tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py`	Introduces thread-local `do_multi_stream` flag with context manager to control multi-stream execution; replaces previous `is_graph_capturing()` checks with `do_multi_stream()` in MoeLoadBalancer and custom ops; refactors CUDA graph capture method to use new context managers for multi-stream and piecewise CUDA graph flags.
MoeLoadBalancer Method Rename and Test Updates `tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py`, `tests/unittest/_torch/modules/test_moe_load_balancer.py`	Renames method `set_next_iter_info` to `set_iter_info` in MoeLoadBalancer and updates all test usages accordingly.
KvCacheCreator Shutdown Sequence `tensorrt_llm/_torch/pyexecutor/_util.py`	Changes the order of setting `is_warmup` flag and calling `py_executor.shutdown()` in `estimate_max_tokens`.
New Context Managers for Piecewise CUDA Graph `tensorrt_llm/_torch/utils.py`, `tensorrt_llm/_torch/compilation/utils.py`	Adds `piecewise_cuda_graph(enable: bool)` and `piecewise_cuda_graph_capture(enable: bool)` context managers to temporarily set piecewise CUDA graph flags within a scoped context.

Sequence Diagram(s)

sequenceDiagram
    participant API as LlmArgs / TorchCompileConfig
    participant Backend as PyTorchConfig
    participant Engine as PyTorchModelEngine

    API->>Backend: Provide capture_num_tokens (torch_compile_piecewise_cuda_graph_num_tokens)
    Backend->>Engine: Pass piecewise CUDA graph token counts in config
    Engine->>Engine: Use token counts for piecewise CUDA graph warmup and execution

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[TRTLLM-4279] feat: Multistream initial support for torch compile flow #5847: Extends multi-stream scheduling logic in Backend.optimize, which is directly built upon in this PR with the addition of the is_graph_capturing() runtime check.

Suggested labels

Community want to contribute

Suggested reviewers

chzblych
pcastonguay
litaotju
shaharmor98

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c838b7e and eab12f9.

📒 Files selected for processing (1)

tensorrt_llm/llmapi/llm_args.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/llmapi/llm_args.py

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

tensorrt_llm/_torch/pyexecutor/model_engine.py (3)

408-412: Split the long initialization line for better readability.

The line exceeds the 120-character limit and contains complex fallback logic that would be clearer if split.

-        self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else []
+        self._piecewise_cuda_graph_num_tokens = (
+            pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens
+            if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens
+            else pytorch_backend_config.cuda_graph_batch_sizes
+            if pytorch_backend_config.cuda_graph_batch_sizes
+            else []
+        )

678-706: Good simplification of torch compile warmup logic.

The removal of contextlib.ExitStack makes the code clearer. Consider splitting the long logging line for consistency.

-                            logger.info(
-                                f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase"
-                            )
+                            phase_type = 'context' if num_tokens_per_request > 1 else 'generation'
+                            logger.info(
+                                f"Run warmup for batch size={bs}, pure {phase_type} phase"
+                            )

726-788: Well-structured CUDA graph warmup with new piecewise support.

The reorganization improves clarity, and the reverse sorting enables memory reuse optimization. The new piecewise CUDA graph warmup section correctly implements multiple forward passes with proper memory cleanup.

-        logger.info(
-            f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes."
-        )
+        num_batch_sizes = len(self._cuda_graph_batch_sizes)
+        logger.info(f"Creating CUDA graph instances for {num_batch_sizes} batch sizes.")

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a178cea and 80e16c3.

📒 Files selected for processing (8)

tensorrt_llm/_torch/compilation/backend.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tests/unittest/_torch/modules/test_moe_load_balancer.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (5)

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/backend.py

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tensorrt_llm/_torch/compilation/backend.py

📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tensorrt_llm/_torch/compilation/backend.py

📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

tensorrt_llm/_torch/compilation/backend.py

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/_torch/compilation/backend.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

408-408: Line too long (296 > 120)

(E501)

700-700: Line too long (137 > 120)

(E501)

733-733: Line too long (133 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (15)

tests/unittest/_torch/modules/test_moe_load_balancer.py (3)

270-270: LGTM! Method rename correctly reflected in test.

The test correctly uses the renamed method set_iter_info instead of set_next_iter_info, maintaining consistency with the implementation changes.

309-309: LGTM! Consistent method rename in test.

The method call correctly reflects the API change from set_next_iter_info to set_iter_info with appropriate parameters for the statistic kernel test.

371-371: LGTM! Final method rename correctly applied.

The test properly uses the renamed set_iter_info method, completing the consistent update across all test cases to align with the implementation changes.

tensorrt_llm/_torch/pyexecutor/config.py (1)

74-74: LGTM! Clean addition of new configuration field.

The new torch_compile_piecewise_cuda_graph_num_tokens field is well-named, follows the existing naming convention, and has an appropriate type annotation for specifying token counts for piecewise CUDA graph capture.

tensorrt_llm/_torch/compilation/backend.py (2)

15-15: LGTM! Import addition supports new functionality.

The import of is_graph_capturing is necessary for the enhanced CUDA graph capture state checking implemented later in the file.

117-118: LGTM! Enhanced condition adds important safeguard.

The addition of is_graph_capturing() check ensures multi-stream scheduling only runs when CUDA graph capturing is active, providing better control over the scheduling behavior and aligning with the unified CUDA graph capture management.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

246-247: LGTM! Improved shutdown sequence ensures proper state management.

Reordering the warmup flag clearing before shutdown ensures consistent state management and prevents potential issues during resource cleanup. This aligns well with the unified warmup state management approach.

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (2)

795-796: LGTM! Method rename improves API clarity.

The rename from set_next_iter_info to set_iter_info makes the method name more concise and clear while maintaining the same functionality and parameters.

942-943: LGTM! Method call correctly updated to use renamed method.

The method call properly reflects the rename from set_next_iter_info to set_iter_info, maintaining consistency with the method definition change.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

212-217: LGTM! Clean warmup state management.

The warmup state is properly encapsulated and the flow is clear: set warmup flag, perform warmup operations, then disable warmup. The comment about profiling behavior during warmup adds helpful context.

272-283: Propagate the warmup flag and validate its type

The is_warmup setter in py_executor.py should drive the corresponding flag on the underlying engines (to trigger torch.compile optimizations and suppress MoE stats) and guard against non‐bool inputs.

– File: tensorrt_llm/_torch/pyexecutor/py_executor.py
Lines ~277–283

Suggested change:
 @is_warmup.setter
 def is_warmup(self, value: bool):
+    if not isinstance(value, bool):
+        raise TypeError(f"is_warmup must be bool, got {type(value).__name__}")
     self._is_warmup = value
-    # Set warmup flag in model engine to trigger torch compile and avoid moe load balancer statistics update
-    # self.model_engine.is_warmup = value
-    # if self.draft_model_engine is not None:
-    #     self.draft_model_engine.is_warmup = value
+    # Propagate warmup state to the model engines for compile and MoE behavior
+    self.model_engine.is_warmup = value
+    if self.draft_model_engine is not None:
+        self.draft_model_engine.is_warmup = value
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

356-356: LGTM! Clean refactoring of warmup state management.

The property-based approach for is_warmup provides better encapsulation and centralizes the logic for toggling torch compile optimizations and MoeLoadBalancer settings.

Also applies to: 465-485

492-497: LGTM! Consistent with property refactoring.

The context manager correctly uses the new is_warmup property while maintaining proper state restoration.

707-725: LGTM! Consistent refactoring of autotuner warmup.

The autotuner warmup follows the same simplified pattern as the torch compile warmup, making the code more maintainable.

2136-2137: LGTM! Simplified MoeLoadBalancer retrieval.

The removal of conditional warmup logic here is correct since that's now handled in the is_warmup property setter.

tensorrt_llm/llmapi/llm_args.py

yizhang-nv · 2025-08-05T04:15:21Z

/bot run

tensorrt-cicd · 2025-08-05T04:21:15Z

PR_Github #14078 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-05T06:29:10Z

PR_Github #14078 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10623 completed with status: 'FAILURE'

yizhang-nv · 2025-08-06T03:13:25Z

/bot run

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80e16c3 and 52b5a78.

📒 Files selected for processing (13)

tensorrt_llm/_torch/compilation/backend.py (4 hunks)
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (7 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (11 hunks)
tensorrt_llm/_torch/modules/multi_stream_utils.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
tensorrt_llm/_torch/utils.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tests/unittest/_torch/modules/test_moe_load_balancer.py (3 hunks)

✅ Files skipped from review due to trivial changes (4)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/utils.py
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/llmapi/llm_args.py

🚧 Files skipped from review as they are similar to previous changes (6)

tensorrt_llm/_torch/pyexecutor/config.py
tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/_util.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

322-322: Line too long (296 > 120)

(E501)

699-699: Line too long (137 > 120)

(E501)

732-732: Line too long (133 > 120)

(E501)

🔇 Additional comments (16)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)

6-6: LGTM! Import addition is correct.

The import of with_multi_stream is properly sourced from the multi_stream_utils module and is used correctly in the capture method.

8-8: LGTM! Import addition is correct.

The import of piecewise_cuda_graph is properly added and used as a context manager in the capture method.

80-84: Excellent refactoring using context managers for state management.

The nested context managers replace manual flag management with a more robust approach that ensures proper cleanup. The warmup and capture logic remains unchanged while improving maintainability.

tensorrt_llm/_torch/modules/multi_stream_utils.py (5)

1-2: LGTM! Required imports for thread-local state management.

The threading and contextmanager imports are necessary and correctly added for the new functionality.

8-14: Thread-local implementation is correct.

The thread-local class properly inherits from threading.local and initializes the flag to a sensible default. The global instance pattern is appropriate for module-level state management.

17-23: LGTM! Clean state management functions.

The setter and getter functions provide proper encapsulation of the thread-local state with clear type hints and focused responsibilities.

25-32: Excellent context manager implementation.

The context manager properly saves and restores the previous state using try/finally, ensuring robust cleanup even if exceptions occur.

60-62: Good refactoring of multi-stream decision logic.

The change from is_graph_capturing() to do_multi_stream() decouples the multi-stream decision from graph capturing state, providing better control. The variable rename to multi_stream correctly avoids shadowing the function name.
tensorrt_llm/_torch/pyexecutor/model_engine.py (8)

338-338: LGTM!

The parameter rename from cuda_graph_batch_sizes to capture_num_tokens improves semantic clarity, better reflecting that it specifies token counts for piecewise CUDA graph capture.

361-361: LGTM!

Clean initialization of the new warmup state property.

464-484: Well-designed property implementation for warmup state.

The property pattern effectively centralizes warmup state management and properly coordinates torch compile optimizations, CUDA graph capture flags, and MoeLoadBalancer settings.

Consider using a more explicit private attribute name in the getter:
 @property
 def is_warmup(self):
-    return getattr(self, "_is_warmup", False)
+    return self._is_warmup
And initialize self._is_warmup = False in the constructor instead of using getattr with a default.

491-496: LGTM!

Clean update to use the new is_warmup property while preserving the same context manager behavior.

677-705: Excellent simplification of torch compile warmup logic.

The refactoring removes complex ExitStack management in favor of cleaner context managers. The no_cuda_graph() context manager provides better isolation during warmup.

Address the line length violation on line 699:
-                            logger.info(
-                                f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase"
-                            )
+                            phase_type = 'context' if num_tokens_per_request > 1 else 'generation'
+                            logger.info(
+                                f"Run warmup for batch size={bs}, pure {phase_type} phase"
+                            )
725-787: Well-structured CUDA graph warmup with piecewise support.

The reorganized CUDA graph warmup logic is clearer with explicit sorting and separation of concerns. The new piecewise CUDA graph warmup section properly handles multiple iterations and memory cleanup.

Address the line length violation on line 732:
-        logger.info(
-            f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes."
-        )
+        num_batch_sizes = len(self._cuda_graph_batch_sizes)
+        logger.info(f"Creating CUDA graph instances for {num_batch_sizes} batch sizes.")
The piecewise warmup logic (lines 767-787) looks correct with 3 warmup iterations plus one final iteration, followed by proper GPU memory cleanup.

1525-1525: LGTM!

Correct usage of the new is_warmup property for CUDA graph warmup detection.

2135-2136: LGTM!

Clean formatting of the MoeLoadBalancer reference retrieval.

tensorrt_llm/_torch/pyexecutor/model_engine.py

yizhang-nv · 2025-08-06T05:55:56Z

/bot run

yizhang-nv · 2025-08-06T05:56:34Z

/bot run

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

322-327: Refactor complex initialization logic for better readability.

The nested ternary operators make this initialization difficult to read and maintain, and line 322 exceeds the 120-character limit. Consider breaking this into multiple lines with clear variable names.

-        self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else []
-        self._piecewise_cuda_graph_num_tokens = [
-            i for i in self._piecewise_cuda_graph_num_tokens
-            if i <= self.max_num_tokens
-        ]
+        # Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes
+        raw_num_tokens = (
+            pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens
+            or pytorch_backend_config.cuda_graph_batch_sizes
+            or []
+        )
+        self._piecewise_cuda_graph_num_tokens = [
+            i for i in raw_num_tokens if i <= self.max_num_tokens
+        ]

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

677-788: LGTM: Warmup method refactoring improves clarity.

The refactoring successfully:

Uses no_cuda_graph() context manager for cleaner CUDA graph control

Adds proper piecewise CUDA graph warmup with piecewise_cuda_graph_capture(True) context manager

Reorganizes the logic for better readability

However, consider addressing the line length violations (lines 699, 732 exceed 120 characters) for better code style compliance.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52b5a78 and 4759be8.

📒 Files selected for processing (14)

tensorrt_llm/_torch/compilation/backend.py (4 hunks)
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (7 hunks)
tensorrt_llm/_torch/compilation/utils.py (2 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (11 hunks)
tensorrt_llm/_torch/modules/multi_stream_utils.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
tensorrt_llm/_torch/utils.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tests/unittest/_torch/modules/test_moe_load_balancer.py (3 hunks)

✅ Files skipped from review due to trivial changes (3)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/compilation/piecewise_optimizer.py

🚧 Files skipped from review as they are similar to previous changes (9)

tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/utils.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline, and attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/compilation/utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/compilation/utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T05:46:41.295Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

322-322: Line too long (296 > 120)

(E501)

474-474: Undefined name set_enable_piecewise_cuda_graph_capture_flag

(F821)

477-477: Undefined name set_enable_piecewise_cuda_graph_capture_flag

(F821)

699-699: Line too long (137 > 120)

(E501)

732-732: Line too long (133 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

tensorrt_llm/_torch/compilation/utils.py (2)

1-1: LGTM!

The contextlib import is correctly added to support the new context manager function.

47-54: Excellent implementation of the context manager!

The piecewise_cuda_graph_capture context manager correctly follows the standard pattern:

Saves previous state using the existing getter function

Sets the new state using the existing setter function

Uses try/finally for exception safety to ensure state restoration

Leverages existing helper functions rather than directly accessing globals

This provides a clean, safe way to manage the piecewise CUDA graph capture flag in scoped contexts.

tensorrt_llm/_torch/pyexecutor/model_engine.py (6)

41-41: LGTM: Import updated to use new context manager.

The import change from set_enable_piecewise_cuda_graph_capture_flag to piecewise_cuda_graph_capture aligns with the new scoped flag management pattern using context managers.

338-338: LGTM: Backend configuration updated correctly.

The addition of capture_num_tokens=self._piecewise_cuda_graph_num_tokens properly propagates the piecewise CUDA graph configuration to the backend.

361-361: LGTM: Warmup state initialization added.

The initialization of is_warmup = False properly sets up the new property-based warmup state management.

491-496: LGTM: Context manager updated to use new property.

The set_warmup_flag context manager correctly uses the new is_warmup property, maintaining consistency with the property-based warmup state management.

1525-1525: LGTM: Updated to use new warmup property.

The condition correctly uses self.is_warmup property, maintaining consistency with the new property-based warmup state management pattern.

2135-2136: LGTM: Improved MoeLoadBalancer usage pattern.

Using getattr with a default None value makes the MoeLoadBalancer access more robust and handles cases where it might not be present.

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt-cicd · 2025-08-06T06:02:00Z

PR_Github #14247 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-06T08:22:16Z

PR_Github #14247 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10757 completed with status: 'FAILURE'

yizhang-nv · 2025-08-06T09:47:43Z

/bot run

Superjomn

LGTM on the llmapi changes.

yizhang-nv · 2025-08-15T05:25:04Z

/bot run

tensorrt-cicd · 2025-08-15T05:31:17Z

PR_Github #15402 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-16T03:58:20Z

PR_Github #15402 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11611 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

yizhang-nv · 2025-08-18T01:55:44Z

/bot run

tensorrt-cicd · 2025-08-18T02:00:45Z

PR_Github #15564 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-18T04:13:28Z

PR_Github #15564 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11722 completed with status: 'ABORTED'

liji-nv · 2025-08-18T05:59:29Z

/bot run

tensorrt-cicd · 2025-08-18T06:04:37Z

PR_Github #15597 [ run ] triggered by Bot

tensorrt_llm/_torch/compilation/backend.py

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt_llm/_torch/pyexecutor/py_executor.py

…UDA graph capturing. Update method names for clarity and add new configuration options for piecewise CUDA graph token capture. Signed-off-by: yizhang-nv <[email protected]>

Signed-off-by: yizhang-nv <[email protected]>

Signed-off-by: Yi Zhang <[email protected]>

yizhang-nv · 2025-08-18T09:43:21Z

/bot run

yizhang-nv · 2025-08-18T14:55:55Z

/bot run

tensorrt-cicd · 2025-08-18T15:01:03Z

PR_Github #15626 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-18T15:01:05Z

PR_Github #15597 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-19T00:32:55Z

PR_Github #15626 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11764 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

yizhang-nv requested review from a team as code owners August 5, 2025 03:30

yizhang-nv requested review from Superjomn, achartier, hlu1 and suyoggupta August 5, 2025 03:30

coderabbitai bot reviewed Aug 5, 2025

View reviewed changes

tensorrt_llm/llmapi/llm_args.py Show resolved Hide resolved

tensorrt_llm/llmapi/llm_args.py Show resolved Hide resolved

yizhang-nv requested review from liji-nv and zongfeijing August 5, 2025 03:34

yizhang-nv force-pushed the enhance-torch-compile branch from 80e16c3 to 52b5a78 Compare August 6, 2025 03:11

yizhang-nv requested review from a team as code owners August 6, 2025 03:11

yizhang-nv requested review from Tabrizian and chuangz0 August 6, 2025 03:11

coderabbitai bot reviewed Aug 6, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated Show resolved Hide resolved

yizhang-nv force-pushed the enhance-torch-compile branch from 52b5a78 to 4759be8 Compare August 6, 2025 05:56

yizhang-nv changed the title ~~refactor: Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic~~ [None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic Aug 6, 2025

coderabbitai bot reviewed Aug 6, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/model_engine.py Show resolved Hide resolved

yizhang-nv requested a review from litaotju August 14, 2025 02:57

Tabrizian approved these changes Aug 14, 2025

View reviewed changes

Superjomn approved these changes Aug 14, 2025

View reviewed changes

yizhang-nv requested a review from yuxianq August 15, 2025 05:24

yizhang-nv force-pushed the enhance-torch-compile branch from 69df79c to 059d16b Compare August 15, 2025 05:24

yizhang-nv force-pushed the enhance-torch-compile branch from 059d16b to eae8e92 Compare August 18, 2025 01:55

yuxianq reviewed Aug 18, 2025

View reviewed changes

tensorrt_llm/_torch/compilation/backend.py Outdated Show resolved Hide resolved

yuxianq reviewed Aug 18, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated Show resolved Hide resolved

yuxianq reviewed Aug 18, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

yuxianq approved these changes Aug 18, 2025

View reviewed changes

yizhang-nv added 3 commits August 18, 2025 17:43

Refactor Backend and MoeLoadBalancer to improve warmup handling and C…

8f78a4a

…UDA graph capturing. Update method names for clarity and add new configuration options for piecewise CUDA graph token capture. Signed-off-by: yizhang-nv <[email protected]>

Update moe statistic flag

2947672

Signed-off-by: yizhang-nv <[email protected]>

Refactor according to comment

c0495d1

Signed-off-by: Yi Zhang <[email protected]>

yizhang-nv force-pushed the enhance-torch-compile branch from c0986ee to c0495d1 Compare August 18, 2025 09:43

yizhang-nv enabled auto-merge (squash) August 18, 2025 09:43

yizhang-nv merged commit a15af87 into NVIDIA:main Aug 19, 2025
4 checks passed

coderabbitai bot mentioned this pull request Aug 25, 2025

Draft: Split atten_metadata #6638

Draft

[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic #6615

[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic #6615

Uh oh!

Conversation

yizhang-nv commented Aug 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yizhang-nv commented Aug 5, 2025

Uh oh!

tensorrt-cicd commented Aug 5, 2025

Uh oh!

tensorrt-cicd commented Aug 5, 2025

Uh oh!

yizhang-nv commented Aug 6, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yizhang-nv commented Aug 6, 2025

Uh oh!

yizhang-nv commented Aug 6, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

yizhang-nv commented Aug 6, 2025

Uh oh!

Superjomn left a comment

Choose a reason for hiding this comment

Uh oh!

yizhang-nv commented Aug 15, 2025

Uh oh!

tensorrt-cicd commented Aug 15, 2025

Uh oh!

tensorrt-cicd commented Aug 16, 2025

Uh oh!

yizhang-nv commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

liji-nv commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

Uh oh!

yizhang-nv commented Aug 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 5, 2025 •

edited

Loading