Skip to content

Conversation

suyoggupta
Copy link
Collaborator

@suyoggupta suyoggupta commented Oct 2, 2025

Enable autotuning when capturing cudagraphs.

Perf results:

Mixtral 7x8B, H200, ISL/OSL=128/128

AutoDeploy without autotuning

===========================================================                                                                                                        
= PERFORMANCE OVERVIEW                                                                                                                                             
===========================================================
Request Throughput (req/sec):                     9.0903
Total Output Throughput (tokens/sec):             1163.5598
Total Token Throughput (tokens/sec):              2327.1195
Total Latency (ms):                               28161.8536
Average request latency (ms):                     27333.1795
Per User Output Throughput [w/ ctx] (tps/user):   4.6843
Per GPU Output Throughput (tps/gpu):              1163.5598

AutoDeploy with autotuning

===========================================================                                                                                                        
= PERFORMANCE OVERVIEW                                                                                                                                             
===========================================================                                                                                                        
Request Throughput (req/sec):                     34.8969
Total Output Throughput (tokens/sec):             4466.8075
Total Token Throughput (tokens/sec):              8933.6150
Total Latency (ms):                               7335.8881
Average request latency (ms):                     7166.9759
Per User Output Throughput [w/ ctx] (tps/user):   17.8627
Per GPU Output Throughput (tps/gpu):              4466.8075

Pytorch:

===========================================================                                                                                                        
= PERFORMANCE OVERVIEW                                                                                                                                             
===========================================================                                                                                                        
Request Throughput (req/sec):                     35.4328                                                                                                          
Total Output Throughput (tokens/sec):             4535.4044                                                                                                        
Total Token Throughput (tokens/sec):              9070.8089                                                                                                        
Total Latency (ms):                               7224.9345                                                                                                        
Average request latency (ms):                     7050.8062                                                                                                        
Per User Output Throughput [w/ ctx] (tps/user):   18.1581                                                                                                          
Per GPU Output Throughput (tps/gpu):              4535.4044 

Summary by CodeRabbit

  • New Features
    • Integrated automatic autotuning into CUDA graph warm-up, optimizing execution during graph capture.
    • Improves out-of-the-box performance and stability for Torch-based deployments using CUDA graphs, with no configuration changes required.

Copy link
Contributor

coderabbitai bot commented Oct 2, 2025

📝 Walkthrough

Walkthrough

Introduces autotuner integration into CUDA graph capture warm-up: imports autotune from tensorrt_llm._torch.autotuner and wraps the warm-up phase in _capture_one_graph with both CudaGraphWarmUpPhase and autotune().

Changes

Cohort / File(s) Summary
CUDA graph warm-up and autotune integration
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Added import of autotune and wrapped the CUDA graph warm-up in _capture_one_graph with CudaGraphWarmUpPhase and autotune() to include autotuning during graph capture warm-up.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant TorchCudaGraph as torch_cudagraph backend
  participant WarmUp as CudaGraphWarmUpPhase
  participant Autotune as autotune()

  Caller->>TorchCudaGraph: _capture_one_graph(...)
  activate TorchCudaGraph
  Note over TorchCudaGraph: Begin CUDA graph capture setup
  TorchCudaGraph->>WarmUp: enter
  activate WarmUp
  WarmUp->>Autotune: enter
  activate Autotune
  Note over WarmUp,Autotune: Warm-up iterations with autotuning
  Autotune-->>WarmUp: exit
  deactivate Autotune
  WarmUp-->>TorchCudaGraph: exit
  deactivate WarmUp
  TorchCudaGraph->>TorchCudaGraph: Capture CUDA graph
  TorchCudaGraph-->>Caller: Return captured graph
  deactivate TorchCudaGraph
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description omits the required sections from the repository template, including explicit “## Description”, “## Test Coverage”, and “## PR Checklist” headings, and does not follow the @coderabbitai summary structure. Please update the description to include a “## Description” section explaining the issue and solution, a “## Test Coverage” section listing relevant tests, and a “## PR Checklist” section per the repository template instructions.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The title succinctly describes that the PR adds autotuning to CUDA graph capture in AutoDeploy, clearly reflecting the primary change and making the purpose immediately understandable to reviewers.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)

57-58: Persist autotune profiles by specifying cache_path. Default cache_path=None disables loading and saving of tuning results—autotuning will rerun on every graph capture and restart. Pass a filesystem path to autotune(cache_path=…) to enable profiling_cache.load/save and avoid redundant tuning.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc7f78c and 36cbf7e.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)
tensorrt_llm/_torch/autotuner.py (1)
  • autotune (210-242)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)

13-13: LGTM!

The import follows coding guidelines by maintaining the module namespace structure.

@suyoggupta suyoggupta changed the title [Perf][AutoDeploy] add autotuning when capturing cudagraphs [feat][AutoDeploy] add autotuning when capturing cudagraphs Oct 2, 2025
Signed-off-by: Suyog Gupta <[email protected]>
@suyoggupta suyoggupta changed the title [feat][AutoDeploy] add autotuning when capturing cudagraphs [None][feat][AutoDeploy] add autotuning when capturing cudagraphs Oct 2, 2025
@suyoggupta suyoggupta changed the title [None][feat][AutoDeploy] add autotuning when capturing cudagraphs [None][feat]AutoDeploy: add autotuning when capturing cudagraphs Oct 2, 2025
@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20547 [ run ] triggered by Bot

@suyoggupta suyoggupta changed the title [None][feat]AutoDeploy: add autotuning when capturing cudagraphs [None][feat] AutoDeploy add autotuning when capturing cudagraphs Oct 2, 2025
Signed-off-by: Suyog Gupta <[email protected]>
@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20548 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20547 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15504 (Blue Ocean) completed with status: ABORTED

@suyoggupta suyoggupta added the AutoDeploy <NV> AutoDeploy Backend label Oct 2, 2025
@github-project-automation github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Oct 2, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #20548 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15505 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20554 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20554 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15511 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20558 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20558 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15515 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20570 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20570 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15525 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@suyoggupta suyoggupta merged commit d821524 into NVIDIA:main Oct 3, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Oct 3, 2025
evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AutoDeploy <NV> AutoDeploy Backend

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants