[None][feat] AutoDeploy add autotuning when capturing cudagraphs #8120

suyoggupta · 2025-10-02T08:38:29Z

Enable autotuning when capturing cudagraphs.

Perf results:

Mixtral 7x8B, H200, ISL/OSL=128/128

AutoDeploy without autotuning

===========================================================                                                                                                        
= PERFORMANCE OVERVIEW                                                                                                                                             
===========================================================
Request Throughput (req/sec):                     9.0903
Total Output Throughput (tokens/sec):             1163.5598
Total Token Throughput (tokens/sec):              2327.1195
Total Latency (ms):                               28161.8536
Average request latency (ms):                     27333.1795
Per User Output Throughput [w/ ctx] (tps/user):   4.6843
Per GPU Output Throughput (tps/gpu):              1163.5598

AutoDeploy with autotuning

===========================================================                                                                                                        
= PERFORMANCE OVERVIEW                                                                                                                                             
===========================================================                                                                                                        
Request Throughput (req/sec):                     34.8969
Total Output Throughput (tokens/sec):             4466.8075
Total Token Throughput (tokens/sec):              8933.6150
Total Latency (ms):                               7335.8881
Average request latency (ms):                     7166.9759
Per User Output Throughput [w/ ctx] (tps/user):   17.8627
Per GPU Output Throughput (tps/gpu):              4466.8075

Pytorch:

===========================================================                                                                                                        
= PERFORMANCE OVERVIEW                                                                                                                                             
===========================================================                                                                                                        
Request Throughput (req/sec):                     35.4328                                                                                                          
Total Output Throughput (tokens/sec):             4535.4044                                                                                                        
Total Token Throughput (tokens/sec):              9070.8089                                                                                                        
Total Latency (ms):                               7224.9345                                                                                                        
Average request latency (ms):                     7050.8062                                                                                                        
Per User Output Throughput [w/ ctx] (tps/user):   18.1581                                                                                                          
Per GPU Output Throughput (tps/gpu):              4535.4044

Summary by CodeRabbit

New Features
- Integrated automatic autotuning into CUDA graph warm-up, optimizing execution during graph capture.
- Improves out-of-the-box performance and stability for Torch-based deployments using CUDA graphs, with no configuration changes required.

Signed-off-by: Suyog Gupta <[email protected]>

coderabbitai · 2025-10-02T08:42:41Z

📝 Walkthrough

Walkthrough

Introduces autotuner integration into CUDA graph capture warm-up: imports autotune from tensorrt_llm._torch.autotuner and wraps the warm-up phase in _capture_one_graph with both CudaGraphWarmUpPhase and autotune().

Changes

Cohort / File(s)	Summary
CUDA graph warm-up and autotune integration `tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py`	Added import of `autotune` and wrapped the CUDA graph warm-up in `_capture_one_graph` with `CudaGraphWarmUpPhase` and `autotune()` to include autotuning during graph capture warm-up.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant TorchCudaGraph as torch_cudagraph backend
  participant WarmUp as CudaGraphWarmUpPhase
  participant Autotune as autotune()

  Caller->>TorchCudaGraph: _capture_one_graph(...)
  activate TorchCudaGraph
  Note over TorchCudaGraph: Begin CUDA graph capture setup
  TorchCudaGraph->>WarmUp: enter
  activate WarmUp
  WarmUp->>Autotune: enter
  activate Autotune
  Note over WarmUp,Autotune: Warm-up iterations with autotuning
  Autotune-->>WarmUp: exit
  deactivate Autotune
  WarmUp-->>TorchCudaGraph: exit
  deactivate WarmUp
  TorchCudaGraph->>TorchCudaGraph: Capture CUDA graph
  TorchCudaGraph-->>Caller: Return captured graph
  deactivate TorchCudaGraph

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description omits the required sections from the repository template, including explicit “## Description”, “## Test Coverage”, and “## PR Checklist” headings, and does not follow the @coderabbitai summary structure.	Please update the description to include a “## Description” section explaining the issue and solution, a “## Test Coverage” section listing relevant tests, and a “## PR Checklist” section per the repository template instructions.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title Check	✅ Passed	The title succinctly describes that the PR adds autotuning to CUDA graph capture in AutoDeploy, clearly reflecting the primary change and making the purpose immediately understandable to reviewers.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)

57-58: Persist autotune profiles by specifying cache_path. Default cache_path=None disables loading and saving of tuning results—autotuning will rerun on every graph capture and restart. Pass a filesystem path to autotune(cache_path=…) to enable profiling_cache.load/save and avoid redundant tuning.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc7f78c and 36cbf7e.

📒 Files selected for processing (1)

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)

tensorrt_llm/_torch/autotuner.py (1)

autotune (210-242)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)

13-13: LGTM!

The import follows coding guidelines by maintaining the module namespace structure.

Signed-off-by: Suyog Gupta <[email protected]>

suyoggupta · 2025-10-02T17:01:58Z

/bot run

tensorrt-cicd · 2025-10-02T17:07:09Z

PR_Github #20547 [ run ] triggered by Bot

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py

Signed-off-by: Suyog Gupta <[email protected]>

suyoggupta · 2025-10-02T17:26:26Z

/bot run

tensorrt-cicd · 2025-10-02T17:31:38Z

PR_Github #20548 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-02T17:31:40Z

PR_Github #20547 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15504 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-10-02T19:46:05Z

PR_Github #20548 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15505 completed with status: 'FAILURE'

suyoggupta · 2025-10-02T19:57:40Z

/bot run

tensorrt-cicd · 2025-10-02T20:02:59Z

PR_Github #20554 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-02T21:11:27Z

PR_Github #20554 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15511 completed with status: 'FAILURE'

suyoggupta · 2025-10-02T22:08:01Z

/bot run

tensorrt-cicd · 2025-10-02T22:13:32Z

PR_Github #20558 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-02T23:13:31Z

PR_Github #20558 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15515 completed with status: 'FAILURE'

suyoggupta · 2025-10-03T05:42:39Z

/bot run

tensorrt-cicd · 2025-10-03T05:51:29Z

PR_Github #20570 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-03T08:51:39Z

PR_Github #20570 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15525 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

…DIA#8120) Signed-off-by: Suyog Gupta <[email protected]>

add autotuning when capturing cudagraphs

36cbf7e

Signed-off-by: Suyog Gupta <[email protected]>

suyoggupta requested a review from a team as a code owner October 2, 2025 08:38

suyoggupta requested a review from nvchenghaoz October 2, 2025 08:38

github-project-automation bot added this to AutoDeploy Board Oct 2, 2025

github-project-automation bot moved this to Backlog in AutoDeploy Board Oct 2, 2025

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

suyoggupta changed the title ~~[Perf][AutoDeploy] add autotuning when capturing cudagraphs~~ [feat][AutoDeploy] add autotuning when capturing cudagraphs Oct 2, 2025

lint

55b98be

Signed-off-by: Suyog Gupta <[email protected]>

suyoggupta changed the title ~~[feat][AutoDeploy] add autotuning when capturing cudagraphs~~ [None][feat][AutoDeploy] add autotuning when capturing cudagraphs Oct 2, 2025

suyoggupta changed the title ~~[None][feat][AutoDeploy] add autotuning when capturing cudagraphs~~ [None][feat]AutoDeploy: add autotuning when capturing cudagraphs Oct 2, 2025

suyoggupta changed the title ~~[None][feat]AutoDeploy: add autotuning when capturing cudagraphs~~ [None][feat] AutoDeploy add autotuning when capturing cudagraphs Oct 2, 2025

lucaslie reviewed Oct 2, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py Show resolved Hide resolved

typo

34267f8

Signed-off-by: Suyog Gupta <[email protected]>

suyoggupta added the AutoDeploy <NV> AutoDeploy Backend label Oct 2, 2025

nvchenghaoz approved these changes Oct 2, 2025

View reviewed changes

github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Oct 2, 2025

lucaslie approved these changes Oct 2, 2025

View reviewed changes

Merge branch 'main' into sg/autotune

35cfb6f

Merge branch 'main' into sg/autotune

4961566

suyoggupta merged commit d821524 into NVIDIA:main Oct 3, 2025
5 checks passed

github-project-automation bot moved this from In review to Done in AutoDeploy Board Oct 3, 2025

evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Oct 3, 2025

[None][feat] AutoDeploy add autotuning when capturing cudagraphs (NVI…

6c5d52e

…DIA#8120) Signed-off-by: Suyog Gupta <[email protected]>

nzmora-nvidia mentioned this pull request Oct 9, 2025

[AutoDeploy]: Improve MoE performance for TP>1 #8232

Open

1 task

[None][feat] AutoDeploy add autotuning when capturing cudagraphs #8120

[None][feat] AutoDeploy add autotuning when capturing cudagraphs #8120

Uh oh!

Conversation

suyoggupta commented Oct 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

suyoggupta commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

Uh oh!

suyoggupta commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

suyoggupta commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

suyoggupta commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

suyoggupta commented Oct 3, 2025

Uh oh!

tensorrt-cicd commented Oct 3, 2025

Uh oh!

tensorrt-cicd commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

suyoggupta commented Oct 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 2, 2025 •

edited

Loading