Skip to content

Conversation

syuoni
Copy link
Collaborator

@syuoni syuoni commented Sep 19, 2025

Summary by CodeRabbit

  • Documentation
    • Added a new Tech Blog entry on combining guided decoding with speculative decoding for smoother CPU–GPU cooperation.
    • Updated the Tech Blogs list with the new post and link.
    • Article covers high-level design, data flows, masking approach, CUDA graph integration, concurrency safeguards, and rollback mechanics.
    • Includes performance highlights with observed speedups in benchmark scenarios and comparisons across model setups.
    • Provides acknowledgements and references for further reading.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

Currently, the image links are relative links for review purpose.

  • Update the image links to use absolute links before merging PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
@syuoni syuoni self-assigned this Sep 19, 2025
@syuoni syuoni requested a review from a team as a code owner September 19, 2025 07:11
Copy link
Contributor

coderabbitai bot commented Sep 19, 2025

📝 Walkthrough

Walkthrough

Adds a new Tech Blog entry to README and introduces a detailed documentation article on combining guided decoding with speculative decoding, covering design, data flow, CUDA Graph capturability via host callbacks, masking, state management, and benchmarking. No code or public API changes.

Changes

Cohort / File(s) Summary of Changes
README Tech Blog Index
README.md
Inserts a new Tech Blog list item for 09/18 with a link to the new article; positioned before the 08/29 entry.
Tech Blog Article
docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md
Adds a new blog post detailing guided decoding + speculative decoding integration, grammar/mask flow, one- vs two-model drafting, vocab mapping, CUDA Graph host callbacks, masking kernel assumptions, concurrency considerations, and performance results.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Executor as Decoder Executor
  participant GPU as Model Forward (GPU)
  participant CPU as Grammar Engine (CPU)
  rect rgb(238,245,255)
    note over Executor,GPU: Speculative Decoding Loop
    User->>Executor: Start request
    Executor->>GPU: Draft tokens (1 or 2-model)
    par Overlap
      GPU-->>Executor: Draft logits
      Executor->>CPU: Compute grammar mask for draft/target
      CPU-->>Executor: Token mask(s), updated grammar state
    end
    Executor->>Executor: Apply mask to logits (disallow tokens)
    alt Draft verified
      Executor->>Executor: Accept draft token(s), advance grammar state
    else Draft rejected
      Executor->>Executor: Roll back draft, reuse valid prefix
    end
    Executor->>GPU: Target step with masked logits
    GPU-->>Executor: Next token
    Executor->>CPU: Advance grammar with accepted token
  end
  Executor-->>User: Streamed tokens
Loading
sequenceDiagram
  autonumber
  participant Exec as Executor
  participant CUDAGraph as CUDA Graph
  participant HostCB as cudaLaunchHostFunc Callback
  participant Py as Python HostFunc (capturable)
  participant CPU as Grammar Engine
  participant GPU as Model Stream

  note over CUDAGraph,HostCB: Capturable guided decoding
  Exec->>CUDAGraph: Capture graph (fixed buffers, slots)
  CUDAGraph->>GPU: Enqueue model kernels
  CUDAGraph->>HostCB: Schedule host callback
  HostCB->>Py: Invoke hostfunc (GIL-released)
  Py->>CPU: Compute/restore grammar state, build masks
  CPU-->>Py: Masks + state snapshot
  Py-->>HostCB: Write masks into fixed buffers
  HostCB-->>CUDAGraph: Callback done
  CUDAGraph-->>Exec: Replay complete (overlapped streams)
Loading

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR uses the repository's PR template but leaves required sections unfilled: the Description and Test Coverage fields are empty and the body contains template placeholders (including the @coderabbitai marker), so reviewers lack a concise "what/why" and verification guidance. Because the template explicitly requires a short explanation and tests (or a doc-only justification), the current description is incomplete for sign-off. The checked notes about image links do not replace the missing Description and Test Coverage information. Please update the PR body to provide a clear Description summarizing the change and rationale, complete the Test Coverage section (or state "doc-only; no tests required"), and ensure the PR title matches the repository naming convention; also convert any relative image links to absolute URLs before merging. After updating, re-request review and confirm any CODEOWNERS or CI considerations mentioned in the template.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "[None][doc] Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly" clearly and specifically summarizes the principal change (adding a tech blog entry), follows the repository's ticket/type prefix convention, and is readable for reviewers scanning history.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (8)
docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md (8)

3-3: Avoid “emphasis as heading” (markdownlint MD036) for the author line.

Use plain text or a small “Authors:” prefix instead of italic-only.

-*By NVIDIA TensorRT LLM Team and XGrammar Team*
+Authors: NVIDIA TensorRT LLM Team and XGrammar Team

21-23: Fix invalid ToC anchors (markdownlint MD051).

Section slugs use “troubleshooting”, not “trouble-shooting”.

-    - [Troubleshooting: Data Race between Host and CUDA Callback](#trouble-shooting-data-race-between-host-and-cuda-callback)
-    - [Troubleshooting: Deadlock by GIL and CUDA Mutex](#trouble-shooting-deadlock-by-gil-and-cuda-mutex)
+    - [Troubleshooting: Data Race between Host and CUDA Callback](#troubleshooting-data-race-between-host-and-cuda-callback)
+    - [Troubleshooting: Deadlock by GIL and CUDA Mutex](#troubleshooting-deadlock-by-gil-and-cuda-mutex)

50-56: Add alt text to Figure 1 image (markdownlint MD045).

-  <img src="https://github.com/media/tech_blog12_constrained_decoding_pipeline_overlap.png" width="600">
+  <img src="https://github.com/media/tech_blog12_constrained_decoding_pipeline_overlap.png" width="600" alt="Guided decoding timelines with and without CPU/GPU overlap">

63-69: Add alt text to Figure 2 image (markdownlint MD045).

-  <img src="https://github.com/media/tech_blog12_one_model_vs_two_model.png" width="600">
+  <img src="https://github.com/media/tech_blog12_one_model_vs_two_model.png" width="600" alt="GPU timelines: one-model vs two-model speculative decoding">

140-146: Add alt text to Figure 4 image (markdownlint MD045).

-  <img src="https://github.com/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps_by_cuda_callback.png" width="800">
+  <img src="https://github.com/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps_by_cuda_callback.png" width="800" alt="CPU-GPU synchronization across multiple steps via CUDA callbacks">

262-266: Add alt text to Figures 5–8 (markdownlint MD045).

-  <img src="https://github.com/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.1_8b.png" width="600">
+  <img src="https://github.com/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.1_8b.png" width="600" alt="Pareto curve: JSON Mode Eval, LLaMA 3.1 8B on H200">
-  <img src="https://github.com/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.3_70b.png" width="600">
+  <img src="https://github.com/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.3_70b.png" width="600" alt="Pareto curve: JSON Mode Eval, LLaMA 3.3 70B on H200">
-  <img src="https://github.com/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.1_8b.png" width="600">
+  <img src="https://github.com/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.1_8b.png" width="600" alt="Pareto curve: JSON Schema Bench, LLaMA 3.1 8B on H200">
-  <img src="https://github.com/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.3_70b.png" width="600">
+  <img src="https://github.com/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.3_70b.png" width="600" alt="Pareto curve: JSON Schema Bench, LLaMA 3.3 70B on H200">

Also applies to: 271-274, 283-285, 290-292


32-37: Unordered list marker style differs from repo lint expectations (markdownlint MD004).

Switch “*” to “-” in lists or adjust markdownlint config for this doc. Given the linter output, prefer dashes.

Also applies to: 74-79, 81-82, 123-137, 198-201, 221-223, 229-231, 308-309


151-155: Add commit SHA to versioned deep links for reproducibility

Links point to NVIDIA/TensorRT-LLM v1.1.0rc5 — include the exact commit SHA used for the benchmarks to avoid future drift.

Verification: referenced image assets are present; "Troubleshooting" anchors found at lines 21–22.
Applies to lines 151–155 and 190–206.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 451475e and ce64c8f.

⛔ Files ignored due to path filters (8)
  • docs/source/blogs/media/tech_blog12_constrained_decoding_pipeline_overlap.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps_by_cuda_callback.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog12_one_model_vs_two_model.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.1_8b.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.3_70b.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.1_8b.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.3_70b.png is excluded by !**/*.png
📒 Files selected for processing (2)
  • README.md (1 hunks)
  • docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md

3-3: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


21-21: Link fragments should be valid

(MD051, link-fragments)


22-22: Link fragments should be valid

(MD051, link-fragments)


32-32: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


33-33: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


34-34: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


35-35: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


36-36: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


52-52: Images should have alternate text (alt text)

(MD045, no-alt-text)


65-65: Images should have alternate text (alt text)

(MD045, no-alt-text)


74-74: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


75-75: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


76-76: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


77-77: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


81-81: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


82-82: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


86-86: Images should have alternate text (alt text)

(MD045, no-alt-text)


123-123: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


124-124: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


128-128: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


129-129: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


130-130: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


131-131: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


132-132: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


133-133: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


134-134: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


135-135: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


136-136: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


137-137: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


138-138: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


142-142: Images should have alternate text (alt text)

(MD045, no-alt-text)


198-198: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


199-199: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


200-200: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


201-201: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


221-221: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


222-222: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


223-223: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


229-229: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


230-230: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


231-231: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


264-264: Images should have alternate text (alt text)

(MD045, no-alt-text)


272-272: Images should have alternate text (alt text)

(MD045, no-alt-text)


283-283: Images should have alternate text (alt text)

(MD045, no-alt-text)


290-290: Images should have alternate text (alt text)

(MD045, no-alt-text)


308-308: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


309-309: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
README.md (1)

21-22: LGTM — blog file present; verify built docs link.
File exists: docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md. Verify the link resolves after building the docs.

Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
@syuoni
Copy link
Collaborator Author

syuoni commented Sep 19, 2025

/bot skip --comment "doc changes only"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19346 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19346 [ skip ] completed with state SUCCESS
Skipping testing for commit a212b9e

@juney-nvidia juney-nvidia merged commit c8cc16d into NVIDIA:main Sep 19, 2025
6 of 7 checks passed
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
…ding: Making CPU and GPU Cooperate Seamlessly (NVIDIA#7864)

Signed-off-by: Enwei Zhu <[email protected]>
MrGeva pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Sep 21, 2025
…ding: Making CPU and GPU Cooperate Seamlessly (NVIDIA#7864)

Signed-off-by: Enwei Zhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants