-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][doc] Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly #7864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][doc] Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly #7864
Conversation
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
📝 WalkthroughWalkthroughAdds a new Tech Blog entry to README and introduces a detailed documentation article on combining guided decoding with speculative decoding, covering design, data flow, CUDA Graph capturability via host callbacks, masking, state management, and benchmarking. No code or public API changes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Executor as Decoder Executor
participant GPU as Model Forward (GPU)
participant CPU as Grammar Engine (CPU)
rect rgb(238,245,255)
note over Executor,GPU: Speculative Decoding Loop
User->>Executor: Start request
Executor->>GPU: Draft tokens (1 or 2-model)
par Overlap
GPU-->>Executor: Draft logits
Executor->>CPU: Compute grammar mask for draft/target
CPU-->>Executor: Token mask(s), updated grammar state
end
Executor->>Executor: Apply mask to logits (disallow tokens)
alt Draft verified
Executor->>Executor: Accept draft token(s), advance grammar state
else Draft rejected
Executor->>Executor: Roll back draft, reuse valid prefix
end
Executor->>GPU: Target step with masked logits
GPU-->>Executor: Next token
Executor->>CPU: Advance grammar with accepted token
end
Executor-->>User: Streamed tokens
sequenceDiagram
autonumber
participant Exec as Executor
participant CUDAGraph as CUDA Graph
participant HostCB as cudaLaunchHostFunc Callback
participant Py as Python HostFunc (capturable)
participant CPU as Grammar Engine
participant GPU as Model Stream
note over CUDAGraph,HostCB: Capturable guided decoding
Exec->>CUDAGraph: Capture graph (fixed buffers, slots)
CUDAGraph->>GPU: Enqueue model kernels
CUDAGraph->>HostCB: Schedule host callback
HostCB->>Py: Invoke hostfunc (GIL-released)
Py->>CPU: Compute/restore grammar state, build masks
CPU-->>Py: Masks + state snapshot
Py-->>HostCB: Write masks into fixed buffers
HostCB-->>CUDAGraph: Callback done
CUDAGraph-->>Exec: Replay complete (overlapped streams)
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (8)
docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md (8)
3-3
: Avoid “emphasis as heading” (markdownlint MD036) for the author line.Use plain text or a small “Authors:” prefix instead of italic-only.
-*By NVIDIA TensorRT LLM Team and XGrammar Team* +Authors: NVIDIA TensorRT LLM Team and XGrammar Team
21-23
: Fix invalid ToC anchors (markdownlint MD051).Section slugs use “troubleshooting”, not “trouble-shooting”.
- - [Troubleshooting: Data Race between Host and CUDA Callback](#trouble-shooting-data-race-between-host-and-cuda-callback) - - [Troubleshooting: Deadlock by GIL and CUDA Mutex](#trouble-shooting-deadlock-by-gil-and-cuda-mutex) + - [Troubleshooting: Data Race between Host and CUDA Callback](#troubleshooting-data-race-between-host-and-cuda-callback) + - [Troubleshooting: Deadlock by GIL and CUDA Mutex](#troubleshooting-deadlock-by-gil-and-cuda-mutex)
50-56
: Add alt text to Figure 1 image (markdownlint MD045).- <img src="https://github.com/media/tech_blog12_constrained_decoding_pipeline_overlap.png" width="600"> + <img src="https://github.com/media/tech_blog12_constrained_decoding_pipeline_overlap.png" width="600" alt="Guided decoding timelines with and without CPU/GPU overlap">
63-69
: Add alt text to Figure 2 image (markdownlint MD045).- <img src="https://github.com/media/tech_blog12_one_model_vs_two_model.png" width="600"> + <img src="https://github.com/media/tech_blog12_one_model_vs_two_model.png" width="600" alt="GPU timelines: one-model vs two-model speculative decoding">
140-146
: Add alt text to Figure 4 image (markdownlint MD045).- <img src="https://github.com/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps_by_cuda_callback.png" width="800"> + <img src="https://github.com/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps_by_cuda_callback.png" width="800" alt="CPU-GPU synchronization across multiple steps via CUDA callbacks">
262-266
: Add alt text to Figures 5–8 (markdownlint MD045).- <img src="https://github.com/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.1_8b.png" width="600"> + <img src="https://github.com/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.1_8b.png" width="600" alt="Pareto curve: JSON Mode Eval, LLaMA 3.1 8B on H200"> - <img src="https://github.com/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.3_70b.png" width="600"> + <img src="https://github.com/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.3_70b.png" width="600" alt="Pareto curve: JSON Mode Eval, LLaMA 3.3 70B on H200"> - <img src="https://github.com/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.1_8b.png" width="600"> + <img src="https://github.com/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.1_8b.png" width="600" alt="Pareto curve: JSON Schema Bench, LLaMA 3.1 8B on H200"> - <img src="https://github.com/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.3_70b.png" width="600"> + <img src="https://github.com/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.3_70b.png" width="600" alt="Pareto curve: JSON Schema Bench, LLaMA 3.3 70B on H200">Also applies to: 271-274, 283-285, 290-292
32-37
: Unordered list marker style differs from repo lint expectations (markdownlint MD004).Switch “*” to “-” in lists or adjust markdownlint config for this doc. Given the linter output, prefer dashes.
Also applies to: 74-79, 81-82, 123-137, 198-201, 221-223, 229-231, 308-309
151-155
: Add commit SHA to versioned deep links for reproducibilityLinks point to NVIDIA/TensorRT-LLM v1.1.0rc5 — include the exact commit SHA used for the benchmarks to avoid future drift.
Verification: referenced image assets are present; "Troubleshooting" anchors found at lines 21–22.
Applies to lines 151–155 and 190–206.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (8)
docs/source/blogs/media/tech_blog12_constrained_decoding_pipeline_overlap.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog12_cpu_gpu_synchronization_for_multiple_steps_by_cuda_callback.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog12_one_model_vs_two_model.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.1_8b.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog12_pareto_curve_json_mode_eval_llama_3.3_70b.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.1_8b.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog12_pareto_curve_json_schema_bench_llama_3.3_70b.png
is excluded by!**/*.png
📒 Files selected for processing (2)
README.md
(1 hunks)docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md
(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md
3-3: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
21-21: Link fragments should be valid
(MD051, link-fragments)
22-22: Link fragments should be valid
(MD051, link-fragments)
32-32: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
33-33: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
34-34: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
35-35: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
36-36: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
52-52: Images should have alternate text (alt text)
(MD045, no-alt-text)
65-65: Images should have alternate text (alt text)
(MD045, no-alt-text)
74-74: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
75-75: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
76-76: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
77-77: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
81-81: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
82-82: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
86-86: Images should have alternate text (alt text)
(MD045, no-alt-text)
123-123: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
124-124: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
128-128: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
129-129: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
130-130: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
131-131: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
132-132: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
133-133: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
134-134: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
135-135: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
136-136: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
137-137: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
138-138: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
142-142: Images should have alternate text (alt text)
(MD045, no-alt-text)
198-198: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
199-199: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
200-200: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
201-201: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
221-221: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
222-222: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
223-223: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
229-229: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
230-230: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
231-231: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
264-264: Images should have alternate text (alt text)
(MD045, no-alt-text)
272-272: Images should have alternate text (alt text)
(MD045, no-alt-text)
283-283: Images should have alternate text (alt text)
(MD045, no-alt-text)
290-290: Images should have alternate text (alt text)
(MD045, no-alt-text)
308-308: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
309-309: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
README.md (1)
21-22
: LGTM — blog file present; verify built docs link.
File exists: docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md. Verify the link resolves after building the docs.
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md
Outdated
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md
Outdated
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md
Outdated
Show resolved
Hide resolved
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
/bot skip --comment "doc changes only" |
PR_Github #19346 [ skip ] triggered by Bot |
PR_Github #19346 [ skip ] completed with state |
…ding: Making CPU and GPU Cooperate Seamlessly (NVIDIA#7864) Signed-off-by: Enwei Zhu <[email protected]>
…ding: Making CPU and GPU Cooperate Seamlessly (NVIDIA#7864) Signed-off-by: Enwei Zhu <[email protected]>
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
Currently, the image links are relative links for review purpose.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.