Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
[None][Doc]: Display tech blog for nvidia.github.io domain.
Signed-off-by: nv-guomingz <[email protected]>
  • Loading branch information
nv-guomingz committed Aug 26, 2025
commit c99ac2c405f420d156484bad9e7ac0e1322dea5a
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ Firstly let’s have an overview of the overall imbalance issues across layers:

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture1.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture1.png">
</figure>
</div>
<p align="center"><sub><em>Figure 1: The routed token count from rank 0 to all the ranks(including rank 0), for decode iteration 1950, and all the MoE layers</em></sub></p>
Expand All @@ -93,7 +93,7 @@ If we zoom on the MoE in the layer 36 and record its activated expert rank distr

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture2.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture2.png">
</figure>
</div>
<p align="center"><sub><em>Figure 2: The tokens received for each expert rank for layer 36</em></sub></p>
Expand All @@ -102,7 +102,7 @@ If we flatten the data to see the routed tokens for each expert, we can see that

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture3.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture3.png">
</figure>
</div>
<p align="center"><sub><em>Figure 3: The tokens received for each expert for layer 36</em></sub></p>
Expand All @@ -111,7 +111,7 @@ It is also interesting to see that this kind of imbalance issue is very stable a

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture4.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture4.png">
</figure>
</div>
<p align="center"><sub><em>Figure 4: The accumulated token counts received for each expert for layer 36, within 50 decode steps, and the local batch size=256.</em></sub></p>
Expand All @@ -121,7 +121,7 @@ We have also done the duration-based analysis for local batch size=1 which corre

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture5.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture5.png">
</figure>
</div>
<p align="center"><sub><em>Figure 5: The accumulated token counts received for each expert for layer 36, within 400 decode iterations, and the local batch size \= 1\.</em></sub></p>
Expand All @@ -139,7 +139,7 @@ And another natural question is whether the above observation can change signifi

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture6.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture6.png">
</figure>
</div>
<p align="center"><sub><em>Figure 6: The routed token count from rank 0 to all the ranks, for iteration 1950, and all the MoE layers</em></sub></p>
Expand All @@ -148,7 +148,7 @@ In Figure 6, compared with Figure 1, it can be seen that for GSM8K, the hot laye

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture7.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture7.png">
</figure>
</div>
<p align="center"><sub><em>Figure 7: routed token counts from EP rank 0 to other EP ranks, still taking the iteration 1950, MoE layer 36 as the example</em></sub></p>
Expand All @@ -158,7 +158,7 @@ Based on Figure 8, it can be observed that the workload imbalance is relatively

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture8.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture8.png">
</figure>
</div>
<p align="center"><sub><em>Figure 8: The accumulated token counts sent from EP Rank 0 to all the ranks, for MoE layer 57 within 50 decode steps, local batch size=256</em></sub></p>
Expand All @@ -167,7 +167,7 @@ If we flatten the EP rank level data to expert-level data, we can have the follo

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture9.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture9.png">
</figure>
</div>
<p align="center"><sub><em>Figure 9: The accumulated token counts received for each expert for layer 57, within 50 decode steps, and the local batch size=256.</em></sub></p>
Expand All @@ -176,7 +176,7 @@ The similar imbalance pattern also exists for a single request.

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture10.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture10.png">
</figure>
</div>
<p align="center"><sub><em>Figure 10: The accumulated token counts received for each expert for layer 57, within 400 decode steps, for a single request</em></sub></p>
Expand All @@ -185,7 +185,7 @@ If we use another request, then we can still observe the expert imbalance issue,

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture11.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture11.png">
</figure>
</div>
<p align="center"><sub><em>Figure 11: The accumulated token counts received for each expert for layer 57, within 400 decode steps, for a single request</em></sub></p>
Expand Down Expand Up @@ -218,7 +218,7 @@ To make sure large-scale EP can run well, careful considerations are needed to m

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture12.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture12.png">
</figure>
</div>
<p align="center"><sub><em>Figure 12: the high-level design of TensorRT-LLM large-scale EP</em></sub></p>
Expand Down Expand Up @@ -247,7 +247,7 @@ For the **Update Weights \& Placemen**t component, we identified two design choi

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture13.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture13.png">
</figure>
</div>
<p align="center"><sub><em>Figure 13: One example of the layer-wise MoE weight re-distribution</em></sub></p>
Expand All @@ -258,7 +258,7 @@ Let’s use GB200 as an example. In Figure 14, we illustrate the communication b

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture14.png" width="500" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture14.png" width="500" >
</figure>
</div>
<p align="center"><sub><em>Figure 14: high-level topology of GB200 system</em></sub></p>
Expand All @@ -270,7 +270,7 @@ Let's assume that we target **50ms** inter-token-latency (ITL) as our main laten

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture15.png" width="300" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture15.png" width="300" >
</figure>
</div>
<p align="center"><sub><em>Figure 15: The theoretical expert count to be updated for each iteration with following 50ms ITL constraints, by using different HW as pools to store the full MoE weight</em></sub></p>
Expand Down Expand Up @@ -407,14 +407,14 @@ As shown by Figure 1, on the machine translation dataset, MoE layer 36 suffers f

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture16.png" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture16.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 16: The routed token count by receiving ranks (x-axis) and iterations (y-axis) at layer 36 (No EPLB)</em></sub></p>

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture17.png" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture17.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 17: The routed token count by experts (x-axis) and iterations (y-axis) at layer 36 (No EPLB)</em></sub></p>
Expand All @@ -425,14 +425,14 @@ With the above statistics, we can perform offline EPLB. One potential strategy i

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture18.png" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture18.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 18: The routed token count by receiving ranks (x-axis) and iterations (y-axis) at layer 36 (EPLB with 9 per-rank slots and EP 32)</em></sub></p>

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture19.png" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture19.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 19: The routed token count by experts (x-axis) and iterations (y-axis) at layer 36 (EPLB with 9 per-rank slots and EP 32)</em></sub></p>
Expand All @@ -441,14 +441,14 @@ Another EPLB strategy is to maintain 8 expert slots per rank while increasing ex

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture20.png" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture20.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 20: The routed token count by receiving ranks (x-axis) and iterations (y-axis) at layer 36 (EPLB with 8 per-rank slots and EP 36)</em></sub></p>

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture21.png" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture21.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 21: The routed token count by experts (x-axis) and iterations (y-axis) at layer 36 (EPLB with 8 per-rank slots and EP 36)</em></sub></p>
Expand All @@ -473,7 +473,7 @@ Let’s still use the machine translation dataset, DeepSeek R1 model, layer 36

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture22.png" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture22.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 22: The token count sent from rank 0 to all the ranks, run on GB200, with EP32, local batch size=256, with 256 slots(no replication), so each rank hosts 8 experts</em></sub></p>
Expand All @@ -484,7 +484,7 @@ In Figure 22, only placement adjustment has been done by the Online EPLB. If we

<div align="center">
<figure>
<img src="../media/tech_blog4_Picture23.png" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture23.png" >
</figure>
</div>
<p align="center"><sub><em>Figure 23: The token count sent from rank 0 to all the ranks, run on GB200, with EP32, local batch size=256, with 288 slots(with replication), so each rank hosts 9 experts</em></sub></p>
Expand All @@ -498,7 +498,7 @@ Note: all the representative workloads illustrated in this section are from the
Let's use some representative workloads to illustrate the performance impact with large-scale EP.
<div align="center">
<figure>
<img src="../media/tech_blog4_Picture24.png" width="500" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture24.png" width="500" >
</figure>
</div>
<p align="center"><sub><em>Figure 24: EP impact over MoE Group GEMM and EP communication</em></sub></p>
Expand All @@ -508,7 +508,7 @@ When the EP size increases from 18 to 72, the speed-up diminishes. We are workin
Next, let's use some representative workloads to understand the performance impact with EPLB.
<div align="center">
<figure>
<img src="../media/tech_blog4_Picture25.png" width="500" >
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog4_Picture25.png" width="500" >
</figure>
</div>
<p align="center"><sub><em>Figure 25: EPLB performance impact</em></sub></p>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ spec_config = NGramDecodingConfig(

<div align="center">
<figure>
<img src="../media/tech_blog7_init_sequence_scan.png" width="auto" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_init_sequence_scan.png" width="auto" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 1. Request initial scan</em></sub></p>
Expand All @@ -66,7 +66,7 @@ spec_config = NGramDecodingConfig(

<div align="center">
<figure>
<img src="../media/tech_blog7_per_token_update.png" width="auto" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_per_token_update.png" width="auto" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 2. Per-token update</em></sub></p>
Expand Down Expand Up @@ -107,7 +107,7 @@ For batch size of 1, 4 and 32, we configure the max_batch_size of the model acco

<div align="center">
<figure>
<img src="../media/tech_blog7_speed_up_first_turn.png" width="80%" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_speed_up_first_turn.png" width="80%" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 3. First Turn Speed-up</em></sub></p>
Expand All @@ -127,7 +127,7 @@ Figure 4 shows the distribution of accepted length (AL) with `k=3, v=5`. When `A

<div align="center">
<figure>
<img src="../media/tech_blog7_magpie_accepted_length_distribution.png" width="90%" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_magpie_accepted_length_distribution.png" width="90%" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 4. Accepted draft token length distribution</em></sub></p>
Expand All @@ -136,7 +136,7 @@ In Figure 5, for each iteration, we plot the average of accepted length (AL) for

<div align="center">
<figure>
<img src="../media/tech_blog7_al_over_iteration_magpie.png" width="auto" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_al_over_iteration_magpie.png" width="auto" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 5. AL over iteration</em></sub></p>
Expand All @@ -145,7 +145,7 @@ Figure 6 shows the speed-up with N-Gram speculative decoding for the second turn
N-Gram with `k = 3, v = 5` delivers 96.13% of speed-up with single batch and 63.99% of speed-up with batch size 4. With batch size 32 and N-Gram `k = 5, v = 3`, the speed up is 33.06%.
<div align="center">
<figure>
<img src="../media/tech_blog7_speed_up_second_turn.png" width="80%" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_speed_up_second_turn.png" width="80%" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 6. Second Turn Speed-up</em></sub></p>
Expand Down Expand Up @@ -175,7 +175,7 @@ From the pie chart on the left, among the seven draft tokens proposed by N-Gram,

<div align="center">
<figure>
<img src="../media/tech_blog7_accepted_length_case2.png" width="auto" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog7_accepted_length_case2.png" width="auto" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 7. Accepted Tokens from Drafts</em></sub></p>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Our initial kernel breakdown and analysis revealed several key observations abou

<div align="center">
<figure>
<img src="../media/tech_blog8_kernel_breakdown.png" width="1000">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_kernel_breakdown.png" width="1000">
</figure>
</div>
<p align="center"><sub><em>Figure 1: Kernel breakdown when scaling EP without EPLB.</em></sub></p>
Expand All @@ -57,7 +57,7 @@ Before MoE group GEMMs, `M` tokens are expanded to `M * topK` tokens, which are

<div align="center">
<figure>
<img src="../media/tech_blog8_moe_aux_kernels1.png" width="400">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_moe_aux_kernels1.png" width="400">
</figure>
</div>
<p align="center"><sub><em>Figure 2: Sparsity of valid expanded tokens. For DeepSeek-R1 deployed with EP 32, a batch of 12 tokens are expanded to 96 tokens, but only 3 are valid on rank 0.</em></sub></p>
Expand All @@ -70,7 +70,7 @@ This optimization was implemented in [PR 5215](https://github.com/NVIDIA/TensorR

<div align="center">
<figure>
<img src="../media/tech_blog8_moe_aux_kernels2.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_moe_aux_kernels2.png">
</figure>
</div>
<p align="center"><sub><em>Figure 3: Optimization effect on MoE auxiliary kernels. (Left) Before optimization, kernel time increases with EP size. (Right) After optimization, kernel time remains constant with EP size.</em></sub></p>
Expand All @@ -87,7 +87,7 @@ This optimization was implemented in [PR 5570](https://github.com/NVIDIA/TensorR

<div align="center">
<figure>
<img src="../media/tech_blog8_communication_kernel.png">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_communication_kernel.png">
</figure>
</div>
<p align="center"><sub><em>Figure 4: Optimization effect on communication kernels.</em></sub></p>
Expand Down Expand Up @@ -279,21 +279,21 @@ We explored different workloads including 1k-ISL 1k-OSL, 4k-ISL 1k-OSL, and 8k-I

<div align="center">
<figure>
<img src="../media/tech_blog8_perf-1k-1k-dep.png" width="800">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_perf-1k-1k-dep.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 5: DeepSeek R1 throughput on ISL/OSL 1k/1k.</em></sub></p>

<div align="center">
<figure>
<img src="../media/tech_blog8_perf-4k-1k-dep.png" width="800">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_perf-4k-1k-dep.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 6: DeepSeek R1 throughput on ISL/OSL 4k/1k.</em></sub></p>

<div align="center">
<figure>
<img src="../media/tech_blog8_perf-8k-1k-dep.png" width="800">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_perf-8k-1k-dep.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 7: DeepSeek R1 throughput on ISL/OSL 8k/1k.</em></sub></p>
Expand All @@ -302,7 +302,7 @@ When enabling MTP, there is an extra performance boost compared to the baseline.

<div align="center">
<figure>
<img src="../media/tech_blog8_perf-8k-1k-e2e-mtp.png" width="800">
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog8_perf-8k-1k-e2e-mtp.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 8: DeepSeek R1 throughput on ISL/OSL 8k/1k with MTP enabled.</em></sub></p>
Expand Down
5 changes: 2 additions & 3 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,15 +151,14 @@ Welcome to TensorRT-LLM's Documentation!
.. toctree::
:maxdepth: 2
:caption: Blogs
:hidden:
:glob:

blogs/H100vsA100.md
blogs/H200launch.md
blogs/Falcon180B-H200.md
blogs/quantization-in-TRT-LLM.md
blogs/XQA-kernel.md
blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md
blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md
blogs/tech_blog/*


Indices and tables
Expand Down
Loading