Skip to content

Conversation

liji-nv
Copy link
Collaborator

@liji-nv liji-nv commented Aug 27, 2025

  • Let all moe backend go through the same interface
  • MOE is wrapped with custom op to improve full graph torch compile compatibility

Summary by CodeRabbit

  • New Features

    • Per-layer MoE context with runtime registration and FP4-aware custom-op support.
    • Exposes unpadded hidden size for fused MoE kernels.
  • Refactor

    • Simplified MoE interfaces: removed global max-token parameter from public APIs; padding and max-token decisions now derived from per-rank counts.
    • Internal MoE entry points reorganized to separate public wrapper from implementation.
  • Behavioral

    • Metadata simplified: rank-token counts are now a plain field (no automatic max tracking).
  • Tests

    • Unit tests updated to match the simplified MoE interfaces.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@liji-nv liji-nv requested review from a team as code owners August 27, 2025 02:42
@liji-nv liji-nv requested review from 2ez4bz, byshiue and hlu1 August 27, 2025 02:42
Copy link
Contributor

coderabbitai bot commented Aug 27, 2025

📝 Walkthrough

Walkthrough

Removed all_rank_max_num_tokens across MoE routing paths and metadata; renamed fused-backend entrypoints from forwardforward_impl; added per-layer layer_idx registration and a MoE.forward wrapper that can route through a new moe_custom_op; DP-padding now derives max tokens from all_rank_num_tokens; tests and call-sites updated accordingly.

Changes

Cohort / File(s) Summary
Model MoE routing: drop all_rank_max_num_tokens
tensorrt_llm/_torch/models/modeling_deepseekv3.py, tensorrt_llm/_torch/models/modeling_llama.py, tensorrt_llm/_torch/models/modeling_gpt_oss.py, tensorrt_llm/_torch/models/modeling_mixtral.py, tensorrt_llm/_torch/models/modeling_qwen3_moe.py, tensorrt_llm/_torch/models/modeling_qwen_moe.py
Removed reads/arguments of all_rank_max_num_tokens; updated signatures and call-sites so routing/expert invocations use only all_rank_num_tokens (and do_finalize/use_dp_padding where applicable).
Fused MoE interface & custom-op
tensorrt_llm/_torch/modules/fused_moe/interface.py
Added per-layer layer_idx registration into model_config.extra_attrs, introduced moe_custom_op and fake-op sizing path, added forward_impl abstract and a @final forward wrapper that dispatches to moe_custom_op when registered; added FP4 wrapping and dynamic padding helpers.
Fused MoE backends: rename & param cleanup
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py, .../fused_moe_deepgemm.py, .../fused_moe_triton.py, .../fused_moe_trtllm_gen.py, .../fused_moe_wide_ep.py
Renamed backend entrypoints forwardforward_impl, removed all_rank_max_num_tokens from signatures, forwarded layer_idx to base MoE, added unpadded_hidden_size (Cutlass) and compute DP padding via max(all_rank_num_tokens) when use_dp_padding is enabled.
Fused MoE tests
tests/unittest/_torch/modules/test_fused_moe.py
Removed all_rank_max_num_tokens from test call-sites; tests now pass only all_rank_num_tokens and use_dp_padding where applicable.
Speculative MTP usage
tensorrt_llm/_torch/speculative/mtp.py, tensorrt_llm/_torch/speculative/interface.py
Removed passing/stored all_rank_max_num_tokens in speculative MTP calls and SpecMetadata; all_rank_num_tokens is now a plain field (no private backing or derived max).
Attention/Spec metadata: remove max tracking
tensorrt_llm/_torch/attention_backend/interface.py, tensorrt_llm/_torch/speculative/interface.py
Converted all_rank_num_tokens from a property to a plain field, removed _all_rank_num_tokens and all_rank_max_num_tokens and associated getter/setter synchronization; cross-attention variants updated similarly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant ModelLayer as Model Layer
  participant MoEWrapper as MoE.forward (wrapper)
  participant CustomOp as moe_custom_op
  participant Backend as Backend.forward_impl

  Caller->>ModelLayer: layer.forward(x, router_logits, do_finalize, output_dtype, all_rank_num_tokens, use_dp_padding)
  ModelLayer->>MoEWrapper: forward(...)
  alt register_to_config == True
    MoEWrapper->>CustomOp: moe_custom_op(layer_idx, x/x_sf/is_swizzled, router_logits, do_finalize, output_dtype, all_rank_num_tokens, use_dp_padding)
    CustomOp->>Backend: forward_impl(x_or_fp4, router_logits, do_finalize, output_dtype, all_rank_num_tokens, use_dp_padding)
    note right of Backend: if use_dp_padding → padding = max(all_rank_num_tokens)
    Backend-->>CustomOp: outputs (tensor or list)
    CustomOp-->>MoEWrapper: outputs
  else
    MoEWrapper->>Backend: forward_impl(x, router_logits, do_finalize, output_dtype, all_rank_num_tokens, use_dp_padding)
    Backend-->>MoEWrapper: outputs
  end
  MoEWrapper-->>Caller: outputs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

SW Architecture

Suggested reviewers

  • hlu1
  • yweng0828
  • 2ez4bz
  • byshiue
  • yizhang-nv

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_gpt_oss.py (1)

211-217: Fix attribute name for experts_per_token (will raise at runtime).

GptOss configs use num_experts_per_tok elsewhere; experts_per_token may not exist. Safer to source from the routing method’s top_k.

Apply this diff:

-        return get_cached_perfect_router_logits(
-            num_tokens=num_tokens,
-            num_experts=num_experts,
-            experts_per_token=pretrained_config.experts_per_token,
-            moe_ep_size=self.config.mapping.moe_ep_size,
-            device=device,
-            dtype=pretrained_config.torch_dtype)
+        return get_cached_perfect_router_logits(
+            num_tokens=num_tokens,
+            num_experts=num_experts,
+            experts_per_token=self.routing_method.top_k,
+            moe_ep_size=self.config.mapping.moe_ep_size,
+            device=device,
+            dtype=pretrained_config.torch_dtype)
🧹 Nitpick comments (14)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (3)

1-1: Add NVIDIA copyright header.

Per guidelines, prepend the current-year NVIDIA header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

24-24: Avoid string comparison for version checks.

Use proper semantic version comparison; string compare fails for "3.10.0" vs "3.4.0".

-    assert triton.__version__ >= "3.4.0", "Triton kernels are detected but the Triton wheel is too old"
+    try:
+        from packaging.version import Version
+        assert Version(triton.__version__) >= Version("3.4.0"), \
+            "Triton kernels are detected but the Triton wheel is too old"
+    except Exception:
+        major, minor, patch = (triton.__version__.split(".") + ["0", "0"])[:3]
+        assert (int(major), int(minor), int(patch)) >= (3, 4, 0), \
+            "Triton kernels are detected but the Triton wheel is too old"

1363-1384: Guard DP arguments in forward_impl.

Add a defensive check so use_dp_padding/all_rank_num_tokens aren’t used inconsistently when DP is disabled.

     ) -> torch.Tensor:
         assert do_finalize, "TritonFusedMoE does not support do_finalize=False"
-        assert use_dp_padding is None or not use_dp_padding, \
+        assert use_dp_padding is None or not use_dp_padding, \
             "TritonFusedMoE does not support use_dp_padding=True"
+        if self.use_dp and self.parallel_size > 1:
+            assert all_rank_num_tokens is not None, "all_rank_num_tokens required when DP is enabled"
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (3)

1-1: Add NVIDIA copyright header.

Per guidelines, prepend the current-year NVIDIA header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

664-669: Handle None when DP is disabled.

If callers pass use_dp_padding=True by mistake in non-DP runs, max(all_rank_num_tokens) will crash. Make the padding branch conditional on both flags.

-        if use_dp_padding:
-            all_rank_num_tokens_padded = [max(all_rank_num_tokens)
-                                          ] * len(all_rank_num_tokens)
+        if use_dp_padding and all_rank_num_tokens is not None:
+            all_rank_num_tokens_padded = [max(all_rank_num_tokens)] * len(all_rank_num_tokens)
         else:
             all_rank_num_tokens_padded = all_rank_num_tokens

97-101: Fix assertion message typo.

“cannot be divisible” should be “must be divisible”.

-    ), "the last dimension of `input` cannot be divisible by `group_size`"
+    ), "the last dimension of `input` must be divisible by `group_size`"
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

1-1: Add NVIDIA copyright header.

Per guidelines, prepend the current-year NVIDIA header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

380-386: Assume non-None all_rank_num_tokens in forward_chunk.

This relies on forward_impl always passing a non-None list. Add a lightweight assert to catch misuses.

-        all_rank_max_num_tokens = max(all_rank_num_tokens)
+        assert all_rank_num_tokens is not None, "all_rank_num_tokens required"
+        all_rank_max_num_tokens = max(all_rank_num_tokens)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

1-1: Add NVIDIA copyright header.

Per guidelines, prepend the current-year NVIDIA header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

1-1: Add NVIDIA copyright header.

Per guidelines, prepend the current-year NVIDIA header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

472-476: Handle None when DP is disabled.

Mirror the defensive pattern so max(None) is not evaluated outside DP.

-        if use_dp_padding:
-            all_rank_num_tokens_padded = [max(all_rank_num_tokens)
-                                          ] * len(all_rank_num_tokens)
+        if use_dp_padding and all_rank_num_tokens is not None:
+            all_rank_num_tokens_padded = [max(all_rank_num_tokens)] * len(all_rank_num_tokens)
         else:
             all_rank_num_tokens_padded = all_rank_num_tokens
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)

183-201: Tighten DP preconditions and make dtype check robust.

Mirror Cutlass/DeepGEMM precondition asserts for DP, and avoid crashes when x is not a torch.Tensor.

Apply this diff:

     def forward_impl(
         self,
         x: Union[torch.Tensor, Fp4QuantizedTensor],
         router_logits: torch.Tensor,
         do_finalize: bool = True,
         all_rank_num_tokens: Optional[List[int]] = None,
         use_dp_padding: Optional[bool] = None,
         **kwargs,
     ) -> torch.Tensor:
-
-        assert x.dtype == torch.bfloat16
+        # Require BF16 tensor input for TRTLLM-Gen paths
+        assert isinstance(x, torch.Tensor) and x.dtype == torch.bfloat16, \
+            f"TRTLLMGenFusedMoE expects BF16 torch.Tensor input; got {type(x)} with dtype {getattr(x, 'dtype', None)}"
+
+        # Validate DP inputs when parallelized
+        if self.use_dp and self.parallel_size > 1:
+            assert all_rank_num_tokens is not None and use_dp_padding is not None, \
+                "all_rank_num_tokens and use_dp_padding must be provided for attention-DP with TP>1"
tensorrt_llm/_torch/models/modeling_llama.py (1)

330-332: Avoid assigning lambdas (ruff E731).

Rewrite the lambdas as small local functions for readability and lint compliance.

Apply this diff:

-        fn0 = lambda: self.shared_expert(hidden_states)
-        fn1 = lambda: self.compute_routed_output(
-            hidden_states, all_rank_num_tokens, cutlass_min_latency_mode)
+        def fn0():
+            return self.shared_expert(hidden_states)
+
+        def fn1():
+            return self.compute_routed_output(
+                hidden_states, all_rank_num_tokens, cutlass_min_latency_mode)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

559-565: Use typing.List for Python 3.8 compatibility.

Replace PEP 585 list[int] with List[int] to match the project’s Python 3.8+ guideline.

Apply this diff:

-        all_rank_num_tokens: Optional[list[int]] = None,
+        all_rank_num_tokens: Optional[List[int]] = None,
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e12868b and 9ded5d9.

📒 Files selected for processing (13)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_mixtral.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/interface.py (5 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (0 hunks)
💤 Files with no reviewable changes (4)
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py
  • tensorrt_llm/_torch/models/modeling_mixtral.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🧬 Code graph analysis (8)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_impl (713-859)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (6)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)
  • forward_impl (1363-1383)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_impl (713-859)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (206-213)
tensorrt_llm/_torch/modules/attention.py (2)
  • forward_impl (394-450)
  • forward_impl (926-1020)
tensorrt_llm/_torch/models/modeling_llama.py (2)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
  • compute_routed_output (533-556)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_impl (183-401)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (206-213)
tensorrt_llm/_torch/modules/fused_moe/interface.py (3)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
  • calculate_tile_tokens_dim (16-26)
tensorrt_llm/_torch/utils.py (4)
  • Fp4QuantizedTensor (97-104)
  • get_model_extra_attrs (52-53)
  • _ (190-196)
  • shape (103-104)
tensorrt_llm/_torch/modules/fused_moe/routing.py (2)
  • BaseMoeRoutingMethod (158-181)
  • DeepSeekV3MoeRoutingMethod (213-223)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (7)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (204-205)
  • all_rank_num_tokens (208-211)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)
  • forward_impl (1363-1383)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_impl (183-401)
tensorrt_llm/_torch/utils.py (1)
  • Fp4QuantizedTensor (97-104)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (5)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)
  • forward_impl (1363-1383)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_impl (713-859)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (204-205)
  • all_rank_num_tokens (208-211)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (204-205)
  • all_rank_num_tokens (208-211)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py

330-330: Do not assign a lambda expression, use a def

Rewrite fn0 as a def

(E731)


331-332: Do not assign a lambda expression, use a def

Rewrite fn1 as a def

(E731)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)

1281-1291: LGTM: layer_idx propagation.

Forwarding layer_idx to the base MoE is consistent with the new custom-op path.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)

640-648: LGTM: unified forward_impl signature.

Signature matches the new interface and removes all_rank_max_num_tokens.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

76-86: LGTM: layer_idx propagation.

Forwarding layer_idx to the base MoE is consistent with the custom-op wrapper.


713-735: LGTM: forward_impl refactor and DP-padding logic.

all_rank_max_num_tokens derivation and internal padding are consistent with other backends.

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

215-260: LGTM: unified forward wrapper with custom-op path.

Wrapper cleanly delegates to custom op or backend impl; matches new API.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

88-89: LGTM: layer_idx propagation.

Consistent with the refactor and custom-op registration.


451-460: LGTM: forward_impl refactor.

Signature and DP checks align with other backends.

tensorrt_llm/_torch/models/modeling_gpt_oss.py (1)

278-281: MoE API unification: call site looks correct.

Switch to all_rank_num_tokens and dropping all_rank_max_num_tokens is consistent with the new backend interfaces. Explicit use_dp_padding=False satisfies Triton’s constraint.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)

79-80: Propagating layer_idx into the base MoE is good.

This wires per-layer context for custom op routing and matches the interface changes.

tensorrt_llm/_torch/models/modeling_llama.py (1)

314-319: MoE call updated to the new interface — looks good.

Passing only all_rank_num_tokens and setting use_dp_padding=False aligns with backend expectations.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

534-555: API removal of all_rank_max_num_tokens applied correctly.

Routed path now relies solely on all_rank_num_tokens with use_dp_padding=False; consistent with other backends.

@liji-nv liji-nv force-pushed the dev-liji-moe-wrapper branch 2 times, most recently from 66f5277 to 5930e39 Compare August 27, 2025 03:13
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 27, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16624 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)

97-101: Fix contradictory assertion message

The condition requires divisibility; the message says “cannot be divisible”.

Apply:

-    assert (
-        input.shape[-1] % group_size == 0
-    ), "the last dimension of `input` cannot be divisible by `group_size`"
+    assert (
+        input.shape[-1] % group_size == 0
+    ), "the last dimension of `input` must be divisible by `group_size`"
🧹 Nitpick comments (10)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (3)

191-201: Signature vs. dtype assertion mismatch

The signature advertises Union[torch.Tensor, Fp4QuantizedTensor] but the first assert hard-requires BF16. Consider narrowing the type hint to torch.Tensor or relax the assert with an explicit Fp4QuantizedTensor check.


397-401: Guard slicing when use_dp_padding is True

This slice assumes all_rank_num_tokens is provided. With the DP guards added above, this is safe; without them it can throw. Please ensure the guards are added.


1-1: Missing NVIDIA copyright header

Per repo guidelines, prepend the current-year NVIDIA header.

Apply (top of file):

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (2)

298-337: Return type annotation mismatch (function returns None)

deepgemm_fp8_group_blockwise_gemm doesn’t return a tensor; callers don’t use a return value.

Apply one of:

-def deepgemm_fp8_group_blockwise_gemm(
+def deepgemm_fp8_group_blockwise_gemm(
@@
-) -> torch.Tensor:
+) -> None:
@@
-    return
+    return

Or return d explicitly and keep the annotation; current usage suggests -> None is cleaner.


1-1: Missing NVIDIA copyright header

Please prepend the header.

+// Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/models/modeling_llama.py (2)

330-333: Replace assigned lambdas (ruff E731)

Use defs for readability and lint compliance.

Apply:

-        fn0 = lambda: self.shared_expert(hidden_states)
-        fn1 = lambda: self.compute_routed_output(
-            hidden_states, all_rank_num_tokens, cutlass_min_latency_mode)
+        def _compute_shared_output():
+            return self.shared_expert(hidden_states)
+
+        def _compute_routed_output():
+            return self.compute_routed_output(
+                hidden_states, all_rank_num_tokens, cutlass_min_latency_mode)
@@
-        shared_output, routed_output = maybe_execute_in_parallel(
-            fn0, fn1, self.moe_event[0], self.moe_event[1], self.aux_stream)
+        shared_output, routed_output = maybe_execute_in_parallel(
+            _compute_routed_output, _compute_shared_output,
+            self.moe_event[0], self.moe_event[1], self.aux_stream)

1-1: Missing NVIDIA copyright header

Please add header at file start.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

380-381: Defensive check for all_rank_num_tokens in forward_chunk

forward_chunk assumes a non-None list; add a lightweight assert to catch misuse.

Apply:

-        all_rank_max_num_tokens = max(all_rank_num_tokens)
+        assert all_rank_num_tokens is not None
+        all_rank_max_num_tokens = max(all_rank_num_tokens)

1-1: Missing NVIDIA copyright header

Please prepend the header.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

1-1: Missing NVIDIA copyright header

Please add header at top.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9ded5d9 and 5930e39.

📒 Files selected for processing (13)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_mixtral.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/interface.py (5 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (0 hunks)
💤 Files with no reviewable changes (4)
  • tensorrt_llm/_torch/models/modeling_mixtral.py
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/models/modeling_llama.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/models/modeling_llama.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/models/modeling_llama.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_llama.py
🧬 Code graph analysis (5)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_impl (713-859)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/utils.py (1)
  • Fp4QuantizedTensor (97-104)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)
  • forward_impl (1363-1383)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (211-218)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (5)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_impl (713-859)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (211-218)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (204-205)
  • all_rank_num_tokens (208-211)
tensorrt_llm/_torch/models/modeling_llama.py (2)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (7)
  • forward (151-168)
  • forward (392-397)
  • forward (558-600)
  • forward (748-788)
  • forward (982-1066)
  • forward (1099-1128)
  • compute_routed_output (533-556)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py

330-330: Do not assign a lambda expression, use a def

Rewrite fn0 as a def

(E731)


331-332: Do not assign a lambda expression, use a def

Rewrite fn1 as a def

(E731)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2)

79-80: layer_idx propagation — LGTM

Forwarding layer_idx to the base MoE keeps per-layer context consistent with other backends.


183-201: Please verify DP guards in forward_impl of fused_moe_trtllm_gen.py

It appears this backend does not currently guard against missing DP inputs when data parallelism is enabled. Without checking for all_rank_num_tokens and use_dp_padding, downstream scatter/reduce calls can misalign tokens across ranks.

Proposed change to mirror other MoE backends (e.g., Cutlass, DeepGEMM):

 def forward_impl(
@@
     ) -> torch.Tensor:
-        assert x.dtype == torch.bfloat16
+        # Guard DP inputs when using data parallelism
+        if self.use_dp and self.parallel_size > 1:
+            assert all_rank_num_tokens is not None, "all_rank_num_tokens must be provided in DP mode"
+            assert use_dp_padding is not None,    "use_dp_padding must be provided in DP mode"
+
+        assert x.dtype == torch.bfloat16

• File: tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
• Lines: around 183–201

Please confirm that

  1. self.use_dp and self.parallel_size are the correct flags for DP mode here.
  2. These assertions align with the guards in other fused MoE backends (e.g., in Cutlass/DeepGEMM modules).
  3. all_rank_num_tokens and use_dp_padding are passed through the higher‐level API in DP scenarios.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)

664-669: DP padding logic — LGTM

Deriving padding via max(all_rank_num_tokens) aligns with other backends and the PR objective to remove all_rank_max_num_tokens.

tensorrt_llm/_torch/models/modeling_llama.py (2)

315-319: MoE call-site update — LGTM

Dropping all_rank_max_num_tokens and passing do_finalize/use_dp_padding matches the new MoE interface.


531-537: Propagation of all_rank_num_tokens — LGTM

Forwarding attn_metadata.all_rank_num_tokens into MoE layers is consistent with the new API.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

85-86: layer_idx forwarding — LGTM

Consistent with other MoE backends.


713-731: alltoall path selection — LGTM

Computing all_rank_max_num_tokens internally and using can_use_alltoall keeps the wide-EP flow coherent post-API change.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)

88-93: unpadded_hidden_size addition — LGTM

Keeping unpadded_hidden_size and passing it to the custom op is appropriate for padded hidden sizes.


472-476: DP padding via max(all_rank_num_tokens) — LGTM

Matches the unified interface and other backends.


451-459: forward_impl contract — LGTM

Assert on do_finalize and DP guards match the intended usage and other backends.

@liji-nv liji-nv force-pushed the dev-liji-moe-wrapper branch from 5930e39 to b1977fc Compare August 27, 2025 06:21
@liji-nv liji-nv requested a review from a team as a code owner August 27, 2025 06:21
@liji-nv liji-nv requested a review from yweng0828 August 27, 2025 06:21
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16624 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12480 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/models/modeling_llama.py (1)

1-1: Add NVIDIA copyright header (2025).

Required by repo guidelines for all source files.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

1-1: Prepend NVIDIA copyright header (2025).

Keep the upstream DeepSeek notice; add NVIDIA header first.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

560-566: Fix Python 3.8-incompatible type hint (list[int]).

Use typing.List for 3.8 compatibility (no future annotations at top).

-        all_rank_num_tokens: Optional[list[int]] = None,
+        all_rank_num_tokens: Optional[List[int]] = None,
🧹 Nitpick comments (3)
tensorrt_llm/_torch/models/modeling_llama.py (1)

330-333: Replace assigned lambdas with local defs (ruff E731).

Avoid E731 and improves tracebacks.

Apply:

-        fn0 = lambda: self.shared_expert(hidden_states)
-        fn1 = lambda: self.compute_routed_output(
-            hidden_states, all_rank_num_tokens, cutlass_min_latency_mode)
+        def _compute_shared():
+            return self.shared_expert(hidden_states)
+
+        def _compute_routed():
+            return self.compute_routed_output(
+                hidden_states, all_rank_num_tokens, cutlass_min_latency_mode)
+
+        fn0 = _compute_shared
+        fn1 = _compute_routed
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

380-380: Guard None/empty token lists before max().

Forward_impl asserts non-None, but adding a local assert improves failure locality and error message.

-        all_rank_max_num_tokens = max(all_rank_num_tokens)
+        assert all_rank_num_tokens, "all_rank_num_tokens must be a non-empty list"
+        all_rank_max_num_tokens = max(all_rank_num_tokens)

342-351: Minor naming consistency: all_reduce vs allreduce.

Two different members exist (base all_reduce and local allreduce). Consider renaming the local one to all_reduce_ep to avoid confusion.

-            self.allreduce = AllReduce(mapping=model_config.mapping,
+            self.all_reduce_ep = AllReduce(mapping=model_config.mapping,
                                        strategy=AllReduceStrategy.NCCL)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5930e39 and b1977fc.

📒 Files selected for processing (14)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_mixtral.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/interface.py (5 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (0 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (0 hunks)
💤 Files with no reviewable changes (5)
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_mixtral.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_llama.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/models/modeling_llama.py (3)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (7)
  • forward (151-168)
  • forward (392-397)
  • forward (558-600)
  • forward (748-788)
  • forward (982-1066)
  • forward (1099-1128)
  • compute_routed_output (533-556)
tensorrt_llm/_torch/models/modeling_qwen3_moe.py (4)
  • forward (55-58)
  • forward (118-157)
  • forward (204-297)
  • forward (344-370)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (204-205)
  • all_rank_num_tokens (208-211)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/utils.py (1)
  • Fp4QuantizedTensor (97-104)
tensorrt_llm/_torch/modules/fused_moe/interface.py (5)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
  • calculate_tile_tokens_dim (16-26)
tensorrt_llm/_torch/utils.py (5)
  • Fp4QuantizedTensor (97-104)
  • get_model_extra_attrs (52-53)
  • is_torch_compiling (37-39)
  • _ (190-196)
  • shape (103-104)
tensorrt_llm/_torch/modules/fused_moe/routing.py (2)
  • BaseMoeRoutingMethod (158-181)
  • DeepSeekV3MoeRoutingMethod (213-223)
tensorrt_llm/_torch/modules/attention.py (3)
  • extract_extra_attrs (30-67)
  • forward_impl (394-450)
  • forward_impl (926-1020)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (204-205)
  • all_rank_num_tokens (208-211)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py

330-330: Do not assign a lambda expression, use a def

Rewrite fn0 as a def

(E731)


331-332: Do not assign a lambda expression, use a def

Rewrite fn1 as a def

(E731)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/_torch/models/modeling_llama.py (1)

314-318: MoE call migrated to forward_impl with unified args — looks good.

Matches the new interface (drops all_rank_max_num_tokens, adds do_finalize/use_dp_padding).

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

76-86: Propagating layer_idx to base MoE — correct and consistent.

Enables per-layer registration for custom-op path.


713-731: forward_impl signature alignment — good.

Asserts for all_rank_num_tokens/use_dp_padding match DP usage expectations; derivation of max-token is centralized here.


769-779: Zero-length chunk handling for alltoall — sensible.

Substituting 0 with 1 prevents illegal alltoall sizes in chunked path.

tensorrt_llm/_torch/modules/fused_moe/interface.py (4)

26-37: Hardened MoE layer weakref lookup — good.

Clear errors for missing map, missing key, and GC’d refs.


40-71: Custom op wrapper is correct.

FP4 wrapping, do_finalize branching, and arg plumbing align with the new interface.


164-173: Per-layer registration with duplicate guard — looks good.

Avoids key collisions and ensures weakref stored only when layer_idx is concrete.


220-264: final forward wrapper is correct.

Routes to custom-op under compile, else to forward_impl; FP4 metadata handling is consistent.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

534-555: Dropped all_rank_max_num_tokens from routed path — correct.

Callers now pass only all_rank_num_tokens; use_dp_padding controlled locally.


798-807: MoE callsite updated to new interface — good.

do_finalize plumbed; all_rank_num_tokens sourced from attn_metadata.

@liji-nv liji-nv force-pushed the dev-liji-moe-wrapper branch 2 times, most recently from d2e4ad8 to d19d657 Compare August 28, 2025 09:27
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 28, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16836 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

1-1: Add NVIDIA copyright header (2025).

This source file is missing the required header at the top.

Apply:

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

259-264: Fix router scales None when finalize fusion is enabled (hard crash risk).

With apply_router_weight_on_input=True, you null-out token_final_scales. CUTLASS finalize fusion requires non-null router scales; passing None can crash. Provide ones to satisfy the kernel when we already pre-applied scaling.

Apply:

         if self.apply_router_weight_on_input:
             assert x.dtype != torch.float8_e4m3fn, "Current workaround for apply_router_weight_on_input does not support fp8 input"
             x = x * token_final_scales.to(x.dtype)
             # TODO: remove this once we have correct fusedmoe kernel ready
-            token_final_scales = None
+            token_final_scales = (
+                torch.ones_like(token_selected_experts, dtype=torch.float32)
+                if self.use_fused_finalize else None
+            )
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)

1-1: Add NVIDIA copyright header (2025).

Missing at the top of this source.

Apply:

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/models/modeling_llama.py (1)

1-1: Add NVIDIA copyright header (2025).

Header is required on this source file.

Apply:

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

1-1: Add NVIDIA copyright header (2025).

Missing at file top.

Apply:

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
♻️ Duplicate comments (3)
tensorrt_llm/_torch/modules/fused_moe/interface.py (3)

26-37: MoE layer lookup hardening — LGTM.

Clear error messages and GC’d weakref handling are in place.


165-174: Guarded registration prevents None-key collisions — LGTM.

Only registers when layer_idx is provided and checks duplicates.


1-1: Add NVIDIA copyright header (2025).

File is missing the standard header.

Apply:

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

1-27: Consider adding NVIDIA copyright header

According to the coding guidelines, all source files should have an NVIDIA copyright header with the current year prepended.

Add the NVIDIA copyright header at the beginning of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # --------------------------------------------------
 # Portions of this code were derived from DeepSeek‑V3:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b1977fc and d19d657.

📒 Files selected for processing (14)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_mixtral.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/interface.py (5 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (0 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (0 hunks)
💤 Files with no reviewable changes (5)
  • tensorrt_llm/_torch/models/modeling_mixtral.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic

Files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header with current year to all source files

Files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
📚 Learning: 2025-08-28T08:07:44.865Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-28T08:07:44.865Z
Learning: Applies to **/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py} : Prepend NVIDIA copyright header with current year to all source files

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🧬 Code graph analysis (6)
tensorrt_llm/_torch/modules/fused_moe/interface.py (6)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
  • calculate_tile_tokens_dim (16-26)
tensorrt_llm/_torch/model_config.py (1)
  • ModelConfig (62-529)
tensorrt_llm/_torch/utils.py (5)
  • Fp4QuantizedTensor (97-104)
  • get_model_extra_attrs (52-53)
  • is_torch_compiling (37-39)
  • _ (190-196)
  • shape (103-104)
tensorrt_llm/_torch/modules/fused_moe/routing.py (2)
  • BaseMoeRoutingMethod (158-181)
  • DeepSeekV3MoeRoutingMethod (213-223)
tensorrt_llm/_torch/modules/attention.py (3)
  • extract_extra_attrs (30-67)
  • forward_impl (394-450)
  • forward_impl (926-1020)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_impl (183-401)
tensorrt_llm/_torch/models/modeling_llama.py (3)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (7)
  • forward (151-168)
  • forward (392-397)
  • forward (558-600)
  • forward (748-788)
  • forward (982-1066)
  • forward (1099-1128)
  • compute_routed_output (533-556)
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
  • compute_routed_output (601-615)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (5)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_impl (183-401)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_impl (713-859)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (212-219)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (204-205)
  • all_rank_num_tokens (208-211)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (8)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)
  • forward_impl (1363-1383)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_impl (183-401)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_impl (713-859)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (212-219)
tensorrt_llm/functional.py (2)
  • max (438-442)
  • max (3228-3250)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (204-205)
  • all_rank_num_tokens (208-211)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • all_rank_num_tokens (167-168)
  • all_rank_num_tokens (171-174)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (204-205)
  • all_rank_num_tokens (208-211)
tensorrt_llm/_torch/utils.py (1)
  • Fp4QuantizedTensor (97-104)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py

330-330: Do not assign a lambda expression, use a def

Rewrite fn0 as a def

(E731)


331-332: Do not assign a lambda expression, use a def

Rewrite fn1 as a def

(E731)

🔇 Additional comments (21)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)

88-89: Layer index plumbing — LGTM.

Forwarding layer_idx to the base MoE keeps the per-layer custom-op routing consistent.


91-93: Exposing unpadded_hidden_size — LGTM.

This is needed by kernels that internally pad hidden size.


469-471: DP padding derived from all_rank_num_tokens — LGTM.

Replacing all_rank_max_num_tokens param with local max(all_rank_num_tokens) is consistent with the new interface.

Also applies to: 472-475

tensorrt_llm/_torch/modules/fused_moe/interface.py (3)

40-71: Custom op wrapper aligns with compile-time path — LGTM.

Inputs/outputs match the forward wrapper contract; Optional x_sf and do_finalize list-wrapping are correct.


221-265: Final forward wrapper is sealed and routes through custom-op when compiling — LGTM.

Consistent, prevents subclasses from bypassing the unified API.


73-114: Ignore dtype consistency in fake non-finalize path—no real fused MoE kernels support do_finalize=False.
All existing backends (TRT-LLM, Triton, DeepGEMM, Cutlass) assert do_finalize must be true, so the non-finalize outputs aren’t consumed.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2)

1290-1291: Propagate layer_idx to base — LGTM.

Keeps per-layer custom-op wiring consistent with other backends.


1363-1383: Forward interface rename and DP padding assertion — LGTM.

Triton path rejects use_dp_padding=True explicitly, consistent with backend limitations.

tensorrt_llm/_torch/models/modeling_llama.py (2)

311-319: MoE callsite aligns with new API — LGTM.

Dropped all_rank_max_num_tokens and wired do_finalize via cutlass_min_latency_mode.


531-537: Feed-forward MoE call uses attn_metadata.all_rank_num_tokens — LGTM.

Matches new unified MoE interface.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

85-86: Layer index propagated to base — LGTM.

Keeps per-layer registration consistent with other MoE backends.


380-381: Deriving all_rank_max_num_tokens locally — LGTM.

Correct for per-chunk dispatch decisions.


725-731: Local max computation and alltoall gating — LGTM.

Compute-and-use pattern matches the updated forward_impl contract.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (8)

534-534: Signature change: Removed all_rank_max_num_tokens parameter

The removal of all_rank_max_num_tokens from the function signature is consistent with the broader API cleanup across the MoE infrastructure. This simplifies the interface while maintaining backward compatibility.


559-564: LGTM: Updated MoE forward signature

The simplified function signature for Deepseekv3MoE.forward properly removes all_rank_max_num_tokens while maintaining other parameters. The assertion for do_finalize correctly enforces that finalization is required when not using DP mode.


577-581: Consistent parameter removal in internal helper

The _compute_routed_output helper correctly passes only the required parameters without all_rank_max_num_tokens, maintaining consistency with the updated API.


798-807: Correct MoE parameter forwarding in DecoderLayer

The _run_MoE function properly forwards only all_rank_num_tokens from attn_metadata, maintaining consistency with the simplified MoE API.


1045-1051: Consistent MoE call in MTP layer

The MTP layer's MoE call correctly uses only all_rank_num_tokens parameter, maintaining consistency with the updated MoE API.


534-556: DP padding behavior is correct as implemented

fused_moe computes all_rank_max_num_tokens internally as max(all_rank_num_tokens) (e.g. fused_moe_wide_ep.py lines 380/725), and existing unit tests exercise the use_dp_padding=False path with expected outputs.


982-991: Verify removal of all_rank_max_num_tokens from MTP forward calls
No instances of all_rank_max_num_tokens being passed to DeepseekV3MTP.forward were found—manually confirm that all internal and external MTP layer invocations have been updated to the new signature.


546-554: Assert non-null all_rank_num_tokens for data-parallel MoE
Add assert all_rank_num_tokens is not None before the allgather/experts call in compute_routed_output to avoid passing None into backends that expect a list of per-rank token counts.

@liji-nv liji-nv force-pushed the dev-liji-moe-wrapper branch from d19d657 to 09c1e26 Compare August 28, 2025 11:34
@liji-nv liji-nv requested a review from a team as a code owner August 28, 2025 11:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
tensorrt_llm/_torch/speculative/interface.py (1)

129-131: Fix trailing comma: default becomes a tuple instead of SpeculativeDecodingMode.

The trailing comma turns the default into (SpeculativeDecodingMode.NONE,), breaking type checks and comparisons.

Apply:

-    spec_dec_mode: SpeculativeDecodingMode = SpeculativeDecodingMode.NONE,
+    spec_dec_mode: SpeculativeDecodingMode = SpeculativeDecodingMode.NONE
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

208-213: PEP 604 unions ('|') break Python 3.8 compatibility.

Project targets Python 3.8+; use typing.Optional instead.

Apply:

-    def apply_linear(
-                     input,
-                     bias,
-                     lora_params: Optional[dict] | None = None,
-                     layer_idx: Optional[int] | None = None):
+    def apply_linear(
+                     input,
+                     bias,
+                     lora_params: Optional[dict] = None,
+                     layer_idx: Optional[int] = None):

558-566: Typing: use List[int] for Python 3.8.

Built-in generics (list[int]) require 3.9+. Switch to typing.List.

Apply:

-        all_rank_num_tokens: Optional[list[int]] = None,
+        all_rank_num_tokens: Optional[List[int]] = None,
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

380-386: Harden forward_chunk inputs and simplify dtype selection.

Add explicit None-check for all_rank_num_tokens and avoid no-op assignment; provide fallback dtype for Fp4 path.

Apply:

-        all_rank_max_num_tokens = max(all_rank_num_tokens)
-        if isinstance(x, Fp4QuantizedTensor):
-            assert output_dtype is not None
-            output_dtype = output_dtype
-        else:
-            output_dtype = x.dtype
+        assert all_rank_num_tokens is not None, "all_rank_num_tokens must be provided"
+        all_rank_max_num_tokens = max(all_rank_num_tokens)
+        if isinstance(x, Fp4QuantizedTensor):
+            # Fallback to module dtype when caller doesn't pass output_dtype
+            output_dtype = output_dtype or self.dtype
+        else:
+            output_dtype = x.dtype
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

1-1: Add NVIDIA copyright header (2025).

Required by repo guidelines for all source files.

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
 import os

118-129: Chunking stream/event allocation condition can leave None and crash later.

aux_stream/event_dict are created only when self.moe_max_num_tokens < moe_max_num_tokens, but forward_impl uses them whenever num_chunks > 1. If num_rows > self.moe_max_num_tokens while the two values are equal (the default), event_dict is None and access at Line 506 will fail.

Initialize lazily in forward_impl when chunking is actually needed:

@@
-        # in case of num_rows is larger than max_chunk_size, we need to split the input into multiple chunks
-        num_chunks = (num_rows + self.moe_max_num_tokens -
-                      1) // self.moe_max_num_tokens
+        # in case of num_rows is larger than max_chunk_size, we need to split the input into multiple chunks
+        num_chunks = (num_rows + self.moe_max_num_tokens - 1) // self.moe_max_num_tokens
+
+        # Lazily allocate stream/events if chunking is required.
+        if num_chunks > 1 and (self.aux_stream is None or self.event_dict is None):
+            self.aux_stream = torch.cuda.Stream()
+            self.event_dict = {
+                EventType.Main: torch.cuda.Event(),
+                EventType.MoeChunkingOverlap: torch.cuda.Event(),
+            }
♻️ Duplicate comments (2)
tensorrt_llm/_torch/models/modeling_llama.py (1)

330-332: Replace assigned lambdas with defs (Ruff E731).

Apply:

-        fn0 = lambda: self.shared_expert(hidden_states)
-        fn1 = lambda: self.compute_routed_output(
-            hidden_states, all_rank_num_tokens, cutlass_min_latency_mode)
+        def fn0():
+            return self.shared_expert(hidden_states)
+
+        def fn1():
+            return self.compute_routed_output(
+                hidden_states, all_rank_num_tokens, cutlass_min_latency_mode
+            )
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

1-1: Add NVIDIA copyright header (2025).

This file is missing the required header. (This was flagged earlier; repeating here since it’s still absent.)

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
 import weakref
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

108-113: Fake op: make first output dtype configurable instead of hardcoded BF16.

Using x.dtype (or dt) avoids dtype mismatches in ahead-of-time shape/dtype propagation.

-            x.new_empty((max_num_padded_tokens, hidden_size),
-                        dtype=torch.bfloat16),
+            x.new_empty((max_num_padded_tokens, hidden_size), dtype=dt),
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d19d657 and 09c1e26.

📒 Files selected for processing (16)
  • tensorrt_llm/_torch/attention_backend/interface.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_mixtral.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/interface.py (5 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (1 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (0 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (0 hunks)
💤 Files with no reviewable changes (6)
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py
  • tensorrt_llm/_torch/models/modeling_mixtral.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic

Files:

  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header with current year to all source files

Files:

  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
📚 Learning: 2025-08-28T08:07:44.885Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-28T08:07:44.885Z
Learning: Applies to **/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py} : Prepend NVIDIA copyright header with current year to all source files

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🧬 Code graph analysis (6)
tensorrt_llm/_torch/models/modeling_llama.py (1)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (7)
  • forward (151-168)
  • forward (392-397)
  • forward (558-600)
  • forward (748-788)
  • forward (982-1066)
  • forward (1099-1128)
  • compute_routed_output (533-556)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
  • do_finalize (246-247)
tensorrt_llm/_torch/modules/fused_moe/interface.py (3)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
  • calculate_tile_tokens_dim (16-26)
tensorrt_llm/_torch/utils.py (5)
  • Fp4QuantizedTensor (97-104)
  • get_model_extra_attrs (52-53)
  • is_torch_compiling (37-39)
  • _ (190-196)
  • shape (103-104)
tensorrt_llm/_torch/modules/fused_moe/routing.py (2)
  • BaseMoeRoutingMethod (158-181)
  • DeepSeekV3MoeRoutingMethod (213-223)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (5)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_impl (183-401)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_impl (713-859)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (212-219)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/functional.py (2)
  • max (438-442)
  • max (3228-3250)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)
  • forward_impl (1363-1383)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_impl (183-401)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (212-219)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py

330-330: Do not assign a lambda expression, use a def

Rewrite fn0 as a def

(E731)


331-332: Do not assign a lambda expression, use a def

Rewrite fn1 as a def

(E731)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (16)
tensorrt_llm/_torch/speculative/interface.py (1)

144-145: No stale getters/setters remain for all_rank_num_tokens. Defaulting to a plain field is safe—no @property def all_rank_num_tokens or setter exists in the repo.

tensorrt_llm/_torch/models/modeling_llama.py (2)

314-319: MoE call updated to new interface — OK.

Dropping all_rank_max_num_tokens and passing do_finalize/use_dp_padding aligns with the new MoE API.


531-537: Verify attn_metadata.all_rank_num_tokens is always initialized in DP flows
ModelEngine must call _get_all_rank_num_tokens/_get_padding_params and assign attn_metadata.all_rank_num_tokens before feed_forward; missing this will trigger runtime asserts in MoE backends.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

533-555: MoE routed-output path updated — OK.

Uses do_finalize + all_rank_num_tokens only; matches refactored fused MoE interfaces.


798-807: _run_MoE callsite aligned — OK.

all_rank_num_tokens passed; all_rank_max_num_tokens removed.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2)

1290-1291: Propagate layer_idx to base MoE — OK.

Keeps per-layer context consistent across backends.


1363-1383: forward_impl added and constrained — OK.

Asserts on do_finalize/use_dp_padding align with Triton backend limitations; reduce-scatter wrapper preserved.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

85-86: Pass layer_idx to base MoE — OK.


713-735: forward_impl API drop of all_rank_max_num_tokens — OK.

Local derivation via max(all_rank_num_tokens) keeps behavior while shrinking public surface.


861-871: Alltoall prepare path derives with all_rank_max_num_tokens — OK.

Matches new internal contract; no external max needed.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (5)

88-93: Good: layer_idx propagation and unpadded_hidden_size tracked.

Forwarding layer_idx into the base and preserving unpadded_hidden_size before padding looks correct and future‑proofs finalize kernels.


469-471: Confirm chunking threshold difference vs DeepGEMM.

This backend chunks for any num_rows > moe_max_num_tokens, while DeepGEMM waits until > 2× for overlap with two workspaces. Please confirm this intentional divergence (performance tradeoff).


472-475: Good: DP padding derived from runtime max.

Replacing all_rank_max_num_tokens with [max(...) ]*N is correct and keeps callsites simple.


425-429: Unwrap custom-op return consistently.

Unpacking the first tensor aligns with “single-tensor except min-latency” contract.


556-559: Good: trim to local rank tokens after collective.

Slicing back to all_rank_num_tokens[tp_rank] matches DP semantics.

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

221-265: Forward wrapper shape path looks consistent with custom-op contract.

Routing through moe_custom_op only under torch.compile and registered layers is clean; fallback to forward_impl otherwise is appropriate.

@liji-nv liji-nv force-pushed the dev-liji-moe-wrapper branch from 09c1e26 to 9cb9d21 Compare August 28, 2025 12:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
tensorrt_llm/_torch/models/modeling_llama.py (1)

330-333: Replace assigned lambdas with defs (Ruff E731).

Keeps style consistent and satisfies linter.

Apply:

-        fn0 = lambda: self.shared_expert(hidden_states)
-        fn1 = lambda: self.compute_routed_output(
-            hidden_states, all_rank_num_tokens, cutlass_min_latency_mode)
+        def fn0():
+            return self.shared_expert(hidden_states)
+
+        def fn1():
+            return self.compute_routed_output(
+                hidden_states, all_rank_num_tokens, cutlass_min_latency_mode
+            )
tensorrt_llm/_torch/modules/fused_moe/interface.py (3)

1-1: Add NVIDIA copyright header (2025).

Required by repo guidelines for source files.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

26-37: Replace asserts with explicit exceptions in extract_extra_attrs.

Prevents failures being stripped under -O and improves debuggability.

Apply:

-def extract_extra_attrs(layer_idx: str):
-    extra_attrs = get_model_extra_attrs()
-    assert extra_attrs is not None, "Model extra attrs are not set"
-
-    moe_layers = extra_attrs.get("moe_layers", None)
-    assert moe_layers is not None, "No MoE layers registered"
-    moe_layer_ref = moe_layers.get(layer_idx)
-    assert moe_layer_ref is not None, f"Cannot find MoE layer for layer_idx={layer_idx}"
-    moe_layer = moe_layer_ref() if callable(moe_layer_ref) else None
-    assert moe_layer is not None, f"MoE layer for layer_idx={layer_idx!r} is no longer alive"
-
-    return moe_layer
+def extract_extra_attrs(layer_idx: str):
+    extra_attrs = get_model_extra_attrs()
+    if extra_attrs is None:
+        raise RuntimeError("Model extra attrs are not set")
+    moe_layers = extra_attrs.get("moe_layers")
+    if moe_layers is None:
+        raise KeyError("No MoE layers registered")
+    moe_layer_ref = moe_layers.get(layer_idx)
+    if moe_layer_ref is None:
+        raise KeyError(f"Cannot find MoE layer for layer_idx={layer_idx!r}; available={list(moe_layers.keys())}")
+    moe_layer = moe_layer_ref() if callable(moe_layer_ref) else None
+    if moe_layer is None:
+        raise ReferenceError(f"MoE layer for layer_idx={layer_idx!r} is no longer alive")
+    return moe_layer

165-174: Avoid assert for duplicate MoE layer registration.

Use a real exception to prevent silent overwrites.

Apply:

-            assert self.layer_idx_str not in model_config.extra_attrs["moe_layers"], \
-                f"Duplicate MoE layer for layer_idx={self.layer_idx_str}"
+            if self.layer_idx_str in model_config.extra_attrs["moe_layers"]:
+                raise KeyError(f"Duplicate MoE layer for layer_idx={self.layer_idx_str}")
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

370-381: Guard forward_chunk against None all_rank_num_tokens.

Type hints allow Optional but code unconditionally calls max(); add a defensive check or narrow the type.

Apply:

-    def forward_chunk(
+    def forward_chunk(
         self,
         x: Union[torch.Tensor, Fp4QuantizedTensor],
         router_logits: torch.Tensor,
         use_all_to_all: bool,
         output_dtype: Optional[torch.dtype] = None,
-        all_rank_num_tokens: Optional[List[int]] = None,
+        all_rank_num_tokens: List[int],
         use_dp_padding: Optional[bool] = None,
         repeating_info: Tuple = (True, True),
     ) -> torch.Tensor:
-        all_rank_max_num_tokens = max(all_rank_num_tokens)
+        assert all_rank_num_tokens is not None, "all_rank_num_tokens is required in WideEPMoE"
+        all_rank_max_num_tokens = max(all_rank_num_tokens)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

472-475: Ensure DP-padding path only executes with valid token sizes.

Minor safety guard to avoid None when not in DP contexts.

Apply:

-        if use_dp_padding:
-            all_rank_num_tokens_padded = [max(all_rank_num_tokens)
-                                          ] * len(all_rank_num_tokens)
+        if use_dp_padding:
+            assert all_rank_num_tokens is not None, "use_dp_padding requires all_rank_num_tokens"
+            all_rank_num_tokens_padded = [max(all_rank_num_tokens)] * len(all_rank_num_tokens)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 09c1e26 and 9cb9d21.

📒 Files selected for processing (16)
  • tensorrt_llm/_torch/attention_backend/interface.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_mixtral.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/interface.py (5 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (1 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (0 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (0 hunks)
💤 Files with no reviewable changes (5)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
  • tensorrt_llm/_torch/models/modeling_mixtral.py
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tests/unittest/_torch/modules/test_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/models/modeling_llama.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header with current year to all source files

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/models/modeling_llama.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/models/modeling_llama.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_llama.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
📚 Learning: 2025-08-28T08:07:44.885Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-28T08:07:44.885Z
Learning: Applies to **/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py} : Prepend NVIDIA copyright header with current year to all source files

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/interface.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (6)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)
  • forward_impl (1363-1383)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_impl (183-401)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
  • forward_impl (713-859)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (212-219)
tensorrt_llm/functional.py (2)
  • max (438-442)
  • max (3228-3250)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (4)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • forward_impl (451-559)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
  • forward_impl (640-771)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • forward_impl (212-219)
tensorrt_llm/_torch/utils.py (1)
  • Fp4QuantizedTensor (97-104)
tensorrt_llm/_torch/modules/fused_moe/interface.py (5)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
  • calculate_tile_tokens_dim (16-26)
tensorrt_llm/_torch/utils.py (5)
  • Fp4QuantizedTensor (97-104)
  • get_model_extra_attrs (52-53)
  • is_torch_compiling (37-39)
  • _ (190-196)
  • shape (103-104)
tensorrt_llm/_torch/modules/fused_moe/routing.py (2)
  • BaseMoeRoutingMethod (158-181)
  • DeepSeekV3MoeRoutingMethod (213-223)
tensorrt_llm/_torch/modules/attention.py (3)
  • extract_extra_attrs (30-67)
  • forward_impl (394-450)
  • forward_impl (926-1020)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
  • forward_impl (183-401)
tensorrt_llm/_torch/models/modeling_llama.py (3)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (7)
  • forward (151-168)
  • forward (392-397)
  • forward (558-600)
  • forward (748-788)
  • forward (982-1066)
  • forward (1099-1128)
  • compute_routed_output (533-556)
tensorrt_llm/_torch/models/modeling_qwen3_moe.py (4)
  • forward (55-58)
  • forward (118-157)
  • forward (204-297)
  • forward (344-370)
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
  • compute_routed_output (601-615)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_llama.py

330-330: Do not assign a lambda expression, use a def

Rewrite fn0 as a def

(E731)


331-332: Do not assign a lambda expression, use a def

Rewrite fn1 as a def

(E731)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tensorrt_llm/_torch/models/modeling_llama.py (1)

314-318: MoE call aligned with new unified interface (no all_rank_max_num_tokens).

Correctly routes through experts with do_finalize=not cutlass_min_latency_mode and DP padding disabled.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

85-86: Layer index propagation added.

Forwarding layer_idx to base MoE is correct for custom-op routing.


712-721: Compute max tokens once and reuse.

Deriving all_rank_max_num_tokens here is fine; downstream usage is consistent.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

88-89: Layer index propagation added.

Passing layer_idx to base MoE enables per-layer registration for the custom op.


451-459: forward_impl rename and interface sync look good.

Matches unified MoE.forward wrapper contract.

tensorrt_llm/_torch/modules/fused_moe/interface.py (3)

41-71: Custom op wrapper correctly delegates to forward_impl.

Handles FP4 wrapping and finalized/non-finalized outputs as expected.


73-114: Fake path handles non-DP and DeepSeek dtype nuances.

Shape/dtype inference logic looks sound and matches runtime paths.


221-265: final forward wrapper is correct.

Routes through moe_custom_op under torch.compile; otherwise calls forward_impl.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

534-555: MoE call updated to unified interface.

Dropping all_rank_max_num_tokens and passing output_dtype matches backend expectations.

@liji-nv
Copy link
Collaborator Author

liji-nv commented Sep 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17666 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17666 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13280 completed with status: 'FAILURE'

@liji-nv
Copy link
Collaborator Author

liji-nv commented Sep 5, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17745 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17745 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User liji-nv is not in the prioritized list. L0 testing cannot be triggered.

@liji-nv liji-nv force-pushed the dev-liji-moe-wrapper branch from 68a561e to 3ae903d Compare September 5, 2025 10:59
@liji-nv
Copy link
Collaborator Author

liji-nv commented Sep 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17781 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17781 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13312 completed with status: 'FAILURE'

Copy link
Member

@yizhang-nv yizhang-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liji-nv liji-nv force-pushed the dev-liji-moe-wrapper branch from 3ae903d to eb1d305 Compare September 8, 2025 06:23
@liji-nv
Copy link
Collaborator Author

liji-nv commented Sep 8, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17990 [ run ] triggered by Bot

@liji-nv liji-nv enabled auto-merge (squash) September 8, 2025 09:37
@tensorrt-cicd
Copy link
Collaborator

PR_Github #17990 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13484 completed with status: 'FAILURE'

@liji-nv liji-nv force-pushed the dev-liji-moe-wrapper branch from eb1d305 to ee77b50 Compare September 8, 2025 13:19
* Let all moe backend go through the same interface
* MOE is wrapped with custom op to improve full graph torch compile compatibility

Signed-off-by: Jin Li <[email protected]>
@liji-nv liji-nv force-pushed the dev-liji-moe-wrapper branch from ee77b50 to b0dcd1e Compare September 8, 2025 13:19
@liji-nv
Copy link
Collaborator Author

liji-nv commented Sep 8, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18052 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18052 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13528 completed with status: 'FAILURE'

@liji-nv
Copy link
Collaborator Author

liji-nv commented Sep 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18134 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18134 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13590 completed with status: 'SUCCESS'

@liji-nv liji-nv merged commit d49374b into NVIDIA:main Sep 9, 2025
5 checks passed
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants