Skip to content

Conversation

JC-ut0
Copy link
Contributor

@JC-ut0 JC-ut0 commented Oct 14, 2025

What this PR does / why we need it?

In memory of #2610
In the pd Disaggregation scenario, the first token of the inference after the d node receives the kv follows the eager mode.

Fixes:
Running with MTP torchair graph mode with Prefilling Decoding Disaggregation , if all requests processed by the D node are requests just transmitted from the P node, it will break the torchair graph.

Reason: During PD Disaggregation , the P node only transmits the KV cache and prompt to the D node, not the actual tokens inferred (neither the main model tokens nor the MTP tokens are transmitted). Therefore, the D node will treat this request as one without MTP tokens for inference (seq_len=1).
The community does not have graph mode issues because the community's attention has a seq_len=1 for each batch during the decode phase.
We have issues because the graph mode pads according to processing 2 tokens per request. When there are some seq_len=1 and some seq_len=2, padding is done at the end. If all requests received by the D node are seq_len=1, padding cannot be performed normally according to the attention's fia operator constraints.

Solution:

The kv consumer uses extra torchair graph padding to avoid breaking FIA graph constrains (The one this PR implemented).

The kv producer provides the correct tokens to the kv consumer, so that our graph mode constraints are not broken, and all logic is the same as the PD mixed deployment . Since we are using the community scheduler, the modification requires patching the vllm scheduler, but theoretically, performance should be better. (Maybe later )

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@JC-ut0 JC-ut0 force-pushed the mtp_torchair_pd_fix branch from 7e2a779 to 23231fe Compare October 14, 2025 09:39
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for an MTP torchair graph mode issue in prefilling-decoding disaggregation scenarios. The change involves adding extra padding to the torchair graph batch sizes for KV consumers to avoid breaking FIA graph constraints. However, the implementation has a logical flaw where self.torchair_graph_batch_sizes is modified and then reused, leading to incorrect calculations. I've suggested a refactoring to correct this logic.

@JC-ut0 JC-ut0 force-pushed the mtp_torchair_pd_fix branch from 23231fe to c0aab7f Compare October 14, 2025 09:54
@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Oct 14, 2025
@JC-ut0 JC-ut0 force-pushed the mtp_torchair_pd_fix branch from c0aab7f to e8f85eb Compare October 15, 2025 06:13
Signed-off-by: xuyexiong <[email protected]>
@JC-ut0 JC-ut0 force-pushed the mtp_torchair_pd_fix branch from e8f85eb to cc90849 Compare October 15, 2025 06:23
Copy link
Collaborator

@linfeng-yuan linfeng-yuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please check the logic of max_num_seq updating here.

Currently, deepseek can also performed with aclgraph in full graph mode, I think you can submit another PR to update the padding size of mtp for aclgraph also (full graph only).

@yiz-liu yiz-liu merged commit b0ae203 into vllm-project:main Oct 16, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants