[AFD]AFD implementation for dsv3 #3447

chopper0126 · 2025-10-14T08:54:56Z

What this PR does / why we need it?

This PR corresponds to the RFC vllm-project#22799 and a follow-up PR of vllm-project#25162 and Oliver-ss/vllm#2

This is the preliminary implementation of DeepSeek V2 Lite AFD for Ascend. It currently only supports the P2P connector and the DeepSeek V2 Lite model.

AFD for the DeepSeek V2 Lite model as well as a p2p connector for A2E/E2A communication on Asend.
extend the metadata of afd connector so that AFD can work with different hardware (GPU, NPU and more).
online serving requests

Later, we are going to support the following features:

TBO (Triple Batch Overlap) based on DBO extension
enable graph mode
offline serving request in a batch manner
multi-node support for full deepseek-V3/R1 models on GPU/NPU.

How was this patch tested?

use the following script for testing:
online_attn.sh

export ASCEND_RT_VISIBLE_DEVICES=4,5
vllm serve /home/data/DeepSeek-V2-Lite \
    --tensor-parallel-size 2 \
    --enable_expert_parallel \
    --enforce_eager          \
    --afd-config \
    '{"afd_connector":"p2pconnector", "afd_role": "attention", "num_afd_stages":"1","afd_extra_config":{"afd_size":"2A2F"}}'

ffn.sh

export ASCEND_RT_VISIBLE_DEVICES=6,7

python -m vllm.entrypoints.afd_ffn_server /home/data/DeepSeek-V2-Lite \
        --tensor-parallel-size 2 \
        --enable_expert_parallel \
        --enforce_eager          \
        --afd-config '{"afd_connector":"p2pconnector", "num_afd_stages":"1", "afd_role": "ffn", "afd_extra_config":{"afd_size":"2A2F"}}'

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

gemini-code-assist

Code Review

This pull request introduces a preliminary implementation of Attention/FFN Decoupling (AFD) for DeepSeek V3 models on Ascend NPUs. The changes are extensive, adding new FFN workers, model runners, and communication utilities, and modifying the core model and forward context to support the decoupled architecture. While the implementation lays the groundwork for AFD, there are several critical issues that need to be addressed, including hardcoded network configurations and a syntax error due to duplicated fields. Additionally, there are high-severity issues related to performance and best practices, such as an inefficient tensor parallelism implementation and setting environment variables within the code. These issues will need to be resolved to make the implementation robust, configurable, and performant.

gemini-code-assist · 2025-10-14T08:58:43Z

ffn.py

+    #TODO:remove hard code
+    init_method = 'tcp://127.0.0.1:29505'


The init_method for process group initialization is hardcoded with a local IP address and port. This is also the case on lines 94 and 101. This makes the script inflexible and difficult to use in a real distributed environment. These values should be parameterized or read from a configuration file to allow for different network setups.

gemini-code-assist · 2025-10-14T08:58:43Z

vllm_ascend/distributed/afd_communicators.py

+    new_default_group = init_process_group(
+        init_method='tcp://127.0.0.1:29500',
+        backend='gloo', 
+        rank=rank, 
+        world_size=world_size, 
+        group_name="new_hccl"
+    )


The init_method for process group initialization is hardcoded with a local IP address and port. This makes this function inflexible and difficult to use in different distributed environments. This value should be passed as an argument or read from a configuration.

gemini-code-assist · 2025-10-14T08:58:44Z

vllm_ascend/worker/npu_input_batch.py

+    mm_features: Optional[list[MultiModalFeatureSpec]] = None
+    # for back-compatibility, will be removed in next major release
+    mm_kwargs: Optional[list[MultiModalKwargsItem]] = None
+    mm_positions: Optional[list[PlaceholderRange]] = None
+    mm_hashes: Optional[list[PlaceholderRange]] = None


These fields (mm_features, mm_kwargs, mm_positions, mm_hashes) are duplicated from lines 62-66. This will cause a SyntaxError at runtime. Please remove the duplicated block.

gemini-code-assist · 2025-10-14T08:58:44Z

ffn.py

+from vllm_ascend.distributed.afd_communicators import send_object,recv_object,FFNNeedForwardData
+
+import os
+os.environ["PYTORCH_NPU_ALLOC_CONF"] = "max_split_size_mb:256"


Setting environment variables within a library or script is generally considered bad practice as it can have unintended side effects on other parts of the application or other libraries. It's better to have the user set this environment variable before running the application. Please move this to the documentation or a startup script.

gemini-code-assist · 2025-10-14T08:58:44Z

vllm_ascend/worker/model_runner_v1.py

+        if afd_metadata:
+            # Padding for AFD
+            num_input_tokens = num_input_tokens
+            (num_pad_afd, afd_tokens_start_loc,
+                afd_tokens_lens) = self.get_afd_padding(
+                    afd_metadata.afd_tokens_start_loc,
+                    afd_metadata.afd_tokens_lens)
+            afd_metadata.afd_tokens_start_loc = afd_tokens_start_loc
+            afd_metadata.afd_tokens_lens = afd_tokens_lens
+            num_input_tokens += num_pad_afd
+            num_tokens_across_dp = None
+


The logic for padding for AFD seems to be duplicated. A similar block of code exists in _prepare_inputs (lines 1542-1551). This could be refactored into a helper function to avoid code duplication and improve maintainability.

gemini-code-assist · 2025-10-14T08:58:44Z

vllm_ascend/worker/npu_ffn_model_runner_v1.py

+        tp_world_size = get_tensor_model_parallel_world_size()
+        if tp_world_size > 1:
+            # All-gather hidden states from all TP ranks
+            gathered_hidden_states = tensor_model_parallel_all_gather(
+                hidden_states, dim=0)
+            ffn_output = self.model.compute_ffn_output(current_layer_idx,
+                                                       gathered_hidden_states)
+            # Extract the output corresponding to current rank
+            start_idx = hidden_states.shape[
+                0] * get_tensor_model_parallel_rank()
+            end_idx = start_idx + hidden_states.shape[0]
+            rank_ffn_output = ffn_output[start_idx:end_idx, :]
+        else:
+            # Single TP case
+            rank_ffn_output = self.model.compute_ffn_output(
+                current_layer_idx, hidden_states)
+
+        return rank_ffn_output


The logic for handling tensor parallelism in _execute_eager_mode seems inefficient. It performs an all_gather on the input hidden_states, then computes the FFN output (which likely involves an all_reduce in the final RowParallelLinear layer), and finally manually slices the output. This results in redundant communication (all_gather followed by all_reduce). A more efficient approach would be to use a reduce_scatter operation in the final layer of the FFN computation, which would directly produce the sliced output for each rank. This would avoid the unnecessary all_gather and the manual slicing. Since this is a hot path, this inefficiency could significantly impact performance.

github-actions · 2025-10-14T09:00:21Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-10-14T09:41:44Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist bot reviewed Oct 14, 2025

View reviewed changes

github-actions bot added module:core merge-conflicts labels Oct 14, 2025

chopper0126 force-pushed the ascendmain-1011 branch 3 times, most recently from 422cc62 to 1ef3c29 Compare October 15, 2025 02:47

[AFD]AFD implementation for dsv3

1ef3c29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AFD]AFD implementation for dsv3 #3447

[AFD]AFD implementation for dsv3 #3447

chopper0126 commented Oct 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[AFD]AFD implementation for dsv3 #3447

Are you sure you want to change the base?

[AFD]AFD implementation for dsv3 #3447

Conversation

chopper0126 commented Oct 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chopper0126 commented Oct 14, 2025 •

edited by github-actions bot

Loading