draft AFD implementation for step3 #25162

Oliver-ss · 2025-09-18T12:16:48Z

Purpose

Desgin doc: https://docs.google.com/document/d/1GS2g8df7sdPmDvysmsURXN7xDDwnJfM6ERkBryUiTEA/edit?tab=t.0#heading=h.g8s3tkkthjdk
Step Paper: https://arxiv.org/abs/2507.19427

This is the preliminary implementation of Step3 AFD. It currently only supports the StepMesh connector and the Step3 model. In the future, the community will help expand the connector and add support for the DeepSeek V3 model as mentioned in RFC

The current CUDA Graph implementation for AFD still requires optimization. At present, it involves intrusive modifications to each model and is not compatible with the existing CudaGraphWrapper implementation. Welcome everyone to join the discussion on finding a more elegant solution.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-09-18T12:17:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Oliver-ss.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a preliminary implementation of Step3 AFD (Attention FFN Disaggregation), a significant new feature for distributed inference. The changes are extensive, adding new configurations, communication connectors, and modifying core model execution logic. While the implementation lays a solid foundation, I've identified several critical issues related to correctness in distributed settings, maintainability, and configuration that should be addressed. These include potential race conditions or deadlocks due to incorrect distributed logic, CUDA graph caching issues, hardcoded values that limit portability, and significant code duplication. Addressing these points will improve the robustness and usability of this new feature.

gemini-code-assist · 2025-09-18T12:19:15Z

vllm/config/__init__.py

+        factors: list[Any] = [
+            self.afd_connector,
+            self.afd_role,
+            self.num_afd_stages,
+            self.num_attention_servers,
+            self.num_ffn_servers,
+        ]


The compute_hash method for AFDConfig omits several fields that could affect the computation graph, such as afd_server_rank and afd_extra_config. If these fields alter the model's execution path, their omission can lead to CUDA graph cache collisions, causing incorrect results or crashes. The afd_extra_config dictionary should be hashed in a deterministic way (e.g., by sorting keys) to ensure a stable hash.

Suggested change

factors: list[Any] = [

self.afd_connector,

self.afd_role,

self.num_afd_stages,

self.num_attention_servers,

self.num_ffn_servers,

]

factors: list[Any] = [

self.afd_connector,

self.afd_role,

self.num_afd_stages,

self.num_attention_servers,

self.num_ffn_servers,

self.afd_server_rank,

]

if self.afd_extra_config:

factors.append(json.dumps(self.afd_extra_config, sort_keys=True))

gemini-code-assist · 2025-09-18T12:19:15Z

vllm/distributed/afd_transfer/afd_connector/base.py

+    def recv_ffn_output(
+        self,
+        handle: Any,
+    ) -> torch.Tensor:


The signature of recv_ffn_output in this abstract base class (handle: Any) does not match the implementations in DummyAFDConnector and StepMeshAFDConnector, which use timeout_ms: Optional[float] = None. This violates the Liskov Substitution Principle and will lead to runtime errors. The call sites in step3_text.py also call this method without arguments. The signatures across the base class and all implementations should be consistent.

Suggested change

def recv_ffn_output(

self,

handle: Any,

) -> torch.Tensor:

def recv_ffn_output(

self,

timeout_ms: Optional[float] = None,

) -> torch.Tensor:

gemini-code-assist · 2025-09-18T12:19:15Z

vllm/distributed/afd_transfer/afd_connector/stepmesh_connector.py

+            self.scheduler_process = subprocess.Popen(
+                [
+                    "python",
+                    "-c",
+                    "import torch; import fserver_lib as ps; import os; "
+                    'os.environ["DMLC_ROLE"] = "scheduler"; '
+                    'os.environ["DMLC_INTERFACE"] = "brainpf_bond0"; '
+                    "ps.init(); ps.stop()",
+                ],
+                env=os.environ.copy(),
+            )


The network interface DMLC_INTERFACE is hardcoded to "brainpf_bond0" for the scheduler subprocess. This is specific to a particular environment and will cause the scheduler to fail in other environments where this interface does not exist. This should be made configurable or use a more general default like "auto" which is used elsewhere in this file.

self.scheduler_process = subprocess.Popen( [ "python", "-c", "import torch; import fserver_lib as ps; import os; " 'os.environ["DMLC_ROLE"] = "scheduler"; ' "ps.init(); ps.stop()", ], env=os.environ.copy(), )

gemini-code-assist · 2025-09-18T12:19:15Z

vllm/distributed/afd_transfer/afd_connector/stepmesh_connector.py

+    def send_attn_output(
+        self,
+        hidden_states: torch.Tensor,
+        metadata: AFDConnectorMetadata,
+    ) -> Any:


The send_attn_output method is defined in the base class AFDConnectorBase to return a handle of type Any. However, this implementation does not return any value. This violates the interface contract and can lead to NoneType errors if the caller expects a handle. The method should return the event handle created by ps.push_pull.

gemini-code-assist · 2025-09-18T12:19:15Z

vllm/v1/attention/backends/flash_attn.py

+            stage_num_reqs = stage_end_req - stage_start_req
+            stage_num_actual_tokens = stage_end_token - stage_start_token
+            stage_max_seq_len = int(
+                seq_lens_cpu[stage_start_req:stage_end_req].max())
+
+            stage_max_query_len = min(max_query_len, stage_num_actual_tokens)
+
+            if stage_num_actual_tokens == 0 or stage_num_reqs == 0:
+                stage_metadatas.append(None)
+                continue


There's a potential RuntimeError here. If a stage has no requests (stage_num_reqs == 0), seq_lens_cpu[stage_start_req:stage_end_req] will be an empty tensor. Calling .max() on an empty tensor will raise an error. The check for stage_num_reqs == 0 should be performed before attempting to calculate stage_max_seq_len.

Suggested change

stage_num_reqs = stage_end_req - stage_start_req

stage_num_actual_tokens = stage_end_token - stage_start_token

stage_max_seq_len = int(

seq_lens_cpu[stage_start_req:stage_end_req].max())

stage_max_query_len = min(max_query_len, stage_num_actual_tokens)

if stage_num_actual_tokens == 0 or stage_num_reqs == 0:

stage_metadatas.append(None)

continue

stage_num_reqs = stage_end_req - stage_start_req

stage_num_actual_tokens = stage_end_token - stage_start_token

if stage_num_actual_tokens == 0 or stage_num_reqs == 0:

stage_metadatas.append(None)

continue

stage_max_seq_len = int(

seq_lens_cpu[stage_start_req:stage_end_req].max())

stage_max_query_len = min(max_query_len, stage_num_actual_tokens)

gemini-code-assist · 2025-09-18T12:19:16Z

vllm/v1/worker/gpu_ffn_model_runner.py

+        tp_world_size = get_tensor_model_parallel_world_size()
+        if tp_world_size > 1:
+            # Handle TP case: all-gather tensors from all TP ranks
+            gathered_hidden_states = tensor_model_parallel_all_gather(
+                hidden_states, dim=0)
+            ffn_output = self.model.compute_ffn_output(current_layer_idx,
+                                                       gathered_hidden_states)
+
+            # Extract the output corresponding to current rank
+            start_idx = hidden_states.shape[
+                0] * get_tensor_model_parallel_rank()
+            end_idx = start_idx + hidden_states.shape[0]
+            rank_ffn_output = ffn_output[start_idx:end_idx, :]


This logic for handling tensor parallelism appears to be incorrect, similar to the issue in _execute_eager_mode. The tensor_model_parallel_all_gather will unnecessarily replicate the hidden_states tensor across TP ranks, leading to incorrect computations. The FFN layers themselves should manage the sharding. The input hidden_states should be passed directly to self.model.compute_ffn_output.

# The input hidden_states is replicated on all TP ranks. # The FFN layers with tensor parallelism will handle sharding internally. rank_ffn_output = self.model.compute_ffn_output( current_layer_idx, hidden_states)

gemini-code-assist · 2025-09-18T12:19:16Z

vllm/attention/layer.py

+                    afd_metadata = forward_context.afd_metadata
+                    if afd_metadata is not None:
+                        afd_stage_idx = afd_metadata.afd_stage_idx
+                        if afd_stage_idx < len(attn_metadata):
+                            attn_metadata = attn_metadata[afd_stage_idx]
+                        else:
+                            attn_metadata = None  # padding


The logic to extract attn_metadata based on afd_metadata is duplicated in at least five places in this file (here, in the other forward path, maybe_save_kv_layer_to_connector, unified_attention, and unified_attention_with_output). This much code duplication is a significant maintainability risk; a bug fix or logic change would need to be applied in all locations, which is error-prone. I recommend refactoring this logic into a helper function to centralize it and improve code clarity.

gemini-code-assist · 2025-09-18T12:19:16Z

vllm/distributed/afd_transfer/afd_connector/stepmesh_connector.py

+                layer_idx=-1,  # Extract from comm_id
+                stage_idx=-1,  # Extract from comm_id


Using hardcoded placeholder values (-1) for layer_idx and stage_idx is risky. The TODO comment indicates this is incomplete. If downstream code does not properly handle these negative indices, it could lead to indexing errors or silent correctness issues. This logic should be fully implemented to extract the correct indices from comm_id before this feature is considered complete.

gemini-code-assist · 2025-09-18T12:19:16Z

vllm/distributed/afd_transfer/afd_connector/stepmesh_connector.py

+        try:
+            if len(self.events) > 0:
+                event, metadata = self.events.popleft()
+                ps.wait(event, timeout_ms=50000)


A hardcoded timeout of 50 seconds is used when waiting for the FFN output. This value may not be suitable for all workloads or environments. For production readiness, this timeout should be configurable, for instance, through the afd_extra_config dictionary in AFDConfig.

gemini-code-assist · 2025-09-18T12:19:16Z

vllm/v1/worker/gpu_worker.py

+
+        if self.profiler:
+            self.profiler.start()
+            for _ in range(1000):  # FIXME: hardcoded profiler iterations


The number of profiler iterations is hardcoded to 1000. This is inflexible and should be made configurable, for example, through an environment variable or a configuration parameter. The FIXME comment indicates this is a known issue that should be addressed.

shen-shanshan · 2025-09-24T03:10:27Z

vllm/distributed/afd_transfer/afd_connector/factory.py

+        def loader() -> type[AFDConnectorBase]:
+            module = importlib.import_module(module_path)
+            return getattr(module, class_name)


Maybe this can directly use vllm.utils.resolve_obj_by_qualname.

hbelove · 2025-09-26T06:52:14Z

vllm/v1/attention/backends/flash_attn.py

+                slot_mapping=stage_slot_mapping,
+                use_cascade=False,
+                common_prefix_len=0,
+                scheduler_metadata=stage_scheduler_metadata,


if aot_schedule is false, stage_scheduler_metadata and FlashAttentionMetadata can not be created.

hmellor · 2025-10-01T12:55:55Z

vllm/config/__init__.py

Configs no longer live in vllm/config/__init__.py

mergify · 2025-10-08T14:40:07Z

Documentation preview: https://vllm--25162.org.readthedocs.build/en/25162/

fengidri · 2025-10-10T01:46:04Z

Hello, I'd like to ask: how does StepMesh implement CUDA Graph? During graph replay, Python code isn't executed—so how is the PushPullWorker thread in StepMesh launched?

Oliver-ss · 2025-10-10T12:26:50Z

Hello, I'd like to ask: how does StepMesh implement CUDA Graph? During graph replay, Python code isn't executed—so how is the PushPullWorker thread in StepMesh launched?

Now stepmesh connector operation is not captured inside the cuda graph and piecewise cuda graph is used.

LucasWilkinson · 2025-10-10T15:06:52Z

vllm/v1/attention/backends/flash_attn.py

            causal=causal)
        return attn_metadata

+    def _init_stage_buffers(self, vllm_config: VllmConfig,


please see:

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 1388 to 1404 in ae9d0e7

if ubatch_slices is not None:

common_attn_metadata_list = split_attn_metadata(

ubatch_slices, common_attn_metadata

)

for ubid, common_attn_metadata in enumerate(

common_attn_metadata_list

):

attn_metadata_i = attn_group.get_metadata_builder(

ubatch_id=ubid

).build(

common_prefix_len=common_prefix_len,

common_attn_metadata=common_attn_metadata,

)

for layer_name in kv_cache_group_spec.layer_names:

assert type(attn_metadata) is list

attn_metadata[ubid][layer_name] = attn_metadata_i

else:

, I think this infrastructure can be reused instead of having to add all of this to the FA backend

jiangkuaixue123 · 2025-10-13T04:05:45Z

examples/online_serving/afd_step3/README.md

+1. Attn
+
+```
+vllm fserver /path/step3v -dp 8 --afd-config '{"afd_connector": "dummy", "afd_role": "attention", "afd_host": "127.0.0.0"}' --max-num-batched-tokens 384 --max-num-seqs 384 --compilation-config '{"cudagraph_capture_sizes": [1, 8]}'


This should be for vllm serve?

Oliver-ss added 2 commits September 18, 2025 20:02

init afd impl for step3

2e14cc6

clean up code

d9df208

Oliver-ss requested review from LucasWilkinson, ProExpertProg, WoosukKwon, aarnphm, alexm-redhat, chaunceyjiang, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 18, 2025 12:16

mergify bot added documentation Improvements or additions to documentation frontend v1 labels Sep 18, 2025

mergify bot added the needs-rebase label Sep 18, 2025

gemini-code-assist bot reviewed Sep 18, 2025

View reviewed changes

shen-shanshan reviewed Sep 24, 2025

View reviewed changes

hbelove reviewed Sep 26, 2025

View reviewed changes

hsliuustc0106 mentioned this pull request Sep 29, 2025

Support AFD for DeepSeek models: online serving ds-v2-lite with afd p2pconnector hsliuustc0106/vllm#46

Merged

12 tasks

chopper0126 mentioned this pull request Sep 29, 2025

[RFC]: ATTN-FFN Disaggregation for MoE Models #22799

Open

19 tasks

hsliuustc0106 mentioned this pull request Sep 30, 2025

AFD: update DeepSeek support Oliver-ss/vllm#2

Open

12 tasks

hmellor reviewed Oct 1, 2025

View reviewed changes

vllm/config/__init__.py

Copy link

Member

hmellor Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configs no longer live in vllm/config/__init__.py

LucasWilkinson reviewed Oct 10, 2025

View reviewed changes

jiangkuaixue123 reviewed Oct 13, 2025

View reviewed changes

chopper0126 mentioned this pull request Oct 15, 2025

[AFD]AFD implementation for dsv3 vllm-project/vllm-ascend#3447

Open

7 tasks

		layer_idx=-1, # Extract from comm_id
		stage_idx=-1, # Extract from comm_id

	if ubatch_slices is not None:
	common_attn_metadata_list = split_attn_metadata(
	ubatch_slices, common_attn_metadata
	)
	for ubid, common_attn_metadata in enumerate(
	common_attn_metadata_list
	):
	attn_metadata_i = attn_group.get_metadata_builder(
	ubatch_id=ubid
	).build(
	common_prefix_len=common_prefix_len,
	common_attn_metadata=common_attn_metadata,
	)
	for layer_name in kv_cache_group_spec.layer_names:
	assert type(attn_metadata) is list
	attn_metadata[ubid][layer_name] = attn_metadata_i
	else:

Uh oh!

draft AFD implementation for step3 #25162

Are you sure you want to change the base?

draft AFD implementation for step3 #25162

Conversation

Oliver-ss commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Sep 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

shen-shanshan Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

hbelove Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

hmellor Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

fengidri commented Oct 10, 2025

Uh oh!

Oliver-ss commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

jiangkuaixue123 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Oliver-ss commented Sep 18, 2025 •

edited by github-actions bot

Loading

Oliver-ss commented Oct 10, 2025 •

edited

Loading