-
Notifications
You must be signed in to change notification settings - Fork 484
[Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 #3432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the deepseek-v3.2 model to adapt to vllm 0.11.0, which involves removing obsolete patches and changing the mechanism for detecting sparse attention. The renaming of use_sfa
to use_sparse
improves code clarity. While the refactoring is extensive and generally well-executed, I've identified several critical issues where the old configuration access pattern was not fully updated. These will lead to AttributeError
exceptions at runtime and need to be fixed.
Test pass with deepseek-v3.2-w8a8: import os
from vllm import LLM, SamplingParams
os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["HCCL_BUFFSIZE"] = "1024"
def main():
prompts = [
"窗前明月光,",
"The president of the United States is Mr.",
"The capital of France is",
"The future of AI is",
"感时花溅泪,",
"家书抵万金啥意思?",
"plz tell me a story: ",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
# Create an LLM.
llm = LLM(model="/vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8",
tensor_parallel_size=16,
enforce_eager=True,
trust_remote_code=True,
max_model_len=1024,
# max_num_seqs=2,
gpu_memory_utilization=0.9,
quantization="ascend",
additional_config={"ascend_scheduler_config":{"enabled":True}}
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == "__main__":
main() Results: root@cmq-docker:/vllm-workspace/vllm-ascend# python scripts/run_ds.py
/vllm-workspace/vllm/vllm/__init__.py:7: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
from .version import __version__, __version_tuple__ # isort:skip
INFO 10-14 03:13:10 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 10-14 03:13:10 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 10-14 03:13:10 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 10-14 03:13:10 [__init__.py:207] Platform plugin ascend is activated
WARNING 10-14 03:13:13 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 10-14 03:13:13 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 10-14 03:13:13 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 10-14 03:13:13 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 10-14 03:13:13 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 10-14 03:13:13 [registry.py:582] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 10-14 03:13:13 [registry.py:582] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 10-14 03:13:13 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 10-14 03:13:13 [registry.py:582] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 10-14 03:13:13 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 10-14 03:13:13 [utils.py:233] non-default args: {'trust_remote_code': True, 'max_model_len': 1024, 'tensor_parallel_size': 16, 'disable_log_stats': True, 'quantization': 'ascend', 'enforce_eager': True, 'additional_config': {'ascend_scheduler_config': {'enabled': True}}, 'model': '/vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8'}
INFO 10-14 03:13:13 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type deepseek_v32 to instantiate a model of type deepseek_v3. This is not supported for all configurations of models and can yield errors.
INFO 10-14 03:13:14 [config.py:388] Replacing legacy 'type' key with 'rope_type'
INFO 10-14 03:13:14 [model.py:547] Resolved architecture: DeepseekV32ForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 10-14 03:13:14 [model.py:1510] Using max model len 1024
INFO 10-14 03:13:14 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 10-14 03:13:14 [config.py:422] Using custom fp8 kv-cache format for DeepSeekV3.2
INFO 10-14 03:13:14 [__init__.py:381] Cudagraph is disabled under eager mode
INFO 10-14 03:13:14 [platform.py:179] Compilation disabled, using eager mode by default
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:15 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:15 [core.py:77] Initializing a V1 LLM engine (vdev) with config: model='/vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8', speculative_config=None, tokenizer='/vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=16, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=ascend, enforce_eager=True, kv_cache_dtype=bfloat16, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8, enable_prefix_caching=True, chunked_prefill_enabled=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [multiproc_executor.py:720] Reducing Torch parallelism from 320 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:15 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], buffer_handle=(16, 16777216, 10, 'psm_528dbe69'), local_subscribe_addr='ipc:///tmp/4b94b813-95b9-4404-8669-00a689a1219c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) WARNING 10-14 03:13:15 [camem.py:64] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory. Sleep mode will be disabled.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:19 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_249dc2ae'), local_subscribe_addr='ipc:///tmp/d86fd2ed-d279-40ae-b560-764e15f8399a', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:20 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:20 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_195518fc'), local_subscribe_addr='ipc:///tmp/24dad618-fb15-48d1-aeca-78f767573735', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank1]:[W1014 03:13:20.861616957 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:20 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2e30cb22'), local_subscribe_addr='ipc:///tmp/03a3543f-9b63-4c94-b105-5e87ba752092', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank2]:[W1014 03:13:21.455990564 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:21 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_61208f09'), local_subscribe_addr='ipc:///tmp/70194c40-4b87-4ae4-bfdf-b675407637eb', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank4]:[W1014 03:13:22.148502012 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:22 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_9c23977c'), local_subscribe_addr='ipc:///tmp/e43a6a47-f35b-4bc2-a573-ee09fd649b2f', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:22 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_5d684123'), local_subscribe_addr='ipc:///tmp/1cac3638-4b77-405d-a437-d26ee74b37c6', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank6]:[W1014 03:13:23.239528660 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:23 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:23 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_c69bf8e4'), local_subscribe_addr='ipc:///tmp/47b459fb-012b-44ed-89d1-4c60b518c0d5', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank3]:[W1014 03:13:23.650381449 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank8]:[W1014 03:13:23.020025003 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:24 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_9882b1d1'), local_subscribe_addr='ipc:///tmp/8cf27b4e-5959-4f50-b2ec-a5232a1d7cb8', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:24 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_b001e390'), local_subscribe_addr='ipc:///tmp/385ece7c-0b36-4de8-aab0-c9968e874819', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank5]:[W1014 03:13:24.118055725 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank9]:[W1014 03:13:25.373742877 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:25 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2622a75d'), local_subscribe_addr='ipc:///tmp/99a79e0a-2829-4404-a1f6-ccea361f973a', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:25 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:25 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_51dd5c95'), local_subscribe_addr='ipc:///tmp/0a850256-beaa-40d8-a2d7-d81969e175b6', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_27b575d4'), local_subscribe_addr='ipc:///tmp/e04cc5e2-8745-483a-a0fb-c36f9400e9e6', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank15]:[W1014 03:13:25.860076501 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank14]:[W1014 03:13:25.942467580 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank10]:[W1014 03:13:25.119551278 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:26 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:26 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2d9165b3'), local_subscribe_addr='ipc:///tmp/b5af7133-1826-4a8f-97c2-de8286de9d00', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank11]:[W1014 03:13:26.717876490 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:26 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:26 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f0f3d5d0'), local_subscribe_addr='ipc:///tmp/35f0ff5c-794d-40af-b098-5940539f4bff', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:27 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:27 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_692fda41'), local_subscribe_addr='ipc:///tmp/ff9a17fc-d701-44df-b551-da34cb15f477', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank13]:[W1014 03:13:27.371838371 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank7]:[W1014 03:13:27.579003951 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:27 [worker_v1.py:102] custom_ops module loaded successfully. Custom operators like torch.ops.custom.npu_sparse_flash_attention are now available.
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:27 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_5dae18ed'), local_subscribe_addr='ipc:///tmp/9b812ed3-4c08-4b5e-9340-7542e8e51eb0', remote_subscribe_addr=None, remote_addr_ipv6=False)
[rank12]:[W1014 03:13:27.135636056 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W1014 03:13:27.143625798 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], buffer_handle=(15, 4194304, 6, 'psm_226b3c13'), local_subscribe_addr='ipc:///tmp/0989bbaa-cf89-462b-95ee-1aebb1d18e45', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 10 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 10, EP rank 10
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 9 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 9, EP rank 9
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 2 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 1 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 0 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 3 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 11 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 11, EP rank 11
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 4 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 4, EP rank 4
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 5 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 5, EP rank 5
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 6 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 6, EP rank 6
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 7 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 7, EP rank 7
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 8 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 8, EP rank 8
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 14 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 14, EP rank 14
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 12 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 12, EP rank 12
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 13 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 13, EP rank 13
(EngineCore_DP0 pid=610071) INFO 10-14 03:13:28 [parallel_state.py:1208] rank 15 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 15, EP rank 15
(EngineCore_DP0 pid=610071) (Worker_TP0 pid=610078) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP2 pid=610088) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP9 pid=610123) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP7 pid=610113) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP10 pid=610128) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP13 pid=610143) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP5 pid=610103) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP15 pid=610153) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP14 pid=610148) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP6 pid=610108) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP8 pid=610118) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP12 pid=610138) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP11 pid=610133) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP3 pid=610093) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP1 pid=610083) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP4 pid=610098) INFO 10-14 03:13:28 [model_runner_v1.py:2567] Starting to load model /vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8...
(EngineCore_DP0 pid=610071) (Worker_TP12 pid=610138) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP0 pid=610078) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP9 pid=610123) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP7 pid=610113) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP15 pid=610153) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP5 pid=610103) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP3 pid=610093) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP14 pid=610148) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP10 pid=610128) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP11 pid=610133) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP13 pid=610143) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP6 pid=610108) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP8 pid=610118) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP1 pid=610083) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP2 pid=610088) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
(EngineCore_DP0 pid=610071) (Worker_TP4 pid=610098) INFO 10-14 03:13:29 [utils.py:64] Using the vLLM Ascend Quantization now!
Loading safetensors checkpoint shards: 0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 1% Completed | 1/163 [00:00<00:49, 3.25it/s]
Loading safetensors checkpoint shards: 1% Completed | 2/163 [00:00<00:56, 2.83it/s]
Loading safetensors checkpoint shards: 2% Completed | 3/163 [00:01<01:00, 2.64it/s]
Loading safetensors checkpoint shards: 2% Completed | 4/163 [00:01<01:04, 2.48it/s]
Loading safetensors checkpoint shards: 3% Completed | 5/163 [00:01<01:06, 2.39it/s]
Loading safetensors checkpoint shards: 4% Completed | 6/163 [00:02<01:06, 2.35it/s]
Loading safetensors checkpoint shards: 4% Completed | 7/163 [00:02<01:06, 2.33it/s]
Loading safetensors checkpoint shards: 5% Completed | 8/163 [00:03<01:08, 2.27it/s]
Loading safetensors checkpoint shards: 6% Completed | 9/163 [00:03<01:06, 2.31it/s]
Loading safetensors checkpoint shards: 6% Completed | 10/163 [00:04<01:06, 2.31it/s]
Loading safetensors checkpoint shards: 7% Completed | 11/163 [00:04<01:04, 2.36it/s]
Loading safetensors checkpoint shards: 7% Completed | 12/163 [00:04<01:02, 2.44it/s]
Loading safetensors checkpoint shards: 8% Completed | 13/163 [00:05<01:02, 2.40it/s]
Loading safetensors checkpoint shards: 9% Completed | 14/163 [00:05<01:01, 2.41it/s]
Loading safetensors checkpoint shards: 9% Completed | 15/163 [00:06<01:02, 2.38it/s]
Loading safetensors checkpoint shards: 10% Completed | 16/163 [00:06<01:02, 2.34it/s]
Loading safetensors checkpoint shards: 10% Completed | 17/163 [00:07<01:02, 2.32it/s]
Loading safetensors checkpoint shards: 11% Completed | 18/163 [00:07<01:01, 2.37it/s]
Loading safetensors checkpoint shards: 12% Completed | 19/163 [00:07<00:59, 2.41it/s]
Loading safetensors checkpoint shards: 12% Completed | 20/163 [00:08<00:57, 2.48it/s]
Loading safetensors checkpoint shards: 13% Completed | 21/163 [00:08<00:57, 2.47it/s]
Loading safetensors checkpoint shards: 13% Completed | 22/163 [00:09<00:57, 2.47it/s]
Loading safetensors checkpoint shards: 14% Completed | 23/163 [00:09<00:56, 2.50it/s]
Loading safetensors checkpoint shards: 15% Completed | 24/163 [00:09<00:55, 2.51it/s]
Loading safetensors checkpoint shards: 15% Completed | 25/163 [00:10<00:54, 2.52it/s]
Loading safetensors checkpoint shards: 16% Completed | 26/163 [00:10<00:42, 3.24it/s]
Loading safetensors checkpoint shards: 17% Completed | 27/163 [00:10<00:42, 3.23it/s]
Loading safetensors checkpoint shards: 17% Completed | 28/163 [00:11<00:45, 3.00it/s]
Loading safetensors checkpoint shards: 18% Completed | 29/163 [00:11<00:47, 2.82it/s]
Loading safetensors checkpoint shards: 18% Completed | 30/163 [00:11<00:48, 2.77it/s]
Loading safetensors checkpoint shards: 19% Completed | 31/163 [00:12<00:49, 2.65it/s]
Loading safetensors checkpoint shards: 20% Completed | 32/163 [00:12<00:50, 2.59it/s]
Loading safetensors checkpoint shards: 20% Completed | 33/163 [00:13<00:50, 2.56it/s]
Loading safetensors checkpoint shards: 21% Completed | 34/163 [00:13<00:50, 2.53it/s]
Loading safetensors checkpoint shards: 21% Completed | 35/163 [00:13<00:51, 2.50it/s]
Loading safetensors checkpoint shards: 22% Completed | 36/163 [00:14<00:51, 2.46it/s]
Loading safetensors checkpoint shards: 23% Completed | 37/163 [00:14<00:51, 2.43it/s]
Loading safetensors checkpoint shards: 23% Completed | 38/163 [00:15<00:50, 2.49it/s]
Loading safetensors checkpoint shards: 24% Completed | 39/163 [00:15<00:49, 2.50it/s]
Loading safetensors checkpoint shards: 25% Completed | 40/163 [00:15<00:49, 2.47it/s]
Loading safetensors checkpoint shards: 25% Completed | 41/163 [00:16<00:50, 2.41it/s]
Loading safetensors checkpoint shards: 26% Completed | 42/163 [00:16<00:50, 2.42it/s]
Loading safetensors checkpoint shards: 26% Completed | 43/163 [00:17<00:48, 2.46it/s]
Loading safetensors checkpoint shards: 27% Completed | 44/163 [00:17<00:48, 2.44it/s]
Loading safetensors checkpoint shards: 28% Completed | 45/163 [00:18<00:48, 2.42it/s]
Loading safetensors checkpoint shards: 28% Completed | 46/163 [00:18<00:47, 2.48it/s]
Loading safetensors checkpoint shards: 29% Completed | 47/163 [00:18<00:44, 2.58it/s]
Loading safetensors checkpoint shards: 29% Completed | 48/163 [00:18<00:35, 3.28it/s]
Loading safetensors checkpoint shards: 30% Completed | 49/163 [00:19<00:37, 3.06it/s]
Loading safetensors checkpoint shards: 31% Completed | 50/163 [00:19<00:39, 2.88it/s]
Loading safetensors checkpoint shards: 31% Completed | 51/163 [00:20<00:39, 2.82it/s]
Loading safetensors checkpoint shards: 32% Completed | 52/163 [00:20<00:40, 2.73it/s]
Loading safetensors checkpoint shards: 33% Completed | 53/163 [00:20<00:41, 2.65it/s]
Loading safetensors checkpoint shards: 33% Completed | 54/163 [00:21<00:42, 2.59it/s]
Loading safetensors checkpoint shards: 34% Completed | 55/163 [00:21<00:42, 2.53it/s]
Loading safetensors checkpoint shards: 34% Completed | 56/163 [00:22<00:42, 2.49it/s]
Loading safetensors checkpoint shards: 35% Completed | 57/163 [00:22<00:42, 2.50it/s]
Loading safetensors checkpoint shards: 36% Completed | 58/163 [00:22<00:41, 2.52it/s]
Loading safetensors checkpoint shards: 36% Completed | 59/163 [00:23<00:41, 2.50it/s]
Loading safetensors checkpoint shards: 37% Completed | 60/163 [00:23<00:41, 2.49it/s]
Loading safetensors checkpoint shards: 37% Completed | 61/163 [00:24<00:41, 2.49it/s]
Loading safetensors checkpoint shards: 38% Completed | 62/163 [00:24<00:40, 2.52it/s]
Loading safetensors checkpoint shards: 39% Completed | 63/163 [00:24<00:39, 2.54it/s]
Loading safetensors checkpoint shards: 39% Completed | 64/163 [00:25<00:38, 2.57it/s]
Loading safetensors checkpoint shards: 40% Completed | 65/163 [00:25<00:37, 2.58it/s]
Loading safetensors checkpoint shards: 40% Completed | 66/163 [00:25<00:37, 2.59it/s]
Loading safetensors checkpoint shards: 41% Completed | 67/163 [00:26<00:36, 2.63it/s]
Loading safetensors checkpoint shards: 42% Completed | 68/163 [00:26<00:36, 2.61it/s]
Loading safetensors checkpoint shards: 42% Completed | 69/163 [00:27<00:36, 2.58it/s]
Loading safetensors checkpoint shards: 43% Completed | 70/163 [00:27<00:36, 2.52it/s]
Loading safetensors checkpoint shards: 44% Completed | 71/163 [00:27<00:37, 2.49it/s]
Loading safetensors checkpoint shards: 44% Completed | 72/163 [00:28<00:36, 2.50it/s]
Loading safetensors checkpoint shards: 45% Completed | 73/163 [00:28<00:36, 2.46it/s]
Loading safetensors checkpoint shards: 45% Completed | 74/163 [00:29<00:36, 2.44it/s]
Loading safetensors checkpoint shards: 46% Completed | 75/163 [00:29<00:36, 2.39it/s]
Loading safetensors checkpoint shards: 47% Completed | 76/163 [00:30<00:35, 2.43it/s]
Loading safetensors checkpoint shards: 47% Completed | 77/163 [00:30<00:35, 2.42it/s]
Loading safetensors checkpoint shards: 48% Completed | 78/163 [00:30<00:35, 2.41it/s]
Loading safetensors checkpoint shards: 48% Completed | 79/163 [00:31<00:34, 2.41it/s]
Loading safetensors checkpoint shards: 49% Completed | 80/163 [00:31<00:34, 2.39it/s]
Loading safetensors checkpoint shards: 50% Completed | 81/163 [00:32<00:34, 2.37it/s]
Loading safetensors checkpoint shards: 50% Completed | 82/163 [00:32<00:34, 2.33it/s]
Loading safetensors checkpoint shards: 51% Completed | 83/163 [00:33<00:34, 2.33it/s]
Loading safetensors checkpoint shards: 52% Completed | 84/163 [00:33<00:33, 2.33it/s]
Loading safetensors checkpoint shards: 52% Completed | 85/163 [00:33<00:33, 2.33it/s]
Loading safetensors checkpoint shards: 53% Completed | 86/163 [00:34<00:33, 2.29it/s]
Loading safetensors checkpoint shards: 53% Completed | 87/163 [00:34<00:33, 2.26it/s]
Loading safetensors checkpoint shards: 54% Completed | 88/163 [00:35<00:32, 2.29it/s]
Loading safetensors checkpoint shards: 55% Completed | 89/163 [00:35<00:32, 2.29it/s]
Loading safetensors checkpoint shards: 55% Completed | 90/163 [00:36<00:31, 2.30it/s]
Loading safetensors checkpoint shards: 56% Completed | 91/163 [00:36<00:31, 2.30it/s]
Loading safetensors checkpoint shards: 56% Completed | 92/163 [00:36<00:30, 2.32it/s]
Loading safetensors checkpoint shards: 57% Completed | 93/163 [00:37<00:29, 2.34it/s]
Loading safetensors checkpoint shards: 58% Completed | 94/163 [00:37<00:29, 2.33it/s]
Loading safetensors checkpoint shards: 58% Completed | 95/163 [00:38<00:28, 2.35it/s]
Loading safetensors checkpoint shards: 59% Completed | 96/163 [00:38<00:27, 2.40it/s]
Loading safetensors checkpoint shards: 60% Completed | 97/163 [00:38<00:26, 2.45it/s]
Loading safetensors checkpoint shards: 60% Completed | 98/163 [00:39<00:26, 2.47it/s]
Loading safetensors checkpoint shards: 61% Completed | 99/163 [00:39<00:25, 2.48it/s]
Loading safetensors checkpoint shards: 61% Completed | 100/163 [00:40<00:25, 2.44it/s]
Loading safetensors checkpoint shards: 62% Completed | 101/163 [00:40<00:20, 3.09it/s]
Loading safetensors checkpoint shards: 63% Completed | 102/163 [00:40<00:20, 2.97it/s]
Loading safetensors checkpoint shards: 63% Completed | 103/163 [00:41<00:22, 2.69it/s]
Loading safetensors checkpoint shards: 64% Completed | 104/163 [00:41<00:22, 2.59it/s]
Loading safetensors checkpoint shards: 64% Completed | 105/163 [00:41<00:22, 2.55it/s]
Loading safetensors checkpoint shards: 65% Completed | 106/163 [00:42<00:22, 2.52it/s]
Loading safetensors checkpoint shards: 66% Completed | 107/163 [00:42<00:18, 2.96it/s]
Loading safetensors checkpoint shards: 66% Completed | 108/163 [00:42<00:18, 2.91it/s]
Loading safetensors checkpoint shards: 67% Completed | 109/163 [00:43<00:19, 2.74it/s]
Loading safetensors checkpoint shards: 67% Completed | 110/163 [00:43<00:20, 2.65it/s]
Loading safetensors checkpoint shards: 68% Completed | 111/163 [00:44<00:19, 2.63it/s]
Loading safetensors checkpoint shards: 69% Completed | 112/163 [00:44<00:20, 2.48it/s]
Loading safetensors checkpoint shards: 69% Completed | 113/163 [00:45<00:20, 2.46it/s]
Loading safetensors checkpoint shards: 70% Completed | 114/163 [00:45<00:19, 2.46it/s]
Loading safetensors checkpoint shards: 71% Completed | 115/163 [00:45<00:19, 2.49it/s]
Loading safetensors checkpoint shards: 71% Completed | 116/163 [00:46<00:18, 2.53it/s]
Loading safetensors checkpoint shards: 72% Completed | 117/163 [00:46<00:18, 2.47it/s]
Loading safetensors checkpoint shards: 72% Completed | 118/163 [00:47<00:18, 2.41it/s]
Loading safetensors checkpoint shards: 73% Completed | 119/163 [00:47<00:18, 2.40it/s]
Loading safetensors checkpoint shards: 74% Completed | 120/163 [00:47<00:17, 2.43it/s]
Loading safetensors checkpoint shards: 74% Completed | 121/163 [00:48<00:17, 2.43it/s]
Loading safetensors checkpoint shards: 75% Completed | 122/163 [00:48<00:17, 2.40it/s]
Loading safetensors checkpoint shards: 75% Completed | 123/163 [00:49<00:16, 2.41it/s]
Loading safetensors checkpoint shards: 76% Completed | 124/163 [00:49<00:16, 2.34it/s]
Loading safetensors checkpoint shards: 77% Completed | 125/163 [00:50<00:16, 2.30it/s]
Loading safetensors checkpoint shards: 77% Completed | 126/163 [00:50<00:15, 2.32it/s]
Loading safetensors checkpoint shards: 78% Completed | 127/163 [00:50<00:15, 2.37it/s]
Loading safetensors checkpoint shards: 79% Completed | 128/163 [00:51<00:14, 2.40it/s]
Loading safetensors checkpoint shards: 79% Completed | 129/163 [00:51<00:11, 2.93it/s]
Loading safetensors checkpoint shards: 80% Completed | 130/163 [00:51<00:11, 2.96it/s]
Loading safetensors checkpoint shards: 80% Completed | 131/163 [00:52<00:11, 2.77it/s]
Loading safetensors checkpoint shards: 81% Completed | 132/163 [00:52<00:11, 2.72it/s]
Loading safetensors checkpoint shards: 82% Completed | 133/163 [00:52<00:11, 2.64it/s]
Loading safetensors checkpoint shards: 82% Completed | 134/163 [00:53<00:10, 2.64it/s]
Loading safetensors checkpoint shards: 83% Completed | 135/163 [00:53<00:10, 2.65it/s]
Loading safetensors checkpoint shards: 83% Completed | 136/163 [00:54<00:10, 2.55it/s]
Loading safetensors checkpoint shards: 84% Completed | 137/163 [00:54<00:10, 2.55it/s]
Loading safetensors checkpoint shards: 85% Completed | 138/163 [00:54<00:09, 2.60it/s]
Loading safetensors checkpoint shards: 85% Completed | 139/163 [00:55<00:09, 2.60it/s]
Loading safetensors checkpoint shards: 86% Completed | 140/163 [00:55<00:08, 2.56it/s]
Loading safetensors checkpoint shards: 87% Completed | 141/163 [00:56<00:08, 2.57it/s]
Loading safetensors checkpoint shards: 87% Completed | 142/163 [00:56<00:08, 2.53it/s]
Loading safetensors checkpoint shards: 88% Completed | 143/163 [00:56<00:08, 2.41it/s]
Loading safetensors checkpoint shards: 88% Completed | 144/163 [00:57<00:07, 2.45it/s]
Loading safetensors checkpoint shards: 89% Completed | 145/163 [00:57<00:07, 2.45it/s]
Loading safetensors checkpoint shards: 90% Completed | 146/163 [00:58<00:07, 2.41it/s]
Loading safetensors checkpoint shards: 90% Completed | 147/163 [00:58<00:06, 2.42it/s]
Loading safetensors checkpoint shards: 91% Completed | 148/163 [00:58<00:06, 2.44it/s]
Loading safetensors checkpoint shards: 91% Completed | 149/163 [00:59<00:05, 2.43it/s]
Loading safetensors checkpoint shards: 93% Completed | 151/163 [00:59<00:03, 3.19it/s]
Loading safetensors checkpoint shards: 93% Completed | 152/163 [01:00<00:03, 2.94it/s]
Loading safetensors checkpoint shards: 94% Completed | 153/163 [01:00<00:03, 2.80it/s]
Loading safetensors checkpoint shards: 94% Completed | 154/163 [01:01<00:03, 2.74it/s]
Loading safetensors checkpoint shards: 95% Completed | 155/163 [01:01<00:02, 2.69it/s]
Loading safetensors checkpoint shards: 96% Completed | 156/163 [01:01<00:02, 2.66it/s]
Loading safetensors checkpoint shards: 96% Completed | 157/163 [01:02<00:02, 2.61it/s]
(EngineCore_DP0 pid=610071) (Worker_TP13 pid=610143) INFO 10-14 03:14:34 [default_loader.py:267] Loading weights took 62.18 seconds
Loading safetensors checkpoint shards: 97% Completed | 158/163 [01:02<00:01, 2.61it/s]
Loading safetensors checkpoint shards: 98% Completed | 159/163 [01:02<00:01, 2.58it/s]
Loading safetensors checkpoint shards: 98% Completed | 160/163 [01:03<00:01, 2.60it/s]
Loading safetensors checkpoint shards: 99% Completed | 161/163 [01:03<00:00, 2.57it/s]
(EngineCore_DP0 pid=610071) (Worker_TP15 pid=610153) INFO 10-14 03:14:36 [default_loader.py:267] Loading weights took 63.84 seconds
Loading safetensors checkpoint shards: 99% Completed | 162/163 [01:04<00:00, 2.54it/s]
(EngineCore_DP0 pid=610071) (Worker_TP13 pid=610143) INFO 10-14 03:14:36 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
Loading safetensors checkpoint shards: 100% Completed | 163/163 [01:04<00:00, 2.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [01:04<00:00, 2.53it/s]
(EngineCore_DP0 pid=610071) (Worker_TP0 pid=610078)
(EngineCore_DP0 pid=610071) (Worker_TP0 pid=610078) INFO 10-14 03:14:37 [default_loader.py:267] Loading weights took 64.60 seconds
(EngineCore_DP0 pid=610071) (Worker_TP15 pid=610153) INFO 10-14 03:14:38 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP1 pid=610083) INFO 10-14 03:14:39 [default_loader.py:267] Loading weights took 66.48 seconds
(EngineCore_DP0 pid=610071) (Worker_TP9 pid=610123) INFO 10-14 03:14:39 [default_loader.py:267] Loading weights took 66.61 seconds
(EngineCore_DP0 pid=610071) (Worker_TP0 pid=610078) INFO 10-14 03:14:39 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP2 pid=610088) INFO 10-14 03:14:39 [default_loader.py:267] Loading weights took 66.72 seconds
(EngineCore_DP0 pid=610071) (Worker_TP3 pid=610093) INFO 10-14 03:14:39 [default_loader.py:267] Loading weights took 66.88 seconds
(EngineCore_DP0 pid=610071) (Worker_TP14 pid=610148) INFO 10-14 03:14:39 [default_loader.py:267] Loading weights took 66.97 seconds
(EngineCore_DP0 pid=610071) (Worker_TP10 pid=610128) INFO 10-14 03:14:39 [default_loader.py:267] Loading weights took 67.03 seconds
(EngineCore_DP0 pid=610071) (Worker_TP14 pid=610148) INFO 10-14 03:14:40 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP3 pid=610093) INFO 10-14 03:14:41 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP9 pid=610123) INFO 10-14 03:14:41 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP1 pid=610083) INFO 10-14 03:14:41 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP8 pid=610118) INFO 10-14 03:14:41 [default_loader.py:267] Loading weights took 68.47 seconds
(EngineCore_DP0 pid=610071) (Worker_TP5 pid=610103) INFO 10-14 03:14:41 [default_loader.py:267] Loading weights took 68.59 seconds
(EngineCore_DP0 pid=610071) (Worker_TP7 pid=610113) INFO 10-14 03:14:41 [default_loader.py:267] Loading weights took 68.65 seconds
(EngineCore_DP0 pid=610071) (Worker_TP2 pid=610088) INFO 10-14 03:14:41 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP10 pid=610128) INFO 10-14 03:14:41 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP4 pid=610098) INFO 10-14 03:14:41 [default_loader.py:267] Loading weights took 69.12 seconds
(EngineCore_DP0 pid=610071) (Worker_TP12 pid=610138) INFO 10-14 03:14:41 [default_loader.py:267] Loading weights took 69.19 seconds
(EngineCore_DP0 pid=610071) (Worker_TP6 pid=610108) INFO 10-14 03:14:42 [default_loader.py:267] Loading weights took 69.29 seconds
(EngineCore_DP0 pid=610071) (Worker_TP5 pid=610103) INFO 10-14 03:14:42 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP11 pid=610133) INFO 10-14 03:14:42 [default_loader.py:267] Loading weights took 69.54 seconds
(EngineCore_DP0 pid=610071) (Worker_TP7 pid=610113) INFO 10-14 03:14:42 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP8 pid=610118) INFO 10-14 03:14:42 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP4 pid=610098) INFO 10-14 03:14:43 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP11 pid=610133) INFO 10-14 03:14:43 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP6 pid=610108) INFO 10-14 03:14:43 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP12 pid=610138) INFO 10-14 03:14:43 [model_runner_v1.py:2593] Loading model weights took 42.5184 GB
(EngineCore_DP0 pid=610071) (Worker_TP15 pid=610153) WARNING 10-14 03:14:44 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP14 pid=610148) WARNING 10-14 03:14:44 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP13 pid=610143) WARNING 10-14 03:14:44 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP0 pid=610078) WARNING 10-14 03:14:44 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP9 pid=610123) WARNING 10-14 03:14:44 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP1 pid=610083) WARNING 10-14 03:14:44 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP10 pid=610128) WARNING 10-14 03:14:44 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP2 pid=610088) WARNING 10-14 03:14:44 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP3 pid=610093) WARNING 10-14 03:14:44 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP5 pid=610103) WARNING 10-14 03:14:45 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP7 pid=610113) WARNING 10-14 03:14:46 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP8 pid=610118) WARNING 10-14 03:14:46 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP6 pid=610108) WARNING 10-14 03:14:47 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP4 pid=610098) WARNING 10-14 03:14:47 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP11 pid=610133) WARNING 10-14 03:14:47 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP12 pid=610138) WARNING 10-14 03:14:47 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=610071) (Worker_TP5 pid=610103) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10441577676, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP14 pid=610148) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10218149068, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP15 pid=610153) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10456347852, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP9 pid=610123) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10456323276, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP3 pid=610093) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10441426124, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP7 pid=610113) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10441319628, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP8 pid=610118) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10218968268, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP6 pid=610108) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10232923340, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP2 pid=610088) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10231706828, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP1 pid=610083) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10454742220, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP13 pid=610143) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10456061132, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP10 pid=610128) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10221147340, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP11 pid=610133) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10450908364, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP0 pid=610078) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 9180513484, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP12 pid=610138) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10217718988, total memory: 65464696832
(EngineCore_DP0 pid=610071) (Worker_TP4 pid=610098) INFO 10-14 03:14:52 [worker_v1.py:238] Available memory: 10232415436, total memory: 65464696832
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 65,280 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 63.75x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 74,368 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 72.62x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 72,704 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 71.00x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 74,240 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 72.50x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 72,704 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 71.00x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 74,240 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 72.50x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 72,704 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 71.00x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 74,240 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 72.50x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 72,704 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 71.00x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 74,368 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 72.62x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 72,704 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 71.00x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 74,240 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 72.50x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 72,576 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 70.88x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 74,368 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 72.62x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 72,704 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 71.00x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1087] GPU KV cache size: 74,368 tokens
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [kv_cache_utils.py:1091] Maximum concurrency for 1,024 tokens per request: 72.62x
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:52 [core.py:210] init engine (profile, create kv cache, warmup model) took 9.04 seconds
(EngineCore_DP0 pid=610071) WARNING 10-14 03:14:53 [core.py:112] Using configured V1 scheduler class vllm_ascend.core.scheduler.AscendScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:53 [__init__.py:381] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=610071) INFO 10-14 03:14:53 [platform.py:179] Compilation disabled, using eager mode by default
INFO 10-14 03:14:53 [llm.py:306] Supported_tasks: ['generate']
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 329.43it/s]
Processed prompts: 0%| | 0/7 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=610071) (Worker_TP7 pid=610113) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP5 pid=610103) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP7 pid=610113) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP5 pid=610103) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP14 pid=610148) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP14 pid=610148) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP4 pid=610098) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP4 pid=610098) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP6 pid=610108) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP6 pid=610108) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP3 pid=610093) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP3 pid=610093) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP10 pid=610128) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP10 pid=610128) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP9 pid=610123) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP9 pid=610123) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP12 pid=610138) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP12 pid=610138) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP15 pid=610153) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP15 pid=610153) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP1 pid=610083) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP1 pid=610083) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP11 pid=610133) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP11 pid=610133) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP2 pid=610088) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP2 pid=610088) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP8 pid=610118) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP8 pid=610118) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP0 pid=610078) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP0 pid=610078) actual_query_lens = torch.tensor(query_lens[reqs_start:],
(EngineCore_DP0 pid=610071) (Worker_TP13 pid=610143) /vllm-workspace/vllm-ascend/vllm_ascend/attention/sfa_v1.py:374: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
(EngineCore_DP0 pid=610071) (Worker_TP13 pid=610143) actual_query_lens = torch.tensor(query_lens[reqs_start:],
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:15<00:00, 2.25s/it, est. speed input: 3.31 toks/s, output: 44.52 toks/s]
Prompt: '窗前明月光,', Generated text: '疑是地上霜。\n\n举头望明月,低头思故乡。\n\n李白《静夜思》\n\n床前洒满皎洁的月光,诗人恍惚间以为是地上的秋霜。抬起头来仰望天上的明月,低下头来不由得思念起遥远的故乡。\n\n这首小诗,没有精工华美的辞藻,没有奇特新颖的立意,只是用叙述的语气,写远客思乡之情,然而它却意味深长,耐人寻绎,千百年来,如此广泛地吸引着'
Prompt: 'The president of the United States is Mr.', Generated text: ' Obama.\n\nMr. Obama is the president of the United States.\n\nThe president of the United States is Mr. Obama.\n\nMr. Obama is the president of the United States.\n\nThe president of the United States is Mr. Obama.\n\nMr. Obama is the president of the United States.\n\nThe president of the United States is Mr. Obama.\n\nMr. Obama is the president of the United States.\n\nThe president of the United States is Mr. Obama.\n\nMr. Obama is the president of the United States'
Prompt: 'The capital of France is', Generated text: ' Paris, one of the most important and influential cities in the world. Paris is located in the north-central part of the country, on the banks of the Seine River. It is not only the political center of France but also a global hub for art, fashion, gastronomy, and culture. The city is renowned for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, the Louvre Museum, and the Champs-Élysées. Paris has a rich history that dates'
Prompt: 'The future of AI is', Generated text: " here, and it's changing everything. From healthcare to transportation, AI is revolutionizing industries and transforming the way we live and work. But what does this mean for you? How can you stay ahead of the curve and thrive in this new era of intelligence? In this video, we'll explore the latest advancements in AI and what they mean for the future. We'll dive into the world of machine learning, natural language processing, and computer vision, and see how these technologies are being applied in real-world"
Prompt: '感时花溅泪,', Generated text: '恨别鸟惊心。 烽火连三月,家书抵万金。 白头搔更短,浑欲不胜簪。 4、望岳 杜甫 岱宗夫如何,齐鲁青未了。 造化钟神秀,阴阳割昏晓。 荡胸生层云,决眦入归鸟。 会当凌绝顶,一览众山小。 5、春望 杜甫 国破山河在,城春'
Prompt: '家书抵万金啥意思?', Generated text: '家书抵万金的意思及全诗出处和翻译赏析\n\n家书抵万金,这是一句流传千古的诗句,它表达了家书在人们心中的珍贵和重要性。那么,家书抵万金到底是什么意思呢?本文将从全诗出处、翻译赏析等方面进行探讨。\n\n一、全诗出处\n\n家书抵万金出自唐代诗人杜甫的《春望》。全诗如下:\n\n国破山河在,城春草木深。\n\n感时花溅泪'
Prompt: 'plz tell me a story: ', Generated text: "2nd grade reading level, about a girl who wants to be a scientist when she grows up and she goes to the moon\n\nOf course! Here is a story for you.\n\n### Luna's Big Dream\n\nLuna loved science. While other kids had posters of pop stars or cartoon characters, Luna had a giant poster of the solar system above her bed. Her favorite subject in school was when her class got to go to the library and learn about planets and stars.\n\nOne night, she pointed her" |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: MengqingCao <[email protected]>
class AscendDeepseekV2Model(DeepseekV2Model, nn.Module): | ||
|
||
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): | ||
# Rewrite this init func mainly for removing cuda-hard code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add this for vLLM 0.11.0, as the cuda hard code, ptal @zzzzwwjj @wangxiyuan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any vllm PR to fix this hard code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, merged now vllm-project/vllm@302ef40
Test passed with torchair enabled: import os
from vllm import LLM, SamplingParams
os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["HCCL_BUFFSIZE"] = "1024"
def main():
prompts = [
"窗前明月光,",
"The president of the United States is Mr.",
"The capital of France is",
"The future of AI is",
"感时花溅泪,",
"家书抵万金啥意思?",
"plz tell me a story: ",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
# Create an LLM.
llm = LLM(model="/vllm-workspace/cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3___2-Exp-W8A8",
tensor_parallel_size=16,
trust_remote_code=True,
max_model_len=1024,
# max_num_seqs=2,
gpu_memory_utilization=0.9,
quantization="ascend",
additional_config={
"ascend_scheduler_config":{"enabled":True},
"torchair_graph_config":{"enabled":True,"graph_batch_sizes":[16]}
}
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == "__main__":
main()
|
What this PR does / why we need it?
Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch.
The final goal is to remove all the patches and align the code arch to vllm, thus we need to do the following work in next prs.
TODO:
Does this PR introduce any user-facing change?
N/A
How was this patch tested?