Eval bug: [CUDA] MoE model (Qwen3-30B-A3B) loads to GPU but does not utilize CUDA for inference in build b5466

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Jayden\Downloads\Compressed\llama-b5466-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\Jayden\Downloads\Compressed\llama-b5466-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\Jayden\Downloads\Compressed\llama-b5466-bin-win-cuda-12.4-x64\ggml-cpu-alderlake.dll
version: 5466 (9ecf3e6)
built with clang version 18.1.8 for x86_64-pc-windows-msvc

Operating systems

Windows

GGML backends

CUDA

Hardware

CPU: Intel(R) Core(TM) 17-14700KF
RAM: 64.0GB
GPU: NVIDIA GeForce RTX 4080 SUPER

Models

https://huggingface.co/Qwen/Qwen3-30B-A3B-GGUF
Qwen3-30B-A3B-Q4_K_M.gguf

Problem description & steps to reproduce

🧾 Problem Description

When running the Qwen3-30B-A3B-Q4_K_M.gguf MoE-based model using llama-server.exe from the llama-b5466-bin-win-cuda-12.4-x64 build on an NVIDIA RTX 4080 SUPER (Ampere architecture), the following issues occur:

The model successfully loads onto the GPU (CUDA0) and allocates memory.
However, no actual CUDA kernel computation is triggered, leading to:
- Extremely slow inference
- Unstable or incomplete generation output
- No visible CUDA-related activity in logs (e.g., no ggml_cuda_assign_buffers, kernel launched, etc.)

In contrast, the same model runs correctly using the llama-b5333 build, with proper GPU acceleration and stable output.

This suggests a regression or compatibility issue in the newer b5466 build, especially with how it handles MoE models or GPU offloading logic.

📋 Steps to Reproduce

✅ Affected Version

llama-b5466-bin-win-cuda-12.4-x64

🔧 Command Used

llama-server.exe --model "C:\Users\Jayden\Downloads\Qwen3-30B-A3B-Q4_K_M.gguf" --n-gpu-layers 30 --threads 16 --repeat-penalty 1.3 --temp 0.7 --mirostat 2 --mlock

🧪 Reproduction Steps

Download and extract the llama-b5466-bin-win-cuda-12.4-x64 build.
Place the Qwen3-30B-A3B-Q4_K_M.gguf model file in a known location.
Run the above command in CMD/PowerShell.
Wait for the model to load (you'll see logs indicating successful loading to CUDA).

Send a chat completion request via the API endpoint:

POST http://127.0.0.1:8080/v1/chat/completions

With body:

{
  "model": "Qwen3-30B-A3B",
  "messages": [{"role": "user", "content": "Hello"}]
}

Observe:
- Very slow response time or timeout
- Incomplete or unstable output
- No evidence of CUDA kernel usage in logs

✅ Control Test (Working Version)

Repeat the exact steps above using llama-b5333. You will observe:

Fast inference
Coherent output
Logs clearly indicate CUDA kernel execution

⚠️ Observed Behavior in b5466

Despite seeing these log entries:

llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4080 SUPER) - 15035 MiB free
load_tensors: offloaded 30/49 layers to GPU

There is no further CUDA activity during inference.
No ggml_cuda_assign_buffers or kernel launched messages are shown.

🛠 Potential Causes

Regression in how MoE layers are offloaded or executed in b5466.
Changes in graph execution logic that prevent CUDA kernels from being used.
Incomplete or incorrect support for MoE models in this particular build.
Possible missing or broken CUDA kernel bindings for certain ops used by MoE models.

✅ Suggested Fixes / Investigations

Add better detection of MoE model structures
Ensure CUDA kernels are properly triggered for MoE layers
Consider adding a warning when certain features are not supported in current builds
Compare source code diffs between b5333 and b5466 focusing on:
- src/ggml-cuda.cu
- src/llama.cpp (offloading logic)
- MoE-specific handling in llm_load_tensors()

First Bad Commit

No response

Relevant log output

go.bat
@echo off
llama-server.exe --model "C:\Users\Jayden\Downloads\Qwen3-30B-A3B-Q4_K_M.gguf" --n-gpu-layers 30 --threads 16 --repeat-penalty 1.3 --temp 0.7 --mirostat 2 --mlock

C:\Users\Jayden\Downloads\Compressed\llama-b5466-bin-win-cuda-12.4-x64>go
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Jayden\Downloads\Compressed\llama-b5466-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\Jayden\Downloads\Compressed\llama-b5466-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\Jayden\Downloads\Compressed\llama-b5466-bin-win-cuda-12.4-x64\ggml-cpu-alderlake.dll
build: 5466 (9ecf3e66) with clang version 18.1.8 for x86_64-pc-windows-msvc
system info: n_threads = 16, n_threads_batch = 16, total_threads = 28

system_info: n_threads = 16 (n_threads_batch = 16) / 28 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 27
main: loading model
srv    load_model: loading model 'C:\Users\Jayden\Downloads\Qwen3-30B-A3B-Q4_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4080 SUPER) - 15035 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from C:\Users\Jayden\Downloads\Qwen3-30B-A3B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B Gptq Fp16
llama_model_loader: - kv   3:                           general.finetune str              = gptq
llama_model_loader: - kv   4:                           general.basename str              = Qwen3
llama_model_loader: - kv   5:                         general.size_label str              = 30B
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["? ?", "?? ??", "i n", "? t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 17.28 GiB (4.86 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B Gptq Fp16
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 '?'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloaded 30/49 layers to GPU
load_tensors:        CUDA0 model buffer size = 10750.86 MiB
load_tensors:   CPU_Mapped model buffer size =  6940.48 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   240.00 MiB
llama_kv_cache_unified:        CPU KV buffer size =   144.00 MiB
llama_kv_cache_unified: size =  384.00 MiB (  4096 cells,  48 layers,  1 seqs), K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_context:      CUDA0 compute buffer size =   544.18 MiB
llama_context:  CUDA_Host compute buffer size =    12.01 MiB
llama_context: graph nodes  = 3222
llama_context: graph splits = 256 (with bs=512), 39 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for index in range(ns.last_query_index, -1, -1) %}
    {%- set message = messages[index] %}
    {%- if ns.multi_step_tool and message.role == "user" and not('<tool_response>' in message.content and '</tool_response>' in message.content) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set content = message.content %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in message.content %}
                {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
                {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}
{%- endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 383
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 383, n_tokens = 383, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 383, n_tokens = 383
srv  cancel_tasks: cancel task, id_task = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 418, truncated = 0
srv  update_slots: all slots are idle
srv    operator(): operator(): cleaning up before exit...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: [CUDA] MoE model (Qwen3-30B-A3B) loads to GPU but does not utilize CUDA for inference in build b5466 #13729

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

🧾 Problem Description

📋 Steps to Reproduce

✅ Affected Version

🔧 Command Used

🧪 Reproduction Steps

✅ Control Test (Working Version)

⚠️ Observed Behavior in b5466

🛠 Potential Causes

✅ Suggested Fixes / Investigations

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: [CUDA] MoE model (Qwen3-30B-A3B) loads to GPU but does not utilize CUDA for inference in build b5466 #13729

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

🧾 Problem Description

📋 Steps to Reproduce

✅ Affected Version

🔧 Command Used

🧪 Reproduction Steps

✅ Control Test (Working Version)

⚠️ Observed Behavior in b5466

🛠 Potential Causes

✅ Suggested Fixes / Investigations

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions