Qwen3 32B and 30B models are similar size, But there is 4x difference between the performance!?

32B and 30B models are similar size, But there is huge difference between the performance:

X86 Cuda (4x diff)
llama.cpp build: 6a2bc8b (5415)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A10-24Q, compute capability 8.6, VMM: yes
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | pp512 | 672.53 ± 3.28 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | tg128 | 23.22 ± 0.01 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | pp512 | 1328.67 ± 15.35 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | tg128 | 103.50 ± 0.17 |

ARM (2.5x diff):
llama.cpp build: 814f795 (5307)
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | pp512 | 131.40 ± 0.21 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | tg128 | 14.44 ± 0.10 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | pp512 | 383.43 ± 3.12 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | tg128 | 39.50 ± 0.16 |

What is the explanation to this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3 32B and 30B models are similar size, But there is 4x difference between the performance!? #13652

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen3 32B and 30B models are similar size, But there is 4x difference between the performance!? #13652

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions