Description
32B and 30B models are similar size, But there is huge difference between the performance:
X86 Cuda (4x diff)
llama.cpp build: 6a2bc8b (5415)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A10-24Q, compute capability 8.6, VMM: yes
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | pp512 | 672.53 ± 3.28 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | tg128 | 23.22 ± 0.01 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | pp512 | 1328.67 ± 15.35 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | tg128 | 103.50 ± 0.17 |
ARM (2.5x diff):
llama.cpp build: 814f795 (5307)
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | pp512 | 131.40 ± 0.21 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | tg128 | 14.44 ± 0.10 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | pp512 | 383.43 ± 3.12 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | tg128 | 39.50 ± 0.16 |
What is the explanation to this?