Skip to content

Qwen3 32B and 30B models are similar size, But there is 4x difference between the performance!? #13652

Closed
@jagusztinl

Description

@jagusztinl

32B and 30B models are similar size, But there is huge difference between the performance:

X86 Cuda (4x diff)
llama.cpp build: 6a2bc8b (5415)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A10-24Q, compute capability 8.6, VMM: yes
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | pp512 | 672.53 ± 3.28 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | CUDA | 99 | 1 | tg128 | 23.22 ± 0.01 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | pp512 | 1328.67 ± 15.35 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | CUDA | 99 | 1 | tg128 | 103.50 ± 0.17 |

ARM (2.5x diff):
llama.cpp build: 814f795 (5307)
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | pp512 | 131.40 ± 0.21 |
| qwen3 32B Q4_0 | 17.41 GiB | 32.76 B | BLAS | 64 | 1 | tg128 | 14.44 ± 0.10 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | pp512 | 383.43 ± 3.12 |
| qwen3moe 30B.A3B Q4_0 | 16.42 GiB | 30.53 B | BLAS | 64 | 1 | tg128 | 39.50 ± 0.16 |

What is the explanation to this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions