Closed
Description
I expected more gpu_offloads get better performances(tokens/sec), however the bench-results were different.
The followings were executed on QCS8550 with a model (https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct-GGUF/blob/main/EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf).
llama-bench -m ./EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf -ngl 0,5,10,15,20,31
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 740'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.42.20.00
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 256 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: A_q_d buffer size reduced from 311164928 to 268435456 due to device limitations.
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 0 | pp512 | 18.92 ± 0.18 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 0 | tg128 | 3.90 ± 0.10 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 5 | pp512 | 16.97 ± 0.03 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 5 | tg128 | 3.37 ± 0.02 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 10 | pp512 | 16.23 ± 0.02 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 10 | tg128 | 3.12 ± 0.02 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 15 | pp512 | 15.87 ± 0.03 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 15 | tg128 | 2.93 ± 0.01 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 20 | pp512 | 15.22 ± 0.02 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 20 | tg128 | 2.80 ± 0.01 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 31 | pp512 | 13.81 ± 0.03 |
| exaone ?B Q4_K - Medium | 1.39 GiB | 2.41 B | OpenCL | 31 | tg128 | 2.95 ± 0.14 |