OpenCL: Performance comparison depending on gpu_offloads

I expected more gpu_offloads get better performances(tokens/sec), however the bench-results were different.

The followings were executed on QCS8550 with a model (https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct-GGUF/blob/main/EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf).

llama-bench -m ./EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf  -ngl 0,5,10,15,20,31

ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 740'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.42.20.00
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 256 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: A_q_d buffer size reduced from 311164928 to 268435456 due to device limitations.
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   0 |         pp512 |         18.92 ± 0.18 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   0 |         tg128 |          3.90 ± 0.10 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   5 |         pp512 |         16.97 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   5 |         tg128 |          3.37 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  10 |         pp512 |         16.23 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  10 |         tg128 |          3.12 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  15 |         pp512 |         15.87 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  15 |         tg128 |          2.93 ± 0.01 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  20 |         pp512 |         15.22 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  20 |         tg128 |          2.80 ± 0.01 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  31 |         pp512 |         13.81 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  31 |         tg128 |          2.95 ± 0.14 |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenCL: Performance comparison depending on gpu_offloads #12810

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenCL: Performance comparison depending on gpu_offloads #12810

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions