sycl: Add reorder to Q6_K mmvq implementation #13885

s-Nick · 2025-05-29T10:31:53Z

This PR implements quants reordering for mmvq for Q6_K quantization following the work done for Q4_0 and Q4_K.
These changes give good results on BMG and do not detriment performance on other GPUs.

Performance impact

All numbers taken with GGML_SYCL_DISABLE_OPT=0 .

Battlemage B580

model	size	params	backend	ngl	sm	test	this PR t/s	master(`26b79b6`) t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	pp512	7421.88 ± 40.31	7303.25 ± 190.33
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	tg128	134.93 ± 4.28	132.17 ± 7.13
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	pp512	7508.23 ± 12.32	7543.68 ± 52.94
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	tg128	124.74 ± 2.57	117.75 ± 2.95
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	pp512	2164.47 ± 3.86	2156.43 ± 4.42
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	tg128	65.13 ± 0.55	65.65 ± 0.38
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	pp512	2206.63 ± 4.49	2202.15 ± 5.80
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	tg128	55.04 ± 0.19	52.57 ± 0.23
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	pp512	5722.94 ± 33.50	5688.02 ± 29.39
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	tg128	93.32 ± 2.52	88.34 ± 2.42
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	pp512	3043.55 ± 8.11	3047.39 ± 6.17
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	tg128	95.46 ± 2.23	95.79 ± 2.39
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	pp512	3140.52 ± 6.44	3154.81 ± 5.22
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	tg128	72.71 ± 0.89	69.35 ± 0.16
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	none	pp512	1472.53 ± 4.66	1468.04 ± 0.47
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	none	tg128	23.48 ± 0.03	20.30 ± 0.05

Lunar Lake

model	size	params	backend	ngl	test	this PR t/s	master(`26b79b6`) t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	pp512	1559.52 ± 38.29	1761.94 ± 15.09
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	tg128	58.17 ± 0.59	56.42 ± 0.42
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	pp512	1775.38 ± 37.60	1675.54 ± 32.04
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	tg128	42.21 ± 0.19	39.81 ± 0.65
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	pp512	391.97 ± 4.96	433.80 ± 1.21
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	tg128	21.61 ± 0.51	20.38 ± 0.61
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	pp512	492.83 ± 0.55	488.92 ± 1.28
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	tg128	15.86 ± 0.24	14.98 ± 0.16
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	pp512	985.69 ± 62.63	990.18 ± 10.61
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	tg128	29.24 ± 0.15	27.48 ± 0.25
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	pp512	674.42 ± 0.61	665.08 ± 3.18
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	tg128	34.34 ± 0.09	33.52 ± 0.11
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	pp512	743.15 ± 1.86	737.78 ± 2.94
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	tg128	22.53 ± 0.48	22.17 ± 0.30
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	pp512	301.50 ± 14.06	302.83 ± 1.09
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	tg128	7.72 ± 0.05	6.06 ± 0.02
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	SYCL	99	pp512	411.81 ± 4.68	418.31 ± 5.50
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	SYCL	99	tg128	14.46 ± 0.06	13.40 ± 0.09

Intel Arc A770

model	size	params	backend	ngl	sm	test	t/s	master(`26b79b6`) t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	pp512	4456.48 ± 8.69	4433.16 ± 9.59
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	tg128	45.60 ± 0.21	44.82 ± 0.24
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	pp512	4499.64 ± 3.28	4460.18 ± 5.36
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	tg128	44.60 ± 0.16	43.98 ± 0.14
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	pp512	1716.17 ± 1.29	1707.70 ± 1.18
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	tg128	34.40 ± 0.03	34.06 ± 0.02
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	pp512	1732.65 ± 2.73	1723.82 ± 1.31
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	tg128	32.60 ± 0.27	31.71 ± 0.27
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	pp512	3661.03 ± 6.91	3632.01 ± 6.45
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	tg128	39.20 ± 0.31	38.16 ± 0.36
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	pp512	2464.93 ± 2.58	2445.01 ± 2.54
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	tg128	39.98 ± 0.01	39.41 ± 0.33
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	pp512	2513.13 ± 2.47	2492.01 ± 1.80
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	tg128	34.50 ± 0.30	34.14 ± 0.02
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	none	pp512	1031.68 ± 1.04	1024.74 ± 1.07
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	none	tg128	17.33 ± 0.11	15.12 ± 0.16

Alcpz

This looks good. Most of my comments are minor things or topics for discussing.

Alcpz · 2025-06-02T09:28:28Z

ggml/src/ggml-sycl/ggml-sycl.cpp

+
+    auto *       ql_ptr     = data_device;
+    auto *       qh_ptr     = ql_ptr + (QK_K / 2) * nblocks;
+    // scales are after all quants' bits so adding both to get correct offset


This comment describes the reordered structure a bit. Down below you have a similar comment for the high and low bits. I suggest having both in the same place if we want to keep these.

Thank you for spotting it, I think a better place to keep this comment is inside the struct where the offset is computed. I am going to remove it from here.

Addressed in f66d799

Alcpz · 2025-06-02T09:30:01Z

ggml/src/ggml-sycl/vecdotq.hpp

+    using q6_k_block  = ggml_sycl_reordered::block_q_t<GGML_TYPE_Q6_K>;
+    using q6_k_traits = typename q6_k_block::traits;
+
+    // contiguous v/x values


Can you explain this comment?

left there from original vec_dot_q6_K_q8_1_impl_mmvq, It is valid for all K quantization and it simply means that v value it uses are contiguous. In retrospective, it is a bit cryptic and redundant, I'll remove it.

Addressed in f66d799

Alcpz · 2025-06-02T09:34:32Z

ggml/src/ggml-sycl/vecdotq.hpp

+
+    float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
+                     const std::pair<int, int> d_offset, const block_q8_1 * __restrict__ bq8_1, const int & iqs,
+                     int /* n_blocks */) {


Since you found a way to get rid of n_blocks and q6_K is quite similar to q4_K, do you think it's feasible to remove it also from q4_K so we reduce the function signature?

It's not really part of the PR, so if this would require much more work we can add a TODO to deal with that.

I think it is possible. I read the logic for q4_k and n_blocks is used only to compute the position after all qs. To me it looks like we could put both block-scales and super-block scales in d_offset pair and compute them only once.

Is it a small enough change? If not we can refactor as part of a different PR.

To me it is small enough, so I made the change in 7c859d0

Alcpz · 2025-06-02T09:37:13Z

ggml/src/ggml-sycl/mmvq.cpp

@@ -35,9 +35,8 @@ static void mul_mat_vec_q_reorder(const void * __restrict__ vx, const void * __r
    for (int i = sg.get_local_linear_id() / block_elements_per_subgroup; i < blocks_per_row; i += blocks_per_subgroup) {
        const int ibx       = row * blocks_per_row + i;  // x block index
        // TODO: Generalize offsets, right now only works for quantizations that don't split high and low bits


This PR deals with this TODO.

I thought it too, but I wasn't sure I covered all the possible cases. If you are, I am happy to remove the comment.

addressed in f66d799

Alcpz · 2025-06-02T09:45:10Z

ggml/src/ggml-sycl/vecdotq.hpp

+        const int8_t *    scales = reinterpret_cast<const int8_t *>(base + d_offset.first);
+        const ggml_half * d      = (const ggml_half *) (base + d_offset.second) + ib;
+
+        const int bq8_offset   = 2 * QR6_K * (iqs / (QI6_K / 2)) + (iqs % (QI6_K / 2)) / (QI6_K / 4);


Discussion:

block traits (traits::qk and such) were introduced to not have the QIK_K, QK_K and such macros lying around. Are we all happy with having the generic traits only in the mmvq entrypoint (mul_mat_vec_q_reorder)?

I used them in Q4_0, but that case has much simpler quant/dequantize algorithms. Just double checking that this is a conscious choice.

While computing the offset I found more intuitive to use macros. I don't mind changing it as long as the SYCL backend style is consistent.(also Q4_K reorder still uses macros).

I meant the opposite, to leave them like that. It seems that the macros are shorter and easier to read, but wanted to see what others thought about it.

Signed-off-by: nscipione <[email protected]>

the `nblocks` parameter can be removed from the function call, by using the `d_offset` pair. Signed-off-by: nscipione <[email protected]>

AD2605 · 2025-06-05T09:41:07Z

ggml/src/ggml-sycl/vecdotq.hpp

+    float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
+                     const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr, const sycl::half2 * q8_1_ds,
+                     const int & iqs) {


Suggested change

float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,

const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr, const sycl::half2 * q8_1_ds,

const int & iqs) {

__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int>& ibx_offset,

const std::pair<int, int>& d_offset, const int8_t * q8_1_quant_ptr, const sycl::half2 * q8_1_ds,

const int & iqs) {

Thank you for your comments @AD2605.
Adding __dpct_inline__ it's a nice improvement, but I don't think I should pass std::pair<int,int> as a reference. It's a small type so I think it's better passing it by value.

Added in ce751ad

AD2605 · 2025-06-05T09:42:40Z

ggml/src/ggml-sycl/quants.hpp

+        return { low_bits_index, high_bits_index };
+    }
+
+    static constexpr std::pair<int, int> get_d_offset(int nrows, int ncols, const int block_index) {


Is there a reason this function, and the one above is marked as constexpr ?

I don't have any particular reason, just being consistent with other quants implementation

AD2605 · 2025-06-05T09:44:11Z

ggml/src/ggml-sycl/vecdotq.hpp

+    __dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
+                                     const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,
+                                     const sycl::half2 * q8_1_ds, const int & iqs) {


Suggested change

__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,

const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,

const sycl::half2 * q8_1_ds, const int & iqs) {

__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int>& ibx_offset,

const std::pair<int, int>& d_offset, const int8_t * q8_1_quant_ptr,

const sycl::half2 * q8_1_ds, const int & iqs) {

AD2605 · 2025-06-05T09:48:06Z

ggml/src/ggml-sycl/vecdotq.hpp

+    __dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
+                                     const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,
+                                     const sycl::half2 * q8_1_ds, const int & iqs) {


Suggested change

__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,

const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,

const sycl::half2 * q8_1_ds, const int & iqs) {

__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int>& ibx_offset,

const std::pair<int, int>& d_offset, const int8_t * q8_1_quant_ptr,

const sycl::half2 * q8_1_ds, const int & iqs) {

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 29, 2025

Alcpz reviewed Jun 2, 2025

View reviewed changes

Add Reorder to Q6_K mmvq implementation

6d0c2d8

Signed-off-by: nscipione <[email protected]>

s-Nick force-pushed the mmvq_q6_k_reorder branch from 3ec8eb3 to 6d0c2d8 Compare June 3, 2025 15:46

s-Nick added 2 commits June 4, 2025 15:50

Address PR comments: clean up comments

f66d799

Remove unused parameter after refactoring q4_k

7c859d0

the `nblocks` parameter can be removed from the function call, by using the `d_offset` pair. Signed-off-by: nscipione <[email protected]>

AD2605 reviewed Jun 5, 2025

View reviewed changes

Rbiessy approved these changes Jun 6, 2025

View reviewed changes

Adding inline to function and removing unnecessary reference to int

ce751ad

AD2605 approved these changes Jun 6, 2025

View reviewed changes

Alcpz approved these changes Jun 9, 2025

View reviewed changes

s-Nick merged commit b460d16 into ggml-org:master Jun 9, 2025
45 checks passed

sycl: Add reorder to Q6_K mmvq implementation #13885

sycl: Add reorder to Q6_K mmvq implementation #13885

Uh oh!

Conversation

s-Nick commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance impact

Battlemage B580

Lunar Lake

Intel Arc A770

Uh oh!

Alcpz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

s-Nick commented May 29, 2025 •

edited

Loading