Skip to content

sycl: Add reorder to Q6_K mmvq implementation #13885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 9, 2025

Conversation

s-Nick
Copy link
Collaborator

@s-Nick s-Nick commented May 29, 2025

This PR implements quants reordering for mmvq for Q6_K quantization following the work done for Q4_0 and Q4_K.
These changes give good results on BMG and do not detriment performance on other GPUs.

Performance impact

All numbers taken with GGML_SYCL_DISABLE_OPT=0 .

Battlemage B580

model size params backend ngl sm mmap test this PR t/s master(26b79b6) t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 7421.88 ± 40.31 7303.25 ± 190.33
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 134.93 ± 4.28 132.17 ± 7.13
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 pp512 7508.23 ± 12.32 7543.68 ± 52.94
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 tg128 124.74 ± 2.57 117.75 ± 2.95
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 pp512 2164.47 ± 3.86 2156.43 ± 4.42
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 tg128 65.13 ± 0.55 65.65 ± 0.38
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 pp512 2206.63 ± 4.49 2202.15 ± 5.80
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 tg128 55.04 ± 0.19 52.57 ± 0.23
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 pp512 5722.94 ± 33.50 5688.02 ± 29.39
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 tg128 93.32 ± 2.52 88.34 ± 2.42
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 pp512 3043.55 ± 8.11 3047.39 ± 6.17
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 tg128 95.46 ± 2.23 95.79 ± 2.39
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 pp512 3140.52 ± 6.44 3154.81 ± 5.22
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 tg128 72.71 ± 0.89 69.35 ± 0.16
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 none 0 pp512 1472.53 ± 4.66 1468.04 ± 0.47
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 none 0 tg128 23.48 ± 0.03 20.30 ± 0.05

Lunar Lake

model size params backend ngl test this PR t/s master(26b79b6) t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 pp512 1559.52 ± 38.29 1761.94 ± 15.09
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 tg128 58.17 ± 0.59 56.42 ± 0.42
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 pp512 1775.38 ± 37.60 1675.54 ± 32.04
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 tg128 42.21 ± 0.19 39.81 ± 0.65
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 pp512 391.97 ± 4.96 433.80 ± 1.21
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 tg128 21.61 ± 0.51 20.38 ± 0.61
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 pp512 492.83 ± 0.55 488.92 ± 1.28
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 tg128 15.86 ± 0.24 14.98 ± 0.16
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 pp512 985.69 ± 62.63 990.18 ± 10.61
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 tg128 29.24 ± 0.15 27.48 ± 0.25
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 pp512 674.42 ± 0.61 665.08 ± 3.18
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 tg128 34.34 ± 0.09 33.52 ± 0.11
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 pp512 743.15 ± 1.86 737.78 ± 2.94
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 tg128 22.53 ± 0.48 22.17 ± 0.30
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 pp512 301.50 ± 14.06 302.83 ± 1.09
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 tg128 7.72 ± 0.05 6.06 ± 0.02
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 pp512 411.81 ± 4.68 418.31 ± 5.50
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 tg128 14.46 ± 0.06 13.40 ± 0.09

Intel Arc A770

model size params backend ngl sm mmap test t/s master(26b79b6) t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 4456.48 ± 8.69 4433.16 ± 9.59
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 45.60 ± 0.21 44.82 ± 0.24
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 pp512 4499.64 ± 3.28 4460.18 ± 5.36
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 tg128 44.60 ± 0.16 43.98 ± 0.14
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 pp512 1716.17 ± 1.29 1707.70 ± 1.18
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 tg128 34.40 ± 0.03 34.06 ± 0.02
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 pp512 1732.65 ± 2.73 1723.82 ± 1.31
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 tg128 32.60 ± 0.27 31.71 ± 0.27
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 pp512 3661.03 ± 6.91 3632.01 ± 6.45
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 tg128 39.20 ± 0.31 38.16 ± 0.36
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 pp512 2464.93 ± 2.58 2445.01 ± 2.54
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 tg128 39.98 ± 0.01 39.41 ± 0.33
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 pp512 2513.13 ± 2.47 2492.01 ± 1.80
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 tg128 34.50 ± 0.30 34.14 ± 0.02
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 none 0 pp512 1031.68 ± 1.04 1024.74 ± 1.07
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 none 0 tg128 17.33 ± 0.11 15.12 ± 0.16

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 29, 2025
Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Most of my comments are minor things or topics for discussing.


auto * ql_ptr = data_device;
auto * qh_ptr = ql_ptr + (QK_K / 2) * nblocks;
// scales are after all quants' bits so adding both to get correct offset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment describes the reordered structure a bit. Down below you have a similar comment for the high and low bits. I suggest having both in the same place if we want to keep these.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for spotting it, I think a better place to keep this comment is inside the struct where the offset is computed. I am going to remove it from here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f66d799

using q6_k_block = ggml_sycl_reordered::block_q_t<GGML_TYPE_Q6_K>;
using q6_k_traits = typename q6_k_block::traits;

// contiguous v/x values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left there from original vec_dot_q6_K_q8_1_impl_mmvq, It is valid for all K quantization and it simply means that v value it uses are contiguous. In retrospective, it is a bit cryptic and redundant, I'll remove it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f66d799


float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const block_q8_1 * __restrict__ bq8_1, const int & iqs,
int /* n_blocks */) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you found a way to get rid of n_blocks and q6_K is quite similar to q4_K, do you think it's feasible to remove it also from q4_K so we reduce the function signature?

It's not really part of the PR, so if this would require much more work we can add a TODO to deal with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible. I read the logic for q4_k and n_blocks is used only to compute the position after all qs. To me it looks like we could put both block-scales and super-block scales in d_offset pair and compute them only once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a small enough change? If not we can refactor as part of a different PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it is small enough, so I made the change in 7c859d0

@@ -35,9 +35,8 @@ static void mul_mat_vec_q_reorder(const void * __restrict__ vx, const void * __r
for (int i = sg.get_local_linear_id() / block_elements_per_subgroup; i < blocks_per_row; i += blocks_per_subgroup) {
const int ibx = row * blocks_per_row + i; // x block index
// TODO: Generalize offsets, right now only works for quantizations that don't split high and low bits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR deals with this TODO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it too, but I wasn't sure I covered all the possible cases. If you are, I am happy to remove the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in f66d799

const int8_t * scales = reinterpret_cast<const int8_t *>(base + d_offset.first);
const ggml_half * d = (const ggml_half *) (base + d_offset.second) + ib;

const int bq8_offset = 2 * QR6_K * (iqs / (QI6_K / 2)) + (iqs % (QI6_K / 2)) / (QI6_K / 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion:

block traits (traits::qk and such) were introduced to not have the QIK_K, QK_K and such macros lying around. Are we all happy with having the generic traits only in the mmvq entrypoint (mul_mat_vec_q_reorder)?

I used them in Q4_0, but that case has much simpler quant/dequantize algorithms. Just double checking that this is a conscious choice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While computing the offset I found more intuitive to use macros. I don't mind changing it as long as the SYCL backend style is consistent.(also Q4_K reorder still uses macros).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the opposite, to leave them like that. It seems that the macros are shorter and easier to read, but wanted to see what others thought about it.

@s-Nick s-Nick force-pushed the mmvq_q6_k_reorder branch from 3ec8eb3 to 6d0c2d8 Compare June 3, 2025 15:46
s-Nick added 2 commits June 4, 2025 15:50
the `nblocks` parameter can be removed from the function call, by using
the `d_offset` pair.

Signed-off-by: nscipione <[email protected]>
Comment on lines 427 to 429
float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr, const sycl::half2 * q8_1_ds,
const int & iqs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr, const sycl::half2 * q8_1_ds,
const int & iqs) {
__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int>& ibx_offset,
const std::pair<int, int>& d_offset, const int8_t * q8_1_quant_ptr, const sycl::half2 * q8_1_ds,
const int & iqs) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments @AD2605.
Adding __dpct_inline__ it's a nice improvement, but I don't think I should pass std::pair<int,int> as a reference. It's a small type so I think it's better passing it by value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in ce751ad

return { low_bits_index, high_bits_index };
}

static constexpr std::pair<int, int> get_d_offset(int nrows, int ncols, const int block_index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this function, and the one above is marked as constexpr ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any particular reason, just being consistent with other quants implementation

Comment on lines +350 to +352
__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,
const sycl::half2 * q8_1_ds, const int & iqs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,
const sycl::half2 * q8_1_ds, const int & iqs) {
__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int>& ibx_offset,
const std::pair<int, int>& d_offset, const int8_t * q8_1_quant_ptr,
const sycl::half2 * q8_1_ds, const int & iqs) {

Comment on lines +287 to +289
__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,
const sycl::half2 * q8_1_ds, const int & iqs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const int8_t * q8_1_quant_ptr,
const sycl::half2 * q8_1_ds, const int & iqs) {
__dpct_inline__ float operator()(const void * __restrict__ vbq, const std::pair<int, int>& ibx_offset,
const std::pair<int, int>& d_offset, const int8_t * q8_1_quant_ptr,
const sycl::half2 * q8_1_ds, const int & iqs) {

@s-Nick s-Nick merged commit b460d16 into ggml-org:master Jun 9, 2025
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants