Skip to content

ggml: aarch64: Implement SVE Kernels for Int 8 Quantization #14117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

Vithulep
Copy link
Contributor

This PR adds SVE kernel support for Int8 datatype specific to ARM architecture.
Major code changes:

  1. Implement SVE intrinsics code for quantize_row_q8_0()
  2. Implement SVE intrinsics code for dequantize_row_q8_0()

Performance

The performance remained nearly the same before and after the PR changes; however, this PR introduces an SVE intrinsic implementation in Mamba int8, achieving comparable performance.

Task1: Prompt Length: 128 tokens, Generated Tokens: 1 token

Threads Baseline (pre-PR) (Tokens/sec) This PR(SVE) (Tokens/sec)
8 28.03 27.78
16 52 51.61
32 94.46 93.54
64 150.05 153.04

Task2: Prompt Length: 1024 tokens, Generated Tokens: 1 token

Threads Baseline (pre-PR) (Tokens/sec) This PR(SVE) (Tokens/sec)
8 27.45 27.16
16 49.84 49.47
32 87.59 86.78
64 141.09 141.03

Task3: Prompt Length: 8192 tokens, Generated Tokens: 1 token

Threads Baseline (pre-PR) (Tokens/sec) This PR(SVE) (Tokens/sec)
8 27.49 27.21
16 51.14 50.61
32 89.4 88.52
64 141.97 141.3

The command used to measure the performance is

 ./build/bin/llama-bench -m falcon-mamba-7b-q8_0.gguf -t 8,16,32,64 -p 128, 1024, 8192 -n 0

Perplexity

There is no change in model accuracy as a result of this PR. And below is the summary.

Baseline (pre-PR) This PR(SVE)
7.6508 +/- 0.67260 7.6508 +/- 0.67260
Command:  ./build/bin/llama-perplexity -s 0 -np 128 -t 64 -m falcon-mamba-7b-q8_0.gguf -c 128 -b 128 --chunks 16 -f scripts/wikitext-2-raw/wiki.test.raw

@Vithulep Vithulep changed the title ggml: aarch64: Implement SVE Q8 kernels for vector functions ggml: aarch64: Implement SVE Int 8 Quantization kernels Jun 11, 2025
@Vithulep Vithulep changed the title ggml: aarch64: Implement SVE Int 8 Quantization kernels ggml: aarch64: Implement SVE Kernels for Int 8 Quantization Jun 11, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 11, 2025
@Vithulep
Copy link
Contributor Author

The CI failures seem CMake-related and are also occurring in other PRs. Since I haven’t modified any CMake files, they don’t appear to be caused by this PR.

@compilade
Copy link
Collaborator

compilade commented Jun 13, 2025

they don’t appear to be caused by this PR.

@Vithulep There seems to be one failure related to Q8_0 quants in one of the ARM runners for the test-quantize-fns test, see https://github.com/ggml-org/llama.cpp/actions/runs/15626597321/job/44021925255?pr=14117#step:6:21067

q8_0 reference implementation error: FAILED (0.000175)

Not sure if it's directly related, but it might.

@@ -340,20 +340,37 @@ void dequantize_row_q5_1(const block_q5_1 * GGML_RESTRICT x, float * GGML_RESTRI
}
}

// SVE Support added for Scaler Implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dequantization from Q8_0 is not performed during inference, and so it's not as time-sensitive (hence the plain scalar code, which is also simpler to maintain).
Only quantization of the intermediate tensors matmultiplied with types having Q8_0 as their vec_dot_type is exercised during the perplexity and speed benchmarks you've shared.

Did you test the dequantization changes for correctness outside of inference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dequantize_row_q8_0() kernel is called during Inference but very small no. of times. 1 call for every token generation. Hence Its effect can't be seen in speedup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dequantize_row_q8_0() kernel is called during Inference but very small no. of times. 1 call for every token generation.

Ah yes I had forgotten that ggml_get_rows dequantizes what it extracts. It's used when converting tokens to embeddings at the beginning of the model graph. It's not really a bottleneck, though.

Thanks, I was wrong about it not being called.

tjohnman and others added 5 commits June 17, 2025 15:37
… (ggml-org#330)

* Check for reverse prompt by characters instead of tokens (ggml-org#292)

* Update main.cpp

Wording.

* Cleanup.

* Remove unnecessary use of std::stringstream.

---------

Co-authored-by: Johnman <tjohnman@github>
Co-authored-by: Georgi Gerganov <[email protected]>
@wenlujon
Copy link

just curious, since there's no performance gain (seems even slight drop) with adding sve version, why replacing the NEON version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants