Releases · ggml-org/llama.cpp

18 Oct 11:26

e56abd2

b6794 Latest

Latest

vulkan: Implement topk_moe fused shader, ported from CUDA (#16641)

This is similar to the CUDA shader from #16130, but doesn't use shared memory
and handles different subgroup sizes.

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-18T11:26:41Z
llama-b6794-bin-macos-arm64.zip

sha256:69f1e95e5434475b6214fec8d9c34eb9e077420a425f46ddf9725ba3d849feb0

10.4 MB 2025-10-18T11:26:54Z
llama-b6794-bin-macos-x64.zip

sha256:303e7ee1c60d86f22939ba274dcc6d46b3dd8c16e6e1af36da6424a0ac6b29e9

27 MB 2025-10-18T11:26:55Z
llama-b6794-bin-ubuntu-vulkan-x64.zip

sha256:a1f4ee3e74a48a45e70194671067e88a03ff7a4bf45b826ddb87f171e1ccb688

25.9 MB 2025-10-18T11:26:57Z
llama-b6794-bin-ubuntu-x64.zip

sha256:b269d87bf731ae3b3197f92cf94f4c624913648317e90886e5ac3946090531a3

12.5 MB 2025-10-18T11:26:59Z
llama-b6794-bin-win-cpu-arm64.zip

sha256:836405a70555ddd3230bb10dcecc4d1e5909173124275f4cc6657f2b78810ec2

10.6 MB 2025-10-18T11:27:00Z
llama-b6794-bin-win-cpu-x64.zip

sha256:9e4215c8dbb0a6b3651f6895bd74822831e4b8e849d452a2e7919a1712c66e3a

13.7 MB 2025-10-18T11:27:01Z
llama-b6794-bin-win-cuda-12.4-x64.zip

sha256:e0b06eb0c04fdec6122778908d49bef23335ff3c3f90cc6cf8475cb34a22cd6c

169 MB 2025-10-18T11:27:02Z
llama-b6794-bin-win-hip-radeon-x64.zip

sha256:8f944146c8da9469c8ae7b52e23e57d200b83b16316dea8a77c23d8417f1197b

321 MB 2025-10-18T11:27:09Z
llama-b6794-bin-win-opencl-adreno-arm64.zip

sha256:f885390a044cfed0c3ccbacd68f0ef7a7f37067c13e0053dc685a34b6ef3897b

11 MB 2025-10-18T11:27:20Z
Source code (zip)

2025-10-18T10:22:57Z
Source code (tar.gz)

2025-10-18T10:22:57Z

18 Oct 10:08

github-actions

b6793

38355c6

b6793

CUDA: use registers instead of smem in topk-moe (#16647)

Uses the technique used in the vulkan PR #16641. Neat trick!

Assets 15

18 Oct 01:13

github-actions

b6792

8138785

b6792

opencl: transposed gemm/gemv moe kernel with mxfp4,f32 (#16602)

* opencl: transposed gemm/gemv moe kernel with mxfp4,f32

* add restore kernel for moe transpose

* fix trailing whitespaces

* resolve compilation warnings

Assets 15

17 Oct 16:32

github-actions

b6791

66b0dbc

b6791

llama-model: fix insonsistent ctxs <-> bufs order (#16581)

Assets 15

17 Oct 16:21

github-actions

b6790

41386cf

b6790

rpc : report actual free memory (#16616)

* rpc : report actual free memory

Start reporting the free memory on every device instead of using
fixed values. Now llama-cli users can get a nice memory breakdown
when using RPC devices.

* drop --mem in rpc-server

Assets 15

17 Oct 12:54

github-actions

b6789

3d4e86b

b6789

vulkan: Add State Space Model (SSM) Operations Support (#16463)

* vulkan: implement SSM scan operation

Add State Space Model scan operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <[email protected]>

* vulkan: implement SSM conv operation

Add State Space Model conv operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <[email protected]>

---------

Signed-off-by: Giuseppe Scrivano <[email protected]>

Assets 15

17 Oct 10:33

github-actions

b6788

342c728

b6788

ggml : fix SpaceMit IME array out-of-bounds in task assignment (#16629)

Fix incorrect task-to-batch index calculation in the quantization phase.

The bug caused out-of-bounds access to qnbitgemm_args array when
compute_idx exceeded per_gemm_block_count_m, leading to invalid
pointer dereferences and SIGBUS errors.

Correctly map tasks to batches by dividing compute_idx by
per_gemm_block_count_m instead of block_size_m.

Example:
  batch_feature=1, gemm_m=30, block_size_m=4
  per_gemm_block_count_m = 8, task_count = 8

  Old: gemm_idx = 4/4 = 1 (out of bounds  New: gemm_idx = 4/8 = 0 (correct)

Tested on SpaceMit K1 RISC-V64 with qwen2.5:0.5b model.

Co-authored-by: muggle <[email protected]>

Assets 15

17 Oct 08:29

github-actions

b6786

b194915

b6786

vulkan: fix debug build (add_rms_len/data not found) (#16624)

Assets 15

17 Oct 07:28

github-actions

b6785

9ad4f19

b6785

metal : add `CONV_TRANSPOSE_2D` (#16542)

* initial: headers and metal-device.cpp updates

* adding conv_transpose_2d

* fix type

* fix type: int32->int64

* Update ggml/src/ggml-metal/ggml-metal.metal

Co-authored-by: Georgi Gerganov <[email protected]>

* Update ggml/src/ggml-metal/ggml-metal.metal

Co-authored-by: Georgi Gerganov <[email protected]>

* Update ggml/src/ggml-metal/ggml-metal.metal

Co-authored-by: Georgi Gerganov <[email protected]>

* add checks for src[0] and src[1]; add type checks

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <[email protected]>

* add more tests, add optimization to threading

* add dynamic memory allocation in metal

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 15

17 Oct 06:25

github-actions

b6784

79967ec

b6784

grammar : use int64_t to avoid int overflows in int schema to grammar…

Assets 15

Releases: ggml-org/llama.cpp

b6794

Uh oh!

b6793

Uh oh!

b6792

Uh oh!

b6791

Uh oh!

b6790

Uh oh!

b6789

Uh oh!

b6788

Uh oh!

b6786

Uh oh!

b6785

Uh oh!

b6784

Uh oh!