Skip to content

Conversation

19h
Copy link
Owner

@19h 19h commented Jun 21, 2023

No description provided.

ggerganov and others added 30 commits September 22, 2025 10:58
* contrib : update roles

* contrib : merge PR sections + add link to CI instructions

Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.
…#16124)

* claim responsibility for ci, gguf-py and convert

* add myself to various src/llama- files
* Vulkan: add conv_transpose_2d operation

* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)

* Vulkan: fix incorrect indentation in conv_transpose_2d shader

* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation

* Vulkan: revert the order of the index calculation and bound check in conv_2d shader

* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.

* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
* ggml : add ggml_op_is_empty

* ggml : move to ggml-impl.h
* ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph

* cont : fix wrong bounds check condition

* cont : remove unnecessary overload
These two local variables 'arg' and 'arg_prefix' have been overriden by:

  1. for (const auto & arg : opt.args)

  2. for (int i = 1; i < argc; i++) {
        const std::string arg_prefix = "--";

        std::string arg = argv[i];
* common : use the json parser

Signed-off-by: Adrien Gallouët <[email protected]>

* common : enable --offline mode without CURL support

This change refactors the download logic to properly support offline mode
even when the project is built without CURL.

Without this commit, using `--offline` would give the following error:

    error: built without CURL, cannot download model from the internet

even if all the files are already cached.

Signed-off-by: Adrien Gallouët <[email protected]>

---------

Signed-off-by: Adrien Gallouët <[email protected]>
* implement set_rows with i32 index

* template fix

* test quantized path

warnings--

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* forgotten name change

* deduplicate cuda/sycl and test-fix

* indent++

* vulkan: support set_rows with i32 index type (#16162)

* disable i32 index for webgpu for now

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Jeff Bolz <[email protected]>
Disable 'performance-enum-size' checking:

Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes)
than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the
base type to reduce its size.
…n) (#16177)

This is a configuration of the hparams in the GraniteHybrid architecture
that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x).
It may be used for some models in the Granite 4 family with the
GraniteHybrid architecture acting as a superset arch. Rather than support
it directly in the c++ graph, we simply coerce the architecture flag back
to the correct "granite" or "granitemoe" architecture.

Branch: gabe-l-hart/GraniteNonHybridConversion

Signed-off-by: Gabe Goodhart <[email protected]>

Co-authored-by: Sigbjørn Skjæret <[email protected]>
* devops: add s390x dockerfile

Signed-off-by: Aaron Teo <[email protected]>

* devops: add missing ninja

Signed-off-by: Aaron Teo <[email protected]>

* devops: move s390x docker into cpu docker

Signed-off-by: Aaron Teo <[email protected]>

* devops: rework s390x docker

Signed-off-by: Aaron Teo <[email protected]>

* devops: copy more tools

Signed-off-by: Aaron Teo <[email protected]>

* devops: add server build step

Signed-off-by: Aaron Teo <[email protected]>

* devops: remove apt clean steps as distroless misses it

Signed-off-by: Aaron Teo <[email protected]>

* devops: remove apt commands from distroless

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix shared libs in distroless

Signed-off-by: Aaron Teo <[email protected]>

* devops: use correct libs path

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix shared libs

Signed-off-by: Aaron Teo <[email protected]>

* devops: add collector stage

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix missing stage ref

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix permission issue

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix unknown model loading failures

Signed-off-by: Aaron Teo <[email protected]>

* devops: attempt at fixing model loading failure

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix missing ggml shared object

failure to load model

Signed-off-by: Aaron Teo <[email protected]>

* devops: remove move shared objects

Signed-off-by: Aaron Teo <[email protected]>

* devops: move libggml-cpu and blas into bin

Signed-off-by: Aaron Teo <[email protected]>

* devops: finalise hardened server stage

Signed-off-by: Aaron Teo <[email protected]>

* devops: add cli target

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix typos

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix missing shared libraries in base

Signed-off-by: Aaron Teo <[email protected]>

* devops: update debian target

Signed-off-by: Aaron Teo <[email protected]>

* devops: formalise llama.cpp loc

Signed-off-by: Aaron Teo <[email protected]>

* Revert "devops: formalise llama.cpp loc"

This reverts commit 0a7664a.

Signed-off-by: Aaron Teo <[email protected]>

* devops: formalise llama.cpp loc

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 0a7664a)
Signed-off-by: Aaron Teo <[email protected]>

* devops: attempt at fixing missing dir

Signed-off-by: Aaron Teo <[email protected]>

* devops: attempt at making it cache the build

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix copying process

Signed-off-by: Aaron Teo <[email protected]>

* devops: make build dir an argument

Signed-off-by: Aaron Teo <[email protected]>

* Revert "devops: make build dir an argument"

This reverts commit 4386989.

Signed-off-by: Aaron Teo <[email protected]>

* devops: add build stage for gguf-py

Signed-off-by: Aaron Teo <[email protected]>

* devops: move gguf-py installation into build stage

Signed-off-by: Aaron Teo <[email protected]>

* devops: break system packages?

Signed-off-by: Aaron Teo <[email protected]>

* devops: add rust compiler installer

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix rustc not found

Signed-off-by: Aaron Teo <[email protected]>

* devops: remove cache mount to allow rustc to persist

Signed-off-by: Aaron Teo <[email protected]>

* devops: move rustc installation to another layer

Signed-off-by: Aaron Teo <[email protected]>

* devops: move gguf-py installation to full stage, fix copying

Signed-off-by: Aaron Teo <[email protected]>

* devops: remove rustc installation in build

Signed-off-by: Aaron Teo <[email protected]>

* devops: disable full target for now

Signed-off-by: Aaron Teo <[email protected]>

* devops: attempting static build

Signed-off-by: Aaron Teo <[email protected]>

* devops: merge s390x dockerfile into cpu for now

Signed-off-by: Aaron Teo <[email protected]>

* devops: switch to gcc image for build step

Signed-off-by: Aaron Teo <[email protected]>

* devops: remove build essentials

Signed-off-by: Aaron Teo <[email protected]>

* devops: install openblas into base target

Signed-off-by: Aaron Teo <[email protected]>

* devops: go back to s390x dockerfile

Signed-off-by: Aaron Teo <[email protected]>

* devops: remove libggml and libblas

Signed-off-by: Aaron Teo <[email protected]>

* devops: add full target

Signed-off-by: Aaron Teo <[email protected]>

* devops: add break system packages

Signed-off-by: Aaron Teo <[email protected]>

* devops: add libjpeg

Signed-off-by: Aaron Teo <[email protected]>

* devops: add missing cmake dep

Signed-off-by: Aaron Teo <[email protected]>

* devops: finalise docker images for s390x

Signed-off-by: Aaron Teo <[email protected]>

* devops: add custom openblas patch

Signed-off-by: Aaron Teo <[email protected]>

* devops: use libopenblas-dev instead of libopenblas-openmp-dev

Signed-off-by: Aaron Teo <[email protected]>

* devops: add s390x docker build

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
This commit adds examples/model-conversion/ to the CODEOWNERS file and
assigns myself (@danbev) as the code owner for this directory.
* zdnn: initial matmul refactor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: rm static from funcs

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: update ggml-zdnn.h

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: change header files to hpp

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: switch to common.hpp

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: move mulmat forward around

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: rm inline from utils

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <[email protected]>

* docs: add zDNN docs

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
)

* fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl

* change initialization to true
* ci : disable AMD workflows + update NVIDIA workflows

* cont : fixes

* cont : update nvidia vulkan workflows
Fix two incorrect make targets in the readme.

Signed-off-by: Jie Fu <[email protected]>
This commit adds a leading slash to the paths of root-level files
in the CODEOWNERS file.

The motivation for this is that these might otherwise match files
in subdirectories that have other/additional owners will override them.

Refs: #16209 (comment)
* model : add label for LiquidAI LFM2-2.6B model

HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B).

Support for GGUF conversion and inference is added in #14620.

However, due to similar `n_embd`, it identifies as a 1.2B model.
Fix the label by using `n_ff` to identify the model instead.

Output of `llama-bench`:
```
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| lfm2 1.2B F16                  |   2.18 GiB |     1.17 B | CPU        |      10 |           pp512 |        223.97 ± 5.32 |
| lfm2 2.6B F16                  |   4.79 GiB |     2.57 B | CPU        |      10 |           pp512 |         92.53 ± 4.14 |
| lfm2 350M F16                  | 676.25 MiB |   354.48 M | CPU        |      10 |           pp512 |       725.52 ± 11.70 |
| lfm2 700M F16                  |   1.38 GiB |   742.49 M | CPU        |      10 |           pp512 |       336.22 ± 12.93 |
```

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
…15815)

* ggml : make gallocr respect the backend's max buffer size

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max  allocation size in buffer type  interface

* fix missing newline, apple-clang warning

* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.

* track (chunk, offset) pairs instead of "global" offsets through gallocr.

* simpler, don't need loops to map between local/global offsets
* touches more code

* fix dyn_tallocr_max_size and initialization

* fix memory leak when buffers are reused due to same buffer type appearing multiple times

* make vbuffer allocation follow the same logic as backend_buffer did before

* continue to use leftover unallocated space of previous chunks after a new one has been created

* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size

* refactor: move adding new free block and new chunk into separate functions

* allocate chunks individually with a separate free-blocks list for each one

* needs a bit more memory/allocations/indirections, but code is simpler

* fix warnings (missing static) & debug checks
am17an and others added 30 commits October 14, 2025 13:16
* CUDA: use fastdiv + ggml_cuda_mad for mmvf

* use bf16 directly + fix formatting

* Add exception for HIP code
Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel.

Enable /MP so source files are compiled in parallel.
Signed-off-by: Stefan Savic <[email protected]>
Co-authored-by: Stefan Savic <[email protected]>
* metal : avoid using Metal's gpuAddress property

* metal : fix rope kernels buffer check
* CUDA set scheduling strategy to spinning for cc121

* Using prop.major and prop.minor, include HIP and MUSA

* Exclude HIP and MUSA

* Remove trailing whitespace

Co-authored-by: Johannes Gäßler <[email protected]>

* Remove empty line

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
* llama-quant: add support for mmproj

* Update src/llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* check prefix instead

* small fix

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* optimise GGML_OP_SUM

* add non-contiguous tests by permuting the input

* change tests to require full contiguity of OP_SUM

* cuda : add check GGML_OP_SUM

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* opencl: add mm_q8_0_f32

* opencl: fix data loading for incomplete tile

* opencl: use q8_0 mm for larger matrix

* opencl: add some tests to cover the path
* CPU: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators

- Added the operators to unary op enum
- Implemented API functions
- Implemented forward and unary-op logic in CPU backend
- Updated ggml_get_n_tasks
- Updated operators names array and static_assert
- Updated docs and enabled automatic tests

* docs: add documentation for ggml_trunc and ggml_trunc_inplace in ggml.h

* chore: remove trailing whitespace from ggml.h

* Remove unresolved merge markers

* Apply review suggestions: cleanup formatting, enum order and leftover artifacts

* Regenerate ops.md using create_ops_docs.py
BF16 requires special handling in this script
while it's a 2-bytes data, but view is 1-byte by default.
Switch to correct view before attempting byteswapping.

With this change correctly byteswapping models like
Meta-Llama-3-8B-Instruct-bf16-GGUF
should be possible.
* SYCL: Add GGML_OP_MEAN operator support

* SYCL: Fix formatting for GGML_OP_MEAN case

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
## Why it failed

When compiling with strict compiler flags (-Wwrite-strings -Werror=discarded-qualifiers),
the build fails with the following error:

```
cmake \
  -S . \
  -B ../llama.cpp.build \
  --preset=x64-linux-gcc-debug \
  -DCMAKE_INSTALL_PREFIX=/tmp/local \
  -DCMAKE_C_FLAGS="-Wwrite-strings -Werror=discarded-qualifiers" && \
cmake --build ../llama.cpp.build/
...
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_cpu_init’:
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3572:24: error: passing argument 1 of ‘putenv’ discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers]
 3572 |                 putenv("KMP_BLOCKTIME=200"); // 200ms
      |                        ^~~~~~~~~~~~~~~~~~~
In file included from /home/otegami/work/cpp/llama.cpp/ggml/src/./ggml-impl.h:10,
                 from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:6,
                 from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/traits.h:3,
                 from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:6:
/usr/include/stdlib.h:786:26: note: expected ‘char *’ but argument is of type ‘const char *’
  786 | extern int putenv (char *__string) __THROW __nonnull ((1));
      |                    ~~~~~~^~~~~~~~
cc1: some warnings being treated as errors
ninja: build stopped: subcommand failed.
```

The issue is that putenv() expects a non-const char * but receives a string literal (const char *).

## How to fix

This PR replaces putenv("KMP_BLOCKTIME=200") with setenv("KMP_BLOCKTIME", "200", 0).

Benefits of setenv():
- Accepts const char * parameters (no qualifier warnings)
- Makes copies of the strings (safer memory handling)
- The third parameter (0) ensures we don't overwrite if already set
* Update the docs on -t --threads

* Revert "Update the docs on -t --threads"

This reverts commit eba9734.

* docs: clarify -t/--threads parameter uses CPU threads and defaults to all available cores

* Update arg.cpp
This commit applies .clang-format rules to all source files under the
ggml-cann directory to ensure consistent coding style and readability.
The .clang-format option `SortIncludes: false` has been set to disable
automatic reordering of include directives.
No functional changes are introduced.

Co-authored-by: hipudding <[email protected]>
* SYCL: update element-wise ops and presets

* clean arange

* Re-trigger CI

---------

Co-authored-by: Gitty Burstein <[email protected]>
…iters (#16599)

* fix: added a normalization step for MathJax-style \[\] and \(\) delimiters

So inline and block equations are converted before KaTeX rendering,
enabling proper display of model-generated LaTeX in the WebUI

* chore: update webui build output
* SYCL/SET: implement operator + wire-up; docs/ops updates; element_wise & ggml-sycl changes

* sycl(SET): re-apply post-rebase; revert manual docs/ops.md; style cleanups

* move SET op to standalone file, GPU-only implementation

* Update SYCL SET operator for F32

* ci: fix editorconfig issues (LF endings, trailing spaces, final newline)

* fixed ggml-sycl.cpp

---------

Co-authored-by: Gitty Burstein <[email protected]>
* initial: headers and metal-device.cpp updates

* adding conv_transpose_2d

* fix type

* fix type: int32->int64

* Update ggml/src/ggml-metal/ggml-metal.metal

Co-authored-by: Georgi Gerganov <[email protected]>

* Update ggml/src/ggml-metal/ggml-metal.metal

Co-authored-by: Georgi Gerganov <[email protected]>

* Update ggml/src/ggml-metal/ggml-metal.metal

Co-authored-by: Georgi Gerganov <[email protected]>

* add checks for src[0] and src[1]; add type checks

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <[email protected]>

* add more tests, add optimization to threading

* add dynamic memory allocation in metal

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* webui: reorganize settings layout

* chore: update webui build output

* fix: remove unused variable

* chore: update webui build output
Fix incorrect task-to-batch index calculation in the quantization phase.

The bug caused out-of-bounds access to qnbitgemm_args array when
compute_idx exceeded per_gemm_block_count_m, leading to invalid
pointer dereferences and SIGBUS errors.

Correctly map tasks to batches by dividing compute_idx by
per_gemm_block_count_m instead of block_size_m.

Example:
  batch_feature=1, gemm_m=30, block_size_m=4
  per_gemm_block_count_m = 8, task_count = 8

  Old: gemm_idx = 4/4 = 1 (out of bounds  New: gemm_idx = 4/8 = 0 (correct)

Tested on SpaceMit K1 RISC-V64 with qwen2.5:0.5b model.

Co-authored-by: muggle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.