forked from cmp-nct/ggllm.cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
Float #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
19h
wants to merge
6,086
commits into
19h:master
Choose a base branch
from
ggml-org:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Float #1
+689,118
−53,734
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* contrib : update roles * contrib : merge PR sections + add link to CI instructions Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.
…#16124) * claim responsibility for ci, gguf-py and convert * add myself to various src/llama- files
* Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
* ggml : add ggml_op_is_empty * ggml : move to ggml-impl.h
* ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph * cont : fix wrong bounds check condition * cont : remove unnecessary overload
These two local variables 'arg' and 'arg_prefix' have been overriden by: 1. for (const auto & arg : opt.args) 2. for (int i = 1; i < argc; i++) { const std::string arg_prefix = "--"; std::string arg = argv[i];
* common : use the json parser Signed-off-by: Adrien Gallouët <[email protected]> * common : enable --offline mode without CURL support This change refactors the download logic to properly support offline mode even when the project is built without CURL. Without this commit, using `--offline` would give the following error: error: built without CURL, cannot download model from the internet even if all the files are already cached. Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]>
--------- Co-authored-by: slaren <[email protected]>
* implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (#16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Jeff Bolz <[email protected]>
Disable 'performance-enum-size' checking: Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes) than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the base type to reduce its size.
…n) (#16177) This is a configuration of the hparams in the GraniteHybrid architecture that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x). It may be used for some models in the Granite 4 family with the GraniteHybrid architecture acting as a superset arch. Rather than support it directly in the c++ graph, we simply coerce the architecture flag back to the correct "granite" or "granitemoe" architecture. Branch: gabe-l-hart/GraniteNonHybridConversion Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
* devops: add s390x dockerfile Signed-off-by: Aaron Teo <[email protected]> * devops: add missing ninja Signed-off-by: Aaron Teo <[email protected]> * devops: move s390x docker into cpu docker Signed-off-by: Aaron Teo <[email protected]> * devops: rework s390x docker Signed-off-by: Aaron Teo <[email protected]> * devops: copy more tools Signed-off-by: Aaron Teo <[email protected]> * devops: add server build step Signed-off-by: Aaron Teo <[email protected]> * devops: remove apt clean steps as distroless misses it Signed-off-by: Aaron Teo <[email protected]> * devops: remove apt commands from distroless Signed-off-by: Aaron Teo <[email protected]> * devops: fix shared libs in distroless Signed-off-by: Aaron Teo <[email protected]> * devops: use correct libs path Signed-off-by: Aaron Teo <[email protected]> * devops: fix shared libs Signed-off-by: Aaron Teo <[email protected]> * devops: add collector stage Signed-off-by: Aaron Teo <[email protected]> * devops: fix missing stage ref Signed-off-by: Aaron Teo <[email protected]> * devops: fix permission issue Signed-off-by: Aaron Teo <[email protected]> * devops: fix unknown model loading failures Signed-off-by: Aaron Teo <[email protected]> * devops: attempt at fixing model loading failure Signed-off-by: Aaron Teo <[email protected]> * devops: fix missing ggml shared object failure to load model Signed-off-by: Aaron Teo <[email protected]> * devops: remove move shared objects Signed-off-by: Aaron Teo <[email protected]> * devops: move libggml-cpu and blas into bin Signed-off-by: Aaron Teo <[email protected]> * devops: finalise hardened server stage Signed-off-by: Aaron Teo <[email protected]> * devops: add cli target Signed-off-by: Aaron Teo <[email protected]> * devops: fix typos Signed-off-by: Aaron Teo <[email protected]> * devops: fix missing shared libraries in base Signed-off-by: Aaron Teo <[email protected]> * devops: update debian target Signed-off-by: Aaron Teo <[email protected]> * devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <[email protected]> * Revert "devops: formalise llama.cpp loc" This reverts commit 0a7664a. Signed-off-by: Aaron Teo <[email protected]> * devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 0a7664a) Signed-off-by: Aaron Teo <[email protected]> * devops: attempt at fixing missing dir Signed-off-by: Aaron Teo <[email protected]> * devops: attempt at making it cache the build Signed-off-by: Aaron Teo <[email protected]> * devops: fix copying process Signed-off-by: Aaron Teo <[email protected]> * devops: make build dir an argument Signed-off-by: Aaron Teo <[email protected]> * Revert "devops: make build dir an argument" This reverts commit 4386989. Signed-off-by: Aaron Teo <[email protected]> * devops: add build stage for gguf-py Signed-off-by: Aaron Teo <[email protected]> * devops: move gguf-py installation into build stage Signed-off-by: Aaron Teo <[email protected]> * devops: break system packages? Signed-off-by: Aaron Teo <[email protected]> * devops: add rust compiler installer Signed-off-by: Aaron Teo <[email protected]> * devops: fix rustc not found Signed-off-by: Aaron Teo <[email protected]> * devops: remove cache mount to allow rustc to persist Signed-off-by: Aaron Teo <[email protected]> * devops: move rustc installation to another layer Signed-off-by: Aaron Teo <[email protected]> * devops: move gguf-py installation to full stage, fix copying Signed-off-by: Aaron Teo <[email protected]> * devops: remove rustc installation in build Signed-off-by: Aaron Teo <[email protected]> * devops: disable full target for now Signed-off-by: Aaron Teo <[email protected]> * devops: attempting static build Signed-off-by: Aaron Teo <[email protected]> * devops: merge s390x dockerfile into cpu for now Signed-off-by: Aaron Teo <[email protected]> * devops: switch to gcc image for build step Signed-off-by: Aaron Teo <[email protected]> * devops: remove build essentials Signed-off-by: Aaron Teo <[email protected]> * devops: install openblas into base target Signed-off-by: Aaron Teo <[email protected]> * devops: go back to s390x dockerfile Signed-off-by: Aaron Teo <[email protected]> * devops: remove libggml and libblas Signed-off-by: Aaron Teo <[email protected]> * devops: add full target Signed-off-by: Aaron Teo <[email protected]> * devops: add break system packages Signed-off-by: Aaron Teo <[email protected]> * devops: add libjpeg Signed-off-by: Aaron Teo <[email protected]> * devops: add missing cmake dep Signed-off-by: Aaron Teo <[email protected]> * devops: finalise docker images for s390x Signed-off-by: Aaron Teo <[email protected]> * devops: add custom openblas patch Signed-off-by: Aaron Teo <[email protected]> * devops: use libopenblas-dev instead of libopenblas-openmp-dev Signed-off-by: Aaron Teo <[email protected]> * devops: add s390x docker build Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
This commit adds examples/model-conversion/ to the CODEOWNERS file and assigns myself (@danbev) as the code owner for this directory.
* zdnn: initial matmul refactor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: rm static from funcs Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: update ggml-zdnn.h Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: change header files to hpp Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: switch to common.hpp Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: move mulmat forward around Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: rm inline from utils Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <[email protected]> * docs: add zDNN docs Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
* ci : disable AMD workflows + update NVIDIA workflows * cont : fixes * cont : update nvidia vulkan workflows
Fix two incorrect make targets in the readme. Signed-off-by: Jie Fu <[email protected]>
This commit adds a leading slash to the paths of root-level files in the CODEOWNERS file. The motivation for this is that these might otherwise match files in subdirectories that have other/additional owners will override them. Refs: #16209 (comment)
Signed-off-by: Jie Fu <[email protected]>
Signed-off-by: Uilian Ries <[email protected]>
…ontaining "." (#16215) Signed-off-by: Jie Fu <[email protected]>
* model : add label for LiquidAI LFM2-2.6B model HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B). Support for GGUF conversion and inference is added in #14620. However, due to similar `n_embd`, it identifies as a 1.2B model. Fix the label by using `n_ff` to identify the model instead. Output of `llama-bench`: ``` | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | lfm2 1.2B F16 | 2.18 GiB | 1.17 B | CPU | 10 | pp512 | 223.97 ± 5.32 | | lfm2 2.6B F16 | 4.79 GiB | 2.57 B | CPU | 10 | pp512 | 92.53 ± 4.14 | | lfm2 350M F16 | 676.25 MiB | 354.48 M | CPU | 10 | pp512 | 725.52 ± 11.70 | | lfm2 700M F16 | 1.38 GiB | 742.49 M | CPU | 10 | pp512 | 336.22 ± 12.93 | ``` * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>
…15815) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks
* CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code
Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel.
Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Stefan Savic <[email protected]>
* metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check
* CUDA set scheduling strategy to spinning for cc121 * Using prop.major and prop.minor, include HIP and MUSA * Exclude HIP and MUSA * Remove trailing whitespace Co-authored-by: Johannes Gäßler <[email protected]> * Remove empty line Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>
* llama-quant: add support for mmproj * Update src/llama.cpp Co-authored-by: Georgi Gerganov <[email protected]> * check prefix instead * small fix --------- Co-authored-by: Georgi Gerganov <[email protected]>
* optimise GGML_OP_SUM * add non-contiguous tests by permuting the input * change tests to require full contiguity of OP_SUM * cuda : add check GGML_OP_SUM --------- Co-authored-by: Georgi Gerganov <[email protected]>
* opencl: add mm_q8_0_f32 * opencl: fix data loading for incomplete tile * opencl: use q8_0 mm for larger matrix * opencl: add some tests to cover the path
* CPU: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators - Added the operators to unary op enum - Implemented API functions - Implemented forward and unary-op logic in CPU backend - Updated ggml_get_n_tasks - Updated operators names array and static_assert - Updated docs and enabled automatic tests * docs: add documentation for ggml_trunc and ggml_trunc_inplace in ggml.h * chore: remove trailing whitespace from ggml.h * Remove unresolved merge markers * Apply review suggestions: cleanup formatting, enum order and leftover artifacts * Regenerate ops.md using create_ops_docs.py
BF16 requires special handling in this script while it's a 2-bytes data, but view is 1-byte by default. Switch to correct view before attempting byteswapping. With this change correctly byteswapping models like Meta-Llama-3-8B-Instruct-bf16-GGUF should be possible.
* SYCL: Add GGML_OP_MEAN operator support * SYCL: Fix formatting for GGML_OP_MEAN case * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>
## Why it failed When compiling with strict compiler flags (-Wwrite-strings -Werror=discarded-qualifiers), the build fails with the following error: ``` cmake \ -S . \ -B ../llama.cpp.build \ --preset=x64-linux-gcc-debug \ -DCMAKE_INSTALL_PREFIX=/tmp/local \ -DCMAKE_C_FLAGS="-Wwrite-strings -Werror=discarded-qualifiers" && \ cmake --build ../llama.cpp.build/ ... /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_cpu_init’: /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3572:24: error: passing argument 1 of ‘putenv’ discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers] 3572 | putenv("KMP_BLOCKTIME=200"); // 200ms | ^~~~~~~~~~~~~~~~~~~ In file included from /home/otegami/work/cpp/llama.cpp/ggml/src/./ggml-impl.h:10, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:6, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/traits.h:3, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:6: /usr/include/stdlib.h:786:26: note: expected ‘char *’ but argument is of type ‘const char *’ 786 | extern int putenv (char *__string) __THROW __nonnull ((1)); | ~~~~~~^~~~~~~~ cc1: some warnings being treated as errors ninja: build stopped: subcommand failed. ``` The issue is that putenv() expects a non-const char * but receives a string literal (const char *). ## How to fix This PR replaces putenv("KMP_BLOCKTIME=200") with setenv("KMP_BLOCKTIME", "200", 0). Benefits of setenv(): - Accepts const char * parameters (no qualifier warnings) - Makes copies of the strings (safer memory handling) - The third parameter (0) ensures we don't overwrite if already set
* Update the docs on -t --threads * Revert "Update the docs on -t --threads" This reverts commit eba9734. * docs: clarify -t/--threads parameter uses CPU threads and defaults to all available cores * Update arg.cpp
This commit applies .clang-format rules to all source files under the ggml-cann directory to ensure consistent coding style and readability. The .clang-format option `SortIncludes: false` has been set to disable automatic reordering of include directives. No functional changes are introduced. Co-authored-by: hipudding <[email protected]>
* SYCL: update element-wise ops and presets * clean arange * Re-trigger CI --------- Co-authored-by: Gitty Burstein <[email protected]>
…iters (#16599) * fix: added a normalization step for MathJax-style \[\] and \(\) delimiters So inline and block equations are converted before KaTeX rendering, enabling proper display of model-generated LaTeX in the WebUI * chore: update webui build output
* SYCL/SET: implement operator + wire-up; docs/ops updates; element_wise & ggml-sycl changes * sycl(SET): re-apply post-rebase; revert manual docs/ops.md; style cleanups * move SET op to standalone file, GPU-only implementation * Update SYCL SET operator for F32 * ci: fix editorconfig issues (LF endings, trailing spaces, final newline) * fixed ggml-sycl.cpp --------- Co-authored-by: Gitty Burstein <[email protected]>
… conversion logic (#16626)
* initial: headers and metal-device.cpp updates * adding conv_transpose_2d * fix type * fix type: int32->int64 * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <[email protected]> * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <[email protected]> * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <[email protected]> * add checks for src[0] and src[1]; add type checks * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <[email protected]> * add more tests, add optimization to threading * add dynamic memory allocation in metal --------- Co-authored-by: Georgi Gerganov <[email protected]>
* webui: reorganize settings layout * chore: update webui build output * fix: remove unused variable * chore: update webui build output
Fix incorrect task-to-batch index calculation in the quantization phase. The bug caused out-of-bounds access to qnbitgemm_args array when compute_idx exceeded per_gemm_block_count_m, leading to invalid pointer dereferences and SIGBUS errors. Correctly map tasks to batches by dividing compute_idx by per_gemm_block_count_m instead of block_size_m. Example: batch_feature=1, gemm_m=30, block_size_m=4 per_gemm_block_count_m = 8, task_count = 8 Old: gemm_idx = 4/4 = 1 (out of bounds New: gemm_idx = 4/8 = 0 (correct) Tested on SpaceMit K1 RISC-V64 with qwen2.5:0.5b model. Co-authored-by: muggle <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.