Closed
Description
Name and Version
version: 5184 (87616f0)
built with MSVC 19.41.34120.0 for x64
Operating systems
Mac, Windows
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
llama-retrieval.exe --context-file <any_text_file> --chunk-size 1 -c 512 -t 8 -m bge-large-en-v1.5-f32.gguf
Problem description & steps to reproduce
The sample failed to decode any tokens created from the text embeddings.
It looks like we need to skip the kv-cache logic to look for an unused slot when pooling is active (which is true for the above model).
The following IF in llama-context.cpp is removed, causing us to go into this logic to search for an unused slot and hit the decoding spew.
// non-causal masks do not use the KV cache
if (hparams.causal_attn) {
kv_self_update();
Just adding "if (!embd_pooling)" appears to fix the issue but I am not sure what it does to the original logic for the non-causal mask with gemma-3.
First Bad Commit
Relevant log output
llama-retrieval.exe --context-file <any_text_file> --chunk-size 1 -c 512 -t 8 -m bge-large-en-v1.5-f32.gguf
...
init: CPU KV buffer size = 48.00 MiB
llama_context: KV self size = 48.00 MiB, K (f16): 24.00 MiB, V (f16): 24.00 MiB
llama_context: CPU compute buffer size = 27.01 MiB
llama_context: graph nodes = 825
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
batch_decode: n_tokens = 2043, n_seq = 118
find_slot: n_tokens = 2043 > size = 512
decode: failed to find KV cache slot for ubatch of size 2043
llama_decode: failed to decode, ret = 1
get_embeddings_ith: invalid embeddings id 0, reason: no embeddings
batch_decode: failed to get embeddings for token 0
get_embeddings_ith: invalid embeddings id 1, reason: no embeddings
batch_decode: failed to get embeddings for token 1
get_embeddings_ith: invalid embeddings id 2, reason: no embeddings
batch_decode: failed to get embeddings for token 2
get_embeddings_ith: invalid embeddings id 3, reason: no embeddings
batch_decode: failed to get embeddings for token 3
...