Skip to content

Misc. bug: Retrieval sample not decoding token successfully #13102

Closed
@HoiV

Description

@HoiV

Name and Version

version: 5184 (87616f0)
built with MSVC 19.41.34120.0 for x64

Operating systems

Mac, Windows

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

llama-retrieval.exe --context-file <any_text_file> --chunk-size 1 -c 512 -t 8 -m bge-large-en-v1.5-f32.gguf

Problem description & steps to reproduce

The sample failed to decode any tokens created from the text embeddings.

It looks like we need to skip the kv-cache logic to look for an unused slot when pooling is active (which is true for the above model).

The following IF in llama-context.cpp is removed, causing us to go into this logic to search for an unused slot and hit the decoding spew.

    // non-causal masks do not use the KV cache
    if (hparams.causal_attn) {
        kv_self_update();

Just adding "if (!embd_pooling)" appears to fix the issue but I am not sure what it does to the original logic for the non-causal mask with gemma-3.

First Bad Commit

bed4c73

Relevant log output

llama-retrieval.exe --context-file <any_text_file> --chunk-size 1 -c 512 -t 8 -m bge-large-en-v1.5-f32.gguf
...
init:        CPU KV buffer size =    48.00 MiB
llama_context: KV self size  =   48.00 MiB, K (f16):   24.00 MiB, V (f16):   24.00 MiB
llama_context:        CPU compute buffer size =    27.01 MiB
llama_context: graph nodes  = 825
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
batch_decode: n_tokens = 2043, n_seq = 118
find_slot: n_tokens = 2043 > size = 512
decode: failed to find KV cache slot for ubatch of size 2043
llama_decode: failed to decode, ret = 1
get_embeddings_ith: invalid embeddings id 0, reason: no embeddings
batch_decode: failed to get embeddings for token 0
get_embeddings_ith: invalid embeddings id 1, reason: no embeddings
batch_decode: failed to get embeddings for token 1
get_embeddings_ith: invalid embeddings id 2, reason: no embeddings
batch_decode: failed to get embeddings for token 2
get_embeddings_ith: invalid embeddings id 3, reason: no embeddings
batch_decode: failed to get embeddings for token 3
...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions