Skip to content

Eval bug: RWKV inference issue with llama-server #13018

Closed
@blakkd

Description

@blakkd

Name and Version

build b5155

~/l/b/bin ❯❯❯ ./llama-server --version
version: 5155 (64082100)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 3090 24GB

Models

LatentWanderer/featherless-ai_Qwerky-QwQ-32B-gguf

Problem description & steps to reproduce

llama-cli works as intended.
But when trying to run llama-server, only the first generation is working fine.
When this first generation ends up, or cancelled, the server crashes upon any new generation attempt.

What I exactly did:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16
cd build/bin
./llama-server -m /home/user/Downloads/featherless-ai_Qwerky-QwQ-32B-Q4_K_M.gguf -ngl 65 -c 2048 --port 8082 -n 50

First Bad Commit

I can't exactly tell right now, but picking a way older version, for example b4616, the bug isn't encountered.

Relevant log output

Here are 2 consecutive generation requests:

/llama-server -m /home/user/Downloads/featherless-ai_Qwerky-QwQ-32B-Q4_K_M.gguf -ngl 65 -c 2048 --port 8082 -n 50

.
.
.

main: server is listening on http://127.0.0.1:8082 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 20, n_tokens = 20
slot      release: id  0 | task 0 | stop processing: n_past = 69, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     110.29 ms /    20 tokens (    5.51 ms per token,   181.35 tokens per second)
       eval time =    1884.07 ms /    50 tokens (   37.68 ms per token,    26.54 tokens per second)
      total time =    1994.36 ms /    70 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 51 | processing task
slot update_slots: id  0 | task 51 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id  0 | task 51 | need to evaluate at least 1 token to generate logits, n_past = 20, n_prompt_tokens = 20
slot update_slots: id  0 | task 51 | kv cache rm [0, end)
slot update_slots: id  0 | task 51 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id  0 | task 51 | prompt done, n_past = 20, n_tokens = 20
/home/user/llama.cpp-b5155/src/llama-kv-cache.cpp:599: GGML_ASSERT(empty_cell.is_empty()) failed
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions