Closed
Description
Name and Version
build b5155
~/l/b/bin ❯❯❯ ./llama-server --version
version: 5155 (64082100)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX 3090 24GB
Models
LatentWanderer/featherless-ai_Qwerky-QwQ-32B-gguf
Problem description & steps to reproduce
llama-cli
works as intended.
But when trying to run llama-server, only the first generation is working fine.
When this first generation ends up, or cancelled, the server crashes upon any new generation attempt.
What I exactly did:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16
cd build/bin
./llama-server -m /home/user/Downloads/featherless-ai_Qwerky-QwQ-32B-Q4_K_M.gguf -ngl 65 -c 2048 --port 8082 -n 50
First Bad Commit
I can't exactly tell right now, but picking a way older version, for example b4616
, the bug isn't encountered.
Relevant log output
Here are 2 consecutive generation requests:
/llama-server -m /home/user/Downloads/featherless-ai_Qwerky-QwQ-32B-Q4_K_M.gguf -ngl 65 -c 2048 --port 8082 -n 50
.
.
.
main: server is listening on http://127.0.0.1:8082 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 20, n_tokens = 20
slot release: id 0 | task 0 | stop processing: n_past = 69, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 110.29 ms / 20 tokens ( 5.51 ms per token, 181.35 tokens per second)
eval time = 1884.07 ms / 50 tokens ( 37.68 ms per token, 26.54 tokens per second)
total time = 1994.36 ms / 70 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 51 | processing task
slot update_slots: id 0 | task 51 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id 0 | task 51 | need to evaluate at least 1 token to generate logits, n_past = 20, n_prompt_tokens = 20
slot update_slots: id 0 | task 51 | kv cache rm [0, end)
slot update_slots: id 0 | task 51 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id 0 | task 51 | prompt done, n_past = 20, n_tokens = 20
/home/user/llama.cpp-b5155/src/llama-kv-cache.cpp:599: GGML_ASSERT(empty_cell.is_empty()) failed
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.