Closed
Description
Name and Version
llama-cli --version
version: 5336 (053367d)
built with gcc-12.4 (GCC) 12.4.0 for x86_64-redhat-linux
Operating systems
Linux
GGML backends
Vulkan
Hardware
AMD RX 7600
Models
Phi-4-mini-reasoning-Q8_0.gguf
Problem description & steps to reproduce
server crashes after a few minutes of inference.
./llama-server -m /models/Phi-4-mini-reasoning-Q8_0.gguf -t 8 --batch-size 2048 --ubatch-size 1024 -fa -ctk q8_0 -ctv q8_0 --gpu-layers 99 -c 32768 --temp 0.8 --top-p 0.95 --min-p 0 --jinja
First Bad Commit
No response
Relevant log output
/media/build/llama/ggml/src/ggml-backend.cpp:748: pre-allocated tensor (cache_k_l0 (view) (copy of cache_k_l0 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY)
[New LWP 11936]
[New LWP 11937]
[New LWP 11938]
[New LWP 11939]
[New LWP 11940]
[New LWP 11941]
[New LWP 11942]
[New LWP 11943]
[New LWP 11944]
[New LWP 11945]
[New LWP 11946]
[New LWP 11947]
[New LWP 20707]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f51be9db5a6 in waitpid () from /lib64/libpthread.so.0
#0 0x00007f51be9db5a6 in waitpid () from /lib64/libpthread.so.0
#1 0x000000000070a5e8 in ggml_abort ()
#2 0x000000000071dfff in ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*) ()
#3 0x000000000071ee1a in ggml_backend_sched_split_graph(ggml_backend_sched*, ggml_cgraph*) [clone .part.0] ()
#4 0x00000000007227d1 in ggml_backend_sched_alloc_graph ()
#5 0x00000000005089ce in llama_kv_cache_unified::update(llama_context&) ()
#6 0x00000000004e146f in llama_context::kv_self_update() ()
#7 0x00000000004e487e in llama_context::decode(llama_batch&) ()
#8 0x00000000004e62ea in llama_decode ()
#9 0x00000000003676da in server_context::update_slots() ()
#10 0x00000000003323dc in server_queue::start_loop() ()
#11 0x00000000003a1420 in main ()
[Inferior 1 (process 11925) detached]