Description
Hello,
My current setup is with Koboldcpp, but this issue happens with Llama.cpp, LMStudio, and Ollama-Vulkan Fork. So whenever llama.cpp vulkan runner code is used, the issue occurs.
I've been running Vulkan Runner for ~2 months with Koboldcpp + Llama-Swap + OpenWebUI. The combination works great most of the time, but I see Vulkan sometimes under load (requesting frequent model switching, during large RAG queries, Web Search), the machine just halts for a few seconds, screen turns black, then Windows explorer.exe quits and relaunches. After this phase, Koboldcpp models are freed from memory, and I almost always have to quit and relaunch Koboldcpp, Llama-Swap..etc. In some cases, a restart is a must.
CPU: AMD 7735HS
iGPU: 680m
RAM: 48GB DDR5, with 16 of that shared with iGPU
Running always the latest from Koboldcpp and Llama.cpp. Latest AMD drivers.
Looking at various logs from llama.cpp, koboldcpp, and LMStudio, the issue seems to happen during prompt processing, or first few tokens generation. CPU inference never exhibits this behavior. I tried offloading the embedding to OpenWebUI for RAG and Web Search, but that only reduced the frequency of crashes, it didn't stop it.