Description
Name and Version
ghcr.io/ggerganov/llama.cpp:server-cuda d5709176ee6d
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
docker-compose config (with healthchecks and volumes removed for brevity):
services:
llama-cpp-chat:
image: ghcr.io/ggerganov/llama.cpp:server-cuda
pull_policy: always
restart: always
ports:
- 8080:8080
command:
- "-a"
- "llama-cpp-chat"
- "--tensor-split"
- "22,22,22,10"
- "--parallel"
- "4"
- "--batch-size"
- "1024"
- "--ubatch-size"
- "512"
- "--threads-http"
- "4"
- "-ngl"
- "500"
- "-fa"
- "-m"
- "/models/unsloth--Llama-4-Scout-17B-16E-Instruct-GGUF/IQ4_XS/Llama-4-Scout-17B-16E-Instruct-IQ4_XS-00001-of-00002.gguf"
- "-c"
- "32768"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: [ "0", "1", "2" ]
capabilities: [gpu]
$ docker images | grep server-cuda
ghcr.io/ggerganov/llama.cpp server-cuda d5709176ee6d 7 hours ago 2.76GB
Problem description & steps to reproduce
After the following error (which may or may not be a bug in its own right -- it's possibly a PCIe riser bus integrity issue on my side):
/app/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_cuda_mul_mat_q at /app/ggml/src/ggml-cuda/mmq.cu:145
cudaMemcpyAsync(ids_host.data(), ids->data, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
Completions hang (apparently forever):
$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'
Sat 3 May 13:41:45 BST 2025
^C
$ date
Sat 3 May 13:42:33 BST 2025
Whereas before such an error (or after the error and a llama-server restart):
$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'; date
Sat 3 May 13:40:47 BST 2025
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?"}}],"created":1746276047,"model":"llama-cpp-chat","system_fingerprint":"b5269-1d36b367","object":"chat.completion","usage":{"completion_tokens":10,"prompt_tokens":11,"total_tokens":21},"id":"chatcmpl-shzI6fsLwPfFYsQ6SRArjYUKB6lI1mMD","timings":{"prompt_n":11,"prompt_ms":88.858,"prompt_per_token_ms":8.078000000000001,"prompt_per_second":123.79301807378064,"predicted_n":10,"predicted_ms":198.143,"predicted_per_token_ms":19.8143,"predicted_per_second":50.468600959912784}}Sat 3 May 13:40:47 BST 2025
This is not ideal in itself, but understandable.
However, at least, we would want to restart the service (since this is sufficient for recovery, at least with pcie_aspm=off given on the linux command line).
Docker/k8s could do this restarting, but would need the health endpoint to tell us that there's a problem. But, I get:
$ curl http://localhost:8080/health
{"status":"ok"}
So at least, toggling the "ok" to "error" or something, when this CUDA error occurs, would be helpful.
First Bad Commit
No response
Relevant log output
(see above)