Skip to content

Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK #13281

Open
@lee-b

Description

@lee-b

Name and Version

ghcr.io/ggerganov/llama.cpp:server-cuda d5709176ee6d

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

docker-compose config (with healthchecks and volumes removed for brevity):

services:
  llama-cpp-chat:
    image: ghcr.io/ggerganov/llama.cpp:server-cuda
    pull_policy: always
    restart: always
    ports:
      - 8080:8080
    command:
      - "-a"
      - "llama-cpp-chat"

      - "--tensor-split"
      - "22,22,22,10"

      - "--parallel"
      - "4"
      - "--batch-size"
      - "1024"
      - "--ubatch-size"
      - "512"
      - "--threads-http"
      - "4"

      - "-ngl"
      - "500"

      - "-fa"

      - "-m"
      - "/models/unsloth--Llama-4-Scout-17B-16E-Instruct-GGUF/IQ4_XS/Llama-4-Scout-17B-16E-Instruct-IQ4_XS-00001-of-00002.gguf"
      - "-c"
      - "32768"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ "0", "1", "2" ]
              capabilities: [gpu]


$ docker images | grep server-cuda
ghcr.io/ggerganov/llama.cpp                      server-cuda   d5709176ee6d   7 hours ago    2.76GB

Problem description & steps to reproduce

After the following error (which may or may not be a bug in its own right -- it's possibly a PCIe riser bus integrity issue on my side):

/app/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
   CUDA error: an illegal memory access was encountered
   current device: 0, in function ggml_cuda_mul_mat_q at /app/ggml/src/ggml-cuda/mmq.cu:145
   cudaMemcpyAsync(ids_host.data(), ids->data, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)

Completions hang (apparently forever):

$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'
Sat  3 May 13:41:45 BST 2025
^C
$ date
Sat  3 May 13:42:33 BST 2025

Whereas before such an error (or after the error and a llama-server restart):

$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'; date
Sat  3 May 13:40:47 BST 2025
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?"}}],"created":1746276047,"model":"llama-cpp-chat","system_fingerprint":"b5269-1d36b367","object":"chat.completion","usage":{"completion_tokens":10,"prompt_tokens":11,"total_tokens":21},"id":"chatcmpl-shzI6fsLwPfFYsQ6SRArjYUKB6lI1mMD","timings":{"prompt_n":11,"prompt_ms":88.858,"prompt_per_token_ms":8.078000000000001,"prompt_per_second":123.79301807378064,"predicted_n":10,"predicted_ms":198.143,"predicted_per_token_ms":19.8143,"predicted_per_second":50.468600959912784}}Sat  3 May 13:40:47 BST 2025

This is not ideal in itself, but understandable.

However, at least, we would want to restart the service (since this is sufficient for recovery, at least with pcie_aspm=off given on the linux command line).

Docker/k8s could do this restarting, but would need the health endpoint to tell us that there's a problem. But, I get:

$ curl http://localhost:8080/health
{"status":"ok"}

So at least, toggling the "ok" to "error" or something, when this CUDA error occurs, would be helpful.

First Bad Commit

No response

Relevant log output

(see above)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions