Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK

### Name and Version

ghcr.io/ggerganov/llama.cpp:server-cuda   d5709176ee6d 

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
docker-compose config (with healthchecks and volumes removed for brevity):

services:
  llama-cpp-chat:
    image: ghcr.io/ggerganov/llama.cpp:server-cuda
    pull_policy: always
    restart: always
    ports:
      - 8080:8080
    command:
      - "-a"
      - "llama-cpp-chat"

      - "--tensor-split"
      - "22,22,22,10"

      - "--parallel"
      - "4"
      - "--batch-size"
      - "1024"
      - "--ubatch-size"
      - "512"
      - "--threads-http"
      - "4"

      - "-ngl"
      - "500"

      - "-fa"

      - "-m"
      - "/models/unsloth--Llama-4-Scout-17B-16E-Instruct-GGUF/IQ4_XS/Llama-4-Scout-17B-16E-Instruct-IQ4_XS-00001-of-00002.gguf"
      - "-c"
      - "32768"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ "0", "1", "2" ]
              capabilities: [gpu]


$ docker images | grep server-cuda
ghcr.io/ggerganov/llama.cpp                      server-cuda   d5709176ee6d   7 hours ago    2.76GB
```

### Problem description & steps to reproduce

After the following error (which may or may not be a bug in its own right -- it's possibly a PCIe riser bus integrity issue on my side):

```
/app/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
   CUDA error: an illegal memory access was encountered
   current device: 0, in function ggml_cuda_mul_mat_q at /app/ggml/src/ggml-cuda/mmq.cu:145
   cudaMemcpyAsync(ids_host.data(), ids->data, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
```

Completions hang (apparently forever):

```
$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'
Sat  3 May 13:41:45 BST 2025
^C
$ date
Sat  3 May 13:42:33 BST 2025
```

Whereas before such an error (or after the error and a llama-server restart):

```
$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'; date
Sat  3 May 13:40:47 BST 2025
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?"}}],"created":1746276047,"model":"llama-cpp-chat","system_fingerprint":"b5269-1d36b367","object":"chat.completion","usage":{"completion_tokens":10,"prompt_tokens":11,"total_tokens":21},"id":"chatcmpl-shzI6fsLwPfFYsQ6SRArjYUKB6lI1mMD","timings":{"prompt_n":11,"prompt_ms":88.858,"prompt_per_token_ms":8.078000000000001,"prompt_per_second":123.79301807378064,"predicted_n":10,"predicted_ms":198.143,"predicted_per_token_ms":19.8143,"predicted_per_second":50.468600959912784}}Sat  3 May 13:40:47 BST 2025
```

This is not ideal in itself, but understandable.

However, at least, we would want to restart the service (since this is sufficient for recovery, at least with pcie_aspm=off given on the linux command line).

Docker/k8s could do this restarting, but would need the health endpoint to tell us that there's a problem. But, I get:

```
$ curl http://localhost:8080/health
{"status":"ok"}
```

So at least, toggling the "ok" to "error" or something, when this CUDA error occurs, would be helpful.

### First Bad Commit

_No response_

### Relevant log output

(see above)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK #13281

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK #13281

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions