Description
Name and Version
ghcr.io/ggerganov/llama.cpp:server-cuda
docker.compose.image=sha256:f608f747701dc4df42f89cab23a6d6b556889f4454737395e988fd3a94e41b45
llama-cpp-chat-1 | build: 5332 (7c28a74) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
image: ghcr.io/ggerganov/llama.cpp:server-cuda
command:
- "-a"
- "llama-cpp-chat"
- "-a"
- "llama-cpp-chat"
- "-ctv"
- "q4_0"
- "-ctv"
- "q4_0"
- "--slot-save-path"
- "/prompt-cache/slot-saves"
- "-fa"
- "-m"
- "/models/mradermacher--Dolphin-Mistral-24B-Venice-Edition-i1-GGUF/Dolphin-Mistral-24B-Venice-Edition.i1-IQ4_XS.gguf"
- "-np"
- "2"
- "-c"
- "131072"
- "-ngl"
- "500"
- "--cache-reuse"
- "25"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: [ "0", "2", "3" ]
capabilities: [gpu]
Problem description & steps to reproduce
I'm seeing quite frequent CUDA illegal memory access errors.
I've seen this with Scout and/or Maverick before too, but this particular instance is when running mradermacher's Dolphin-Mistral-24B-Venice-Edition-i1-GGUF/Dolphin-Mistral-24B-Venice-Edition.i1-IQ4_XS.gguf.
I believe I have ruled out a card/VRAM/PCIe issue, since I consistently get this error despite various combinations of CUDA_VISIBLE_DEVICES (and confirming in nvtop that the cards reported as problematic have since been excluded). The cards are on risers, but short, high-quality ones, and different risers on different cards, so I don't think that's relevant. The different cards have different PCI versions and lane counts, too.
First Bad Commit
No response
Relevant log output
llama-cpp-chat-1 | /app/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
llama-cpp-chat-1 | CUDA error: an illegal memory access was encountered
llama-cpp-chat-1 | current device: 0, in function ggml_backend_cuda_cpy_tensor_async at /app/ggml/src/ggml-cuda/ggml-cuda.cu:2428
llama-cpp-chat-1 | cudaMemcpyPeerAsync(dst->data, cuda_ctx_dst->device, src->data, cuda_ctx_src->device, ggml_nbytes(dst), cuda_ctx_src->stream())