Description
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./llama-server \
--model /mnt/ssd/models/gguf/Openhands-32B-Q8_0/all-hands_openhands-lm-32b-v0.1-Q8_0.gguf \
-a oh32q8 \
--host 0.0.0.0 \
--port 9000 \
--api-key Llh123456@ \
--ctx-size 128000 \
--no-webui \
--n-gpu-layers 65 \
--mlock \
--tensor-split "0.15,0.15,0.25,0.15,0.15,0.15" \
--main-gpu 0 \
--flash-attn \
--defrag-thold 0.2 \
--split-mode layer
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=4
./llama-server \
--model /mnt/ssd/models/gguf/DSR1Q38BQ8/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf \
-a dsq3q8 \
--host 0.0.0.0 \
--port 9001 \
--api-key Llh123456@ \
--ctx-size 55000 \
--no-webui \
--n-gpu-layers 37 \
--mlock \
#--tensor-split "0,0,0,0,0.3"
--main-gpu 0 \
--flash-attn \
--defrag-thold 0.2 \
--split-mode layer
unset CUDA_VISIBLE_DEVICES
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2,3
./llama-server \
--model /mnt/ssd/models/gguf/Qwen_QwQ-32B-Q8_0/qwq-32b-q8_0.gguf \
-a qwq32q8 \
--host 0.0.0.0 \
--port 9000 \
--api-key Llh123456@ \
--ctx-size 110000 \
--no-webui \
--n-gpu-layers 65 \
--mlock \
--tensor-split "0.3,0.3,0.3" \
--main-gpu 0 \
--flash-attn \
--defrag-thold 0.2 \
--split-mode layer
unset CUDA_VISIBLE_DEVICES
Problem description & steps to reproduce
I have tried compiling many versions. Problems started occurring after version B5298. Before that, everything was normal.On flight B5298, I was unable to successfully load the three models, but on flight B5297, everything ran perfectly.
Here is the log.
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 4: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 5298 (141a908) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32
system_info: n_threads = 32 (n_threads_batch = 32) / 32 | CUDA : ARCHS = 500,610,700,750,800,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9000, http threads: 31
main: loading model
srv load_model: loading model '/mnt/ssd/models/gguf/Openhands-32B-Q8_0/all-hands_openhands-lm-32b-v0.1-Q8_0.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080 Ti) - 11567 MiB free
stuck
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 5298 (141a908) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32
system_info: n_threads = 32 (n_threads_batch = 32) / 32 | CUDA : ARCHS = 500,610,700,750,800,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9001, http threads: 31
main: loading model
srv load_model: loading model '/mnt/ssd/models/gguf/DSR1Q38BQ8/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf'
stuck
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 5298 (141a908) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32
system_info: n_threads = 32 (n_threads_batch = 32) / 32 | CUDA : ARCHS = 500,610,700,750,800,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9000, http threads: 31
main: loading model
srv load_model: loading model '/mnt/ssd/models/gguf/Qwen_QwQ-32B-Q8_0/qwq-32b-q8_0.gguf'
stuck
The command I use for compilation
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j30 --clean-first
Could it be related to these compilation modifications?
First Bad Commit
No response