Misc. bug: Stuck while loading the model

Name and Version

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server \
    --model /mnt/ssd/models/gguf/Openhands-32B-Q8_0/all-hands_openhands-lm-32b-v0.1-Q8_0.gguf \
    -a oh32q8 \
    --host 0.0.0.0 \
    --port 9000 \
    --api-key Llh123456@ \
    --ctx-size 128000 \
    --no-webui \
    --n-gpu-layers 65 \
    --mlock \
    --tensor-split "0.15,0.15,0.25,0.15,0.15,0.15" \
    --main-gpu 0 \
    --flash-attn \
    --defrag-thold 0.2 \
    --split-mode layer

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=4
./llama-server \
    --model /mnt/ssd/models/gguf/DSR1Q38BQ8/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf \
    -a dsq3q8 \
    --host 0.0.0.0 \
    --port 9001 \
    --api-key Llh123456@ \
    --ctx-size 55000 \
    --no-webui \
    --n-gpu-layers 37 \
    --mlock \
    #--tensor-split "0,0,0,0,0.3"
    --main-gpu 0 \
    --flash-attn \
    --defrag-thold 0.2 \
    --split-mode layer
unset CUDA_VISIBLE_DEVICES

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2,3
./llama-server \
    --model /mnt/ssd/models/gguf/Qwen_QwQ-32B-Q8_0/qwq-32b-q8_0.gguf \
    -a qwq32q8 \
    --host 0.0.0.0 \
    --port 9000 \
    --api-key Llh123456@ \
    --ctx-size 110000 \
    --no-webui \
    --n-gpu-layers 65 \
    --mlock \
    --tensor-split "0.3,0.3,0.3" \
    --main-gpu 0 \
    --flash-attn \
    --defrag-thold 0.2 \
    --split-mode layer
unset CUDA_VISIBLE_DEVICES

Problem description & steps to reproduce

I have tried compiling many versions. Problems started occurring after version B5298. Before that, everything was normal.On flight B5298, I was unable to successfully load the three models, but on flight B5297, everything ran perfectly.
Here is the log.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 4: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 5298 (141a908) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32

Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9000, http threads: 31
main: loading model
srv load_model: loading model '/mnt/ssd/models/gguf/Openhands-32B-Q8_0/all-hands_openhands-lm-32b-v0.1-Q8_0.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080 Ti) - 11567 MiB free

stuck

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 5298 (141a908) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32

Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9001, http threads: 31
main: loading model
srv load_model: loading model '/mnt/ssd/models/gguf/DSR1Q38BQ8/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf'

stuck

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 5298 (141a908) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32

Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9000, http threads: 31
main: loading model
srv load_model: loading model '/mnt/ssd/models/gguf/Qwen_QwQ-32B-Q8_0/qwq-32b-q8_0.gguf'

stuck

The command I use for compilation
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j30 --clean-first

Could it be related to these compilation modifications?

First Bad Commit

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Stuck while loading the model #14114

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

stuck

stuck

stuck

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Stuck while loading the model #14114

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

stuck

stuck

stuck

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions