Skip to content

Misc. bug: Stuck while loading the model #14114

Open
@fgfg54321

Description

@fgfg54321

Name and Version

b5298

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server \
    --model /mnt/ssd/models/gguf/Openhands-32B-Q8_0/all-hands_openhands-lm-32b-v0.1-Q8_0.gguf \
    -a oh32q8 \
    --host 0.0.0.0 \
    --port 9000 \
    --api-key Llh123456@ \
    --ctx-size 128000 \
    --no-webui \
    --n-gpu-layers 65 \
    --mlock \
    --tensor-split "0.15,0.15,0.25,0.15,0.15,0.15" \
    --main-gpu 0 \
    --flash-attn \
    --defrag-thold 0.2 \
    --split-mode layer
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=4
./llama-server \
    --model /mnt/ssd/models/gguf/DSR1Q38BQ8/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf \
    -a dsq3q8 \
    --host 0.0.0.0 \
    --port 9001 \
    --api-key Llh123456@ \
    --ctx-size 55000 \
    --no-webui \
    --n-gpu-layers 37 \
    --mlock \
    #--tensor-split "0,0,0,0,0.3"
    --main-gpu 0 \
    --flash-attn \
    --defrag-thold 0.2 \
    --split-mode layer
unset CUDA_VISIBLE_DEVICES
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2,3
./llama-server \
    --model /mnt/ssd/models/gguf/Qwen_QwQ-32B-Q8_0/qwq-32b-q8_0.gguf \
    -a qwq32q8 \
    --host 0.0.0.0 \
    --port 9000 \
    --api-key Llh123456@ \
    --ctx-size 110000 \
    --no-webui \
    --n-gpu-layers 65 \
    --mlock \
    --tensor-split "0.3,0.3,0.3" \
    --main-gpu 0 \
    --flash-attn \
    --defrag-thold 0.2 \
    --split-mode layer
unset CUDA_VISIBLE_DEVICES

Problem description & steps to reproduce

I have tried compiling many versions. Problems started occurring after version B5298. Before that, everything was normal.On flight B5298, I was unable to successfully load the three models, but on flight B5297, everything ran perfectly.
Here is the log.


ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 4: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 5298 (141a908) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32

system_info: n_threads = 32 (n_threads_batch = 32) / 32 | CUDA : ARCHS = 500,610,700,750,800,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9000, http threads: 31
main: loading model
srv load_model: loading model '/mnt/ssd/models/gguf/Openhands-32B-Q8_0/all-hands_openhands-lm-32b-v0.1-Q8_0.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080 Ti) - 11567 MiB free

stuck

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 5298 (141a908) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32

system_info: n_threads = 32 (n_threads_batch = 32) / 32 | CUDA : ARCHS = 500,610,700,750,800,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9001, http threads: 31
main: loading model
srv load_model: loading model '/mnt/ssd/models/gguf/DSR1Q38BQ8/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf'

stuck

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 5298 (141a908) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32

system_info: n_threads = 32 (n_threads_batch = 32) / 32 | CUDA : ARCHS = 500,610,700,750,800,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9000, http threads: 31
main: loading model
srv load_model: loading model '/mnt/ssd/models/gguf/Qwen_QwQ-32B-Q8_0/qwq-32b-q8_0.gguf'

stuck

The command I use for compilation
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j30 --clean-first

Could it be related to these compilation modifications?
Image

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions