Closed
Description
Name and Version
version: 5298 (141a908)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.3.0
Operating systems
Mac
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
./llama-cli -m Qwen3-235B-A22B.i1-IQ3_M.gguf --no-mmap -fa -c 8192
./llama-cli -m Qwen3-235B-A22B.i1-IQ3_M.gguf --no-mmap -fa -c 16384
Problem description & steps to reproduce
- Download https://huggingface.co/mradermacher/Qwen3-235B-A22B-i1-GGUF (IQ3_M)
- sudo nano /etc/sysctl.conf
iogpu.wired_limit_mb=122880 - Restart computer
- Run ./llama-cli -m Qwen3-235B-A22B.i1-IQ3_M.gguf --no-mmap -fa -c 8192 or 16384
- Observe slow loading times and unnecessary swapping in Memory section of Activity Monitor
Videos showing extended swap/unswap times:
8k ctx: https://github.com/user-attachments/assets/7cde30ce-3770-4582-85cf-2c4382f527dc
16k ctx: https://github.com/user-attachments/assets/b97a88c5-36f5-4477-8525-fd78050eadc2
(where memory pressure drops at the end is when the model finishes loading)
8-16k ctx should only use 100,127-101,671 MiB memory, as per calculations below. An M3 Max with 131,072 MiB memory and 122,880 MiB VRAM limit should be able to handle it without the long swap/unswap process.
8k ctx
Component | GPU (MiB) | CPU (MiB) | Total (MiB) |
---|---|---|---|
Model buffer | 98,030.93 | 255.02 | 98,285.95 |
KV‐cache buffer | 1,504.00 | 0.00 | 1,504.00 |
Compute buffer | 312.75 | 24.01 | 336.76 |
Output buffer | 0.00 | 0.58 | 0.58 |
Grand total | 99,847.68 | 279.61 | 100,127.29 |
16k ctx
Component | GPU (MiB) | CPU (MiB) | Total (MiB) |
---|---|---|---|
Model buffer | 98,030.93 | 255.02 | 98,285.95 |
KV‐cache buffer | 3,008.00 | 0.00 | 3,008.00 |
Compute buffer | 336.75 | 40.01 | 376.76 |
Output buffer | 0.00 | 0.58 | 0.58 |
Grand total | 101,375.68 | 295.61 | 101,671.29 |
First Bad Commit
No response