Closed
Description
Name and Version
llama-server.exe --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | matrix cores: none
version: 4880 (2048b59)
built with MSVC 19.43.34808.0 for x64
Operating systems
Windows
GGML backends
Vulkan
Hardware
Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | matrix cores: none
Models
gemma3:12b
Problem description & steps to reproduce
it's unable to load Gemma3:12b GUFF model.
First Bad Commit
No response
Relevant log output
llama-server.exe -m %file_path_gemma3_12b% --no-mmap -c 16384 -np 1 -ngl 50 --temp 0.1 -t 9 -tb 8 -C FF000 --no-perf --host 0.0.0.0 --port 3000
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | matrix cores: none
Not enough set bits in CPU mask (8) to satisfy requested thread count: 9
Not enough set bits in CPU mask (8) to satisfy requested thread count: 9
build: 4880 (2048b591) with MSVC 19.43.34808.0 for x64
system info: n_threads = 9, n_threads_batch = 8, total_threads = 20
system_info: n_threads = 9 (n_threads_batch = 8) / 20 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 0.0.0.0, port: 3000, http threads: 19
main: loading model
srv load_model: loading model 'D:\OllamaModels\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3'
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Iris(R) Xe Graphics) - 16224 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 1065 tensors from D:\OllamaModels\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 1: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 2: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 3: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 4: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 5: gemma3.block_count u32 = 48
llama_model_loader: - kv 6: gemma3.context_length u32 = 8192
llama_model_loader: - kv 7: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 8: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 9: gemma3.vision.attention.head_count u32 = 16
llama_model_loader: - kv 10: gemma3.vision.attention.layer_norm_epsilon f32 = 0.000001
llama_model_loader: - kv 11: gemma3.vision.block_count u32 = 27
llama_model_loader: - kv 12: gemma3.vision.embedding_length u32 = 1152
llama_model_loader: - kv 13: gemma3.vision.feed_forward_length u32 = 4304
llama_model_loader: - kv 14: gemma3.vision.image_size u32 = 896
llama_model_loader: - kv 15: gemma3.vision.num_channels u32 = 3
llama_model_loader: - kv 16: gemma3.vision.patch_size u32 = 14
llama_model_loader: - kv 17: general.architecture str = gemma3
llama_model_loader: - kv 18: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: tokenizer.ggml.add_padding_token bool = false
llama_model_loader: - kv 22: tokenizer.ggml.add_unknown_token bool = false
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,514906] = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
llama_model_loader: - kv 26: tokenizer.ggml.model str = llama
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
llama_model_loader: - kv 29: tokenizer.ggml.scores arr[f32,262145] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,262145] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,262145] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 32: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: general.file_type u32 = 15
llama_model_loader: - type f32: 563 tensors
llama_model_loader: - type f16: 165 tensors
llama_model_loader: - type q4_K: 290 tensors
llama_model_loader: - type q6_K: 47 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 7.57 GiB (5.34 BPW)
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: gemma3.attention.layer_norm_rms_epsilon
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'D:\OllamaModels\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3'
srv load_model: failed to load model, 'D:\OllamaModels\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3'
srv operator (): operator (): cleaning up before exit...
main: exiting due to model loading error