Skip to content

Misc. bug: Qwen 3.0 "enable_thinking" parameter not working #13160

Open
@celsowm

Description

@celsowm

Name and Version

llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 5199 (ced44be)
built with MSVC 19.41.34120.0 for x64

Operating systems

windows 11

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server Qwen3-14B-Q5_K_M.gguf

Problem description & steps to reproduce

param enable_thinking: false on llama-server has not effect at all when you send on request. (despite been on Alibaba examples)

SGLang and VLLM support this by "chat_template_kwargs":

https://qwen.readthedocs.io/en/latest/deployment/sglang.html#thinking-non-thinking-modes
https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes

curl http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": false}
}'

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions