Download Latest Version koboldcpp-1.100.1 source code.zip (45.6 MB)
Email in envelope

Get an email when there's a new version of KoboldCpp

Home / v1.90.2
Name Modified Size InfoDownloads / Week
Parent folder
koboldcpp_cu12.exe 2025-05-02 642.3 MB
koboldcpp.exe 2025-05-02 528.1 MB
koboldcpp-mac-arm64 2025-05-02 27.1 MB
koboldcpp-linux-x64-nocuda 2025-05-02 78.0 MB
koboldcpp-linux-x64-cuda1210 2025-05-02 736.5 MB
koboldcpp-linux-x64-cuda1150 2025-05-02 645.0 MB
koboldcpp_oldcpu.exe 2025-05-02 528.3 MB
koboldcpp_nocuda.exe 2025-05-02 77.3 MB
koboldcpp-1.90.2 source code.tar.gz 2025-05-02 27.7 MB
koboldcpp-1.90.2 source code.zip 2025-05-02 28.1 MB
README.md 2025-05-02 5.4 kB
Totals: 11 Items   3.3 GB 0

koboldcpp-1.90.2

Qwen of the line edition

  • NEW: Android Termux Auto-Installer - You can now setup KoboldCpp via Termux on Android with a single command, which triggers an automated installation script. Check it out here. Install Termux from F-Droid, then run the command with internet accessible, and everything will be setup, downloaded, compiled and configured for instant use with a Gemma3-1B model.
  • Merged support for Qwen3. Now also triggers --nobostoken automatically if a model metadata explicitly indicates no_bos_token, it can still be enabled manually for other models.
  • Fixes for THUDM GLM-4, note that this model enforces --blasbatchsize 16 or smaller in order to get coherent output.
  • Merged overhaul to Qwen2.5vl projector. Both old (HimariO version) and new (ngxson version) mmprojs should work, retaining backwards compatibility. However, you should update to the new projectors.
  • Merged functioning Pixtral support. Note that pixtral is very token heavy, about 4000 tokens for a 1024px image, you can try increasing max --contextsize or lowering --visionmaxres.
  • Added support for OpenAI Structured Outputs in chat completions API, also accepts the schema when sent as a stringified JSON object in the "grammar" field. You can use this to enforce JSON outputs with specific schema.
  • --blasbatchsize -1 now exclusively uses a batch size of 1 when processing prompt. Also permitted --blasbatchsize 16 which replicates the old behavior (batch of 16 does not trigger GEMM).
  • KCPP API server now correctly handles explicitly set nulled fields.
  • Fixed Zenity/YAD detection not working correctly in the previous version.
  • Improved input sanitization when launching and passing url as a model param, Also for better security, --onready shell commands can still be used as a CLI parameter, but cannot be embedded into a .kcppt or .kcpps file.
  • More robust checks for system glslc when building vulkan shaders.
  • Improved auto gpu layers when loading multi-part GGUF models (on 1 gpu), also slightly tightened memory estimation, and accounts for quantized KV when guessing layers.
  • Added new flag --mmprojcpu that allows you to load and run the projector on CPU while keeping the main model on GPU.
  • noscript mode randomizes generated image names to prevent browser caching.
  • Updated Kobold Lite, multiple fixes and improvements
  • Increased default tokens generated and slider limits (can be overridden)
  • ChatGLM-4 and Qwen3 (chatml think/nothinking) presets added. You can disable thinking in Qwen3 by swapping between ChatML (No Thinking) and normal ChatML.
  • Added toggle to disable LaTeX while leaving markdown enabled
  • Merged fixes and improvements from upstream

  • Hotfix 1.90.1:

  • Reworked thinking tags handling. ChatML (No thinking) is removed, instead, thinking can be forced or prevented for all instruct formats (Settings > Tokens > CoT).
  • More GLM4 fixes, now works fine with larger batches on CUDA, on vulkan glm4 ubatch size is still limited to 16.
  • Some chat completions parsing fixes.
  • Updated Lite with a new scenario

  • Hotfix 1.90.2:

  • Pulled further upstream updates. Massive file size increase caused by https://github.com/ggml-org/llama.cpp/pull/13199, I can't do anything about it. Don't ask me.
  • NEW: Added a hugginface model search tool! Now you can find, browse and download models straight from huggingface.
  • Increased --defaultgenamount range
  • Try to fix YAD GUI launcher
  • Added rudimentary websocket spoof for ComfyUI, increased comfyui compatibility.
  • Fixed a few parsing issues for nulled chat completions params
  • Automatically handle multipart file downloading, up to 9 parts.
  • Fixed rope config not saving correctly to kcpps sometimes
  • Merged fixes for Plamo models, thanks to @CISC

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller. If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller. If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster). If you're using Linux, select the appropriate Linux binary file instead (not exe). If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary. If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI. and then once loaded, you can connect like this (or use the full koboldai client): http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

Source: README.md, updated 2025-05-02