Does quantization make the model slower or faster?

Faster, almost always. Smaller weights mean less memory bandwidth needed to move them, and modern runtimes have hand-tuned kernels for low-bit math. On GPU-bound workloads the speedup is small; on memory-bound workloads (typical for batch size 1 chat) it can be substantial.

Is quantized fine-tuning a thing?

Yes. QLoRA fine-tunes a 4-bit quantized base model by training small low-rank adapters in higher precision. It lets people fine-tune 70B-class models on a single 24GB GPU. The result is competitive with full-precision LoRA for most tasks.

How do I tell if a quant level is too aggressive?

Run your eval set at both Q4_K_M and Q6_K and compare the outputs. If you cannot tell them apart on your task, Q4 is fine. If Q4 starts producing different code, different math, or different facts on a noticeable percentage of runs, you have crossed the line for that task and should step up.

Quantization Explained for LLMs

Quantization is the trick that makes useful local LLMs possible. A 70B-parameter model checkpointed at 16-bit precision weighs roughly 140 GB of weights; the same model quantized to a 4-bit GGUF (e.g., Q4_K_M) typically lands in the 35–45 GB range (model-dependent, plus a few percent of metadata), which is what lets it run on a pair of 24 GB consumer GPUs instead of needing a small datacenter.

What quantization actually does

Most modern LLMs are trained and checkpointed at 16-bit precision (BF16 or FP16). Two bytes per weight; a 70B model in BF16 is on the order of 140 GB of raw weights, ignoring optimizer state and metadata. Many inference stacks now run at 4-bit or 8-bit by default — full-precision inference is no longer the norm even on hosted setups.

Quantization replaces those 16-bit floats with smaller integer representations, plus a small amount of bookkeeping data (a scale factor, sometimes a zero-point) per group of weights. At 4 bits per weight, the same 70B model becomes roughly a quarter of the original size, give or take.

The cost is precision. Each weight is now one of a small set of discrete values inside its group's scale (16 values for 4-bit). For the vast majority of weights this is fine — neural networks are surprisingly robust to rounding. For the small minority of weights doing critical work, modern quantization schemes use calibration data to keep those at higher precision.

The quality cliff (heuristic, not a benchmark)

Empirically — and these are heuristics from community testing, not universal results — quality tends to drop gracefully from FP16 down to about 4 bits. Below 4 bits the drop gets sharper:

8-bit: typically near-indistinguishable from full precision on chat and writing tasks.
6-bit: typically very close to full precision in most use cases.
5-bit: usually subtle drop, sometimes visible on the hardest reasoning and code tasks.
4-bit: usually a small drop, broadly fine for chat / writing / summarization. Often the practical default.
3-bit: real and noticeable drop on math, code, and instruction following.
2-bit: heavy quality loss; the model still runs but is significantly degraded for most uses.

Where the cliff is for your model and your task is something you have to test, not something a chart can tell you universally. Pretrained model architecture, MoE vs dense, and the specific quantization scheme all matter.

How to pick a quant level

Two questions to answer:

What is your VRAM budget? Subtract the OS, the context-window allocation, and any other GPU users. Take the rest, divide by the FP16 size of the model, and see which quantization gets you under the line.
How hard is your task? Chat and writing tend to tolerate 4-bit. Code generation, math, structured reasoning, and long-context retrieval often benefit from stepping up to Q5_K_M or Q6_K. Pick the quant that fits, then bump up one notch if your task is in the hard column and you have memory headroom — and verify on a small eval rather than trusting the heuristic.

The common upgrade path: start at Q4_K_M (or the equivalent 4-bit scheme your runtime supports), run a small battery of your real tasks, step up to Q5_K_M or Q6_K if anything is failing in a way that smells like precision rather than capability.

What quantization cannot fix

A quantized model is still the same model. Quantization cannot make a 7B model think like a 70B model; it can only make a 70B model fit. If your task needs a bigger model, you need a bigger model, not a better quant.

Comments 2

u/ml_priya · 1 week ago

Clear explainer. One nit: a quick note that quantization quality varies a lot by model would help. Some degrade gracefully, some fall off a cliff below 4-bit.
u/quant_q3 · 2 weeks ago

the 'some models fall off a cliff below 4-bit' point is so important. quant quality is not uniform and people just assume it is until it bites

Quantization Explained for LLMs

Article summary

What quantization actually does

The quality cliff (heuristic, not a benchmark)

How to pick a quant level

What quantization cannot fix

Frequently asked questions

See also

Where to go next

Comments 2