Quantization is the trick that makes useful local LLMs possible. A 70B-parameter model checkpointed at 16-bit precision weighs roughly 140 GB of weights; the same model quantized to a 4-bit GGUF (e.g., Q4_K_M) typically lands in the 35–45 GB range (model-dependent, plus a few percent of metadata), which is what lets it run on a pair of 24 GB consumer GPUs instead of needing a small datacenter.

What quantization actually does

Most modern LLMs are trained and checkpointed at 16-bit precision (BF16 or FP16). Two bytes per weight; a 70B model in BF16 is on the order of 140 GB of raw weights, ignoring optimizer state and metadata. Many inference stacks now run at 4-bit or 8-bit by default — full-precision inference is no longer the norm even on hosted setups.

Quantization replaces those 16-bit floats with smaller integer representations, plus a small amount of bookkeeping data (a scale factor, sometimes a zero-point) per group of weights. At 4 bits per weight, the same 70B model becomes roughly a quarter of the original size, give or take.

The cost is precision. Each weight is now one of a small set of discrete values inside its group's scale (16 values for 4-bit). For the vast majority of weights this is fine — neural networks are surprisingly robust to rounding. For the small minority of weights doing critical work, modern quantization schemes use calibration data to keep those at higher precision.

The quality cliff (heuristic, not a benchmark)

Empirically — and these are heuristics from community testing, not universal results — quality tends to drop gracefully from FP16 down to about 4 bits. Below 4 bits the drop gets sharper:

  • 8-bit: typically near-indistinguishable from full precision on chat and writing tasks.
  • 6-bit: typically very close to full precision in most use cases.
  • 5-bit: usually subtle drop, sometimes visible on the hardest reasoning and code tasks.
  • 4-bit: usually a small drop, broadly fine for chat / writing / summarization. Often the practical default.
  • 3-bit: real and noticeable drop on math, code, and instruction following.
  • 2-bit: heavy quality loss; the model still runs but is significantly degraded for most uses.

Where the cliff is for your model and your task is something you have to test, not something a chart can tell you universally. Pretrained model architecture, MoE vs dense, and the specific quantization scheme all matter.

How to pick a quant level

Two questions to answer:

  1. What is your VRAM budget? Subtract the OS, the context-window allocation, and any other GPU users. Take the rest, divide by the FP16 size of the model, and see which quantization gets you under the line.
  2. How hard is your task? Chat and writing tend to tolerate 4-bit. Code generation, math, structured reasoning, and long-context retrieval often benefit from stepping up to Q5_K_M or Q6_K. Pick the quant that fits, then bump up one notch if your task is in the hard column and you have memory headroom — and verify on a small eval rather than trusting the heuristic.

The common upgrade path: start at Q4_K_M (or the equivalent 4-bit scheme your runtime supports), run a small battery of your real tasks, step up to Q5_K_M or Q6_K if anything is failing in a way that smells like precision rather than capability.

What quantization cannot fix

A quantized model is still the same model. Quantization cannot make a 7B model think like a 70B model; it can only make a 70B model fit. If your task needs a bigger model, you need a bigger model, not a better quant.