Subject

Local Models

Running LLMs on your own hardware: how the stack works, which runtimes to pick, what quantization actually changes, and which open-weight models are genuinely usable right now.

As of 2026-05-15

As of 2026-05-16

Running a model locally means downloading the model weights and doing inference on hardware you own. It is more accessible than it sounds; the entire stack is three pieces.

The three pieces

A runtime. The program that loads the model into memory and decodes tokens. The four most commonly used right now:
- Ollama — CLI + daemon, simple workflow, REST API. Repo: ollama/ollama.
- LM Studio — desktop GUI for browsing, downloading, and running models, with an OpenAI-compatible local server.
- llama.cpp — the C/C++ engine that powers a lot of the others. Highest control, lowest overhead.
- vLLM — server-oriented, high-throughput, GPU-focused. Paged-attention for long contexts and concurrency.
A model in a runnable format. GGUF is the dominant format for CPU/GPU-hybrid runtimes (llama.cpp, Ollama, LM Studio). Safetensors is the format you reach for with vLLM or Hugging Face Transformers, and some runtimes can load it directly without converting to GGUF. Neither is universally standard; pick what your runtime supports.
A model that fits. This is the constraint that decides everything. Your usable VRAM (or unified memory on Apple Silicon) caps which model you can run at which quantization. The numbers below are rough rules of thumb; exact memory depends on context length and runtime overhead.

VRAM rough rules of thumb

At 4-bit quantization (Q4_K_M, the common default). Treat these as approximate and confirm against your runtime's reporting:

8 GB VRAM → 7B–8B class models, with short context.
12 GB VRAM → 13B class, or 7B with a long context window.
16 GB VRAM → 13B with room, or 27B class squeezed.
24 GB VRAM (one 3090, 4090, 5090, or 7900 XTX) → 27B–32B class comfortably; 70B class only with CPU offload or aggressive quantization.
48 GB VRAM (two 3090s, an A6000, an RTX 6000 Ada) → 70B class natively, longer contexts possible.
96 GB+ → bigger open-weight models with longer contexts and less compromise on quantization. "Frontier-tier" depends on which models are open at the time; check the open-weight tracker.

Apple Silicon is the third option. Unified memory is shared between CPU and GPU and does not map one-to-one to VRAM, so the numbers above are not directly portable. In practice: a Mac with ample unified memory can load larger models than a comparable PC GPU but runs slower per token; the exact gap depends on model, quantization, and software stack. Check the specific Apple-Silicon Mac configuration you're considering against current benchmarks rather than trusting a generic multiplier. Apple's official Mac specs list the unified-memory options.

Note: a high-end multi-GPU consumer PC (e.g., dual 24 GB or 32 GB cards) can also run 70B+ models with tensor parallelism in vLLM or distributed inference in llama.cpp. It is no longer accurate to say only a high-memory Apple Silicon Mac can do this.

Quantization in one paragraph

You almost never run weights at full 16-bit precision locally. You quantize them to 8, 6, 5, or 4 bits per parameter. Memory drops roughly proportionally; speed often improves; quality drops a little. Q4_K_M in GGUF is the modern default for chat. Drop to 3-bit only if you must. Stay above 4-bit when you are doing fine-grained reasoning or generation where you noticed quality cliffs. See quantization-explained-for-llms for the deeper version.

What runs well right now

Generally, the workable open-weight tiers on consumer hardware in mid-2026:

7B–8B for fast chat on consumer hardware.
27B–32B as a "default smart model" tier for 24 GB cards.
70B+ for power users with multi-GPU setups or high-memory Apple Silicon.

For which specific model to run, the answer changes monthly; check the open-weight tracker or the latest snapshot if it has been published.

When local beats API

Privacy. Data never leaves your machine. For regulated work, this is the whole game.
Latency. Short prompts in chat-style interaction have near-zero round trip. Useful for fast UIs.
Cost at high volume. Fixed cost. If you have a workload that would cost you four figures a month on an API, the math flips fast.
No-network or low-network environments. Field work, secure facilities, intermittent connections.

When API beats local

Raw capability per dollar. Frontier APIs tend to lead on the hardest reasoning, the longest context, and the latest skills.
Multimodal. Open-weight is catching up on vision, but audio and video are still typically stronger on hosted APIs.
Operational simplicity. No GPU babysitting. Updates are someone else's problem.
Scaling spikes. APIs absorb bursts; your local GPU does not.

Most production stacks we have seen pair the two: a hosted API for the hardest requests, a local or smaller hosted model for the high-volume or sensitive ones. Whether that is what "most" deployments do is anecdotal; pick the mix that matches your specific cost/privacy/capability constraint.

Forthcoming

Best Local Llms Snapshot 2026 05
Gpu Vs Cpu for Local Llms
Vram Requirements for Popular Models

Where to go next

A short editorial reading list. Pick whichever fits how you like to learn.

NerdSip: 5-minute AI micro-course on almost any topic, on iOS and Android

Comments 0

No comments yet. Be the first to share your thoughts.