DeepSeek V4: 1.6T MoE, 1M Context, $0.87/M Output. Architecture, Benchmarks, Pricing (2026)

TL;DR

DeepSeek V4 released April 24, 2026 with two models. V4-Pro: 1.6T total parameters, 49B active, $0.435/M input, $0.87/M output. V4-Flash: 284B total, 13B active, $0.14/M input, $0.28/M output. Both default to 1M-token context with 384K max output. DeepSeek-V4-Pro-Max scores 80.6% on SWE-bench Verified, the highest open-weights entry, tied with Gemini 3.1 Pro. Per output token, V4-Pro is 28.7x cheaper than Claude Opus 4.8 and 34.5x cheaper than GPT-5.5.

What it is

Two MoE models with 1M-token context. Attention: token-wise compression + DeepSeek Sparse Attention (DSA). Open weights on Hugging Face, MIT license per Lambda's launch analysis. API speaks both OpenAI ChatCompletions and Anthropic formats.

Why it matters

80.6% SWE-bench Verified at $0.87/M output. Opus 4.8 scores 88.6% at $25/M output. A dollar buys 1.15M output tokens from V4-Pro, 40K from Opus 4.8, 33K from GPT-5.5.

Release Date and Models

DeepSeek released V4 on April 24, 2026 as a preview, announced on its API docs news page. Two models shipped the same day:

deepseek-v4-pro: 1.6T total parameters, 49B active per token.
deepseek-v4-flash: 284B total parameters, 13B active per token.

Both run with a 1M-token context window by default across all official DeepSeek services, support JSON output, tool calls, and thinking plus non-thinking modes, and expose 384K max output tokens. The legacy deepseek-chat and deepseek-reasoner endpoints now map to deepseek-v4-flash (non-thinking and thinking mode) and will be fully retired after July 24, 2026 15:59 UTC. The API supports both the OpenAI ChatCompletions format and the Anthropic API format, which is what makes the Claude Code setup below a three-line config.

Weights are on Hugging Face (deepseek-ai/DeepSeek-V4-Pro, deepseek-ai/DeepSeek-V4-Flash), released under the MIT license per Lambda's launch analysis.

V4 Pro vs V4 Flash: Specs and Pricing

1.6T / 49B

V4-Pro total / active params

284B / 13B

V4-Flash total / active params

1M / 384K

Context / max output (both)

Specification	V4-Pro	V4-Flash
Total parameters	1.6 trillion	284 billion
Active parameters per token	49B	13B
Context window	1,000,000 tokens	1,000,000 tokens
Max output tokens	384K	384K
Input, cache miss / 1M tokens	$0.435	$0.14
Input, cache hit / 1M tokens	$0.003625	$0.0028
Output / 1M tokens	$0.87	$0.28
Concurrency limit	500	2500
Release date	April 24, 2026 (preview)	April 24, 2026 (preview)
Weights	Hugging Face, MIT	Hugging Face, MIT

The cache-hit input prices are the standout line items. A cache hit on V4-Pro costs $0.003625/M, 120x less than its cache-miss rate. Agentic coding loops that resend the same system prompt and file context on every turn see most of their input tokens land in cache, so effective per-session cost runs far below the list cache-miss rate.

Which to pick: Pro activates 3.8x more parameters per token and is the variant behind the 80.6% SWE-bench Verified score (as Pro-Max). Flash costs 3.1x less per output token and allows 5x the concurrency (2500 vs 500), which makes it the default for batch pipelines, evals, and high-volume extraction.

DeepSeek V4 Architecture

Per DeepSeek's release notes, V4's attention combines token-wise compression with DSA (DeepSeek Sparse Attention). DeepSeek's own documentation is sparse on internals; the detail below comes from the model card and third-party technical analyses of the weights, and is marked as such.

Attention: token-wise compression + DSA

The official description: each layer compresses the KV cache token-wise, then applies DeepSeek Sparse Attention over the compressed representation. This is what makes a 1M-token default context economical to serve at $0.435/M input.

Third-party analyses (Lambda's launch breakdown and model-card readers) describe the mechanism as Compressed Sparse Attention: KV caches compressed 4x along the sequence dimension, with a lightning indexer selecting the top 1,024 compressed KV entries per query. Treat the 4x and top-1,024 figures as secondary-source; DeepSeek's news page does not publish them.

DeepSeek V4 Pro architecture

V4-Pro is the 1.6T-total-parameter MoE with 49B active per token. Versus V3 (671B total, 37B active), that is 2.4x the total parameter count and 1.3x the active compute per token, with the context window expanded 8x from 128K to 1M. The sparse-attention stack is what keeps the larger model servable: DeepSeek prices Pro at $0.87/M output, 4x the output price of V4-Flash but a fraction of closed-model rates.

DeepSeek V4 Flash architecture

V4-Flash shares the attention design but at 284B total parameters with 13B active per token. The smaller expert pool is why DeepSeek can offer 2500 concurrent requests at $0.14/M input. It also inherits the same 1M context and 384K max output, so the variant choice is about quality per token, not context capability.

What DeepSeek has and has not published

Confirmed by DeepSeek's own release notes: MoE parameter counts (1.6T/49B and 284B/13B), token-wise compression + DSA attention, 1M default context, 384K max output, thinking and non-thinking modes, OpenAI and Anthropic API compatibility. Not published first-party as of June 9, 2026: the CSA compression ratios, indexer internals, optimizer details, and training token counts that circulate in third-party writeups. We cite those as secondary-source where used.

Benchmarks: SWE-bench Verified

The independently tracked number is DeepSeek-V4-Pro-Max at 80.6% on SWE-bench Verified (llm-stats tracker, June 2026). That is the highest open-weights entry, tied with Gemini 3.1 Pro, 0.1 points ahead of MiniMax M3 and 0.2 ahead of Qwen3.7 Max. Closed frontier models score higher: Claude Fable 5 leads at 95.0% (currently suspended, see note).

Model	Score	Output price / 1M tokens
Claude Fable 5	95.0%	$50.00
Claude Mythos Preview	93.9%	restricted access
Claude Opus 4.8	88.6%	$25.00
Claude Opus 4.7	87.6%	$25.00
Claude Opus 4.5	80.9%	$25.00
Claude Opus 4.6	80.8%	$25.00
DeepSeek-V4-Pro-Max	80.6%	$0.87
Gemini 3.1 Pro	80.6%	$12.00
MiniMax M3	80.5%	$1.20
Qwen3.7 Max	80.4%	$2.40 (Qwen3.5-Plus rate)

Read the price column against the score column. The 14.4-point gap between Fable 5 (95.0%) and V4-Pro-Max (80.6%) costs 57x more per output token. The 8-point gap to Opus 4.8 costs 28.7x more. Whether that trade is worth it depends entirely on whether your tasks live in the band those extra points unlock.

DeepSeek-reported launch numbers

At launch, DeepSeek reported V4-Pro-Max scoring 93.5 on LiveCodeBench Pass@1 and a 3206 Codeforces rating. These are vendor-run numbers from the April 2026 release coverage, not independent leaderboard entries; vendor scaffolds routinely score above standardized harnesses (see our SWE-bench Pro breakdown for how large that gap runs).

DeepSeek V4 Flash and SWE-bench: What Exists

No independently run SWE-bench score for deepseek-v4-flash has been published as of June 9, 2026. Specifically:

Scale's SEAL SWE-bench Pro leaderboards (public and private sets) list no DeepSeek V4 entry of any variant. The top open-weights entry there is qwen3-coder-480b-a35b at 38.7% on the public set.
The llm-stats SWE-bench Verified tracker lists DeepSeek-V4-Pro-Max (80.6%) but no Flash entry.

If you see a Flash SWE-bench number quoted, check whether it is a vendor-run scaffold result or a community harness; neither currently appears on the two trackers above. For agentic coding where benchmark evidence exists, the published data points at Pro, not Flash.

Cost Math vs Opus 4.8, GPT-5.5, Gemini 3.1 Pro

Reddit and X discussion of V4 settled on a "17x cheaper" shorthand. The exact ratios depend on which token type you compare. At list prices:

Model	Input / 1M	Output / 1M	Output tokens per $1	Output cost vs V4-Pro
DeepSeek V4-Flash	$0.14	$0.28	3.57M	0.32x
DeepSeek V4-Pro	$0.435	$0.87	1.15M	1x
MiniMax M3 (≤512K)	$0.30	$1.20	833K	1.4x
Gemini 3.1 Pro (≤200K)	$2.00	$12.00	83K	13.8x
GPT-5.4	$2.50	$15.00	67K	17.2x
Claude Sonnet 4.6	$3.00	$15.00	67K	17.2x
Claude Opus 4.8	$5.00	$25.00	40K	28.7x
GPT-5.5	$5.00	$30.00	33K	34.5x
Claude Fable 5	$10.00	$50.00	20K	57.5x

So "17x cheaper" is accurate against GPT-5.4 and Sonnet 4.6 output. Against Opus 4.8 it is 28.7x on output and 11.5x on input. Against Fable 5 it is 57.5x on output. Cached input widens every gap: a V4-Pro cache hit costs $0.003625/M vs $0.50/M for an Opus 4.8 cache hit, 138x apart.

A concrete daily workload, 20 requests of 50K input + 10K output (1M input, 200K output per day), assuming zero cache hits:

V4-Flash: $0.20/day, about $6/month
V4-Pro: $0.61/day, about $18/month
Opus 4.8: $10.00/day, about $300/month
GPT-5.5: $11.00/day, about $330/month
Claude Fable 5: $20.00/day, about $600/month

Full Anthropic-side rates, including cache and batch multipliers, are in our Anthropic API pricing guide.

Where to Run It: First-Party API vs OpenRouter

The first-party API (api.deepseek.com) is the reference deployment: $0.435/$0.87 for Pro, $0.14/$0.28 for Flash, with the cache-hit discounts and the 500/2500 concurrency limits above. OpenRouter lists deepseek/deepseek-v4-pro at the same $0.435/M input and $0.87/M output with 1M context, routing across hosts (DeepInfra and NVIDIA also serve V4 endpoints; per-host rates vary by routing tier, so check the live OpenRouter provider list before committing volume).

Reasons to pick first-party: documented cache-hit pricing at $0.003625/M (Pro) and the Anthropic-format endpoint. Reasons to pick OpenRouter: one key across models, provider failover, and easy A/B against MiniMax M3 or Qwen3.5 at the prices in the table above. Because the weights are open, you can also self-host; Flash at 284B total is the realistic target, Pro at 1.6T is multi-node territory.

Output fidelity is where serverless hosts diverge. Most serverless providers quantize activations to fp8 to cut serving cost, which moves output away from the reference weights. Morph Open Source Models serves DeepSeek with 16-bit (bf16) activations and does not quantize activations to fp8, so output matches the published weights. For coding agents specifically, Morph adds codegen-tuned speculative decoding (draft and ngram tuned on code) plus custom low-level inference kernels built for code generation, which makes it the fastest and highest-quality option for codegen. morph-dsv4flash (DeepSeek V4 Flash) runs at $0.139/M input and $0.278/M output; see pricing for the full list.

Use DeepSeek V4 in Claude Code and OpenCode

Claude Code

DeepSeek's API speaks the Anthropic format natively, so Claude Code needs only environment variables, no proxy:

export ANTHROPIC_BASE_URL="https://api.deepseek.com/anthropic"
export ANTHROPIC_AUTH_TOKEN="sk-your-deepseek-key"
export ANTHROPIC_MODEL="deepseek-v4-pro"
export ANTHROPIC_SMALL_FAST_MODEL="deepseek-v4-flash"
claude

Pointing the small-fast model at deepseek-v4-flash keeps background tasks on the $0.28/M-output tier. For routing multiple providers behind one endpoint instead, see Claude Code with LiteLLM.

OpenCode

OpenCode ships a DeepSeek provider. Run opencode auth login, select DeepSeek, paste your API key, then set the model in opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "deepseek/deepseek-v4-pro"
}

Codex CLI

Codex's custom model_providers config only supports wire_api = "responses", and DeepSeek's API exposes ChatCompletions and Anthropic formats, not the Responses API. To drive V4 from Codex, route through a Responses-compatible proxy or use OpenRouter-backed tooling; for direct use, Claude Code and OpenCode are the paths that work without translation.

What Changed from DeepSeek V3

Dimension	DeepSeek V3	DeepSeek V4-Pro
Total parameters	671B	1.6T (2.4x larger)
Active parameters	37B per token	49B per token
Context window	128K tokens	1M tokens (8x larger)
Attention	Multi-head Latent Attention (MLA)	Token-wise compression + DSA
Max output	8K tokens	384K tokens
API formats	OpenAI-compatible	OpenAI + Anthropic formats
Endpoints	deepseek-chat / deepseek-reasoner	deepseek-v4-pro / deepseek-v4-flash

The structural shift is the attention stack: replacing MLA with token-wise compression plus DSA is what moves the default context from 128K to 1M without 1M-context pricing. The legacy V3-era endpoints survive only as aliases onto deepseek-v4-flash until July 24, 2026.

Limitations

Known limitations

Preview status: the April 24, 2026 release is labeled a preview. Specs and pricing held through June, but GA may change behavior.
8-point gap to the frontier on SWE-bench Verified: V4-Pro-Max scores 80.6% vs Opus 4.8's 88.6% and Fable 5's 95.0%. For tasks where those points matter, the cheap model retries its way into costing you time instead of money.
No independent SWE-bench Pro entry: Scale's SEAL leaderboard has no DeepSeek V4 result, so agentic performance under a standardized harness is unverified.
Sparse first-party architecture docs: compression ratios and indexer internals circulate only in third-party analyses.
Self-hosting Pro is heavy: 1.6T total parameters means multi-node inference even quantized. Flash at 284B is the practical self-host target.

Frequently Asked Questions

When was DeepSeek V4 released?

April 24, 2026, as a preview, with V4-Pro and V4-Flash shipping the same day. The legacy deepseek-chat and deepseek-reasoner endpoints now alias to deepseek-v4-flash and retire after July 24, 2026 15:59 UTC.

What is the DeepSeek V4 architecture?

A mixture-of-experts transformer with token-wise compression plus DeepSeek Sparse Attention (DSA). Pro: 1.6T total / 49B active. Flash: 284B total / 13B active. Both: 1M context, 384K max output. Third-party analyses add 4x KV compression and a top-1,024 lightning indexer, which DeepSeek has not published first-party.

What does DeepSeek V4 cost?

Official API: Flash $0.14/M input (miss), $0.0028/M (cache hit), $0.28/M output. Pro $0.435/M input (miss), $0.003625/M (cache hit), $0.87/M output. OpenRouter lists Pro at the same $0.435/$0.87.

What is deepseek-v4-flash's SWE-bench score?

None published independently as of June 9, 2026. Scale SEAL and llm-stats both lack a Flash entry. The only independent V4 number is Pro-Max at 80.6% SWE-bench Verified.

How does V4-Pro-Max compare to Claude on SWE-bench Verified?

V4-Pro-Max 80.6% vs Opus 4.6 80.8%, Opus 4.8 88.6%, Fable 5 95.0% (llm-stats, June 2026). Per output token V4-Pro costs $0.87 vs $25 for Opus 4.8 and $50 for Fable 5.

Is V4 open source?

Open weights on Hugging Face under MIT (per Lambda's launch analysis; the HF model cards require accepting access terms). Download, run, fine-tune.

Can I use V4 in Claude Code?

Yes. Set ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic, ANTHROPIC_AUTH_TOKEN to your DeepSeek key, and ANTHROPIC_MODEL=deepseek-v4-pro. The full snippet is in the setup section above.

Use WarpGrep with DeepSeek V4 for Better Code Search Context

WarpGrep is an agentic code search tool that works as an MCP server. Connect it to any DeepSeek-powered agent for high-precision codebase context, so V4's 1M-token window gets filled with the right code, not noise. Free for 100k requests, then $1 per 1M.

Try WarpGrep Free

See How It Works

Fast Apply

WarpGrep

Compact

Reflexes

Model Router

DeepSeek

MiniMax

Qwen

Blog

Startup Credits

Students

Contact Us

About

Careers

DeepSeek V4: 1.6T MoE, 1M Context, $0.87/M Output (2026 Guide)