Best AI Model for Coding (June 2026): 12 Models Ranked by SWE-bench Pro Score and Cost per Task

Claude Fable 5 (95.0% SWE-bench Verified, $10/$50), Opus 4.8, GPT-5.5, GPT-5.4, Gemini 3.1 Pro, DeepSeek V4, MiniMax M3 compared. SWE-bench Pro scores, per-token pricing, and output-dollar per solved point for 12 models. Updated June 9, 2026.

June 9, 2026 · 1 min read
Best AI Model for Coding (June 2026): 12 Models Ranked by SWE-bench Pro Score and Cost per Task

Best AI Model for Coding: Quick Answer (June 2026)

Every ranking page shows you a benchmark column or a price column. Almost none divide one by the other. This page does both: SWE-bench Pro scores from Scale's standardized leaderboard, official per-token prices, and the output-dollar cost per solved benchmark point. Updated June 9, 2026, the day Claude Fable 5 went GA.

Highest scores, any price

Claude Fable 5 (currently suspended, see note)

  • 95.0% SWE-bench Verified
  • 80.3% SWE-bench Pro (Anthropic harness)
  • $10 / $50 per M tokens, 1M context
  • Practical pick while suspended: Opus 4.8

Best on a standardized harness

GPT-5.4

  • 59.10% SWE-bench Pro (Scale SEAL, #1)
  • $2.50 / $15 per M tokens
  • 1M context, 128K output

Cheapest per solved point

Claude Haiku 4.5

  • 39.45% SWE-bench Pro (Scale SEAL)
  • $1 / $5 per M tokens
  • ~$0.13 output per Pro point

New since May 2026

Claude Fable 5 (GA June 9): 95.0% SWE-bench Verified, $10/$50, 1M context, adaptive thinking always on. MiniMax M3 (May 31): 80.5% SWE-bench Verified as an open-weights model at $0.30/$1.20. Claude Opus 4.8 (May 28): 88.6% SWE-bench Verified at the same $5/$25 as Opus 4.7. The open-weights pack (DeepSeek-V4-Pro-Max, MiniMax M3, Qwen3.7 Max) now sits within 0.2 points of Gemini 3.1 Pro on SWE-bench Verified.

12 Models Ranked: SWE-bench Pro x Price x Cost per Solved Point

The table below uses Scale's SEAL public-set leaderboard, which runs every model through the same standardized scaffolding on SWE-bench Pro (1,865 tasks, 41 professional repositories). The last column divides official output price by score: dollars of output tokens per benchmark point solved. Lower is more cost-effective.

ModelSWE-bench Pro$/M input / outputOutput $ per Pro point
gpt-5.4 (xHigh)59.10%$2.50 / $15$0.25
Muse Spark55.00%not publishedn/a
Claude Opus 4.6 (thinking)51.90%$5 / $25$0.48
Gemini 3.1 Pro (thinking)46.10%$2 / $12 (≤200K tokens)$0.26
Claude Opus 4.545.89%$5 / $25$0.54
Claude Sonnet 4.543.60%$3 / $15$0.34
Gemini 3 Pro (preview)43.30%previewn/a
Claude Sonnet 4 (retires Jun 15, 2026)42.70%$3 / $15$0.35
GPT-5 (High)41.78%supersededn/a
gpt-5.2-codex41.04%$1.75 / $14$0.34
Claude Haiku 4.539.45%$1 / $5$0.13
Qwen3 Coder 480B (open weights)38.70%self-hostn/a

Three things fall out of the combined view. GPT-5.4 wins on score and is competitive on cost per point. Haiku 4.5 solves 67% as many tasks as gpt-5.4 at a third of its output price, making it the cost-per-point leader at roughly $0.13 per point. And Opus 4.6, the top Claude entry Scale has tested, pays a 2x cost-per-point premium over gpt-5.4 for 7.2 fewer points on this harness, which is exactly why Anthropic publishes its own numbers (covered below).

59.10%
Top score, standardized harness (gpt-5.4)
$0.13
Cheapest output-$ per Pro point (Haiku 4.5)
1,865
Tasks in SWE-bench Pro across 41 repos

Scale's private set reorders the podium

On SWE-bench Pro's commercial (private) set, Claude Opus 4.6 leads at 47.10%, ahead of Muse Spark (44.70%) and gpt-5.4 (43.40%). Gemini 3.1 Pro drops to 32.20%. Models that top the public set do not automatically top unseen codebases; if your repo looks nothing like open-source Python, weight the private-set ordering.

SWE-bench Verified Leaderboard (June 2026)

SWE-bench Verified is older, Python-only, and partially contaminated, but it is still the number every launch post quotes. The June 2026 top ten, per the llm-stats tracker:

SWE-bench Verified: Top 10 (June 2026)

Source: llm-stats.com tracker, updated June 2026. Higher = more GitHub issues resolved.

1Claude Fable 5new
95%
2Claude Mythos Previewrestricted
93.9%
3Claude Opus 4.8
88.6%
4Claude Opus 4.7
87.6%
5Claude Opus 4.5
80.9%
6Claude Opus 4.6
80.8%
7DeepSeek-V4-Pro-Maxopen weights
80.6%
8Gemini 3.1 Pro
80.6%
9MiniMax M3open weights
80.5%
10Qwen3.7 Maxopen weights
80.4%

Anthropic models hold the top 6 slots. Three open-weights models sit within 0.2 points of Gemini 3.1 Pro.

Two structural facts in this chart. First, the Claude line pulled away from the pack: Fable 5's 95.0% is 6.4 points above Opus 4.8 and 14.4 above the 80-percent cluster. Second, the 80-percent cluster is now half open weights. DeepSeek-V4-Pro-Max (80.6%) ties Gemini 3.1 Pro exactly, and you can download its MIT-licensed weights.

Which Claude Model Is Best for Coding?

Anthropic ships five current coding-relevant models. Exact API IDs, prices, and the decision logic:

Model (API ID)Coding benchmarks$/M in / outContext / max output
Claude Fable 5 (claude-fable-5)95.0% SWE-bench Verified, 80.3% SWE-bench Pro (vendor)$10 / $501M / 128K
Claude Opus 4.8 (claude-opus-4-8)88.6% Verified, 69.2% Pro (vendor), 74.6% Terminal-Bench 2.1$5 / $251M / 128K
Claude Opus 4.7 (claude-opus-4-7)87.6% Verified$5 / $251M / 128K
Claude Sonnet 4.6 (claude-sonnet-4-6)43.60% Pro for Sonnet 4.5 on Scale; 4.6 untested there$3 / $151M / 64K
Claude Haiku 4.5 (claude-haiku-4-5)39.45% SWE-bench Pro (Scale SEAL)$1 / $5200K / 64K

Default: Opus 4.8

claude-opus-4-8 at $5/$25 is the working default for coding agents: 88.6% SWE-bench Verified, 69.2% SWE-bench Pro on Anthropic's harness, 1M context with no long-context surcharge, and the effort parameter defaults to high. A fast-mode research preview runs ~2.5x faster output at $10/$50.

Ceiling: Fable 5 (currently suspended, see note)

claude-fable-5 ($10/$50, GA June 9, 2026) adds 6.4 points of SWE-bench Verified and 11.1 points of vendor SWE-bench Pro over Opus 4.8. Adaptive thinking is always on. Suspended June 12, 2026 per US export-control directive. While suspended, Opus 4.8 is the practical ceiling pick.

Volume: Sonnet 4.6

claude-sonnet-4-6 at $3/$15 carries a 1M context window, the only sub-Opus Claude with one. Use it for high-throughput agent loops where Opus pricing compounds: CI review bots, test generation, batch transforms. Batch API halves it to $1.50/$7.50.

Quick edits and subagents: Haiku 4.5

claude-haiku-4-5 at $1/$5 is the cost-per-point leader on Scale's leaderboard (~$0.13 of output per Pro point). Route single-file edits, lint fixes, and explore-style subagents here; cache hits cost $0.10/M.

Retirements and cost traps

Claude Sonnet 4 (claude-sonnet-4-20250514) and Opus 4 retire June 15, 2026; Opus 4.1 retires August 5, 2026 and still bills at $15/$75. Migrate to claude-sonnet-4-6 / claude-opus-4-8. Also note the tokenizer change: Opus 4.7 and later (including Fable 5) can produce up to 35% more tokens for the same text than pre-4.7 models, so compare per-request costs, not just per-token rates. Full price tables on the Anthropic API pricing page.

One model to ignore: Claude Mythos 5 (93.9% SWE-bench Verified as Mythos Preview) is the same underlying model as Fable 5 with safety classifiers lifted, available only to approved Project Glasswing partners. It is also suspended as of June 12, 2026. There is no self-serve access and no active availability, so it is not a practical coding pick.

Claude Opus 4.8 vs GPT-5.5: The $5-Input Flagships

Both current flagships cost $5/M input. The output gap ($25 vs $30) and the benchmark splits decide it:

DimensionClaude Opus 4.8GPT-5.5
SWE-bench Pro (vendor table)69.2%58.6%
Terminal-Bench 2.174.6%78.2%
GDPval-AA (knowledge work Elo)18901769
Pricing ($/M in / out)$5 / $25$5 / $30
Cached input ($/M)$0.50 (cache hit)$0.50
Context / max output1M / 128K1M / 128K
Knowledge cutoffJan 2026Dec 1, 2025

Head-to-Head: The Race Card

GPT-5.3 Codex vs Claude Opus 4.6 across 7 dimensions

Codex (3)
Opus (4)
Raw SpeedCodex leads
Codex
Opus
25% faster executionThorough but slower
Reasoning DepthOpus leads
Codex
Opus
Strong on algorithmsGPQA Diamond leader
Intent UnderstandingOpus leads
Codex
Opus
Needs detailed promptsGets vague requests right
Token EfficiencyCodex leads
Codex
Opus
2-4x fewer tokensThinks out loud more
Multi-file RefactoringOpus leads
Codex
Opus
Good at scoped editsHandles 10+ files cleanly
Code ReviewCodex leads
Codex
Opus
Finds edge cases fastDeeper architectural insight
Context WindowOpus leads
Codex
Opus
256K tokens1M tokens (beta)

Scores based on benchmarks, developer surveys, and hands-on testing as of February 2026. Neither model "wins" overall — it depends on your workflow.

Opus 4.8 wins repo-scale software engineering (69.2% vs 58.6% SWE-bench Pro on vendor harnesses) and knowledge work (1890 vs 1769 GDPval-AA Elo). GPT-5.5 wins terminal-native workflows (78.2% vs 74.6% Terminal-Bench 2.1) and ships through Codex, where a $20/mo Plus plan covers 15-80 messages per 5-hour window without per-token billing. If your work lives in a CLI agent, Codex pricing changes the math; if it lives in long-horizon repo edits, Opus 4.8 does.

Open-Source Models: 80% SWE-bench Verified at a Tenth of the Price

The open-weights frontier moved twice this spring. DeepSeek V4 (April 24, 2026) put an MIT-licensed model at 80.6% SWE-bench Verified, and MiniMax M3 (May 31, 2026) matched it at 80.5% with a 1M context window. Official API prices and self-host terms:

ModelSWE-bench Verified$/M in / out (official API)Context / license
DeepSeek-V4-Pro-Max80.6% (top open weights)see V4 Pro tier1M / MIT
DeepSeek V4 Pro (1.6T / 49B active)see Pro Max$0.435 / $0.871M, 384K out / MIT
DeepSeek V4 Flash (284B / 13B active)n/a$0.14 / $0.281M, 384K out / MIT
morph-dsv4flash (DeepSeek V4 Flash on Morph)16-bit activations, codegen spec decoding + kernels$0.139 / $0.278MIT weights, hosted
MiniMax M380.5%$0.30 / $1.20 (≤512K)1M / open weights
Qwen3.7 Max80.4%hosted via Model Studiosee Qwen tiers
Qwen3.5-397B-A17Bn/a$0.60 / $3.60 (0-256K)open weights
GLM-5.1n/a$1.40 / $4.40Z.AI flagship
Kimi K2.5n/a$0.60 / $3.00262K context
Qwen3 Coder 480B38.70% SWE-bench Pro (Scale)self-hostopen weights

The arithmetic that matters: MiniMax M3 produces output at $1.20/M against Opus 4.8's $25/M, a 20.8x gap, while trailing it by 8.1 points on SWE-bench Verified. DeepSeek V4 Flash sets the absolute floor at $0.28/M output with a 1M-token context and 384K max output. The price gap is mostly model size, not provider margin, which is the core fact of how AI inference is priced. For teams with data-sovereignty requirements, DeepSeek V4's MIT license means the 80.6% model is self-hostable outright. Deeper coverage on the best open-source coding model page.

80.6%
DeepSeek-V4-Pro-Max SWE-bench Verified
$0.28/M
DeepSeek V4 Flash output price
20.8x
Opus 4.8 output premium over MiniMax M3

Where you run DeepSeek changes the output

Open weights are identical everywhere, the serving stack is not. Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Morph serves DeepSeek with 16-bit (bf16) activations and does not quantize them, so output matches the reference weights. That makes Morph the place to run DeepSeek when fidelity matters. For coding specifically, Morph adds speculative decoding tuned on code (draft/ngram) plus custom low-level inference kernels built for code generation, which is why it is the fastest and highest-quality option for coding agents. morph-dsv4flash (DeepSeek V4 Flash) runs at $0.139/M input and $0.278/M output. See the full catalog on Morph models and pricing.

Best AI Model for Coding at $0

Four paths to real coding capability without a credit card, as of June 2026:

OptionWhat you getLimit
Codex CLI on ChatGPT FreeCodex CLI with GPT-5.x models, $0/moLowest usage limits; Plus ($20/mo) raises to 15-80 GPT-5.5 messages per 5h window
GLM-4.7-Flash / GLM-4.5-FlashFree models on the Z.AI APIFlash tier only; GLM-5.1 costs $1.40/$4.40
Qwen on Alibaba Model Studio1M free tokens for new users across Qwen 3.5 models90-day validity
DeepSeek V4 open weightsMIT-licensed weights, self-host V4 Flash (284B/13B active)Your GPU cost; hosted API is $0.14/$0.28 anyway

The honest framing: free tiers are for evaluation and light use. DeepSeek V4 Flash's paid API at $0.14/M input ($0.0028/M on cache hits) and $0.28/M output is close enough to zero that most teams skip self-hosting unless data cannot leave their network.

Why Vendor Scores Run 20 Points Above Scale's Leaderboard

Anthropic reports Fable 5 at 80.3% on SWE-bench Pro. Scale's standardized leaderboard tops out at 59.10% (gpt-5.4). Both numbers are real. The difference is the harness: Scale runs every model through identical scaffolding; vendors run their own tuned agent stacks.

Same Benchmark, Different Harness: SWE-bench Pro

Vendor-reported (Anthropic scaffold) vs Scale SEAL standardized scaffolding.

1Fable 5 (vendor)
80.3%
2Opus 4.8 (vendor)
69.2%
3GPT-5.5 (vendor table)
58.6%
4gpt-5.4 (Scale, #1)
59.1%
5Opus 4.6 (Scale, top Claude)
51.9%
6Gemini 3.1 Pro (Scale)
46.1%

The vendor-vs-standardized gap is 17-21 points for the same model families. The harness is the variable.

The practical conclusion has not changed since 2025: the scaffold around the model accounts for more variance than swapping frontier models. Before paying a 2x token premium, fix retrieval, context management, and tool design. Subagent architecture and context engineering move scores more than model choice does.

The implication

A mid-tier model in a strong harness beats a frontier model in a weak one. Tools like WarpGrep (semantic codebase search for terminal agents, $0 for 100k requests) upgrade the harness for every model you route through it.

Per-Task Routing: Which Model for Which Job

The most cost-effective setups in June 2026 route by task, not by loyalty. Numbers-backed defaults:

TaskRoute toWhy (verified numbers)
Overnight refactor, >50 filesClaude Opus 4.869.2% SWE-bench Pro (vendor), 1M context, no long-context surcharge
Hardest debugging / migration runsClaude Fable 5 (suspended; use Opus 4.8 now)95.0% SWE-bench Verified, 80.3% Pro; worth $10/$50 when reruns cost more
Quick edits, lint fixes, subagentsClaude Haiku 4.5$1/$5, ~$0.13 output per Pro point, $0.10/M cache hits
Terminal / DevOps workflowsGPT-5.578.2% Terminal-Bench 2.1 vs Opus 4.8's 74.6%
Standardized-harness ceilinggpt-5.459.10% Scale SEAL SWE-bench Pro at $2.50/$15
High-volume batch / CI botsDeepSeek V4 Flash or MiniMax M3$0.28/M and $1.20/M output, both 1M context
Budget proprietary, long promptsGemini 3.1 Pro46.10% Scale Pro at $2/$12 (≤200K), $4/$18 above
Data sovereignty / self-hostDeepSeek V4 (MIT)80.6% SWE-bench Verified (Pro Max), weights on Hugging Face
Codebase search for any agentWarpGrep + any modelModel-agnostic retrieval; $0 for 100k requests

Cost levers that apply across routes: Anthropic's Batch API is 50% off input and output, prompt-cache reads are 0.1x base input, and DeepSeek cache hits drop input to $0.0028/M. A routing setup that pins 80% of traffic to Haiku 4.5 or DeepSeek V4 Flash and reserves Opus 4.8 / Fable 5 for the hard 20% typically beats any single-model subscription. Doing the split automatically needs a classifier: Morph's Router scores each prompt by difficulty and domain in ~180ms and returns the cheapest capable model, and Claude Code Router makes that per-request routing concrete inside the terminal agent. Which model your editor agent uses matters too; see Claude Code models for harness-side defaults.

Frequently Asked Questions

What is the best AI model for coding in June 2026?

Claude Fable 5 (GA June 9, 2026) tops SWE-bench Verified at 95.0% and Anthropic's SWE-bench Pro table at 80.3%, priced at $10/$50 per million tokens, but it is currently suspended (see note above). The practical pick while it is unavailable is Claude Opus 4.8. On Scale's standardized SWE-bench Pro leaderboard, gpt-5.4 leads at 59.10%, ahead of Muse Spark (55.00%) and Claude Opus 4.6 (51.90%). Cost-adjusted, Claude Haiku 4.5 ($1/$5) is the cheapest per solved benchmark point at roughly $0.13 of output per point.

What is the best LLM for coding?

"Best LLM for coding" and "best AI model for coding" are the same question. By raw capability, the top LLM in June 2026 is Claude Fable 5 (95.0% SWE-bench Verified, $10/$50), though it is currently suspended (see note above); Claude Opus 4.8 is the practical pick in the interim. The best open-source LLM is DeepSeek-V4-Pro-Max at 80.6%, with MiniMax M3 (80.5%) and Qwen3.7 Max (80.4%) within 0.2 points. The cheapest per solved point is Claude Haiku 4.5 at about $0.13. For most teams the best answer is not one LLM but a router that sends easy work to a cheap or open model and reserves a frontier model for hard edits.

Which Claude model is best for coding?

Claude Opus 4.8 (API ID claude-opus-4-8) is the default and currently the practical pick: 88.6% SWE-bench Verified, 69.2% SWE-bench Pro on Anthropic's harness, $5/$25, 1M context with no long-context surcharge. Claude Fable 5 (claude-fable-5, $10/$50) is the ceiling at 95.0% Verified but is currently suspended (see note above). Claude Sonnet 4.6 (claude-sonnet-4-6, $3/$15) is the volume pick with a 1M context, and Claude Haiku 4.5 (claude-haiku-4-5, $1/$5) handles quick edits and subagents. Avoid Sonnet 4 and Opus 4 (retire June 15, 2026) and Opus 4.1 (retires August 5, 2026, still $15/$75).

What are the SWE-bench Pro scores for coding models in 2026?

Scale SEAL public set (standardized scaffolding, June 2026): gpt-5.4 xHigh 59.10%, Muse Spark 55.00%, Opus 4.6 thinking 51.90%, Gemini 3.1 Pro thinking 46.10%, Opus 4.5 45.89%, Sonnet 4.5 43.60%, Gemini 3 Pro 43.30%, Sonnet 4 42.70%, GPT-5 High 41.78%, gpt-5.2-codex 41.04%, Haiku 4.5 39.45%, Qwen3 Coder 480B 38.70%. Vendor-reported numbers run higher: Anthropic reports Fable 5 at 80.3% and Opus 4.8 at 69.2% on its own scaffold.

What is the best free AI model for coding?

Four real $0 paths in June 2026: Codex CLI is included with a ChatGPT Free sign-in (lowest usage limits); GLM-4.7-Flash and GLM-4.5-Flash are free on the Z.AI API; Alibaba Model Studio gives new users 1M free tokens (90-day validity) across Qwen 3.5 models; and DeepSeek V4's MIT-licensed weights are self-hostable. DeepSeek V4 Flash's paid API is near-free at $0.14/M input, $0.28/M output.

What is the best open-source AI model for coding?

DeepSeek-V4-Pro-Max leads open weights at 80.6% SWE-bench Verified, tied with Gemini 3.1 Pro. MiniMax M3 scores 80.5% (1M context, $0.30/$1.20) and Qwen3.7 Max 80.4%. DeepSeek V4 ships under MIT: V4 Pro (1.6T total / 49B active) costs $0.435/$0.87 on the official API, V4 Flash (284B/13B) costs $0.14/$0.28 with a 1M context and 384K max output.

How much do the top coding models cost per million tokens?

Output price ladder, June 2026: DeepSeek V4 Flash $0.28, MiniMax M3 $1.20, Qwen3.5-Plus $2.40, Kimi K2.5 $3.00, GLM-5 $3.20, Gemini 3.1 Pro $12, gpt-5.3-codex $14, gpt-5.4 and Claude Sonnet 4.6 $15, Claude Opus 4.8 $25, gpt-5.5 $30, Claude Fable 5 $50. Inputs range from $0.14/M (DeepSeek V4 Flash) to $10/M (Fable 5).

Why do vendor benchmark scores differ from Scale's leaderboard?

Scale runs every model through identical standardized scaffolding on SWE-bench Pro's 1,865 tasks across 41 repositories; vendors run their own tuned harnesses. The same model family scores 51.90% (Opus 4.6 on Scale) versus 69.2% (Opus 4.8 on Anthropic's harness). That 17-to-21-point spread is the harness, which is why agent tooling moves results more than model swaps.

Which AI model is most cost-effective for coding in 2026?

Dividing output price by Scale SEAL SWE-bench Pro score: Claude Haiku 4.5 about $0.13 of output per point, gpt-5.4 $0.25, Gemini 3.1 Pro $0.26, gpt-5.2-codex and Sonnet 4.5 about $0.34, Opus 4.6 $0.48, Opus 4.5 $0.54. For raw per-token cost with a 1M context, DeepSeek V4 Flash at $0.14/$0.28 is the floor.

Stop Debating Models. Start Searching Codebases.

WarpGrep adds semantic codebase search to any terminal agent. Works with Fable 5, Opus 4.8, GPT-5.5, Gemini, DeepSeek, or any model. $0 for 100k requests, $1 per 1M on Pro. The harness matters more than the model.