LLM API Providers (2026): 12 APIs Compared by Price per 1M Tokens, Rate Limits, and Context

Pricing for 12 LLM API providers, verified June 9, 2026: GPT-5.5 $5/$30, Claude Opus 4.8 $5/$25, Gemini 3.1 Pro $2/$12, DeepSeek V4 Flash $0.14/$0.28 per 1M tokens. Plus rate-limit tiers, OpenAI-compatibility matrix, free tiers, and SWE-bench scores.

June 9, 2026 · 1 min read
LLM API Providers (2026): 12 APIs Compared by Price per 1M Tokens, Rate Limits, and Context

What Is an LLM API

An LLM API is an HTTP interface to a hosted large language model. You send a prompt (plus optional tools, images, or files) to an endpoint such as /v1/chat/completions and receive generated tokens back, billed per million tokens (MTok) of input and output. It replaces running model weights on your own GPUs with a metered service: no infrastructure, no model updates to manage, pay only for tokens used.

Every provider on this page works that way. The differences are price, rate limits, context window, and model quality, and they are large. As of June 9, 2026, output tokens cost between $0.28/M (DeepSeek V4 Flash) and $50/M (Claude Fable 5). Five models sit within half a point of each other on SWE-bench Verified (80.4 to 80.9 percent) while spanning a 20x price range.

179x
Output price spread ($0.28 to $50 per MTok)
1M
Context window now standard on flagships
95.0%
Top SWE-bench Verified score (Claude Fable 5, currently suspended)
$0
GLM-4.7-Flash API price (free)

All prices verified June 9, 2026

Every number below comes from the provider's official pricing or docs page as of June 9, 2026. LLM prices change quarterly; check the linked provider before committing to a volume contract.

Pricing Table: Flagship and Cheapest Model per Provider

Prices are $ per 1M tokens, input / output, standard (non-batch) tier. Context is the maximum input window.

ProviderModelInput/MTokOutput/MTokContext
AnthropicClaude Fable 5$10.00$50.001M
OpenAIGPT-5.5$5.00$30.001M
AnthropicClaude Opus 4.8$5.00$25.001M
AnthropicClaude Sonnet 4.6$3.00$15.001M
OpenAIGPT-5.4$2.50$15.001M
OpenAIGPT-5.3-Codex$1.75$14.00n/a
GoogleGemini 3.1 Pro (preview)$2.00$12.001M
GoogleGemini 3.5 Flash$1.50$9.00n/a
AnthropicClaude Haiku 4.5$1.00$5.00200K
OpenAIGPT-5.4-mini$0.75$4.50400K
Z.AIGLM-5.1$1.40$4.40n/a
MoonshotKimi K2.6$0.95$4.00262K
AlibabaQwen3.5-397B (open)$0.60$3.60256K tier
Z.AIGLM-5$1.00$3.20n/a
GoogleGemini 3 Flash (preview)$0.50$3.00n/a
MoonshotKimi K2.5$0.60$3.00262K
AlibabaQwen3.5-Plus$0.40$2.401M (tiered)
Z.AIGLM-4.7$0.60$2.20n/a
GoogleGemini 3.1 Flash-Lite$0.25$1.50n/a
OpenAIGPT-5.4-nano$0.20$1.25n/a
MiniMaxM3 (≤512K prompt)$0.30$1.201M
DeepSeekV4 Pro$0.435$0.871M
DeepSeekV4 Flash$0.14$0.281M
Morphmorph-dsv4flash (DeepSeek V4 Flash, 16-bit)$0.139$0.2781M
Z.AIGLM-4.7-FlashFreeFreen/a

Notes that change effective cost: Gemini 3.1 Pro rises to $4/$18 for prompts over 200K tokens. MiniMax M3 doubles to $0.60/$2.40 past 512K input. Qwen3.5-Plus rises to $0.50/$3.00 for the 256K-1M range. Anthropic charges the same per-token rate at any context length (a 900K-token request bills like a 9K one), but models from Opus 4.7 onward use a new tokenizer that can produce up to 35% more tokens for the same text, which inflates effective per-request cost in cross-provider comparisons.

Batch and cache discounts: OpenAI cached input is 10x cheaper ($0.50/M on GPT-5.5). Anthropic batch is 50% off and cache reads are 0.1x base input. Gemini batch is half price. DeepSeek cache hits drop input to $0.0028/M, a 50x reduction. If your workload re-sends the same system prompt or file context, the cache column matters more than the headline price.

Provider-by-Provider Breakdown

OpenAI

GPT-5.5 ($5/$30, 1M context, 128K max output, Dec 2025 knowledge cutoff) is the flagship. GPT-5.4 ($2.50/$15) covers most workloads at half the price, and GPT-5.4-nano ($0.20/$1.25) handles classification and extraction. GPT-5.3-Codex ($1.75/$14) is the coding-tuned line; it tops Scale's standardized SWE-bench Pro leaderboard in its gpt-5.4 (xHigh) variant at 59.1%. Cached input is billed at 10% of base. Regional data-residency endpoints add 10%. See Codex pricing for the subscription side.

ModelInput/MTokCached InOutput/MTokContext
GPT-5.5$5.00$0.50$30.001M
GPT-5.5-pro$30.00n/a$180.00n/a
GPT-5.4$2.50$0.25$15.001M
GPT-5.3-Codex$1.75$0.175$14.00n/a
GPT-5.4-mini$0.75$0.075$4.50400K
GPT-5.4-nano$0.20$0.02$1.25n/a

Anthropic (Claude)

Claude leads coding benchmarks: Fable 5 scored 95.0% on SWE-bench Verified (currently suspended, see note above), Opus 4.8 scores 88.6%. Fable 5 ($10/$50) was the frontier tier with adaptive thinking always on; Opus 4.8 ($5/$25) is the current price-performance pick, and an Opus 4.8 fast mode (research preview) delivers ~2.5x faster output at $10/$50. All current 1M-context Claude models charge no long-context surcharge. Sonnet 4 and Opus 4 retire June 15, 2026; Opus 4.1 retires August 5, 2026. Full breakdown: Anthropic API pricing.

ModelInput/MTokOutput/MTokContextMax Output
Claude Fable 5$10.00$50.001M128K
Claude Opus 4.8 / 4.7 / 4.6$5.00$25.001M128K
Claude Sonnet 4.6$3.00$15.001M64K
Claude Haiku 4.5$1.00$5.00200K64K

Google (Gemini)

Gemini 3.1 Pro (preview) is the cheapest frontier-tier API at $2/$12 for prompts up to 200K tokens ($4/$18 above). Gemini 3.5 Flash ($1.50/$9) is the stable workhorse Google positions for agentic and coding tasks. Gemini 3.1 Flash-Lite ($0.25/$1.50) is the budget tier. Batch runs at half price across the line; context caching costs $0.20/M plus hourly storage on 3.1 Pro.

DeepSeek

The price floor for frontier-adjacent quality. V4 Flash ($0.14/$0.28) and V4 Pro ($0.435/$0.87) both ship 1M context and 384K max output. V4 (released April 24, 2026) is open weights under MIT: V4 Pro is 1.6T total / 49B active parameters, V4 Flash 284B / 13B. DeepSeek-V4-Pro-Max scores 80.6% on SWE-bench Verified, the highest open-weights entry. The API accepts both OpenAI and Anthropic request formats. Legacy deepseek-chat and deepseek-reasoner endpoints map to V4 Flash and retire after July 24, 2026. Deep dive: DeepSeek V4.

Where you run DeepSeek changes its output. Most serverless providers quantize activations to fp8 to cut cost, which degrades quality away from the reference weights. Morph serves DeepSeek with 16-bit (bf16) activations and does not quantize them, so output matches the released weights. That makes Morph the best place to run DeepSeek when fidelity matters: morph-dsv4flash is $0.139/M input and $0.278/M output. For coding agents specifically, Morph adds codegen-tuned speculative decoding (draft/ngram tuned on code) and custom low-level inference kernels built for code generation, so it is the fastest and highest-fidelity way to run open-source models for codegen. See Morph models and pricing.

Moonshot (Kimi)

Kimi K2.6 ($0.95/$4.00) and K2.5 ($0.60/$3.00), both with a 262,144-token context window, automatic context caching (cache hits: $0.16/M and $0.10/M), and text, image, and video input on K2.5.

Z.AI (GLM)

GLM-5.1 ($1.40/$4.40) is the flagship, GLM-5 ($1.00/$3.20) remains listed, and GLM-4.7 ($0.60/$2.20) covers budget workloads. GLM-4.7-Flash and GLM-4.5-Flash are free, the only free named models from a major provider.

MiniMax

MiniMax M3 ($0.30/$1.20 up to 512K input, $0.60/$2.40 beyond, 1M context) scores 80.5% on SWE-bench Verified. That makes it the cheapest 80%+ SWE-bench model available through a hosted API: 1/20th of Opus 4.5's output price for a score 0.4 points lower.

Alibaba (Qwen)

Qwen3.5-Plus costs $0.40/$2.40 up to 256K input and $0.50/$3.00 from 256K to 1M, with thinking and non-thinking output priced the same. Hosted open-weights variants run from $0.25/$2.00 (qwen3.5-35B-A3B) to $0.60/$3.60 (qwen3.5-397B-A17B). New Model Studio users get 1M free tokens valid 90 days. Batch is half price.

Aggregators and Cloud Resellers

OpenRouter fronts hundreds of models behind one OpenAI-compatible key, useful for evaluation and failover at a small markup. AWS Bedrock and Azure / Microsoft Foundry resell first-party models (Claude Fable 5 is on Bedrock as anthropic.claude-fable-5 and on Foundry; note Opus context drops to 200K on Foundry) with VPC integration, compliance certifications, and consolidated cloud billing. If you need a routing layer rather than a reseller, see LLM gateways and LLM routers.

Morph (specialized)

Morph serves task-specific models rather than general chat: Fast Apply merges code edits at 10,500 tok/s, and WarpGrep does agentic codebase search ($0 for 100k requests, $1/1M on Pro). Covered in Specialized APIs below.

LLM API Rate Limits by Provider and Tier

Rate limits decide whether your launch survives traffic, and almost no comparison page publishes them. Here is what each provider enforces, from official docs as of June 9, 2026.

Anthropic: spend-based tiers, cache-aware token counting

Tiers advance automatically by cumulative credit purchase: Tier 1 at $5, Tier 2 at $40, Tier 3 at $200, Tier 4 at $400. Monthly spend caps are $500 / $500 / $1,000 / $200,000. Limits are per model class, measured in requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM). Cache reads do not count toward ITPM: with a 2M ITPM limit and an 80% cache hit rate you can process 10M total input tokens per minute.

Model classTier 1 (RPM / ITPM / OTPM)Tier 4 (RPM / ITPM / OTPM)
Claude Fable 550 / 100K / 20K4,000 / 4M / 800K
Claude Opus 4.x50 / 500K / 80K4,000 / 10M / 800K
Claude Sonnet 4.x50 / 30K / 8K4,000 / 2M / 400K
Claude Haiku 4.550 / 50K / 10K4,000 / 4M / 800K

Counterintuitive: Opus gets 16x the Tier 1 input throughput of Sonnet (500K vs 30K ITPM). If you are rate-limit-bound on a new Anthropic account, the expensive model is also the one you can call hardest.

OpenAI: spend-unlocked tiers, per-model limits

TierQualificationMonthly usage cap
FreeAllowed geography$100
Tier 1$5 paid$100
Tier 2$50 paid$500
Tier 3$100 paid$1,000
Tier 4$250 paid$5,000
Tier 5$1,000 paid$200,000

Per-model RPM/TPM numbers are published on each model's page in the OpenAI console rather than a single table, and GPT-5.5 carries a separate limit for long-context requests.

DeepSeek: concurrency caps instead of token budgets

DeepSeek publishes no RPM/TPM limits. It caps concurrent requests: 2,500 in flight on V4 Flash, 500 on V4 Pro. For batch-style pipelines this is friendlier than token-per-minute budgets; for bursty single requests it makes no difference.

Others

Moonshot, Z.AI, MiniMax, and Alibaba scale limits with account spend and publish them in their consoles rather than public tables. MiniMax sells a priority tier ($0.45/$1.80 for M3 at ≤512K) for latency-sensitive traffic. Bedrock and Azure enforce cloud-account-level quotas you raise through support tickets.

OpenAI-Compatibility Matrix

Most providers accept OpenAI-format /v1/chat/completions requests, so switching is a base_url and api_key change. The exceptions matter when you build against provider-specific features.

ProviderOpenAI chat formatAnthropic formatStreamingTool calls
OpenAINativeNoYesYes
AnthropicVia OpenAI SDK compat layerNativeYesYes
Google GeminiCompat endpointNoYesYes
DeepSeekYesYesYesYes
Moonshot (Kimi)YesNoYesYes
Z.AI (GLM)YesNoYesYes
MiniMaxYesNoYesYes
Alibaba (Qwen)Compat modeNoYesYes
OpenRouterNative (aggregator)NoYesYes
MorphYesNoYesn/a (task models)

DeepSeek is the only first-party provider that natively speaks both OpenAI and Anthropic formats, which means it drops into Claude-Code-style agents without a proxy. If you point a coding agent at a custom provider, the agent must support it: Codex, for example, configures custom providers in config.toml via model_providers entries with base_url, env_key, and wire_api = "responses" (the only wire API it supports), covered in Codex provider configuration.

Benchmarks vs Price: What You Get per Dollar

SWE-bench Verified (real GitHub issues, verified fixes) is the most cited coding benchmark. Scores below are from the llm-stats tracker, June 2026; prices are official API rates.

ModelSWE-bench VerifiedInput/MTokOutput/MTok
Claude Fable 5 (currently suspended, see note)95.0%$10.00$50.00
Claude Mythos 5 Preview (currently suspended, see note)93.9%restrictedrestricted
Claude Opus 4.888.6%$5.00$25.00
Claude Opus 4.787.6%$5.00$25.00
Claude Opus 4.580.9%$5.00$25.00
Claude Opus 4.680.8%$5.00$25.00
DeepSeek V4 Pro Max80.6%open weights (MIT)open weights (MIT)
Gemini 3.1 Pro80.6%$2.00$12.00
MiniMax M380.5%$0.30$1.20
Qwen3.7 Max80.4%see Model Studiosee Model Studio

The 80% club spans a 20x price range

Five models score between 80.4% and 80.9% on SWE-bench Verified. Within that half-point band, output prices run from $1.20/M (MiniMax M3) to $25/M (Claude Opus 4.5). The next 8 points of accuracy (Opus 4.8 at 88.6%) cost 20x more than M3; the next 14 (Fable 5 at 95.0%) cost 40x. Whether those points are worth it depends entirely on whether your tasks live in the gap.

On the harder SWE-bench Pro (1,865 tasks, 41 professional repos), Scale's standardized-scaffold leaderboard tops out at gpt-5.4 (xHigh) 59.1%, Muse Spark 55.0%, and Claude Opus 4.6 (thinking) 51.9% on the public set. Vendor-reported numbers run far higher (Anthropic reports Fable 5 at 80.3% with its own scaffold), so compare scores only within the same harness.

LLM API Free Tiers: Exact Amounts

ProviderFree offerLimit
Z.AIGLM-4.7-Flash, GLM-4.5-FlashFree models, no token charge
Alibaba Model Studio1M tokens for new users90-day validity
OpenAI APIFree tier in allowed regions$100/month usage cap
OpenAI CodexIncluded with ChatGPT FreeLowest 5-hour-window message limits
Google Gemini APIFree development tierReduced rate limits
AnthropicNoneTier 1 starts at $5 credit
Morph WarpGrep100k requestsThen $1 per 1M requests on Pro

For experimentation, the practical order is: GLM Flash models (unlimited free), Qwen's 1M tokens, then Gemini's free tier. For coding agents specifically, Codex CLI works with a free ChatGPT sign-in, with the lowest usage limits.

Cost Calculator: Real Workloads

Per-token prices mean nothing until mapped to usage. Two reference workloads, 30-day months, no cache discounts applied (caching reduces all of these).

Coding agent: 50M input / 5M output tokens per day

ModelDaily costMonthly cost
Claude Fable 5$750$22,500
GPT-5.5$400$12,000
Claude Opus 4.8$375$11,250
Claude Sonnet 4.6$225$6,750
GPT-5.4$200$6,000
Gemini 3.1 Pro (≤200K prompts)$160$4,800
GLM-5.1$92$2,760
Qwen3.5-Plus$32$960
MiniMax M3$21$630
DeepSeek V4 Flash$8.40$252

Support chatbot: 20M input / 5M output tokens per day

ModelDaily costMonthly cost
Claude Haiku 4.5$45.00$1,350
GPT-5.4-mini$37.50$1,125
Gemini 3 Flash (preview)$25.00$750
GLM-4.7$23.00$690
Gemini 3.1 Flash-Lite$12.50$375
GPT-5.4-nano$10.25$308
DeepSeek V4 Flash$4.20$126

The 89x gap on identical traffic

The same coding-agent workload costs $22,500/month on Claude Fable 5 and $252/month on DeepSeek V4 Flash. The production answer is rarely either extreme: route routine edits to a cheap model, escalate multi-file refactors to a frontier one, and cache aggressively (DeepSeek cache hits bill input at $0.0028/M; Anthropic cache reads are 0.1x and do not count against rate limits). Model the routing split with the LLM cost calculator.

Latency and Throughput

Two metrics matter: time to first token (how fast streaming starts) and tokens per second (how fast it finishes). Provider speed claims vary with load and region, so measure on your own traffic. What the official docs do establish:

  • Reasoning adds latency by design. Claude Fable 5 ran adaptive thinking always on; thinking tokens are generated and billed before visible output. DeepSeek V4 exposes separate thinking and non-thinking modes so you can opt out per request.
  • Anthropic sells speed explicitly: Opus 4.8 fast mode (research preview) delivers roughly 2.5x faster output at $10/$50 per MTok, double the standard rate. The previous generation's fast mode (Opus 4.6/4.7) costs $30/$150.
  • MiniMax's priority tier ($0.45/$1.80 vs $0.30/$1.20 on M3) trades a 50% surcharge for faster scheduling.
  • Specialized models break the general-purpose ceiling: Morph Fast Apply sustains 10,500 tok/s on code-edit application, two orders of magnitude above frontier chat models, because the task (merging an edit into a file) does not need frontier reasoning.

Specialized APIs: When General-Purpose Falls Short

Coding agents spend most of their compute on two operations: searching codebases for context and applying edits to files. Cognition (the team behind Devin) measured 60% of agent time on search alone. Both operations run through general-purpose LLMs by default, at general-purpose prices and speeds.

Morph Fast Apply

Code-edit application at 10,500 tok/s with 98% accuracy. The agent outputs a lazy edit snippet; Fast Apply merges it into the full file in 1-3 seconds. OpenAI-compatible /v1/chat/completions endpoint.

Morph WarpGrep

RL-trained agentic codebase search: 8 parallel tool calls per turn, 4 turns, sub-6s searches, 0.73 F1. 100k requests free, then $1 per 1M requests on Pro. Ships as an MCP server for any agent.

The pattern generalizes: a frontier model reasons and decides what to change; narrow, fast models execute the mechanical steps. The frontier model's output shrinks (edit snippets instead of whole files), which is exactly the token class that costs $12-50/M. See Fast Apply and WarpGrep.

How to Choose an LLM API

Decision framework

  • 1. Establish your quality floor cheaply. Run your real prompts through DeepSeek V4 Flash ($0.28/M out), MiniMax M3 ($1.20/M), and GPT-5.4-nano ($1.25/M). If one passes, you are done at under 5% of frontier cost.
  • 2. Escalate only measured gaps. Move to Gemini 3.1 Pro ($12/M), GPT-5.4 ($15/M), or Opus 4.8 ($25/M) for the tasks where the cheap tier measurably fails. Fable 5 ($50/M, currently suspended) was the residue tier; Opus 4.8 is the current frontier pick.
  • 3. Check the constraint that binds you. Rate-limited on day one? Anthropic Tier 1 gives Opus 500K ITPM vs Sonnet's 30K. Need self-hosting or data control? DeepSeek V4 and Qwen 3.5 are open weights. Compliance? Bedrock or Foundry.
  • 4. Use specialized APIs for mechanical steps. Edit application, search, embeddings, and reranking all have purpose-built models that beat $15-50/M generalists on both speed and cost.

The most common mistake is anchoring on one provider's flagship and never testing down. Five models clear 80% on SWE-bench Verified; only one of them costs $25/M output. The second most common is ignoring caching: at an 80% cache hit rate, Anthropic bills cached reads at 0.1x and exempts them from rate limits, and DeepSeek drops cached input to $0.0028/M. For long-context workloads, compare windows in detail at LLM context window comparison.

Frequently Asked Questions

What is an LLM API?

An HTTP interface to a hosted large language model. You POST a prompt to an endpoint such as /v1/chat/completions and receive generated tokens, billed per million tokens of input and output. It replaces self-hosted GPU inference with a metered service.

What are the best LLM API providers in 2026?

First-party: OpenAI (GPT-5.5), Anthropic (Claude Fable 5, Opus 4.8), Google (Gemini 3.1 Pro), DeepSeek (V4), Moonshot (Kimi K2.6), Z.AI (GLM-5.1), MiniMax (M3), and Alibaba (Qwen3.5). Aggregator: OpenRouter. Cloud resellers: Bedrock and Azure / Microsoft Foundry. Which is best depends on the binding constraint: benchmark score (Anthropic), price (DeepSeek, MiniMax), free tier (Z.AI), or compliance (cloud resellers).

What is the cheapest LLM API in 2026?

DeepSeek V4 Flash: $0.14/M input, $0.28/M output, $0.0028/M on cache hits, with a 1M-token context. MiniMax M3 ($0.30/$1.20) is the cheapest model above 80% on SWE-bench Verified. GLM-4.7-Flash is free outright.

Which LLM API is best for coding?

Claude Fable 5 led SWE-bench Verified at 95.0% (currently suspended, see note above); Opus 4.8 at 88.6% is the current top available pick. On Scale's standardized SWE-bench Pro, gpt-5.4 (xHigh) leads the public set at 59.1%. For budget coding, MiniMax M3 (80.5%) and open-weights DeepSeek V4 Pro Max (80.6%) are the value picks. For applying edits an agent has already decided on, Morph Fast Apply runs at 10,500 tok/s with 98% accuracy.

What rate limits do LLM APIs have?

Anthropic: tiered by cumulative deposit ($5 to $400); Tier 1 gives Fable 5 50 RPM / 100K input tokens per minute, Tier 4 gives 4,000 RPM / 4M ITPM, and cached tokens are exempt. OpenAI: tiers unlock at $5 to $1,000 paid with $100 to $200,000 monthly usage caps; per-model RPM/TPM live on each model page. DeepSeek: concurrency caps of 2,500 (V4 Flash) and 500 (V4 Pro) instead of token budgets.

Which LLM API has the largest context window?

1M tokens is the 2026 flagship standard: Claude Fable 5, Opus 4.8/4.7/4.6, Sonnet 4.6 (all with no long-context surcharge), GPT-5.5, GPT-5.4, Gemini 3.1 Pro (1,048,576), DeepSeek V4 Pro and Flash, MiniMax M3, and Qwen3.5-Plus. Below 1M: GPT-5.4-mini at 400K, Kimi K2.5/K2.6 at 262,144, Claude Haiku 4.5 at 200K.

Are LLM APIs interchangeable / OpenAI-compatible?

DeepSeek, Moonshot, Z.AI, MiniMax, Qwen, OpenRouter, and Morph accept OpenAI-format requests directly; Google exposes a compatibility endpoint; Anthropic offers an OpenAI SDK compatibility layer over its native Messages API. DeepSeek also accepts Anthropic-format requests. In practice, switching providers is a base URL and key change, with re-testing for tool-calling behavior.

Should I use one provider or multiple?

Multiple, behind a router. Send high-volume simple traffic to a sub-$1.20/M model and escalate hard tasks to a frontier model. The 89x spread between Fable 5 and V4 Flash on identical traffic is the budget you are leaving on the table with a single-provider setup. See LLM gateways for the plumbing.

Related Resources

Code Editing at 10,500 tok/s

Frontier LLM APIs bill $12-50 per million output tokens to rewrite whole files. Morph Fast Apply merges edit snippets into files at 10,500 tok/s with 98% accuracy, behind an OpenAI-compatible API.