The Inference Cost Collapse: 10x Per Year and What It Means for AI Architecture
AI inference costs are dropping at roughly 10x per year — faster than Moore's Law ever achieved — and this cost trajectory should be reshaping how you design systems, price products, and think about build-versus-buy decisions.
Jeff Brook
AI Researcher — Founder, AI Daily News
In March 2024, GPT-4-class inference cost approximately $30 per million input tokens through OpenAI's API. By March 2025, equivalent capability was available for $3 per million tokens across multiple providers. As of February 2026, the effective cost for frontier-equivalent inference is $0.30-0.50 per million tokens when using competitive providers, open-weight models on optimised infrastructure, or batch processing discounts.
This is a 10x annual cost reduction — a pace that exceeds Moore's Law (roughly 2x per two years) by an order of magnitude. The drivers are a combination of hardware improvements (Blackwell, TPU v6), software optimisation (quantisation, speculative decoding, KV-cache improvements), open-weight models that enable self-hosting, and aggressive pricing competition among providers.
According to analysis by Epoch AI, this cost trajectory shows no signs of flattening through at least 2027. Understanding what it means — and acting on it — is one of the highest-leverage strategic decisions available to AI practitioners right now.
What is driving the cost down?
Four factors compound.
Hardware efficiency. Each GPU generation delivers roughly 2-3x more inference throughput per dollar than its predecessor. The Blackwell Ultra, announced at GTC 2026, achieves 2.5x the inference performance per watt of the H100. Google's TPU v6 Trillium shows comparable gains. The hardware upgrade cycle alone accounts for about 2-3x cost reduction per year.
Quantisation and compression. Running models at lower numerical precision — FP8, INT8, and increasingly FP4 — reduces memory bandwidth and compute requirements with minimal quality impact. A model quantised to FP4 requires roughly 4x less memory and compute than the same model at FP16, with quality retention above 97% for most tasks. Two years ago, aggressive quantisation was a research technique; today it is a production default.
Speculative decoding and batching. Techniques that generate multiple candidate tokens in parallel, then verify them cheaply, improve throughput by 2-3x without additional hardware. Combined with intelligent request batching that amortises the fixed costs of model loading across multiple requests, these software optimisations deliver another 2-3x reduction in effective per-token cost.
Market competition. The number of viable inference providers has expanded from three (OpenAI, Anthropic, Google) to over a dozen, including DeepSeek, Mistral, Together AI, Fireworks, Groq, and multiple hyperscaler offerings. Price competition among providers with different cost structures (self-hosted open models versus API margins versus loss-leader strategies) drives prices toward marginal cost faster than any single factor.
What does this mean for system architecture?
The inference cost collapse changes the calculus for several fundamental design decisions.
LLM calls should replace traditional code for tasks where they are more maintainable. When inference was expensive, the decision to use an LLM call versus writing deterministic code was primarily economic. At current prices, a million classification calls cost under $0.50. The decision should increasingly be about maintainability and adaptability rather than cost. An LLM-based classifier that handles edge cases gracefully and adapts to new categories without code changes may be cheaper to maintain than a rule-based system, even if each individual inference is more expensive than a database lookup.
Multi-call architectures become economically viable. Agent workflows that chain 10-20 LLM calls per user interaction were prohibitively expensive at $30/M tokens. At $0.30/M tokens, a complex agent workflow costs fractions of a cent. This unlocks architectures that were theoretically sound but economically impractical: generate-then-verify pipelines, multi-perspective analysis (have three different prompts analyse the same problem and synthesise), and recursive refinement loops.
The build-versus-buy calculation shifts toward build. Self-hosting open-weight models on your own infrastructure was rarely cost-effective when API prices were already dropping. But the combination of capable open models (DeepSeek R2, Llama 4, Mistral Large 3) and optimised serving frameworks (vLLM, TensorRT-LLM) means that at moderate scale — roughly 10M+ tokens per day — self-hosting is now cheaper than API access, with the additional benefits of data privacy and latency control. The crossover point drops further as inference hardware costs decline.
Caching and optimisation matter less, experimentation matters more. When inference was expensive, teams invested heavily in prompt caching, result caching, and minimising unnecessary LLM calls. These are still good engineering practices, but the ROI of aggressive optimisation has decreased. Conversely, the ROI of experimentation — trying different prompts, model sizes, and architectures — has increased because each experiment is cheaper. Shift engineering investment from cost optimisation to quality optimisation.
What does this mean for product pricing?
If your product's value proposition is 'cheaper than hiring a human to do this,' your margin is expanding. AI products priced based on human-equivalent value — a $50 report that would cost $500 from a consultant — see their gross margin improve with every inference cost reduction. This is the strongest pricing position in AI: value-based pricing decoupled from cost-based infrastructure.
If your product is priced based on infrastructure costs, you are in a margin squeeze. Products priced as a markup on API costs face continuous price pressure as the underlying costs drop and competitors adjust. The sustainable response is to move to value-based pricing before the margin erodes.
What should you watch for?
The second-order effects of cheap inference are more important than the first-order cost savings. When inference is effectively free at the margin, AI shifts from a scarce resource to be rationed to an abundant resource to be deployed liberally. This changes not just how much AI you use but where and how you use it. The organisations that adapt their architectures, products, and processes to this new reality will outperform those that simply enjoy lower bills for the same workloads.
The cost curve is your tailwind. Design for it.