Claude Opus 4: What Practitioners Need to Know

Anthropic shipped Claude Opus 4 this month, and the benchmarks tell a clear story: this is the strongest general-purpose model available to practitioners today. But benchmarks are table stakes. The real question is what changes in how you build.

What does the benchmark picture actually look like?

Opus 4 leads on SWE-bench Verified with a 72.5% solve rate, up from 49.0% on Opus 3. On the GPQA Diamond graduate-level reasoning benchmark, it scores 74.9% — a meaningful jump that puts it ahead of GPT-4o and Gemini Ultra on problems that require multi-step scientific reasoning.

The extended thinking capability is where the architectural shift matters most. Opus 4 can use up to 128,000 tokens of internal reasoning before producing a response. According to Anthropic's technical documentation, this scales test-time compute dynamically — the model allocates more thinking to harder problems without manual prompting.

Context window capacity holds at 200,000 tokens with no degradation in recall accuracy across the full window, based on needle-in-a-haystack evaluations.

How does extended thinking change agent architectures?

The practical impact lands hardest on agent systems. Previous models required elaborate chain-of-thought prompting, decomposition frameworks, and multi-turn scaffolding to handle complex reasoning. Opus 4's native extended thinking absorbs much of that complexity into the model itself.

This means:

Simpler agent loops. You can pass harder problems in a single turn instead of building multi-step orchestration. The model's internal reasoning replaces external scaffolding.
Better tool use. Opus 4 shows improved reliability in multi-tool sequences — selecting the right tool, interpreting results, and deciding next steps without the hallucination patterns that plagued earlier models.
Reduced prompt engineering. The model follows complex instructions more faithfully, reducing the need for few-shot examples and guard-rail prompts that consumed context window budget.

For teams running Claude Code or similar coding agents, the SWE-bench improvement translates directly: the model resolves real GitHub issues — complete with test discovery, multi-file edits, and verification — at a rate that was state-of-the-art for specialised coding systems just six months ago.

What about cost and latency?

Opus 4 is priced at the premium tier. For teams already using the Anthropic API, the per-token cost reflects the capability jump. The key economic question is whether Opus 4 in fewer turns outperforms a cheaper model in more turns.

Early production data from teams running both suggests the answer is yes for tasks involving reasoning, debugging, and synthesis. For straightforward extraction, classification, and formatting, Haiku and Sonnet remain more cost-effective.

Latency on extended thinking responses varies with problem complexity. Simple queries return in under two seconds. Complex reasoning tasks with full thinking allocation can take 30-60 seconds. For synchronous user-facing applications, this means you need to design for variable response times — streaming the thinking process or showing progress indicators.

What should you do differently?

Three concrete adjustments for teams adopting Opus 4:

Audit your scaffolding. If you built multi-turn decomposition to compensate for weaker reasoning, test whether Opus 4 handles the full task in a single pass. Many teams will be able to simplify their agent architectures significantly.
Reassess your model routing. The gap between Opus and Sonnet has widened on reasoning tasks but narrowed on simpler ones. A well-configured model router that sends hard problems to Opus and routine tasks to Haiku will outperform a single-model approach on both quality and cost.
Test your evaluation suite. If your evals were calibrated against Opus 3 or Sonnet, they may not discriminate at Opus 4's capability level. Tasks that were previously challenging may now be trivially solved, masking remaining failure modes. Upgrade your hardest test cases.

The model landscape moves fast, but the architectural lesson is consistent: invest in evaluation and routing infrastructure, not model-specific optimisations. The models keep getting better. Your job is to have systems that can absorb that improvement automatically.

What does the benchmark picture actually look like?

How does extended thinking change agent architectures?

What about cost and latency?

What should you do differently?

Share this briefing

Your daily AI update