OpenAI Launches GPT-5.4 with Native Computer Use and 1M-Token Context

OpenAI launched GPT-5.4 on March 5, rolling together three capabilities that have been evolving separately: a 1M-token context window, native computer use, and the integrated coding engine from GPT-5.3-Codex. The model ships in three variants — standard, Thinking, and Pro — each targeting different use cases and cost profiles.

The headline number is the computer use benchmark: 75% on OSWorld-Verified, compared to 72.4% for human operators on the same tasks. That is the first time a frontier model has exceeded the human baseline on a verified computer use benchmark.

Why does native computer use matter at this score?

Computer use has been the bottleneck for practical AI agents. Models could reason, write code, and generate content, but operating a desktop — clicking buttons, navigating menus, filling forms, handling popups — remained unreliable enough to require human supervision on every task.

At 75% on OSWorld-Verified, GPT-5.4 crosses a threshold. Not perfection, but reliable enough that certain categories of desktop automation become viable without a human watching every step. Think form-filling workflows, data entry across multiple applications, repetitive browser-based tasks, and basic QA testing of web interfaces.

The practical ceiling is still defined by the 25% failure rate. For high-stakes operations — financial transactions, medical records, legal filings — human oversight remains essential. But for the long tail of repetitive desktop work that currently occupies knowledge workers for hours each day, the economics have shifted.

What does the 1M-token context window change?

The jump from 128K to 1M tokens is not just a quantitative change. It is a qualitative shift in what you can fit inside a single inference call. Entire codebases, full legal contracts with appendices, multi-year financial reports, or complete documentation sets can now be processed in a single pass without chunking strategies.

For practitioners building retrieval-augmented generation systems, this raises an uncomfortable question: when does RAG become unnecessary overhead? If the model can ingest your entire knowledge base in one call, the retrieval step adds latency and complexity for no benefit. The answer depends on cost and refresh frequency — 1M tokens per call is expensive, and static documents do not need re-ingestion — but the architectural calculus has changed.

The context window also transforms agentic workflows. An agent operating over a long session can maintain full context of everything it has observed and done, eliminating the memory management hacks that current agent frameworks require.

How do the three variants differ?

OpenAI ships GPT-5.4 in three configurations:

Standard is the general-purpose variant, optimised for latency and cost. Suitable for most production workloads where speed matters more than maximum reasoning depth.
Thinking adds extended reasoning traces before generating final output, similar to the approach pioneered by o1 and o3. This variant trades latency for accuracy on complex reasoning tasks — mathematical proofs, multi-step logic, code architecture decisions.
Pro combines extended reasoning with maximum compute allocation. Designed for the hardest problems where cost is secondary to quality. Research, complex analysis, and high-stakes decision support.

The integrated coding engine from GPT-5.3-Codex means all three variants handle code generation, debugging, and refactoring natively rather than routing to a separate model. This eliminates the context-switching overhead that plagued earlier architectures where a reasoning model would hand off to a coding specialist.

What should teams building with the API watch for?

Three practical considerations:

Cost management at 1M tokens. The context window is available, but filling it on every call will be expensive. Teams need to be deliberate about what goes into context. The fact that you can send 1M tokens does not mean you should. Prompt engineering discipline matters more, not less, with larger windows.

Computer use integration patterns. The computer use capability works through a screen-observation-action loop. Latency per action is measured in seconds, not milliseconds. Systems designed around this need to account for the pace — it is automation at human speed, not API speed. Batch processing and parallel agent instances are the way to scale throughput.

Variant selection per task. Running every request through Pro is wasteful. The practical pattern is to route simple tasks to Standard, flag complex reasoning tasks to Thinking, and reserve Pro for genuinely hard problems. Model routing — dynamically selecting the right variant based on task complexity — becomes a first-class infrastructure concern.

GPT-5.4 does not obsolete existing architectures overnight, but it moves the frontier in three dimensions simultaneously. Teams that have been waiting for computer use to become reliable, for context windows to eliminate chunking, or for coding to be native rather than bolted on now have a single model that delivers all three.

Why does native computer use matter at this score?

What does the 1M-token context window change?

How do the three variants differ?

What should teams building with the API watch for?

Share this briefing

Your daily AI update