OpenAI Launches GPT-5.4 With 1M Token Context, Native Computer Use, and Record Benchmarks

OpenAI has released GPT-5.4, its most capable foundation model to date, available in three variants — standard, Thinking, and Pro. The launch marks OpenAI's most significant model update since GPT-5.2, introducing a 1 million token context window, native computer-use capabilities, and substantially improved accuracy across professional benchmarks.

Key Highlights

1 million token context window — the largest OpenAI has ever offered, enabling analysis of entire codebases and extended agent workflows
Native computer use that scores 75% on OSWorld-Verified, surpassing the 72.4% human baseline
33% fewer false claims and 18% fewer overall errors compared to GPT-5.2
83% on GDPval, a new record across 44 professional task categories

Three Variants, One Architecture

GPT-5.4 ships in three flavors targeting different use cases:

GPT-5.4 Thinking replaces GPT-5.2 Thinking as the default reasoning model in ChatGPT. It delivers improvements across six areas: coding and tool use, visual processing, agent workflows, token efficiency, web search synthesis, and business document automation. It is rolling out to Plus, Team, and Pro subscribers, with GPT-5.2 Thinking being retired in three months.

GPT-5.4 Pro is the high-performance variant optimized for the most demanding workloads, priced at $30 per million input tokens and $180 per million output tokens.

GPT-5.4 standard serves as the general-purpose API model at $2.50 per million input tokens and $15 per million output tokens, with cached input at just $0.25 per million tokens.

Computer Use Goes Mainstream

GPT-5.4 is the first general-purpose OpenAI model with built-in computer-use capabilities. Agents can now control mouse and keyboard inputs, navigate desktop applications, and execute multi-step workflows across software — all without external tooling.

On OSWorld-Verified, which tests desktop navigation tasks, GPT-5.4 achieved a 75% success rate, a massive jump from GPT-5.2's 47.3% and notably above the 72.4% human performance benchmark. On WebArena-Verified for browser-based tasks, it scored 67.3%.

Benchmark Performance

The model sets new records across professional and technical benchmarks:

Benchmark	GPT-5.4	GPT-5.2	Notes
GDPval (knowledge work)	83.0%	—	Record across 44 professions
OSWorld-Verified	75.0%	47.3%	Surpasses 72.4% human baseline
SWE-Bench Pro	57.7%	56.8%	Software engineering tasks
MMMU-Pro (visual)	81.2%	—	Visual understanding
Spreadsheet tasks	87.3%	—	Business automation

GPT-5.4 also claims the top position on Mercor's APEX-Agents benchmark, designed to evaluate agents on sustained professional tasks across investment banking, consulting, and corporate law.

Token Efficiency

Beyond raw performance, OpenAI emphasized efficiency gains. On the MCP Atlas benchmark, GPT-5.4 achieved a 47% reduction in token usage while maintaining accuracy — a critical improvement for cost-conscious API users. In Codex, the model supports a /fast mode delivering up to 1.5x faster token generation.

Competitive Landscape

The release directly targets Anthropic's Claude Opus 4.6, which currently leads in coding and agentic workflows. At $2.50/$15 per million tokens (input/output), GPT-5.4 standard undercuts Claude Opus 4.6's $5/$25 pricing, though the Pro variant at $30/$180 is significantly more expensive.

The 1 million token context window matches what Anthropic and Google have offered, closing a gap that had put OpenAI at a disadvantage for long-context workloads.

What's Next

GPT-5.4 Thinking is rolling out gradually in ChatGPT and Codex. The API is available immediately for developers. OpenAI has signaled that computer-use capabilities will expand further, with tighter integration into enterprise workflows and autonomous agent platforms.

Source: TechCrunch

Key Highlights

1 million token context window — the largest OpenAI has ever offered, enabling analysis of entire codebases and extended agent workflows

Native computer use that scores 75% on OSWorld-Verified, surpassing the 72.4% human baseline

33% fewer false claims and 18% fewer overall errors compared to GPT-5.2

83% on GDPval, a new record across 44 professional task categories

Three Variants, One Architecture

GPT-5.4 ships in three flavors targeting different use cases:

GPT-5.4 Pro is the high-performance variant optimized for the most demanding workloads, priced at $30 per million input tokens and $180 per million output tokens.

GPT-5.4 standard serves as the general-purpose API model at $2.50 per million input tokens and $15 per million output tokens, with cached input at just $0.25 per million tokens.

Computer Use Goes Mainstream

Benchmark Performance

The model sets new records across professional and technical benchmarks:

Benchmark

GPT-5.4

GPT-5.2

Notes

GDPval (knowledge work)

83.0%

—

Record across 44 professions

OSWorld-Verified

75.0%

47.3%

Surpasses 72.4% human baseline

SWE-Bench Pro

57.7%

56.8%

Software engineering tasks

MMMU-Pro (visual)

81.2%

—

Visual understanding

Spreadsheet tasks

87.3%

—

Business automation

GPT-5.4 also claims the top position on Mercor's APEX-Agents benchmark, designed to evaluate agents on sustained professional tasks across investment banking, consulting, and corporate law.

Token Efficiency

Competitive Landscape

The 1 million token context window matches what Anthropic and Google have offered, closing a gap that had put OpenAI at a disadvantage for long-context workloads.