The AI updates aren’t slowing down. Literally two days after OpenAI launched a new underlying AI model for ChatGPT called GPT-5.3 Instant, the company has unveiled another, even more massive upgrade: GPT-5.4.
Actually, GPT-5.4 comes in two varieties: GPT-5.4 Thinking and GPT-5.4 Pro, the latter designed for the most complex tasks.
Both will be available in OpenAI’s paid application programming interface (API) and Codex software development application, while GPT-5.4 Thinking will be available to all paid subscribers of ChatGPT (Plus, the $20-per-month plan, and up) and Pro will be reserved for ChatGPT Pro ($200 monthly) and Enterprise plan users.
ChatGPT Free users will also get a taste of GPT-5.4, but only when their queries are auto-routed to the model, according to an OpenAI spokesperson.
The big headlines on this release are efficiency, with OpenAI reporting that GPT-5.4 uses far fewer tokens (47% fewer on some tasks) than its predecessors, and, arguably even more impressively, a new “native” Computer Use mode available through the API and its Codex that lets GPT-5.4 navigate a users’ computer like a human and work across applications.
The company is also releasing a new suite of ChatGPT integrations allowing GPT-5.4 to be plugged directly into users’ Microsoft Excel and Google Sheets spreadsheets and cells, enabling granular analysis and automated task completion that should speed up work across the enterprise, but may make fears of white collar layoffs even more pronounced on the heels of similar offerings from Anthropic’s Claude and its new Cowork application.
OpenAI says GPT-5.4 supports up to 1 million tokens of context in the API and Codex, enabling agents to plan, execute, and verify tasks across long horizons— however, it charges double the cost per 1 million tokens once the input exceeds 272,000 tokens.
Native computer use: a step toward autonomous workflows
The most consequential capability OpenAI highlights is that GPT-5.4 is its first general-purpose model released with native, state-of-the-art computer-use capabilities in Codex and the API, enabling agents to operate computers and carry out multi-step workflows across applications.
OpenAI says the model can both write code to operate computers via libraries like Playwright and issue mouse and keyboard commands in response to screenshots. OpenAI also claims a jump in agentic web browsing.
Benchmark results are presented as evidence that this is not merely a UI wrapper.
On BrowseComp, which measures how well AI agents can persistently browse the web to find hard-to-locate information, OpenAI reports GPT-5.4 improving by 17% absolute over GPT-5.2, and GPT-5.4 Pro reaching 89.3%, described as a new state of the art.
On OSWorld-Verified, which measures desktop navigation using screenshots plus keyboard and mouse actions, OpenAI reports GPT-5.4 at 75.0% success, compared to 47.3% for GPT-5.2, and notes reported human performance at 72.4%.
On WebArena-Verified, GPT-5.4 reaches 67.3% success using both DOM- and screenshot-driven interaction, compared to 65.4% for GPT-5.2. On Online-Mind2Web, OpenAI reports 92.8% success using screenshot-based observations alone.
OpenAI also links computer use to improvements in vision and document handling. On MMMU-Pro, GPT-5.4 reaches 81.2% success without tool use, compared with 79.5% for GPT-5.2, and OpenAI says it achieves that result using a fraction of the “thinking tokens.”
On OmniDocBench, GPT-5.4’s average error is reported at 0.109, improved from 0.140 for GPT-5.2. The post also describes expanded support for high-fidelity image inputs, including an “original” detail level up to 10.24M pixels.
OpenAI positions GPT-5.4 as built for longer, multi-step workflows—work that increasingly looks like an agent keeping state across many actions rather than a chatbot responding once.
Tool search and improved tool orchestration
As tool ecosystems get larger, OpenAI argues that the naive approach—dumping every tool definition into the prompt—creates a tax paid on every request: cost, latency, and context pollution.
GPT-5.4 introduces tool search in the API as a structural fix. Instead of receiving all tool definitions upfront, the model receives a lightweight list of tools plus a search capability, and it retrieves full tool definitions only when they’re actually needed.
OpenAI describes the efficiency win with a concrete comparison: on 250 tasks from Scale’s MCP Atlas benchmark, running with 36 MCP servers enabled, the tool-search configuration reduced total token usage by 47% while achieving the same accuracy as a configuration that exposed all MCP functions directly in context.
That 47% figure is specifically about the tool-search setup in that evaluation—not a blanket claim that GPT-5.4 uses 47% fewer tokens for every kind of task.
Improvements for developers and coding workflows
OpenAI’s coding pitch is that GPT-5.4 combines the coding strengths of GPT-5.3-Codex with stronger tool and computer-use capabilities that matter when tasks aren’t single-shot.
GPT-5.4 matches or outperforms GPT-5.3-Codex on SWE-Bench Pro while being lower latency across reasoning efforts.
Codex also gets workflow-level knobs. OpenAI says /fast mode delivers up to 1.5× faster performance across supported models, including GPT-5.4, describing it as the same model and intelligence “just faster.”
And it describes releasing an experimental Codex skill, “Playwright (Interactive)”, meant to demonstrate how coding and computer use can work in tandem—visually debugging web and Electron apps and testing an app as it’s being built.
OpenAI for Microsoft Excel and Google Sheets
Alongside GPT-5.4, OpenAI is announcing a suite of secure AI products in ChatGPT built for enterprises and financial institutions, powered by GPT-5.4 for advanced financial reasoning and Excel-based modeling.
The centerpiece is ChatGPT for Excel and Google Sheets (beta), which OpenAI describes as ChatGPT embedded directly in spreadsheets to build, analyze, and update complex financial models using the formulas and structures teams already rely on.
The suite also includes new ChatGPT app integrations intended to unify market, company, and internal data into a single workflow, naming FactSet, MSCI, Third Bridge, and Moody’s.
And it introduces reusable “Skills” for recurring finance work such as earnings previews, comparables analysis, DCF analysis, and investment memo drafting.
OpenAI anchors the finance push with an internal benchmark claim: model performance increased from 43.7% with GPT-5 to 88.0% with GPT-5.4 Thinking on an OpenAI internal investment banking benchmark.
Measuring AI performance against professional work
OpenAI leans on benchmarks intended to resemble real office deliverables, not just puzzle-solving. On GDPval, an evaluation spanning “well-specified knowledge work” across 44 occupations, OpenAI reports that GPT-5.4 matches or exceeds industry professionals in 83.0% of comparisons, compared to 71.0% for GPT-5.2.
The company also highlights specific improvements in the kinds of artifacts that tend to expose model weaknesses: structured tables, formulas, narrative coherence, and design quality.
In an internal benchmark of spreadsheet modeling tasks modeled after what a junior investment banking analyst might do, GPT-5.4 reaches a mean score of 87.5%, compared to 68.4% for GPT-5.2.
And on a set of presentation evaluation prompts, OpenAI says human raters preferred GPT-5.4’s presentations 68.0% of the time over GPT-5.2’s, citing stronger aesthetics, greater visual variety, and more effective use of image generation.
Improving reliability and reducing hallucinations
OpenAI describes GPT-5.4 as its most factual model yet and connects that claim to a practical dataset: de-identified prompts where users previously flagged factual errors. On that set, OpenAI reports GPT-5.4’s individual claims are 33% less likely to be false and its full responses are 18% less likely to contain any errors compared to GPT-5.2.
In statements provided to VentureBeat from OpenAI and attributed early GPT-5.4 testers, Daniel Swiecki of Walleye Capital says that on internal finance and Excel evaluations, GPT-5.4 improved accuracy by 30 percentage points, which he links to expanded automation for model updates and scenario analysis.
Brendan Foody, CEO of Mercor, calls GPT-5.4 the best model the company has tried and says it’s now top of Mercor’s APEX-Agents benchmark for professional services work, emphasizing long-horizon deliverables like slide decks, financial models, and legal analysis.
Pricing and availability
In the API, OpenAI says GPT-5.4 Thinking is available as gpt-5.4 and GPT-5.4 Pro as gpt-5.4-pro. Pricing is as follows:
-
GPT-5.4: $2.50 / 1M input tokens; $15 / 1M output tokens
-
GPT-5.4 Pro: $30 / 1M input tokens; $180 / 1M output tokens
-
Batch + Flex: half-rate; Priority processing: 2× rate
This makes GPT-5.4 among the more expensive models to run over API compared to the entire field, as seen in the table below.
|
Model
|
Input
|
Output
|
Total Cost
|
Source
|
|
Qwen 3 Turbo
|
$0.05
|
$0.20
|
$0.25
|
Alibaba Cloud
|
|
Qwen3.5-Flash
|
$0.10
|
$0.40
|
$0.50
|
Alibaba Cloud
|
|
deepseek-chat (V3.2-Exp)
|
$0.28
|
$0.42
|
$0.70
|
DeepSeek
|
|
deepseek-reasoner (V3.2-Exp)
|
$0.28
|
$0.42
|
$0.70
|
DeepSeek
|
|
Grok 4.1 Fast (reasoning)
|
$0.20
|
$0.50
|
$0.70
|
xAI
|
|
Grok 4.1 Fast (non-reasoning)
|
$0.20
|
$0.50
|
$0.70
|
xAI
|
|
MiniMax M2.5
|
$0.15
|
$1.20
|
$1.35
|
MiniMax
|
|
Gemini 3.1 Flash-Lite
|
$0.25
|
$1.50
|
$1.75
|
Google
|
|
MiniMax M2.5-Lightning
|
$0.30
|
$2.40
|
$2.70
|
MiniMax
|
|
Gemini 3 Flash Preview
|
$0.50
|
$3.00
|
$3.50
|
Google
|
|
Kimi-k2.5
|
$0.60
|
$3.00
|
$3.60
|
Moonshot
|
|
GLM-5
|
$1.00
|
$3.20
|
$4.20
|
Z.ai
|
|
ERNIE 5.0
|
$0.85
|
$3.40
|
$4.25
|
Baidu
|
|
Claude Haiku 4.5
|
$1.00
|
$5.00
|
$6.00
|
Anthropic
|
|
Qwen3-Max (2026-01-23)
|
$1.20
|
$6.00
|
$7.20
|
Alibaba Cloud
|
|
Gemini 3 Pro (≤200K)
|
$2.00
|
$12.00
|
$14.00
|
Google
|
|
GPT-5.2
|
$1.75
|
$14.00
|
$15.75
|
OpenAI
|
|
Claude Sonnet 4.6
|
$3.00
|
$15.00
|
$18.00
|
Anthropic
|
|
GPT-5.4
|
$2.50
|
$15.00
|
$17.50
|
OpenAI
|
|
Gemini 3 Pro (>200K)
|
$4.00
|
$18.00
|
$22.00
|
Google
|
|
Claude Opus 4.6
|
$5.00
|
$25.00
|
$30.00
|
Anthropic
|
|
GPT-5.2 Pro
|
$21.00
|
$168.00
|
$189.00
|
OpenAI
|
|
GPT-5.4 Pro
|
$30.00
|
$180.00
|
$210.00
|
OpenAI
|
Another important note: with GPT-5.4, requests that exceed 272,000 input tokens are billed at 2X the normal rate, reflecting the ability to send prompts larger than earlier models supported.
In Codex, compaction defaults to 272k tokens, and the higher long-context pricing applies only when the input exceeds 272k—meaning developers can keep sending prompts at or under that size without triggering the higher rate, but can opt into larger prompts by raising the compaction limit, with only those larger requests billed differently.
An OpenAI spokesperson said that in the API the maximum output is 128,000 tokens, the same as previous models.
Finally, on why GPT-5.4 is priced higher at baseline, the spokesperson attributed it to three factors: higher capability on complex tasks (including coding, computer use, deep research, advanced document generation, and tool use), major research improvements from OpenAI’s roadmap, and more efficient reasoning that uses fewer reasoning tokens for comparable tasks—adding that OpenAI believes GPT-5.4 remains below comparable frontier models on pricing even with the increase.
The broader shift
Across the release and the follow-up clarifications, GPT-5.4 is positioned as a model meant to move beyond “answer generation” and into sustained professional workflows—ones that require tool orchestration, computer interaction, long context, and outputs that look like the artifacts people actually use at work.
OpenAI’s emphasis on token efficiency, tool search, native computer use, and reduced user-flagged factual errors all point in the same direction: making agentic systems more viable in production by lowering the cost of retries—whether that retry is a human re-prompting, an agent calling another tool, or a workflow re-running because the first pass didn’t stick.