GPT-5.4: Features, benchmarks, and tradeoffs

Stanley Ulili

Updated on March 8, 2026

Core features
Performance benchmarks
Agentic development: building a 3D scene from a single prompt
Tradeoffs
Final thoughts

OpenAI's GPT-5.4 is designed around a single premise: instead of maintaining separate models optimized for coding, reasoning, and agentic tasks, build one model that handles all of them at a high level. Previous releases like GPT-5.3-Codex were exceptional at code generation but less suited for knowledge work and web research. GPT-5.4 attempts to close that gap by merging the coding strengths of Codex with the broader capabilities of models like GPT-5.2.

Slide explaining that GPT-5.4 incorporates the industry-leading coding capabilities of GPT-5.3-Codex while improving how the model works across tools and software environments

The result is a model positioned as a general-purpose workhorse for complex, multi-step tasks that require coding, reasoning, web search, and real-world tool use in combination.

Core features

Native computer use and vision

GPT-5.4 is OpenAI's first general-purpose model with native computer-use capabilities. Rather than only generating code, it can execute that code, interact with running software, and use a mouse and keyboard in a simulated environment. It processes screenshots and responds to visual feedback, which means it can act as both developer and QA tester within a single session, writing code, launching a browser, inspecting the result, identifying issues, and revising accordingly.

Tool search

Providing every tool definition upfront in a prompt has always been expensive in token terms. GPT-5.4 introduces tool search, where the model receives a lightweight index of available tools and fetches a tool's full definition only when it needs it. This keeps the context window leaner and reduces costs for applications that expose a large number of tools or APIs.

Steering

Steering allows a user to intervene mid-generation when the model's output is heading in an unwanted direction. Rather than stopping and restarting, the user can provide corrective input and the model adjusts its trajectory from that point. This makes longer agentic sessions more collaborative and less wasteful when the model makes an early wrong turn.

Fast mode

Standard GPT-5.4 trades speed for capability. Fast mode offers the same intelligence at up to 1.5 times the token generation speed, at double the plan usage cost. For applications where latency matters more than cost, this provides a practical middle ground.

1 million token context window

The context window supports up to 1 million tokens, making it practical to work with entire codebases, lengthy research documents, or extended multi-turn conversations without losing earlier context. Any input beyond 272,000 tokens is billed at double the standard input rate.

Performance benchmarks

Third-party benchmarks from Artificial Analysis give a concrete picture of where GPT-5.4 stands relative to other frontier models.

Artificial Analysis Coding Index chart showing GPT-5.4 at the top with a score of 57, outperforming all other models

On the Artificial Analysis Coding Index, a weighted average across multiple coding benchmarks, GPT-5.4 ranks first among all available models. It also takes the top spot on the Agentic Index, which evaluates a model's ability to complete complex multi-step tasks using tools. On the broader Intelligence Index, which aggregates ten different evaluations, GPT-5.4 scores high enough to be in a statistical tie with Gemini for the top position overall.

Agentic development: building a 3D scene from a single prompt

One of the clearest demonstrations of GPT-5.4's capabilities is using it to build a complex interactive application from a single detailed prompt with no further intervention. The example below uses its native computer-use feature alongside Playwright Interactive for browser-based QA and an Image Gen skill for asset generation, with the goal of producing a hyperrealistic, interactive 3D flyover of Tower Bridge in London.

Prompt structure

Effective prompts for this kind of task go beyond a simple objective. The prompt that produced the Tower Bridge scene covered several distinct areas:

The detailed initial prompt used to create the Tower Bridge 3D experience, specifying the use of Playwright Interactive and Image Gen skills

Tool declarations: explicitly naming Playwright Interactive and Image Gen so the model knows which capabilities to invoke
Core objective: the specific scene and interaction goal ("fly around freely")
Environmental detail: lighting, fog, the River Thames, surrounding landmarks like the Tower of London and HMS Belfast, traffic
UX requirements: intuitive flight controls, multiple viewpoints, close-up structural passes
Quality threshold: "high fidelity and smooth, almost like a photo" rather than blocky geometry
Iteration permission: explicitly giving the model time to refine ("this might take an hour if needs be, iterate until perfect")

The level of specificity matters because the model uses these details to structure its planning before writing a single line of code.

Execution flow

Once the prompt is submitted, the model works through a sequence of steps autonomously. It first analyzes the task and confirms which tools it will use. It then checks its environment by running basic shell commands like pwd and ls, verifies that Node.js and npm are available, and confirms that the OPENAI_API_KEY environment variable is set so it can call the image generation skill.

From there it plans the build, deciding to scaffold a Three.js application and assemble the bridge geometry, river, skyline, and flight camera as a unified scene. It generates texture assets via the image generation skill, installs dependencies, and writes the initial application code.

Iterative QA with Playwright

After the initial build, the model launches a headless Chrome browser using the Playwright Interactive skill, navigates to the local development server, and visually inspects the scene. It identifies issues such as background textures blending incorrectly, then returns to the relevant files (flight-controls.js, scene.js, etc.) to adjust exposure, fog, and image backplates. It then relaunches the browser and verifies the fix. This loop continues until the scene meets the quality bar set in the original prompt.

The final polished 3D flyover of Tower Bridge, showing the raised bascules, surrounding environment, and interactive elements

The full session, covering initial scaffolding through iterative visual QA to a finished interactive scene, runs approximately 90 minutes with minimal user input beyond the original prompt.

Tradeoffs

Speed and latency

GPT-5.4 is the slowest model on Artificial Analysis's benchmarks by a significant margin.

Bar chart titled "Latency: Time To First Answer Token," showing GPT-5.4 at 185 seconds, far longer than any other model listed

It has the highest time to first answer token and the longest end-to-end response time for a 500-token output. For tasks where agentic depth matters more than responsiveness, this is an acceptable tradeoff. For real-time or user-facing applications that need fast replies, it is a meaningful constraint. Fast mode reduces latency at double the cost, which may or may not be practical depending on usage volume.

Pricing

GPT-5.4 is priced at $2.50 per million input tokens and $15.00 per million output tokens.

The gpt-5.4-pro variant is substantially more expensive at $30 per million input tokens and $180 per million output tokens. Combined with the surcharge for context beyond 272,000 tokens, costs can escalate quickly for applications that make heavy use of the full context window or require the pro-tier model's capabilities.

UI design aesthetic

When generating frontend UI, GPT-5.4 tends toward a particular visual style: frosted glass surfaces, gradient overlays, and layered card components. This aesthetic is coherent and modern but can feel repetitive across different projects. On DesignArena, a platform that benchmarks AI design output, GPT-5.4 does not rank among the top models for UI generation. Developers for whom visual design quality is a priority may find other models produce more varied or refined results.

Final thoughts

GPT-5.4 sets a new bar for coding and agentic performance. Its native computer-use capability changes what's possible in automated development workflows, allowing a single model to plan, build, test, and debug software with minimal human checkpoints. The 1 million token context window and tool search feature make it practical for large-scale projects that would have strained earlier architectures.

The tradeoffs are real: slow generation speed, premium pricing at the pro tier, and a design sensibility that can feel generic. Whether these matter depends on the use case. For complex, long-running agentic tasks where raw capability is the priority, GPT-5.4 is currently the strongest option available. For latency-sensitive or cost-constrained applications, it requires careful evaluation against faster and cheaper alternatives.

Got an article suggestion? Let us know

Google Workspace CLI: An Agent-First Interface for the Google Workspace APIs

The gws CLI is a Rust-based tool that lets AI agents interact with the entire Google Workspace API surface through structured JSON payloads and runtime schema introspection, without static documentation or bespoke flags

→

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.