GPT-5.4: Features, benchmarks, and tradeoffs
OpenAI's GPT-5.4 is designed around a single premise: instead of maintaining separate models optimized for coding, reasoning, and agentic tasks, build one model that handles all of them at a high level. Previous releases like GPT-5.3-Codex were exceptional at code generation but less suited for knowledge work and web research. GPT-5.4 attempts to close that gap by merging the coding strengths of Codex with the broader capabilities of models like GPT-5.2.
The result is a model positioned as a general-purpose workhorse for complex, multi-step tasks that require coding, reasoning, web search, and real-world tool use in combination.
Core features
Native computer use and vision
GPT-5.4 is OpenAI's first general-purpose model with native computer-use capabilities. Rather than only generating code, it can execute that code, interact with running software, and use a mouse and keyboard in a simulated environment. It processes screenshots and responds to visual feedback, which means it can act as both developer and QA tester within a single session, writing code, launching a browser, inspecting the result, identifying issues, and revising accordingly.
Tool search
Providing every tool definition upfront in a prompt has always been expensive in token terms. GPT-5.4 introduces tool search, where the model receives a lightweight index of available tools and fetches a tool's full definition only when it needs it. This keeps the context window leaner and reduces costs for applications that expose a large number of tools or APIs.
Steering
Steering allows a user to intervene mid-generation when the model's output is heading in an unwanted direction. Rather than stopping and restarting, the user can provide corrective input and the model adjusts its trajectory from that point. This makes longer agentic sessions more collaborative and less wasteful when the model makes an early wrong turn.
Fast mode
Standard GPT-5.4 trades speed for capability. Fast mode offers the same intelligence at up to 1.5 times the token generation speed, at double the plan usage cost. For applications where latency matters more than cost, this provides a practical middle ground.
1 million token context window
The context window supports up to 1 million tokens, making it practical to work with entire codebases, lengthy research documents, or extended multi-turn conversations without losing earlier context. Any input beyond 272,000 tokens is billed at double the standard input rate.
Performance benchmarks
Third-party benchmarks from Artificial Analysis give a concrete picture of where GPT-5.4 stands relative to other frontier models.
On the Artificial Analysis Coding Index, a weighted average across multiple coding benchmarks, GPT-5.4 ranks first among all available models. It also takes the top spot on the Agentic Index, which evaluates a model's ability to complete complex multi-step tasks using tools. On the broader Intelligence Index, which aggregates ten different evaluations, GPT-5.4 scores high enough to be in a statistical tie with Gemini for the top position overall.
Agentic development: building a 3D scene from a single prompt
One of the clearest demonstrations of GPT-5.4's capabilities is using it to build a complex interactive application from a single detailed prompt with no further intervention. The example below uses its native computer-use feature alongside Playwright Interactive for browser-based QA and an Image Gen skill for asset generation, with the goal of producing a hyperrealistic, interactive 3D flyover of Tower Bridge in London.
Prompt structure
Effective prompts for this kind of task go beyond a simple objective. The prompt that produced the Tower Bridge scene covered several distinct areas:
- Tool declarations: explicitly naming
Playwright InteractiveandImage Genso the model knows which capabilities to invoke - Core objective: the specific scene and interaction goal ("fly around freely")
- Environmental detail: lighting, fog, the River Thames, surrounding landmarks like the Tower of London and HMS Belfast, traffic
- UX requirements: intuitive flight controls, multiple viewpoints, close-up structural passes
- Quality threshold: "high fidelity and smooth, almost like a photo" rather than blocky geometry
- Iteration permission: explicitly giving the model time to refine ("this might take an hour if needs be, iterate until perfect")
The level of specificity matters because the model uses these details to structure its planning before writing a single line of code.
Execution flow
Once the prompt is submitted, the model works through a sequence of steps autonomously. It first analyzes the task and confirms which tools it will use. It then checks its environment by running basic shell commands like pwd and ls, verifies that Node.js and npm are available, and confirms that the OPENAI_API_KEY environment variable is set so it can call the image generation skill.
From there it plans the build, deciding to scaffold a Three.js application and assemble the bridge geometry, river, skyline, and flight camera as a unified scene. It generates texture assets via the image generation skill, installs dependencies, and writes the initial application code.
Iterative QA with Playwright
After the initial build, the model launches a headless Chrome browser using the Playwright Interactive skill, navigates to the local development server, and visually inspects the scene. It identifies issues such as background textures blending incorrectly, then returns to the relevant files (flight-controls.js, scene.js, etc.) to adjust exposure, fog, and image backplates. It then relaunches the browser and verifies the fix. This loop continues until the scene meets the quality bar set in the original prompt.
The full session, covering initial scaffolding through iterative visual QA to a finished interactive scene, runs approximately 90 minutes with minimal user input beyond the original prompt.
Tradeoffs
Speed and latency
GPT-5.4 is the slowest model on Artificial Analysis's benchmarks by a significant margin.
It has the highest time to first answer token and the longest end-to-end response time for a 500-token output. For tasks where agentic depth matters more than responsiveness, this is an acceptable tradeoff. For real-time or user-facing applications that need fast replies, it is a meaningful constraint. Fast mode reduces latency at double the cost, which may or may not be practical depending on usage volume.
Pricing
GPT-5.4 is priced at $2.50 per million input tokens and $15.00 per million output tokens.
The gpt-5.4-pro variant is substantially more expensive at $30 per million input tokens and $180 per million output tokens. Combined with the surcharge for context beyond 272,000 tokens, costs can escalate quickly for applications that make heavy use of the full context window or require the pro-tier model's capabilities.
UI design aesthetic
When generating frontend UI, GPT-5.4 tends toward a particular visual style: frosted glass surfaces, gradient overlays, and layered card components. This aesthetic is coherent and modern but can feel repetitive across different projects. On DesignArena, a platform that benchmarks AI design output, GPT-5.4 does not rank among the top models for UI generation. Developers for whom visual design quality is a priority may find other models produce more varied or refined results.
Final thoughts
GPT-5.4 sets a new bar for coding and agentic performance. Its native computer-use capability changes what's possible in automated development workflows, allowing a single model to plan, build, test, and debug software with minimal human checkpoints. The 1 million token context window and tool search feature make it practical for large-scale projects that would have strained earlier architectures.
The tradeoffs are real: slow generation speed, premium pricing at the pro tier, and a design sensibility that can feel generic. Whether these matter depends on the use case. For complex, long-running agentic tasks where raw capability is the priority, GPT-5.4 is currently the strongest option available. For latency-sensitive or cost-constrained applications, it requires careful evaluation against faster and cheaper alternatives.