Claude Opus 4.7: Benchmarks, Tokenizer Changes, and Coding Performance

Stanley Ulili

Updated on April 17, 2026

What changed
Benchmark results
Cafe website test
Personal finance dashboard test
Summary

Claude Opus 4.7 is Anthropic's latest release, with measurable improvements in agentic coding, visual reasoning, and UI generation. The upgrade also introduces changes that affect API costs: a new tokenizer that maps the same input to more tokens, and higher default thinking effort that produces more output tokens. Understanding both sides is important before migrating.

What changed

Agentic coding and instruction following

Official announcement page titled "Introducing Claude Opus 4.7"

The model shows substantial gains on coding benchmarks, particularly for multi-step agentic tasks where the model plans, writes code, and executes it autonomously. Alongside this, Opus 4.7 follows instructions more literally than previous versions. Prompts written for Opus 4.6 that relied on loose interpretation may produce unexpected results. Teams should review and re-tune existing prompts after migrating.

Vision and image support

The model accepts images up to 2,576 pixels on the long edge, more than three times the resolution limit of previous Claude models. This matters for computer-use agents reading dense application screenshots, data extraction from complex charts, and tasks requiring fine visual detail.

Tokenizer change and cost implications

The per-token pricing is unchanged from Opus 4.6, but the new tokenizer maps the same input text to more tokens. According to Anthropic, the multiplier ranges from 1.0x to 1.35x depending on the content, meaning a prompt that previously used 10,000 input tokens could now use up to 13,500. Combined with increased thinking output at higher effort levels, the effective cost per task can be meaningfully higher despite the static listed price.

Effort levels

Opus 4.7 introduces an xhigh effort level in addition to the existing low, medium, high, and max settings. xhigh is the default in Claude Code.

Graph titled "Agentic coding performance by effort level" comparing score vs. total tokens used for Opus 4.7 and Opus 4.6 across low, medium, high, xhigh, and max settings

The graph shows that Opus 4.7 at high effort outperforms Opus 4.6 at max effort while using fewer tokens. For teams that were previously using maximum effort with Opus 4.6, switching to high with Opus 4.7 can produce better results at lower cost.

Benchmark results

Comprehensive benchmark comparison table showing Opus 4.7 against Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Mythos Preview across multiple tasks

Opus 4.7 improves on Opus 4.6 across most categories:

SWE-bench Pro (agentic coding): 64.3% vs. 53.4% for Opus 4.6, 57.7% for GPT-5.4, and 54.2% for Gemini 3.1 Pro
Terminal-Bench 2.0 (agentic terminal coding): 69.4% vs. 65.4%
Humanity's Last Exam (multidisciplinary reasoning, no tools): 46.9% vs. 40.0%
Visual reasoning (no tools): 82.1% vs. 69.1%

The benchmark table also includes a "Mythos Preview" column for an unreleased future model. Mythos scores 77.8% on SWE-bench Pro, substantially above Opus 4.7. This appears to establish the roadmap context for Opus 4.7 as an intermediate release.

Cybersecurity benchmark

Opus 4.7 scores slightly lower on cybersecurity vulnerability reproduction (73.1% vs. 73.8%) than Opus 4.6. Anthropic attributes this to new cyber safeguards being tested under an initiative called Project Glasswing, which automatically detects and blocks requests related to prohibited or high-risk cybersecurity uses.

Text from Anthropic's announcement explaining the new cyber safeguards being tested in Opus 4.7 and their connection to the eventual release of Mythos-class models

Cafe website test

A simple UI design test: "Create a cafe website, index.html only."

Opus 4.7 generated a single-file HTML page for "Maple & Bean" with a warm brown color palette, a hero section using a thematic background image from Unsplash, an "Our Story" section, a formatted menu, and a footer.

Full-page view of the "Maple & Bean" cafe website generated by Claude Opus 4.7 showing its clean design and warm color palette

Opus 4.6 produced a similar cafe site with a less polished hero section using a gradient rather than an image. Gemini 3.1 Pro's output was considered the strongest visually. GPT-5.4 produced the weakest result, with a generic card-heavy layout.

Personal finance dashboard test

A more demanding prompt asked for a complete personal finance management dashboard with multiple accounts, send/receive money functionality, budget management, and seeded example data, without authentication.

Opus 4.7 generated a working application in approximately 20 minutes from a single prompt:

Frontend: React, Vite, and TypeScript with a dark-mode UI including graphs, transaction lists, budget trackers, and financial goal progress bars
Backend: Express.js with in-memory data storage seeded from a file (not a persistent database)
Functionality: Interactive, with simulated money transfers and goal contributions that update the dashboard state in real time

Complete dark-mode personal finance dashboard generated by Claude Opus 4.7 showing net worth, cash flow, spending mix, and financial metrics

Opus 4.6 produced a functional light-mode dashboard that used SQLite for persistent storage, a more robust backend choice than the in-memory approach Opus 4.7 chose. It used dated versions of React and React Router. Gemini 3.1 Pro required multiple follow-up prompts and produced a less complete result. GPT-5.4 also did not complete the application from a single prompt.

Summary

Opus 4.7 is a meaningful improvement for agentic coding and single-shot full-stack generation. The finance dashboard test in particular shows what the model can produce from one prompt: a working multi-page React application with an API backend and interactive state, in under 20 minutes.

The backend architectural choice (in-memory vs. persistent storage) in that test is worth noting: a more capable model does not always make more robust design decisions. For production use, reviewing and specifying architecture expectations in the prompt will still be necessary.

The cost increase from the tokenizer change and higher default effort is real. Comparing effort levels before settling on defaults for a new application is the most direct way to control this. The data shows that high effort in Opus 4.7 beats max effort in Opus 4.6 on coding tasks while using fewer total tokens, which is the key number to optimize against.

Current pricing and API documentation are available on the Anthropic documentation site.

Got an article suggestion? Let us know

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.