Claude Opus 4.7: Benchmarks, Tokenizer Changes, and Coding Performance
Claude Opus 4.7 is Anthropic's latest release, with measurable improvements in agentic coding, visual reasoning, and UI generation. The upgrade also introduces changes that affect API costs: a new tokenizer that maps the same input to more tokens, and higher default thinking effort that produces more output tokens. Understanding both sides is important before migrating.
What changed
Agentic coding and instruction following
The model shows substantial gains on coding benchmarks, particularly for multi-step agentic tasks where the model plans, writes code, and executes it autonomously. Alongside this, Opus 4.7 follows instructions more literally than previous versions. Prompts written for Opus 4.6 that relied on loose interpretation may produce unexpected results. Teams should review and re-tune existing prompts after migrating.
Vision and image support
The model accepts images up to 2,576 pixels on the long edge, more than three times the resolution limit of previous Claude models. This matters for computer-use agents reading dense application screenshots, data extraction from complex charts, and tasks requiring fine visual detail.
Tokenizer change and cost implications
The per-token pricing is unchanged from Opus 4.6, but the new tokenizer maps the same input text to more tokens. According to Anthropic, the multiplier ranges from 1.0x to 1.35x depending on the content, meaning a prompt that previously used 10,000 input tokens could now use up to 13,500. Combined with increased thinking output at higher effort levels, the effective cost per task can be meaningfully higher despite the static listed price.
Effort levels
Opus 4.7 introduces an xhigh effort level in addition to the existing low, medium, high, and max settings. xhigh is the default in Claude Code.
The graph shows that Opus 4.7 at high effort outperforms Opus 4.6 at max effort while using fewer tokens. For teams that were previously using maximum effort with Opus 4.6, switching to high with Opus 4.7 can produce better results at lower cost.
Benchmark results
Opus 4.7 improves on Opus 4.6 across most categories:
- SWE-bench Pro (agentic coding): 64.3% vs. 53.4% for Opus 4.6, 57.7% for GPT-5.4, and 54.2% for Gemini 3.1 Pro
- Terminal-Bench 2.0 (agentic terminal coding): 69.4% vs. 65.4%
- Humanity's Last Exam (multidisciplinary reasoning, no tools): 46.9% vs. 40.0%
- Visual reasoning (no tools): 82.1% vs. 69.1%
The benchmark table also includes a "Mythos Preview" column for an unreleased future model. Mythos scores 77.8% on SWE-bench Pro, substantially above Opus 4.7. This appears to establish the roadmap context for Opus 4.7 as an intermediate release.
Cybersecurity benchmark
Opus 4.7 scores slightly lower on cybersecurity vulnerability reproduction (73.1% vs. 73.8%) than Opus 4.6. Anthropic attributes this to new cyber safeguards being tested under an initiative called Project Glasswing, which automatically detects and blocks requests related to prohibited or high-risk cybersecurity uses.
Cafe website test
A simple UI design test: "Create a cafe website, index.html only."
Opus 4.7 generated a single-file HTML page for "Maple & Bean" with a warm brown color palette, a hero section using a thematic background image from Unsplash, an "Our Story" section, a formatted menu, and a footer.
Opus 4.6 produced a similar cafe site with a less polished hero section using a gradient rather than an image. Gemini 3.1 Pro's output was considered the strongest visually. GPT-5.4 produced the weakest result, with a generic card-heavy layout.
Personal finance dashboard test
A more demanding prompt asked for a complete personal finance management dashboard with multiple accounts, send/receive money functionality, budget management, and seeded example data, without authentication.
Opus 4.7 generated a working application in approximately 20 minutes from a single prompt:
- Frontend: React, Vite, and TypeScript with a dark-mode UI including graphs, transaction lists, budget trackers, and financial goal progress bars
- Backend: Express.js with in-memory data storage seeded from a file (not a persistent database)
- Functionality: Interactive, with simulated money transfers and goal contributions that update the dashboard state in real time
Opus 4.6 produced a functional light-mode dashboard that used SQLite for persistent storage, a more robust backend choice than the in-memory approach Opus 4.7 chose. It used dated versions of React and React Router. Gemini 3.1 Pro required multiple follow-up prompts and produced a less complete result. GPT-5.4 also did not complete the application from a single prompt.
Summary
Opus 4.7 is a meaningful improvement for agentic coding and single-shot full-stack generation. The finance dashboard test in particular shows what the model can produce from one prompt: a working multi-page React application with an API backend and interactive state, in under 20 minutes.
The backend architectural choice (in-memory vs. persistent storage) in that test is worth noting: a more capable model does not always make more robust design decisions. For production use, reviewing and specifying architecture expectations in the prompt will still be necessary.
The cost increase from the tokenizer change and higher default effort is real. Comparing effort levels before settling on defaults for a new application is the most direct way to control this. The data shows that high effort in Opus 4.7 beats max effort in Opus 4.6 on coding tasks while using fewer total tokens, which is the key number to optimize against.
Current pricing and API documentation are available on the Anthropic documentation site.