Claude 4 Performance Comparison | Opus 4 & Sonnet 4 Benchmarks and Pricing
For developers trying to decide whether to run Opus 4 or Sonnet 4 from the Claude 4 generation, this article summarizes the capabilities of both models and the optimal way to use each. We cover the key benchmarks including SWE-bench, the 5x API pricing difference, and feature distinctions like hybrid reasoning and 7-hour continuous operation — organized so you can make confident decisions for coding, long-running batch jobs, and everyday use.
For coding use cases, Sonnet 4 is the top recommendation. It scores 72.7% on SWE-bench Verified — slightly edging out Opus 4 (72.5%) — while the API pricing is 1/5 on both input and output ($3/$15 vs $15/$75), making Sonnet 4 sufficient for most everyday coding assistance.
For autonomous tasks spanning thousands of steps or long-running batch jobs, Opus 4 truly shines. It offers the endurance to sustain up to 7 hours of continuous work and a Terminal-bench score of 43.2% for autonomous system operation, making it the right choice when you need to complete complex report generation or intricate workflows without interruption.
If you already have a Claude.ai paid plan, you can use both Opus 4 and Sonnet 4 without a separate API contract (subject to usage limits). The safe approach is to first try the behavior in the Web UI, then work out your pricing strategy once you reach a scale that requires API migration.
目次 (9)
- What Is Claude 4 — The Next-Generation Model Released in May 2025
- Claude Opus 4 Performance — Top-Tier Coding Benchmarks
- Claude Sonnet 4 Performance — Best Cost-Performance with 72.7% on SWE-bench
- Hybrid Reasoning — A Design That Balances Speed and Depth
- Claude 4 Pricing — Comparison of Opus 4 and Sonnet 4
- What Changed from Claude 3.7 — Comparison with the Previous Generation
- The Model Lineage After Claude 4 — 4.1 → 4.5 → 4.6 → 4.7
- How to Choose Claude 4 — Opus or Sonnet?
- Summary
What Is Claude 4 — The Next-Generation Model Released in May 2025
Claude 4 is the collective name for the model generation Anthropic released in May 2025, consisting of two models: Claude Opus 4 and Claude Sonnet 4.
The biggest change from the previous Claude 3.x generation is the adoption of "hybrid reasoning." The design takes a two-pronged approach — answering simple queries instantly while using Extended Thinking mode for complex questions, thinking deeply before responding — achieving both speed and accuracy.
Additionally, all Claude 4 generation models support parallel tool execution, making them well-suited for complex tasks that involve calling multiple external tools simultaneously. They have earned particularly high praise for coding assistance, autonomous workflow execution, and long-form analysis.
Claude Opus 4 Performance — Top-Tier Coding Benchmarks
Claude Opus 4 was evaluated as the world's best coding model at the time of its release. Let's look at the key benchmark numbers.
| Benchmark | Claude Opus 4 | What It Measures |
|---|---|---|
| SWE-bench Verified | 72.5% | Ability to autonomously resolve real-world software issues |
| Terminal-bench | 43.2% | System operation capability through terminal interactions |
SWE-bench is an industry-standard evaluation using actual bug fix and feature implementation tasks from GitHub. A score of 72.5% represents a top-tier result as of May 2025, demonstrating the ability to "read a problem and fix it" rather than just simple completion.
Terminal-bench measures task execution capability in CLI environments. It evaluates complex tasks combining script creation, command operations, and file management, and Opus 4's 43.2% confirms its high proficiency in autonomous system operation.
In addition, Opus 4 is designed to maintain up to 7 hours of continuous work, with the endurance to complete complex workflows spanning thousands of steps. It is also well-suited for batch processing and report generation that requires extended runtime.
Source: Introducing Claude 4 — Anthropic
Claude Sonnet 4 Performance — Best Cost-Performance with 72.7% on SWE-bench
Sonnet 4 was released as a major upgrade to Sonnet 3.7. The standout feature is its SWE-bench score.
| Benchmark | Claude Sonnet 4 | Notes |
|---|---|---|
| SWE-bench Verified | 72.7% | Slightly surpasses Opus 4 (72.5%) |
In terms of coding performance metrics, it outperforms Opus 4 by 0.2 percentage points, making it a remarkable model in that it offers "flagship-level coding capability at lower cost."
Sonnet 4, like Opus 4, features hybrid reasoning, parallel tool execution, and improved memory capabilities, and there are many situations where Sonnet 4 is more than sufficient for everyday coding assistance and reasoning tasks.
Hybrid Reasoning — A Design That Balances Speed and Depth
The core technology of Claude 4 is Hybrid Reasoning.
Traditional models were designed to "always respond with a fixed level of reasoning depth," but Claude 4 automatically judges the difficulty of a question and switches between two modes.
Standard Mode (Instant Response)
- Simple questions and routine processing receive near-instant responses
- Minimizes latency for real-time use cases
Extended Thinking Mode
- For complex reasoning, math, and code generation, it "expands the thought process" before responding
- Builds logic step by step, resulting in higher accuracy
With this design, users don't need to manually switch modes — the optimal response is automatically selected based on the nature of the task. Via the API, the thinking parameter can also be used to explicitly enable extended thinking.
Source: Anthropic Official News
Claude 4 Pricing — Comparison of Opus 4 and Sonnet 4
The API pricing for the Claude 4 generation is as follows (per million tokens, USD, before tax).
| Model | Input | Output | Intended Use |
|---|---|---|---|
| Claude Opus 4 | $15 | $75 | High-difficulty coding, long-running workflows |
| Claude Sonnet 4 | $3 | $15 | Everyday coding, reasoning, batch processing |
Sonnet 4 is significantly cheaper than Opus 4 — 1/5 the input cost and 1/5 the output cost. Given that coding performance is nearly on par (and sometimes reversed on SWE-bench), Sonnet 4 becomes the leading choice for cost-efficiency-focused use cases.
Note that both Opus 4 and Sonnet 4 are available on Claude.ai paid plans (subject to usage limits). If you don't use the API, you can use them within the scope of your monthly subscription.
What Changed from Claude 3.7 — Comparison with the Previous Generation
Here we compare the improvements in Claude 4 against the previous generation (Claude 3.7/3.5 series).
Dramatic Improvement in Coding Ability The SWE-bench score rose from around 62% for Claude 3.7 Sonnet to 72.7% for Sonnet 4 — an improvement of over 10 percentage points. Accuracy has improved not just for simple code completion but for advanced tasks like identifying the root cause of bugs, auto-generating test code, and suggesting refactoring.
Adoption of Hybrid Reasoning While Claude 3.7 had experimentally introduced extended thinking mode in some models, the Claude 4 generation features it as standard in both Opus and Sonnet.
Support for Extended Work Sessions With Opus 4 achieving up to 7 hours of continuous operation, it is now possible to complete thousands-of-steps workflows without interruption — something that was difficult with the previous generation.
Improved Instruction-Following Sonnet 4 has improved over Sonnet 3.7 in terms of "following instructions more precisely." It has become easier to get responses that capture the nuances of prompts, reducing the cost of prompt engineering.
The Model Lineage After Claude 4 — 4.1 → 4.5 → 4.6 → 4.7
After the release of Claude 4 (Opus 4 and Sonnet 4), Anthropic has continued rapid iterative development.
| Model | Release | Key Features |
|---|---|---|
| Claude Opus 4 / Sonnet 4 | May 2025 | Start of Claude 4 generation, SWE-bench in the 72% range |
| Claude Opus 4.1 | August 2025 | Performance improvements to the Opus line |
| Claude Opus 4.5 / Sonnet 4.5 | November 2025 | Upgrades to both; Sonnet 4.5 achieves 77.2% on SWE-bench |
| Claude Sonnet 4.6 | February 2026 | Context expansion, increased output limits |
| Claude Opus 4.7 | April 2026 | 87.6% on SWE-bench Verified, enhanced vision |
As of May 2026, the latest flagship is Claude Opus 4.7 (87.6% on SWE-bench) and the balanced model is Claude Sonnet 4.6. While Claude 4 (original) is being superseded by its successors, its design philosophy — hybrid reasoning and extended operation — is carried throughout the entire lineage.
Source: Anthropic Official Model Page
How to Choose Claude 4 — Opus or Sonnet?
Here is a summary of how to choose between Claude 4 generation models.
Cases Where Claude Opus 4 Is the Better Fit
- Autonomous tasks that take several hours or more
- Complex architecture design and full system refactoring
- Development projects where accuracy is the top priority and budget is flexible
- Automation involving system operations at the Terminal-bench level
Cases Where Claude Sonnet 4 Is the Better Fit
- Everyday coding assistance and code reviews
- Batch processing with a focus on cost efficiency
- When you need SWE-bench-level coding performance but want to keep costs down
- Real-time use cases where response speed matters
For general development purposes, starting with Sonnet 4 is the rational approach. Since coding performance is on par with Opus 4 at 1/5 the cost, it's efficient to first validate with Sonnet 4, then switch to Opus 4 when you hit its limits for long continuous operation or specific reasoning tasks.
Summary
Claude 4 (Opus 4 and Sonnet 4) is a model generation that achieved industry-leading standards in both coding and reasoning as of May 2025. Opus 4 scored 72.5% and Sonnet 4 scored 72.7% on SWE-bench, achieving both speed and accuracy through hybrid reasoning.
It also serves as the starting point for the evolution that continued with 4.1, 4.5, 4.6, and 4.7, and understanding the characteristics of the Claude 4 generation is useful for making the most of current models.
Official information is updated regularly at the Anthropic news page.