Claude 4 Performance Comparison | Opus 4 & Sonnet 4 Benchmarks and Pricing

For developers trying to decide whether to run Opus 4 or Sonnet 4 from the Claude 4 generation, this article summarizes the capabilities of both models and the optimal way to use each. We cover the key benchmarks including SWE-bench, the 5x API pricing difference, and feature distinctions like hybrid reasoning and 7-hour continuous operation — organized so you can make confident decisions for coding, long-running batch jobs, and everyday use.

AI Chat Article Summarypowered by Claude
結論powered by Claude

For coding use cases, Sonnet 4 is the top recommendation. It scores 72.7% on SWE-bench Verified — slightly edging out Opus 4 (72.5%) — while the API pricing is 1/5 on both input and output ($3/$15 vs $15/$75), making Sonnet 4 sufficient for most everyday coding assistance.

For autonomous tasks spanning thousands of steps or long-running batch jobs, Opus 4 truly shines. It offers the endurance to sustain up to 7 hours of continuous work and a Terminal-bench score of 43.2% for autonomous system operation, making it the right choice when you need to complete complex report generation or intricate workflows without interruption.

If you already have a Claude.ai paid plan, you can use both Opus 4 and Sonnet 4 without a separate API contract (subject to usage limits). The safe approach is to first try the behavior in the Web UI, then work out your pricing strategy once you reach a scale that requires API migration.

目次 (9)

What Is Claude 4 — The Next-Generation Model Released in May 2025

Claude 4 is the collective name for the model generation Anthropic released in May 2025, consisting of two models: Claude Opus 4 and Claude Sonnet 4.

The biggest change from the previous Claude 3.x generation is the adoption of "hybrid reasoning." The design takes a two-pronged approach — answering simple queries instantly while using Extended Thinking mode for complex questions, thinking deeply before responding — achieving both speed and accuracy.

Additionally, all Claude 4 generation models support parallel tool execution, making them well-suited for complex tasks that involve calling multiple external tools simultaneously. They have earned particularly high praise for coding assistance, autonomous workflow execution, and long-form analysis.


Claude Opus 4 Performance — Top-Tier Coding Benchmarks

Claude Opus 4 was evaluated as the world's best coding model at the time of its release. Let's look at the key benchmark numbers.

Benchmark Claude Opus 4 What It Measures
SWE-bench Verified 72.5% Ability to autonomously resolve real-world software issues
Terminal-bench 43.2% System operation capability through terminal interactions

SWE-bench is an industry-standard evaluation using actual bug fix and feature implementation tasks from GitHub. A score of 72.5% represents a top-tier result as of May 2025, demonstrating the ability to "read a problem and fix it" rather than just simple completion.

Terminal-bench measures task execution capability in CLI environments. It evaluates complex tasks combining script creation, command operations, and file management, and Opus 4's 43.2% confirms its high proficiency in autonomous system operation.

In addition, Opus 4 is designed to maintain up to 7 hours of continuous work, with the endurance to complete complex workflows spanning thousands of steps. It is also well-suited for batch processing and report generation that requires extended runtime.

Source: Introducing Claude 4 — Anthropic


Claude Sonnet 4 Performance — Best Cost-Performance with 72.7% on SWE-bench

Sonnet 4 was released as a major upgrade to Sonnet 3.7. The standout feature is its SWE-bench score.

Benchmark Claude Sonnet 4 Notes
SWE-bench Verified 72.7% Slightly surpasses Opus 4 (72.5%)

In terms of coding performance metrics, it outperforms Opus 4 by 0.2 percentage points, making it a remarkable model in that it offers "flagship-level coding capability at lower cost."

Sonnet 4, like Opus 4, features hybrid reasoning, parallel tool execution, and improved memory capabilities, and there are many situations where Sonnet 4 is more than sufficient for everyday coding assistance and reasoning tasks.


Hybrid Reasoning — A Design That Balances Speed and Depth

The core technology of Claude 4 is Hybrid Reasoning.

Traditional models were designed to "always respond with a fixed level of reasoning depth," but Claude 4 automatically judges the difficulty of a question and switches between two modes.

Standard Mode (Instant Response)

  • Simple questions and routine processing receive near-instant responses
  • Minimizes latency for real-time use cases

Extended Thinking Mode

  • For complex reasoning, math, and code generation, it "expands the thought process" before responding
  • Builds logic step by step, resulting in higher accuracy

With this design, users don't need to manually switch modes — the optimal response is automatically selected based on the nature of the task. Via the API, the thinking parameter can also be used to explicitly enable extended thinking.

Source: Anthropic Official News


Claude 4 Pricing — Comparison of Opus 4 and Sonnet 4

The API pricing for the Claude 4 generation is as follows (per million tokens, USD, before tax).

Model Input Output Intended Use
Claude Opus 4 $15 $75 High-difficulty coding, long-running workflows
Claude Sonnet 4 $3 $15 Everyday coding, reasoning, batch processing

Sonnet 4 is significantly cheaper than Opus 4 — 1/5 the input cost and 1/5 the output cost. Given that coding performance is nearly on par (and sometimes reversed on SWE-bench), Sonnet 4 becomes the leading choice for cost-efficiency-focused use cases.

Note that both Opus 4 and Sonnet 4 are available on Claude.ai paid plans (subject to usage limits). If you don't use the API, you can use them within the scope of your monthly subscription.


What Changed from Claude 3.7 — Comparison with the Previous Generation

Here we compare the improvements in Claude 4 against the previous generation (Claude 3.7/3.5 series).

Dramatic Improvement in Coding Ability The SWE-bench score rose from around 62% for Claude 3.7 Sonnet to 72.7% for Sonnet 4 — an improvement of over 10 percentage points. Accuracy has improved not just for simple code completion but for advanced tasks like identifying the root cause of bugs, auto-generating test code, and suggesting refactoring.

Adoption of Hybrid Reasoning While Claude 3.7 had experimentally introduced extended thinking mode in some models, the Claude 4 generation features it as standard in both Opus and Sonnet.

Support for Extended Work Sessions With Opus 4 achieving up to 7 hours of continuous operation, it is now possible to complete thousands-of-steps workflows without interruption — something that was difficult with the previous generation.

Improved Instruction-Following Sonnet 4 has improved over Sonnet 3.7 in terms of "following instructions more precisely." It has become easier to get responses that capture the nuances of prompts, reducing the cost of prompt engineering.


The Model Lineage After Claude 4 — 4.1 → 4.5 → 4.6 → 4.7

After the release of Claude 4 (Opus 4 and Sonnet 4), Anthropic has continued rapid iterative development.

Model Release Key Features
Claude Opus 4 / Sonnet 4 May 2025 Start of Claude 4 generation, SWE-bench in the 72% range
Claude Opus 4.1 August 2025 Performance improvements to the Opus line
Claude Opus 4.5 / Sonnet 4.5 November 2025 Upgrades to both; Sonnet 4.5 achieves 77.2% on SWE-bench
Claude Sonnet 4.6 February 2026 Context expansion, increased output limits
Claude Opus 4.7 April 2026 87.6% on SWE-bench Verified, enhanced vision

As of May 2026, the latest flagship is Claude Opus 4.7 (87.6% on SWE-bench) and the balanced model is Claude Sonnet 4.6. While Claude 4 (original) is being superseded by its successors, its design philosophy — hybrid reasoning and extended operation — is carried throughout the entire lineage.

Source: Anthropic Official Model Page


How to Choose Claude 4 — Opus or Sonnet?

Here is a summary of how to choose between Claude 4 generation models.

Cases Where Claude Opus 4 Is the Better Fit

  • Autonomous tasks that take several hours or more
  • Complex architecture design and full system refactoring
  • Development projects where accuracy is the top priority and budget is flexible
  • Automation involving system operations at the Terminal-bench level

Cases Where Claude Sonnet 4 Is the Better Fit

  • Everyday coding assistance and code reviews
  • Batch processing with a focus on cost efficiency
  • When you need SWE-bench-level coding performance but want to keep costs down
  • Real-time use cases where response speed matters

For general development purposes, starting with Sonnet 4 is the rational approach. Since coding performance is on par with Opus 4 at 1/5 the cost, it's efficient to first validate with Sonnet 4, then switch to Opus 4 when you hit its limits for long continuous operation or specific reasoning tasks.


Summary

Claude 4 (Opus 4 and Sonnet 4) is a model generation that achieved industry-leading standards in both coding and reasoning as of May 2025. Opus 4 scored 72.5% and Sonnet 4 scored 72.7% on SWE-bench, achieving both speed and accuracy through hybrid reasoning.

It also serves as the starting point for the evolution that continued with 4.1, 4.5, 4.6, and 4.7, and understanding the characteristics of the Claude 4 generation is useful for making the most of current models.

Official information is updated regularly at the Anthropic news page.

参考になったら ♡
Clauder Navi 編集部
@clauder_navi

Anthropic の Claude / Claude Code を中心に、日本のエンジニア向けに最新動向と実務 を毎日発信。 運営方針 は メディアについて をご覧ください。