GPT vs Claude Sonnet | Choosing by Coding Performance and Pricing

AI-generated article summarypowered by Claude

GPT vs Claude Sonnet — AI model comparison by coding performance and pricing

"Which AI should I use for coding — GPT or Claude Sonnet?" This question is the topic most developers wrestle with in 2026. This article compares OpenAI's GPT-5 family (GPT-5.3 Codex) and Anthropic's Claude Sonnet 4.6 across four dimensions — benchmarks, pricing, speed, and areas of strength — with concrete figures, and sorts out which to choose by use case.

結論powered by Claude
Coding accuracy is nearly identical — on SWE-bench Verified, GPT-5.3 Codex scores approximately 80% versus Claude Sonnet 4.6 at 79.6%, a gap of less than 0.4 percentage points. There is no clear winner in pure bug-fixing ability. Codex leads autonomous terminal execution by about 18 points, while Sonnet leads on screen-operation tasks by about 8.5 points, so which model has the edge depends on the type of task.

The key to choosing is the nature of the task. GPT-5 Codex has the advantage for autonomous terminal execution, CI integration, and high-speed iterative work. Claude Sonnet 4.6 is preferred in NxCode's qualitative evaluation for interpreting ambiguous requirements, multi-file refactoring, and vibe-coding-style prototyping.

On pricing, Codex input costs are about 40% lower, but Claude Sonnet 4.6 offers up to 90% discount with prompt caching, so in development workflows that repeatedly reference the same context, the effective cost can flip. Rather than committing to one model, using each where it excels is the 2026 best practice.

目次 (12)

Conclusion | Coding Is Nearly Tied — Splitting by Strength Is the Right Answer

The conclusion first: in pure coding accuracy, GPT-5 and Claude Sonnet 4.6 are nearly neck and neck, with no clear winner. In the verification by third-party comparison media NxCode, the difference between the two shows up in areas of specialty.

  • Terminal operations and autonomous execution as the main focus → GPT-5 (Codex) has the advantage
  • Interpreting ambiguous requirements, multi-file refactoring, cost-consciousness → Claude Sonnet 4.6 has the advantage

Rather than committing to one or the other, the realistic answer is to use each model according to the nature of the task. Source: GPT-5.3 Codex vs Claude Sonnet 4.6 Comparison (NxCode, 2026)

The Two Models Compared — GPT-5 Codex and Claude Sonnet 4.6

First, let's confirm the positioning of what we're comparing.

Claude Sonnet 4.6 is Anthropic's standard model, positioned as "the best balance of speed and intelligence," released on February 17, 2026. It supports a context window of 200K tokens as standard and 1M tokens in API beta, and supports both Extended Thinking and Adaptive Thinking. It is a model designed with agent-based coding and long autonomous tasks in mind. Source: Anthropic Claude Sonnet

GPT-5 Codex is OpenAI's coding-focused lineage, with strengths in terminal-based autonomous execution and VS Code/GitHub integration. In this article, we treat GPT-5.3 Codex — the most frequently appearing model in top SERP results — as the representative.

When people hear "Claude," they tend to think of the top-tier Opus, but in day-to-day development, Sonnet — with its balance of speed and cost — is what's actually used. That's precisely why Sonnet is the main player in any comparison with GPT.

Coding Performance Compared Across 3 Benchmarks — SWE-bench Is Nearly Tied

The results of NxCode's three key benchmark tests are as follows.

Benchmark GPT-5.3 Codex Claude Sonnet 4.6 Advantage
SWE-bench Verified ~80% 79.6% Essentially tied (0.4pt gap)
Terminal-Bench 2.0 77.3% 59.1% Codex (+18.2pt)
OSWorld (computer operation) 64% 72.5% Sonnet (+8.5pt)

The key takeaway is that SWE-bench Verified, which measures real-world bug-fixing ability, is effectively a tie. On the other hand, Codex leads significantly on Terminal-Bench, which measures autonomous terminal execution, while Sonnet pulls ahead on OSWorld, which involves screen operations. There is no simple ranking of which is better at "coding overall" — the winner changes depending on the type of task. Source: NxCode Comparison (2026)

Pricing and Context Window — Codex Has Lower Input Costs

API pricing (per 1 million tokens) and context window are as follows.

Item GPT-5.3 Codex Claude Sonnet 4.6
Input $1.75 $3.00
Output $14.00 $15.00
Context window 400K tokens 200K (1M in API beta)

※ GPT-5.3 Codex pricing and context window are based on NxCode's verification values (third-party measurements at the time of writing). Claude Sonnet 4.6 pricing is based on official Anthropic information.

Codex's input price is about 40% lower, making it appear more cost-competitive. However, Claude Sonnet 4.6 can receive up to 90% discount via prompt caching and 50% via batch processing, so in development workflows that repeatedly reference the same context, the effective cost drops significantly. Source: Anthropic Models overview

While Codex's raw 400K context window is larger, Sonnet can be extended up to 1M tokens in API beta, giving it an edge for use cases that need to load an entire massive codebase at once.

Speed and Token Efficiency — Codex Is Fast and Token-Efficient

Codex has the speed advantage. In NxCode's measurements, Codex outputs 61.9 tokens per second, approximately 25% faster than the previous generation. It also uses 2–4 times fewer tokens per task than Claude-family models — a "token-efficient" design that makes it easier to keep token costs down for the same work.

That said, token unit price alone does not account for the actual cost — the number of attempts and rework required to complete a task also affects real costs. In NxCode's real-task verification (reproducing a Figma design), Codex cost approximately $54 while Sonnet 4.6 cost approximately $40–50, showing that total costs can be comparable in practice. It is important to note that "lower unit price ≠ lower total cost."

Sonnet Has the Edge for Ambiguous Requirements and Multi-File Refactoring

Practical strengths that don't show up in numbers are also worth noting. In NxCode's verification, developers were reported to prefer Sonnet 4.6 70% of the time when tasked with interpreting ambiguous requirements (based on NxCode's qualitative evaluation). Sonnet led 11 to 6 (in NxCode's multi-task head-to-head comparison) in the ability to infer intent from vague specs and translate it into implementation, multi-file refactoring across a codebase, and "vibe coding"-style prototyping that generates an entire app at once.

Conversely, Codex excels at clearly defined terminal tasks, autonomous CI execution, and iterative work requiring speed. A division of labor emerges: Sonnet for "thinking and designing," Codex for "executing defined work quickly."

How to Choose by Use Case

Here is a summary of the comparison organized by use case.

  1. Understanding existing code / large-scale refactoring → Claude Sonnet 4.6 (strong multi-file reasoning and long context)
  2. Implementing from ambiguous specs / prototyping → Claude Sonnet 4.6 (70% preference in requirement interpretation)
  3. Terminal-centric autonomous execution / CI integration → GPT-5 Codex (large margin in Terminal-Bench)
  4. Iterative work where speed and token cost are the top priority → GPT-5 Codex (61.9 tok/s, token-efficient)
  5. Computer operation tasks involving screen manipulation → Claude Sonnet 4.6 (advantage in OSWorld)

The "Hybrid Approach" — Using Both

Finally, what NxCode recommends is using both in combination. In day-to-day development, use Sonnet 4.6 as the default for its speed and cost benefits, then switch to Codex when maximum reasoning depth, terminal execution, or computer operations are needed — this "use-them-for-what-they're-good-at" approach is concluded to be the most cost-efficient strategy for most developers.

If you can afford two subscriptions, rather than committing to one, keeping both at hand and choosing the model based on the nature of the task is the realistic best practice for 2026.

FAQ — Common Questions About Comparing GPT and Claude Sonnet

Q. Which has higher coding accuracy?

On SWE-bench Verified, GPT-5.3 Codex scores approximately 80% versus Claude Sonnet 4.6 at 79.6% — a gap of just 0.4 points, effectively a tie. There is no significant difference in bug-fixing ability.

Q. Which is cheaper?

The input unit price of GPT-5.3 Codex ($1.75) is about 40% lower than Claude Sonnet 4.6 ($3.00). However, Sonnet offers up to 90% discount with prompt caching, so for use cases that repeatedly reference the same context, the effective cost can flip.

Q. Should Claude be compared using Opus or Sonnet when benchmarking against GPT?

For practical comparisons with GPT, Sonnet 4.6 — with its balance of cost efficiency and speed — is the realistic comparison target. You only need to consider Opus when the highest-level reasoning is required. For choosing between Claude models, see the Claude model comparison article.


The benchmark figures in this article are based on verification by third-party comparison media NxCode (source). Specifications and pricing for Claude Sonnet 4.6 are based on official Anthropic sources (Sonnet / Models overview). Models are continuously updated, so please check the latest official information for current pricing and performance.

参考になったら ♡
Clauder Navi 編集部
@clauder_navi

Anthropic の Claude / Claude Code を中心に、日本のエンジニア向けに最新動向と実務 を毎日発信。 運営方針 は メディアについて をご覧ください。