Claude API Pricing | Cost Optimization with Per-Model Rates and Reduction Techniques

Claude Claude API 料金コスト最適化プロンプトキャッシュ Batch API

Clauder Navi 編集部 / 最終更新 2026-06-08

One of the biggest concerns before integrating Claude API into production is not knowing how much it will cost each month. This article focuses exclusively on Claude API pricing, covering token rates for Opus 4.7, Sonnet 4.6, and Haiku 4.5, monthly estimation formulas, discounts through prompt caching and Batch API, the tiered rate limit structure, and practical budget management — all in one place.

Article Summary by AI Chatpowered by Claude

結論powered by Claude

Claude API costs are determined by the simple formula "input tokens × rate + output tokens × rate", and are billed entirely separately from subscriptions (Pro / Max / Team) as pay-as-you-go. Base rates are Opus 4.7 at $5 input / $25 output (per 1 million tokens), Sonnet 4.6 at $3 / $15, and Haiku 4.5 at $1 / $5 (tax-exclusive, USD) — a 5x spread depending on use case. Note that as of June 2026, the top flagship model has been updated to Claude Opus 4.8. For the latest rates, check Anthropic API Pricing.

The primary levers for cost optimization are model selection and prompt caching. Routing heavy reasoning exclusively to Opus while handling routine tasks with Sonnet or Haiku can cut your monthly bill in half. Caching repeated system prompts delivers up to 90% savings on those tokens. Switching eligible workloads to the Batch API yields an additional 50% discount, and stacking both discounts can compress your effective cost to 10–20% of the listed rate.

For budget management, the standard approach is a two-layer setup: the Usage tab and monthly spending limits in Anthropic Console, plus application-side logging that aggregates the usage response. Rate limits scale progressively from Tier 1 through 4, and catching cost spikes early requires understanding Tier upgrade conditions and Token Bucket behavior. Pro / Max subscriptions are billed separately from API keys, so running both simultaneously is a common real-world configuration.

目次 (12)

Claude API Cost Structure — Understanding Token Billing vs. Subscriptions
Per-Model Token Rates — The Three Tiers: Opus / Sonnet / Haiku
Monthly Cost Formulas and Worked Examples — Chatbots, Summarization, Code Generation
Scenario A: Internal Chatbot (Sonnet 4.6, 100,000 requests/month)
Scenario B: Document Summarization Batch (Haiku 4.5, 10,000 documents/month, avg. 8,000 characters)
Scenario C: Code Generation Agent (Opus 4.7, 1,000 sessions/month)
Up to 90% Savings with Prompt Caching — Reusing System Prompts
An Additional 50% Discount with Batch API — Offloading Asynchronous Workloads
Visualizing Token Usage — Two-Layer Logging with Console and usage Response
Rate Limits and Tier Structure — Three Axes: RPM / ITPM / OTPM
Monthly Budget Management Best Practices — Limits, Alerts, and Model Fallbacks
Choosing Between Pro / Max Subscriptions and API — Running Both in Parallel Is the Practical Choice

Claude API Cost Structure — Understanding Token Billing vs. Subscriptions

Claude API billing works as pay-as-you-go based on API key usage issued through Anthropic Console, charging input tokens and output tokens at separate per-token rates. This is completely separate from the fixed monthly subscriptions (Pro at $20/month, Max, Team) that provide near-unlimited access to Claude.ai's chat UI. Even if you have an active subscription, API usage is billed independently — and conversely, if you only use the API, no subscription is required.

A token is the smallest unit an AI model uses internally to process English words, symbols, and Japanese characters. As a rough guide, "4 English characters ≈ 1 token" and "1 Japanese character ≈ 1–2 tokens." The entire prompt sent in a single request (system + user + conversation history) is counted as input tokens, and the text the model returns is counted as output tokens at a separate rate. A common oversight is that in long sessions, previous exchanges are included in input tokens on every request — a dynamic also discussed in A Practical Guide to Preventing Claude Code Cost Explosions.

Source: Anthropic API Pricing (accessed: 2026-05-28)

Per-Model Token Rates — The Three Tiers: Opus / Sonnet / Haiku

As of May 2026, the three main model families available via API were Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5. As of June 2026, the top flagship has been updated to Claude Opus 4.8 (check Anthropic API Pricing for the latest rates). Input and output token rates are arranged in a stepped structure as shown in the quick-reference table below. The difference between Opus and Haiku is 5x on input and 5x on output, meaning your choice of default model can shift your monthly bill by an order of magnitude.

Model	Input (tax-excl., USD / MTok)	Output (tax-excl., USD / MTok)	Intended Use
Claude Opus 4.7	$5.00	$25.00	Complex reasoning, long-form code generation, sophisticated agents
Claude Sonnet 4.6	$3.00	$15.00	General business tasks, coding assistance, summarization, classification
Claude Haiku 4.5	$1.00	$5.00	Routine conversion, classification, preprocessing, first-pass chat responses

MTok = 1 million tokens. All prices are tax-exclusive USD; exchange rate fluctuations and consumption tax apply separately.

Source: Anthropic API Pricing (accessed: 2026-05-28)

The practical selection guideline is straightforward: call Opus only when you can identify a problem that genuinely requires it, and default to Sonnet for everything else. Deterministic tasks like classification, extraction, and summarization are typically handled well by Haiku, so building a routing layer that dispatches across all three tiers from the start gives you room to tune your monthly bill down to one-third later.

Monthly Cost Formulas and Worked Examples — Chatbots, Summarization, Code Generation

Your cost estimate comes down to this single formula:

Monthly USD ≈ (monthly input tokens × input rate) + (monthly output tokens × output rate)

Applying this to three common workloads gives you a quick sense of the numbers:

Scenario A: Internal Chatbot (Sonnet 4.6, 100,000 requests/month)

Assume 2,000 input tokens per request (system prompt + recent history) and 500 output tokens, for 100,000 requests per month.

Input: 2,000 × 100,000 = 200M tokens = 200 MTok → 200 × $3 = $600
Output: 500 × 100,000 = 50M tokens = 50 MTok → 50 × $15 = $750
Monthly total: $1,350 (tax-excl., USD)

Scenario B: Document Summarization Batch (Haiku 4.5, 10,000 documents/month, avg. 8,000 characters)

Assume 8,000 characters ≈ 12,000 input tokens per document, with 500 output tokens per summary.

Input: 12,000 × 10,000 = 120M tokens = 120 MTok → 120 × $1 = $120
Output: 500 × 10,000 = 5M tokens = 5 MTok → 5 × $5 = $25
Monthly total: $145 (tax-excl., USD) ※ An additional 50% discount applies with Batch API (see below)

Scenario C: Code Generation Agent (Opus 4.7, 1,000 sessions/month)

Assume a heavy agent using a cumulative 50,000 input tokens and 8,000 output tokens per session.

Input: 50,000 × 1,000 = 50M tokens = 50 MTok → 50 × $5 = $250
Output: 8,000 × 1,000 = 8M tokens = 8 MTok → 8 × $25 = $200
Monthly total: $450 (tax-excl., USD)

If you switched Scenario A's Sonnet workload to Opus, input would go from $600 to $1,000 and output from $750 to $1,250, pushing the monthly total from $1,350 to $2,250. Conversely, if half the requests are light classification tasks that could run on Haiku, the bill drops to around $700. This comparison makes clear that model selection is the single biggest cost optimization lever.

Up to 90% Savings with Prompt Caching — Reusing System Prompts

For workloads that repeatedly send the same system prompt or long context, enabling prompt caching dramatically reduces costs for those tokens. The mechanism works as follows: the first request writes to the cache (billed at 1.25x the base rate), and subsequent cache hits are billed at 1/10 (0.1x). TTL options include a standard 5-minute cache and an extended 1-hour cache.

The minimal Python SDK implementation adds a cache_control field to the block you want cached:

client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "<long system prompt, internal guidelines, etc.>",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Today's question"}],
)

In Scenario A (monthly $1,350), if a 1,500-token system prompt hits the cache on every request, that portion of the cost drops to roughly one-tenth, compressing more than half of the input-side cost. The higher the request frequency with the same template, the greater the impact, so the standard approach is to define cacheable blocks before building out the full implementation.

Source: Prompt caching — Anthropic Docs (accessed: 2026-05-28)

An Additional 50% Discount with Batch API — Offloading Asynchronous Workloads

For batch processing that does not require real-time responses, switching to the Message Batches API cuts the per-token rate by 50% on both input and output for all models. The SLA guarantees completion within 24 hours, with up to 100,000 requests and 256 MB per batch. It is also compatible with prompt caching, so both discounts can be stacked.

Instead of the standard messages.create, you use messages.batches.create, assign a custom_id to each request, and retrieve results later. Applying this to Scenario B's document summarization batch brings the cost from $145 down to $72.50. For workloads like summarization, classification, tagging, and translation where latency is acceptable, batching is the standard practice.

Conversely, forcing batch processing onto user-facing chat or IDE completions that require sub-second responses will break the user experience. The prerequisite is a design that consciously separates asynchronous-eligible workloads from real-time ones.

Source: Message Batches — Anthropic Docs (accessed: 2026-05-28)

Visualizing Token Usage — Two-Layer Logging with Console and `usage` Response

The main reason costs stay invisible is that per-request token volumes are not being tracked. The Usage tab in Anthropic Console shows daily and per-model consumption and billing, so administrators should bookmark it before going to production. In addition, the application itself should log the usage block returned in each response to persistent storage, enabling cost breakdown by feature and by user.

The minimal Python SDK implementation:

res = client.messages.create(...)
log = {
    "feature": "support_chat",
    "user_id": user.id,
    "model": "claude-sonnet-4-6",
    "input_tokens": res.usage.input_tokens,
    "output_tokens": res.usage.output_tokens,
    "cache_read": res.usage.cache_read_input_tokens or 0,
    "cache_create": res.usage.cache_creation_input_tokens or 0,
}

Aggregating this log daily makes it possible to see on a single dashboard: (1) which users are consuming disproportionately, (2) which features are using several times more than expected, and (3) whether the cache hit rate matches projections. Surprise invoices of $5,000 or more almost always occur when this visibility layer is missing.

Rate Limits and Tier Structure — Three Axes: RPM / ITPM / OTPM

The Claude API enforces limits on three axes — Requests Per Minute (RPM), Input Tokens Per Minute (ITPM), and Output Tokens Per Minute (OTPM) — and accounts automatically advance from Tier 1 → 2 → 3 → 4 based on cumulative spend and account age. Tier 1 has conservative limits appropriate for new accounts; most production workloads require Tier 2 or higher, with monthly credit thresholds and minimum spend as documented upgrade conditions.

When a limit is hit, the API returns HTTP 429 with a retry-after header indicating the number of seconds to wait before retrying. The SDK includes built-in automatic retries with exponential backoff and jitter, but if bursts remain unavoidable, the practical solution is to understand Token Bucket behavior and smooth the request rate using your own queue (e.g., SQS). Because specific limits and upgrade conditions change frequently, always check the official documentation directly before designing for production.

Source: Rate limits — Anthropic Docs (accessed: 2026-05-28)

Monthly Budget Management Best Practices — Limits, Alerts, and Model Fallbacks

Preventing cost incidents requires a three-stage approach: prevention, detection, and graceful degradation.

Prevention (Spend Limits): In Anthropic Console's Billing settings, always configure both a Soft Limit and a Hard Limit so the API stops automatically when the Hard Limit is reached.
Detection (Usage Alerts): Enable email notifications when the Soft Limit is reached, and set up a webhook to forward alerts to your team's Slack, ChatWork, or other messaging platform.
Graceful Degradation (Model Fallback): Implement a fallback chain in your application — "if Opus is unavailable, use Sonnet; if Sonnet is unavailable, use Haiku" — so that a single environment variable change during a cost spike triggers automatic downgrade.

Additionally, during the prototyping phase of any new feature, always build the first working version on Haiku, then upgrade to Sonnet or Opus based on quality requirements. "Starting prototypes on Opus" reliably accumulates tens to hundreds of dollars in unnecessary costs before the feature reaches production, and is a practice to avoid.

Choosing Between Pro / Max Subscriptions and API — Running Both in Parallel Is the Practical Choice

One final distinction worth clarifying is how Claude.ai subscriptions and the API differ in purpose. Subscriptions (Pro at $20/month, Max starting at $100/month, Team from $20/seat/month) provide a near-unlimited experience on claude.ai and Claude Code CLI at a fixed rate, while the API is pay-as-you-go for integrating Claude into your own products.

If you are using Claude Code for IDE completions as an individual developer while also integrating Claude into your own service, Pro / Max subscriptions and API keys are billed separately and can be active simultaneously. In practice, the most rational setup is "personal work on Max subscription at a fixed rate" combined with "production product calls on API pay-as-you-go." Combining the API cost optimization in this article with the subscription selection guide in The Complete Claude Pricing Comparison lets you minimize costs on both the individual and organizational side.

For the implementation steps to get the API running, Claude API Getting Started | The Fastest Way to Call It with 10 Lines of Python covers the shortest path, so once you have a feel for the pricing, head there to get your connection code working.

参考になったら ♡

この記事は役立ちましたか?

ご注意: Clauder Navi は Anthropic 公式情報を直接参照し正確な内容に努めておりますが、本記事の内容に基づく投資判断・契約・利用結果による損害について責任を負いかねます。重要な意思決定の際は、必ず Anthropic 公式・ claude.com の一次情報をご自身でご確認ください。

Clauder Navi 編集部

@clauder_navi

Anthropic の Claude / Claude Code を中心に、日本のエンジニア向けに最新動向と実務を毎日発信。運営方針はメディアについてをご覧ください。

プロフィール → 副社長コラム → レッスン一覧 →

Claude API Pricing | Cost Optimization with Per-Model Rates and Reduction Techniques

Claude API Cost Structure — Understanding Token Billing vs. Subscriptions

Per-Model Token Rates — The Three Tiers: Opus / Sonnet / Haiku

Monthly Cost Formulas and Worked Examples — Chatbots, Summarization, Code Generation

Scenario A: Internal Chatbot (Sonnet 4.6, 100,000 requests/month)

Scenario B: Document Summarization Batch (Haiku 4.5, 10,000 documents/month, avg. 8,000 characters)

Scenario C: Code Generation Agent (Opus 4.7, 1,000 sessions/month)

Up to 90% Savings with Prompt Caching — Reusing System Prompts

An Additional 50% Discount with Batch API — Offloading Asynchronous Workloads

Visualizing Token Usage — Two-Layer Logging with Console and usage Response

Rate Limits and Tier Structure — Three Axes: RPM / ITPM / OTPM

Monthly Budget Management Best Practices — Limits, Alerts, and Model Fallbacks

Choosing Between Pro / Max Subscriptions and API — Running Both in Parallel Is the Practical Choice

関連記事

How to Use Claude's 1 Million Token Context | Pricing, Practical Limits, and Use Cases

Vertex AI Claude Pricing | Opus/Sonnet Rates & Bedrock Comparison

Bedrock Claude Pricing | Per-Token Rates and Cost Reduction for Opus, Sonnet, and Haiku

What Are Claude Tokens? Limits, Japanese Character Conversion, and Ways to Reduce Usage

Visualizing Token Usage — Two-Layer Logging with Console and `usage` Response