Prompt Caching
Summary — Key Points of This Lesson
- Prompt Caching is a feature that caches the first portion of a prompt on Anthropic's servers, reducing token processing costs on reuse.
- The cost to read from cache is 10% of normal input tokens (90% reduction). This is especially effective when repeatedly sending long system prompts or reference documents (source: anthropic.com/pricing — USD, tax excluded).
- Setup is as simple as adding
cache_control: {"type": "ephemeral"}to your request. The SDK automatically determines what to cache. - Cache writes incur 1.25x to 2x the normal cost. The structure is designed to pay off through repeated use.
- Code examples are sourced from the official documentation as the primary reference.
目次 (6)
What Is Prompt Caching?
When using Claude via the API, sending long system prompts or large volumes of reference documents with every request causes input token costs to accumulate. Prompt Caching is a feature that caches the leading portion of a request on Anthropic's servers, allowing subsequent requests to reuse that cache (source).
The cost of reading from cache is held to 10% of normal input tokens (USD, tax excluded). Always check the official site for the latest pricing details (anthropic.com/pricing).
Pricing Structure
| Token Type | Cost Multiplier (vs. Normal Input) |
|---|---|
| Normal input tokens | 1× (baseline) |
| Cache write tokens (5-minute retention) | 1.25× |
| Cache write tokens (1-hour retention) | 2× |
| Cache read tokens | 0.1× (90% reduction) |
※ USD, tax excluded. For specific per-model pricing, see anthropic.com/pricing.
Python — Minimal Code Example
The following configuration follows the official documentation code example
(source).
By adding cache_control at the top level of the request,
the SDK automatically sets a cache breakpoint at the last cacheable block.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
cache_control={"type": "ephemeral"}, # キャッシュを有効化
system="あなたは文学作品を分析する AI アシスタントです。"
"テーマ・登場人物・文体について深い考察を提供してください。",
messages=[
{
"role": "user",
"content": "「源氏物語」の主要テーマを分析してください。",
}
],
)
# 使用トークン数を確認
print(response.usage.model_dump_json())
response.usage includes cache_creation_input_tokens and
cache_read_input_tokens, allowing you to verify how the cache is operating.
When Caching Is Effective
- Long system prompts — when sending hundreds to thousands of tokens of operational instructions with every request
- RAG with reference documents — when referencing the same documents across multiple conversations
- Few-shot example reuse — when repeatedly sending prompts that contain many examples
The cache is invalidated when the matching prefix changes. To maximize the benefit of caching, it is important to design your requests so that the unchanging portions are grouped at the beginning.
Relationship to Claude Opus 4.7 Migration
Prompt Caching continues to be available with Claude Opus 4.7. For notes on migration (breaking changes and deprecated parameters), see Claude Opus 4.7 Migration Practical Guide.
Level 5 Complete
Congratulations. You have completed all 5 lessons of Level 5 "API / SDK." You now have a solid understanding of the fundamentals for using Claude programmatically, covering API key setup, SDK basics, Tool Use, streaming, and Prompt Caching. In Level 6 "Business Use," you will learn practical approaches for integrating Claude into business workflows.