Claude Code Harness | How to Improve Success Rates with a 4-Layer Design

Claude Code

Clauder Navi 編集部 / 最終更新 2026-05-04

Running Claude Code at production quality over the long term depends not on the model itself, but on the design of the surrounding infrastructure — the harness. This article breaks down the 4-layer structure of "Guide, Sensor, Loop, and Persistence" advocated by Anthropic, all the way down to concrete components like CLAUDE.md, Skills, tests, and the Explore→Plan→Code→Commit workflow.

Article Summary by AI Chatpowered by Claude

結論powered by Claude

A harness is the totality of the surrounding infrastructure for Claude Code, and the way it is assembled can dramatically change the first-pass success rate even with the same model. Anthropic officially states that "improving the harness often has more impact than updating the model itself," and the real leverage point has shifted to the design of the external infrastructure.

The concrete structure can be organized into 4 layers: Guide, Sensor, Loop, and Persistence. The basic pattern is to pass context via CLAUDE.md and Skills, detect issues via tests and Lint, iterate with Explore→Plan→Code→Commit, and preserve outcomes across sessions using Git and memory.

The key caution is that ROI will not grow if you stick with a generic configuration. Continuously accumulating your organization's conventions, pitfalls, and domain terminology in CLAUDE.md, and keeping commit granularity fine enough to allow rollbacks, are prerequisites for leveraging it at a professional level.

目次 (14)

What Is a "Harness" — Everything Surrounding the Model (Prompts, Tools, FB Loop)
Model Performance and Harness Performance Are Separate Things
Why the Harness Matters — Even Frontier Models Fall Short of Production Quality with Single Prompts Alone
Why "Device Quality" Beats "Instruction Quality"
The 4-Layer Harness Design — Guide / Sensor / Loop / Persistence
Layer 1: Guide (Pre-Control) — CLAUDE.md, Skills, Launch Protocol
Layer 2: Sensor (Post-Control) — Tests, Lint, Screenshot Comparison, CI Results
Layer 3: Loop (Iteration) — Explore→Plan→Code→Commit, Two-Stage Writer/Reviewer
Layer 4: Persistence — Git Commits, progress.txt for Memory Across Sessions
3 Steps for Beginners — Install → CLAUDE.md → Define Verification Criteria
Step 1: Install Claude Code Locally
Step 2: Write a CLAUDE.md
Step 3: Define Verification Criteria Before Making a Request
Sources (Primary Information)

What Is a "Harness" — Everything Surrounding the Model (Prompts, Tools, FB Loop)

A harness refers to the totality of the surrounding infrastructure that allows Claude Code to stably run long-duration tasks. Even with the same underlying model (Claude), the quality and completion rate of the output varies greatly depending on how the surrounding infrastructure is designed — an understanding that spread rapidly in the second half of 2025, triggered by Anthropic's official engineering blog. Specifically, the following 5 elements are included in the "harness":

Instructions (system prompt, slash commands, task descriptions)
Context (conversation history, file loading, reference documents)
Tools (code editing, shell execution, web search, MCP servers)
Feedback Loop (test results, Lint, type checking, user corrections)
Persistence (memory files, progress logs, Git commit history)

How these 5 elements are combined and in what order they are run is what "harness engineering" systematically designs. Just as a horse's harness controls a carriage, the external infrastructure for Claude Code must be designed to "prevent runaway, avoid stopping, and carry it to the destination" — this is said to be the origin of the name.

Model Performance and Harness Performance Are Separate Things

Even if you are using the same Claude Opus, there is a significant difference in the first-pass success rate between a repository with an empty CLAUDE.md and one where conventions, testing policies, and anti-patterns are clearly organized. This is not a difference in "the model's capability" — it is a difference in "the harness's capability." Anthropic itself explains that "improving the harness side often leads to greater quality gains than updating the model."

Why the Harness Matters — Even Frontier Models Fall Short of Production Quality with Single Prompts Alone

When judged solely by model performance, the quality of output varies greatly by harness quality even when using the same Claude. Anthropic's official engineering blog states that even for frontier models, "a high-level prompt alone is insufficient for building production-quality applications," highlighting that harness design is the practical key. While it is possible to get a demo running with brief instructions, in development tasks spanning hours to days, problems accumulate — "losing context midway," "rewriting unrelated files," "continuing to run even when tests fail" — and the final output fails to reach production quality.

The same blog further notes that "it is not yet clear whether a single general-purpose configuration delivers the best performance across all contexts, or whether a configuration with multiple specialized roles achieves higher performance," indicating that harness design is an important and ongoing area of research. The practical implication is clear: "continuously refining the harness to fit your organization's domain, codebase, and operational flow" is the prerequisite for leveraging a general-purpose model at a professional level.

Why "Device Quality" Beats "Instruction Quality"

In human development too, hiring a talented engineer alone does not raise productivity — quality stabilizes only when CI, code review, Lint, and testing infrastructure are in place. The same applies to Claude Code: rather than raising the intelligence of the model itself, setting up verification mechanisms (tests, screenshot diffs), reference documentation (CLAUDE.md, Skills), and record-keeping (commit granularity, progress logs) tends to yield higher ROI in practice.

Source: Anthropic Engineering: Effective harnesses for long-running agents

The 4-Layer Harness Design — Guide / Sensor / Loop / Persistence

The harness becomes easier to organize when thought of in 4 layers: "how to guide in advance (Guide)," "how to detect issues afterward (Sensor)," "what procedure to follow (Loop)," and "what to preserve across sessions (Persistence)." Cross-referencing Anthropic's official explanations with Claude Code's best practices documentation, these 4 layers emerge as a near-universal common language. Below, we look at "what specifically to set up" for each layer.

Layer 1: Guide (Pre-Control) — CLAUDE.md, Skills, Launch Protocol

Pre-control refers to "the context passed to Claude Code before a task begins." The most important element is CLAUDE.md: consolidating the project's background, conventions, pitfalls, and anti-patterns onto a single page eliminates the need to verbally repeat the same premises every time. Skills are reusable components that extract common workflows — keeping frequently recurring flows like "image generation," "article publishing," and "DB migration" in a single file allows them to be automatically loaded at the right moment.

CLAUDE.md: Project-specific knowledge (conventions, domain terminology, absolute rules)
Skills: Reusable workflows (publishing flow, testing policy, naming conventions)
Launch Protocol: Diffs, locks, and in-progress tasks to check at the start of every session

Layer 2: Sensor (Post-Control) — Tests, Lint, Screenshot Comparison, CI Results

Post-control is the mechanism for mechanically determining "whether what Claude Code wrote is correct." Without this, code that "looks like it works but breaks in production" gets produced at scale. At minimum, it is recommended to run a trio of unit tests, type checking, and Lint — and to add screenshot diffs when touching the UI. By making CI/CD status readable by Claude Code itself, you can build behavior that detects failures and automatically loops back to fix them.

Automated tests: Create a state where "if something breaks, you notice" with unit, integration, or E2E tests
Lint / Type checking: Incorporate tsc --noEmit, ruff, etc., equivalent to npm test
Screenshot comparison: Detect UI diffs with Playwright or similar
CI/CD results: Capture failure logs with gh run view and feed them into automatic re-fixing

Layer 3: Loop (Iteration) — Explore→Plan→Code→Commit, Two-Stage Writer/Reviewer

The golden rule for the execution phase is "don't let it write right away." The basic loop recommended by Anthropic is 4 steps — Explore → Plan → Code → Commit — where you first read the codebase (Explore), document the change plan (Plan), write code in minimal units (Code), and commit once the logic is closed (Commit). For longer tasks, a two-stage structure separating the writer and reviewer, or a three-stage structure with planner, implementer, and evaluator roles, is considered effective; the latter is covered in detail in the harness-design-long-running-apps article.

Explore → Plan → Code → Commit: The basic 4 steps. Skipping Plan is a recipe for accidents
Two-stage writer and reviewer: Switch roles within the same session for self-review
Three-stage planner / implementer / evaluator: Detect "direction drift" early in long-running tasks

Layer 4: Persistence — Git Commits, progress.txt for Memory Across Sessions

This is the mechanism that allows work to continue even when a session ends. Since Claude Code's conversation history is essentially volatile, it is important to write "what is done and what needs to be done next" to an external file. Making fine-grained Git commits is itself a form of persistence, and using progress files like progress.txt or feature-list.json in tandem can drastically reduce the time needed to get back up to speed on resumption.

Git commits: Commit frequently in logically closed units (Plan complete, tests green, etc.)
progress.txt: Current state, next steps, and blockers in a single file
feature-list.json: Break down large features and structure completed vs. remaining tasks

3 Steps for Beginners — Install → CLAUDE.md → Define Verification Criteria

Harness engineering is a deep field, but the first 3 steps are extremely simple. Complex concepts can wait — if you first have "Claude Code running on your machine," "the project's premises summarized on one page," and "a mechanical pass/fail judgment for each task," you have the minimum viable harness. Conversely, swapping in the latest model without these 3 elements in place will result in stagnant quality.

Step 1: Install Claude Code Locally

Start by running it in your local environment and throwing a few tasks at your own repository. Taking notes on "moments when it behaved differently than expected" and "moments when instructions didn't get through" feeds directly into the next step of building out your CLAUDE.md. Installation steps and prerequisites are consolidated in the official documentation's Best Practices.

Step 2: Write a CLAUDE.md

Write out the project background, coding conventions, files not to touch, and constraints that must always be followed — on a single page (roughly 200–500 lines as a guideline). Don't aim for perfection; making it an operational rule to "write it in CLAUDE.md the second time you give the same correction" lets it grow naturally.

Step 3: Define Verification Criteria Before Making a Request

Pass the pass/fail judgment upfront, such as "this task is complete when npm test is green" or "this page is complete when the Playwright screenshot matches the existing one." Instructions without verification criteria tend to be "lucky if it works," and the harness won't deliver its value. Even in areas without tests, simply providing the execution command or manual confirmation procedure in writing improves stability.

Sources (Primary Information)

The primary sources directly referenced in writing this article are listed below. Please always verify the latest accurate information at each link.

Anthropic Engineering: Effective harnesses for long-running agents — Challenges of long-running agents and harness design methodology (published: 2025-11-26)
Anthropic Engineering: Harness design for long-running application development — Generator-evaluator configuration in frontend/full-stack development (published: 2026-03-24)
Claude Code Best Practices (Official Documentation) — Anthropic's official Claude Code best practices guide

参考になったら ♡

この記事は役立ちましたか?

ご注意: Clauder Navi は Anthropic 公式情報を直接参照し正確な内容に努めておりますが、本記事の内容に基づく投資判断・契約・利用結果による損害について責任を負いかねます。重要な意思決定の際は、必ず Anthropic 公式・ claude.com の一次情報をご自身でご確認ください。

Clauder Navi 編集部

@clauder_navi

Anthropic の Claude / Claude Code を中心に、日本のエンジニア向けに最新動向と実務を毎日発信。運営方針はメディアについてをご覧ください。

プロフィール → 副社長コラム → レッスン一覧 →