CLAUDE.md Evaluation Checklist | Diagnose Effectiveness with 5 Metrics

"I wrote a CLAUDE.md, but I can't tell if Claude's behavior has actually changed." — This is a question many people encounter after introducing CLAUDE.md. While there is plenty of information about design principles and how to write it, concrete metrics for verifying whether your own CLAUDE.md is actually working are rarely discussed.

This article organizes 5 evaluation metrics for judging CLAUDE.md effectiveness, 4 signs that indicate it is not working, and perspectives for confronting the structural limitations of CLAUDE.md.

Article Summary by AI Chatpowered by Claude
結論powered by Claude

The evaluation criterion for CLAUDE.md is simple: whether Claude's behavior has actually changed. "Written carefully" and "has many lines" are not evaluation metrics.

There are 5 metrics for confirming effectiveness: ① Is the explanation cost per session nearly zero? ② Has the recurrence rate of the same mistakes decreased? ③ Are constraint rules being followed? ④ Does the line-deletion test confirm each line is truly necessary? ⑤ Is the balance between context consumption and quality maintained?

CLAUDE.md also has structural limitations. The 3 points of manual update burden, inability to handle dynamic information, and context pressure cannot be avoided by design. Constraints that require absolute guarantees should be moved to Hooks, and CLAUDE.md should be pruned monthly as "a file where only indispensable lines remain" — this is the most effective long-term approach.

目次 (10)

1. "Written" and "Working" Are Entirely Different Problems

CLAUDE.md is an instruction file that Claude Code automatically reads at the start of each session. However, simply placing the file is not sufficient. The Claude Code official best practices explicitly state: "Treat CLAUDE.md like code — test changes by whether they actually change Claude's behavior."

In other words, the evaluation criterion comes down to a single point: whether Claude's behavior has changed. "Written carefully," "has many lines," and "has a clean structure" are not evaluation metrics.

Furthermore, "whether CLAUDE.md is working" and "whether it is written correctly" are different questions. Even if the writing is correct, it may not be reaching Claude; and even if the writing is rough, behavioral change may still be occurring. Evaluation must always be performed by observing how Claude behaves in actual sessions.


2. Metric ① Is the Explanation Cost Per Session Nearly Zero?

The simplest evaluation metric is whether you can dive into work with zero explanation when opening a new session.

Write out a list of information you were telling Claude every session before introducing CLAUDE.md — for example, "This project uses TypeScript with ESLint for style," "DB connections read from .env.local," or "Don't touch anything under legacy/."

If CLAUDE.md is working, Claude should behave correctly without you explaining these things at the start of each session. Conversely, if you are re-explaining the same things every time you open a new session, either that information is not in CLAUDE.md or the way it is written is not getting through.

Steps to improve:

  1. Write out the information you "supplemented in the prompt" during the last session
  2. Check whether that information is included in CLAUDE.md
  3. If it is not included, add it; if it is included, review how it is written

3. Metric ② Has the Recurrence Rate of the Same Mistakes Decreased?

The highest-ROI section within CLAUDE.md is the Gotchas section covering project-specific pitfalls. If you write mistakes Claude has repeatedly made in the past here, the same mistakes should decrease in the next session.

How to evaluate:

  1. Give Claude an instruction that intentionally recreates a situation where "it got stuck here last time"
  2. Observe whether behavior contrary to what was written in the CLAUDE.md Gotchas section appears
  3. If the mistake was avoided, that line is working. If the same mistake appears, review the wording or emphasis

For example, if you wrote in Gotchas "call the synchronous version of this function — switching to the async version will break the production batch," try explicitly instructing "switch to the async version." The confirmation point is whether Claude communicates the constraint and refuses, or issues a warning.


4. Metric ③ Actual Compliance Rate for Constraint Rules — The IMPORTANT Test

How well "things you absolutely do not want done" are being followed is also an evaluation axis. Claude Code's official documentation states that adding emphasis markers starting with IMPORTANT: or YOU MUST: increases compliance rates.

Evaluation steps:

  1. Write a prohibition in CLAUDE.md, then give Claude an instruction that intentionally tries to induce it to break that prohibition
  2. Check whether Claude communicates the constraint and responds with "I cannot do that" or "That violates the constraints"
  3. If it fails to comply, change how the emphasis markers are applied or how the prohibition is written

There is an important caveat here. CLAUDE.md is advisory and provides no 100% guarantee. For processing that "absolutely must be followed," the correct approach is to move it to Hooks rather than CLAUDE.md. Hooks operate deterministically and guarantee that actions occur. Rules that you evaluate as "CLAUDE.md cannot handle this" should be actively migrated to Hooks.


5. Metric ④ Line-Deletion Test — Working Backward to Identify Truly Necessary Lines

As CLAUDE.md grows, the number of lines naturally increases. However, as the official documentation emphasizes, a bloated CLAUDE.md causes Claude to ignore your actual instructions ("Bloated CLAUDE.md files cause Claude to ignore your actual instructions").

The line-deletion test is an evaluation where you actually try "would Claude make a mistake if I removed this line?"

Steps:

  1. Temporarily remove the line you want to evaluate from CLAUDE.md
  2. Carry out work related to that line in a session
  3. Observe whether Claude behaves correctly

If Claude continues to behave correctly even after deletion, that line is currently not getting through. Style conventions manageable by a linter and language standards Claude already knows can often be deleted without issue. Conversely, if a mistake appears the moment it is deleted, that is definitive proof the line is working.


6. Metric ⑤ Balance Between Context Consumption and Quality

The longer CLAUDE.md becomes, the more it pressures the context window of each session. When the context fills up, Claude begins to forget instructions from the beginning. The felo.ai CLAUDE.md usage guide also points out that "the larger the file, the less context is available for actual work."

Evaluation perspectives:

  • Are CLAUDE.md constraints being followed toward the end of long work sessions?
  • Is Claude's accuracy declining when the context is more than half full?

If there are rules that stop being followed toward the end of a session, it is better to guarantee those rules with Hooks or an external mechanism rather than writing them in CLAUDE.md. Also, if the file exceeds 300 lines and end-of-session accuracy is declining, migrating to a design that splits the file using @import syntax and keeps the main body lean is effective.


7. Four Signs That Your CLAUDE.md Is Not Working

If the following states persist, your CLAUDE.md needs to be reviewed.

  1. Explaining the same things every session — The information is either not in CLAUDE.md or the writing is not getting through
  2. Problems that were previously fixed are recurring — The Gotchas section content is thin or lacks sufficient emphasis
  3. Prohibited behavior is appearing — Change how or where emphasis markers are placed, and consider migrating to Hooks
  4. Rules stop being followed toward the end of sessions — CLAUDE.md is too long and is pressuring the context

All of these are not "CLAUDE.md is not working" — they are "problems that can be improved by changing the writing, structure, or operation of CLAUDE.md." When you find a sign, return to the corresponding evaluation metric and narrow down the cause.


8. Understanding the Structural Limitations of CLAUDE.md

To evaluate accurately, you also need to understand what CLAUDE.md structurally cannot handle. The felo.ai article lists the following 4 limitations:

  • Manual update burden: You must update it yourself every time the project state changes
  • Inability to handle dynamic information: It is not suited for state management such as "how far did we get today" or "what should we do next"
  • Complexity of managing multiple projects: It becomes impossible to track which project's CLAUDE.md is the most current
  • Context pressure: The larger the file, the less context is available for actual work

CLAUDE.md is powerful as "persistent context passed at the start of a session," but dynamic state tracking and deterministic constraint guarantees require combining it with Hooks and external tools. Correctly classifying areas that you evaluate as "CLAUDE.md cannot handle this" and moving them to appropriate means leads to mature operation.


9. Growing CLAUDE.md with a Monthly Evaluation Flow

The Claude Code official documentation states: "Prune CLAUDE.md once a month, and test changes by whether they actually change Claude's behavior." It is important not to treat evaluation as a one-time event, but to incorporate it into a regular monthly process.

Recommended monthly flow:

  1. Ask "would Claude make a mistake if I deleted this?" for every line, and delete lines where the answer is no
  2. Add one line to Gotchas for each mistake Claude repeated during the previous month
  3. Move constraint rules that require 100% guarantees to Hooks
  4. Open a new session with the pruned CLAUDE.md and confirm that the main constraints are all being followed

Executing this flow once a month allows CLAUDE.md to grow without becoming bloated while maintaining signal density. "Add cautiously, delete boldly" is the fundamental stance.


10. Evaluation Checklist Quick Reference

Metric Check Content Action When There Is a Problem
Explanation cost Can you start work with zero explanation in a new session? Add what you explained each session to CLAUDE.md
Mistake recurrence rate Do Gotchas mistakes recur in the next session? Review wording and emphasis (IMPORTANT)
Constraint compliance rate Are prohibitions followed when tested with induced instructions? Add IMPORTANT, or migrate to Hooks
Line validity Can you identify "safe to delete" lines via the deletion test? Prune invalid lines to increase signal density
Context volume Are rules being followed toward the end of sessions? Split with @import or shorten CLAUDE.md
Dynamic information Are you trying to make CLAUDE.md track state? Delegate to Hooks or external tools

CLAUDE.md is not something you write once and are done with — it is a file you evaluate and grow. Incorporating the above metrics into a monthly routine and maintaining "a file where only indispensable lines remain" is the shortest path to a CLAUDE.md that works most effectively over the long term.

参考になったら ♡
Clauder Navi 編集部
@clauder_navi

Anthropic の Claude / Claude Code を中心に、日本のエンジニア向けに最新動向と実務 を毎日発信。 運営方針 は メディアについて をご覧ください。