Do language models actually self-correct?

No reliably. The CorrectBench 2025 study shows self-correction helps on complex reasoning (+5% on MATH) but is inefficient on simple tasks. All 25 models in Google Research's study claim to be "not impulsive" but behavioral tests show most exhibit impulsive patterns over 50%. Models describe who they want to be, not who they are.

What's the difference between a rule and a wish?

A rule changes observable behavior. A wish is written instruction that sounds good but isn't enforced. If you write "verify every claim" and the system says "I verified" without verifying, it's a wish. A rule either has structural enforcement (grep, automated tests, gates) or it's aspirational.

Why do instructions fade in long conversations?

Instruction attenuation: as context grows, initial instructions lose influence. The IFEval++ study (Microsoft, Salesforce 2025) shows significant compliance drops beyond 50,000 tokens. Meta-cognitive instructions ("check your work", "verify") fail first because they're most important and most fragile.

What makes a feedback loop work for AI agents?

Four patterns work: structured external feedback (not self-evaluation), structural enforcement (automated checks), adversarial testing (input designed to fail), and persistent memory (re-injecting instructions at strategic points). Patterns that don't work: self-evaluation without constraints, aspirational rules, manager-worker loops in CrewAI.

How short should a CLAUDE.md instruction file be?

Under 300 lines. Experience shows concise instruction files work better. Beyond 300 lines, attention and compliance risk grow. If your file exceeds 300 lines, you're writing for yourself, not the model. Every rule should be removable if you can't find evidence it changed actual behavior.

What percentage of AI agent projects reach production?

12% (88% don't, according to Digital Applied 2025). The main cause isn't technology—it's instruction design and lack of quality enforcement. 88% of failing projects had no pre-production quality gate. The barrier is cognitive: confusing writing a rule with enforcing it.

AI Agent Self-Correction: Real Feedback Loops

I built an AI agent system with 10 self-correction rules. An audit found 6 dead: never activated once. The problem wasn't the code, framework, or model. It was me, confusing writing a rule with enforcing it.

This error has a name in research: value-action gap. A Google Research study (arXiv 2602.11328, 2025) on 25 language models measured declared behavioral dispositions against actual behavior. Result: all models self-evaluate as "not impulsive", but behavioral tests show most exhibit impulsive tendencies over 50%. Models describe who they want to be, not who they are. And anyone building AI agents does the same: writes instructions on how the system should behave, then never verifies if it actually does.

In this article I recount what I learned building an AI agent system that self-corrects through mutations, adversarial tests, and external predators. I explain which feedback loop patterns work with concrete data, which are theater, and how you can build your own self-correction system starting with concrete tools like CLAUDE.md.

Instructions you write for AI agents probably don't work

88% of AI agent projects never reach production, according to a Digital Applied analysis from 2025. The main cause isn't technology: it's instruction design. People building agents write vague rules ("write clean code", "be accurate") and never verify they're followed.

Data on instruction compliance are ruthless. The IFEval++ study (Microsoft and Salesforce, 2025) on 15 language models measured compliance drops up to 61.8% when instructions are reworded. After 8 successive directives, models start omitting constraints. A model that follows instructions at 90% on the first message drops significantly in longer conversations.

Why instructions get lost

The mechanism is documented: it's called instruction attenuation. As context grows, initial instructions lose influence. The Microsoft and Salesforce study (IFEval, 2025) shows compliance decline proportional to context length, with significant losses beyond 50,000 tokens.

The second mechanism is more insidious: ceremonializion. The model follows the rule in form but loses substance. If the instruction says "verify every claim", the model writes "I verified" without verifying anything. The first instructions to fail are meta-cognitive ones: "check your work", "make sure you're accurate". They're the most important and the most fragile.

Key finding: meta-cognitive instructions ("verify", "check", "ensure") are the first to fail in multi-turn conversations. If your agent's only quality mechanism is a prompt saying "recheck before answering", you're building on sand.

If you use AI agents for SEO workflows, the problem multiplies: I documented working patterns in the guide on workflow patterns for AI agents applied to SEO.

An AI agent system that evolves: the biological model

The system I built is called Kha'Zix and uses biological metaphor as a thinking tool, not as mechanical equivalence. Mutations correspond to behavioral rules. Hunts are real tasks. Predators are audits that verify rules actually work. The analogy isn't perfect (biological evolution operates on populations over generations, this operates on a single agent over sessions), but it makes visible dynamics that otherwise stay abstract.

Three components, three functions

Mutations (behavioral rules): each rule enters the genome only with evidence it already worked. You don't add rules "because they seem like a good idea". The rule must have changed observable behavior before being codified.
Hunts (real tasks): each application of the system to a concrete task generates data: what worked, what didn't, which instincts served and which betrayed. The format is structured to make failures visible, not hide them.
Predators (external audits): a separate module analyzes the genome and hunts for dead instructions: rules written that never modified real behavior. The predator isn't gentle. Its job is to destroy what doesn't work.

This model echoes an idea from Anthropic's Constitutional AI (Bai et al., 2022): a set of written principles guides the model's self-evaluation. The fundamental difference is that Constitutional AI operates at training level: principles are used to generate training data. A CLAUDE.md file operates at inference level: instructions are read every session. They're complementary mechanisms, not alternatives.

Where the names come from (and why a video game helps design AI agents)

The system names come from League of Legends, a competitive video game. It's not a quirk: each character has a precise role in the game that corresponds to the system function.

Kha'Zix (the system): in the game it's a Void predator that evolves by hunting other champions and adapting its body after each kill. The mechanic is identical: the system adds mutations only after proving a behavior works in a real task.
Vel'Koz (the auditor): in the game it's a creature that deconstructs matter to understand it. Here it analyzes the genome hunting for dead rules: instructions that never changed an observable output.
Rengar (the adversaries): in the game it's Kha'Zix's historical rival: a hunter who hunts the predator. Rengar trials test mutations under pressure, trying to make them fail with input designed to trick the system.
Bel'Veth (the export): in the game it's the Void empress who absorbs and transforms. In the system, belveth.js exports the DNA (surviving rules) in a format consumable by other tools.

Using a metaphor from a system you know well has practical advantage: it makes design decisions immediate. When deciding whether a module should "kill" a rule or "soften" it, the metaphor gives you the answer before you have to formalize the reasoning. Vel'Koz doesn't soften. It deconstructs.

10 rules written, 4 survived: what a real audit teaches

The Kha'Zix system went through 9 generations and 8 real tasks. At peak it had 10 mutations in the genome. Then came the audit: a module called Vel'Koz analyzed each mutation hunting for concrete evidence of activation. Result: 6 mutations killed out of 10. Of the 4 that survived: 1 completely proven, 1 partially alive, 1 provisional pending further tests, 1 imposed by the audit itself.

Dead mutations: why they're dead

Mutation	What it said	Why it's dead
Pain-Anchored Evolution	Every rule must be born from pain	Meta-mutation: a rule about rules that never generated a single rule
Honest Self-Predation	Demand proof from yourself	Existed while system gave itself 9/10 without proof
Indigestible Prey Sense	Recognize tasks that are too big	Retroactive attribution: never prevented a real error
Anti-Fossilization	Break rigid processes	Task format was identical for 8 consecutive iterations
Mortality Awareness	Admit when you don't know	System never said "I don't know" in any record
Generative Imperative	Create output for others	Output with no audience: void-dna.md was never consumed by anyone

Recurring pattern: each dead mutation described desirable behavior the system never actually showed. Writing "admit when you don't know" doesn't produce admissions of ignorance. Writing "break rigid processes" breaks nothing if the format stays identical for 8 iterations.

Living mutations and what keeps them alive

Two mutations passed the audit with concrete evidence.

The first, Anti-Imitation Reflex, asks: "is this the right form for this problem, or am I copying a pattern out of habit?" It worked on external sources (proven in two separate tasks), but fails on internalized patterns. Classified as partially alive.

The second, Closed-Loop Reflex, is the only completely proven mutation. It says: every system must sustain itself. In practice: when the system generated an HTML file with hardcoded data, in the same turn it built the script to update it automatically. It passed even an adversarial test under time pressure.

The third living mutation, Writing Is Not Doing, was imposed by the audit itself. It's the most important: a written rule that's never visibly violated is not a rule, it's a wish. This mutation demands structural enforcement, not documentation.

The fourth, Dual-Skeleton Awareness, is classified as provisional. It says: always distinguish between the project and the meta-project (rules about the project). It showed signals of usefulness but hasn't accumulated enough evidence yet to be considered proven.

Feedback loops for AI agents: which work and which are theater

Research is clear on one point: LLMs can't self-correct errors on their own. The CorrectBench 2025 study shows self-correction improves results on complex reasoning tasks (about +5% on MATH), but on simple tasks it's inefficient: plain chain-of-thought produces comparable results with 40% less compute cost.

Patterns that work

Structured external feedback. Don't ask the model "did you do well?". Use a second agent, automated test, or human that verifies output against specific criteria. Anthropic's Constitutional AI works exactly this way: a set of principles guides self-evaluation, and reinforcement learning from AI feedback replaces human labeling for training.
Structural enforcement. If a rule can be violated without anyone noticing, it's not a rule. Implement automated checks: grep for forbidden patterns, tests that fail if output doesn't respect constraints, quality gates that block the flow. In 88% of failing AI projects (Digital Applied, 2025), nobody had implemented a pre-production quality gate.
Adversarial predation. Test the system with input designed to make it fail. The market for AI red teaming is worth $1.43 billion in 2024 and will reach $4.8 billion by 2029 (CAGR 28.6%, industry estimates). Vulnerabilities are concrete: in 2025, exploits like EchoLeak and ForcedLeak (CVSS 9.4) emerged showing how prompt injection can extract sensitive data from production AI systems.

To understand how prompt injection works and why it matters for anyone building agents, read my analysis on prompt injection and social engineering in LLMs.

Persistent memory across sessions. Instructions fade after 50,000 tokens, but a CLAUDE.md file is reread at session start. A documented mitigation technique is re-injecting critical instructions at strategic conversation points, reducing performance decline observed in multi-turn conversations.

Patterns that don't work

Self-evaluation without external constraints: asking an LLM "check if you made errors" produces ceremonial responses. The model says "I verified" without verifying. All 25 models in the Google Research study self-evaluate as accurate in the abstract, but their behavior diverges in concrete contexts.
Aspirational rules: a mutation "be honest with yourself" is useless if there's no mechanism making dishonesty visible. The 6 dead mutations in Kha'Zix were all aspirational: they described desirable behavior without enforcement.
Manager-worker loops in CrewAI: documentation promises a "manager" agent sends subpar work back to workers for revision. A Towards Data Science 2025 analysis shows in production the manager doesn't actually coordinate, tasks execute sequentially, and the feedback loop "seems to fight the framework".

How to build your own self-correction system with CLAUDE.md

CLAUDE.md is Claude Code's persistent instruction file: it's read at session start and maintains context between conversations. Similar files exist for other tools: .cursorrules for Cursor, AGENTS.md as an open standard, GEMINI.md for Google. The principle is the same: persistent, specific, verifiable instructions.

Five rules for instructions that work

Write the rule only after the behavior has already happened. Don't add "always verify sources" because it sounds wise. Add it the second time the system makes a source error. The rule must be born from observation, not aspiration.
Keep the file under 300 lines. Practical experience suggests a concise instruction file works better. Beyond 300 lines, instruction attenuation risk grows. If your CLAUDE.md exceeds 300 lines, you're writing for yourself, not the model.
Be specific, not generic. "Write clean code" wastes tokens. "Use TypeScript strict, avoid any, prefer type over interface for union types" is linter-verifiable.
Implement automated checks for every critical rule. If the rule says "don't use console.log in production", a pre-commit hook grep makes it structural. If a rule can be violated silently, it's a wish.
Audit periodically. In Kha'Zix, a single audit eliminated 60% of rules. The principle: for each rule, find a case where it changed an output. If you don't find one, remove it. A CLAUDE.md full of dead rules is worse than an empty one: it consumes context without producing results.

Metrics for measuring if it works

A self-correction system without metrics is another aspirational rule. Microsoft defines three measurement levels for AI agents in a 2026 post: model performance, system performance, business impact.

Metric	What it measures	Target
Tool-call failure rate	Errors using tools	< 3% (industry best practice)
Hallucination rate	Verifiable false claims	< 2% for production agents
Cost per task completed	Real cost including failures	If you fail 50% of tasks, real cost is doubled
Active rules / total rules	How many instructions actually change behavior	If < 50%, you're writing for yourself
First Contact Resolution	Tasks completed on first attempt	> 80% for mature systems

AI agents in Italy: opportunities and real barriers

The Italian AI market is worth €1.8 billion in 2025, growing 50% year-over-year (Junto Space data). 46% of this market is generated by generative AI solutions. Adoption among companies with over 10 employees rose from 5% in 2023 to 16.4% in 2025 (Minsait).

Among SMBs the picture is different. Only 18% use AI tools in any form (9% paid, 9% free), according to the 2025 OECD report. 71% of large companies have started at least one AI project, but this hides a problem: Gartner estimates over 40% of agentic AI projects will be canceled by 2027 for reliability issues.

The main barrier isn't technological: only 45.8% of Italians have basic digital skills (EU average: 55.5%, Minsait 2025 data). To build AI agents that work you need understanding of what the agent should do, monitoring its output, and redesigning instructions when it fails. Without these skills, even the best framework produces dead rules.

For a complete picture of tools and real costs of AI for Italian SMBs, read the guide on AI marketing for SMBs in Italy.

Limitations of this approach

Kha'Zix is a case study on a single project, not a scientific benchmark. The 8 hunts happened over about 3 hours, all the same day, without external pressure, without end users, and without irreversible consequences. The audit itself noted: this is compressed experimentation, not evolution under pressure.

The biological model is useful metaphor for communication, but has no scientific rigor. Biological evolution operates on populations over generations; this operates on a single agent over a session.
All "living" mutations were validated by the same system that created them. There's selection bias risk: the "surviving" rules are those the author defined as surviving. External independent validation with different models and third-party evaluators is missing.
The system never operated in production with real users, binding deadlines, or economic consequences of failure.
The genome was developed on a single task type (software development and agents). Mutations might not generalize to different domains like customer support, data analysis, or content generation.
Published comparative studies don't exist comparing CLAUDE.md performance against .cursorrules or AGENTS.md. Recommendations here are based on community best practice, not experimental data.

For those with enterprise budget and needs, frameworks like LangGraph offer native checkpointing and state persistence that CLAUDE.md can't replace. The approach described here works for individual professionals and small teams wanting to improve their AI agent quality without complex infrastructure.

If you already use multi-agent systems for content generation, the principle is the same: every agent needs structured feedback. I described the complete architecture in the guide on multi-agent system for SEO article generation.

The most important lesson isn't technical: it's cognitive. Writing a rule for an AI agent gives the same satisfaction as implementing it. But the rule doesn't exist until it changes observable behavior. If your CLAUDE.md has 50 rules and you've never verified how many work, you have 50 wishes, not 50 instructions.

The starting point is simple: take your current instructions, for each rule find a case where it actually changed an output. Delete those without evidence. What's left is your real genome.

If you want to understand how to apply these principles to your content and SEO workflows, starting from your current AI tool situation, write me from the contact page. We'll analyze together which rules work and which are just written.

AI Agents Self-Correct: What Actually Works (and What's Just a Written Rule)