What are evaluations for Claude Code skills?

Evaluations are automated tests that verify Claude behaves as expected when a skill is active. You define an input prompt, the expected output characteristics, and binary pass/fail criteria. The system runs the test and reports if the skill produces conforming results. They work like unit tests in software.

What's the difference between capability uplift and encoded preference?

Capability uplift skills teach Claude something the base model doesn't handle well (e.g., audits with custom thresholds, image generation with brand guidelines). Encoded preference skills codify a process Claude can already do step-by-step, but the skill defines the order and specific format of your workflow (e.g., weekly report with fixed structure).

What is benchmark mode?

Benchmark mode is a standardized evaluation tool that measures three metrics: pass rate of evaluations, execution time, and token consumption. It's used to compare different versions of the same skill, monitor model update impact, and quantify the cost-benefit ratio of a skill.

How do comparator agents work?

Two independent agents execute the same task in completely isolated contexts: one with skill version A, the other with version B. A third agent judges the results against objective criteria. It's the equivalent of A/B testing for skills: it lets you verify if a change actually improves results.

Why is triggering optimization important?

With 2-3 skills, triggering works even with imperfect descriptions. But with 10+ skills, a description that's too broad causes false positives (skill activates when it shouldn't) and one that's too narrow causes false negatives (doesn't activate when needed). Automatic optimization analyzes positive and negative prompts to rewrite the description with maximum precision.

How often should I run benchmarks for my skills?

In three situations: after creating a new skill (to establish baseline), after modifying a skill (to verify the change is positive), and after a model update (to catch regressions). For skills you use daily, weekly benchmarking is good practice.

Will skills become obsolete as models improve?

Capability uplift skills carry obsolescence risk: if the base model learns the techniques your skill encodes, it becomes unnecessary. Evaluations will show this—if they pass without the skill loaded, you can retire it. Encoded preference skills (which codify your specific process) have low risk: the model can't learn your personal preferences.

Claude Code Skills 2.0: Evaluations, Benchmarks & Triggering

Claude Code skills evolved on March 3, 2026. Anthropic introduced automated evaluations, benchmark mode, and triggering optimization—three features that transform skills from static documents into testable, measurable systems.

In a previous article I explained what skills are and how to build them for SEO. This article goes deeper: not "how to create a skill" but "how to test, measure, and optimize the skills you already have".

If you use Claude Code for repetitive SEO tasks—audits, meta tag generation, content review—what you'll read here changes how you work with skills. It's no longer trial and error: it's a structured process of continuous improvement.

Two types of skills: capability uplift vs encoded preference

Before discussing testing, there's a fundamental distinction. Not all skills do the same thing, and understanding the type of skill you have determines how you test and improve it.

Capability uplift: the skill teaches something new

These skills enable Claude to do things the base model doesn't handle consistently. They encode specific techniques that produce superior output compared to standard prompting.

In an SEO context, my Nano Banana skill for image generation is a perfect example. The base model can generate images, but without the skill it doesn't know my site's brand guidelines, doesn't know which API model to use, doesn't apply the visual preamble. The skill adds a capability the model doesn't have by default.

SEO audit skill: the model can analyze data, but doesn't know my specific thresholds (title >60 chars = error, not warning).
Image generation skill: the model can call APIs, but doesn't know my color palette, prompt architecture, and brand-specific guidelines.
Content review skill: the model can evaluate text, but doesn't know my custom E-E-A-T criteria and quality thresholds.

Encoded preference: the skill encodes a process

These skills don't teach new capabilities—they encode workflows where Claude can already do each step, but the skill sequences them according to your specific process.

My weekly SEO report skill is an example. Claude can extract data from DataForSEO, can compare metrics, can format tables. But the skill defines: which KPIs to look at first, how to structure week-on-week comparison, which report format to use, who to target recommendations toward. It's my process, codified.

Why the distinction matters

Aspect	Capability Uplift	Encoded Preference
What it does	Teaches a new technique	Encodes an existing process
Without the skill	The model can't do it well	The model can do each step individually
Testing focus	Does the technique produce better output?	Is the process followed in the right order?
Obsolescence risk	High—the model may learn the technique	Low—the process is yours, not the model's
SEO example	Audit with custom thresholds	Weekly report with specific format

This distinction becomes critical when we introduce evaluations: tests for a capability skill verify output quality, those for a preference skill verify process conformance.

Evaluations: automated testing for skills

What evaluations are

Evaluations are automated tests that verify Claude behaves as expected when a skill is active. They work like software tests: you define an input, describe the expected output, and the system verifies if the skill produces it.

The difference from manual testing I described in my previous article is substantial: evaluations are repeatable, automatable, and quantifiable. You don't have to compare outputs by eye—the system does it for you.

How to structure an SEO evaluation

An evaluation has three components.

Input prompt: the request you'd normally make to Claude. For an SEO audit: "Analyze this crawl and identify critical issues".
Expected output description: not the exact output, but the characteristics it must have. "The report must classify issues into 4 severity levels, use the thresholds defined in the skill, include specific URLs for each issue".
Pass/fail criteria: binary conditions. "A title of 65 characters MUST be classified as an error, not a warning". "The report MUST include an executive summary section". "Images >200 KB MUST be flagged".

Two fundamental uses of evaluations

Catching quality regression: AI models evolve. A base model update can change how Claude interprets your skill. A skill that worked perfectly last month might behave differently after an update. Evaluations catch these regressions automatically—you run them after every update and see immediately if something changed.

Understanding model progress: here's the counterintuitive part. If your evaluations pass WITHOUT the skill loaded, it means the base model has learned the techniques encoded in your skill. In that case, the skill might not be necessary anymore—or it might only be needed for the parts the model hasn't yet incorporated.

Practical example: evaluations for the SEO audit skill

Here's how I structured evaluations for my audit skill.

Evaluation	Input	Pass criterion	What it tests
Title threshold	Crawl with 65-char title	Classified as "error"	Respects >60 chars = error rule
Description threshold	Crawl with 85-char description	Classified as "warning"	Respects 80-119 chars = warning rule
Executive summary	Crawl with 15 mixed issues	Report has exec summary section	Output format respected
Correct priority	Crawl with 5xx + missing titles	5xx listed before titles	Severity classification correct
Zero issues	Perfect crawl with no errors	Report says "no critical issues"	Edge case handling

Each evaluation is independent and tests one specific aspect of the skill. When you run them all together, you get a complete picture: if the skill works as intended across every critical dimension.

Benchmark mode: measuring skill performance

Evaluations tell you if the skill works. Benchmark mode tells you how well it works— and at what cost in time and resources.

What the benchmark measures

Pass rate: percentage of evaluations passed. A skill with 95% pass rate is reliable. One with 70% has problems to investigate. Below 50%, the skill likely has ambiguous or conflicting rules.

Execution time: how long it takes Claude to complete the task with the skill active. Useful for comparing different versions of the same skill—a more concise skill might be faster without losing quality.

Token usage: how many tokens the task consumes with the skill. A skill that's too long or verbose can increase consumption without improving results. The benchmark quantifies this cost.

How I use benchmark in my workflow

I run a benchmark in three situations.

After creating a new skill: I establish the baseline. Pass rate, time, tokens. These are my reference numbers.
After modifying a skill: I compare against the previous benchmark. If pass rate went up and token usage is stable, the change is positive. If pass rate stayed the same but tokens increased, the change adds cost without benefit.
After a model update: I verify performance is stable. If pass rate drops, I need to understand what changed and adapt the skill.

Comparative benchmark: skill vs no-skill

The most revealing benchmark is the direct comparison: same task executed with and without the skill active.

Metric	Without skill	With skill	Delta
Thresholds respected	60%	97%	+37%
Report format correct	40%	95%	+55%
Execution time	~3 min	~4 min	+1 min
Tokens consumed	~2000	~3500	+75%

The numbers speak clearly: the skill costs a bit more in time and tokens, but the quality improvement (thresholds, format) is huge. This is the kind of data that justifies the investment in creating and maintaining skills.

Comparator agents: A/B testing for skills

Benchmark compares skill vs no-skill. Comparator agents do something more powerful: they compare two different versions of the same skill.

How A/B testing works

Two independent agents execute the same task in isolated contexts: one with version A of the skill, the other with version B. No contamination between the two—each agent has its own clean context.

A third agent (the judge) compares the two outputs against the evaluation criteria and determines which version produces better results. The judgment isn't blind—the judge knows which version is which— but evaluates against objective criteria, not preferences.

When I use comparators in SEO context

When I rephrase the thresholds in my audit skill and want to verify the change actually improves results, not just modifies them.
When I simplify a skill (reducing length and complexity) and want to confirm the shorter version produces equivalent results to the longer version.
When I test a new section of the skill—I add it in version B and compare with version A without the section.

The value of isolation

The critical point is isolation. Without comparator agents, testing two versions means running a task, changing the skill, running again, and comparing from memory. The context from the first run contaminates the second. With parallel isolated agents, the comparison is clean: same conditions, only the skill differs.

Triggering optimization: the description problem

In my previous article on skills I discussed the "trigger, don't summarize" principle: the description in the frontmatter should describe when to use the skill, not what it does. Anthropic has now automated this optimization.

The problem scales with skill count

When you have 2-3 skills, triggering works well even with imperfect descriptions. But when you have 10, 15, 20, the precision of the description becomes critical.

Description too broad: the skill activates when it shouldn't (false positive). Claude loads irrelevant instructions that confuse the context.
Description too narrow: the skill doesn't activate when it should (false negative). Claude works without the rules you've defined.

With an ecosystem of SEO skills—audit, fix, content, meta generation, images, reports— overlap is real. A request like "improve the titles on main pages" could activate the audit skill, the meta generation skill, or the content review skill. The description must be precise enough to activate only the right one.

How automatic optimization works

The skill-creator analyzes the current description alongside sample prompts and suggests improvements. In practice: you give it examples of when the skill should activate and when it shouldn't, and the tool rewrites the description to maximize match precision.

In tests Anthropic published on six public document creation skills, the optimization improved triggering on five of them. The typical improvement: fewer false positives (the skill doesn't activate when it shouldn't) while maintaining the same true positives (the skill always activates when it should).

How I apply it to SEO skills

For each skill in my ecosystem, I define two sets of prompts.

Positive prompts: requests that MUST activate the skill. For the audit skill: "analyze the crawl", "what technical issues does this site have", "do an SEO audit", "check the site health".

Negative prompts: requests that MUST NOT activate the skill. For the audit skill: "write a meta title for this page" (→ meta generation), "generate a cover image" (→ nano banana), "how's traffic this month" (→ report).

The optimizer uses these sets to refine the description, finding the keywords and phrasing that maximize separation between different skills.

The future: when skills become specifications

A fascinating aspect from Anthropic's blog: as models improve, the line between "skill" and "specification" may blur.

Today, a skill SKILL.md contains both the what (goals, criteria) and the how (operational steps, techniques, workflows). It's an implementation plan that tells the model exactly how to proceed.

In the future, it might suffice to describe the what— quality criteria, thresholds, output format— and let the model autonomously determine the how. Evaluations would describe the desired behavior, and the model would find the best path to get there.

For SEO this would mean: defining audit thresholds, quality criteria for content, constraints on meta tags— without having to explain step-by-step how to do the analysis. The skill becomes a contract of results, not an operational manual.

For now we're in the intermediate phase: the how still matters, but the what matters increasingly. And evaluations are exactly the tool to define the what in a verifiable way.

From creation to systematic maintenance

Creating a skill is the first step. Testing it with evaluations, measuring it with benchmarks, optimizing it with comparator agents and refining triggering—this is the complete cycle.

The mindset shift is significant: skills are no longer documents you write once and forget. They're living systems that require maintenance, regression testing and continuous optimization. Exactly like software code—because, in essence, skills are code.

For anyone working in SEO with Claude Code, this means your skill ecosystem can be treated with the same discipline as your site's code: versioned, tested, measured, iteratively improved. And that discipline is what separates occasional AI usage from a reliable, scalable system of work.

If you want to start from the basics, read my article on how to create skills for Claude Code in SEO.

To understand how workflow patterns orchestrate skills in complex tasks, read Workflow Patterns for AI Agents Applied to SEO.

To discuss how to optimize your SEO skill ecosystem, reach out for a consultation. I help professionals and companies build measurable, scalable AI systems.

Claude Code Skills 2.0: Evaluations, Benchmarks, and Triggering Optimization