February 2026

Designing Skills That EvolveFrom Programs to Training Loops

There are two ways to think about AI skills.

The first treats a skill like a program. You write instructions, the agent executes them, you manually fix what doesn't work. Anthropic's Skill Creator guide is an excellent resource for this approach — it covers directory structure, progressive disclosure, context management, and packaging. The skill format it established is now supported across Claude Code, Codex, Gemini CLI, and others. If you haven't read it, start there.

The second treats a skill like a model being trained. Each execution is a forward pass. The skill's goal measures the gap between actual and desired output — the loss. A reflection step analyzes what went wrong and updates the skill's knowledge and orchestration — the backward pass. The next execution is better.

Building on Anthropic's foundation, we'll walk through each component of a skill designed to improve through its own execution — using a stock analysis workflow as a running example — and show why each one exists.

Skill Component	Programming Analogy	What Changes
Input / Output	Params / Return values	Almost nothing
Knowledge	Config + domain data	Must be separated from logic
Secrets	Environment variables	Almost nothing
Orchestration	Algorithm / control flow	Discovered, not designed
Workspace	Working directory	Feeds the evolution loop
Goal	Tests	Fuzzy, proportional, sometimes delayed

Anatomy of a skill. Amber = reshaped by the intelligent executor. Green = sub-skills called during orchestration.

The “What Changes” column is the story of this article. Let's go through it.

Input & Output: The Parts That Don't Change

Input is what the user wants. Output is what gets delivered back. These concepts are identical whether the executor is a CPU or an LLM, because they describe the caller's intent, not the executor's nature.

inputs:
  - name: stock_codes
    description: Stocks to analyze
    example: "NVDA, TSLA, META"
  - name: investment_horizon
    description: Short-term trading or long-term investing
    default: long-term

outputs:
  - name: summary_report
    description: Rankings and investment recommendations
    path: analysis/YYYYMMDD/SUMMARY.md

If you stopped here, skills and programs would be interchangeable. The interesting part starts when you look at what happens between input and output. But note: the stability of I/O is itself a design choice. Input and output form the fixed contract — the interface that doesn't change while everything inside evolves. Like a function signature in a library: the implementation improves, the API stays.

Knowledge: Why It Must Be a First-Class Citizen

Every non-trivial skill needs domain knowledge to do its job well. A stock analysis skill needs scoring rules. An SEO skill needs keyword strategies. A deployment skill needs environment-specific configs.

Programs have this too — they just call it configuration, lookup tables, or domain data. But there's a critical difference: in a program, code and data are syntactically distinct. You can always tell if (pe < 15) apart from the number 15. In a skill, both the logic and the knowledge are written in natural language. If you put them in the same file, they become indistinguishable.

Consider this excerpt from a skill:

### Step 2: Score each stock
For each stock, check: if P/E forward is less than 15,
add 2 points. If ROE is above 15%, add 2 points.
If revenue growth is negative, subtract 1 point.
Then sort by total score and present rankings.

Which part is the procedure (“check each stock, add points, sort”) and which part is the knowledge (“P/E < 15 = good, ROE > 15% = good”)? They're fused together. When you need to update the scoring thresholds, you're editing the same paragraph that describes the procedure. When you change the procedure, you risk accidentally changing a threshold.

The fix is structural: put knowledge in separate files.

analyze-stocks/
├── SKILL.md                      # Procedure: what to do
├── references/
│   ├── scoring-rules.yaml        # Knowledge: scoring thresholds
│   └── sector-mapping.yaml       # Knowledge: industry → stocks
└── scripts/
    └── get_stock_data.py

The SKILL.md references knowledge without inlining it:

### Step 2: Score [MUST]
> **Knowledge**: $SKILL_DIR/references/scoring-rules.yaml
> Use the rules in this file. Do not invent your own.

Now they evolve independently. Update thresholds without touching the procedure. Change the procedure without accidentally breaking a rule. This separation is trivial in programs (it's just a config file). In skills, it requires conscious discipline, because natural language makes it easy to blur everything together.

This separation also has a deeper function: it makes the skill amenable to reflection. When /base-reflect analyzes a run and finds that growth stocks are being underrated, it can surgically update scoring-rules.yaml without touching the orchestration in SKILL.md. If knowledge and logic were fused, every improvement would risk breaking the procedure. Separate files are like separate parameter groups in a model — they let the learning signal reach the right place without collateral damage.

The litmus test for knowledge

Not everything that looks like information is knowledge. The test: does this make the skill execute better?

Data	Category	Reasoning
“Analyze NVDA and TSLA”	Input	User's intent for this run
P/E < 15 → +2 points	Knowledge	Makes scoring more accurate
AI news → NVDA, MSFT, GOOGL	Knowledge	Makes news mapping more complete
NVDA scored 12, ★★★★★	Output	Delivered to user, doesn't help the skill

If you can't articulate how a piece of information improves the skill's execution, it's not knowledge — it's either input, output, or noise.

knowledge:
  - file: references/scoring-rules.yaml
    purpose: Makes scoring step produce more accurate results
    used_by: [scoring]
  - file: references/sector-mapping.yaml
    purpose: Makes news mapping cover the right companies
    used_by: [news-mapping]

Progressive disclosure: managing knowledge at scale

Separating knowledge from logic creates a new problem: a skill might accumulate dozens of reference files. Loading them all into context every time wastes the context window — which, as Anthropic's Skill Creator guide points out, is a shared resource.

The solution is a three-level loading system:

Level	What Loads	When
Metadata	Name + description (~100 words)	Always in context
SKILL.md body	Orchestration + goals (<5k words)	When the skill triggers
References	Knowledge files (unlimited)	On demand, as needed by the agent

The SKILL.md references knowledge files and describes when to load them — but doesn't inline their contents. When a user asks about sales metrics, the agent reads references/sales.md without loading references/finance.md or references/product.md.

This matters for evolution too. A skill with well-organized reference files can update one knowledge domain (new scoring rules) without re-loading unrelated ones (sector mappings). Progressive disclosure isn't just about efficiency — it's about making knowledge independently addressable, which is a prerequisite for independent evolution.

Secrets: The Part That's Immune to Intelligence

Skills often need API keys or credentials. This is the one component where the intelligent executor makes almost no difference — security is a physical constraint, not an intelligence problem. The principle is the same as in programming: declare dependencies, don't embed them. Secrets live in ~/.claude/.env, user-level, never in the skill directory, never uploaded when sharing. Like I/O, secrets form part of the stable frame around which evolution happens — /base-reflect doesn't touch them.

Orchestration: Discovered, Not Designed

This is where the intelligent executor changes things most dramatically.

A key design principle from Anthropic's guide: Claude is already very smart. Only add instructions the model doesn't already know. Challenge each piece of information: does this justify its token cost? This seems like practical advice about conciseness, but it has a deeper implication — it means parts of your skill are already obsolete, and more will become obsolete as models improve.

In programming, you design an algorithm, then implement it. The thinking comes first, the code comes second. With skills, it often works the other way around: you run the skill, observe what works, and reflect the optimal path after the fact.

This inversion exists because of a property unique to skills: the orchestration contains two fundamentally different things, and only one of them has lasting value.

Scaffolding

The first kind teaches the model things it should already know:

Step 1: Use WebSearch to search "finance news today"
Step 2: Filter results for market-relevant news
Step 3: Map news to affected stocks

This compensates for current model limitations. As the agents powering tools like Claude Code and Codex improve, this kind of instruction becomes unnecessary — you don't need to tell a sufficiently capable model to use web search when it needs current news. Scaffolding is a depreciating asset. The better the runtime gets, the less of it you need.

Discovered orchestration

The second kind captures genuine discoveries about what works in a specific domain:

### Step 1: Pre-filter [MUST]
> Rationale: Scoring all stocks first, then checking news
> only for survivors, avoids anchoring bias toward stocks
> that happen to be in the headlines.

### Step 2: News for survivors only [MUST]
> Rationale: Checking news before filtering led to
> over-weighting popular stocks with media coverage.

This isn't teaching the model how to use tools. It's recording that in this particular problem domain, doing A before B produces better results than doing B before A — a fact that was only discovered through trial and error.

Scaffolding and discovered orchestration have opposite trajectories. Scaffolding depreciates as models improve. Discovered orchestration appreciates, because there will always be new problem domains where the optimal strategy is unknown until someone finds it through iteration. It's the core intellectual property of a well-reflected skill.

Notice the > Rationale: blocks in the example above. These aren't documentation for humans — they're metadata for reflection. When /base-reflect analyzes a run, it needs to know why a step exists to judge whether to keep, modify, or remove it. Without rationale, /base-reflect can see what the skill did but not why, and can't make informed changes. Rationale annotations are the gradient highways of the evolution loop — they let the learning signal reach the right orchestration step.

The two dimensions of orchestration

Anthropic's guide introduces a useful concept: degrees of freedom. Some instructions need to be precise (a narrow bridge with cliffs needs guardrails), while others can be loose (an open field allows many routes). But this is only the horizontal axis. The vertical axis — whether the instruction is scaffolding or a genuine discovery — is equally important.

Quadrant chart with X-axis from specific scripts to general guidance, Y-axis from scaffolding (depreciates) to discovered (appreciates). Top half amber: domain insights and precise procedures to keep. Bottom half gray: rigid instructions and general guidance that will fade. — Only the top half is worth investing in. The bottom half evaporates as models improve.

The quadrant chart reveals something that neither axis shows alone: “Use WebSearch for news” and “Score BEFORE checking news” look similar on the page — both are step-by-step instructions. But the first is scaffolding that will fade; the second is a discovered ordering that encodes real domain insight. Knowing which quadrant each instruction belongs to tells you where to invest your effort and what to let go.

Workspace: Why Skills Need a Working Table

During execution, a skill produces data that's neither input nor output — raw API responses before scoring, scoring breakdowns before report generation, baseline metrics for later evaluation. Programs have working directories for this. Skills need the same thing, but for a different reason.

Data	Category	Reasoning
Raw JSON from yfinance	Workspace	Intermediate — feeds the scoring step
Per-stock score breakdown	Workspace	Intermediate — feeds report generation
Baseline metrics at execution time	Workspace	Evidence for future evaluation
Evaluation notes	Workspace	Raw material for reflection

workspace: analysis/YYYYMMDD/.workspace/

In a program, the working directory is an implementation detail — it exists for the current run and you rarely look at it afterwards. In a skill, workspace serves a much more important function: it's the primary input for the evolution loop.

When you run /base-reflect (available as a Skillbase skill for Claude Code, Codex, Gemini CLI, and other agents) to optimize a skill, it doesn't look at the output (that's for the user). It looks at the workspace — the intermediate steps, the scoring breakdowns, the decisions that were made along the way. This is how it figures out what worked, what didn't, and what orchestration to update.

A program's temp files are disposable. A skill's workspace is the fossil record of its execution — and fossils are how species evolve. The richer the workspace, the more precisely /base-reflect can attribute problems to specific components. A per-stock score breakdown lets /base-reflect see “this loss-making company ranked second because the scoring rules have no negative weight for losses” — a specific, actionable finding. Without that breakdown, all /base-reflect gets is “the final report looks off.”

Goal: From Binary Tests to Proportional Verification

Programs have tests. Tests are binary — pass or fail, no in-between. This works because the executor is deterministic: given the same input, a program always produces the same output, so you can write exact assertions.

Skills run on a probabilistic executor. The same skill with the same input might produce slightly different output each time. More importantly, for complex skills, “correct” is often a matter of degree, not a binary state.

So skills have goals instead of tests. A goal states what success looks like, in natural language:

goal: Scores should correlate with actual stock
      performance — high-rated stocks should
      outperform low-rated ones.

The agent interprets what “correlate” and “outperform” mean in context. This flexibility is a feature, not a bug — it's possible because the executor is intelligent enough to judge nuance. Verification scales with complexity: simple skills need no extra checking (the result speaks for itself), while complex skills may generate child skills for deferred evaluation weeks later.

But goals serve a more fundamental purpose than verification. In the evolution loop, the goal is the loss function. Without it, /base-reflect can analyze workspace data and see what happened, but it can't judge whether the outcome was good or bad. Goal converts raw observations into directional evidence: “scores didn't correlate with performance, therefore the scoring rules need revision.” No goal, no direction. No direction, no evolution — just random drift.

The Evolution Loop: What Programs Don't Have

All of the above — knowledge, orchestration, workspace, goals — come together in one mechanism that has no real equivalent in programming: the reflection feedback loop.

In practice, a strong evolution loop uses two feedback channels, not one. Workspace reflection captures what happened in your own runs. User-reported FEEDBACK captures what breaks for real callers after a skill is shared or deployed.

Feedback Channel	Where It Comes From	What It Improves
Workspace reflection	Your execution traces and intermediate artifacts	Core orchestration and knowledge rules
User FEEDBACK	Reported scenarios, failures, and suggestions from users	Edge cases, reliability, and real-world behavior

Circular flow diagram: Execute generates data for Workspace, which provides raw material for Reflect, which extracts lessons learned to Update Knowledge and Orchestration, making the next run better.

Most approaches to skill improvement follow the traditional programming model: use the skill, notice a problem, manually fix it, test again. Anthropic's Skill Creator guide calls this “Iterate” — and it works, but it's ad-hoc. The improvement happens outside the skill, driven entirely by human observation.

The evolution loop is structurally different, and the analogy to machine learning is instructive. Each execution is a forward pass: the skill runs, workspace captures intermediate states (like saving activations during a forward pass), and the goal measures the gap between actual and desired output (the loss). Reflect is the backward pass: it analyzes the loss, traces it through the workspace evidence, and attributes the error to specific components — updating a scoring threshold in knowledge, reordering steps in orchestration, or both.

User FEEDBACK acts like an additional supervision stream. It often surfaces failure modes your local runs never hit: unusual data distributions, ambiguous prompts, latency-sensitive paths, and integration assumptions. A robust /base-reflect workflow should incorporate these FEEDBACK signals into the same update cycle, with clear attribution between session findings and external reports.

This is why each component is designed the way it is. Knowledge separation creates independently addressable parameter groups, so /base-reflect can update scoring rules without touching orchestration. Rationale annotations create gradient highways, so /base-reflect can trace an outcome back to the step that caused it. Workspace captures the activations that make attribution possible. Goal provides the loss function that gives /base-reflect a direction.

One training step: a concrete example

v1 runs. The skill checks news first, finds “NVIDIA dominates AI chip market,” then scores all stocks. NVDA gets a massive boost from news sentiment. A loss-making company with media buzz ranks second. Workspace captures the per-stock score breakdown, the news-to-stock mapping, and the final rankings.

Reflect analyzes. Goal says “scores should correlate with actual performance.” Workspace shows that media-heavy stocks were systematically overrated and that a company with negative earnings ranked high. Reflect traces two root causes: (1) checking news before scoring introduced anchoring bias (orchestration problem), and (2) the scoring rules have no penalty for negative earnings (knowledge gap).

v2 is born. Reflect updates orchestration: move scoring before news checking, add rationale explaining the anchoring risk. Reflect updates scoring-rules.yaml: add “net income negative → −3 points.” Next run, the same input produces more accurate rankings. One training step complete.

Without separated knowledge, /base-reflect couldn't have updated the scoring threshold without risking the procedure. Without workspace, /base-reflect wouldn't have seen the per-stock breakdown that revealed the anchoring pattern. Without the goal, /base-reflect wouldn't have known the rankings were wrong. Every design choice in a good skill serves this loop.

The Full Picture

skill-name/
├── SKILL.md              # Orchestration + goal
├── references/           # Knowledge (separate from logic)
│   ├── scoring-rules.yaml
│   └── sector-mapping.yaml
├── scripts/              # Tool code
│   └── get_stock_data.py
└── assets/               # Output resources (templates, images)
    └── report-template.md

---
name: analyze-stocks

inputs:
  - name: stock_codes
    description: Stocks to analyze
  - name: investment_horizon
    default: long-term

outputs:
  - name: summary_report
    path: analysis/YYYYMMDD/SUMMARY.md

knowledge:
  - file: references/scoring-rules.yaml
    purpose: Makes scoring accurate
    used_by: [scoring]
  - file: references/sector-mapping.yaml
    purpose: Maps news to correct companies
    used_by: [news-mapping]

secrets:
  - env: NEWS_API_KEY
    used_by: [news-fetching]
    setup: "https://newsapi.org/register"

workspace: analysis/YYYYMMDD/.workspace/

goal: >
  Scores should correlate with actual stock performance.
  High-rated stocks should outperform low-rated ones.
---

Six components. Each one has a counterpart in programming. Each one is reshaped by the same force: the executor understands what you mean, not just what you say.

That single fact is why knowledge must be separated (natural language logic and natural language data are otherwise indistinguishable), why orchestration is discovered rather than designed (the search space is new every time), why goals are fuzzy (the agent can judge nuance), why workspace feeds an evolution loop (the agent can learn from its own runs), and why scaffolding is temporary (the agent keeps getting smarter).

We're at the beginning of this. The first programs were also just scripts — sequences of instructions with no structure. It took decades to develop functions, modules, type systems, package managers, and testing frameworks. Skills are at the “scripts” stage right now. The concepts in this article — separating knowledge from logic, recording discovered orchestration, proportional verification, and the /base-reflect loop with user FEEDBACK — are early attempts at giving skills the kind of structural discipline that programs took years to develop.

The difference is that this time, the runtime is evolving alongside the programs. Every design choice you make today might be obsolete tomorrow — not because it was wrong, but because the executor learned to do it without being told. The art of writing good skills is knowing which parts to invest in (knowledge, discovered orchestration, goals) and which parts to hold loosely (scaffolding, rigid structure, detailed instructions).

Build for evolution, not for permanence. That's the one principle that won't depreciate.