Agent Skills and Macro-Evals: Turning Agent Failures Into Operating Memory

Agent skills are not valuable because markdown is easy to edit. They matter when macro-evals turn repeated agent failures into operating memory.

Agent skills have been made easy to write before most teams know what deserves to be written down.

That is the trap.

A skill can look trivial from the outside: a folder, a SKILL.md, a description, some instructions, maybe a script or a reference file. OpenAI’s Codex docs describe skills as reusable workflows that package instructions, resources and optional scripts; Codex starts with the skill name, description and file path, then loads the full SKILL.md only when the skill is selected.

The structure is deliberately simple.

The judgement behind it is not.

A standing order in a banking app is also simple once someone has designed the interface. Amount. Recipient. Date. Frequency.

The value sits elsewhere: knowing which payments should never rely on a person remembering them manually.

Agent skills need the same lens.

Markdown makes behaviour editable. A team can change a review rule, routing policy, evidence standard, formatting expectation, or repo-specific workflow without rebuilding the platform.

Useful, obviously.

The deeper product question is which judgement should leave live reasoning and become procedural memory.

TL;DR

Agent skills matter when they become the memory of repeated correction.

Macro-evals reveal which problems recur across a population of agent traces. A single output may look plausible while the trace shows a missed handoff, ignored signal, weak review decision, or repeated routing failure.

The serious loop is:

lower-level evals flag local issues inside individual runs;
macro-evals find repeated behaviour patterns across many traces;
operators inspect the pattern, decide what kind of correction is needed, then encode that correction in the right place;
skills are one useful destination, especially when the correction is a reusable workflow rule, evidence standard, verification sequence, or formatting policy.

The file may be markdown. The work is systems judgement.

Agent discourse still over-indexes on the prefrontal cortex

Most agent commentary focuses on the visible intelligence layer:

planning;
reasoning;
tool choice;
delegation;
long-context retrieval;
multi-step execution;
agent-to-agent handoffs.

That layer matters. It is also where people overpay.

The best product operators spend as much time asking what should move into the cerebellum.

The reflex layer.

The part of the system that stops spending expensive cognition on decisions it should already know how to handle.

A coding agent should not rediscover a repository’s verification sequence after every change.

A finance agent should not reconstruct approval policy every time a supplier invoice changes bank details.

A support agent should not improvise escalation logic when a complaint carries regulatory risk.

A training-plan agent should not “think carefully” from scratch every time recent load has spiked, recovery markers are falling, and local soreness is present.

Reasoning should remain available for ambiguous work.

Repeated operational judgement should become inspectable, testable, and reusable.

Macro-evals show which judgement keeps failing

The useful macro-eval idea is simple enough to explain without the notebook machinery.

A single agent run answers one question:

Did this workflow succeed this time?

A macro-eval asks a better production question:

Across many runs, which behaviour patterns keep appearing, where do they concentrate, and which part of the workflow should a human inspect first?

OpenAI’s macro-evals cookbook separates lower-level evals from macro-evals. Lower-level evals grade individual agents, handoffs, tools, and completed runs. Macro-evals look across those findings to identify repeated problems and concentration points across the population.

The cookbook’s mental model is useful:

case_type → run_outcome → eval_finding → behaviour_pattern

Translated into operating language:

Case type describes the situation the agent faced.
Run outcome records how the workflow ended.
Eval finding captures the local concern.
Behaviour pattern shows what repeats across many traces.

That last step is where skills become interesting.

A local finding tells you something went wrong.

A repeated behaviour pattern tells you the system may need memory.

The final answer is too small a surface

Agentic systems fail inside the workflow, not only at the final response.

A final answer can look coherent while the trace tells a different story: a specialist missed a signal, a tool output was ignored, review happened too late, or the orchestrator routed around a required step. OpenAI’s cookbook makes this explicit: agentic evals need to inspect tool use, delegation, review pauses, grounding in business context, and the workflow behind the final answer.

That matters because plausible output is the most dangerous kind of failure.

Bad output gets caught.

Plausible output gets operationalised.

A training plan can read well while preserving too much intensity for a tired athlete.

A compliance summary can sound professional while failing to carry forward the evidence that should trigger review.

A coding agent can explain a change cleanly while skipping the one verification command that would have exposed the regression.

A finance agent can produce a neat recommendation while failing to route an exception caused by changed supplier bank details.

Macro-evals exist because production failure often lives in the trace.

This is the same reason AI workflow automation breaks when nobody owns the review loop. Review is not a cosmetic step at the end. It is part of the operating surface where quality debt becomes visible.

The fitness-tech example makes the memory problem visible

I’ve built fitness-tech apps and agents, so training-plan generation is the easiest place to see the issue.

A training agent might have access to:

wearable data;
recent sessions;
soreness notes;
sleep history;
resting heart rate;
goal pace;
calendar constraints;
preferred sports;
equipment access;
previous adherence.

One generated plan can look fine.

The week has intervals, aerobic work, strength, mobility, and a rationale. The tone is confident. The sessions match the stated goal. A user who asked for an aggressive plan feels understood.

The problem appears across runs.

When running load rises quickly and recovery markers deteriorate, the system still preserves too much intensity if the user’s stated goal is aggressive. The risk signal exists in the trace, then gets softened before the final plan is produced.

Nobody needs a broad instruction like:

Be careful with tired runners.

That sentence has no operational force.

A useful correction is closer to:

If recent load has spiked, recovery markers are worsening, and local soreness is present, the final plan must carry that evidence forward, reduce high-impact intensity, and explain the trade-off before prescribing speed work.

That rule has four important properties:

a trigger the system can recognise;
evidence that must survive into the final decision;
a behavioural consequence;
an explanation requirement for the user.

That correction might become a skill. It might become an evaluator. It might become a policy rule inside the training-plan engine. It might become a hard guardrail for speed-work prescription.

The implementation is secondary.

The operating move is the point: repeated judgement leaves the live reasoning stream and becomes memory.

Skills are only one destination for macro-eval findings

This is where the discourse gets lazy.

A macro-eval finding should not automatically become another markdown instruction.

Different failure types need different corrections:

Macro-eval finding	Better correction
The agent ignores a visible risk signal	Skill instruction, evidence-carrying rule, or evaluator check
The wrong specialist gets called repeatedly	Routing policy, handoff rule, or orchestrator update
Review happens too late	Escalation threshold or workflow state change
A calculation is inconsistent	Script, typed function, or deterministic validation
Output format breaks downstream work	Template, schema check, or required artifact structure
The model lacks source context	Retrieval/data pipeline change
A risky action should not happen autonomously	Permission boundary or human approval gate

This is the bridge between macro-evals and skills.

The macro-eval tells you which behaviour repeats.

The operator decides what kind of system change should absorb the lesson.

Skills are powerful when the correction is procedural: “when this kind of work appears, follow this workflow, preserve this evidence, run these checks, and produce this kind of output.”

They are weaker when the real issue is missing data, bad permissions, poor tool design, or an action the agent should not own.

Trace documents define what the macro-eval can notice

OpenAI’s cookbook turns raw traces into compact trace documents before clustering behaviour patterns. That matters because raw traces can contain hundreds of events, long model responses, tool payloads and repeated status updates. The compact document preserves the parts that matter: scenario, routing, state transitions, handoffs, findings, and terminal state.

That is an underrated design choice.

A macro-eval is only as useful as the evidence it can see.

For a training-plan agent, a useful trace document should preserve:

the athlete’s goal and time horizon;
recent training load by sport;
acute changes in sleep, resting heart rate, soreness, and adherence;
the sessions the planner proposed;
any safety or recovery warnings raised mid-run;
the final rationale shown to the user;
whether the plan reduced intensity, preserved it, or escalated to a cautious recommendation.

For a coding agent, the trace document should preserve the task, files changed, commands run, test failures, skipped checks, review comments, and final handoff state.

For a finance workflow, it should preserve the invoice type, supplier history, purchase-order match, payment threshold, bank-detail change, approver route, exception status, and audit packet.

This is not mechanical logging.

It is evaluation design.

The trace document decides which patterns the macro-eval is allowed to discover.

Pattern discovery gives operators a triage board

Once traces are comparable, macro-evals can cluster risky or failed runs into behaviour patterns. OpenAI’s cookbook ranks patterns using prevalence, severity-weighted prevalence, and impact score; the point is not to prove every pattern is a defect, but to show reviewers where to inspect first.

That distinction matters.

A high-impact behaviour pattern is an inspection target.

It may reveal a defect. It may expose unrealistic test scenarios. It may show that the business policy itself is ambiguous. It may point to a workflow that needs a named owner rather than another prompt patch.

In a training-plan system, a macro-eval might find:

tired-runner cases still receiving high-impact intensity;
multi-sport athletes getting too much total load because each sport is assessed separately;
users with limited calendars receiving plans that look elegant but fail adherence reality;
aggressive goal language causing the system to underweight recovery signals;
soreness evidence appearing in the trace but disappearing from the user-facing rationale.

Those findings are more useful than “the plan was bad”.

They tell the operator where judgement collapsed.

Diagnosis turns a pattern into an inspection target

Pattern discovery says what repeats.

Diagnosis asks where to inspect.

OpenAI’s cookbook uses a lightweight execution graph of trace events, then walks backward from a focus event to score upstream suspects such as agents, tools, handoffs, review markers, or specialist responses. It explicitly treats this as inspection guidance rather than causal proof.

That framing is exactly right.

A macro-eval should not pretend to be omniscient. It should narrow the field.

In the training-plan case, the focus event might be:

final plan includes speed work despite load spike, poor recovery, and local soreness.

The backward inspection might find:

the recovery-analysis step flagged risk;
the plan-builder step treated goal ambition as higher priority;
the final-rationale step mentioned fatigue but did not alter the prescription;
the review rule only checked weekly volume, not combined impact intensity;
the soreness note was treated as context rather than a constraint.

Now the team has real choices.

A skill can force the planner to carry recovery evidence into the final plan.

An evaluator can fail outputs that prescribe speed work without explaining the trade-off.

A deterministic check can calculate load spike thresholds.

A review rule can escalate plans where soreness and recovery deterioration appear together.

A product decision can make certain prescriptions unavailable without more user input.

That is the operating loop.

Macro-evals reduce the search space. Operators still decide the correction.

Skills become useful when they encode a specific judgement

A skill should not be a nicer prompt.

A strong skill has a narrow job.

OpenAI’s Codex docs advise keeping each skill focused on one job, preferring instructions unless deterministic behaviour or external tooling is needed, writing imperative steps with explicit inputs and outputs, and testing prompts against the skill description to confirm trigger behaviour.

That maps cleanly to the macro-eval loop.

A weak training skill says:

Generate safe, personalised plans.

A stronger skill says:

Use this skill when a plan includes running intensity and the athlete shows any combination of recent load spike, deteriorating recovery markers, local soreness, or reduced adherence. Before prescribing speed work, compare goal ambition against recovery evidence. If risk is elevated, reduce high-impact intensity, preserve aerobic continuity through lower-impact work where possible, and explain the trade-off in the final rationale.

That instruction is still readable.

It also has operational teeth.

It tells the system when to load the skill, which evidence to inspect, what decision to change, and how the final answer should reflect the decision.

The same pattern applies outside fitness.

A repo skill should specify when code changes require lint, typecheck, build, unit tests, integration tests, or generated asset updates.

A brand skill should specify banned claims, proof requirements, channel constraints, previous winners, and the review criteria for rejecting output before a creative lead sees it.

A finance skill should specify which invoice exceptions require routing, which evidence belongs in the audit packet, and when payment must pause.

The value is not “we have a skills folder”.

The value is that repeated operational judgement has somewhere to live.

Scripts belong where probability should end

Some corrections should not remain in prose.

A skill can include scripts and references. That matters because agents with filesystem and code execution do not need to reason through every repeatable check in natural language.

Use instructions for judgement.

Use scripts for checks.

A training agent can reason about whether a cautious plan is more appropriate. A script can calculate acute load change, compare session density, or flag missing recovery inputs.

A coding agent can decide the right implementation approach. A script can run the repo’s exact verification sequence.

A content agent can judge whether a draft fits a campaign angle. A validator can check required sections, character limits, banned terms, missing citations, and malformed frontmatter.

A finance agent can explain the commercial implication of an exception. A parser can extract invoice fields, compare bank details, and detect threshold breaches.

This split protects the system from two expensive habits:

paying frontier reasoning costs for repeatable checks;
letting probabilistic generation handle work that should be deterministic.

That is where agent skills start to look less like prompt management and more like operating-system design.

Skills also protect the context window from policy sprawl

Teams often respond to agent failures by adding more global instruction.

The system prompt becomes a landfill.

Brand rules sit beside compliance policy. Output formatting sits beside escalation logic. Tool preferences sit beside edge-case warnings. The agent carries everything into every task, whether relevant or not.

Progressive disclosure is the architectural answer.

Codex starts with skill metadata, then loads full instructions only when the skill is selected. That keeps the main context lean while allowing deeper procedural knowledge to exist outside the always-on prompt.

This is the practical reason skills matter for cost.

Large context windows make sloppy operating design easier to tolerate. They do not make it free. Irrelevant instruction still competes for attention, increases latency, raises cost, and creates more ways for rules to conflict.

The better pattern is selective memory:

keep the base agent small enough to reason clearly;
give skills sharp descriptions so the right ones trigger;
load detailed instructions when the task actually needs them;
move deterministic checks into scripts;
evaluate whether the selected skill changed behaviour.

That is a cleaner system than one giant prompt trying to govern every possible case.

Skills need owners or they become instruction debt

A company can create fifty skills and still have no operating memory.

The failure mode is obvious.

Someone writes a skill after a bad output. Another person adds a caveat after a stakeholder complaint. A third person adds tone guidance because leadership disliked a draft. Six weeks later the skill contains stale examples, duplicate rules, unclear triggers, and instructions that conflict quietly.

That is instruction debt.

Skills need ownership.

The owner protects five things:

Trigger quality: the skill loads for the right work and stays quiet when irrelevant.
Evidence standards: the skill names which inputs must influence the final decision.
Behavioural consequence: the instruction changes output, routing, review, or verification.
Evaluation coverage: sample tasks prove the skill still works after edits.
Lifecycle discipline: stale rules get removed, split, or promoted into harder system controls.

This is product work.

It sits close to the technical product manager role I described in What a technical product manager should actually do in an AI-heavy company: defining context, failure modes, review policy, instrumentation, cost, latency, and the handoff into the real workflow.

A skill without an owner is a prompt with a nicer file path.

Public agent work makes the memory stronger

Private agent work creates private acceleration.

Public agent work creates shared learning.

I wrote about this in Public AI Agents and the Return of the Shop Floor. When agent work happens in visible spaces, teams can inspect the prompt, trace, correction, artefact, and decision trail. That matters because skills should not only store the rule. They should preserve why the rule exists.

The useful record looks like this:

a macro-eval identified a repeated pattern;
traces showed which signals were dropped or over-weighted;
the team chose a correction type;
the skill changed a specific behaviour;
a small eval set proved the correction held;
future operators can see the failure the rule prevents.

That lineage prevents skills from becoming folklore.

It also makes the team faster.

New operators can inherit judgement instead of reverse-engineering it from old Slack threads, private chats, and unexplained markdown files.

Internal tools are often where this starts

The cleanest skill loops often begin inside internal tools.

An internal reporting agent misses the same commercial caveat.

A support triage agent escalates one class of customer issue too late.

A creative QA agent keeps approving claims that lack proof.

A product research agent summarises interviews but loses the evidence behind objections.

Those workflows already have repeated work, expert reviewers, operational stakes, and observable failures. That makes them ideal places to build the loop from trace to eval to skill.

This connects to Internal tools are product work when they change how the company operates. Internal tools stop being side projects when they change review quality, decision speed, routing, accountability, and the company’s ability to learn from repeated work.

Agent skills sit naturally inside that shift.

They are not just developer convenience.

They are one way an internal tool remembers how the organisation wants work done.

The company-building implication is bigger than skills

Skills become more valuable when they are part of a shared operating layer.

I wrote about this in AI-Native Company Building: Why Portfolio Bets Become Rational on Shared Infrastructure. The argument there was that adjacent AI products can share primitives: identity, permissions, memory, orchestration, evals, audit logs, review queues, escalation paths, reporting, and billing.

Skills fit into that layer.

One company might ship different vertical workflows for finance ops, compliance review, support triage, creative QA, and product research. Each surface has its own domain rules. The underlying pattern can still rhyme:

traces capture what happened;
evals detect local failure;
macro-evals find repeated behaviour;
skills encode reusable correction;
scripts make repeatable checks deterministic;
review queues preserve human judgement where risk demands it.

That is why “skills are just markdown files” is such a shallow take.

The folder is the visible artefact.

The operating layer is the compounding asset.

The same logic applies at company level in AI-Native Operating Models: Why Agentic AI Breaks Companies That Only Cut Headcount. Evals, governance, review paths, and operating memory belong in the system design—not in a pile of one-off prompt fixes.

A practical loop for agent-skill management

A serious team can manage skills with a loop like this:

Capture real traces. Keep the workflow evidence, not only the final answer.
Run lower-level evals. Grade decisions, tool use, handoffs, review timing, grounding, and output quality.
Build trace documents. Compress runs into comparable records that preserve the evidence the macro-eval needs.
Find behaviour patterns. Cluster repeated failures and inspect where they concentrate.
Rank by operational importance. Prevalence matters, but severity changes the order.
Diagnose upstream suspects. Look at agents, tools, handoffs, review markers, and workflow transitions near the focus event.
Choose the correction type. Use a skill, script, eval, route, permission, data change, review rule, or product constraint.
Test the behaviour change. The correction only counts if future runs behave differently.
Version the memory. Keep the reason for the rule visible, so the skill does not become unexplained instruction debt.

That loop is where product operators earn their keep.

Writing markdown is the easy part.

Knowing which operational lesson belongs in memory is the scarce work.

The mobile-app version of this is already familiar

Mobile product work teaches the same lesson in a different form.

I wrote about the trade-offs in What building mobile apps teaches a product lead about real trade-offs: the demo is never the whole product. Release cycles, onboarding, analytics, edge cases, paywalls, QA, copy, and user behaviour decide whether the thing survives real use.

Agents compress implementation, but they do not remove product judgement.

A training-plan generator is not valuable because it can produce a week of workouts.

It becomes valuable when it handles messy human inputs, protects users from unsafe recommendations, explains trade-offs, adapts to adherence, and earns trust over repeated use.

Macro-evals and skills belong in that gap.

They turn repeated use into learning.

The cost argument will get louder

Agent costs will rise in strange places.

Teams will add more tools. Context windows will grow. Multi-agent workflows will expand. Reasoning models will be used more often because they produce better work. Traces will get longer. Review loops will mature.

Without memory discipline, companies will pay agents to rediscover rules the business already knows.

A coding agent should not spend premium reasoning budget figuring out whether the repo needs pnpm test after a runtime change.

A support agent should not re-derive the escalation path for regulated complaints.

A training agent should not repeatedly negotiate whether soreness plus load spike plus poor recovery should reduce high-impact intensity.

A creative agent should not rediscover brand claims, proof standards, and banned phrases every time it drafts ads.

The system should know.

That does not mean every decision becomes rigid.

It means expensive reasoning is preserved for work that deserves live judgement.

The operator’s job is deciding what becomes reflex

The reflex layer is where agent systems become cheaper, safer, and more consistent.

Not because everything becomes automated.

Because repeated judgement stops being treated as a fresh act of intelligence every time.

Some findings should become skills.

Some need scripts.

Some require evals.

Some belong in review queues.

Some should change permissions.

Some expose a product constraint the agent should never route around.

Macro-evals give the operator the map. They show which failures repeat, where they concentrate, and which part of the workflow deserves inspection first.

Skills give the operator one place to store the correction.

The companies that handle this well will not win because they have prettier markdown.

They will win because their agents stop making the same expensive mistake twice.

FAQ

What are agent skills?

Agent skills are reusable workflow packages—often a folder with a SKILL.md, description, instructions, and optional scripts—that encode procedural memory for agentic systems. They load when relevant work appears and turn repeated operational judgement into inspectable, testable rules.

What are macro-evals?

Macro-evals analyse many agent traces to find repeated behaviour patterns across runs. Lower-level evals grade individual outputs; macro-evals cluster findings to show where failures concentrate and which part of the workflow deserves inspection first.

When should a macro-eval finding become a skill?

A finding should become a skill when the correction is procedural: a reusable workflow rule, evidence standard, verification sequence, routing policy, or formatting requirement. Skills are weaker when the real issue is missing data, bad permissions, poor tool design, or an action the agent should not own.

Why do macro-evals matter for agentic systems?

Agentic systems fail inside the workflow, not only at the final response. A plausible answer can hide missed handoffs, ignored signals, late review, or repeated routing errors. Macro-evals make those trace-level patterns visible across many runs.

What is procedural memory in AI systems?

Procedural memory is the part of an agent system that handles repeated operational judgement without spending expensive reasoning on decisions it should already know. Skills, scripts, evaluators, review rules, and permission boundaries are all ways to store that memory.

Bring the product, workflow, or growth constraint

Model Operator helps founders, product leaders and commercial teams turn unclear product bets, AI workflows, growth systems and internal operations into shipped systems with ownership, measurement and a sharper definition of quality.

If the question inside the company is where agent failures should become operating memory instead of repeated correction, start with the trace.