Skip to main content
An Agent Application is only useful if it improves outcomes in practice. The right eval loop tests both the package contract and the real CLI behavior.

What to evaluate

A good evaluation set covers more than final prose output. Use the categories below to build a comprehensive set of assertions.
Test whether the runtime surfaces your package and skills at the right times:
  • Does the package appear when a relevant task is requested?
  • Are unrelated packages correctly excluded?
  • Does the runtime distinguish between the base application contract and local operating guidance?
Test whether the contract in APP.md is sufficient to operate the application without guessing:
  • Is the entry.command unambiguous?
  • Are all callable commands listed in commands?
  • Is the state model described clearly enough to reason about mutations?
  • Are confirmation rules documented for destructive commands?
Test whether local skills load when they should and stay quiet when they should not:
  • Does the skill activate on relevant operating tasks?
  • Does the skill avoid activating on unrelated prompts?
  • Does the skill complement APP.md rather than contradict it?
Test whether the documented commands actually work:
  • Is entry.command callable?
  • Do all listed commands in APP.md execute without error?
  • Do commands behave consistently across repeated runs?
Test whether the CLI returns machine-readable output reliably:
  • Does every successful command return parseable JSON on stdout?
  • Do failure cases return structured JSON errors with non-zero exit codes?
  • Do expected fields appear in both success and failure responses?
  • Does the output shape remain stable across versions?
Test whether the application manages its owned state correctly:
  • Does state persist correctly after add, update, and complete operations?
  • Do IDs remain stable across runs?
  • Does the application own its state independently of prompt memory or chat history?
Test whether destructive commands enforce their safety rules:
  • Does a destructive command fail without the required confirmation flag?
  • Does it return a structured error explaining the requirement?
  • Does it succeed when the confirmation flag is provided?
Track the overhead the package introduces per run:
  • How many tokens does loading the full package cost?
  • How does that cost change with and without local skills loaded?
  • Is the extra context cost justified by the improvement in outcomes?

Start with realistic test cases

Build your eval set from cases that reflect how the package will actually be used. Each case should include:
  • a realistic user or operator prompt
  • the package path being evaluated
  • expected commands or behaviors
  • expected JSON fields or side effects
  • optional input files or seed state
Good starting cases:
  • Inspect an Agent Application and summarize its callable commands.
  • Add and complete an item in the example to-do application.
  • Attempt a destructive command without confirmation and verify that it fails safely.
  • Compare a package run with and without its local operating skill.

Compare against a baseline

Run each case at least two ways:
  • with the current package and local skills
  • with the previous package revision or without the local skill
Comparing against a baseline tells you whether the package is improving outcomes rather than only adding more context. A package that costs more tokens without improving pass rate is not an improvement.

Write objective assertions first

Start with checks you can grade without human review:
  • APP.md, app/, and skills/ exist
  • the documented entry.command is callable
  • JSON output parses successfully
  • expected fields are present in success and failure cases
  • destructive commands require explicit confirmation
  • state changes match the documented contract
Add human review after objective assertions pass. Human review is most useful for broader questions: Is the skill helping the agent choose the right command? Is the output clear enough to act on?

Track cost and drift

Collect per-run data to spot regressions early:
MetricWhy it matters
Pass rateMeasures overall reliability
Failure categoryIdentifies whether failures are in design, instructions, or tooling
DurationTracks execution time per run
Total tokensMeasures context cost
CLI vs. APP.md alignmentDetects when docs drift away from implementation
Package docs can drift away from the implementation over time. Track whether live CLI behavior still matches APP.md as part of your regular eval cycle.

Use failures to refine the package

Read failures at three levels:
  • Package design: wrong boundary between APP.md, app/, and local skills
  • Instructions: unclear command semantics or missing safety defaults
  • Tooling: weak discovery, weak JSON validation, or poor confirmation UX
If the same logic is being reinvented in every run, improve the package contract or bundle better local guidance. Repeated reinvention is a signal that something is missing from APP.md or a local skill.

The evaluation loop

1

Run the eval set against the package

Execute all test cases against your current package version. Collect outputs, exit codes, and token counts.
2

Grade objective assertions

Check each case against your defined assertions: JSON parses, fields are present, confirmation rules hold, state matches.
3

Review outputs and execution traces

Read failures and near-misses. Identify whether the problem is in package design, instructions, or tooling.
4

Tighten APP.md, descriptions, or local skills

Make targeted changes based on what you found. Improve the contract if commands are ambiguous. Improve descriptions if discovery is missing cases. Improve local skills if operating guidance is missing.
5

Rerun and compare the delta

Run the full eval set again. Compare pass rate, failure categories, and token cost against the previous version. Stop when the package improves outcomes consistently and the extra context cost is justified.