What to evaluate
A good evaluation set covers more than final prose output. Use the categories below to build a comprehensive set of assertions.Discovery quality
Discovery quality
Test whether the runtime surfaces your package and skills at the right times:
- Does the package appear when a relevant task is requested?
- Are unrelated packages correctly excluded?
- Does the runtime distinguish between the base application contract and local operating guidance?
APP.md contract clarity
APP.md contract clarity
Test whether the contract in
APP.md is sufficient to operate the application without guessing:- Is the
entry.commandunambiguous? - Are all callable commands listed in
commands? - Is the state model described clearly enough to reason about mutations?
- Are confirmation rules documented for destructive commands?
Local skill activation behavior
Local skill activation behavior
Test whether local skills load when they should and stay quiet when they should not:
- Does the skill activate on relevant operating tasks?
- Does the skill avoid activating on unrelated prompts?
- Does the skill complement
APP.mdrather than contradict it?
CLI execution quality
CLI execution quality
Test whether the documented commands actually work:
- Is
entry.commandcallable? - Do all listed commands in
APP.mdexecute without error? - Do commands behave consistently across repeated runs?
JSON output stability
JSON output stability
Test whether the CLI returns machine-readable output reliably:
- Does every successful command return parseable JSON on stdout?
- Do failure cases return structured JSON errors with non-zero exit codes?
- Do expected fields appear in both success and failure responses?
- Does the output shape remain stable across versions?
State correctness after mutations
State correctness after mutations
Test whether the application manages its owned state correctly:
- Does state persist correctly after
add,update, andcompleteoperations? - Do IDs remain stable across runs?
- Does the application own its state independently of prompt memory or chat history?
Confirmation behavior for destructive commands
Confirmation behavior for destructive commands
Test whether destructive commands enforce their safety rules:
- Does a destructive command fail without the required confirmation flag?
- Does it return a structured error explaining the requirement?
- Does it succeed when the confirmation flag is provided?
Time and token cost
Time and token cost
Track the overhead the package introduces per run:
- How many tokens does loading the full package cost?
- How does that cost change with and without local skills loaded?
- Is the extra context cost justified by the improvement in outcomes?
Start with realistic test cases
Build your eval set from cases that reflect how the package will actually be used. Each case should include:- a realistic user or operator prompt
- the package path being evaluated
- expected commands or behaviors
- expected JSON fields or side effects
- optional input files or seed state
- Inspect an Agent Application and summarize its callable commands.
- Add and complete an item in the example to-do application.
- Attempt a destructive command without confirmation and verify that it fails safely.
- Compare a package run with and without its local operating skill.
Compare against a baseline
Run each case at least two ways:- with the current package and local skills
- with the previous package revision or without the local skill
Write objective assertions first
Start with checks you can grade without human review:APP.md,app/, andskills/exist- the documented
entry.commandis callable - JSON output parses successfully
- expected fields are present in success and failure cases
- destructive commands require explicit confirmation
- state changes match the documented contract
Track cost and drift
Collect per-run data to spot regressions early:| Metric | Why it matters |
|---|---|
| Pass rate | Measures overall reliability |
| Failure category | Identifies whether failures are in design, instructions, or tooling |
| Duration | Tracks execution time per run |
| Total tokens | Measures context cost |
CLI vs. APP.md alignment | Detects when docs drift away from implementation |
Use failures to refine the package
Read failures at three levels:- Package design: wrong boundary between
APP.md,app/, and local skills - Instructions: unclear command semantics or missing safety defaults
- Tooling: weak discovery, weak JSON validation, or poor confirmation UX
The evaluation loop
Run the eval set against the package
Execute all test cases against your current package version. Collect outputs, exit codes, and token counts.
Grade objective assertions
Check each case against your defined assertions: JSON parses, fields are present, confirmation rules hold, state matches.
Review outputs and execution traces
Read failures and near-misses. Identify whether the problem is in package design, instructions, or tooling.
Tighten APP.md, descriptions, or local skills
Make targeted changes based on what you found. Improve the contract if commands are ambiguous. Improve descriptions if discovery is missing cases. Improve local skills if operating guidance is missing.