Evaluating packages

An Agent Application is only useful if it improves outcomes in practice. The right eval loop tests both the package contract and the real CLI behavior.

What to evaluate

A good evaluation set covers more than final prose output. Use the categories below to build a comprehensive set of assertions.

Discovery quality

Test whether the runtime surfaces your package and skills at the right times:

Does the package appear when a relevant task is requested?
Are unrelated packages correctly excluded?
Does the runtime distinguish between the base application contract and local operating guidance?

APP.md contract clarity

Test whether the contract in APP.md is sufficient to operate the application without guessing:

Is the entry.command unambiguous?
Are all callable commands listed in commands?
Is the state model described clearly enough to reason about mutations?
Are confirmation rules documented for destructive commands?

Local skill activation behavior

Test whether local skills load when they should and stay quiet when they should not:

Does the skill activate on relevant operating tasks?
Does the skill avoid activating on unrelated prompts?
Does the skill complement APP.md rather than contradict it?

CLI execution quality

Test whether the documented commands actually work:

Is entry.command callable?
Do all listed commands in APP.md execute without error?
Do commands behave consistently across repeated runs?

JSON output stability

Test whether the CLI returns machine-readable output reliably:

Does every successful command return parseable JSON on stdout?
Do failure cases return structured JSON errors with non-zero exit codes?
Do expected fields appear in both success and failure responses?
Does the output shape remain stable across versions?

State correctness after mutations

Test whether the application manages its owned state correctly:

Does state persist correctly after add, update, and complete operations?
Do IDs remain stable across runs?
Does the application own its state independently of prompt memory or chat history?

Confirmation behavior for destructive commands

Test whether destructive commands enforce their safety rules:

Does a destructive command fail without the required confirmation flag?
Does it return a structured error explaining the requirement?
Does it succeed when the confirmation flag is provided?

Time and token cost

Track the overhead the package introduces per run:

How many tokens does loading the full package cost?
How does that cost change with and without local skills loaded?
Is the extra context cost justified by the improvement in outcomes?

Start with realistic test cases

Build your eval set from cases that reflect how the package will actually be used. Each case should include:

a realistic user or operator prompt
the package path being evaluated
expected commands or behaviors
expected JSON fields or side effects
optional input files or seed state

Good starting cases:

Inspect an Agent Application and summarize its callable commands.
Add and complete an item in the example to-do application.
Attempt a destructive command without confirmation and verify that it fails safely.
Compare a package run with and without its local operating skill.

Compare against a baseline

Run each case at least two ways:

with the current package and local skills
with the previous package revision or without the local skill

Comparing against a baseline tells you whether the package is improving outcomes rather than only adding more context. A package that costs more tokens without improving pass rate is not an improvement.

Write objective assertions first

Start with checks you can grade without human review:

APP.md, app/, and skills/ exist
the documented entry.command is callable
JSON output parses successfully
expected fields are present in success and failure cases
destructive commands require explicit confirmation
state changes match the documented contract

Add human review after objective assertions pass. Human review is most useful for broader questions: Is the skill helping the agent choose the right command? Is the output clear enough to act on?

Track cost and drift

Collect per-run data to spot regressions early:

Metric	Why it matters
Pass rate	Measures overall reliability
Failure category	Identifies whether failures are in design, instructions, or tooling
Duration	Tracks execution time per run
Total tokens	Measures context cost
CLI vs. `APP.md` alignment	Detects when docs drift away from implementation

Package docs can drift away from the implementation over time. Track whether live CLI behavior still matches APP.md as part of your regular eval cycle.

Use failures to refine the package

Read failures at three levels:

Package design: wrong boundary between APP.md, app/, and local skills
Instructions: unclear command semantics or missing safety defaults
Tooling: weak discovery, weak JSON validation, or poor confirmation UX

If the same logic is being reinvented in every run, improve the package contract or bundle better local guidance. Repeated reinvention is a signal that something is missing from APP.md or a local skill.

The evaluation loop

Run the eval set against the package

Execute all test cases against your current package version. Collect outputs, exit codes, and token counts.

Grade objective assertions

Check each case against your defined assertions: JSON parses, fields are present, confirmation rules hold, state matches.

Review outputs and execution traces

Read failures and near-misses. Identify whether the problem is in package design, instructions, or tooling.

Tighten APP.md, descriptions, or local skills

Make targeted changes based on what you found. Improve the contract if commands are ambiguous. Improve descriptions if discovery is missing cases. Improve local skills if operating guidance is missing.

Rerun and compare the delta

Run the full eval set again. Compare pass rate, failure categories, and token cost against the previous version. Stop when the package improves outcomes consistently and the extra context cost is justified.

Get Started

Package Format

Package Authors

Runtime Implementers

Example App

What to evaluate

Start with realistic test cases

Compare against a baseline

Write objective assertions first

Track cost and drift

Use failures to refine the package

The evaluation loop

Get Started

Package Format

Package Authors

Runtime Implementers

Example App

​What to evaluate

​Start with realistic test cases

​Compare against a baseline

​Write objective assertions first

​Track cost and drift

​Use failures to refine the package

​The evaluation loop

What to evaluate

Start with realistic test cases

Compare against a baseline

Write objective assertions first

Track cost and drift

Use failures to refine the package

The evaluation loop