What to evaluate
- Whether the Skill is triggered correctly.
- Whether the Skill’s instructions steer the agent through the expected flow.
- Whether the Skill improves pass rate, cost, or latency.
- Whether the Skill prevents wrong tools, wrong files, or wrong commands.
Define the experiment and install the Skill
username/repo is your Skill repository on GitHub. The corresponding agent adapter will use npx skill add to install and configure the Skill automatically.
Write the eval
EVAL.ts
Run
Next steps
- Fixtures — organize tasks and verification scripts.
- Experiments — run with-Skill vs without-Skill controlled experiments.
- Scoring Guide — score both final results and behavioral constraints together.