Anatomy of an eval
Every eval is created withdefineEval and has three essential parts: a human-readable description that appears in reports, an agent reference that names the subject under test, and an async test(t) function where you express what success looks like.
defineEval accepts:
You must not provide an
id or name field. niceeval derives the ID automatically from the file path.Path as identity
The file path of an eval is its ID. niceeval strips theevals/ prefix and the .eval.ts suffix to produce a stable, human-readable identifier:
| File path | Eval ID |
|---|---|
evals/weather/brooklyn.eval.ts | weather/brooklyn |
evals/sql.eval.ts | sql |
evals/fixtures/button/ (fixture directory) | fixtures/button |
The eval lifecycle
Discovery
When you run
npx niceeval exp ..., the runner recursively scans the evals/ directory for files ending in .eval.ts and for fixture directories (those containing a PROMPT.md). Default exports of defineEval(...) or an array of defineEval(...) calls are registered as individual evals.Scheduling
The runner dispatches evals up to
maxConcurrency at a time. Before dispatching an eval, it checks the fingerprint cache: if the eval’s source, its inputs, and the active agent haven’t changed since the last passing run, the cached result is replayed and the eval is skipped — saving both time and API cost.agent.send
For each eval, the runner calls
agent.send(input, ctx) with the text from your first t.send(...) call. The adapter drives the subject under test and returns a Turn containing the standard event stream. Multi-turn evals call agent.send once per await t.send(...).Scoring
Once your
test(t) function completes, the core evaluates every registered assertion — value assertions (t.check, t.require), scoped assertions (t.succeeded(), t.calledTool(), etc.), and LLM-as-judge calls — against the collected turn data.Outcome
All assertion results are folded into a single outcome by
outcome.ts. The rules are deterministic and described in full in the Scoring page.Report
The outcome and all assertion details stream to the console in real time. When the full run finishes, reporters write structured output (JUnit, JSON, etc.) and the
.niceeval/<run>/ directory is populated with artifacts: summary.json, per-eval results, the event stream, transcript, generated-file diffs, and test output.Outcome types
Each eval ends with exactly one outcome. Understanding what each outcome means helps you interpret run output and configure CI thresholds correctly.passed
No execution errors. All gate assertions passed. All soft assertions met their thresholds. This is the unambiguous success state.
failed
An execution error occurred (timeout, thrown exception, author mistake), or at least one gate assertion did not pass. A failed eval is a hard signal that something is broken.
passed
All gate assertions passed, but at least one soft assertion fell below its threshold. This means “usable but there is a quality regression.” Scored evals do not fail the run by default — only under
--strict.skipped
Your
test(t) function called t.skip("reason"), signaling that a prerequisite was missing or the eval does not apply to the current configuration. Skipped evals are excluded from pass-rate calculations.runs > 1), the summary for that eval becomes a pass rate (percentage of runs that produced passed) and an average latency, rather than a single outcome.
Gate vs soft assertions (brief introduction)
Every assertion carries a severity that determines how it affects the outcome:- Gate assertions are hard requirements. If a gate assertion fails, the entire eval is
failed. Use gate for facts that must be true: “the agent called the right tool”, “the output parsed as valid JSON”, “no shell commands failed.” - Soft assertions are quality scores with a numeric threshold. If a soft assertion’s score falls below its threshold, the eval becomes
passedrather thanpassed. Use soft for continuous judgments: similarity scoring, LLM-as-judge factuality, efficiency budgets you want to track without blocking CI.
niceeval/expect carry sensible defaults (includes and equals default to gate; similarity and judge calls default to soft), and you can override the severity with a chain method:
The *.eval.ts naming convention
niceeval discovers evals by scanning for files that match the *.eval.ts glob. A few conventions help keep a large eval suite organized:
- Files must end in
.eval.tsto be discovered. Any other.tsfile inevals/is ignored. - Use subdirectories to group related evals.
evals/billing/refund.eval.tsproduces IDbilling/refund. - Dataset files and helper utilities live alongside eval files but do not match
*.eval.ts, so they are never mistakenly treated as evals. - Fixture directories are discovered separately by the presence of
PROMPT.md, not by filename pattern.
Array exports and dataset fan-out
When a*.eval.ts file’s default export is an array of defineEval(...) calls, niceeval registers each element as a separate eval. This is the canonical way to evaluate an agent against a dataset:
<file-id>/<zero-padded-index> — for example sql/0000, sql/0001 — so they are stable and sortable regardless of dataset order changes.
Related pages
- Agents & Adapters — how the
agentfield connects your eval to a subject under test. - Scoring — the complete assertion vocabulary and outcome rules.
- Overview — the full architecture diagram and layer responsibilities.