defineEval call. The framework follows three core principles: path-as-identity (the file path is the eval’s ID), one file, one eval (or one array for dataset fan-out), and linear writing with inline assertions (you write checks right where the conversation happens). This page walks you through the full surface area of defineEval, from the simplest single-turn check to multi-turn conversations and dataset-driven suites.
The defineEval shape
defineEval accepts a configuration object with the following fields:
Only files ending in
.eval.ts are discovered by the runner. Use directory structure to express grouping: evals/billing/refund.eval.ts produces the ID billing/refund.Single-turn evals
A single-turn eval sends one message and asserts the agent’s response. Uset.send() to drive the conversation, then write scoped assertions (t.succeeded(), t.calledTool()) and value assertions (t.check()) immediately after.
The Turn object
t.send(message) returns a Turn — an immutable snapshot of that exchange:
| Property | Type | Description |
|---|---|---|
turn.events | StreamEvent[] | The normalized event stream — the primary source of truth |
turn.data | unknown | Structured output, if the agent returned one |
turn.status | "completed" | "failed" | "waiting" | Completion status of the turn |
turn.usage | Usage | undefined | Optional token usage for this turn |
turn.message | string | Convenience: the assistant’s text reply (derived from events) |
turn.toolCalls | ToolCall[] | Convenience: tool calls made this turn (derived from events) |
t.reply is a shorthand for the last assistant message across the whole session.
Key t properties
| Member | Description |
|---|---|
t.reply | Last assistant message text |
t.flags | Runtime flags passed via CLI |
t.log(msg) | Emit a structured log line into the eval’s trace |
t.skip(reason) | Mark this eval as skipped and halt execution |
Multi-turn evals
For multi-turn conversations, assign eacht.send() call to a variable and assert on it right away. This keeps assertions co-located with the turn they describe, making failures easy to trace.
Parallel sessions
When you need independent conversation threads running concurrently within one eval, callt.newSession() to open a fresh session that doesn’t share history with the current one.
The eval context t
The t argument passed to test() is the eval context. Its available methods depend on what capabilities the connected agent declares, but the core interface is always present:
Core t methods and properties
Core t methods and properties
| Member | Description |
|---|---|
t.send(message) | Send a message to the agent; returns a Turn |
t.reply | Shorthand for the last assistant message |
t.check(value, assertion) | Record a value-level assertion immediately |
t.require(value, assertion) | Like t.check, but throws on failure — use for preconditions |
t.succeeded() | Scoped: assert the run completed without failure |
t.calledTool(name, opts?) | Scoped: assert a tool was called (with optional matching) |
t.judge | LLM-as-judge sub-interface |
t.flags | CLI flags for this run |
t.log(msg) | Emit a structured log line |
t.skip(reason) | Skip this eval |
t.newSession() | Open a new independent conversation session |
t.usage | { inputTokens, outputTokens, cacheReadTokens? … } |
Dataset fan-out
When a.eval.ts file’s default export is an array, niceeval fans it out into one eval per element. This is the idiomatic way to run many test cases from a single file.
sql/0000, sql/0001, and so on. You can filter to a single case by passing its full ID after the experiment selector, or run the whole file by passing the file-level prefix (npx niceeval exp local sql).
loadYaml and loadJson are both available from niceeval/loaders. Both return the parsed document as a plain JavaScript object.Sandbox fixtures
When evaluating a coding agent, the task lives on disk rather than in code. A fixture is a directory that niceeval discovers automatically — no.eval.ts wrapper needed.
PROMPT.md is treated as a fixture, including arbitrarily nested ones (fixtures/api/auth/). EVAL.ts is hidden from the agent during execution — it is only uploaded after the agent finishes, so the agent cannot read the answers.
For programmatic control over fixtures, use defineAgentEval instead. See the Fixtures guide for the full picture.
Naming and organization conventions
File naming
Only files ending in
.eval.ts are discovered. Use descriptive names that match the scenario: refund-request.eval.ts, not test1.eval.ts.Directory grouping
Use directories to express feature areas.
evals/billing/refund.eval.ts → ID billing/refund. Directories become ID prefixes you can filter on.Datasets
Store YAML and JSON datasets under
evals/data/. This is a convention, not a requirement, but it keeps data files out of the eval index.Fixtures
Store sandbox fixtures under
evals/fixtures/. Again, convention only — niceeval finds any directory with PROMPT.md.