Skip to main content
defineEval is the primary building block for writing evals in niceeval. You call it once per file, pass a configuration object describing what to test, and export the result as the default export. The runner discovers your file, derives its ID from the file path, and executes the test function under the selected experiment’s agent.
import { defineEval } from "niceeval";

export default defineEval({
  description: "Brooklyn weather query",
  async test(t) {
    await t.send("What's the weather like in Brooklyn today?");
    t.succeeded();
    t.calledTool("get_weather", { input: { city: "Brooklyn" }, count: 1 });
    t.check(t.reply, includes("sunny"));
  },
});
You must not provide an id or name field. niceeval derives the eval ID from the file path: evals/weather/brooklyn.eval.ts becomes weather/brooklyn. Rename the file to rename the ID — it never goes stale.

defineEval options

description
string
A human-readable label for this eval. Appears in console output and reports. Does not affect the ID — that comes from the file path.
tags
string[]
An array of tag strings used to filter evals via --tag on the CLI. Tags let you group related evals (e.g. "billing", "regression", "slow") without changing the directory structure.
tags: ["billing", "regression"],
judge
JudgeConfig
Overrides the judge model for this specific eval. Takes precedence over the global judge.model set in defineConfig. The model field accepts any model string understood by your judge backend.
judge: { model: "anthropic/claude-opus-4-8" },
reporters
Reporter[]
A list of reporters applied only to this eval, in addition to (or instead of) the globally configured reporters. Useful when a specific eval needs specialized output formats.
timeoutMs
number
Per-eval timeout in milliseconds. Overrides the global timeoutMs in defineConfig. When the timeout elapses, the eval is marked failed with error: timeout and the runner moves on.
metadata
Record<string, unknown>
Arbitrary key-value pairs attached to this eval’s result record. Useful for downstream analysis, custom reporters, or dashboard annotations.
metadata: { owner: "platform-team", jira: "PLAT-1234" },
test
(t: TestContext) => Promise<void>
required
The async function that drives the agent and asserts results. Receives the test context t (see below). All assertions, sends, and judge calls live here.
async test(t) {
  const turn = await t.send("Summarize this document.");
  t.judge.autoevals.summarizes(document).atLeast(0.8);
  t.check(turn.message, includes("key finding"));
},

The test context: t

The t object is assembled by the runner based on the capabilities declared by the agent adapter. You get a different set of methods depending on what the agent supports. At the TypeScript level, methods that require a capability your agent hasn’t declared are simply not present on t — so misconfiguration shows up at compile time, not at runtime.

Always available

These methods are available regardless of which agent or capabilities are configured.
t.send(text)
(text: string) => Promise<Turn>
Sends a message to the agent and waits for the response. Returns a Turn object (see below). Each call to t.send counts as one turn; call it multiple times for multi-turn conversations (requires conversation capability).
const turn = await t.send("What is the capital of France?");
t.check(value, assertion)
(value: unknown, assertion: Assertion) => void
Evaluates assertion against value immediately and records the result. On failure, the eval is marked according to the assertion’s severity (gatefailed, softpassed). Does not throw — execution continues.
t.check(t.reply, includes("Paris"));
t.check(turn.data, equals({ intent: "refund" }));
t.require(value, assertion)
(value: unknown, assertion: Assertion) => void
Like t.check, but throws immediately if the assertion fails, aborting the rest of the test. Use this for preconditions where continuing would be meaningless.
t.require(turn.status, equals("completed")); // abort if the agent errored
t.log(msg)
(msg: string) => void
Writes a diagnostic message to the eval’s output log. Appears in .niceeval/ artifacts and in verbose console output. Useful for debugging flaky evals.
t.skip(reason)
(reason: string) => void
Marks this eval as skipped with the given reason and stops execution. Use when a prerequisite isn’t available (e.g. a required environment variable is missing).
if (!process.env.OPENAI_API_KEY) t.skip("OPENAI_API_KEY not set");
t.flags
Readonly<Record<string, unknown>>
The feature flags passed down from the experiment configuration. Read these inside your test to branch on experiment variables.
if (t.flags.useExtendedPrompt) { /* ... */ }
t.model
string | undefined
The model tier string that was passed to the agent for this run. Read-only. Useful for logging or conditional assertions.
t.signal
AbortSignal
The AbortSignal for this eval’s lifetime. Forwarded from the runner’s timeout and early-exit logic. Pass it into any custom async work you do inside test.

With conversation capability

Available when the agent declares capabilities: { conversation: true }.
t.reply
string
The text of the most recent assistant message across all turns. Equivalent to reading the last message event with role: "assistant". Shorthand for the common pattern of checking the final response.
t.check(t.reply, includes("confirmed"));
t.newSession()
() => void
Signals the runner to start a fresh conversation session for subsequent t.send calls. The current session’s history is discarded. Useful when you need to test multiple independent conversation threads within a single eval.
await t.send("Hello, remember my name is Alice.");
t.newSession();
const turn = await t.send("What is my name?");
// The agent should not remember — it's a new session.

With toolObservability capability

Available when the agent declares capabilities: { toolObservability: true }. All of these are scope-level assertions — they are evaluated after the test function returns, reading from the accumulated standard event stream.
t.calledTool(name, opts?)
(name: string, opts?: ToolMatchOpts) => void
Asserts the agent called the named tool. Optional opts narrow the match:
  • input — partial/deep match against call arguments (literal, regex against serialized form, or predicate)
  • count — exact number of times the tool was called
  • status — filter by call outcome ("completed" | "failed")
t.calledTool("get_weather", { input: { city: "Brooklyn" }, count: 1 });
t.notCalledTool(name, opts?)
(name: string, opts?: ToolMatchOpts) => void
Asserts the agent did not call the named tool (with the given input, if provided). Accepts the same opts as t.calledTool.
t.notCalledTool("shell", { input: { command: /npm i/ } });
t.toolOrder(names)
(names: string[]) => void
Asserts the listed tools were called in the given relative order. Other tools may appear between them.
t.toolOrder(["read_file", "write_file"]);
t.usedNoTools()
() => void
Asserts the agent made zero tool calls during this run. Useful for verifying lightweight responses that should not invoke any external actions.
t.maxToolCalls(n)
(n: number) => void
Asserts the total number of tool calls was at most n.
t.maxToolCalls(5);
t.loadedSkill(skillName)
(skillName: string) => void
Syntactic sugar for t.calledTool("load_skill", { input: { skill: skillName } }). Asserts the agent loaded the named skill.
t.loadedSkill("memory-v2");
t.calledSubagent(name, opts?)
(name: string, opts?: SubagentMatchOpts) => void
Asserts the agent delegated to a sub-agent with the given name. opts may include remoteUrl (string or RegExp) and output matchers.
t.calledSubagent("researcher", { remoteUrl: /api\.example/ });
t.noFailedActions()
() => void
Asserts that none of the tool calls, sub-agent calls, or skill loads ended with status: "failed".
t.event(type, opts?)
(type: string, opts?) => void
Asserts a specific event type appears in the raw event stream. opts may include count and data matchers.
t.event("input.requested", { count: 1 });
t.notEvent(type)
(type: string) => void
Asserts that the given event type does not appear anywhere in the event stream.
t.notEvent("error");
t.eventOrder(types)
(types: string[]) => void
Asserts event types appear in the given relative order in the stream.
t.eventOrder(["action.called", "subagent.called"]);
t.eventsSatisfy(label, predicate)
(label: string, predicate: (events: StreamEvent[]) => boolean) => void
Escape hatch for custom event-stream assertions. Receives the full raw event array and must return a boolean.
t.eventsSatisfy("reads before writes", (events) => {
  const readIdx = events.findIndex(e => e.type === "action.called" && e.name === "read_file");
  const writeIdx = events.findIndex(e => e.type === "action.called" && e.name === "write_file");
  return readIdx < writeIdx;
});

With workspace (sandbox) capability

Available for sandbox agents (those using defineSandboxAgent).
t.fileChanged(path)
(path: string) => void
Asserts the agent modified the file at the given workspace-relative path. Derived from git diff HEAD after the agent run.
t.fileDeleted(path)
(path: string) => void
Asserts the agent deleted the file at the given path.
t.sandbox.diff
DiffView
A queryable view of all changes the agent made to the workspace:
  • t.sandbox.diff.get(path) — returns the post-run content of the file at path
  • t.sandbox.diff.isEmpty() — asserts no files were changed
  • t.sandbox.diff.matches(re) — asserts the full diff text matches a regex
  • t.notInDiff(re) — asserts the diff does not match the regex (useful for detecting leaked secrets or banned patterns)
t.check(t.sandbox.diff.get("src/Button.tsx"), includes("onClick"));
t.notInDiff(/sk-[A-Za-z0-9]/); // no API keys in diff
commandSucceeded()
ValueAssertion
Use this matcher with t.check(await t.sandbox.runCommand(...), commandSucceeded()) to assert that a verification command exited with code 0.
t.sandbox.runCommand(command, args, opts)
Promise<CommandResult>
Asserts a specific npm script (e.g. "build", "lint") exited with code 0.
t.check(await t.sandbox.runCommand("npm", ["run", "build"], { cwd: "/workspace" }), commandSucceeded());

Judge assertions

Available on any eval that has a judge model configured (globally or per-eval).
t.judge.autoevals.factuality(expected, opts?)
(expected: string, opts?) => JudgeAssertion
Uses the judge model to score factual consistency between the agent’s reply and the expected reference text. Returns a JudgeAssertion with .atLeast(threshold) for setting a soft threshold.
t.judge.autoevals.factuality("Paris is the capital of France.").atLeast(0.8);
t.judge.autoevals.closedQA(question, opts?)
(question: string, opts?) => JudgeAssertion
Asks the judge model a yes/no question about the reply. Returns a score between 0 and 1. Use .atLeast(threshold) to set a minimum passing score.
t.judge.autoevals.closedQA("Is the tone professional and on-topic?").atLeast(0.7);
t.judge.autoevals.summarizes(source, opts?)
(source: string, opts?) => JudgeAssertion
Asks the judge whether the agent’s reply faithfully summarizes the source text.
t.judge.autoevals.closedQA(rubric, opts?)
(rubric: string, opts?) => JudgeAssertion
Free-form scoring against a custom rubric string. Useful when none of the built-in judge methods fit your evaluation criteria.
t.judge.autoevals.closedQA("Does the answer use simple language suitable for a 10-year-old?", {
  on: turn.message,
}).atLeast(0.6);
opts accepted by all judge methods:
  • on — the value to evaluate (defaults to t.reply)
  • model — overrides the judge model for this single call

Efficiency assertions

t.maxTokens(n)
(n: number) => TokenAssertion
Asserts the total token usage (input + output) for this run did not exceed n. Defaults to gate severity. Chain .atLeast(0.7) to downgrade.
t.maxTokens(50_000);          // gate: fails if over
t.maxTokens(80_000).atLeast(0.7);   // soft: passed but not failed
t.maxCost(usd)
(usd: number) => TokenAssertion
Asserts the estimated cost of this run (based on a price table) did not exceed usd dollars.
t.maxCost(0.50); // fail if over $0.50
t.usage
Usage
The accumulated token usage for the current run. Available to read at any point inside test.
t.check(t.usage.outputTokens, satisfies(n => n < 10_000, "output not verbose"));
Fields:
inputTokens
number
Total prompt tokens consumed.
outputTokens
number
Total completion tokens produced.
cacheReadTokens
number | undefined
Tokens served from prompt cache (if available).

The Turn return type

await t.send(text) returns a Turn object. It is immutable and carries everything the agent produced for that one round-trip.
events
StreamEvent[]
The raw standard event stream produced by the agent for this turn. All scope-level assertions read from this. See defineAgent reference for the full StreamEvent union type.
data
unknown | undefined
Structured output returned by the agent (e.g. a parsed JSON object). Used by turn.outputEquals() and turn.outputMatches().
status
"completed" | "failed" | "waiting"
The outcome of this turn. "waiting" means the agent stopped at a human-in-the-loop (input.requested) prompt.
message
string
Convenience field: the concatenated text of all message events with role: "assistant" in this turn. Derived from events.
toolCalls
ToolCall[]
Convenience field: the list of tool calls made during this turn, each with name, input, output, and status. Derived from events.
usage
Usage | undefined
Token usage for this specific turn (if reported by the agent).

Turn methods

turn.outputEquals(expected)
(expected: unknown) => void
Asserts that turn.data deeply equals expected. Equivalent to t.check(turn.data, equals(expected)) but scoped to the turn object.
const turn = await t.send("Return the intent as JSON.");
turn.outputEquals({ intent: "refund" });
turn.outputMatches(schema)
(schema: StandardSchema) => void
Validates turn.data against a Standard Schema (e.g. a Zod schema).
import { z } from "zod";
turn.outputMatches(z.object({ intent: z.enum(["refund", "ship"]) }));

defineAgentEval

defineAgentEval is a convenience wrapper for coding-agent evals that you want to define programmatically rather than via a fixture directory. It is equivalent to a fixture (PROMPT.md + EVAL.ts) but expressed entirely in TypeScript.
import { defineAgentEval } from "niceeval";
import { includes } from "niceeval/expect";

export default defineAgentEval({
  description: "Rewrite callbacks to async/await",
  prompt: "Rewrite all callbacks in src/legacy.js to async/await, preserving behavior.",
  files: "./fixtures/legacy-callbacks",  // workspace starter files
  async test(t) {
    await t.run();                        // drives the sandbox agent
    t.fileChanged("src/legacy.js");
    t.check(t.sandbox.diff.get("src/legacy.js"), includes("await"));
    t.check(await t.sandbox.runCommand("npm", ["test"], { cwd: "/workspace" }), commandSucceeded());
  },
});
description
string
Human-readable description. Appears in reports.
prompt
string
required
The task prompt sent to the coding agent. Equivalent to the contents of PROMPT.md in a fixture directory.
files
string
Path to a local directory whose contents are uploaded to the sandbox as the agent’s starting workspace. Equivalent to the non-test files in a fixture directory.
test
(t: TestContext) => Promise<void>
required
The assertion function. Receives the full test context t. In addition to all standard t methods, t.run() is available to trigger the agent run explicitly.
t.run()
() => Promise<void>
Drives the sandbox agent with the configured prompt and files. You must call this before asserting any workspace state (diff, files, tests).

Dataset export (fan-out)

When a single eval file exports an array as its default export, niceeval fans it out into one eval per element. This is the canonical way to write parameterized test suites.
// evals/sql.eval.ts
import { defineEval } from "niceeval";
import { loadYaml } from "niceeval/loaders";
import { equals } from "niceeval/expect";

const doc = await loadYaml("evals/data/sql-cases.yaml");
const rows = doc.cases as { task: string; prompt: string; sql: string }[];

export default rows.map((row) =>
  defineEval({
    description: row.task,
    async test(t) {
      await t.send(row.prompt);
      t.succeeded();
      t.check(t.reply, equals(row.sql));
    },
  }),
);
Generated IDs use the file path as a prefix plus a zero-padded 4-digit index:
Array indexGenerated ID
0sql/0000
1sql/0001
12sql/0012
IDs are stable as long as the array order doesn’t change, making them safe to reference in CI history, dashboards, and issue trackers. loadJson from niceeval/loaders works the same way as loadYaml.