defineEval: declare, configure, and run evals in niceeval

defineEval is the primary building block for writing evals in niceeval. You call it once per file, pass a configuration object describing what to test, and export the result as the default export. The runner discovers your file, derives its ID from the file path, and executes the test function under the selected experiment’s agent.

import { defineEval } from "niceeval";

export default defineEval({
  description: "Brooklyn weather query",
  async test(t) {
    await t.send("What's the weather like in Brooklyn today?");
    t.succeeded();
    t.calledTool("get_weather", { input: { city: "Brooklyn" }, count: 1 });
    t.check(t.reply, includes("sunny"));
  },
});

You must not provide an id or name field. niceeval derives the eval ID from the file path: evals/weather/brooklyn.eval.ts becomes weather/brooklyn. Rename the file to rename the ID — it never goes stale.

defineEval options

description

string

A human-readable label for this eval. Appears in console output and reports. Does not affect the ID — that comes from the file path.

The test context: `t`

The t object is assembled by the runner based on the capabilities declared by the agent adapter. You get a different set of methods depending on what the agent supports. At the TypeScript level, methods that require a capability your agent hasn’t declared are simply not present on t — so misconfiguration shows up at compile time, not at runtime.

Always available

These methods are available regardless of which agent or capabilities are configured.

t.send(text)

(text: string) => Promise<Turn>

Sends a message to the agent and waits for the response. Returns a Turn object (see below). Each call to t.send counts as one turn; call it multiple times for multi-turn conversations (requires conversation capability).

const turn = await t.send("What is the capital of France?");

t.check(value, assertion)

(value: unknown, assertion: Assertion) => void

Evaluates assertion against value immediately and records the result. On failure, the eval is marked according to the assertion’s severity (gate → failed, soft → passed). Does not throw — execution continues.

t.check(t.reply, includes("Paris"));
t.check(turn.data, equals({ intent: "refund" }));

t.require(value, assertion)

(value: unknown, assertion: Assertion) => void

Like t.check, but throws immediately if the assertion fails, aborting the rest of the test. Use this for preconditions where continuing would be meaningless.

t.require(turn.status, equals("completed")); // abort if the agent errored

t.log(msg)

(msg: string) => void

Writes a diagnostic message to the eval’s output log. Appears in .niceeval/ artifacts and in verbose console output. Useful for debugging flaky evals.

t.skip(reason)

(reason: string) => void

Marks this eval as skipped with the given reason and stops execution. Use when a prerequisite isn’t available (e.g. a required environment variable is missing).

if (!process.env.OPENAI_API_KEY) t.skip("OPENAI_API_KEY not set");

t.flags

Readonly<Record<string, unknown>>

The feature flags passed down from the experiment configuration. Read these inside your test to branch on experiment variables.

if (t.flags.useExtendedPrompt) { /* ... */ }

t.model

string | undefined

The model tier string that was passed to the agent for this run. Read-only. Useful for logging or conditional assertions.

t.signal

AbortSignal

The AbortSignal for this eval’s lifetime. Forwarded from the runner’s timeout and early-exit logic. Pass it into any custom async work you do inside test.

With `conversation` capability

Available when the agent declares capabilities: { conversation: true }.

t.reply

string

The text of the most recent assistant message across all turns. Equivalent to reading the last message event with role: "assistant". Shorthand for the common pattern of checking the final response.

t.check(t.reply, includes("confirmed"));

t.newSession()

() => void

Signals the runner to start a fresh conversation session for subsequent t.send calls. The current session’s history is discarded. Useful when you need to test multiple independent conversation threads within a single eval.

await t.send("Hello, remember my name is Alice.");
t.newSession();
const turn = await t.send("What is my name?");
// The agent should not remember — it's a new session.

With `toolObservability` capability

Available when the agent declares capabilities: { toolObservability: true }. All of these are scope-level assertions — they are evaluated after the test function returns, reading from the accumulated standard event stream.

t.calledTool(name, opts?)

(name: string, opts?: ToolMatchOpts) => void

Asserts the agent called the named tool. Optional opts narrow the match:

input — partial/deep match against call arguments (literal, regex against serialized form, or predicate)
count — exact number of times the tool was called
status — filter by call outcome ("completed" | "failed")

t.calledTool("get_weather", { input: { city: "Brooklyn" }, count: 1 });

t.notCalledTool(name, opts?)

(name: string, opts?: ToolMatchOpts) => void

Asserts the agent did not call the named tool (with the given input, if provided). Accepts the same opts as t.calledTool.

t.notCalledTool("shell", { input: { command: /npm i/ } });

t.toolOrder(names)

(names: string[]) => void

Asserts the listed tools were called in the given relative order. Other tools may appear between them.

t.toolOrder(["read_file", "write_file"]);

t.usedNoTools()

() => void

Asserts the agent made zero tool calls during this run. Useful for verifying lightweight responses that should not invoke any external actions.

t.maxToolCalls(n)

(n: number) => void

Asserts the total number of tool calls was at most n.

t.maxToolCalls(5);

t.loadedSkill(skillName)

(skillName: string) => void

Syntactic sugar for t.calledTool("load_skill", { input: { skill: skillName } }). Asserts the agent loaded the named skill.

t.loadedSkill("memory-v2");

t.calledSubagent(name, opts?)

(name: string, opts?: SubagentMatchOpts) => void

Asserts the agent delegated to a sub-agent with the given name. opts may include remoteUrl (string or RegExp) and output matchers.

t.calledSubagent("researcher", { remoteUrl: /api\.example/ });

t.noFailedActions()

() => void

Asserts that none of the tool calls, sub-agent calls, or skill loads ended with status: "failed".

t.event(type, opts?)

(type: string, opts?) => void

Asserts a specific event type appears in the raw event stream. opts may include count and data matchers.

t.event("input.requested", { count: 1 });

t.notEvent(type)

(type: string) => void

Asserts that the given event type does not appear anywhere in the event stream.

t.notEvent("error");

t.eventOrder(types)

(types: string[]) => void

Asserts event types appear in the given relative order in the stream.

t.eventOrder(["action.called", "subagent.called"]);

t.eventsSatisfy(label, predicate)

(label: string, predicate: (events: StreamEvent[]) => boolean) => void

Escape hatch for custom event-stream assertions. Receives the full raw event array and must return a boolean.

t.eventsSatisfy("reads before writes", (events) => {
  const readIdx = events.findIndex(e => e.type === "action.called" && e.name === "read_file");
  const writeIdx = events.findIndex(e => e.type === "action.called" && e.name === "write_file");
  return readIdx < writeIdx;
});

With `workspace` (sandbox) capability

Available for sandbox agents (those using defineSandboxAgent).

t.fileChanged(path)

(path: string) => void

Asserts the agent modified the file at the given workspace-relative path. Derived from git diff HEAD after the agent run.

t.fileDeleted(path)

(path: string) => void

Asserts the agent deleted the file at the given path.

t.sandbox.diff

DiffView

A queryable view of all changes the agent made to the workspace:

t.sandbox.diff.get(path) — returns the post-run content of the file at path
t.sandbox.diff.isEmpty() — asserts no files were changed
t.sandbox.diff.matches(re) — asserts the full diff text matches a regex
t.notInDiff(re) — asserts the diff does not match the regex (useful for detecting leaked secrets or banned patterns)

t.check(t.sandbox.diff.get("src/Button.tsx"), includes("onClick"));
t.notInDiff(/sk-[A-Za-z0-9]/); // no API keys in diff

commandSucceeded()

ValueAssertion

Use this matcher with t.check(await t.sandbox.runCommand(...), commandSucceeded()) to assert that a verification command exited with code 0.

t.sandbox.runCommand(command, args, opts)

Promise<CommandResult>

Asserts a specific npm script (e.g. "build", "lint") exited with code 0.

t.check(await t.sandbox.runCommand("npm", ["run", "build"], { cwd: "/workspace" }), commandSucceeded());

Judge assertions

Available on any eval that has a judge model configured (globally or per-eval).

t.judge.autoevals.factuality(expected, opts?)

(expected: string, opts?) => JudgeAssertion

Uses the judge model to score factual consistency between the agent’s reply and the expected reference text. Returns a JudgeAssertion with .atLeast(threshold) for setting a soft threshold.

t.judge.autoevals.factuality("Paris is the capital of France.").atLeast(0.8);

t.judge.autoevals.closedQA(question, opts?)

(question: string, opts?) => JudgeAssertion

Asks the judge model a yes/no question about the reply. Returns a score between 0 and 1. Use .atLeast(threshold) to set a minimum passing score.

t.judge.autoevals.closedQA("Is the tone professional and on-topic?").atLeast(0.7);

t.judge.autoevals.summarizes(source, opts?)

(source: string, opts?) => JudgeAssertion

Asks the judge whether the agent’s reply faithfully summarizes the source text.

t.judge.autoevals.closedQA(rubric, opts?)

(rubric: string, opts?) => JudgeAssertion

Free-form scoring against a custom rubric string. Useful when none of the built-in judge methods fit your evaluation criteria.

t.judge.autoevals.closedQA("Does the answer use simple language suitable for a 10-year-old?", {
  on: turn.message,
}).atLeast(0.6);

opts accepted by all judge methods:

on — the value to evaluate (defaults to t.reply)
model — overrides the judge model for this single call

Efficiency assertions

t.maxTokens(n)

(n: number) => TokenAssertion

Asserts the total token usage (input + output) for this run did not exceed n. Defaults to gate severity. Chain .atLeast(0.7) to downgrade.

t.maxTokens(50_000);          // gate: fails if over
t.maxTokens(80_000).atLeast(0.7);   // soft: passed but not failed

t.maxCost(usd)

(usd: number) => TokenAssertion

Asserts the estimated cost of this run (based on a price table) did not exceed usd dollars.

t.maxCost(0.50); // fail if over $0.50

t.usage

Usage

The accumulated token usage for the current run. Available to read at any point inside test.

t.check(t.usage.outputTokens, satisfies(n => n < 10_000, "output not verbose"));

Fields:

inputTokens

number

Total prompt tokens consumed.

outputTokens

number

Total completion tokens produced.

cacheReadTokens

number | undefined

Tokens served from prompt cache (if available).

The Turn return type

await t.send(text) returns a Turn object. It is immutable and carries everything the agent produced for that one round-trip.

events

StreamEvent[]

The raw standard event stream produced by the agent for this turn. All scope-level assertions read from this. See defineAgent reference for the full StreamEvent union type.

data

unknown | undefined

Structured output returned by the agent (e.g. a parsed JSON object). Used by turn.outputEquals() and turn.outputMatches().

status

"completed" | "failed" | "waiting"

The outcome of this turn. "waiting" means the agent stopped at a human-in-the-loop (input.requested) prompt.

message

string

Convenience field: the concatenated text of all message events with role: "assistant" in this turn. Derived from events.

toolCalls

ToolCall[]

Convenience field: the list of tool calls made during this turn, each with name, input, output, and status. Derived from events.

usage

Usage | undefined

Token usage for this specific turn (if reported by the agent).

Turn methods

turn.outputEquals(expected)

(expected: unknown) => void

Asserts that turn.data deeply equals expected. Equivalent to t.check(turn.data, equals(expected)) but scoped to the turn object.

const turn = await t.send("Return the intent as JSON.");
turn.outputEquals({ intent: "refund" });

turn.outputMatches(schema)

(schema: StandardSchema) => void

Validates turn.data against a Standard Schema (e.g. a Zod schema).

import { z } from "zod";
turn.outputMatches(z.object({ intent: z.enum(["refund", "ship"]) }));

defineAgentEval

defineAgentEval is a convenience wrapper for coding-agent evals that you want to define programmatically rather than via a fixture directory. It is equivalent to a fixture (PROMPT.md + EVAL.ts) but expressed entirely in TypeScript.

import { defineAgentEval } from "niceeval";
import { includes } from "niceeval/expect";

export default defineAgentEval({
  description: "Rewrite callbacks to async/await",
  prompt: "Rewrite all callbacks in src/legacy.js to async/await, preserving behavior.",
  files: "./fixtures/legacy-callbacks",  // workspace starter files
  async test(t) {
    await t.run();                        // drives the sandbox agent
    t.fileChanged("src/legacy.js");
    t.check(t.sandbox.diff.get("src/legacy.js"), includes("await"));
    t.check(await t.sandbox.runCommand("npm", ["test"], { cwd: "/workspace" }), commandSucceeded());
  },
});

description

string

Human-readable description. Appears in reports.

prompt

string

required

The task prompt sent to the coding agent. Equivalent to the contents of PROMPT.md in a fixture directory.

files

string

Path to a local directory whose contents are uploaded to the sandbox as the agent’s starting workspace. Equivalent to the non-test files in a fixture directory.

test

(t: TestContext) => Promise<void>

required

The assertion function. Receives the full test context t. In addition to all standard t methods, t.run() is available to trigger the agent run explicitly.

t.run()

() => Promise<void>

Drives the sandbox agent with the configured prompt and files. You must call this before asserting any workspace state (diff, files, tests).

Dataset export (fan-out)

When a single eval file exports an array as its default export, niceeval fans it out into one eval per element. This is the canonical way to write parameterized test suites.

// evals/sql.eval.ts
import { defineEval } from "niceeval";
import { loadYaml } from "niceeval/loaders";
import { equals } from "niceeval/expect";

const doc = await loadYaml("evals/data/sql-cases.yaml");
const rows = doc.cases as { task: string; prompt: string; sql: string }[];

export default rows.map((row) =>
  defineEval({
    description: row.task,
    async test(t) {
      await t.send(row.prompt);
      t.succeeded();
      t.check(t.reply, equals(row.sql));
    },
  }),
);

Generated IDs use the file path as a prefix plus a zero-padded 4-digit index:

Array index	Generated ID
0	`sql/0000`
1	`sql/0001`
12	`sql/0012`

IDs are stable as long as the array order doesn’t change, making them safe to reference in CI history, dashboards, and issue trackers. loadJson from niceeval/loaders works the same way as loadYaml.

​defineEval options

​The test context: t

​Always available

​With conversation capability

​With toolObservability capability

​With workspace (sandbox) capability

​Judge assertions

​Efficiency assertions

​The Turn return type

​Turn methods

​defineAgentEval

​Dataset export (fan-out)

defineEval options

The test context: `t`

Always available

With `conversation` capability

With `toolObservability` capability

With `workspace` (sandbox) capability

Judge assertions

Efficiency assertions

The Turn return type

Turn methods

defineAgentEval

Dataset export (fan-out)