Skip to main content
Every agent in niceeval is an adapter — a piece of code you write that knows how to drive a specific backend and translate its output into a standard event stream. The runner knows nothing about your agent’s wire protocol, CLI flags, or authentication; it only calls agent.send(input, ctx) and expects back a Turn. This page covers the two adapter factories: defineAgent for remote and in-process agents, and defineSandboxAgent for coding agents that run inside an isolated sandbox.

defineAgent

Use defineAgent for any agent you can drive in-process or over HTTP. The send function is your responsibility: call your code, fire a fetch, stream from a WebSocket — whatever your backend requires. Map the result to the standard event stream and return a Turn.
import { defineAgent } from "niceeval/adapter";

Options

name
string
required
A unique identifier for this agent. Experiment files reference agent objects directly, and reports use this name for grouping.
name: "my-agent",
capabilities
AgentCapabilities
Declares what the agent can do. The runner uses these flags to decide which methods appear on the eval’s t context. Omitting a capability hides the corresponding t methods at the TypeScript type level, surfacing misconfiguration at compile time rather than runtime.
capabilities: {
  conversation: true,       // allows multi-turn t.send and t.reply
  toolObservability: true,  // enables t.calledTool, t.event, etc.
},
conversation
boolean
The agent supports multi-turn sessions. Enables t.reply and t.newSession().
toolObservability
boolean
The agent produces action.* and subagent.* events. Enables t.calledTool, t.notCalledTool, t.toolOrder, t.usedNoTools, t.maxToolCalls, t.loadedSkill, t.calledSubagent, t.noFailedActions, t.event, t.notEvent, t.eventOrder, and t.eventsSatisfy.
workspace
boolean
The agent works on a file system. Enables t.sandbox.diff, t.fileChanged, t.fileDeleted, t.testsPassed, and t.scriptPassed. This flag is automatically set for defineSandboxAgent adapters.
send
(input: TurnInput, ctx: AgentContext) => Promise<Turn>
required
The core function that drives the agent. Called once per t.send() invocation. See the TurnInput and AgentContext sections below for parameter details. Must return a Turn (see the Turn section).
async send(input, ctx) {
  const res = await myAgent.handle(input.text, { signal: ctx.signal });
  return {
    events: toStreamEvents(res),
    data: res.json,
    status: "completed",
  };
},

The input parameter

text
string
The user message string for this turn. This is the value passed to t.send(text).

The ctx parameter (AgentContext)

signal
AbortSignal
An AbortSignal tied to the eval’s timeout. Pass it to any fetch calls or long-running async work so they cancel cleanly when the eval times out or is aborted by early-exit logic.
model
string | undefined
The model tier string requested by the experiment (e.g. "claude-opus-4-8"). When present, pass it to your backend’s model selection parameter. When absent, let your backend use its own default.
flags
Readonly<Record<string, unknown>>
Feature flags set by the experiment and transparently forwarded to the agent. Read these to toggle behaviors (e.g. ctx.flags.webResearch). The same flags are available on t.flags in the eval’s test function.
session
{ id?: string; isNew: boolean }
Session state for multi-turn conversations. id is an opaque string you assign after the first turn so subsequent turns can resume the session. isNew is true on the first turn or after the eval calls t.newSession().
if (!ctx.session.isNew && ctx.session.id) {
  // resume existing session
} else {
  // start a fresh session
}
ctx.session.id = responseBody.sessionId; // store for next turn
log
(msg: string) => void
Writes a diagnostic message to the eval’s log. Useful for debugging adapter internals without polluting the test output.

The Turn return type

Your send function must return an object satisfying the Turn interface.
events
StreamEvent[]
required
The normalized standard event stream for this turn. This is the core product of your adapter — every scope-level assertion in the eval reads from it. See the StreamEvent section below for all event types.
data
unknown | undefined
Structured (non-text) output from the agent. Used by turn.outputEquals() and turn.outputMatches(). Set this when your agent returns a parsed object alongside its text response.
status
"completed" | "failed" | "waiting"
The outcome of this turn:
  • "completed" — the agent finished normally
  • "failed" — the agent encountered an error
  • "waiting" — the agent stopped at a human-in-the-loop (input.requested) prompt
usage
Usage | undefined
Token counts for this turn. Provide these when your backend exposes them so niceeval can report costs and power t.maxTokens() / t.maxCost() assertions.
  • inputTokens: number
  • outputTokens: number
  • cacheReadTokens?: number

Complete example: in-process agent

// agents/my-agent.ts
import { defineAgent } from "niceeval/adapter";
import { classifyIntent } from "../src/agent.js";

export default defineAgent({
  name: "classify",
  capabilities: {},
  async send(input, ctx) {
    const result = await classifyIntent(input.text);
    return {
      events: [
        { type: "message", role: "assistant", text: JSON.stringify(result) },
      ],
      data: result,
      status: "completed",
    };
  },
});

Complete example: remote HTTP agent

// agents/support-bot.ts
import { defineAgent } from "niceeval/adapter";

export default defineAgent({
  name: "support-bot",
  capabilities: { conversation: true, toolObservability: true },
  async send(input, ctx) {
    const r = await fetch(`${process.env.SUPPORT_BOT_URL}/chat`, {
      method: "POST",
      body: JSON.stringify({ message: input.text }),
      signal: ctx.signal,
    });
    const body = await r.json();
    return {
      events: toStreamEvents(body),   // your mapping function
      data: body.output,
      status: "completed",
    };
  },
});
Authentication (API keys, base URLs, tokens) belongs inside the adapter — read it from environment variables in the send closure. niceeval never sees it and never passes it via ctx. This keeps credential scope tight and lets the same adapter be used across environments simply by changing env vars.

defineSandboxAgent

Use defineSandboxAgent for coding agents that run as a CLI inside an isolated sandbox (Docker container or cloud VM). The runner provisions the sandbox and passes it via ctx.sandbox. Your send function installs the CLI, runs the agent with the task prompt, reads back the transcript, and parses it into the standard event stream.
import { defineSandboxAgent, shared } from "niceeval/adapter";
defineSandboxAgent accepts exactly the same options as defineAgent (see above), plus ctx.sandbox is always populated.

The ctx.sandbox field (Sandbox interface)

runCommand(cmd, args?, opts?)
(cmd: string, args?: string[], opts?: RunOpts) => Promise<CommandResult>
Runs a single command inside the sandbox. Returns { stdout, stderr, exitCode }.
const res = await ctx.sandbox.runCommand("npm", ["install"], { cwd: "/workspace" });
opts fields:
  • env?: Record<string, string> — extra environment variables merged into the command’s environment
  • cwd?: string — working directory override for this command
  • root?: boolean — run as root (false by default). Use for privileged setup steps like installing system packages.
// privileged: install a system package
await ctx.sandbox.runCommand("apt-get", ["install", "-y", "openjdk-17-jdk"], { root: true });

// non-privileged (default): run npm
await ctx.sandbox.runCommand("npm", ["install"]);
runShell(script, opts?)
(script: string, opts?) => Promise<CommandResult>
Runs a multi-line shell script inside the sandbox. Accepts the same opts as runCommand. Useful for complex setup sequences.
await ctx.sandbox.runShell(`
  git config user.email "bot@example.com"
  git config user.name "Bot"
`);
readFile(path)
(path: string) => Promise<string>
Reads a file from the sandbox filesystem and returns its contents as a string.
writeFiles(files)
(files: Record<string, string>) => Promise<void>
Writes one or more files into the sandbox. Keys are paths, values are file contents.
await ctx.sandbox.writeFiles({
  "/workspace/.env": "API_KEY=test",
});
uploadFiles(files)
(files: SandboxFile[]) => Promise<void>
Uploads a batch of files (including binary) to the sandbox. Used internally by shared.prepareWorkspace to upload workspace fixture files.
runCommand(..., { cwd })
() => string
Returns the current working directory path inside the sandbox.
runCommand(..., { cwd: path })
(path: string) => void
Sets the default working directory for subsequent commands.
stop()
() => Promise<void>
Tears down and destroys the sandbox instance. Called automatically by the runner after the eval completes. You generally do not need to call this yourself.

shared helpers

The shared object from niceeval/adapter provides utilities that are common across all sandbox agent adapters, ensuring that workspace preparation, diff collection, validation, and observability injection work consistently regardless of which agent CLI you’re wrapping.
shared.prepareWorkspace(sandbox, fixture)
function
Uploads workspace files to the sandbox (hiding EVAL.ts and other test files to prevent the agent from seeing the answer), then runs git init && git commit to establish a baseline for later diffing.
shared.captureLatestJsonl(sandbox, dir)
function
Locates and reads the most recently modified .jsonl transcript file under dir. Used by adapters like claude-code that write transcripts to a well-known directory.
const raw = await shared.captureLatestJsonl(sb, "~/.claude/projects");
shared.runValidation(sandbox, scripts, mode)
function
Uploads the test files (e.g. EVAL.ts) that were hidden during workspace preparation, then runs the Vitest suite and/or npm scripts to validate the agent’s output.
shared.injectO11yContext(sandbox, events)
function
Derives observability data from the standard event stream and writes it to __niceeval__/results.json inside the sandbox. This makes agent behavior visible to assertions in EVAL.ts.
// In EVAL.ts — read what the agent did:
const o11y = JSON.parse(readFileSync("__niceeval__/results.json", "utf-8")).o11y;
expect(o11y.shellCommands.map(c => c.command)).not.toContain("rm -rf /");

Complete example: claude-code adapter

// agents/claude-code.ts
import { defineSandboxAgent, shared } from "niceeval/adapter";
import { requireEnv } from "niceeval";

const auth = () => ({ ANTHROPIC_API_KEY: requireEnv("ANTHROPIC_API_KEY") });

export default defineSandboxAgent({
  name: "claude-code",
  async send(input, ctx) {
    const sb = ctx.sandbox!;

    // Install the CLI (privileged — npm global install)
    await sb.runCommand("npm", ["install", "-g", "@anthropic-ai/claude-code"], {
      root: true,
    });

    // Build the argument list
    const args = ["--print", "--dangerously-skip-permissions"];
    if (ctx.model) args.push("--model", ctx.model);
    if (ctx.flags.webResearch) args.push("--allowedTools", "WebSearch,WebFetch");
    if (!ctx.session.isNew && ctx.session.id) args.push("--resume", ctx.session.id);
    args.push(input.text);

    const res = await sb.runCommand("claude", args, { env: auth() });

    // Capture and parse the transcript
    const raw = await shared.captureLatestJsonl(sb, "~/.claude/projects");
    ctx.session.id = shared.sessionIdFromClaudeTranscript(raw);

    return {
      events: parseClaudeCode(raw),  // your transcript → StreamEvent[] parser
      status: res.exitCode === 0 ? "completed" : "failed",
    };
  },
});

StreamEvent union type

Every adapter must produce StreamEvent[]. This normalized stream is what all scope-level assertions in test(t) read from. If your backend uses a different representation, map it to these types in your send function.
type StreamEvent =
  | { type: "message"; role: "assistant" | "user"; text: string }
  | { type: "action.called"; callId: string; name: string; input: JsonValue }
  | { type: "action.result"; callId: string; output?: JsonValue;
      status: "completed" | "failed" | "rejected" }
  | { type: "subagent.called"; callId: string; name: string; remoteUrl?: string }
  | { type: "subagent.completed"; callId: string; output?: JsonValue;
      status: "completed" | "failed" }
  | { type: "input.requested"; request: InputRequest }
  | { type: "thinking"; text: string }
  | { type: "error"; message: string };
Event typeDescription
messageA text message from the assistant or user. t.reply is derived from all assistant messages in the stream.
action.calledA tool, skill, or action was invoked. callId links to the corresponding action.result.
action.resultThe result of a tool call. Paired with action.called by callId.
subagent.calledThe agent delegated to a sub-agent.
subagent.completedA sub-agent delegation finished.
input.requestedThe agent paused waiting for human input (HITL). Causes status: "waiting" on the Turn.
thinkingReasoning text from a chain-of-thought model. Not counted as a reply message.
errorAn error emitted by the agent during execution. t.notEvent("error") asserts none occurred.
Skill loading (load_skill) is represented as an action.called event with name: "load_skill". The t.loadedSkill(name) assertion is syntactic sugar for t.calledTool("load_skill", { input: { skill: name } }) — no separate event type is needed.

Using agents in experiments

Once you’ve written an adapter, reference it from an experiment file:
// experiments/local.ts
import { defineExperiment } from "niceeval";
import myAgent from "./agents/my-agent.js";

export default defineExperiment({
  agent: myAgent,
  runs: 1,
});
agent
AgentAdapter
Agent adapter instance created with defineAgent or defineSandboxAgent.
runs
number
Number of attempts for each matched eval in this experiment.

Built-in agents

The following coding agent adapters are exported by niceeval and can be referenced from experiment files:

claude-code

Anthropic Claude Code CLI. Requires ANTHROPIC_API_KEY. Uses claude --print --dangerously-skip-permissions.

codex

OpenAI Codex CLI. Requires codex login or API key setup. Uses codex exec --json.

bub

Built-in bub coding agent. Same adapter shape as claude-code — use as a reference when writing your own sandbox adapter.
npx niceeval exp claude-code-local evals/fixtures/button --sandbox docker
npx niceeval exp codex-local evals/fixtures/button --sandbox docker

ctx vs t: two names, same data

The ctx object in your adapter’s send function and the t object in your eval’s test function share the same underlying data — t is the runner’s high-level view built on top of ctx.
Conceptctx (agent side)t (eval side)
Feature flagsctx.flagst.flags
Modelctx.modelt.model
Abort signalctx.signalt.signal
Loggingctx.log()t.log()
Sessionctx.session.id / isNewt.newSession()
Sandboxctx.sandbox (raw Sandbox handle)t.sandbox.diff, t.fileChanged, etc. (high-level view)
Authentication details, CLI flags, and transcript locations are agent-local — they live inside send and are never exposed via ctx or t.