defineAgent and defineSandboxAgent: adapter reference

Every agent in niceeval is an adapter — a piece of code you write that knows how to drive a specific backend and translate its output into a standard event stream. The runner knows nothing about your agent’s wire protocol, CLI flags, or authentication; it only calls agent.send(input, ctx) and expects back a Turn. This page covers the two adapter factories: defineAgent for remote and in-process agents, and defineSandboxAgent for coding agents that run inside an isolated sandbox.

defineAgent

Use defineAgent for any agent you can drive in-process or over HTTP. The send function is your responsibility: call your code, fire a fetch, stream from a WebSocket — whatever your backend requires. Map the result to the standard event stream and return a Turn.

import { defineAgent } from "niceeval/adapter";

Options

name

string

required

A unique identifier for this agent. Experiment files reference agent objects directly, and reports use this name for grouping.

name: "my-agent",

capabilities

AgentCapabilities

Declares what the agent can do. The runner uses these flags to decide which methods appear on the eval’s t context. Omitting a capability hides the corresponding t methods at the TypeScript type level, surfacing misconfiguration at compile time rather than runtime.

capabilities: {
  conversation: true,       // allows multi-turn t.send and t.reply
  toolObservability: true,  // enables t.calledTool, t.event, etc.
},

conversation

boolean

The agent supports multi-turn sessions. Enables t.reply and t.newSession().

toolObservability

boolean

The agent produces action.* and subagent.* events. Enables t.calledTool, t.notCalledTool, t.toolOrder, t.usedNoTools, t.maxToolCalls, t.loadedSkill, t.calledSubagent, t.noFailedActions, t.event, t.notEvent, t.eventOrder, and t.eventsSatisfy.

workspace

boolean

The agent works on a file system. Enables t.sandbox.diff, t.fileChanged, t.fileDeleted, t.testsPassed, and t.scriptPassed. This flag is automatically set for defineSandboxAgent adapters.

send

(input: TurnInput, ctx: AgentContext) => Promise<Turn>

required

The core function that drives the agent. Called once per t.send() invocation. See the TurnInput and AgentContext sections below for parameter details. Must return a Turn (see the Turn section).

async send(input, ctx) {
  const res = await myAgent.handle(input.text, { signal: ctx.signal });
  return {
    events: toStreamEvents(res),
    data: res.json,
    status: "completed",
  };
},

The `input` parameter

text

string

The user message string for this turn. This is the value passed to t.send(text).

The `ctx` parameter (AgentContext)

signal

AbortSignal

An AbortSignal tied to the eval’s timeout. Pass it to any fetch calls or long-running async work so they cancel cleanly when the eval times out or is aborted by early-exit logic.

model

string | undefined

The model tier string requested by the experiment (e.g. "claude-opus-4-8"). When present, pass it to your backend’s model selection parameter. When absent, let your backend use its own default.

flags

Readonly<Record<string, unknown>>

Feature flags set by the experiment and transparently forwarded to the agent. Read these to toggle behaviors (e.g. ctx.flags.webResearch). The same flags are available on t.flags in the eval’s test function.

session

{ id?: string; isNew: boolean }

Session state for multi-turn conversations. id is an opaque string you assign after the first turn so subsequent turns can resume the session. isNew is true on the first turn or after the eval calls t.newSession().

if (!ctx.session.isNew && ctx.session.id) {
  // resume existing session
} else {
  // start a fresh session
}
ctx.session.id = responseBody.sessionId; // store for next turn

log

(msg: string) => void

Writes a diagnostic message to the eval’s log. Useful for debugging adapter internals without polluting the test output.

The Turn return type

Your send function must return an object satisfying the Turn interface.

events

StreamEvent[]

required

The normalized standard event stream for this turn. This is the core product of your adapter — every scope-level assertion in the eval reads from it. See the StreamEvent section below for all event types.

data

unknown | undefined

Structured (non-text) output from the agent. Used by turn.outputEquals() and turn.outputMatches(). Set this when your agent returns a parsed object alongside its text response.

status

"completed" | "failed" | "waiting"

The outcome of this turn:

"completed" — the agent finished normally
"failed" — the agent encountered an error
"waiting" — the agent stopped at a human-in-the-loop (input.requested) prompt

usage

Usage | undefined

Token counts for this turn. Provide these when your backend exposes them so niceeval can report costs and power t.maxTokens() / t.maxCost() assertions.

inputTokens: number
outputTokens: number
cacheReadTokens?: number

Complete example: in-process agent

// agents/my-agent.ts
import { defineAgent } from "niceeval/adapter";
import { classifyIntent } from "../src/agent.js";

export default defineAgent({
  name: "classify",
  capabilities: {},
  async send(input, ctx) {
    const result = await classifyIntent(input.text);
    return {
      events: [
        { type: "message", role: "assistant", text: JSON.stringify(result) },
      ],
      data: result,
      status: "completed",
    };
  },
});

Complete example: remote HTTP agent

// agents/support-bot.ts
import { defineAgent } from "niceeval/adapter";

export default defineAgent({
  name: "support-bot",
  capabilities: { conversation: true, toolObservability: true },
  async send(input, ctx) {
    const r = await fetch(`${process.env.SUPPORT_BOT_URL}/chat`, {
      method: "POST",
      body: JSON.stringify({ message: input.text }),
      signal: ctx.signal,
    });
    const body = await r.json();
    return {
      events: toStreamEvents(body),   // your mapping function
      data: body.output,
      status: "completed",
    };
  },
});

Authentication (API keys, base URLs, tokens) belongs inside the adapter — read it from environment variables in the send closure. niceeval never sees it and never passes it via ctx. This keeps credential scope tight and lets the same adapter be used across environments simply by changing env vars.

defineSandboxAgent

Use defineSandboxAgent for coding agents that run as a CLI inside an isolated sandbox (Docker container or cloud VM). The runner provisions the sandbox and passes it via ctx.sandbox. Your send function installs the CLI, runs the agent with the task prompt, reads back the transcript, and parses it into the standard event stream.

import { defineSandboxAgent, shared } from "niceeval/adapter";

defineSandboxAgent accepts exactly the same options as defineAgent (see above), plus ctx.sandbox is always populated.

The `ctx.sandbox` field (Sandbox interface)

runCommand(cmd, args?, opts?)

(cmd: string, args?: string[], opts?: RunOpts) => Promise<CommandResult>

Runs a single command inside the sandbox. Returns { stdout, stderr, exitCode }.

const res = await ctx.sandbox.runCommand("npm", ["install"], { cwd: "/workspace" });

opts fields:

env?: Record<string, string> — extra environment variables merged into the command’s environment
cwd?: string — working directory override for this command
root?: boolean — run as root (false by default). Use for privileged setup steps like installing system packages.

// privileged: install a system package
await ctx.sandbox.runCommand("apt-get", ["install", "-y", "openjdk-17-jdk"], { root: true });

// non-privileged (default): run npm
await ctx.sandbox.runCommand("npm", ["install"]);

runShell(script, opts?)

(script: string, opts?) => Promise<CommandResult>

Runs a multi-line shell script inside the sandbox. Accepts the same opts as runCommand. Useful for complex setup sequences.

await ctx.sandbox.runShell(`
  git config user.email "bot@example.com"
  git config user.name "Bot"
`);

readFile(path)

(path: string) => Promise<string>

Reads a file from the sandbox filesystem and returns its contents as a string.

writeFiles(files)

(files: Record<string, string>) => Promise<void>

Writes one or more files into the sandbox. Keys are paths, values are file contents.

await ctx.sandbox.writeFiles({
  "/workspace/.env": "API_KEY=test",
});

uploadFiles(files)

(files: SandboxFile[]) => Promise<void>

Uploads a batch of files (including binary) to the sandbox. Used internally by shared.prepareWorkspace to upload workspace fixture files.

runCommand(..., { cwd })

() => string

Returns the current working directory path inside the sandbox.

runCommand(..., { cwd: path })

(path: string) => void

Sets the default working directory for subsequent commands.

stop()

() => Promise<void>

Tears down and destroys the sandbox instance. Called automatically by the runner after the eval completes. You generally do not need to call this yourself.

shared helpers

The shared object from niceeval/adapter provides utilities that are common across all sandbox agent adapters, ensuring that workspace preparation, diff collection, validation, and observability injection work consistently regardless of which agent CLI you’re wrapping.

shared.prepareWorkspace(sandbox, fixture)

function

Uploads workspace files to the sandbox (hiding EVAL.ts and other test files to prevent the agent from seeing the answer), then runs git init && git commit to establish a baseline for later diffing.

shared.captureLatestJsonl(sandbox, dir)

function

Locates and reads the most recently modified .jsonl transcript file under dir. Used by adapters like claude-code that write transcripts to a well-known directory.

const raw = await shared.captureLatestJsonl(sb, "~/.claude/projects");

shared.runValidation(sandbox, scripts, mode)

function

Uploads the test files (e.g. EVAL.ts) that were hidden during workspace preparation, then runs the Vitest suite and/or npm scripts to validate the agent’s output.

shared.injectO11yContext(sandbox, events)

function

Derives observability data from the standard event stream and writes it to __niceeval__/results.json inside the sandbox. This makes agent behavior visible to assertions in EVAL.ts.

// In EVAL.ts — read what the agent did:
const o11y = JSON.parse(readFileSync("__niceeval__/results.json", "utf-8")).o11y;
expect(o11y.shellCommands.map(c => c.command)).not.toContain("rm -rf /");

Complete example: claude-code adapter

// agents/claude-code.ts
import { defineSandboxAgent, shared } from "niceeval/adapter";
import { requireEnv } from "niceeval";

const auth = () => ({ ANTHROPIC_API_KEY: requireEnv("ANTHROPIC_API_KEY") });

export default defineSandboxAgent({
  name: "claude-code",
  async send(input, ctx) {
    const sb = ctx.sandbox!;

    // Install the CLI (privileged — npm global install)
    await sb.runCommand("npm", ["install", "-g", "@anthropic-ai/claude-code"], {
      root: true,
    });

    // Build the argument list
    const args = ["--print", "--dangerously-skip-permissions"];
    if (ctx.model) args.push("--model", ctx.model);
    if (ctx.flags.webResearch) args.push("--allowedTools", "WebSearch,WebFetch");
    if (!ctx.session.isNew && ctx.session.id) args.push("--resume", ctx.session.id);
    args.push(input.text);

    const res = await sb.runCommand("claude", args, { env: auth() });

    // Capture and parse the transcript
    const raw = await shared.captureLatestJsonl(sb, "~/.claude/projects");
    ctx.session.id = shared.sessionIdFromClaudeTranscript(raw);

    return {
      events: parseClaudeCode(raw),  // your transcript → StreamEvent[] parser
      status: res.exitCode === 0 ? "completed" : "failed",
    };
  },
});

StreamEvent union type

Every adapter must produce StreamEvent[]. This normalized stream is what all scope-level assertions in test(t) read from. If your backend uses a different representation, map it to these types in your send function.

type StreamEvent =
  | { type: "message"; role: "assistant" | "user"; text: string }
  | { type: "action.called"; callId: string; name: string; input: JsonValue }
  | { type: "action.result"; callId: string; output?: JsonValue;
      status: "completed" | "failed" | "rejected" }
  | { type: "subagent.called"; callId: string; name: string; remoteUrl?: string }
  | { type: "subagent.completed"; callId: string; output?: JsonValue;
      status: "completed" | "failed" }
  | { type: "input.requested"; request: InputRequest }
  | { type: "thinking"; text: string }
  | { type: "error"; message: string };

Event type details

Event type	Description
`message`	A text message from the assistant or user. `t.reply` is derived from all `assistant` messages in the stream.
`action.called`	A tool, skill, or action was invoked. `callId` links to the corresponding `action.result`.
`action.result`	The result of a tool call. Paired with `action.called` by `callId`.
`subagent.called`	The agent delegated to a sub-agent.
`subagent.completed`	A sub-agent delegation finished.
`input.requested`	The agent paused waiting for human input (HITL). Causes `status: "waiting"` on the Turn.
`thinking`	Reasoning text from a chain-of-thought model. Not counted as a reply message.
`error`	An error emitted by the agent during execution. `t.notEvent("error")` asserts none occurred.

Skill loading (load_skill) is represented as an action.called event with name: "load_skill". The t.loadedSkill(name) assertion is syntactic sugar for t.calledTool("load_skill", { input: { skill: name } }) — no separate event type is needed.

Using agents in experiments

Once you’ve written an adapter, reference it from an experiment file:

// experiments/local.ts
import { defineExperiment } from "niceeval";
import myAgent from "./agents/my-agent.js";

export default defineExperiment({
  agent: myAgent,
  runs: 1,
});

agent

AgentAdapter

Agent adapter instance created with defineAgent or defineSandboxAgent.

runs

number

Number of attempts for each matched eval in this experiment.

Built-in agents

The following coding agent adapters are exported by niceeval and can be referenced from experiment files:

claude-code

Anthropic Claude Code CLI. Requires ANTHROPIC_API_KEY. Uses claude --print --dangerously-skip-permissions.

codex

OpenAI Codex CLI. Requires codex login or API key setup. Uses codex exec --json.

bub

Built-in bub coding agent. Same adapter shape as claude-code — use as a reference when writing your own sandbox adapter.

npx niceeval exp claude-code-local evals/fixtures/button --sandbox docker
npx niceeval exp codex-local evals/fixtures/button --sandbox docker

ctx vs t: two names, same data

The ctx object in your adapter’s send function and the t object in your eval’s test function share the same underlying data — t is the runner’s high-level view built on top of ctx.

Concept	`ctx` (agent side)	`t` (eval side)
Feature flags	`ctx.flags`	`t.flags`
Model	`ctx.model`	`t.model`
Abort signal	`ctx.signal`	`t.signal`
Logging	`ctx.log()`	`t.log()`
Session	`ctx.session.id` / `isNew`	`t.newSession()`
Sandbox	`ctx.sandbox` (raw `Sandbox` handle)	`t.sandbox.diff`, `t.fileChanged`, etc. (high-level view)

Authentication details, CLI flags, and transcript locations are agent-local — they live inside send and are never exposed via ctx or t.

​defineAgent

​Options

​The input parameter

​The ctx parameter (AgentContext)

​The Turn return type

​Complete example: in-process agent

​Complete example: remote HTTP agent

​defineSandboxAgent

​The ctx.sandbox field (Sandbox interface)

​shared helpers

​Complete example: claude-code adapter

​StreamEvent union type

​Using agents in experiments

​Built-in agents

claude-code

codex

bub

​ctx vs t: two names, same data

defineAgent

Options

The `input` parameter

The `ctx` parameter (AgentContext)

The Turn return type

Complete example: in-process agent

Complete example: remote HTTP agent

defineSandboxAgent

The `ctx.sandbox` field (Sandbox interface)

shared helpers

Complete example: claude-code adapter

StreamEvent union type

Using agents in experiments

Built-in agents

ctx vs t: two names, same data