agent.send(input, ctx) and expects back a Turn. This page covers the two adapter factories: defineAgent for remote and in-process agents, and defineSandboxAgent for coding agents that run inside an isolated sandbox.
defineAgent
UsedefineAgent for any agent you can drive in-process or over HTTP. The send function is your responsibility: call your code, fire a fetch, stream from a WebSocket — whatever your backend requires. Map the result to the standard event stream and return a Turn.
Options
A unique identifier for this agent. Experiment files reference agent objects directly, and reports use this name for grouping.
Declares what the agent can do. The runner uses these flags to decide which
methods appear on the eval’s
t context. Omitting a capability hides the
corresponding t methods at the TypeScript type level, surfacing
misconfiguration at compile time rather than runtime.The agent supports multi-turn sessions. Enables
t.reply and t.newSession().The agent produces
action.* and subagent.* events. Enables
t.calledTool, t.notCalledTool, t.toolOrder, t.usedNoTools,
t.maxToolCalls, t.loadedSkill, t.calledSubagent, t.noFailedActions,
t.event, t.notEvent, t.eventOrder, and t.eventsSatisfy.The agent works on a file system. Enables
t.sandbox.diff, t.fileChanged,
t.fileDeleted, t.testsPassed, and t.scriptPassed. This flag is
automatically set for defineSandboxAgent adapters.The core function that drives the agent. Called once per
t.send() invocation.
See the TurnInput and AgentContext sections below for parameter details.
Must return a Turn (see the Turn section).The input parameter
The user message string for this turn. This is the value passed to
t.send(text).The ctx parameter (AgentContext)
An
AbortSignal tied to the eval’s timeout. Pass it to any fetch calls or
long-running async work so they cancel cleanly when the eval times out or is
aborted by early-exit logic.The model tier string requested by the experiment (e.g.
"claude-opus-4-8").
When present, pass it to your backend’s model selection parameter. When
absent, let your backend use its own default.Feature flags set by the experiment and transparently forwarded to the agent.
Read these to toggle behaviors (e.g.
ctx.flags.webResearch). The same
flags are available on t.flags in the eval’s test function.Session state for multi-turn conversations.
id is an opaque string you
assign after the first turn so subsequent turns can resume the session.
isNew is true on the first turn or after the eval calls t.newSession().Writes a diagnostic message to the eval’s log. Useful for debugging adapter
internals without polluting the test output.
The Turn return type
Yoursend function must return an object satisfying the Turn interface.
The normalized standard event stream for this turn. This is the core product
of your adapter — every scope-level assertion in the eval reads from it. See
the StreamEvent section below for all event types.
Structured (non-text) output from the agent. Used by
turn.outputEquals()
and turn.outputMatches(). Set this when your agent returns a parsed object
alongside its text response.The outcome of this turn:
"completed"— the agent finished normally"failed"— the agent encountered an error"waiting"— the agent stopped at a human-in-the-loop (input.requested) prompt
Token counts for this turn. Provide these when your backend exposes them so
niceeval can report costs and power
t.maxTokens() / t.maxCost() assertions.inputTokens: numberoutputTokens: numbercacheReadTokens?: number
Complete example: in-process agent
Complete example: remote HTTP agent
Authentication (API keys, base URLs, tokens) belongs inside the adapter —
read it from environment variables in the
send closure. niceeval never sees
it and never passes it via ctx. This keeps credential scope tight and lets
the same adapter be used across environments simply by changing env vars.defineSandboxAgent
UsedefineSandboxAgent for coding agents that run as a CLI inside an isolated sandbox (Docker container or cloud VM). The runner provisions the sandbox and passes it via ctx.sandbox. Your send function installs the CLI, runs the agent with the task prompt, reads back the transcript, and parses it into the standard event stream.
defineSandboxAgent accepts exactly the same options as defineAgent (see above), plus ctx.sandbox is always populated.
The ctx.sandbox field (Sandbox interface)
runCommand(cmd, args?, opts?)
(cmd: string, args?: string[], opts?: RunOpts) => Promise<CommandResult>
Runs a single command inside the sandbox. Returns
{ stdout, stderr, exitCode }.opts fields:env?: Record<string, string>— extra environment variables merged into the command’s environmentcwd?: string— working directory override for this commandroot?: boolean— run as root (falseby default). Use for privileged setup steps like installing system packages.
Runs a multi-line shell script inside the sandbox. Accepts the same
opts as
runCommand. Useful for complex setup sequences.Reads a file from the sandbox filesystem and returns its contents as a string.
Writes one or more files into the sandbox. Keys are paths, values are file
contents.
Uploads a batch of files (including binary) to the sandbox. Used internally
by
shared.prepareWorkspace to upload workspace fixture files.Returns the current working directory path inside the sandbox.
Sets the default working directory for subsequent commands.
Tears down and destroys the sandbox instance. Called automatically by the
runner after the eval completes. You generally do not need to call this
yourself.
shared helpers
Theshared object from niceeval/adapter provides utilities that are common across all sandbox agent adapters, ensuring that workspace preparation, diff collection, validation, and observability injection work consistently regardless of which agent CLI you’re wrapping.
Uploads workspace files to the sandbox (hiding
EVAL.ts and other test files
to prevent the agent from seeing the answer), then runs git init && git commit
to establish a baseline for later diffing.Locates and reads the most recently modified
.jsonl transcript file under
dir. Used by adapters like claude-code that write transcripts to a
well-known directory.Uploads the test files (e.g.
EVAL.ts) that were hidden during workspace
preparation, then runs the Vitest suite and/or npm scripts to validate the
agent’s output.Derives observability data from the standard event stream and writes it to
__niceeval__/results.json inside the sandbox. This makes agent behavior
visible to assertions in EVAL.ts.Complete example: claude-code adapter
StreamEvent union type
Every adapter must produceStreamEvent[]. This normalized stream is what all scope-level assertions in test(t) read from. If your backend uses a different representation, map it to these types in your send function.
Event type details
Event type details
| Event type | Description |
|---|---|
message | A text message from the assistant or user. t.reply is derived from all assistant messages in the stream. |
action.called | A tool, skill, or action was invoked. callId links to the corresponding action.result. |
action.result | The result of a tool call. Paired with action.called by callId. |
subagent.called | The agent delegated to a sub-agent. |
subagent.completed | A sub-agent delegation finished. |
input.requested | The agent paused waiting for human input (HITL). Causes status: "waiting" on the Turn. |
thinking | Reasoning text from a chain-of-thought model. Not counted as a reply message. |
error | An error emitted by the agent during execution. t.notEvent("error") asserts none occurred. |
Skill loading (
load_skill) is represented as an action.called event with
name: "load_skill". The t.loadedSkill(name) assertion is syntactic sugar
for t.calledTool("load_skill", { input: { skill: name } }) — no separate
event type is needed.Using agents in experiments
Once you’ve written an adapter, reference it from an experiment file:Agent adapter instance created with
defineAgent or defineSandboxAgent.Number of attempts for each matched eval in this experiment.
Built-in agents
The following coding agent adapters are exported by niceeval and can be referenced from experiment files:claude-code
Anthropic Claude Code CLI. Requires
ANTHROPIC_API_KEY.
Uses claude --print --dangerously-skip-permissions.codex
OpenAI Codex CLI. Requires
codex login or API key setup.
Uses codex exec --json.bub
Built-in bub coding agent. Same adapter shape as
claude-code — use as
a reference when writing your own sandbox adapter.ctx vs t: two names, same data
Thectx object in your adapter’s send function and the t object in your eval’s test function share the same underlying data — t is the runner’s high-level view built on top of ctx.
| Concept | ctx (agent side) | t (eval side) |
|---|---|---|
| Feature flags | ctx.flags | t.flags |
| Model | ctx.model | t.model |
| Abort signal | ctx.signal | t.signal |
| Logging | ctx.log() | t.log() |
| Session | ctx.session.id / isNew | t.newSession() |
| Sandbox | ctx.sandbox (raw Sandbox handle) | t.sandbox.diff, t.fileChanged, etc. (high-level view) |
send and are never exposed via ctx or t.