Sandbox agents: evaluate Claude Code, Codex, and bub
Use niceeval’s built-in agents (claude-code, codex, bub) or write a custom adapter to run a coding-agent CLI in an isolated Docker or cloud sandbox.
A sandbox agent spawns a coding-agent CLI inside an isolated environment — a Docker container or cloud micro-VM — gives it a workspace, lets it run freely, then reads back the transcript and validates the result. This is how you evaluate tools like Claude Code, Codex, and bub: they need a real filesystem to write code, execute builds, and call tools, so niceeval provides that filesystem in a throwaway container your host machine never sees.
Select the agent in an experiment file, then run that experiment. The optional --sandbox flag only overrides where niceeval spins up the isolated environment (see Sandbox Backends for the full list of options).
# Evaluate the button fixture with Claude Code in a local Docker containerexport ANTHROPIC_API_KEY=sk-ant-...npx niceeval exp local fixtures/button --sandbox docker# Run 10 times and stop as soon as one pass is recordednpx niceeval exp local fixtures/button --runs 10 --early-exit
The --sandbox docker flag is optional if Docker is your default backend. Keep the agent choice in experiments/local.ts or another signed-in experiment file.
The runner creates the sandbox and commits a baseline, then your eval and adapter decide what to do. Starter files are uploaded explicitly from test(t); validation commands are ordinary t.sandbox.runCommand(...) calls.
createSandbox(backend, timeout) → git init && git commit # baseline for later diff → test(t): uploadDirectory/writeFiles and run setup commands → adapter.send(input, ctx) # ← the adapter's only segment → test(t): run validation commands and record assertions → collectGeneratedFiles() # git diff HEAD → sandbox.stop() # destroy the environment
A sandbox adapter receives a ctx whose ctx.sandbox is the live Sandbox handle for the current isolated environment. Your send function uses that handle to install the CLI, authenticate, run the agent, and read back the transcript.
import { defineSandboxAgent } from "niceeval/adapter";defineSandboxAgent({ name: string; async send(input: TurnInput, ctx: AgentContext): Promise<Turn>; // ↑ ctx.sandbox is the Sandbox handle});
The five things that differ between coding-agent adapters are:
Install the CLI — e.g. npm install -g @anthropic-ai/claude-code
Authenticate — read the API key from the environment and pass it to the command
Build the command — construct the argument list, including the prompt
Pass the model flag — forward ctx.model to the CLI if the experiment specifies one
Read and parse the transcript — locate the native JSONL output and convert it to StreamEvent[]
niceeval provides helpers that all sandbox adapters can reuse. Using them ensures that workspace preparation, diff collection, and validation are identical across every agent — results are always apples-to-apples.
Helper
What it does
shared.prepareWorkspace(sandbox, fixture)
Uploads workspace files (hiding EVAL.ts), runs git init and commits a baseline
shared.captureLatestJsonl(sandbox, dir)
Finds and reads the most recent JSONL transcript in the given directory
shared.runValidation(sandbox, scripts, mode)
Uploads test files and runs EVAL.ts (Vitest) plus any npm scripts
shared.injectO11yContext(sandbox, events)
Derives the o11y summary from the event stream and writes it to __niceeval__/results.json so EVAL.ts can assert on agent behavior
shared.captureGeneratedFiles(sandbox)
Runs git diff HEAD and returns { generated, deleted } — the file-level diff used for t.fileChanged, t.fileDeleted, and t.sandbox.diff
Each coding agent writes its own native transcript format. Your adapter’s fifth step is converting that format into the standard StreamEvent[] vocabulary that all niceeval assertions understand.
// Minimal transcript parser skeletonimport type { StreamEvent } from "niceeval";function parseClaudeCode(rawJsonl: string): StreamEvent[] { const events: StreamEvent[] = []; for (const line of rawJsonl.trim().split("\n")) { const entry = JSON.parse(line); if (entry.type === "assistant" && entry.message?.content) { for (const block of entry.message.content) { if (block.type === "text") { events.push({ type: "message", role: "assistant", text: block.text }); } if (block.type === "tool_use") { events.push({ type: "action.called", callId: block.id, name: block.name, input: block.input }); } } } if (entry.type === "tool_result") { events.push({ type: "action.result", callId: entry.tool_use_id, output: entry.content, status: "completed", }); } } return events;}
Once you normalize the transcript into StreamEvent[], the entire suite of niceeval assertions becomes available: t.calledTool, t.toolOrder, t.noFailedActions, t.messageIncludes, and more — no extra work required.
Experiments pass a model tier and feature flags through ctx. Your adapter should forward them rather than hardcoding values — this allows the same adapter to serve multiple experiments without modification.
// Forward the experiment's model tier to the CLI (omit if not set → CLI uses its default)if (ctx.model) args.push("--model", ctx.model);// Read a feature flag to conditionally enable a toolif (ctx.flags.webResearch) args.push("--allowedTools", "WebSearch,WebFetch");
ctx.model is set by the experiment configuration, not by the adapter. If an experiment doesn’t specify a model, ctx.model is undefined and the agent CLI uses its built-in default.
Built-in agents (claude-code, codex, bub) can be imported into experiment files. Custom adapters follow the same pattern:
// experiments/local.tsimport { defineExperiment } from "niceeval";import myCustomAgent from "./agents/my-custom-agent.js";export default defineExperiment({ agent: myCustomAgent, runs: 1,});
Then run that experiment:
npx niceeval exp local fixtures/my-task --sandbox docker
Never read ANTHROPIC_API_KEY, OPENAI_API_KEY, or other secrets through ctx. Authentication is the adapter’s private responsibility. Read secrets directly from process.env inside your adapter definition — they should never be visible to the experiment or the eval author.