Sandbox agents: evaluate Claude Code, Codex, and bub

A sandbox agent spawns a coding-agent CLI inside an isolated environment — a Docker container or cloud micro-VM — gives it a workspace, lets it run freely, then reads back the transcript and validates the result. This is how you evaluate tools like Claude Code, Codex, and bub: they need a real filesystem to write code, execute builds, and call tools, so niceeval provides that filesystem in a throwaway container your host machine never sees.

Built-in sandbox agents

niceeval ships three sandbox adapters out of the box. Import one into an experiment file and run that experiment.

claude-code

Runs Anthropic’s Claude Code CLI. Requires ANTHROPIC_API_KEY.

codex

Runs OpenAI’s Codex CLI. Requires OPENAI_API_KEY.

bub

Runs the bub coding agent. Authentication follows the bub CLI conventions.

Running a sandbox agent

Select the agent in an experiment file, then run that experiment. The optional --sandbox flag only overrides where niceeval spins up the isolated environment (see Sandbox Backends for the full list of options).

# Evaluate the button fixture with Claude Code in a local Docker container
export ANTHROPIC_API_KEY=sk-ant-...
npx niceeval exp local fixtures/button --sandbox docker

# Run 10 times and stop as soon as one pass is recorded
npx niceeval exp local fixtures/button --runs 10 --early-exit

The --sandbox docker flag is optional if Docker is your default backend. Keep the agent choice in experiments/local.ts or another signed-in experiment file.

Environment variables by agent

Agent	Required variable
`claude-code`	`ANTHROPIC_API_KEY`
`codex`	`OPENAI_API_KEY`
`bub`	(follows bub CLI auth conventions)

How a sandbox agent works

The runner creates the sandbox and commits a baseline, then your eval and adapter decide what to do. Starter files are uploaded explicitly from test(t); validation commands are ordinary t.sandbox.runCommand(...) calls.

createSandbox(backend, timeout)
  → git init && git commit             # baseline for later diff
  → test(t): uploadDirectory/writeFiles and run setup commands
  → adapter.send(input, ctx)           # ← the adapter's only segment
  → test(t): run validation commands and record assertions
  → collectGeneratedFiles()            # git diff HEAD
  → sandbox.stop()                     # destroy the environment

The `defineSandboxAgent` shape

A sandbox adapter receives a ctx whose ctx.sandbox is the live Sandbox handle for the current isolated environment. Your send function uses that handle to install the CLI, authenticate, run the agent, and read back the transcript.

import { defineSandboxAgent } from "niceeval/adapter";

defineSandboxAgent({
  name: string;
  async send(input: TurnInput, ctx: AgentContext): Promise<Turn>;
  //                                 ↑ ctx.sandbox is the Sandbox handle
});

The five things that differ between coding-agent adapters are:

Install the CLI — e.g. npm install -g @anthropic-ai/claude-code
Authenticate — read the API key from the environment and pass it to the command
Build the command — construct the argument list, including the prompt
Pass the model flag — forward ctx.model to the CLI if the experiment specifies one
Read and parse the transcript — locate the native JSONL output and convert it to StreamEvent[]

The built-in `claude-code` adapter (full example)

The source for the built-in Claude Code adapter illustrates all five steps and how the shared helpers fit in:

// agents/claude-code.ts  (built-in; custom agents follow the same shape)
import { defineSandboxAgent, shared } from "niceeval/adapter";
import { requireEnv } from "niceeval";

// Authentication is the adapter's private business — never passed through ctx
const auth = () => ({ ANTHROPIC_API_KEY: requireEnv("ANTHROPIC_API_KEY") });

export default defineSandboxAgent({
  name: "claude-code",
  async send(input, ctx) {
    const sb = ctx.sandbox!;

    // Step 1: install the CLI
    await sb.runCommand("npm", ["install", "-g", "@anthropic-ai/claude-code"]);

    // Step 3 & 4: build the command, forwarding model and feature flags
    const args = ["--print", "--dangerously-skip-permissions"];
    if (ctx.model) args.push("--model", ctx.model);   // only when experiment sets it
    if (ctx.flags.webResearch) args.push("--allowedTools", "WebSearch,WebFetch");
    if (!ctx.session.isNew && ctx.session.id) args.push("--resume", ctx.session.id);
    args.push(input.text);

    // Step 2: authenticate via env, run the agent
    const res = await sb.runCommand("claude", args, { env: auth() });

    // Step 5: read the transcript, parse it into StreamEvent[]
    const raw = await shared.captureLatestJsonl(sb, "~/.claude/projects");
    ctx.session.id = shared.sessionIdFromClaudeTranscript(raw);  // enables multi-turn resume

    return {
      events: parseClaudeCode(raw),   // native JSONL → standard StreamEvent[]
      status: res.exitCode === 0 ? "completed" : "failed",
    };
  },
});

Shared helpers

niceeval provides helpers that all sandbox adapters can reuse. Using them ensures that workspace preparation, diff collection, and validation are identical across every agent — results are always apples-to-apples.

Helper	What it does
`shared.prepareWorkspace(sandbox, fixture)`	Uploads workspace files (hiding `EVAL.ts`), runs `git init` and commits a baseline
`shared.captureLatestJsonl(sandbox, dir)`	Finds and reads the most recent JSONL transcript in the given directory
`shared.runValidation(sandbox, scripts, mode)`	Uploads test files and runs `EVAL.ts` (Vitest) plus any npm scripts
`shared.injectO11yContext(sandbox, events)`	Derives the o11y summary from the event stream and writes it to `__niceeval__/results.json` so `EVAL.ts` can assert on agent behavior
`shared.captureGeneratedFiles(sandbox)`	Runs `git diff HEAD` and returns `{ generated, deleted }` — the file-level diff used for `t.fileChanged`, `t.fileDeleted`, and `t.sandbox.diff`

Transcript parsing: JSONL → `StreamEvent[]`

Each coding agent writes its own native transcript format. Your adapter’s fifth step is converting that format into the standard StreamEvent[] vocabulary that all niceeval assertions understand.

// Minimal transcript parser skeleton
import type { StreamEvent } from "niceeval";

function parseClaudeCode(rawJsonl: string): StreamEvent[] {
  const events: StreamEvent[] = [];

  for (const line of rawJsonl.trim().split("\n")) {
    const entry = JSON.parse(line);

    if (entry.type === "assistant" && entry.message?.content) {
      for (const block of entry.message.content) {
        if (block.type === "text") {
          events.push({ type: "message", role: "assistant", text: block.text });
        }
        if (block.type === "tool_use") {
          events.push({ type: "action.called", callId: block.id, name: block.name, input: block.input });
        }
      }
    }

    if (entry.type === "tool_result") {
      events.push({
        type: "action.result",
        callId: entry.tool_use_id,
        output: entry.content,
        status: "completed",
      });
    }
  }

  return events;
}

Once you normalize the transcript into StreamEvent[], the entire suite of niceeval assertions becomes available: t.calledTool, t.toolOrder, t.noFailedActions, t.messageIncludes, and more — no extra work required.

`ctx.model` and `ctx.flags`

Experiments pass a model tier and feature flags through ctx. Your adapter should forward them rather than hardcoding values — this allows the same adapter to serve multiple experiments without modification.

// Forward the experiment's model tier to the CLI (omit if not set → CLI uses its default)
if (ctx.model) args.push("--model", ctx.model);

// Read a feature flag to conditionally enable a tool
if (ctx.flags.webResearch) args.push("--allowedTools", "WebSearch,WebFetch");

ctx.model is set by the experiment configuration, not by the adapter. If an experiment doesn’t specify a model, ctx.model is undefined and the agent CLI uses its built-in default.

Registering a custom sandbox agent

Built-in agents (claude-code, codex, bub) can be imported into experiment files. Custom adapters follow the same pattern:

// experiments/local.ts
import { defineExperiment } from "niceeval";
import myCustomAgent from "./agents/my-custom-agent.js";

export default defineExperiment({
  agent: myCustomAgent,
  runs: 1,
});

Then run that experiment:

npx niceeval exp local fixtures/my-task --sandbox docker

Never read ANTHROPIC_API_KEY, OPENAI_API_KEY, or other secrets through ctx. Authentication is the adapter’s private responsibility. Read secrets directly from process.env inside your adapter definition — they should never be visible to the experiment or the eval author.

​Built-in sandbox agents

claude-code

codex

bub

​Running a sandbox agent

​Environment variables by agent

​How a sandbox agent works

​The defineSandboxAgent shape

​The built-in claude-code adapter (full example)

​Shared helpers

​Transcript parsing: JSONL → StreamEvent[]

​ctx.model and ctx.flags

​Registering a custom sandbox agent

Built-in sandbox agents

Running a sandbox agent

Environment variables by agent

How a sandbox agent works

The `defineSandboxAgent` shape

The built-in `claude-code` adapter (full example)

Shared helpers

Transcript parsing: JSONL → `StreamEvent[]`

`ctx.model` and `ctx.flags`

Registering a custom sandbox agent