niceeval agents and adapters: connect to any AI system
Learn how niceeval connects to any AI through named agent adapters. Remote agents wrap HTTP or in-process calls; sandbox agents run CLI tools in isolation.
Connecting niceeval to the thing you want to evaluate is the part of the system most likely to be misunderstood, so it helps to state the two foundational claims upfront before going any further: niceeval does not define any agent protocol, and the adapter is the open boundary where you own the integration. Everything else on this page follows from those two facts.
Agent is the abstraction. From niceeval’s perspective, an agent is a named object with capability flags and a single send method. The runner interacts with nothing but this interface. It never branches on if (agent === "claude-code").
Adapter is the concrete implementation. You write it (or use one of niceeval’s built-ins). It knows how to authenticate, how to call your service, and how to translate whatever your AI returns into the standard event stream.
Experiments reference agents directly. The URL, API key, or CLI invocation is the adapter’s private configuration. niceeval never sees it.
Some eval frameworks define a wire protocol and let you point at any compatible endpoint with a URL. niceeval deliberately does not do this, because there is no single protocol that every AI agent speaks. Forcing your agent to conform to an external protocol would mean wrapping it in a compatibility shim instead of testing it as it actually runs.Instead, you write a small adapter that knows your agent’s protocol and normalizes its output. The adapter’s URL (or any other connection detail) is read from environment variables or closure — the core never touches it:
# Evaluate local and production with the same adapter; URL is its private confignpx niceeval exp local weather # localnpx niceeval exp prod weather # production
Regardless of whether the adapter wraps an in-process function, an HTTP endpoint, or a sandbox CLI, every agent exposes the same interface to the runner:
interface Agent { readonly name: string; // "my-bot" / "claude-code" / "codex" readonly capabilities: AgentCapabilities; send(input: TurnInput, ctx: AgentContext): Promise<Turn>;}interface AgentCapabilities { conversation?: boolean; // supports multiple t.send() calls per eval toolObservability?: boolean; // can produce action.* events → t.calledTool() workspace?: boolean; // works on a filesystem → t.sandbox.diff / t.sandbox}interface AgentContext { readonly signal: AbortSignal; readonly model?: ModelTier; // provided by the experiment; omit → agent's native default readonly flags: Readonly<Record<string, unknown>>; // experiment feature flags, forwarded to agent readonly sandbox?: Sandbox; // only present for sandbox agents (set by --sandbox) readonly session: { id?: string; readonly isNew: boolean }; // for multi-turn resume / newSession log(msg: string): void;}interface Turn { readonly events: StreamEvent[]; // ★ standard event stream — the core product of every adapter readonly data?: unknown; // structured output (for outputEquals / outputMatches) readonly status: "completed" | "failed" | "waiting"; // waiting = parked on HITL input readonly usage?: Usage; // token usage (input / output / cache tokens)}
send is the single verb. Turn.events is the single product. The difference between adapters is entirely in how send translates the raw response into events.
The AgentCapabilities flags you declare in your adapter determine which methods are available in test(t). This is enforced at the TypeScript type level — you cannot call t.calledTool() if the agent hasn’t declared toolObservability: true, and you will get a compile error rather than a runtime surprise.
The most important thing an adapter does is not connect to the AI — it is normalizing the AI’s output into the standard event stream. Once that normalization happens, the entire assertion vocabulary works for free, regardless of which AI produced the data.
The core’s deriveRunFacts(events) folds this flat stream into structured facts — toolCalls, subagentCalls, parked, messageCount — and all scoped assertions (t.calledTool, t.succeeded, t.noFailedActions, etc.) read from those derived facts. Your adapter only needs to produce correct events; scoring is handled entirely by the core.
Skill loading (load_skill) is just an action.called event — so t.loadedSkill("memory-v2") is syntax sugar for t.calledTool("load_skill", { input: { skill: "memory-v2" } }). No special event type needed.
Use defineAgent when your subject under test is a function you can call directly, or a service you reach over HTTP. The send method is entirely in your control: call a local function, fire a fetch, or do anything else. Your only obligation is to map the response to StreamEvent[].
// agents/my-agent.tsimport { defineAgent } from "niceeval/adapter";import { myAgent } from "../src/agent.js";export default defineAgent({ name: "my-agent", capabilities: { conversation: true, toolObservability: true }, async send(input, ctx) { const res = await myAgent.handle(input.text, { signal: ctx.signal }); return { events: toStreamEvents(res), // your mapping function data: res.json, status: "completed", }; },});
toStreamEvents is a small mapping function you write once — it converts “what your service said and which tools it called” into StreamEvent[]. That is the entirety of a remote agent author’s work.
Use defineSandboxAgent when the subject under test is a coding agent CLI (Claude Code, Codex, bub) that needs to run inside an isolated filesystem. The “connection” is not a wire protocol; it is: spawn the CLI inside the sandbox, pass it the prompt, let it work on the sandbox filesystem, read back the transcript.
All coding agent adapters share the same structural skeleton — the parts that differ between claude-code and codex are just five points: which CLI to install, how to authenticate, how to compose the invocation, how the model flag is passed, and where to read the transcript.
The experiment’s agent and the sandbox backend are independent:
experiment.agent selects which system under test to connect to--sandbox <backend> selects where sandbox agents run (docker / vercel / third-party)
Any sandbox agent works with any sandbox backend. claude-code can run in Docker or Vercel; the same Docker sandbox can host claude-code or bub. The runner prepares a Sandbox handle and passes it through ctx.sandbox — agent and sandbox interact only through that interface. Remote agents (defineAgent) ignore --sandbox entirely.
Feature flags (webResearch, which skill to inject, effort level…)
experiment
ctx.flags.*
The one-line rule: the adapter only configures “how to reach me”; the experiment configures “which model and which switches.” This lets the same adapter be reused across experiments with different models and flags, without any changes to the adapter itself.
Implement defineAgent with a send that calls your code or service, maps the response to StreamEvent[], and declares the capabilities your agent supports. That is everything.
Sandbox agent
Implement defineSandboxAgent with the five per-agent differences (install CLI, auth, compose invocation, model flag, read transcript), reuse shared utilities for workspace prep and diff capture, and write a transcript parser (o11y/parsers/<name>.ts) that converts raw JSONL to StreamEvent[].
Neither kind of adapter touches the core. The adapter boundary is the design’s structural load-bearer: adding a new agent never requires modifying niceeval itself.