Skip to main content
Every system under test in niceeval is an agent — a named object that receives a prompt, drives your code or service, and returns a structured result. You connect your own logic by writing an adapter with defineAgent. The adapter owns all the details of how to call your system: in-process function references, HTTP endpoints, authentication headers, and message formats are entirely private to the adapter. Experiments reference agents directly rather than passing URLs on the CLI, because there is no universal protocol that every agent speaks.

When to use defineAgent

In-process function

Call your own function directly inside send. Zero network overhead — the fastest possible eval loop, ideal for unit-level semantic tests in CI.

Remote HTTP service

Issue a fetch inside send using whatever protocol your service speaks. The URL, auth, and request shape are your business; niceeval never sees them.

The defineAgent shape

defineAgent accepts a plain object with three fields. The send function is the only place you need to write any logic.
import { defineAgent } from "niceeval/adapter";

defineAgent({
  name: string;                          // used for reports and grouping
  capabilities: AgentCapabilities;       // declares what this agent can do
  async send(input: TurnInput, ctx: AgentContext): Promise<Turn>;
});

AgentCapabilities

Declaring capabilities lets niceeval shape the t context that eval authors receive. If a capability is absent, the corresponding assertions are not available at the type level — you get a compile-time error rather than a runtime surprise.
interface AgentCapabilities {
  conversation?: boolean;       // supports multiple send() calls per eval (t.reply, t.newSession)
  toolObservability?: boolean;  // produces action.* events  →  t.calledTool, t.toolOrder, etc.
  sandbox?: boolean;            // sandbox agents only — do not set it on a remote agent
}
The sandbox capability — which enables t.sandbox.diff, t.fileChanged, and related assertions — is only meaningful for sandbox agents that run in an isolated filesystem. Remote and in-process agents should declare only conversation and toolObservability.

AgentContext

The runner passes ctx into every send call. Use ctx.signal to respect cancellation, ctx.model to forward the experiment’s model tier to your agent, and ctx.flags to read feature flags defined by the experiment.
interface AgentContext {
  readonly signal: AbortSignal;
  readonly model?: string;                             // set by the experiment; omit → agent's default
  readonly flags: Readonly<Record<string, unknown>>;  // experiment feature flags, forwarded to agent
  readonly sandbox?: Sandbox;                         // sandbox agents only
  readonly session: { id?: string; readonly isNew: boolean };
  log(msg: string): void;
}

Turn — what send returns

interface Turn {
  readonly events: StreamEvent[];                        // ★ the normalized event stream
  readonly data?: unknown;                               // structured output for outputEquals / outputMatches
  readonly status: "completed" | "failed" | "waiting";  // "waiting" means parked at a HITL prompt
  readonly usage?: Usage;                                // optional token usage
}
The events array is the heart of every Turn. All assertions — t.calledTool, t.messageIncludes, t.eventOrder, and the rest — derive from this single stream. Populating it correctly is the only real job of a remote adapter.

In-process adapter example

Use an in-process adapter when your agent is a TypeScript function you can import directly. There is no network round-trip, and you get full type safety.
// agents/classify.ts
import { defineAgent } from "niceeval/adapter";
import { classifyIntent } from "../src/agent.js";

export default defineAgent({
  name: "classify",
  capabilities: {},
  async send(input, ctx) {
    const result = await classifyIntent(input.text, { signal: ctx.signal });

    return {
      events: [
        { type: "message", role: "assistant", text: result.label },
      ],
      data: result,          // available as turn.data in outputEquals / outputMatches
      status: "completed",
    };
  },
});
You don’t need to declare conversation or toolObservability if your function doesn’t support them. Omitting a capability simply means the corresponding t.* methods won’t appear in eval authors’ type signatures.

Remote HTTP adapter example

When your agent lives behind an HTTP endpoint, send is just a fetch. The URL comes from an environment variable so you can point the same adapter at local or production without changing any code.
// agents/weather-bot.ts
import { defineAgent } from "niceeval/adapter";

export default defineAgent({
  name: "weather-bot",
  capabilities: { conversation: true, toolObservability: true },
  async send(input, ctx) {
    const r = await fetch(`${process.env.AGENT_URL}/chat`, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ message: input.text }),
      signal: ctx.signal,
    });
    const body = await r.json();

    return {
      events: toStreamEvents(body),   // map your response shape → StreamEvent[]
      data: body.output,
      status: "completed",
    };
  },
});

toStreamEvents — mapping your response to the standard stream

toStreamEvents is a small mapping function you write. Its job is to translate whatever your service returns into the standard StreamEvent[] vocabulary that niceeval understands. Here is what the standard types look like:
type StreamEvent =
  | { type: "message"; role: "assistant" | "user"; text: string }
  | { type: "action.called"; callId: string; name: string; input: JsonValue }
  | { type: "action.result"; callId: string; output?: JsonValue;
      status: "completed" | "failed" | "rejected" }
  | { type: "subagent.called"; callId: string; name: string; remoteUrl?: string }
  | { type: "subagent.completed"; callId: string; output?: JsonValue;
      status: "completed" | "failed" }
  | { type: "input.requested"; request: InputRequest }  // agent paused at a HITL prompt
  | { type: "thinking"; text: string }
  | { type: "error"; message: string };
A minimal mapper for a service that returns { reply: string, tools: ToolCall[] } might look like this:
function toStreamEvents(body: { reply: string; tools?: any[] }): StreamEvent[] {
  const events: StreamEvent[] = [];

  for (const call of body.tools ?? []) {
    events.push({ type: "action.called", callId: call.id, name: call.name, input: call.input });
    events.push({ type: "action.result", callId: call.id, output: call.result, status: "completed" });
  }

  events.push({ type: "message", role: "assistant", text: body.reply });
  return events;
}
Emit action.called and action.result pairs with matching callId values. niceeval’s deriveRunFacts stitches them into structured toolCalls facts, which power t.calledTool, t.toolOrder, t.noFailedActions, and more.

Referencing an agent from an experiment

Import your adapter from an experiment file so the run configuration is signed in and reviewable.
// experiments/local.ts
import { defineExperiment } from "niceeval";
import classifyAgent from "./agents/classify.js";

export default defineExperiment({
  agent: classifyAgent,
  runs: 1,
});

Switching between local and production with environment variables

Because the adapter reads process.env internally, you can point it at any environment without touching config files. Pass the variable inline or export it before running:
# Evaluate against your local dev server
MY_BOT_URL=http://localhost:3000 npx niceeval exp local

# Evaluate against production
MY_BOT_URL=https://prod.example.com npx niceeval exp prod
You can also create two experiment files — one for local, one for production — and switch between them with npx niceeval exp local vs npx niceeval exp prod.

Standard StreamEvent types at a glance

TypeWhen to emit
messageAny text the agent produces (assistant reply or user echo)
action.calledA tool or skill call is initiated
action.resultThe result of a tool or skill call (pair with action.called via callId)
subagent.calledThe agent delegates work to a child agent
subagent.completedThe child agent finishes
input.requestedThe agent paused and is waiting for human input (HITL); triggers t.parked()
thinkingInternal reasoning text (e.g., extended thinking from Claude)
errorA non-fatal error the agent reported
Skill loads (load_skill) are modeled as action.called events with name: "load_skill". The t.loadedSkill() assertion is therefore just syntactic sugar over t.calledTool("load_skill", …) — no special event type is needed.