Connect a remote or in-process agent to niceeval

Every system under test in niceeval is an agent — a named object that receives a prompt, drives your code or service, and returns a structured result. You connect your own logic by writing an adapter with defineAgent. The adapter owns all the details of how to call your system: in-process function references, HTTP endpoints, authentication headers, and message formats are entirely private to the adapter. Experiments reference agents directly rather than passing URLs on the CLI, because there is no universal protocol that every agent speaks.

When to use `defineAgent`

In-process function

Call your own function directly inside send. Zero network overhead — the fastest possible eval loop, ideal for unit-level semantic tests in CI.

Remote HTTP service

Issue a fetch inside send using whatever protocol your service speaks. The URL, auth, and request shape are your business; niceeval never sees them.

The `defineAgent` shape

defineAgent accepts a plain object with three fields. The send function is the only place you need to write any logic.

import { defineAgent } from "niceeval/adapter";

defineAgent({
  name: string;                          // used for reports and grouping
  capabilities: AgentCapabilities;       // declares what this agent can do
  async send(input: TurnInput, ctx: AgentContext): Promise<Turn>;
});

`AgentCapabilities`

Declaring capabilities lets niceeval shape the t context that eval authors receive. If a capability is absent, the corresponding assertions are not available at the type level — you get a compile-time error rather than a runtime surprise.

interface AgentCapabilities {
  conversation?: boolean;       // supports multiple send() calls per eval (t.reply, t.newSession)
  toolObservability?: boolean;  // produces action.* events  →  t.calledTool, t.toolOrder, etc.
  sandbox?: boolean;            // sandbox agents only — do not set it on a remote agent
}

The sandbox capability — which enables t.sandbox.diff, t.fileChanged, and related assertions — is only meaningful for sandbox agents that run in an isolated filesystem. Remote and in-process agents should declare only conversation and toolObservability.

`AgentContext`

The runner passes ctx into every send call. Use ctx.signal to respect cancellation, ctx.model to forward the experiment’s model tier to your agent, and ctx.flags to read feature flags defined by the experiment.

interface AgentContext {
  readonly signal: AbortSignal;
  readonly model?: string;                             // set by the experiment; omit → agent's default
  readonly flags: Readonly<Record<string, unknown>>;  // experiment feature flags, forwarded to agent
  readonly sandbox?: Sandbox;                         // sandbox agents only
  readonly session: { id?: string; readonly isNew: boolean };
  log(msg: string): void;
}

`Turn` — what `send` returns

interface Turn {
  readonly events: StreamEvent[];                        // ★ the normalized event stream
  readonly data?: unknown;                               // structured output for outputEquals / outputMatches
  readonly status: "completed" | "failed" | "waiting";  // "waiting" means parked at a HITL prompt
  readonly usage?: Usage;                                // optional token usage
}

The events array is the heart of every Turn. All assertions — t.calledTool, t.messageIncludes, t.eventOrder, and the rest — derive from this single stream. Populating it correctly is the only real job of a remote adapter.

In-process adapter example

Use an in-process adapter when your agent is a TypeScript function you can import directly. There is no network round-trip, and you get full type safety.

// agents/classify.ts
import { defineAgent } from "niceeval/adapter";
import { classifyIntent } from "../src/agent.js";

export default defineAgent({
  name: "classify",
  capabilities: {},
  async send(input, ctx) {
    const result = await classifyIntent(input.text, { signal: ctx.signal });

    return {
      events: [
        { type: "message", role: "assistant", text: result.label },
      ],
      data: result,          // available as turn.data in outputEquals / outputMatches
      status: "completed",
    };
  },
});

You don’t need to declare conversation or toolObservability if your function doesn’t support them. Omitting a capability simply means the corresponding t.* methods won’t appear in eval authors’ type signatures.

Remote HTTP adapter example

When your agent lives behind an HTTP endpoint, send is just a fetch. The URL comes from an environment variable so you can point the same adapter at local or production without changing any code.

// agents/weather-bot.ts
import { defineAgent } from "niceeval/adapter";

export default defineAgent({
  name: "weather-bot",
  capabilities: { conversation: true, toolObservability: true },
  async send(input, ctx) {
    const r = await fetch(`${process.env.AGENT_URL}/chat`, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ message: input.text }),
      signal: ctx.signal,
    });
    const body = await r.json();

    return {
      events: toStreamEvents(body),   // map your response shape → StreamEvent[]
      data: body.output,
      status: "completed",
    };
  },
});

`toStreamEvents` — mapping your response to the standard stream

toStreamEvents is a small mapping function you write. Its job is to translate whatever your service returns into the standard StreamEvent[] vocabulary that niceeval understands. Here is what the standard types look like:

type StreamEvent =
  | { type: "message"; role: "assistant" | "user"; text: string }
  | { type: "action.called"; callId: string; name: string; input: JsonValue }
  | { type: "action.result"; callId: string; output?: JsonValue;
      status: "completed" | "failed" | "rejected" }
  | { type: "subagent.called"; callId: string; name: string; remoteUrl?: string }
  | { type: "subagent.completed"; callId: string; output?: JsonValue;
      status: "completed" | "failed" }
  | { type: "input.requested"; request: InputRequest }  // agent paused at a HITL prompt
  | { type: "thinking"; text: string }
  | { type: "error"; message: string };

A minimal mapper for a service that returns { reply: string, tools: ToolCall[] } might look like this:

function toStreamEvents(body: { reply: string; tools?: any[] }): StreamEvent[] {
  const events: StreamEvent[] = [];

  for (const call of body.tools ?? []) {
    events.push({ type: "action.called", callId: call.id, name: call.name, input: call.input });
    events.push({ type: "action.result", callId: call.id, output: call.result, status: "completed" });
  }

  events.push({ type: "message", role: "assistant", text: body.reply });
  return events;
}

Emit action.called and action.result pairs with matching callId values. niceeval’s deriveRunFacts stitches them into structured toolCalls facts, which power t.calledTool, t.toolOrder, t.noFailedActions, and more.

Referencing an agent from an experiment

Import your adapter from an experiment file so the run configuration is signed in and reviewable.

// experiments/local.ts
import { defineExperiment } from "niceeval";
import classifyAgent from "./agents/classify.js";

export default defineExperiment({
  agent: classifyAgent,
  runs: 1,
});

Switching between local and production with environment variables

Because the adapter reads process.env internally, you can point it at any environment without touching config files. Pass the variable inline or export it before running:

# Evaluate against your local dev server
MY_BOT_URL=http://localhost:3000 npx niceeval exp local

# Evaluate against production
MY_BOT_URL=https://prod.example.com npx niceeval exp prod

You can also create two experiment files — one for local, one for production — and switch between them with npx niceeval exp local vs npx niceeval exp prod.

Standard `StreamEvent` types at a glance

Type	When to emit
`message`	Any text the agent produces (assistant reply or user echo)
`action.called`	A tool or skill call is initiated
`action.result`	The result of a tool or skill call (pair with `action.called` via `callId`)
`subagent.called`	The agent delegates work to a child agent
`subagent.completed`	The child agent finishes
`input.requested`	The agent paused and is waiting for human input (HITL); triggers `t.parked()`
`thinking`	Internal reasoning text (e.g., extended thinking from Claude)
`error`	A non-fatal error the agent reported

Skill loads (load_skill) are modeled as action.called events with name: "load_skill". The t.loadedSkill() assertion is therefore just syntactic sugar over t.calledTool("load_skill", …) — no special event type is needed.

​When to use defineAgent

In-process function

Remote HTTP service

​The defineAgent shape

​AgentCapabilities

​AgentContext

​Turn — what send returns

​In-process adapter example

​Remote HTTP adapter example

​toStreamEvents — mapping your response to the standard stream

​Referencing an agent from an experiment

​Switching between local and production with environment variables

​Standard StreamEvent types at a glance

When to use `defineAgent`

The `defineAgent` shape

`AgentCapabilities`

`AgentContext`

`Turn` — what `send` returns

In-process adapter example

Remote HTTP adapter example

`toStreamEvents` — mapping your response to the standard stream

Referencing an agent from an experiment

Switching between local and production with environment variables

Standard `StreamEvent` types at a glance