Eval Your AI Agent Application

This example shows how to use niceeval to evaluate your own AI agent application. See the full example code: https://github.com/CorrectRoadH/niceeval/tree/main/examples/zh/ai-sdk The system under test is a general-purpose AI assistant HTTP service built with an AI SDK tool loop. It handles messages, calls tools (weather / calculator / search), understands images, and uses Langfuse for its own observability. The application doesn’t need a sandbox during testing — niceeval reaches it directly through its HTTP protocol via an adapter.

Directory structure

examples/zh/ai-sdk/
  ai-sdk-agent/            # web agent under test (POST /api/turn, 3 tools + image understanding)
  agents/web-agent.ts      # niceeval adapter: factory webAgent({ baseUrl })
  evals/                   # conversational evals
    weather-tool.eval.ts        # ask about weather → calls get_weather
    image-understanding.eval.ts # image understanding
  experiments/
    compare-models/        # experiment group: one file per model
      gpt-4o-mini.ts
      gpt-4o.ts
  niceeval.config.ts       # register adapter, judge, concurrency

Define the adapter

The adapter tells niceeval how to send requests to the AI agent and how to read responses as a standard event stream. It’s a factory function: baseUrl (where the service runs) is passed in from the outside so the adapter never hardcodes it or reads from env.

// agents/web-agent.ts
import { defineAgent } from "niceeval/adapter";
import type { Agent } from "niceeval/adapter";
import type { StreamEvent } from "niceeval";
import type { AgentEvent, AgentResponse } from "../ai-sdk-agent/src/protocol.ts";

export function webAgent(opts: { baseUrl: string }): Agent {
  const baseUrl = opts.baseUrl.replace(/\/$/, "");
  return defineAgent({
    name: "web-agent",
    capabilities: { conversation: true, toolObservability: true, tracing: true },
    async send(input, ctx) {
      const response = await fetch(`${baseUrl}/api/turn`, {
        method: "POST",
        headers: { "content-type": "application/json" },
        body: JSON.stringify({
          sessionId: ctx.session.id,
          message: input.text,
          model: ctx.model,
          otelEndpoint: ctx.telemetry?.endpoint, // dual observability: let the app send this turn's spans back to niceeval
        }),
        signal: ctx.signal,
      });
      // Shared contract within the same workspace — read as AgentResponse directly, no need to validate as unknown.
      const body = (await response.json()) as AgentResponse;
      ctx.session.id = body.sessionId;
      return {
        events: body.events.map(toStreamEvent),
        data: body.data,
        status: "completed" as const,
      };
    },
  });
}

function toStreamEvent(event: AgentEvent): StreamEvent {
  if (event.type === "action.called") return { ...event, tool: "unknown" };
  return event;
}

Multi-turn messages

t.send() automatically carries ctx.session.id to continue the same session. The adapter writes the service’s returned sessionId back to ctx.session.id. To split traffic by feature flag within an experiment, see Experiments.

Define evals

Each eval sends a message and asserts on the reply, tool calls, and image understanding. Deterministic assertions (calledTool, messageIncludes) run without an API key; open-ended scoring with a judge requires a key to be set.

// evals/weather-tool.eval.ts
import { defineEval } from "niceeval";

export default defineEval({
  description: "AI assistant: asking about weather calls get_weather",
  async test(t) {
    const turn = await t.send("What's the weather like in Brooklyn today?");
    turn.expectOk();
    await t.group("calls get_weather with correct city", () => {
      t.calledTool("get_weather", { input: { city: "Brooklyn" } });
      t.messageIncludes(/°[CF]|temperature|weather/i);
    });
    t.judge.autoevals.closedQA("Does the assistant answer based on the tool's returned weather data, not a hallucination?").atLeast(0.7);
  },
});

For image understanding, put the image URL in the message text (send carries text only), and the assistant uses its multimodal vision model to describe the image:

// evals/image-understanding.eval.ts
import { SAMPLE_IMAGE_URL } from "../ai-sdk-agent/src/assistant.ts";

const turn = await t.send(`What's in this image? ${SAMPLE_IMAGE_URL}`);
t.messageIncludes(/cat/i); // the sample image is a cat

Define experiments

One experiment file = one configuration (single model). For cross-model comparison, write multiple files in the same experiment group folder:

// experiments/compare-models/gpt-4o.ts
import { defineExperiment } from "niceeval";
import { webAgent } from "../../agents/web-agent.ts";

export default defineExperiment({
  description: "AI assistant: gpt-4o",
  agent: webAgent({ baseUrl: "http://127.0.0.1:5188" }),
  model: "gpt-4o", // single string; copy this file and change one line for each model
  runs: 3,
});

Start evaluating

First start the service under test (defaults to mock mode — no API key required). This example is a standalone npm project where niceeval is a local dependency:

cd examples/zh/ai-sdk && pnpm install && pnpm dev

Open a second terminal and run evals:

cd examples/zh/ai-sdk
pnpm exec niceeval exp compare-models   # run the model comparison experiment
pnpm exec niceeval exp compare-models weather-tool   # run one eval inside that experiment

Next steps

Remote Agent — full reference for defineAgent.
Authoring Evals — single-turn, multi-turn, and dataset evals.
CI Integration — put agent regression tests in PRs.

​Directory structure

​Define the adapter

​Multi-turn messages

​Define evals

​Define experiments

​Start evaluating

​Next steps