niceeval quickstart: run your first eval in 10 minutes
Install niceeval, scaffold your project, and run your first three evals — function, conversational, and coding-agent — in under 10 minutes.
This guide takes you from a blank project to three working evals: one that calls an in-process function, one that drives a conversational agent over HTTP, and one that drops a coding agent into a Docker sandbox to write real code. By the end you’ll have a runnable eval suite with CI-ready output.If you already know what you want to evaluate, jump straight to the relevant example:
Eval a Claude Code / Codex plugin
For plugins, MCP servers, and project-level coding-agent extensions.
Eval a Claude Code / Codex Skill
Verify that a Skill is triggered, follows its prescribed flow, and actually improves task success rate.
Eval an AI agent application
For HTTP agents, AI SDK, LangGraph, or any custom agent service.
1
Install niceeval
2
Add niceeval as a dev dependency, then run the scaffold command:
3
npm install -D niceevalnpx niceeval init
4
npx niceeval init reads your project layout and generates everything you need to run your first eval immediately.
5
Explore the generated files
6
After init completes, your project contains the following new entries:
niceeval.config.ts is your central configuration — it sets the judge model, reporters, concurrency, timeout, and sandbox backend. evals/ is where all your eval files live; the file path automatically becomes the eval’s ID.
9
You can safely delete or replace the generated example files once you’ve read through them. They’re there to illustrate the shape of each eval type, not to run against a real agent.
The fastest way to start is evaluating a TypeScript function that lives in your own codebase. You define an agent adapter that calls your function directly, then write an eval that checks the output. No network, no Docker.First, create the agent adapter that wraps your function:
// agents/classify.tsimport { defineAgent } from "niceeval/adapter";import { classifyIntent } from "../src/agent.js"; // your own codeexport default defineAgent({ name: "classify", async send(input) { return { data: await classifyIntent(input.text), status: "completed" }; },});
Then write the eval file:
// evals/classify.eval.tsimport { defineEval } from "niceeval";import { equals } from "niceeval/expect";export default defineEval({ description: "Intent classification: refund request", async test(t) { const turn = await t.send("I'd like to return my order and get a refund."); t.check(turn.data, equals({ intent: "refund" })); },});
Add the classify agent to an experiment file such as experiments/local.ts, then run:
npx niceeval exp local classify
In-process evals are ideal for semantic regression testing. Because they call your code directly, they run in milliseconds and slot naturally into a standard CI pipeline alongside unit tests.
To eval an agent that lives behind an HTTP endpoint, you write an adapter that handles the request/response cycle. The URL and any credentials are the adapter’s concern — niceeval has no --url flag and imposes no protocol requirements.Define the remote agent adapter:
For coding agents like Claude Code, Codex, or bub that need to write real files, niceeval uses a fixture — a small directory containing a prompt, hidden validation tests, and a minimal project to work in.
<!-- evals/fixtures/button/PROMPT.md -->Using the project's existing styling system, export a Button componentfrom src/components/Button.tsx that accepts `label` and `onClick` propsand implements a hover state.
The fixture works as follows: PROMPT.md is sent to the agent as its task. EVAL.ts is a Vitest test file that runs after the agent finishes — the agent never sees it. package.json defines the minimal project environment.Run the coding-agent eval with Docker as the sandbox:
export ANTHROPIC_API_KEY=sk-ant-...npx niceeval exp local fixtures/button --sandbox docker
To measure pass rate over multiple attempts with early stopping on first success:
npx niceeval exp local fixtures/button --runs 10 --early-exit
Sandbox evals require Docker to be running on your machine. If Docker is not available, niceeval will stop with a clear error rather than silently falling back to a different backend.
The --strict flag promotes soft assertion failures (such as LLM judge scores below threshold) to hard failures. Any failed eval causes a non-zero exit code, which marks the workflow step as failed. The --junit flag writes a JUnit XML report that most CI systems can parse natively for test result visualization.
Add OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} to the env block if you’re using Codex as your coding agent.