defineEval API.
What you can evaluate
niceeval covers three categories of agent under test, all through a unified interface:Coding agents
Drop a CLI agent — Claude Code, Codex, bub, or any compatible tool — into an isolated Docker sandbox, give it a task, and verify the result with real tests and file assertions.
HTTP services & deployed agents
Point niceeval at any running HTTP endpoint or deployed agent. Assert on replies, tool calls, and structured output without touching the deployment.
In-process functions
Call your own TypeScript functions directly inside the eval process. Treat evals as semantic unit tests and run them in CI with zero network overhead.
Why “fast”
The name captures three distinct kinds of speed that matter when you’re iterating on agent behavior. Fast to author. Each eval lives in a single file, and its ID is derived automatically from the file path —evals/weather/brooklyn.eval.ts becomes weather/brooklyn. You write assertions inline in a linear async test(t) function with no callbacks and no boilerplate. A one-line .map fans a dataset out into dozens of cases.
Fast to run. The runner uses bounded concurrency so evals execute in parallel without overwhelming your agent. A fingerprint-based result cache skips cases that already passed. Sandboxes can be reused and pre-warmed between runs. The earlyExit flag stops retrying a task the moment one attempt succeeds.
Fast to read. Console output streams in real time so you see failures as they happen. Every run produces structured artifacts — an event stream, full transcript, file diffs, and assertion results — in a .niceeval/<timestamp>/ directory. A unified trace makes it easy to reconstruct exactly what the agent did.
How niceeval is structured
At a high level, yourevals/ directory is the input and .niceeval/ artifacts are the output. Everything in between is owned by three collaborating pieces:
- niceeval core owns everything that is the same regardless of what you’re evaluating: eval discovery, assertion collection, scoring, concurrency scheduling, caching, reporting, and artifact persistence.
- Agent adapters are the open boundary between core and your system under test. Official adapters are included for Claude Code, Codex, and bub; you write your own adapter for any other agent or service.
- Sandbox owns the details of running coding agents in isolation — Docker by default, with support for third-party sandbox providers.
niceeval never hardcodes agent-specific logic in its core. It dispatches entirely through the adapter interface, which means adding a new agent type never requires changes to the runner or scorers.
Two integration modes
niceeval supports two top-level integration modes depending on whether the agent under test needs an isolated workspace. Sandbox mode — for coding agents like Codex and Claude Code that must operate on a real filesystem:What comes next
The Quickstart walks you through installing niceeval, scaffolding your project withnpx niceeval init, and running all three eval types — function, conversational, and coding-agent — in under ten minutes.
If you want to go deeper right away, the Installation page covers prerequisites, configuration options, and environment variables in full detail.