niceeval: TypeScript eval framework for AI agents and LLMs

niceeval is a lightweight TypeScript library that brings structured, repeatable evaluation to AI agents. Instead of manually poking at your agent and hoping for the best, you write declarative eval files that describe what a good result looks like — and niceeval takes care of running them, scoring the output, and producing readable reports. Whether you’re shipping a Claude Code plugin, a customer-facing HTTP service, or an internal function wrapped in an LLM call, niceeval handles all three with the same defineEval API.

What you can evaluate

niceeval covers three categories of agent under test, all through a unified interface:

Coding agents

Drop a CLI agent — Claude Code, Codex, bub, or any compatible tool — into an isolated Docker sandbox, give it a task, and verify the result with real tests and file assertions.

HTTP services & deployed agents

Point niceeval at any running HTTP endpoint or deployed agent. Assert on replies, tool calls, and structured output without touching the deployment.

In-process functions

Call your own TypeScript functions directly inside the eval process. Treat evals as semantic unit tests and run them in CI with zero network overhead.

Why “fast”

The name captures three distinct kinds of speed that matter when you’re iterating on agent behavior. Fast to author. Each eval lives in a single file, and its ID is derived automatically from the file path — evals/weather/brooklyn.eval.ts becomes weather/brooklyn. You write assertions inline in a linear async test(t) function with no callbacks and no boilerplate. A one-line .map fans a dataset out into dozens of cases. Fast to run. The runner uses bounded concurrency so evals execute in parallel without overwhelming your agent. A fingerprint-based result cache skips cases that already passed. Sandboxes can be reused and pre-warmed between runs. The earlyExit flag stops retrying a task the moment one attempt succeeds. Fast to read. Console output streams in real time so you see failures as they happen. Every run produces structured artifacts — an event stream, full transcript, file diffs, and assertion results — in a .niceeval/<timestamp>/ directory. A unified trace makes it easy to reconstruct exactly what the agent did.

How niceeval is structured

At a high level, your evals/ directory is the input and .niceeval/ artifacts are the output. Everything in between is owned by three collaborating pieces:

   your evals/ directory            niceeval core               agent adapter
   ─────────────────────           ──────────────             ─────────────────
   weather.eval.ts   ──discover──>  Runner  ──send──>  Agent  ─── in-process
   sql.eval.ts                        │                        ─── remote HTTP
   fixtures/button/  ──fixture──>     │                        ─── sandbox ──> Docker
     PROMPT.md                        ▼
     EVAL.ts                       Scorers ──> Reporters ──> .niceeval/<run>/
                                   (expect / scoped /         (summary.json /
                                    judge / tests)             event stream /
                                                               transcript / diff)

niceeval core owns everything that is the same regardless of what you’re evaluating: eval discovery, assertion collection, scoring, concurrency scheduling, caching, reporting, and artifact persistence.
Agent adapters are the open boundary between core and your system under test. Official adapters are included for Claude Code, Codex, and bub; you write your own adapter for any other agent or service.
Sandbox owns the details of running coding agents in isolation — Docker by default, with support for third-party sandbox providers.

niceeval never hardcodes agent-specific logic in its core. It dispatches entirely through the adapter interface, which means adding a new agent type never requires changes to the runner or scorers.

Two integration modes

niceeval supports two top-level integration modes depending on whether the agent under test needs an isolated workspace. Sandbox mode — for coding agents like Codex and Claude Code that must operate on a real filesystem:

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     niceeval core    │
   │ discover·schedule·  │
   │    score·report     │
   └─────────────────────┘
        │  Agent adapter (official)
        ▼
   ┌──────────────────────────────┐
   │         Docker Sandbox        │
   │   ┌────────────────────────┐  │
   │   │  Codex / Claude Code / │  │
   │   │  apps needing isolation│  │
   │   └────────────────────────┘  │
   └──────────────────────────────┘

Direct mode — for HTTP services and in-process functions that don’t need Docker:

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     niceeval core    │
   │ discover·schedule·  │
   │    score·report     │
   └─────────────────────┘
        │  Agent adapter (official or custom)
        ▼
   ┌──────────────────────────────┐
   │       your own Web Agent      │
   │  (HTTP / AI SDK · LangGraph · │
   │   and other frameworks —      │
   │        no Docker needed)      │
   └──────────────────────────────┘

What comes next

The Quickstart walks you through installing niceeval, scaffolding your project with npx niceeval init, and running all three eval types — function, conversational, and coding-agent — in under ten minutes. If you want to go deeper right away, the Installation page covers prerequisites, configuration options, and environment variables in full detail.

​What you can evaluate

Coding agents

HTTP services & deployed agents

In-process functions

​Why “fast”

​How niceeval is structured

​Two integration modes

​What comes next

What you can evaluate

Why “fast”

How niceeval is structured

Two integration modes

What comes next