What an experiment is
An experiment is adefineExperiment configuration that describes one agent, one model, how many times to run each eval, and which evals to include. When you run npx niceeval exp <group>, the runner executes each config in that group under the same bounded-concurrency scheduler and produces a comparative report.
You don’t supply an
id or name in defineExperiment. The experiment’s ID is derived from its file path, exactly like evals.The defineExperiment shape
agent
agent
The agent adapter instance to run, such as
codexAgent() or your own defineAgent(...) result.model
model
A single model identifier (e.g.
"gpt-5.4" or "anthropic/claude-opus-4-8"). The model is passed to the agent as ctx.model. To compare models, create multiple experiment files in the same group.runs
runs
How many times to run each
(agent, model, eval) cell. With runs: 5, each cell produces 5 attempts. Results are aggregated into a pass rate rather than a single outcome.earlyExit
earlyExit
Whether to stop remaining retries for an eval as soon as one attempt passes. Defaults to
true. Set to false to collect the full pass-rate distribution — useful for nightly stability runs.evals
evals
A string prefix (or array of prefixes) filtering which evals are included in this experiment. Works the same as the positional argument to
npx niceeval. Omit to include all discovered evals.flags
flags
A record of feature flags injected into every attempt as
ctx.flags (agent side) and t.flags (eval side). Use flags to toggle features — memory backends, tool allowlists, effort settings — without creating separate agent definitions.budget
budget
An estimated cost ceiling in USD. The runner stops dispatching new attempts once accumulated cost exceeds this value. Can be overridden at runtime with
--budget.test(t), or put adapter-specific preparation in the agent adapter’s setup.
Running experiments
Organizing experiments: directory as group
The directory an experiment file lives in determines its comparison group. Files in the same directory are treated as peer configurations — runningniceeval exp compare runs all of them and places their results side by side in the report.
The coding-agent-memory-evals project uses exactly this pattern:
defineExperiment configuration. Running niceeval exp compare executes both and renders a side-by-side comparison report.
Matrix expansion
The runner expands selected experiment files × evals × runs. For example: 2 experiment files ×runs: 5 × 3 evals = 30 attempts. All 30 run through the same bounded-concurrency pool.
Pass rate reporting
Results are aggregated per(agent, model, eval) cell into a pass rate rather than a single pass/fail outcome:
- pass@k — how many of the k attempts passed (e.g.
pass@5 = 4/5) - mean time — average wall-clock duration per attempt
- token usage — average tokens consumed
- estimated cost — average USD cost per attempt
Why pass rate matters
A single passing run does not mean an agent is reliable. An agent that passes 1 out of 5 attempts at the same task is fundamentally different from one that passes 5 out of 5 — even though both have at least one passing run. Pass rate measures stability, not luck. This is particularly important for:- Coding agents that interact with real file systems and have inherent non-determinism
- Evaluating whether a new model tier or feature flag genuinely improves reliability, or just got lucky once
Model and feature flag injection
The experiment runner injectsctx.model and ctx.flags into every attempt. Your agent reads these to configure itself:
t.flags:
Budget and full-distribution runs
For production stability measurements you typically want two things together: a meaningful budget ceiling to prevent runaway costs, andearlyExit: false to collect the full distribution rather than stopping after the first pass: