.niceeval/<timestamp>/. The console gives you instant feedback while the run is in progress, and the local result viewer lets you drill into exactly what the agent did — which tools it called, what files it changed, what the LLM said turn by turn — when you need to debug a failure or understand why an eval’s score dropped.
Console output
The console streams results in real time as each eval completes. A typical run looks like this:gate or soft) and the failure message.
For experiment runs, the summary shows pass rates per (agent, model, eval) cell:
The .niceeval/<timestamp>/ directory
Every run produces a timestamped output directory containing:
The niceeval view command
.niceeval/. The viewer lets you browse evals, inspect transcripts, read diffs, explore the event stream, and navigate the assertion results — all without leaving your machine. No data is uploaded anywhere.
Artifacts explained
summary.json
The top-level summary captures aggregate statistics for the entire run:
summary.json also contains the per-cell pass rate table.
Event stream (events.jsonl)
Each line in events.jsonl is a StreamEvent — the normalized, agent-agnostic event representation that niceeval uses as the source of truth for all assertions:
t.calledTool(), t.event(), t.noFailedActions(), and all other scope assertions query. Reading it directly lets you understand exactly what sequence of actions the agent took.
Transcript (transcript.jsonl)
For sandbox evals (coding agents), the transcript is the raw JSONL log produced by the agent CLI, captured before it is parsed into the standard event stream. This is the most detailed view of what the agent did — every tool invocation, every file read, every shell command, every model response.
Generated file diff (diff.json)
For sandbox evals, diff.json captures the git diff between the workspace’s initial state and the state after the agent finished. It includes the list of changed, added, and deleted files along with their content. This is what t.fileChanged(), t.sandbox.diff.get(), and t.notInDiff() assertions query.
Test output (test-output.txt)
For sandbox evals that include an EVAL.ts, test-output.txt contains the full Vitest output from running the validation tests inside the sandbox. This shows exactly which test cases passed and failed, with the same detail you’d see running Vitest locally.
Outcome meanings
- passed
- failed
- passed
- skipped
All gate assertions passed and all soft assertions met their thresholds. The agent completed the task correctly and within quality targets.
Reading pass rates for experiments
When you run an experiment withruns > 1, the result viewer shows pass@k rates for each (agent, model, eval) cell:
pass@5 = 4/5 means 4 of the 5 attempts for that cell produced a passed outcome. The mean time, token usage, and cost are averages across those 5 attempts. Use these numbers to compare agent reliability and cost efficiency side by side.
Using artifacts for debugging
Reading the transcript to understand agent behavior
When an eval fails unexpectedly, the transcript is usually the best place to start. Opentranscript.jsonl in the viewer or read it directly to see the full turn-by-turn conversation, the exact inputs to every tool call, and the outputs returned. This tells you whether the agent understood the task, attempted the right approach, or got stuck in a loop.
Checking the diff for unexpected changes
For sandbox (coding agent) evals,diff.json shows every file the agent touched. If an eval fails a t.fileChanged("src/Button.tsx") assertion, the diff tells you whether the agent wrote to a different path, skipped writing entirely, or created the file but under the wrong name. If a t.notInDiff(/sk-[A-Za-z0-9]/) assertion fires, the diff shows you exactly which file contains the leaked secret.
Checking the event stream for tool usage issues
If at.calledTool() assertion fails, read events.jsonl to see what tools the agent actually called. You’ll see every action.called and action.result event in order, with the exact inputs and outputs. This makes it easy to spot cases where an agent called the right tool with the wrong arguments, or called a different tool than expected.
Using __niceeval__/results.json in EVAL.ts
For sandbox evals, niceeval injects an observability summary into the sandbox at __niceeval__/results.json before running EVAL.ts. Your test code can read it to make assertions about agent behavior rather than just file outcomes:
test-output.txt.