niceeval scoring: 断言、judge 和 outcome

niceeval 的评分目标是把 agent 行为变成可解释、可比较的结果。你可以混合使用精确断言、事件流断言、LLM judge、沙箱测试和成本约束。

五种评分机制

1. 值断言

t.check(value, matcher)，适合文本、JSON 和结构化输出。

2. 作用域断言

t.succeeded()、t.calledTool()、t.fileChanged() 等，运行结束后统一评估。

3. LLM-as-judge

用 judge 模型评估事实性、总结质量或 rubric 分数。

4. Test-as-scoring

在 sandbox 里运行 EVAL.ts、Vitest 或项目脚本。

5. 效率断言

限制 token、成本和其他 usage 指标。

gate 与 soft

gate
soft

gate 是硬门槛。任何 gate 失败都会让 eval 的 outcome 变成 failed。

Outcome 规则

passed

所有 gate 通过，且没有运行失败。

failed

任一 gate 失败、超时、adapter 失败或测试失败。

passed

运行完成，但结果主要由 soft 分数表达。

skipped

eval 被跳过。

1. 值断言

niceeval/expect 提供可组合 matcher：

import { includes, equals, matches, similarity } from "niceeval/expect";

t.check(t.reply, includes("confirmed"));
t.check(turn.data, equals({ intent: "refund" }));
t.check(t.reply, matches(/order #\d+/i));
t.check(t.reply, similarity("The answer should mention Paris").atLeast(0.8));

2. 作用域断言

作用域断言记录“整次运行”应该满足的事实：

t.succeeded();
t.calledTool("get_weather", { count: 1 });
t.notCalledTool("delete_user");
t.toolOrder(["search", "summarize"]);
t.fileChanged("src/components/Button.tsx");
t.check(await t.sandbox.runCommand("npm", ["test"], { cwd: "/workspace" }), commandSucceeded());

作用域断言通常在 test() 结束后统一评估。不要把它当成立即返回布尔值的函数。

3. LLM-as-judge

Judge 用于无法精确匹配的语义质量：

t.judge.autoevals.closedQA("回答是否准确说明了退款政策?", { on: t.reply }).atLeast(0.7);
t.judge.autoevals.summarizes(sourceText, { on: t.reply }).atLeast(0.8);
t.judge.autoevals.closedQA("礼貌、具体、没有编造事实", { on: t.reply }).atLeast(0.75);

judge 模型可以在 defineConfig、单个 eval 或单次 judge 调用上配置。

4. Test-as-scoring

Sandbox fixture 通过 EVAL.ts 验证 agent 改出的代码：

import { test, expect } from "vitest";
import { existsSync } from "node:fs";

test("Button exists", () => {
  expect(existsSync("src/components/Button.tsx")).toBe(true);
});

测试输出会进入 .niceeval/<run>/.../test-output.txt。

5. 效率和成本

t.maxTokens(50_000);
t.maxCost(0.25);

这些断言适合防止 agent 通过过度调用模型或工具“堆”出结果。

​五种评分机制

1. 值断言

2. 作用域断言

3. LLM-as-judge

4. Test-as-scoring

5. 效率断言

​gate 与 soft

​Outcome 规则

passed

failed

passed

skipped

​1. 值断言

​2. 作用域断言

​3. LLM-as-judge

​4. Test-as-scoring

​5. 效率和成本

​相关阅读

五种评分机制

gate 与 soft

Outcome 规则

1. 值断言

2. 作用域断言

3. LLM-as-judge

4. Test-as-scoring

5. 效率和成本

相关阅读