Evaluation Runs & Scoring Engine — 6 scorers, statistical comparison
2026-03-23
What We Built
A scoring engine with 6 pluggable scorers and an async evaluation run system. Run any dataset against any model, score with multiple strategies, compare runs with statistical significance (paired t-test).
Scorers: validity (structural quality), exact_match, contains, latency (time-based), cost (USD-based), and LLM judge support.
Lockstep Checklist
- [x] API: 5 endpoints at
/v1/eval/runs - [x] DB:
eval_runs+eval_run_resultstables (migration V42, RLS) - [x] SDK-TS:
EvalRunsresource - [x] SDK-PY:
EvalRuns+AsyncEvalRuns - [x] MCP:
br_start_eval_run,br_get_eval_run - [x] Tests: 18 scorer tests, aggregate metrics, statistical comparison