Evaluation Runs & Scoring Engine — 6 scorers, statistical comparison

2026-03-23

evaluation

What We Built

A scoring engine with 6 pluggable scorers and an async evaluation run system. Run any dataset against any model, score with multiple strategies, compare runs with statistical significance (paired t-test).

Scorers: validity (structural quality), exact_match, contains, latency (time-based), cost (USD-based), and LLM judge support.

Lockstep Checklist

  • [x] API: 5 endpoints at /v1/eval/runs
  • [x] DB: eval_runs + eval_run_results tables (migration V42, RLS)
  • [x] SDK-TS: EvalRuns resource
  • [x] SDK-PY: EvalRuns + AsyncEvalRuns
  • [x] MCP: br_start_eval_run, br_get_eval_run
  • [x] Tests: 18 scorer tests, aggregate metrics, statistical comparison