Evaluation Dashboard — dataset browser, run launcher, comparison
2026-03-23
What We Built
A three-tab evaluation dashboard: Dataset Browser (CRUD), Runs (launcher + results), and Compare (side-by-side with statistical significance).
Lockstep Checklist
- [x] API Routes: 7 auth bridge endpoints at
/auth/eval/* - [x] TS SDK:
DashboardEvalresource - [x] Python SDK:
DashboardEval+AsyncDashboardEval - [x] Dashboard:
evaluations.tswith 3 tabs + 4 views - [x] Tests: Score color + delta formatting