Evaluation Dashboard — dataset browser, run launcher, comparison

2026-03-23

dashboardevaluation

What We Built

A three-tab evaluation dashboard: Dataset Browser (CRUD), Runs (launcher + results), and Compare (side-by-side with statistical significance).

Lockstep Checklist

[x] API Routes: 7 auth bridge endpoints at /auth/eval/*
[x] TS SDK: DashboardEval resource
[x] Python SDK: DashboardEval + AsyncDashboardEval
[x] Dashboard: evaluations.ts with 3 tabs + 4 views
[x] Tests: Score color + delta formatting