Evaluation Dashboard — dataset browser, run launcher, comparison

2026-03-23

dashboardevaluation

What We Built

A three-tab evaluation dashboard: Dataset Browser (CRUD), Runs (launcher + results), and Compare (side-by-side with statistical significance).

Lockstep Checklist

  • [x] API Routes: 7 auth bridge endpoints at /auth/eval/*
  • [x] TS SDK: DashboardEval resource
  • [x] Python SDK: DashboardEval + AsyncDashboardEval
  • [x] Dashboard: evaluations.ts with 3 tabs + 4 views
  • [x] Tests: Score color + delta formatting