Evaluation Datasets — DB-backed test case management

2026-03-22

evaluationdatabase

What We Built

Database-backed evaluation dataset management. Two new tables (eval_datasets + eval_dataset_items) with full CRUD, batch item creation (up to 100 per request), import from production completion audit logs, and CSV/JSON export. DB migration V41 with RLS + explicit tenant isolation.

Why It Matters

Evaluation without structured test data is guesswork. This gives teams a persistent, version-controlled collection of test cases they can run against prompts and models. Import from production logs means real traffic becomes test data automatically — no manual curation needed.

How It Works

# Create a dataset
curl -X POST .../v1/eval/datasets -d '{"name": "Customer Support Cases"}'

# Add test cases
curl -X POST .../v1/eval/datasets/{id}/items -d '{
  "items": [
    {"input": {"topic": "billing"}, "expected_output": "I can help with billing..."},
    {"input": {"topic": "refund"}, "expected_output": "To process a refund..."}
  ]
}'

# Import from production logs
curl -X POST .../v1/eval/datasets/{id}/import-from-logs -d '{"model": "anthropic/claude-sonnet-4", "limit": 50}'

# Export as CSV
curl ".../v1/eval/datasets/{id}/export?format=csv" > test-cases.csv

Lockstep Checklist

[x] API Routes: 10 endpoints at /v1/eval/datasets
[x] TS SDK: evalDatasets resource with full CRUD + import + export
[x] Python SDK: eval_datasets resource (sync + async)
[x] MCP: br_list_eval_datasets, br_create_eval_dataset
[x] DB: Migration V41, Drizzle schema, store with tenant isolation
[x] Tests: Type tests, batch limits, export format