Evaluation Datasets — DB-backed test case management
2026-03-22
What We Built
Database-backed evaluation dataset management. Two new tables (eval_datasets + eval_dataset_items) with full CRUD, batch item creation (up to 100 per request), import from production completion audit logs, and CSV/JSON export. DB migration V41 with RLS + explicit tenant isolation.
Why It Matters
Evaluation without structured test data is guesswork. This gives teams a persistent, version-controlled collection of test cases they can run against prompts and models. Import from production logs means real traffic becomes test data automatically — no manual curation needed.
How It Works
# Create a dataset
curl -X POST .../v1/eval/datasets -d '{"name": "Customer Support Cases"}'
# Add test cases
curl -X POST .../v1/eval/datasets/{id}/items -d '{
"items": [
{"input": {"topic": "billing"}, "expected_output": "I can help with billing..."},
{"input": {"topic": "refund"}, "expected_output": "To process a refund..."}
]
}'
# Import from production logs
curl -X POST .../v1/eval/datasets/{id}/import-from-logs -d '{"model": "anthropic/claude-sonnet-4", "limit": 50}'
# Export as CSV
curl ".../v1/eval/datasets/{id}/export?format=csv" > test-cases.csv
Lockstep Checklist
- [x] API Routes: 10 endpoints at
/v1/eval/datasets - [x] TS SDK:
evalDatasetsresource with full CRUD + import + export - [x] Python SDK:
eval_datasetsresource (sync + async) - [x] MCP:
br_list_eval_datasets,br_create_eval_dataset - [x] DB: Migration V41, Drizzle schema, store with tenant isolation
- [x] Tests: Type tests, batch limits, export format