Prompt Evaluation Engine — dataset-based scoring with 4 strategies
2026-03-22
What We Built
A full evaluation engine for prompt templates. Run any prompt against a dataset of test cases with automatic scoring using one of four strategies: exact match, substring containment, structural validity (reusing BR's production validity scorer), or LLM judge (a second model scores the output).
Results are stored alongside prompt data in the config-backed store (up to 20 runs per prompt, FIFO eviction). Every run captures per-case details: input variables, rendered prompt, expected output, actual model output, score, and pass/fail status.
Export any run as CSV or JSON for offline analysis, version comparison, or compliance documentation.
Why It Matters
Prompt engineering is trial and error without measurement. Teams change a system prompt, deploy it, and hope for the best. With evaluation runs, every prompt version has data: pass rate, average score, per-case breakdowns. A/B tests become data-driven — run the same test cases against version A and B, compare scores, promote the winner.
How It Works
# Run evaluation (template-only, no model calls)
curl -X POST https://api.brainstormrouter.com/v1/prompts/customer-support/evaluate \
-H "Authorization: Bearer $API_KEY" \
-d '{
"test_cases": [
{"input": {"name": "Alice"}, "expected_output": "Hello Alice"},
{"input": {"name": "Bob"}, "expected_output": "Hello Bob"}
],
"scorer": "exact_match"
}'
# Run with model-backed scoring
curl -X POST .../v1/prompts/customer-support/evaluate \
-d '{
"test_cases": [{"input": {"topic": "billing"}}],
"model": "anthropic/claude-haiku-4-5",
"scorer": "validity"
}'
# List past runs
curl https://api.brainstormrouter.com/v1/prompts/customer-support/evaluate/results
# Export as CSV
curl ".../v1/prompts/customer-support/evaluate/results/eval_abc123/export?format=csv"
The Numbers
- 4 scoring strategies (exact_match, contains, validity, llm_judge)
- Up to 50 test cases per evaluation run
- 20 stored runs per prompt (FIFO eviction)
- CSV + JSON export formats
- Full lockstep: API + TS SDK + PY SDK + MCP tools
Competitive Edge
No competing AI gateway offers built-in prompt evaluation. Portkey and OpenRouter route traffic but have zero prompt management. Lasso focuses on security, not prompt quality. BrainstormRouter is the only platform where you can version a prompt, evaluate it against a test suite, compare versions, and promote the winner — all through the same API your application already uses.
Lockstep Checklist
- [x] API Routes:
src/api/routes/prompts.ts— 3 new endpoints - [x] TS SDK:
packages/sdk-ts/src/resources/prompts.ts— evaluate, evaluateResults, exportEvaluation - [x] Python SDK:
packages/sdk-py/src/brainstormrouter/resources/prompts.py— evaluate, evaluate_results, export_evaluation - [x] MCP Schemas:
src/mcp/tool-manifest.ts+src/mcp/server.ts— br_evaluate_prompt, br_get_eval_results - [x] Tests:
src/api/routes/prompts.test.ts— scoring logic + type tests