Prompt Evaluation Engine — dataset-based scoring with 4 strategies

2026-03-22

promptsevaluation

What We Built

A full evaluation engine for prompt templates. Run any prompt against a dataset of test cases with automatic scoring using one of four strategies: exact match, substring containment, structural validity (reusing BR's production validity scorer), or LLM judge (a second model scores the output).

Results are stored alongside prompt data in the config-backed store (up to 20 runs per prompt, FIFO eviction). Every run captures per-case details: input variables, rendered prompt, expected output, actual model output, score, and pass/fail status.

Export any run as CSV or JSON for offline analysis, version comparison, or compliance documentation.

Why It Matters

Prompt engineering is trial and error without measurement. Teams change a system prompt, deploy it, and hope for the best. With evaluation runs, every prompt version has data: pass rate, average score, per-case breakdowns. A/B tests become data-driven — run the same test cases against version A and B, compare scores, promote the winner.

How It Works

# Run evaluation (template-only, no model calls)
curl -X POST https://api.brainstormrouter.com/v1/prompts/customer-support/evaluate \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "test_cases": [
      {"input": {"name": "Alice"}, "expected_output": "Hello Alice"},
      {"input": {"name": "Bob"}, "expected_output": "Hello Bob"}
    ],
    "scorer": "exact_match"
  }'

# Run with model-backed scoring
curl -X POST .../v1/prompts/customer-support/evaluate \
  -d '{
    "test_cases": [{"input": {"topic": "billing"}}],
    "model": "anthropic/claude-haiku-4-5",
    "scorer": "validity"
  }'

# List past runs
curl https://api.brainstormrouter.com/v1/prompts/customer-support/evaluate/results

# Export as CSV
curl ".../v1/prompts/customer-support/evaluate/results/eval_abc123/export?format=csv"

The Numbers

4 scoring strategies (exact_match, contains, validity, llm_judge)
Up to 50 test cases per evaluation run
20 stored runs per prompt (FIFO eviction)
CSV + JSON export formats
Full lockstep: API + TS SDK + PY SDK + MCP tools

Competitive Edge

No competing AI gateway offers built-in prompt evaluation. Portkey and OpenRouter route traffic but have zero prompt management. Lasso focuses on security, not prompt quality. BrainstormRouter is the only platform where you can version a prompt, evaluate it against a test suite, compare versions, and promote the winner — all through the same API your application already uses.

Lockstep Checklist

[x] API Routes: src/api/routes/prompts.ts — 3 new endpoints
[x] TS SDK: packages/sdk-ts/src/resources/prompts.ts — evaluate, evaluateResults, exportEvaluation
[x] Python SDK: packages/sdk-py/src/brainstormrouter/resources/prompts.py — evaluate, evaluate_results, export_evaluation
[x] MCP Schemas: src/mcp/tool-manifest.ts + src/mcp/server.ts — br_evaluate_prompt, br_get_eval_results
[x] Tests: src/api/routes/prompts.test.ts — scoring logic + type tests