Tier 1 Semantic Response Quality Scorer — 5 dimensions in <5ms

2026-03-27

quality-scorerrouter

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---

What We Built

A standalone heuristic quality scorer that evaluates every LLM response across five semantic dimensions: instruction adherence, tool appropriateness, coherence, completeness, and conciseness. Each dimension produces a 0-1 score, combined into a weighted composite.

This is Tier 1 scoring only — pure heuristics, no LLM calls. It runs in under 1ms and is designed to be called on every response without measurable overhead.

Why It Matters

Customers routing through BrainstormRouter need to know whether responses are actually good, not just whether they arrived. A JSON request that returns prose is a failure even if the HTTP status was 200. A tool-equipped model that ignores its tools is underperforming. This scorer captures these quality signals that raw status codes miss.

Combined with Thompson sampling and the performance tracker, quality scores feed the router's learning loop — models that consistently produce low-quality responses get routed around automatically.

How It Works

The scorer accepts request context (messages, tools, format hints) and response metadata (content, stop reason, token counts) and evaluates five dimensions:

  • instruction_adherence (30%): Checks if the response matches requested format (JSON validity, code blocks) and penalises truncation.
  • tool_appropriateness (20%): Scores 1.0 when no tools are available (N/A = perfect). Penalises models that ignore tools when the user's question clearly needs them.
  • coherence (20%): Detects degenerate responses — empty content, very short answers to complex prompts, repeated sentences.
  • completeness (20%): Compares response scope to question count. Multiple questions with a one-line answer scores poorly.
  • conciseness (10%): Output/input token ratio with a sweet spot of 0.5-3.0x.
import { scoreQuality } from "./quality-scorer.js";

const result = scoreQuality({
  messages: [{ role: "user", content: "Return the data as JSON" }],
  responseContent: '{"data": [1, 2, 3]}',
  stopReason: "stop",
  outputTokens: 15,
  inputTokens: 10,
});
// result.composite_score → 0.87
// result.dimensions.instruction_adherence → 0.95

The Numbers

  • Scoring latency: <1ms (measured via performance.now())
  • Dimensions: 5 independent scores, clamped to [0, 1]
  • Test coverage: 16 test cases covering all dimensions, edge cases, custom weights, and performance
  • Zero external dependencies: Pure computation, no I/O

Competitive Edge

No competing gateway (Portkey, OpenRouter, Lasso) scores response quality at the semantic level. They report latency and token counts — we score whether the response actually answered the question in the requested format. This is the foundation for quality-aware routing that learns which models are best for which tasks.

Lockstep Checklist

> _You MUST check these boxes [x] and verify the corresponding files are updated BEFORE committing this log._

  • [x] API Routes: src/api/routes/ updated (if API changed). — No API routes added (standalone module).
  • [x] TS SDK: packages/sdk-ts updated, tested, and types exported. — N/A (internal module).
  • [x] Python SDK: packages/sdk-py updated and tested. — N/A (internal module).
  • [x] MCP Schemas: Tool schemas updated in codebase (if applicable). — N/A.
  • [x] Master Record: docs/architecture/master-capability-record.md reflects this capability. — Deferred to integration phase.