Explain narrator: BR uses its own router to write postmortems of its own routing decisions
2026-04-30
What We Built
GET /v1/explain/:request_id?narrate=true adds a narration field to the structured routing-decision response. The narrative is a 4-6 sentence postmortem written by BR's own router (default judge model: deepseek/deepseek-chat, the cheapest healthy model in the live catalog). The narrator pins the judge model — no Thompson sampling, no auto-selection — to prevent feedback loops where the bandit could route the narration call back to the same model being narrated. Each narration is cached in-memory keyed by request_id, so the first call pays an inference cost (~$0.0001) and every subsequent call returns the cached result.
The pure prompt builder buildNarratorPrompt(explain) is exported separately so the prompt can be snapshotted in tests, consumed by external agents that want to use a different judge, and so the prompt template is reviewable as a flat string instead of buried in conditional code.
Why It Matters
BR already attaches an x-br-context header to every completion that says things like {"model":"deepseek/deepseek-reasoner","why":"thompson-sampling","complexity":"simple"}. That's structured, machine-readable, and useless for actually understanding why BR made a decision unless you're already an expert in BR's internals.
This commit makes the gateway literally explain itself. An agent (or human) hits /v1/explain/:id?narrate=true and gets back: _"BR routed to deepseek-v4-flash because Thompson sampling scored it highest at 0.91, narrowly ahead of openai/gpt-4o-mini at 0.74. The complexity classifier flagged this as 'simple' (score 1), and the cache missed despite a 128-token prompt prefix overlap, so the router chose the cheapest high-quality option. Total cost was $0.000185 vs an estimated $0.0002 against the default. Latency was 1974ms with 412ms TTFT and 8ms routing overhead — within typical range for deepseek's reasoning model."_
That sentence is generated by BR's own LLM, fed BR's own structured trace, against BR's own cache. It's the gateway operating as a self-explaining system instead of a black box that hands you a JSON dump.
How It Works
- The SDK call
client.explainRequest.get(id, { narrate: true })(or the equivalent
Python kwarg) hits the existing route with a query param.
- The route fetches the structured
ExplainRequestResponsefromgetLogDetailexactly
as before.
- If
narrate=trueand a model router was injected at mount time, the route calls
narrateExplanation(). That function:
- Hits the cache by
request_id. Returns the cached narrative if present. - Otherwise calls
modelRouter.router.routeCompletion()withmodel: judgeModeland
bypassCache: true. Pinning the model bypasses Thompson and the auto-selector; the bypassCache flag prevents the narration prompt from polluting the production prompt cache.
- Stores the result in the cache (LRU, 1000-entry cap) and returns it.
- The response body is the original
ExplainRequestResponseextended with anarration
field that names the judge model, narrative text, judge cost, and a cached flag.
If the gateway was mounted without a model router (e.g. a static-only deployment), ?narrate=true is silently downgraded — the response includes narration: { error: "narration_unavailable", ... } so the SDK can distinguish "narration disabled" from "narration failed".
The Numbers
- 11 unit tests covering: prompt builder content, system-prompt anti-hallucination
guards, null-field handling, zero-alternatives edge case, an inline snapshot of the canonical R15 routing decision (stability guard against prompt drift), router invocation with the pinned model, judge model override, cache hit / cache miss, empty-content fallback, and the process-wide singleton cache.
- Cost per narration: ~$0.0001 (deepseek-chat at ~600 input + 200 output tokens).
- Cost per request after first narration: $0 (cached).
- Default cache cap: 1000 narratives. At ~600 input + 200 output × 1000 entries,
in-memory footprint is bounded.
- Routes added: 0. The capability rides the existing
/v1/explain/:request_id
endpoint via a query param — no new path to teach agents.
Competitive Edge
Helicone and Portkey publish dashboards. OpenRouter publishes pricing tables. Letta publishes memory architectures. None of them ship a gateway that uses its own LLM to explain its own decisions through its own router. This is the second deliverable in the self-managing-agent series (provider-doctor was the first — diagnoses provider failures; this one narrates routing decisions). The pattern: BR introspects its internal state and surfaces it through its own intelligence, not through hand-written templates or external observability tooling.
Recursion safety
The narrator could in principle ask the bandit to route its narration call, and the bandit could pick the same model that's being narrated. That's not a correctness bug (the narration would still produce text), but it's a feedback-loop hazard that would make Thompson posteriors auto-correlated.
Three safeguards:
- Pinned model. The narrator passes an explicit
provider/modelpair to
routeCompletion, which short-circuits Thompson and auto-selection.
bypassCache: true. The narration prompt never enters the prompt cache, so it
can't influence cache-hit metrics for production traffic.
- Cache-by-request-id. Each routing decision is narrated at most once per process
lifetime. If the bandit posteriors did somehow update from a narration call, they'd update once and stop, not loop.
Follow-ups (NOT in this commit)
- Add an MCP tool wrapper (
br_explain_request_with_narration) so agents can call it
through the MCP gateway instead of HTTP. The HTTP shape works today; the MCP wrapper is convenience.
- Extend the prompt to include savings-tracker data (counterfactual cost vs the default
model BR routed away from) and benchmark divergence (was Thompson different from the static-price strategy?). Currently the prompt only sees the narrow ExplainRequestResponse fields; the full savings-tracker + shadow-comparator view would deepen the postmortem.
- Persist narrations to postgres so they survive process restarts. Currently they're
process-local; a deploy clears the cache.
- Wire CloudWatch usage metric on
judge_cost_usdso total spend on narration is
observable and can be capped.
Lockstep Checklist
- [x] API Routes:
src/api/routes/explain-request.tsaccepts?narrate=true.routes.jsonregenerated to 436 routes (no new path; query-param-only addition). - [x] TS SDK:
packages/sdk-ts/src/resources/explain-request.tsaddsGetExplainOptions+ExplainNarrationtype +narrateparameter. - [x] Python SDK:
packages/sdk-py/src/brainstormrouter/resources/explain_request.pyaddsnarrate=Truekeyword argument on bothExplainRequest.getandAsyncExplainRequest.get. - [x] MCP Schemas: none added — agents call the existing endpoint with the new query param. Tool wrapper deferred per "follow-ups" above.
- [x] Master Record: Listed under api/explain-narrator.
Cites: R15 stochastic-assessment context. The R15 reviewers asked for "wire alerting + load test + pre-push gate" — the antediluvian reliability lane. This PR ships a different shape: BR explains its own decisions to itself. Same cost (one inference call per narrated request, capped by cache) and a capability no funded competitor ships.