Explain narrator: BR uses its own router to write postmortems of its own routing decisions

2026-04-30

routerapisdk

What We Built

GET /v1/explain/:request_id?narrate=true adds a narration field to the structured routing-decision response. The narrative is a 4-6 sentence postmortem written by BR's own router (default judge model: deepseek/deepseek-chat, the cheapest healthy model in the live catalog). The narrator pins the judge model — no Thompson sampling, no auto-selection — to prevent feedback loops where the bandit could route the narration call back to the same model being narrated. Each narration is cached in-memory keyed by request_id, so the first call pays an inference cost (~$0.0001) and every subsequent call returns the cached result.

The pure prompt builder buildNarratorPrompt(explain) is exported separately so the prompt can be snapshotted in tests, consumed by external agents that want to use a different judge, and so the prompt template is reviewable as a flat string instead of buried in conditional code.

Why It Matters

BR already attaches an x-br-context header to every completion that says things like {"model":"deepseek/deepseek-reasoner","why":"thompson-sampling","complexity":"simple"}. That's structured, machine-readable, and useless for actually understanding why BR made a decision unless you're already an expert in BR's internals.

This commit makes the gateway literally explain itself. An agent (or human) hits /v1/explain/:id?narrate=true and gets back: _"BR routed to deepseek-v4-flash because Thompson sampling scored it highest at 0.91, narrowly ahead of openai/gpt-4o-mini at 0.74. The complexity classifier flagged this as 'simple' (score 1), and the cache missed despite a 128-token prompt prefix overlap, so the router chose the cheapest high-quality option. Total cost was $0.000185 vs an estimated $0.0002 against the default. Latency was 1974ms with 412ms TTFT and 8ms routing overhead — within typical range for deepseek's reasoning model."_

That sentence is generated by BR's own LLM, fed BR's own structured trace, against BR's own cache. It's the gateway operating as a self-explaining system instead of a black box that hands you a JSON dump.

How It Works

  1. The SDK call client.explainRequest.get(id, { narrate: true }) (or the equivalent

Python kwarg) hits the existing route with a query param.

  1. The route fetches the structured ExplainRequestResponse from getLogDetail exactly

as before.

  1. If narrate=true and a model router was injected at mount time, the route calls

narrateExplanation(). That function:

  • Hits the cache by request_id. Returns the cached narrative if present.
  • Otherwise calls modelRouter.router.routeCompletion() with model: judgeModel and

bypassCache: true. Pinning the model bypasses Thompson and the auto-selector; the bypassCache flag prevents the narration prompt from polluting the production prompt cache.

  • Stores the result in the cache (LRU, 1000-entry cap) and returns it.
  1. The response body is the original ExplainRequestResponse extended with a narration

field that names the judge model, narrative text, judge cost, and a cached flag.

If the gateway was mounted without a model router (e.g. a static-only deployment), ?narrate=true is silently downgraded — the response includes narration: { error: "narration_unavailable", ... } so the SDK can distinguish "narration disabled" from "narration failed".

The Numbers

  • 11 unit tests covering: prompt builder content, system-prompt anti-hallucination

guards, null-field handling, zero-alternatives edge case, an inline snapshot of the canonical R15 routing decision (stability guard against prompt drift), router invocation with the pinned model, judge model override, cache hit / cache miss, empty-content fallback, and the process-wide singleton cache.

  • Cost per narration: ~$0.0001 (deepseek-chat at ~600 input + 200 output tokens).
  • Cost per request after first narration: $0 (cached).
  • Default cache cap: 1000 narratives. At ~600 input + 200 output × 1000 entries,

in-memory footprint is bounded.

  • Routes added: 0. The capability rides the existing /v1/explain/:request_id

endpoint via a query param — no new path to teach agents.

Competitive Edge

Helicone and Portkey publish dashboards. OpenRouter publishes pricing tables. Letta publishes memory architectures. None of them ship a gateway that uses its own LLM to explain its own decisions through its own router. This is the second deliverable in the self-managing-agent series (provider-doctor was the first — diagnoses provider failures; this one narrates routing decisions). The pattern: BR introspects its internal state and surfaces it through its own intelligence, not through hand-written templates or external observability tooling.

Recursion safety

The narrator could in principle ask the bandit to route its narration call, and the bandit could pick the same model that's being narrated. That's not a correctness bug (the narration would still produce text), but it's a feedback-loop hazard that would make Thompson posteriors auto-correlated.

Three safeguards:

  1. Pinned model. The narrator passes an explicit provider/model pair to

routeCompletion, which short-circuits Thompson and auto-selection.

  1. bypassCache: true. The narration prompt never enters the prompt cache, so it

can't influence cache-hit metrics for production traffic.

  1. Cache-by-request-id. Each routing decision is narrated at most once per process

lifetime. If the bandit posteriors did somehow update from a narration call, they'd update once and stop, not loop.

Follow-ups (NOT in this commit)

  • Add an MCP tool wrapper (br_explain_request_with_narration) so agents can call it

through the MCP gateway instead of HTTP. The HTTP shape works today; the MCP wrapper is convenience.

  • Extend the prompt to include savings-tracker data (counterfactual cost vs the default

model BR routed away from) and benchmark divergence (was Thompson different from the static-price strategy?). Currently the prompt only sees the narrow ExplainRequestResponse fields; the full savings-tracker + shadow-comparator view would deepen the postmortem.

  • Persist narrations to postgres so they survive process restarts. Currently they're

process-local; a deploy clears the cache.

  • Wire CloudWatch usage metric on judge_cost_usd so total spend on narration is

observable and can be capped.

Lockstep Checklist

  • [x] API Routes: src/api/routes/explain-request.ts accepts ?narrate=true. routes.json regenerated to 436 routes (no new path; query-param-only addition).
  • [x] TS SDK: packages/sdk-ts/src/resources/explain-request.ts adds GetExplainOptions + ExplainNarration type + narrate parameter.
  • [x] Python SDK: packages/sdk-py/src/brainstormrouter/resources/explain_request.py adds narrate=True keyword argument on both ExplainRequest.get and AsyncExplainRequest.get.
  • [x] MCP Schemas: none added — agents call the existing endpoint with the new query param. Tool wrapper deferred per "follow-ups" above.
  • [x] Master Record: Listed under api/explain-narrator.

Cites: R15 stochastic-assessment context. The R15 reviewers asked for "wire alerting + load test + pre-push gate" — the antediluvian reliability lane. This PR ships a different shape: BR explains its own decisions to itself. Same cost (one inference call per narrated request, capped by cache) and a capability no funded competitor ships.