Wire all disconnected intelligence into production — 4 systems activated
2026-04-01
What We Built
An honest audit of BrainstormRouter's intelligence layer revealed that 9 of 13 systems were actively wired into the production completions path — but 4 were instantiated and never called (or only exposed via advisory endpoints). This session wired all 4 into the production request path:
- Ranking Engine Tiebreaker — 7-day production quality rankings now break ties in the strategy sort fallback, replacing arbitrary ordering when bandit and cost-frontier have no data.
- Savings API + Header —
GET /v1/intelligence/savingsexposes counterfactual savings data (how much routing intelligence saved vs. the default model). Non-streaming responses includeX-BR-Routing-Savingsheader. - Phase Prediction Headers — The ONNX phase predictor now runs on every completions request, returning
X-BR-PhasesandX-BR-Phase-Confidenceheaders for agentic workflow clients. - Streaming Semantic Cache — Cache hits now work for streaming requests by converting cached responses to SSE-compatible chunks. Previously, only non-streaming benefited from the semantic cache.
Why It Matters
BrainstormRouter's competitive advantage is intelligence — not just routing. But intelligence that exists as code and never executes is vaporware. This session eliminated the gap between "infrastructure exists" and "infrastructure works in production." Every model selection decision now benefits from ranking data, every response carries phase guidance, and streaming requests are no longer second-class citizens for caching.
How It Works
Ranking tiebreaker (model-router.ts): After sortEndpointsByStrategy() returns candidates, the ranking engine's getRankings() (cached 5min) provides a Map. Models with ranking data are sorted by production quality score; models without data are deprioritized with a debug log.
Savings endpoint (intelligence.ts): Calls savingsTracker.summarize() which aggregates in-memory savings entries — total USD saved, per-source breakdown (bandit/cascade/auto_select), and per-model-path analysis. recordFromDecision() now returns the savings value so it can be attached to the decision trace and response header.
Phase prediction (model-router.ts, streaming.ts): Builds a PredictorInput from message text (keyword detection for code, tests, deploy, review, docs, refactor mentions), then calls the ONNX predictor (<2ms inference). Streaming path uses a 5ms Promise.race timeout to guard time-to-first-byte.
Streaming cache (model-router.ts): convertCachedToStream() extracts content from the cached OpenAI-format response and yields it as SSE-compatible chunks with proper raw, delta, and finishReason fields. Includes a log.warn for format mismatches (non-OpenAI cached responses that extract as empty).
The Numbers
- Intelligence coverage: 9/13 → 13/13 systems active (100%)
- Lines changed: 336 added, 21 removed across 11 files
- Test suite: 7022 tests passing, 0 failures
- Latency impact: <2ms for phase prediction, 0ms for cache hits, negligible for ranking lookup (cached)
Competitive Edge
No other AI gateway (Portkey, OpenRouter, Letta) provides phase-level workflow recommendations, counterfactual savings proof, or streaming semantic cache. These are intelligence signals that only exist because BrainstormRouter runs a full learning loop (Thompson sampling → validity scoring → performance tracking → ranking). The savings API is the "prove it" metric — clients can see exactly how much money intelligent routing saved them.
Lockstep Checklist
- [x] API Routes:
src/api/routes/intelligence.ts— addedGET /v1/intelligence/savings - [x] TS SDK:
packages/sdk-ts/src/resources/intelligence.ts—savings()method +SavingsSummaryResponsetype - [x] Python SDK:
packages/sdk-py/src/brainstormrouter/resources/intelligence.py— sync + asyncsavings()methods - [x] MCP Schemas: Not applicable (savings is informational, not agent-facing)
- [ ] Master Record: Needs update in follow-up