Wire all disconnected intelligence into production — 4 systems activated

2026-04-01

ranking-enginesavings-trackerphase-predictorsemantic-cache

What We Built

An honest audit of BrainstormRouter's intelligence layer revealed that 9 of 13 systems were actively wired into the production completions path — but 4 were instantiated and never called (or only exposed via advisory endpoints). This session wired all 4 into the production request path:

Ranking Engine Tiebreaker — 7-day production quality rankings now break ties in the strategy sort fallback, replacing arbitrary ordering when bandit and cost-frontier have no data.
Savings API + Header — GET /v1/intelligence/savings exposes counterfactual savings data (how much routing intelligence saved vs. the default model). Non-streaming responses include X-BR-Routing-Savings header.
Phase Prediction Headers — The ONNX phase predictor now runs on every completions request, returning X-BR-Phases and X-BR-Phase-Confidence headers for agentic workflow clients.
Streaming Semantic Cache — Cache hits now work for streaming requests by converting cached responses to SSE-compatible chunks. Previously, only non-streaming benefited from the semantic cache.

Why It Matters

BrainstormRouter's competitive advantage is intelligence — not just routing. But intelligence that exists as code and never executes is vaporware. This session eliminated the gap between "infrastructure exists" and "infrastructure works in production." Every model selection decision now benefits from ranking data, every response carries phase guidance, and streaming requests are no longer second-class citizens for caching.

How It Works

Ranking tiebreaker (model-router.ts): After sortEndpointsByStrategy() returns candidates, the ranking engine's getRankings() (cached 5min) provides a Map. Models with ranking data are sorted by production quality score; models without data are deprioritized with a debug log.

Savings endpoint (intelligence.ts): Calls savingsTracker.summarize() which aggregates in-memory savings entries — total USD saved, per-source breakdown (bandit/cascade/auto_select), and per-model-path analysis. recordFromDecision() now returns the savings value so it can be attached to the decision trace and response header.

Phase prediction (model-router.ts, streaming.ts): Builds a PredictorInput from message text (keyword detection for code, tests, deploy, review, docs, refactor mentions), then calls the ONNX predictor (<2ms inference). Streaming path uses a 5ms Promise.race timeout to guard time-to-first-byte.

Streaming cache (model-router.ts): convertCachedToStream() extracts content from the cached OpenAI-format response and yields it as SSE-compatible chunks with proper raw, delta, and finishReason fields. Includes a log.warn for format mismatches (non-OpenAI cached responses that extract as empty).

The Numbers

Intelligence coverage: 9/13 → 13/13 systems active (100%)
Lines changed: 336 added, 21 removed across 11 files
Test suite: 7022 tests passing, 0 failures
Latency impact: <2ms for phase prediction, 0ms for cache hits, negligible for ranking lookup (cached)

Competitive Edge

No other AI gateway (Portkey, OpenRouter, Letta) provides phase-level workflow recommendations, counterfactual savings proof, or streaming semantic cache. These are intelligence signals that only exist because BrainstormRouter runs a full learning loop (Thompson sampling → validity scoring → performance tracking → ranking). The savings API is the "prove it" metric — clients can see exactly how much money intelligent routing saved them.

Lockstep Checklist

[x] API Routes: src/api/routes/intelligence.ts — added GET /v1/intelligence/savings
[x] TS SDK: packages/sdk-ts/src/resources/intelligence.ts — savings() method + SavingsSummaryResponse type
[x] Python SDK: packages/sdk-py/src/brainstormrouter/resources/intelligence.py — sync + async savings() methods
[x] MCP Schemas: Not applicable (savings is informational, not agent-facing)
[ ] Master Record: Needs update in follow-up