Streaming Semantic Cache: Cache Hits for ~50% of Traffic

2026-03-18

semantic-cachestreamingmodel-router

What We Built

Streaming requests now benefit from the semantic cache. Previously, the !params.stream gate excluded ~50% of traffic from cache lookups. Now: (1) cache lookups happen for all requests regardless of streaming mode, (2) after a stream completes, the assembled response is stored in the cache via fire-and-forget, and (3) subsequent identical streaming requests get a cache hit returned as a non-streaming response with X-BR-Cache: hit-assembled.

How It Works

A new stream-assembler.ts utility builds a full OpenAI-format chat completion response from accumulated streaming data (content, model, tokens). After streaming completes in completions.ts, the assembled response is stored via router.cacheStore(). On the next identical request (streaming or not), the semantic cache returns the assembled response directly.

Lockstep Checklist

[x] API Routes: No new endpoints (behavior change on existing).
[x] TS SDK: No changes needed.
[x] Python SDK: No changes needed.
[x] MCP Schemas: No changes needed.
[x] Master Record: N/A — optimization.