Semantic cache: partition entries by (tenantId + model) for O(partition) lookup

2026-05-07

routercacheintelligence

What We Built

SemanticCache.lookup() iterated all entries.values() (up to 10,000) and computed cosine similarity for each, with tenant/model filtering INLINE after the dot-product step. With 10K entries and 1536-dim embeddings: ~15M float multiplications per request. Replaced flat Map with partition Map keyed by tenantId+model — lookups scan only the relevant tenant's bucket. Pre-filter happens before cosine.

Why It Matters

On a multi-tenant warm cache, tenant A's request was paying for cosine against every entry from every tenant. Partitioning makes work proportional to requesting tenant's cache size — typical 100-500x speedup on busy multi-tenant deployments.

How It Works

7 regression tests with a SpyCache subclass intercepting computeSimilarity to count per-entry calls. Asserts: tenant A scans <= tenant-A entry count, tenant A request never matches tenant B's entries (isolation), global maxEntries cap enforced across partitions, modelAgnostic still scans all model buckets within tenant only.

Lockstep Checklist

[x] No API route changes (middleware/internal — lockstep N/A)
[x] No SDK changes
[x] No MCP tool changes
[x] Regression test included (test-first invariant per /quality-fleet protocol)
[x] Linked to /quality-fleet R1 dashboard at .quality/dashboard.md

Provenance

Auto-found by /quality-fleet R1 (2026-05-07) scanner round, fixed in fix-agent batch under "go for all of it" autonomy grant. PR #213 merged to main as commit c03fe451f. Finding(s) tracked at .quality/findings.jsonl (entries: "f2b9e4a17c38"). Production-deployed via ECS task-def revision 732 series.