Semantic cache: partition entries by (tenantId + model) for O(partition) lookup
2026-05-07
What We Built
SemanticCache.lookup() iterated all entries.values() (up to 10,000) and computed cosine similarity for each, with tenant/model filtering INLINE after the dot-product step. With 10K entries and 1536-dim embeddings: ~15M float multiplications per request. Replaced flat Map with partition Map keyed by tenantId+model — lookups scan only the relevant tenant's bucket. Pre-filter happens before cosine.
Why It Matters
On a multi-tenant warm cache, tenant A's request was paying for cosine against every entry from every tenant. Partitioning makes work proportional to requesting tenant's cache size — typical 100-500x speedup on busy multi-tenant deployments.
How It Works
7 regression tests with a SpyCache subclass intercepting computeSimilarity to count per-entry calls. Asserts: tenant A scans <= tenant-A entry count, tenant A request never matches tenant B's entries (isolation), global maxEntries cap enforced across partitions, modelAgnostic still scans all model buckets within tenant only.
Lockstep Checklist
- [x] No API route changes (middleware/internal — lockstep N/A)
- [x] No SDK changes
- [x] No MCP tool changes
- [x] Regression test included (test-first invariant per /quality-fleet protocol)
- [x] Linked to /quality-fleet R1 dashboard at
.quality/dashboard.md
Provenance
Auto-found by /quality-fleet R1 (2026-05-07) scanner round, fixed in fix-agent batch under "go for all of it" autonomy grant. PR #213 merged to main as commit c03fe451f. Finding(s) tracked at .quality/findings.jsonl (entries: "f2b9e4a17c38"). Production-deployed via ECS task-def revision 732 series.