Intelligence Phase 2: Benchmark Priors, Task Classification, Cross-Tenant Ranking
2026-03-19
What We Built
Phase 2 of the Model Intelligence Engine adds three capabilities:
- Task-type classification: Every request is automatically classified into one of 7 task types (code_generation, json_extraction, tool_use, creative_writing, summarization, conversation, general). This drives task-specific rankings and routing priors.
- Benchmark ingestor: A new data source that fetches external benchmark data from Artificial Analysis and LMSYS Chatbot Arena, providing warm-start quality priors before production telemetry accumulates.
- Bayesian blend ranking: The ranking engine now blends production telemetry with benchmark priors using a sigmoid-weighted confidence function. Low-data models get more benchmark influence; high-data models are almost entirely production-driven.
Why It Matters
Without task classification, "which model is best?" has no useful answer — GPT-4o-mini dominates for simple chat, Claude Sonnet 4 dominates for code, and Gemini Flash dominates for cost. Task-specific rankings let agents and dashboards answer the real question: "which model is best for what I'm doing right now?"
How It Works
Request classification runs in classifyTaskType() using zero-cost heuristics (no external calls). The task type flows through the performance tracker as a composite key dimension (tenantId:modelKey:taskType), stored alongside global metrics. The ranking engine accepts ?task=code_generation and returns filtered leaderboards.
Cross-tenant priors via aggregatePriors() merge all tenants' task-specific data (excluding the requesting tenant) to provide warm-start signals for new tenants.
The Numbers
- 7 task types classified from request shape
- Bayesian blend sigmoid midpoint at 50 samples (production weight ~50% at 50 samples, ~99% at 200)
- DB migration v40 adds task_type column with composite unique key
- Both SDKs (TS + Python) ship with full Intelligence resource
Competitive Edge
No other gateway (Portkey, OpenRouter, Helicone) provides task-specific model rankings derived from production telemetry. They have static benchmarks or manual tiers. BrainstormRouter's rankings are live, per-task, and automatically improve with every request.
Lockstep Checklist
- [x] API Routes:
src/api/routes/intelligence.tsupdated with?task=query param. - [x] TS SDK:
packages/sdk-ts/src/resources/intelligence.tscreated with all 5 methods. - [x] Python SDK:
packages/sdk-py/src/brainstormrouter/resources/intelligence.pycreated. - [x] MCP Schemas: Not applicable (no new MCP tools).
- [x] Master Record: Intelligence API documented in
docs/reference/intelligence-api.md.