Intelligence Phase 2: Benchmark Priors, Task Classification, Cross-Tenant Ranking

2026-03-19

ranking-enginemodel-performancemodel-auto-selectorbenchmark-ingestor

What We Built

Phase 2 of the Model Intelligence Engine adds three capabilities:

  1. Task-type classification: Every request is automatically classified into one of 7 task types (code_generation, json_extraction, tool_use, creative_writing, summarization, conversation, general). This drives task-specific rankings and routing priors.
  1. Benchmark ingestor: A new data source that fetches external benchmark data from Artificial Analysis and LMSYS Chatbot Arena, providing warm-start quality priors before production telemetry accumulates.
  1. Bayesian blend ranking: The ranking engine now blends production telemetry with benchmark priors using a sigmoid-weighted confidence function. Low-data models get more benchmark influence; high-data models are almost entirely production-driven.

Why It Matters

Without task classification, "which model is best?" has no useful answer — GPT-4o-mini dominates for simple chat, Claude Sonnet 4 dominates for code, and Gemini Flash dominates for cost. Task-specific rankings let agents and dashboards answer the real question: "which model is best for what I'm doing right now?"

How It Works

Request classification runs in classifyTaskType() using zero-cost heuristics (no external calls). The task type flows through the performance tracker as a composite key dimension (tenantId:modelKey:taskType), stored alongside global metrics. The ranking engine accepts ?task=code_generation and returns filtered leaderboards.

Cross-tenant priors via aggregatePriors() merge all tenants' task-specific data (excluding the requesting tenant) to provide warm-start signals for new tenants.

The Numbers

  • 7 task types classified from request shape
  • Bayesian blend sigmoid midpoint at 50 samples (production weight ~50% at 50 samples, ~99% at 200)
  • DB migration v40 adds task_type column with composite unique key
  • Both SDKs (TS + Python) ship with full Intelligence resource

Competitive Edge

No other gateway (Portkey, OpenRouter, Helicone) provides task-specific model rankings derived from production telemetry. They have static benchmarks or manual tiers. BrainstormRouter's rankings are live, per-task, and automatically improve with every request.

Lockstep Checklist

  • [x] API Routes: src/api/routes/intelligence.ts updated with ?task= query param.
  • [x] TS SDK: packages/sdk-ts/src/resources/intelligence.ts created with all 5 methods.
  • [x] Python SDK: packages/sdk-py/src/brainstormrouter/resources/intelligence.py created.
  • [x] MCP Schemas: Not applicable (no new MCP tools).
  • [x] Master Record: Intelligence API documented in docs/reference/intelligence-api.md.