Thompson Sampling: Principled Model Selection Across 247 Endpoints

2026-02-15

model-banditmodel-performance-tracker

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["POST /v1/chat/completions (model selection is internal)"] sdk_methods_updated: ["none (transparent to SDK consumers)"] mcp_tools_updated: ["none"] ---

What We Built

BrainstormRouter's model selection is not a static configuration or a round-robin scheduler. It is a multi-armed bandit that learns, in real time, which models produce the best results for each request shape. The system uses UCB1 (Upper Confidence Bound) for early exploration when data is sparse, then transitions to full Gaussian Thompson Sampling once it has observed 500+ requests per model-shape combination within a 7-day rolling window.

The ModelBandit class in model-bandit.ts (229 LOC) implements the selection algorithm. The ModelPerformanceTracker in model-performance.ts (533 LOC) maintains Welford online accumulators — a numerically stable algorithm for computing running mean and variance in a single pass — bucketed by day. Every model response is recorded as an observation with reward, validity score, quality score, latency, and cost. These observations feed the bandit's arms, which represent competing models.

Hard quality floors (minimum validity 0.5, minimum quality 0.3) gate which models are even eligible for selection. If no models pass the floors — a cold-start scenario — all candidates are admitted. This prevents the system from getting stuck in a local optimum while ensuring that known-bad models are excluded once sufficient data exists.

Why It Matters

Every AI API call costs money and delivers variable quality. Enterprises running production workloads across Claude, GPT-4o, Gemini, and dozens of other models need to know: _which model is actually best for this specific type of request?_ Not in a benchmark — in their production traffic.

BrainstormRouter answers this question automatically. As traffic flows, the bandit learns that Claude Sonnet excels at structured output, GPT-4o-mini handles simple classification cheaply, and Gemini Flash is optimal for high-throughput low-complexity work. No human tuning required. The system explores new models when uncertainty is high and exploits known winners when confidence is strong.

How It Works

The bandit operates in two regimes:

Low-data regime (<500 samples per shape): UCB1 with a decaying exploration coefficient. The exploration bonus starts at C=1.5 (aggressive exploration) and decays toward C=0.5 as sample count grows, with a decay constant of 1000 samples:

const score = arm.rewardMean + C * Math.sqrt(Math.log(totalPulls) / arm.sampleCount);

High-data regime (>=500 samples): Gaussian Thompson Sampling. Each arm samples from N(mean, variance/n), and the arm with the highest sample wins:

const sample = gaussianRandom(arm.rewardMean, Math.sqrt(arm.rewardVar / arm.sampleCount));

The ModelPerformanceTracker maintains daily buckets using Welford's online algorithm for numerically stable variance computation. Buckets are merged across a 7-day window using parallel merge, persisted to Postgres every 300 seconds, and auto-pruned after 30 days. Cold-start models receive 5 synthetic pseudo-observations seeded from pricing data (expensive models get reward 0.65, cheap models get 0.4).

The Numbers

247 model endpoints across 3 providers (Anthropic, OpenAI, Google) feeding the bandit
7-day rolling window for performance statistics — recent enough to capture model updates, wide enough for statistical confidence
500-sample threshold for transitioning from UCB1 to Thompson Sampling
300-second persist interval for Postgres durability with monotonic upsert (DB bucket wins if it has more samples)
30-day retention with automatic pruning of stale buckets
5 pseudo-observations for cold-start seeding from pricing priors

Competitive Edge

No competing AI gateway implements principled exploration/exploitation for model selection. Portkey offers manual routing rules. OpenRouter uses popularity-based defaults. BrainstormRouter is the only platform that _learns_ which models are best for your specific workload, balances the explore/exploit tradeoff mathematically, and transitions automatically from aggressive exploration to confident exploitation as data accumulates. This is not a feature toggle — it is the default behavior of the routing engine.

Lockstep Checklist

[x] API Routes: No new routes — Thompson sampling is internal to the routing engine, transparent to API consumers.
[x] TS SDK: No changes needed — model selection is server-side.
[x] Python SDK: No changes needed — model selection is server-side.
[ ] MCP Schemas: Not applicable.
[x] Master Record: Listed under "Core Engine & Routing" in master-capability-record.md.