Thompson Sampling
UCB1 → Gaussian Thompson: how BrainstormRouter learns which model performs best for your workload.
Why not rules-based routing
Most AI gateways route by rules: cheapest model, lowest latency, round-robin, or manual priority lists. Rules don't learn. If model A starts outperforming model B for your specific workload, a rules-based router keeps splitting traffic equally.
BrainstormRouter uses Thompson Sampling, a Bayesian approach to the multi-armed bandit problem. The router maintains a posterior distribution for each model endpoint and samples from it to balance exploration (trying underexplored models) with exploitation (using models that have performed well).
The full implementation is in src/router/model-bandit.ts.
Algorithm overview
%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
flowchart TD
Start([Select model]) --> Floor{Hard floor<br/>gating}
Floor -->|"validity < 0.5<br/>(≥10 samples)"| Exclude[Exclude arm]
Floor -->|"quality < 0.3<br/>(when data exists)"| Exclude
Floor -->|Pass| Eligible
Eligible --> Zero{Any arms with<br/>0 samples?}
Zero -->|Yes| Random[Random pick<br/>among unexplored]
Zero -->|No| SampleCheck{Total samples<br/>≥ 500?}
SampleCheck -->|No| UCB1["UCB1 Selection<br/>score = mean + C × √(ln(total)/n)"]
SampleCheck -->|Yes| Thompson["Thompson Sampling<br/>sample ~ N(mean, var/n)"]
UCB1 --> Pick([Pick highest score])
Thompson --> Pick
Random --> Pick
style Exclude fill:#7f1d1d
style Random fill:#1e3a5f
style UCB1 fill:#4a3728
style Thompson fill:#2d4a1e
Phase 1: UCB1 (cold start)
When total samples across all arms are below 500, the router uses Upper Confidence Bound (UCB1). This is a deterministic algorithm that balances exploitation and exploration:
ucb_score = reward_mean + C × √(ln(total_samples) / arm_samples)
From model-bandit.ts:131-156:
private ucb1Select(arms: BanditArm[], totalSamples: number): BanditSelection {
// C decays from initialC to minC as total samples grow
const decayFactor = Math.min(1, totalSamples / 1000);
const C = this.config.initialC - (this.config.initialC - this.config.minC) * decayFactor;
let bestArm = arms[0];
let bestScore = -Infinity;
let bestBonus = 0;
for (const arm of arms) {
const explorationBonus = C * Math.sqrt(Math.log(totalSamples) / arm.sampleCount);
const ucbScore = arm.rewardMean + explorationBonus;
if (ucbScore > bestScore) {
bestScore = ucbScore;
bestArm = arm;
bestBonus = explorationBonus;
}
}
// ...
}
The exploration coefficient C decays from 1.5 to 0.5 as total samples grow toward 1000. This means:
- Early (few samples): C=1.5, heavy exploration — tries underexplored models aggressively
- Later (many samples): C=0.5, mild exploration — mostly exploits the best model, occasionally probes others
Arms with zero samples are explored first via random selection, before UCB1 kicks in.
Phase 2: Gaussian Thompson Sampling (steady state)
Once total samples reach 500, the router switches to Gaussian Thompson Sampling. Instead of a deterministic formula, it samples from each model's posterior distribution:
sampled_reward ~ N(mean, variance / sample_count)
From model-bandit.ts:159-180:
private thompsonSelect(arms: BanditArm[]): BanditSelection {
let bestArm = arms[0];
let bestSample = -Infinity;
for (const arm of arms) {
// Sample from posterior N(mean, var / n)
const stddev = arm.sampleCount > 0
? Math.sqrt(Math.max(arm.rewardVar, 0.001) / arm.sampleCount)
: 0.5;
const sampled = arm.rewardMean + gaussianRandom() * stddev;
if (sampled > bestSample) {
bestSample = sampled;
bestArm = arm;
}
}
// ...
}
The beauty of Thompson Sampling: a model with high mean and low variance (consistent performer) will almost always win. But a model with high variance (inconsistent) gets occasional chances to prove itself. The exploration is probability-weighted, not coefficient-tuned.
The gaussianRandom() function uses the Box-Muller transform for standard normal samples (model-bandit.ts:218-228).
Hard floors
Before any selection runs, arms are filtered by hard floors:
| Floor | Threshold | Condition | Source |
|---|---|---|---|
| Validity | ≥ 0.5 | Only after ≥ 10 samples | model-bandit.ts:64 |
| Quality | ≥ 0.3 | Only when quality data exists and ≥ 10 samples | model-bandit.ts:65 |
If no arms pass the hard floors (cold start scenario), all arms are allowed as a fallback.
Reward signal
The reward is a composite score derived from:
- Validity — Did the response complete successfully? (0 or 1)
- Quality — Tool-call accuracy tracked via EWMA
- Cost — Relative to the cheapest available option
- Latency — Relative to the fastest available option
Statistics are maintained in a 7-day sliding window using Welford's online algorithm for numerically stable variance computation. The mergeDailyStats() function (model-bandit.ts:187-215) implements parallel Welford merge for combining daily buckets.
Configuration
All thresholds are configurable per-instance:
const bandit = new ModelBandit({
minValidity: 0.5, // Hard floor: validity
minQuality: 0.3, // Hard floor: quality
initialC: 1.5, // UCB1 exploration start
minC: 0.5, // UCB1 exploration floor
thompsonThreshold: 500, // Samples before switching to Thompson
});
What this means in practice
For a tenant sending 100 requests/day across 4 model endpoints:
- Day 1-2: UCB1 explores all 4 models roughly equally (C=1.5)
- Day 3-5: UCB1 starts favoring the best performer but still probes (C decaying)
- Day 5+: Passes 500 samples, switches to Thompson Sampling
- Steady state: ~80-90% of traffic goes to the best model, ~10-20% explores alternatives
- If a model degrades: Thompson detects it within ~50 samples as the posterior shifts
The router learns your workload without configuration. You don't set weights. You don't pick strategies. The math does it.