Thompson Sampling

UCB1 → Gaussian Thompson: how BrainstormRouter learns which model performs best for your workload.

Why not rules-based routing

Most AI gateways route by rules: cheapest model, lowest latency, round-robin, or manual priority lists. Rules don't learn. If model A starts outperforming model B for your specific workload, a rules-based router keeps splitting traffic equally.

BrainstormRouter uses Thompson Sampling, a Bayesian approach to the multi-armed bandit problem. The router maintains a posterior distribution for each model endpoint and samples from it to balance exploration (trying underexplored models) with exploitation (using models that have performed well).

The full implementation is in src/router/model-bandit.ts.

Algorithm overview

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
flowchart TD
    Start([Select model]) --> Floor{Hard floor<br/>gating}
    Floor -->|"validity < 0.5<br/>(≥10 samples)"| Exclude[Exclude arm]
    Floor -->|"quality < 0.3<br/>(when data exists)"| Exclude
    Floor -->|Pass| Eligible

    Eligible --> Zero{Any arms with<br/>0 samples?}
    Zero -->|Yes| Random[Random pick<br/>among unexplored]
    Zero -->|No| SampleCheck{Total samples<br/>≥ 500?}

    SampleCheck -->|No| UCB1["UCB1 Selection<br/>score = mean + C × √(ln(total)/n)"]
    SampleCheck -->|Yes| Thompson["Thompson Sampling<br/>sample ~ N(mean, var/n)"]

    UCB1 --> Pick([Pick highest score])
    Thompson --> Pick
    Random --> Pick

    style Exclude fill:#7f1d1d
    style Random fill:#1e3a5f
    style UCB1 fill:#4a3728
    style Thompson fill:#2d4a1e

Phase 1: UCB1 (cold start)

When total samples across all arms are below 500, the router uses Upper Confidence Bound (UCB1). This is a deterministic algorithm that balances exploitation and exploration:

ucb_score = reward_mean + C × √(ln(total_samples) / arm_samples)

From model-bandit.ts:131-156:

private ucb1Select(arms: BanditArm[], totalSamples: number): BanditSelection {
  // C decays from initialC to minC as total samples grow
  const decayFactor = Math.min(1, totalSamples / 1000);
  const C = this.config.initialC - (this.config.initialC - this.config.minC) * decayFactor;

  let bestArm = arms[0];
  let bestScore = -Infinity;
  let bestBonus = 0;

  for (const arm of arms) {
    const explorationBonus = C * Math.sqrt(Math.log(totalSamples) / arm.sampleCount);
    const ucbScore = arm.rewardMean + explorationBonus;

    if (ucbScore > bestScore) {
      bestScore = ucbScore;
      bestArm = arm;
      bestBonus = explorationBonus;
    }
  }
  // ...
}

The exploration coefficient C decays from 1.5 to 0.5 as total samples grow toward 1000. This means:

  • Early (few samples): C=1.5, heavy exploration — tries underexplored models aggressively
  • Later (many samples): C=0.5, mild exploration — mostly exploits the best model, occasionally probes others

Arms with zero samples are explored first via random selection, before UCB1 kicks in.

Phase 2: Gaussian Thompson Sampling (steady state)

Once total samples reach 500, the router switches to Gaussian Thompson Sampling. Instead of a deterministic formula, it samples from each model's posterior distribution:

sampled_reward ~ N(mean, variance / sample_count)

From model-bandit.ts:159-180:

private thompsonSelect(arms: BanditArm[]): BanditSelection {
  let bestArm = arms[0];
  let bestSample = -Infinity;

  for (const arm of arms) {
    // Sample from posterior N(mean, var / n)
    const stddev = arm.sampleCount > 0
      ? Math.sqrt(Math.max(arm.rewardVar, 0.001) / arm.sampleCount)
      : 0.5;
    const sampled = arm.rewardMean + gaussianRandom() * stddev;

    if (sampled > bestSample) {
      bestSample = sampled;
      bestArm = arm;
    }
  }
  // ...
}

The beauty of Thompson Sampling: a model with high mean and low variance (consistent performer) will almost always win. But a model with high variance (inconsistent) gets occasional chances to prove itself. The exploration is probability-weighted, not coefficient-tuned.

The gaussianRandom() function uses the Box-Muller transform for standard normal samples (model-bandit.ts:218-228).

Hard floors

Before any selection runs, arms are filtered by hard floors:

FloorThresholdConditionSource
Validity≥ 0.5Only after ≥ 10 samplesmodel-bandit.ts:64
Quality≥ 0.3Only when quality data exists and ≥ 10 samplesmodel-bandit.ts:65

If no arms pass the hard floors (cold start scenario), all arms are allowed as a fallback.

Reward signal

The reward is a composite score derived from:

  • Validity — Did the response complete successfully? (0 or 1)
  • Quality — Tool-call accuracy tracked via EWMA
  • Cost — Relative to the cheapest available option
  • Latency — Relative to the fastest available option

Statistics are maintained in a 7-day sliding window using Welford's online algorithm for numerically stable variance computation. The mergeDailyStats() function (model-bandit.ts:187-215) implements parallel Welford merge for combining daily buckets.

Configuration

All thresholds are configurable per-instance:

const bandit = new ModelBandit({
  minValidity: 0.5, // Hard floor: validity
  minQuality: 0.3, // Hard floor: quality
  initialC: 1.5, // UCB1 exploration start
  minC: 0.5, // UCB1 exploration floor
  thompsonThreshold: 500, // Samples before switching to Thompson
});

What this means in practice

For a tenant sending 100 requests/day across 4 model endpoints:

  • Day 1-2: UCB1 explores all 4 models roughly equally (C=1.5)
  • Day 3-5: UCB1 starts favoring the best performer but still probes (C decaying)
  • Day 5+: Passes 500 samples, switches to Thompson Sampling
  • Steady state: ~80-90% of traffic goes to the best model, ~10-20% explores alternatives
  • If a model degrades: Thompson detects it within ~50 samples as the posterior shifts

The router learns your workload without configuration. You don't set weights. You don't pick strategies. The math does it.