Routing
Thompson sampling, model variants, presets, and fallback chains.
How routing works
Every /v1/chat/completions request passes through BrainstormRouter's model router, which selects the best endpoint based on your routing strategy.
Request → Route Resolution → Strategy Selection → Endpoint → Provider
↓
@preset/slug → Preset config
provider/model → Direct routing
provider/model:variant → Variant suffix
Strategies
Set a strategy per-request or save it in a preset:
| Strategy | Optimizes for | When to use |
|---|---|---|
price | Lowest cost | Batch jobs, embeddings |
latency | Fastest response | Real-time chat, autocomplete |
throughput | Highest tokens/sec | Long-form generation |
priority | Preferred model first | Production with specific model preference |
quality | Best reward score | High-stakes tasks |
cascade | Try models in order | Budget-constrained with fallbacks |
curl https://api.brainstormrouter.com/v1/chat/completions \
-H "Authorization: Bearer br_live_..." \
-H "X-Model-Router-Strategy: price" \
-d '{"model": "anthropic/claude-sonnet-4", "messages": [...]}'
Thompson sampling
BrainstormRouter uses Thompson sampling (a multi-armed bandit algorithm) to learn which model+provider endpoints perform best (src/router/model-bandit.ts). The algorithm uses UCB1 for cold-start exploration (< 500 samples) and Gaussian Thompson Sampling for steady-state exploitation (≥ 500 samples). The reward signal is a composite of:
- Validity — Did the response complete successfully? (hard floor: ≥ 0.5)
- Quality — Tool-call accuracy (EWMA, hard floor: ≥ 0.3)
- Cost — Relative to the cheapest option
- Latency — Relative to the fastest option
Over time, the router automatically shifts traffic toward endpoints that deliver the best results for your workload. No manual weight configuration needed.
See Thompson Sampling deep dive for the full algorithm with code.
Model variants
Append a variant suffix to any model for quick strategy overrides:
| Variant | Effect |
|---|---|
:floor | Cheapest available endpoint for this model |
:fast | Lowest latency endpoint |
:best | Highest quality score endpoint |
:smart | Thompson sampling auto-select |
await client.chat.completions.create({
model: "anthropic/claude-sonnet-4:floor",
messages: [...],
});
Presets
Save a full routing configuration as a reusable slug:
await client.presets.create({
slug: "production",
name: "Production Config",
model: "anthropic/claude-sonnet-4",
strategy: "quality",
fallbacks: ["openai/gpt-4o", "google/gemini-2.0-flash"],
max_cost_usd: 0.50,
});
// Reference in any completion
await client.chat.completions.create({
model: "@preset/production",
messages: [...],
});
See Presets API for full CRUD operations.
Fallback chains
When the primary model fails (rate limit, timeout, error), the router automatically tries fallback models in order:
{
"model": "anthropic/claude-sonnet-4",
"route": {
"fallbacks": ["openai/gpt-4o", "google/gemini-2.0-flash"],
"strategy": "priority"
}
}
Circuit breakers (src/router/circuit-breaker.ts) prevent repeated calls to failing endpoints. When an endpoint trips its circuit breaker (5 consecutive failures or 50% error rate over 60s), it's removed from rotation until it recovers via half-open probing with exponential backoff.
See Circuit Breakers for the state machine.
Routing replays
Record routing decisions and simulate alternatives:
# What would this week have cost with round-robin vs Thompson sampling?
curl https://api.brainstormrouter.com/v1/replays/simulate \
-H "Authorization: Bearer br_live_..." \
-d '{"strategy": "price", "since": "2025-02-20T00:00:00Z"}'
See Replays API for more.