The Hot Path
Request lifecycle from ingress to response, with millisecond latency budgets grounded in source code.
Request lifecycle
Every /v1/chat/completions request passes through five phases. Each has a latency budget enforced in code.
%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
sequenceDiagram
participant C as Client
participant G as Gateway
participant GI as Guardian
participant R as Router
participant P as Provider
participant GI2 as Guardian Post
C->>G: POST /v1/chat/completions
Note over G: Auth + tenant context (< 1ms)
G->>GI: Pre-request middleware
Note over GI: Cost prediction + PII scan (< 5ms p95)
GI-->>G: X-BR-Estimated-Cost: $0.03
G->>R: Route request
Note over R: Bandit selection (< 1ms)
Note over R: Circuit breaker check (< 0.1ms)
R-->>G: Selected: anthropic/claude-sonnet-4
G->>P: Provider call
Note over P: Streaming response (500ms min retry)
P-->>GI2: Response complete
Note over GI2: Quality score + efficiency calc (< 1ms)
GI2-->>C: Response + X-BR-* headers
Hot-path firewall
The diagram above shows the _request_ lifecycle. For streaming responses, a second security pipeline runs on the _response_ path — the hot-path firewall:
%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
sequenceDiagram
participant P as Provider
participant SGE as StreamingGuardrailEvaluator
participant Gov as Governance Validator
participant PII as PII Scanner
participant SIEM as SIEM Export
participant C as Client
P->>SGE: SSE token stream
Note over SGE: Buffer tokens (window=20)
SGE->>Gov: validateChunkGovernance()
Note over Gov: Keyword match (<1ms)
Gov-->>SGE: clean
SGE->>PII: scan() on accumulated text
Note over PII: Regex: email, phone, SSN, CC, IP
alt PII detected — action: truncate
PII-->>SGE: PII match (email, ssn)
SGE->>SGE: truncated = true
Note over SGE: All future tokens → ""
SGE->>SIEM: StreamGuardrailVerdict
Note over SIEM: toCef() or toSiemJson()
SGE-->>C: Stream severed
else PII detected — action: redact
PII-->>SGE: PII match
SGE-->>C: [REDACTED:email]
SGE->>SIEM: Verdict (action: redact)
else Clean
PII-->>SGE: No PII
SGE-->>C: Forward buffered tokens
end
The StreamingGuardrailEvaluator (src/security/streaming-guardrails.ts) is created by createOutboundStreamGuardrails() in the guardrails middleware (src/api/middleware/guardrails.ts:442-463). Evaluation order: governance rules → blocked patterns → PII scanning. Once truncated = true, every subsequent processToken() call returns "" — the stream is permanently severed.
Phase budgets
| Phase | Budget | Source | Line |
|---|---|---|---|
| Guardian pre-request | < 5ms p95 | src/api/middleware/guardian.ts | L16 |
| Bandit selection | < 1ms | src/router/model-bandit.ts | L77-157 |
| Circuit breaker check | < 0.1ms | src/router/circuit-breaker.ts | L52-75 |
| Provider call min retry | 500ms | src/router/model-router.ts | minDelayMs ?? 500 |
| Post-response headers | < 1ms | src/api/middleware/guardian.ts | L260-295 |
What happens at each phase
1. Guardian pre-request (guardian.ts:127-254)
Guardian middleware runs after auth. It:
- Estimates input tokens — rough heuristic at ~4 chars/token, plus tool schema size
- Predicts output tokens — EWMA ratio tracker per tenant (alpha=0.1, default ratio=1.0)
- Calculates estimated cost — model pricing tables × predicted tokens
- Checks cost seatbelt — if
X-BR-Max-Estimated-Costheader is set and mode isactive, rejects with 402 - Tracks velocity — Redis counter per tenant+key at minute and hour granularity
// From guardian.ts — cost prediction in ~3 lines
const ratio = getOutputInputRatio(tenantId); // EWMA ratio
const outputTokenEstimate = Math.ceil(inputTokenEstimate * ratio);
const estimatedCostCents = Math.round((inputCost + outputCost) / 1e7);
If Guardian throws, it fails open — the request proceeds with a degraded status header. Guardian errors never block requests.
2. Bandit selection (model-bandit.ts:77-157)
The Thompson Sampling algorithm picks the best model endpoint:
- Cold start (< 500 samples): UCB1 with decaying exploration coefficient (C: 1.5 → 0.5)
- Steady state (≥ 500 samples): Gaussian Thompson Sampling from posterior N(mean, var/n)
- Hard floors: validity ≥ 0.5, quality ≥ 0.3 (when data exists)
- Zero-sample arms: randomly explored first
See Thompson Sampling deep dive for the full algorithm.
3. Circuit breaker (circuit-breaker.ts:52-75)
A single Map lookup + timestamp comparison. Two triggers can open the circuit:
- Consecutive failures ≥ 5 (default)
- Sliding window error rate ≥ 50% over 60s with minimum 10 requests
When a circuit opens, the endpoint is removed from rotation. Recovery is automatic via half-open probing with exponential backoff (base 30s, cap 16×).
See Circuit Breakers for the state machine.
4. Provider call
The selected endpoint receives the request. On failure:
- Minimum retry delay: 500ms
- Circuit breaker records the failure
- Fallback resolver tries the next endpoint in the chain
- If all endpoints fail, the request returns an error with routing metadata
5. Post-response headers (guardian.ts:260-295)
After the provider responds, Guardian injects response headers:
X-BR-Estimated-Cost: $0.03
X-BR-Actual-Cost: $0.028
X-BR-Efficiency: 0.87
X-BR-Guardian-Overhead-Ms: 1.2
X-BR-Cache: semantic (if cache hit)
X-BR-Savings: $0.028 (if cache hit)
For streaming responses, the Guardian payload is appended to the SSE [DONE] event as a JSON object.
Code vs reality
A standard OpenAI SDK request looks identical:
from openai import OpenAI
client = OpenAI(
base_url="https://api.brainstormrouter.com/v1",
api_key="br_live_...",
)
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4",
messages=[{"role": "user", "content": "Explain circuit breakers."}],
)
But the response headers reveal what happened under the hood:
X-BR-Guardian-Status: on
X-BR-Estimated-Cost: $0.02
X-BR-Actual-Cost: $0.018
X-BR-Efficiency: 0.91
X-BR-Guardian-Overhead-Ms: 0.8
The request was cost-predicted, routed through Thompson Sampling, circuit-breaker checked, quality-scored, and efficiency-calculated — all in under 2ms of added latency.