The Hot Path

Request lifecycle from ingress to response, with millisecond latency budgets grounded in source code.

Request lifecycle

Every /v1/chat/completions request passes through five phases. Each has a latency budget enforced in code.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
sequenceDiagram
    participant C as Client
    participant G as Gateway
    participant GI as Guardian
    participant R as Router
    participant P as Provider
    participant GI2 as Guardian Post

    C->>G: POST /v1/chat/completions
    Note over G: Auth + tenant context (< 1ms)

    G->>GI: Pre-request middleware
    Note over GI: Cost prediction + PII scan (< 5ms p95)
    GI-->>G: X-BR-Estimated-Cost: $0.03

    G->>R: Route request
    Note over R: Bandit selection (< 1ms)
    Note over R: Circuit breaker check (< 0.1ms)
    R-->>G: Selected: anthropic/claude-sonnet-4

    G->>P: Provider call
    Note over P: Streaming response (500ms min retry)

    P-->>GI2: Response complete
    Note over GI2: Quality score + efficiency calc (< 1ms)
    GI2-->>C: Response + X-BR-* headers

Hot-path firewall

The diagram above shows the _request_ lifecycle. For streaming responses, a second security pipeline runs on the _response_ path — the hot-path firewall:

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
sequenceDiagram
    participant P as Provider
    participant SGE as StreamingGuardrailEvaluator
    participant Gov as Governance Validator
    participant PII as PII Scanner
    participant SIEM as SIEM Export
    participant C as Client

    P->>SGE: SSE token stream
    Note over SGE: Buffer tokens (window=20)

    SGE->>Gov: validateChunkGovernance()
    Note over Gov: Keyword match (<1ms)
    Gov-->>SGE: clean

    SGE->>PII: scan() on accumulated text
    Note over PII: Regex: email, phone, SSN, CC, IP

    alt PII detected — action: truncate
        PII-->>SGE: PII match (email, ssn)
        SGE->>SGE: truncated = true
        Note over SGE: All future tokens → ""
        SGE->>SIEM: StreamGuardrailVerdict
        Note over SIEM: toCef() or toSiemJson()
        SGE-->>C: Stream severed
    else PII detected — action: redact
        PII-->>SGE: PII match
        SGE-->>C: [REDACTED:email]
        SGE->>SIEM: Verdict (action: redact)
    else Clean
        PII-->>SGE: No PII
        SGE-->>C: Forward buffered tokens
    end

The StreamingGuardrailEvaluator (src/security/streaming-guardrails.ts) is created by createOutboundStreamGuardrails() in the guardrails middleware (src/api/middleware/guardrails.ts:442-463). Evaluation order: governance rules → blocked patterns → PII scanning. Once truncated = true, every subsequent processToken() call returns "" — the stream is permanently severed.

Phase budgets

PhaseBudgetSourceLine
Guardian pre-request< 5ms p95src/api/middleware/guardian.tsL16
Bandit selection< 1mssrc/router/model-bandit.tsL77-157
Circuit breaker check< 0.1mssrc/router/circuit-breaker.tsL52-75
Provider call min retry500mssrc/router/model-router.tsminDelayMs ?? 500
Post-response headers< 1mssrc/api/middleware/guardian.tsL260-295

What happens at each phase

1. Guardian pre-request (guardian.ts:127-254)

Guardian middleware runs after auth. It:

  1. Estimates input tokens — rough heuristic at ~4 chars/token, plus tool schema size
  2. Predicts output tokens — EWMA ratio tracker per tenant (alpha=0.1, default ratio=1.0)
  3. Calculates estimated cost — model pricing tables × predicted tokens
  4. Checks cost seatbelt — if X-BR-Max-Estimated-Cost header is set and mode is active, rejects with 402
  5. Tracks velocity — Redis counter per tenant+key at minute and hour granularity
// From guardian.ts — cost prediction in ~3 lines
const ratio = getOutputInputRatio(tenantId); // EWMA ratio
const outputTokenEstimate = Math.ceil(inputTokenEstimate * ratio);
const estimatedCostCents = Math.round((inputCost + outputCost) / 1e7);

If Guardian throws, it fails open — the request proceeds with a degraded status header. Guardian errors never block requests.

2. Bandit selection (model-bandit.ts:77-157)

The Thompson Sampling algorithm picks the best model endpoint:

  • Cold start (< 500 samples): UCB1 with decaying exploration coefficient (C: 1.5 → 0.5)
  • Steady state (≥ 500 samples): Gaussian Thompson Sampling from posterior N(mean, var/n)
  • Hard floors: validity ≥ 0.5, quality ≥ 0.3 (when data exists)
  • Zero-sample arms: randomly explored first

See Thompson Sampling deep dive for the full algorithm.

3. Circuit breaker (circuit-breaker.ts:52-75)

A single Map lookup + timestamp comparison. Two triggers can open the circuit:

  • Consecutive failures ≥ 5 (default)
  • Sliding window error rate ≥ 50% over 60s with minimum 10 requests

When a circuit opens, the endpoint is removed from rotation. Recovery is automatic via half-open probing with exponential backoff (base 30s, cap 16×).

See Circuit Breakers for the state machine.

4. Provider call

The selected endpoint receives the request. On failure:

  • Minimum retry delay: 500ms
  • Circuit breaker records the failure
  • Fallback resolver tries the next endpoint in the chain
  • If all endpoints fail, the request returns an error with routing metadata

5. Post-response headers (guardian.ts:260-295)

After the provider responds, Guardian injects response headers:

X-BR-Estimated-Cost: $0.03
X-BR-Actual-Cost: $0.028
X-BR-Efficiency: 0.87
X-BR-Guardian-Overhead-Ms: 1.2
X-BR-Cache: semantic           (if cache hit)
X-BR-Savings: $0.028           (if cache hit)

For streaming responses, the Guardian payload is appended to the SSE [DONE] event as a JSON object.

Code vs reality

A standard OpenAI SDK request looks identical:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.brainstormrouter.com/v1",
    api_key="br_live_...",
)

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4",
    messages=[{"role": "user", "content": "Explain circuit breakers."}],
)

But the response headers reveal what happened under the hood:

X-BR-Guardian-Status: on
X-BR-Estimated-Cost: $0.02
X-BR-Actual-Cost: $0.018
X-BR-Efficiency: 0.91
X-BR-Guardian-Overhead-Ms: 0.8

The request was cost-predicted, routed through Thompson Sampling, circuit-breaker checked, quality-scored, and efficiency-calculated — all in under 2ms of added latency.