The Hot Path

Request lifecycle from ingress to response, with millisecond latency budgets grounded in source code.

Request lifecycle

Every /v1/chat/completions request passes through five phases. Each has a latency budget enforced in code.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
sequenceDiagram
    participant C as Client
    participant G as Gateway
    participant GI as Guardian
    participant R as Router
    participant P as Provider
    participant GI2 as Guardian Post

    C->>G: POST /v1/chat/completions
    Note over G: Auth + tenant context (< 1ms)

    G->>GI: Pre-request middleware
    Note over GI: Cost prediction + PII scan (< 5ms p95)
    GI-->>G: X-BR-Estimated-Cost: $0.03

    G->>R: Route request
    Note over R: Bandit selection (< 1ms)
    Note over R: Circuit breaker check (< 0.1ms)
    R-->>G: Selected: anthropic/claude-sonnet-4

    G->>P: Provider call
    Note over P: Streaming response (500ms min retry)

    P-->>GI2: Response complete
    Note over GI2: Quality score + efficiency calc (< 1ms)
    GI2-->>C: Response + X-BR-* headers

Hot-path firewall

The diagram above shows the _request_ lifecycle. For streaming responses, a second security pipeline runs on the _response_ path — the hot-path firewall:

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
sequenceDiagram
    participant P as Provider
    participant SGE as StreamingGuardrailEvaluator
    participant Gov as Governance Validator
    participant PII as PII Scanner
    participant SIEM as SIEM Export
    participant C as Client

    P->>SGE: SSE token stream
    Note over SGE: Buffer tokens (window=20)

    SGE->>Gov: validateChunkGovernance()
    Note over Gov: Keyword match (<1ms)
    Gov-->>SGE: clean

    SGE->>PII: scan() on accumulated text
    Note over PII: Regex: email, phone, SSN, CC, IP

    alt PII detected — action: truncate
        PII-->>SGE: PII match (email, ssn)
        SGE->>SGE: truncated = true
        Note over SGE: All future tokens → ""
        SGE->>SIEM: StreamGuardrailVerdict
        Note over SIEM: toCef() or toSiemJson()
        SGE-->>C: Stream severed
    else PII detected — action: redact
        PII-->>SGE: PII match
        SGE-->>C: [REDACTED:email]
        SGE->>SIEM: Verdict (action: redact)
    else Clean
        PII-->>SGE: No PII
        SGE-->>C: Forward buffered tokens
    end

The StreamingGuardrailEvaluator (src/security/streaming-guardrails.ts) is created by createOutboundStreamGuardrails() in the guardrails middleware (src/api/middleware/guardrails.ts:442-463). Evaluation order: governance rules → blocked patterns → PII scanning. Once truncated = true, every subsequent processToken() call returns "" — the stream is permanently severed.

Phase budgets

PhaseBudgetSourceLine
Guardian pre-request< 5ms p95src/api/middleware/guardian.tsL16
Bandit selection< 1mssrc/router/model-bandit.tsL77-157
Circuit breaker check< 0.1mssrc/router/circuit-breaker.tsL52-75
Provider call min retry500mssrc/router/model-router.tsminDelayMs ?? 500
Post-response headers< 1mssrc/api/middleware/guardian.tsL260-295

What happens at each phase

1. Guardian pre-request (guardian.ts:127-254)

Guardian middleware runs after auth. It:

  1. Estimates input tokens — rough heuristic at ~4 chars/token, plus tool schema size
  2. Predicts output tokens — EWMA ratio tracker per tenant (alpha=0.1, default ratio=1.0)
  3. Calculates estimated cost — model pricing tables × predicted tokens
  4. Checks cost seatbelt — if X-BR-Max-Estimated-Cost header is set and mode is active, rejects with 402
  5. Tracks velocity — Redis counter per tenant+key at minute and hour granularity
// From guardian.ts — cost prediction in ~3 lines
const ratio = getOutputInputRatio(tenantId); // EWMA ratio
const outputTokenEstimate = Math.ceil(inputTokenEstimate * ratio);
const estimatedCostCents = Math.round((inputCost + outputCost) / 1e7);

If Guardian throws, it fails open — the request proceeds with a degraded status header. Guardian errors never block requests.

2. Bandit selection (model-bandit.ts:77-157)

The Thompson Sampling algorithm picks the best model endpoint:

  • Cold start (< 500 samples): UCB1 with decaying exploration coefficient (C: 1.5 → 0.5)
  • Steady state (≥ 500 samples): Gaussian Thompson Sampling from posterior N(mean, var/n)
  • Hard floors: validity ≥ 0.5, quality ≥ 0.3 (when data exists)
  • Zero-sample arms: randomly explored first

See Thompson Sampling deep dive for the full algorithm.

3. Circuit breaker (circuit-breaker.ts:52-75)

A single Map lookup + timestamp comparison. Two triggers can open the circuit:

  • Consecutive failures ≥ 5 (default)
  • Sliding window error rate ≥ 50% over 60s with minimum 10 requests

When a circuit opens, the endpoint is removed from rotation. Recovery is automatic via half-open probing with exponential backoff (base 30s, cap 16×).

See Circuit Breakers for the state machine.

4. Provider call

The selected endpoint receives the request. On failure:

  • Minimum retry delay: 500ms
  • Circuit breaker records the failure
  • Fallback resolver tries the next endpoint in the chain
  • If all endpoints fail, the request returns an error with routing metadata

5. Post-response headers (guardian.ts:260-295)

After the provider responds, Guardian injects response headers:

X-BR-Estimated-Cost: $0.03
X-BR-Actual-Cost: $0.028
X-BR-Efficiency: 0.87
X-BR-Guardian-Overhead-Ms: 1.2
X-BR-Cache: semantic           (if cache hit)
X-BR-Savings: $0.028           (if cache hit)

For streaming responses, the Guardian payload is appended to the SSE [DONE] event as a JSON object.

Code vs reality

A standard OpenAI SDK request looks identical:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.brainstormrouter.com/v1",
    api_key="br_live_...",
)

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4",
    messages=[{"role": "user", "content": "Explain circuit breakers."}],
)

But the response headers reveal what happened under the hood:

X-BR-Guardian-Status: on
X-BR-Estimated-Cost: $0.02
X-BR-Actual-Cost: $0.018
X-BR-Efficiency: 0.91
X-BR-Guardian-Overhead-Ms: 0.8

Per-request DB connection ledger (R18 hardening)

The hot path's binding constraint at scale is DB connection pool occupancy per request. PR #198's load test (50 RPS, pool=50, 2 ECS tasks = 100 connections) produced 99.45% failure because each request was holding 2-3 connections simultaneously. R18 closed both contributors:

  • Auth lookupTenantStatus n+1 (PR #204): Every authenticated request fired a second DB query for tenant status, even on Redis cache hit. Fixed by JOINing tenants into lookupKeyInDb and folding tenantStatus into the Redis-cached blob. Cache-hit requests now hold 0 DB connections for auth.
  • Budget tenant-limits per-request transaction (PR #203): getTenantBudgetLimits opened a full transaction (BEGIN/SET LOCAL/SELECT/COMMIT) on every budgeted request. Fixed by mirroring the existing tenant-config.ts settingsCache pattern with a 30s in-process Map cache. Cache invalidation propagates from the PUT route so dashboard writes aren't masked.

After both fixes, a typical authenticated request holds 0 DB connections through the hot path (assuming Redis cache hits). The actual model call still uses connections downstream (cost recording, audit chain), but those release immediately and don't gate request throughput.

Redis round-trips on the hot path (R18 hardening)

Upstash Free has ~15ms RTT from us-east-1. Sequential awaited Redis calls compound directly in user-visible latency. R18 collapsed several hot-path patterns:

  • Budget recordSpend (PR #203): 6 sequential round-trips (incrbyfloat+expire pairs for per-key, daily, monthly aggregates) → single pipeline. Saves ~75ms per completion.
  • Community-tier 5-call sequence (PR #212): incr+expire+get+incr+expire collapsed into one pipeline. Saves ~75ms per community-plan request.
  • Agent reputation cold-start (PR #214): Two independent Redis GETs serialized on cache miss → Promise.all. Saves ~15-30ms on ~50% of rate-limit middleware invocations (cross-task cold cache).
  • Agent rate increment (PR #214): incr+expire serialized → pipelined; idempotent expire dropped the conditional branch.

The class-of-bug pattern: each was an awaited Redis call immediately followed by another awaited Redis call where X and Y had no data dependency. Reviewer rule going forward: every awaited Redis call immediately followed by another should be questioned for pipelining.

Semantic cache: O(partition) instead of O(total)

Semantic cache lookup was O(N) over all entries — every request paid for cosine similarity against every cached prompt, regardless of tenant. PR #213 partitioned the in-memory cache by (tenantId, model) so lookups scan only the relevant bucket. On a multi-tenant warm cache (10K entries, 1536-dim embeddings), this is the difference between ~15M float multiplications per request and ~150K. Persistence layer (pgvector) shape unchanged — only in-memory layout partitioned.

LRU eviction also went from O(N) to O(1) (PR #215) by leveraging Map's insertion-order iteration: delete + set on access moves the entry to the tail; eviction takes from the head. Same pattern landed as a reusable utility at src/utils/lru-map.ts.

The request was cost-predicted, routed through Thompson Sampling, circuit-breaker checked, quality-scored, and efficiency-calculated — all in under 2ms of added latency.