Guardian Intelligence: Sub-5ms Cost Prediction on Every Request

2026-02-18

guardiancost-predictionusage-recordingvelocity-tracking

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: [ "POST /v1/chat/completions (middleware)", "GET /v1/insights/daily", "GET /v1/insights/waste", "GET /v1/insights/optimize", ] sdk_methods_updated: ["client.observability.daily()", "client.observability.waste()"] mcp_tools_updated: ["none"] ---

What We Built

Guardian Intelligence is a Hono middleware that runs on every completions request, performing pre-request cost prediction and post-response efficiency scoring without adding meaningful latency. It estimates token counts before the LLM is called, predicts output length using an EWMA (Exponentially Weighted Moving Average) ratio tracker, computes expected cost in USD, and optionally enforces a cost seatbelt that rejects requests exceeding a per-request budget.

The system operates in three modes: Active (402-rejects requests exceeding the cost ceiling), Passive (warns via headers but never blocks), and Off (bypassed entirely via X-BR-Guardian: off header). The default is Passive — observe everything, block nothing until the tenant opts in.

After the LLM responds, Guardian injects post-response headers with actual cost, cache savings, and an efficiency score (normalized 0-1, measuring output tokens per dollar relative to theoretical maximum). These headers are also embedded in the SSE [DONE] payload for streaming responses, giving clients real-time cost visibility.

Why It Matters

AI spend is the new cloud bill — and it is harder to predict. A single runaway prompt can cost $50 in tokens. Enterprises need to see costs _before_ they are incurred, not in a monthly invoice. Guardian gives every API consumer a cost estimate in the response headers of the request they just made, and the option to set hard limits that prevent budget overruns.

This is not a billing dashboard you check once a week. This is per-request, per-tenant, per-key cost intelligence with sub-5ms overhead.

How It Works

Pre-request pipeline (6 steps):

  1. Token estimation: ~4 characters per token, plus 16 characters per message overhead, plus the JSON size of any tool schemas
  2. Output prediction: EWMA ratio of output/input tokens (alpha=0.1, seeded at 1.0, clamped to 0.1–20.0)
  3. Cost calculation: Input + output token estimates multiplied by per-token prices. Computed in microcents (no intermediate rounding) then converted to USD
  4. Seatbelt check: If the X-BR-Max-Estimated-Cost header is set and cost exceeds it, Guardian can reject (Active mode) or warn (Passive mode)
  5. Velocity tracking: Redis INCR per tenant+key with minute and hour buckets (2-minute / 2-hour TTL)
  6. Context injection: Guardian state attached to Hono context for downstream use

Post-response headers:

X-BR-Guardian-Status: on|off|degraded
X-BR-Estimated-Cost: 0.0023
X-BR-Actual-Cost: 0.002147
X-BR-Efficiency: 0.87
X-BR-Guardian-Overhead-Ms: 1.2
X-BR-Cache: semantic
X-BR-Savings: 0.001200

Privacy modes: Standard mode stores full telemetry (token counts, model, latency, prompt hashes, embeddings, SIEM content). Strict mode stores only token counts plus model/provider/latency — no content, no embeddings, no cache writes.

The Numbers

  • <5ms p95 total Guardian overhead (measured: 0.5–1.3ms typical)
  • 0.1 EWMA alpha — 10% weight on new observations, 90% on history. Smooths output predictions without ignoring trends
  • 4 chars/token heuristic with +16 chars/message overhead — simple, fast, accurate enough for cost estimation (not billing)
  • 2 privacy modes — Standard (full telemetry) and Strict (token counts only)
  • Fail-open architecture — Guardian errors never block requests. If Redis is down, velocity tracking degrades gracefully

Competitive Edge

Portkey offers cost tracking as a post-hoc dashboard metric. OpenRouter shows per-request cost in response headers but without prediction, seatbelts, or efficiency scoring. BrainstormRouter's Guardian runs _before_ the LLM call, gives you a cost estimate you can act on, and scores every response for cost-efficiency. The combination of pre-request prediction + post-response scoring + privacy-aware telemetry + per-tenant velocity tracking is unique in the market.

Lockstep Checklist

  • [x] API Routes: Guardian middleware integrated into completions route. Insights endpoints (/v1/insights/*) serve aggregated data.
  • [x] TS SDK: client.observability.daily() and client.observability.waste() consume insights endpoints.
  • [x] Python SDK: client.observability.daily() and client.observability.waste() implemented.
  • [ ] MCP Schemas: Not applicable.
  • [x] Master Record: Listed under "Financial & Key Governance" in master-capability-record.md.