Guardian Intelligence
Pre-request cost prediction, post-response efficiency scoring, and consumption protection middleware.
What Guardian does
Guardian is a Hono middleware that wraps every completion request with cost prediction (pre-request) and efficiency scoring (post-response). It adds X-BR-* headers to every response so your application can make informed decisions about cost and quality — without calling a separate observability API.
The full implementation is in src/api/middleware/guardian.ts.
Design constraints
| Constraint | Value | Source |
|---|---|---|
| Latency budget | < 5ms p95 | guardian.ts:16 |
| Failure mode | Fail-open (never blocks requests) | guardian.ts:17 |
| Bypass | X-BR-Guardian: off header | guardian.ts:18 |
These are non-negotiable. Guardian is consumption protection, not a gate. If Guardian has an error, the request proceeds with X-BR-Guardian-Status: degraded.
Pre-request phase
%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
flowchart LR
Req[Request] --> Bypass{"X-BR-Guardian\n= off?"}
Bypass -->|Yes| Pass([Skip Guardian])
Bypass -->|No| Mode{"Tenant mode?"}
Mode -->|off| Pass2([Skip + header])
Mode -->|passive/active| Estimate
subgraph Estimate["Cost Estimation"]
Tokens["Estimate input tokens\n~4 chars/token"]
EWMA["EWMA output ratio\nalpha=0.1"]
Cost["Calculate cost\nfrom pricing tables"]
Seatbelt{"Cost > limit?"}
end
Tokens --> EWMA --> Cost --> Seatbelt
Seatbelt -->|"Yes + active"| Block[402 cost_exceeded]
Seatbelt -->|"Yes + passive"| Warn[X-BR-Cost-Warning]
Seatbelt -->|No| Continue([Set headers + continue])
style Block fill:#7f1d1d
style Warn fill:#4a3728
Token estimation (guardian.ts:338-366)
Input tokens are estimated from the request body:
- Chat messages:
content.length / 4(rough English heuristic) - Multi-part content (vision): text parts summed
- Tool schemas:
JSON.stringify(tools).length / 4 - Overhead: ~4 tokens per message (role, formatting)
- Minimum: 10 tokens
EWMA output/input ratio (guardian.ts:59-85)
Guardian tracks the output/input token ratio per tenant using an Exponential Weighted Moving Average:
const EWMA_ALPHA = 0.1;
const DEFAULT_RATIO = 1.0; // Conservative seed
const MIN_RATIO = 0.1; // Output ≥ 10% of input
const MAX_RATIO = 20.0; // Output ≤ 20× input
// Update after each response
existing.ratio = EWMA_ALPHA * observed + (1 - EWMA_ALPHA) * existing.ratio;
A new tenant starts with ratio=1.0 (assume output equals input). After ~20 requests, the EWMA stabilizes to the tenant's actual pattern. A code-generation tenant might converge to ratio=3.0 (3× more output than input); a classification tenant might converge to ratio=0.1.
Cost seatbelt
Clients can set X-BR-Max-Estimated-Cost header. In active mode, Guardian rejects requests that exceed the limit with a 402 response. In passive mode, it adds a warning header but allows the request through.
Velocity tracking (guardian.ts:396-412)
Fire-and-forget Redis increments track request velocity per tenant + API key at minute and hour granularity. This powers rate limiting and anomaly detection without adding latency (the Redis pipeline is non-blocking).
Post-response phase
After the provider responds, Guardian injects cost and efficiency data.
Response headers (guardian.ts:260-295)
| Header | Value | Example |
|---|---|---|
X-BR-Guardian-Status | on, off, or degraded | on |
X-BR-Estimated-Cost | Pre-request prediction | $0.03 |
X-BR-Actual-Cost | Real cost from token counts | $0.028 |
X-BR-Efficiency | Output tokens per dollar, normalized 0-1 | 0.87 |
X-BR-Guardian-Overhead-Ms | Guardian processing time | 1.2 |
X-BR-Cache | Cache hit type (if any) | semantic |
X-BR-Savings | Cache savings in USD | $0.028 |
Efficiency scoring (guardian.ts:372-393)
Efficiency is computed as output tokens per dollar, normalized against the model's theoretical maximum:
const tokensPerDollar = outputTokens / actualCostUsd;
const theoreticalMax = 1 / outputCostPerToken;
return Math.min(tokensPerDollar / theoreticalMax, 1);
A score of 1.0 means the entire cost went to output tokens (no input cost). A score of 0.3 means most of the cost was input context (long prompts, short responses). This helps identify requests where a cheaper model would produce the same result.
Streaming responses (guardian.ts:301-328)
For SSE streaming, Guardian can't inject response headers after the stream starts. Instead, the Guardian payload is appended to the [DONE] event:
{
"guardian": {
"efficiency": 0.87,
"actualCost": 0.028,
"outputTokens": 342,
"estimatedCost": 0.03,
"cacheHit": null,
"savings": null
}
}
Streaming guardrails
Guardian handles cost prediction and efficiency scoring. For outbound streaming security — PII detection, governance enforcement, and stream severing — the request pipeline hands off to the StreamingGuardrailEvaluator (src/security/streaming-guardrails.ts).
The evaluator runs inside the SSE response loop, processing tokens through a buffered window:
// From guardrails.ts:442-463 — createOutboundStreamGuardrails()
export function createOutboundStreamGuardrails(
mode: GuardrailsMode,
governance?: {
rules: GovernanceRule[];
onViolation?: (result: GovernanceCheckResult, chunk: string) => void;
},
): StreamingGuardrailEvaluator | null {
if (mode === "off" && (!governance?.rules || governance.rules.length === 0)) {
return null;
}
return new StreamingGuardrailEvaluator(
{ enabled: true, bufferWindowSize: 20, piiMinChars: 50 },
mode === "redact" || mode === "warn" ? getPiiScanner() : null,
governance?.rules && governance.rules.length > 0 ? governance : undefined,
);
}
Actions: allow (forward tokens), truncate (sever the stream), redact (replace PII inline), replace (substitute governance violation message).
For the full streaming security pipeline — including PII air gap, SIEM export, and governance enforcement — see Streaming Security: Code vs Reality.
Privacy modes
Guardian respects tenant privacy settings. From guardian.ts:35-53:
| Mode | token_counts | model_provider_latency | prompt_hash | prompt_embedding | Everything else |
|---|---|---|---|---|---|
standard | Stored | Stored | Stored | Stored | Stored |
strict | Stored | Stored | Blocked | Blocked | Blocked |
In strict mode, Guardian still tracks costs and latency (required for billing and routing), but never stores prompt content, embeddings, or anything that could reconstruct user input.