Guardian Intelligence

Pre-request cost prediction, post-response efficiency scoring, and consumption protection middleware.

What Guardian does

Guardian is a Hono middleware that wraps every completion request with cost prediction (pre-request) and efficiency scoring (post-response). It adds X-BR-* headers to every response so your application can make informed decisions about cost and quality — without calling a separate observability API.

The full implementation is in src/api/middleware/guardian.ts.

Design constraints

Constraint	Value	Source
Latency budget	< 5ms p95	`guardian.ts:16`
Failure mode	Fail-open (never blocks requests)	`guardian.ts:17`
Bypass	`X-BR-Guardian: off` header	`guardian.ts:18`

These are non-negotiable. Guardian is consumption protection, not a gate. If Guardian has an error, the request proceeds with X-BR-Guardian-Status: degraded.

Pre-request phase

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
flowchart LR
    Req[Request] --> Bypass{"X-BR-Guardian\n= off?"}
    Bypass -->|Yes| Pass([Skip Guardian])
    Bypass -->|No| Mode{"Tenant mode?"}
    Mode -->|off| Pass2([Skip + header])
    Mode -->|passive/active| Estimate

    subgraph Estimate["Cost Estimation"]
        Tokens["Estimate input tokens\n~4 chars/token"]
        EWMA["EWMA output ratio\nalpha=0.1"]
        Cost["Calculate cost\nfrom pricing tables"]
        Seatbelt{"Cost > limit?"}
    end

    Tokens --> EWMA --> Cost --> Seatbelt
    Seatbelt -->|"Yes + active"| Block[402 cost_exceeded]
    Seatbelt -->|"Yes + passive"| Warn[X-BR-Cost-Warning]
    Seatbelt -->|No| Continue([Set headers + continue])

    style Block fill:#7f1d1d
    style Warn fill:#4a3728

Token estimation (`guardian.ts:338-366`)

Input tokens are estimated from the request body:

Chat messages: content.length / 4 (rough English heuristic)
Multi-part content (vision): text parts summed
Tool schemas: JSON.stringify(tools).length / 4
Overhead: ~4 tokens per message (role, formatting)
Minimum: 10 tokens

EWMA output/input ratio (`guardian.ts:59-85`)

Guardian tracks the output/input token ratio per tenant using an Exponential Weighted Moving Average:

const EWMA_ALPHA = 0.1;
const DEFAULT_RATIO = 1.0; // Conservative seed
const MIN_RATIO = 0.1; // Output ≥ 10% of input
const MAX_RATIO = 20.0; // Output ≤ 20× input

// Update after each response
existing.ratio = EWMA_ALPHA * observed + (1 - EWMA_ALPHA) * existing.ratio;

A new tenant starts with ratio=1.0 (assume output equals input). After ~20 requests, the EWMA stabilizes to the tenant's actual pattern. A code-generation tenant might converge to ratio=3.0 (3× more output than input); a classification tenant might converge to ratio=0.1.

Cost seatbelt

Clients can set X-BR-Max-Estimated-Cost header. In active mode, Guardian rejects requests that exceed the limit with a 402 response. In passive mode, it adds a warning header but allows the request through.

Velocity tracking (`guardian.ts:396-412`)

Fire-and-forget Redis increments track request velocity per tenant + API key at minute and hour granularity. This powers rate limiting and anomaly detection without adding latency (the Redis pipeline is non-blocking).

Post-response phase

After the provider responds, Guardian injects cost and efficiency data.

Response headers (`guardian.ts:260-295`)

Header	Value	Example
`X-BR-Guardian-Status`	`on`, `off`, or `degraded`	`on`
`X-BR-Estimated-Cost`	Pre-request prediction	`$0.03`
`X-BR-Actual-Cost`	Real cost from token counts	`$0.028`
`X-BR-Efficiency`	Output tokens per dollar, normalized 0-1	`0.87`
`X-BR-Guardian-Overhead-Ms`	Guardian processing time	`1.2`
`X-BR-Cache`	Cache hit type (if any)	`semantic`
`X-BR-Savings`	Cache savings in USD	`$0.028`

Efficiency scoring (`guardian.ts:372-393`)

Efficiency is computed as output tokens per dollar, normalized against the model's theoretical maximum:

const tokensPerDollar = outputTokens / actualCostUsd;
const theoreticalMax = 1 / outputCostPerToken;
return Math.min(tokensPerDollar / theoreticalMax, 1);

A score of 1.0 means the entire cost went to output tokens (no input cost). A score of 0.3 means most of the cost was input context (long prompts, short responses). This helps identify requests where a cheaper model would produce the same result.

Streaming responses (`guardian.ts:301-328`)

For SSE streaming, Guardian can't inject response headers after the stream starts. Instead, the Guardian payload is appended to the [DONE] event:

{
  "guardian": {
    "efficiency": 0.87,
    "actualCost": 0.028,
    "outputTokens": 342,
    "estimatedCost": 0.03,
    "cacheHit": null,
    "savings": null
  }
}

Streaming guardrails

Guardian handles cost prediction and efficiency scoring. For outbound streaming security — PII detection, governance enforcement, and stream severing — the request pipeline hands off to the StreamingGuardrailEvaluator (src/security/streaming-guardrails.ts).

The evaluator runs inside the SSE response loop, processing tokens through a buffered window:

// From guardrails.ts:442-463 — createOutboundStreamGuardrails()
export function createOutboundStreamGuardrails(
  mode: GuardrailsMode,
  governance?: {
    rules: GovernanceRule[];
    onViolation?: (result: GovernanceCheckResult, chunk: string) => void;
  },
): StreamingGuardrailEvaluator | null {
  if (mode === "off" && (!governance?.rules || governance.rules.length === 0)) {
    return null;
  }
  return new StreamingGuardrailEvaluator(
    { enabled: true, bufferWindowSize: 20, piiMinChars: 50 },
    mode === "redact" || mode === "warn" ? getPiiScanner() : null,
    governance?.rules && governance.rules.length > 0 ? governance : undefined,
  );
}

Actions: allow (forward tokens), truncate (sever the stream), redact (replace PII inline), replace (substitute governance violation message).

For the full streaming security pipeline — including PII air gap, SIEM export, and governance enforcement — see Streaming Security: Code vs Reality.

Privacy modes

Guardian respects tenant privacy settings. From guardian.ts:35-53:

Mode	`token_counts`	`model_provider_latency`	`prompt_hash`	`prompt_embedding`	Everything else
`standard`	Stored	Stored	Stored	Stored	Stored
`strict`	Stored	Stored	Blocked	Blocked	Blocked

In strict mode, Guardian still tracks costs and latency (required for billing and routing), but never stores prompt content, embeddings, or anything that could reconstruct user input.