Graduated Trust Degradation

Continuous trust evaluation with four levels — manage misbehaving agents, don't kill them.

Overview

Most AI governance systems use a binary approach: an agent is either allowed or blocked. BrainstormRouter takes a different position. Killing a production agent is expensive — it disrupts workflows, creates cascading failures, and wastes the context the agent has built. Instead, BrainstormRouter degrades trust progressively, giving agents a chance to self-correct while limiting their blast radius.

Trust levels

Every agent operates at one of four trust levels, evaluated continuously by the Guardian middleware (src/api/middleware/guardian.ts):

LevelAnomaly scoreBehavior
full0.0 – 0.3No restrictions. All routing strategies, full budget access.
degraded0.3 – 0.6Routing forced to price strategy. Cost warnings emitted.
restricted0.6 – 0.8Hard cost cap enforced. Only pre-approved models available.
quarantine0.8 – 1.0All requests blocked. Certificate revocation triggered.

Trust level is reported in every response via the X-BR-Guardian-Status header.

How anomaly scoring works

The anomaly score is a composite signal derived from the agent's recent behavior:

  • Cost velocity — Is the agent spending faster than its historical baseline?
  • Error rate — Is the agent triggering more failures than normal?
  • Content flags — Has the streaming firewall flagged output patterns?
  • Request volume — Is the agent making requests at an unusual rate?
  • Model distribution — Has the agent suddenly shifted to expensive models?

Each signal contributes to the composite score using exponentially-weighted moving averages (EWMA), which means recent behavior matters more than historical patterns. A sudden spike in cost or error rate raises the score quickly; sustained normal behavior lowers it gradually.

Routing enforcement

Trust degradation changes routing behavior in concrete ways:

Degraded (score 0.3 – 0.6)

The router overrides the requested strategy to price, selecting the cheapest available endpoint regardless of what the agent asked for. The agent still gets responses — just from lower-cost models. The X-BR-Guardian-Status: degraded header signals the agent (or its orchestrator) that behavior has been flagged.

Agent requests: anthropic/claude-sonnet-4 (quality strategy)
Router delivers: google/gemini-2.0-flash (price strategy override)

Restricted (score 0.6 – 0.8)

A hard cost cap is enforced per-request. The router rejects any request whose estimated cost exceeds the cap. Only models on the tenant's pre-approved list are routable. The agent can still function, but within a tightly controlled envelope.

Quarantine (score 0.8 – 1.0)

All requests return 403 Forbidden. If CAF is enabled, the agent's certificate is revoked. The agent is effectively offline until an administrator reviews the anomaly data and manually restores access.

Trust recovery

An agent recovers trust through consistent clean behavior. After 3 consecutive requests that produce clean verdicts (no anomaly signals), the trust level steps down by one tier:

quarantine → restricted → degraded → full

Recovery is intentionally slower than degradation. A single anomalous request can push an agent from full to degraded, but returning to full requires sustained clean behavior. This asymmetry prevents an adversarial agent from alternating between good and bad behavior to maintain access.

Philosophy: manage, don't kill

Binary kill switches create perverse incentives. If the penalty for suspicious behavior is total shutdown, operators err on the side of permissiveness — they set thresholds too high or disable detection entirely, because false positives are too costly.

Graduated trust changes the calculus:

  • False positives are cheap — a wrongly-degraded agent still works, just at

lower cost. No pages, no incidents, no manual intervention required.

  • True positives are contained — a genuinely misbehaving agent gets

progressively constrained before it can do real damage.

  • Recovery is automatic — operators don't need to babysit every alert.

This approach mirrors how human organizations handle trust: probation before termination, warnings before sanctions. The difference is that BrainstormRouter enforces it in real-time, on every request, with sub-5ms overhead.

Comparison to binary approaches

AspectBinary kill switchGraduated trust
False positive costFull outageTemporary cost reduction
Detection thresholdSet high to avoid outagesSet low — degradation is cheap
RecoveryManual operator interventionAutomatic after clean behavior
Blast radiusTotal shutdown or nothingProportional to severity
Agent visibilityNo warning before killX-BR-Guardian-Status header on every response
Operator burdenMust review every killOnly quarantine requires review

See Agent Identity & CAF for how trust levels integrate with cryptographic identity and certificate lifecycle.