Graduated Trust Degradation
Continuous trust evaluation with four levels — manage misbehaving agents, don't kill them.
Overview
Most AI governance systems use a binary approach: an agent is either allowed or blocked. BrainstormRouter takes a different position. Killing a production agent is expensive — it disrupts workflows, creates cascading failures, and wastes the context the agent has built. Instead, BrainstormRouter degrades trust progressively, giving agents a chance to self-correct while limiting their blast radius.
Trust levels
Every agent operates at one of four trust levels, evaluated continuously by the Guardian middleware (src/api/middleware/guardian.ts):
| Level | Anomaly score | Behavior |
|---|---|---|
full | 0.0 – 0.3 | No restrictions. All routing strategies, full budget access. |
degraded | 0.3 – 0.6 | Routing forced to price strategy. Cost warnings emitted. |
restricted | 0.6 – 0.8 | Hard cost cap enforced. Only pre-approved models available. |
quarantine | 0.8 – 1.0 | All requests blocked. Certificate revocation triggered. |
Trust level is reported in every response via the X-BR-Guardian-Status header.
How anomaly scoring works
The anomaly score is a composite signal derived from the agent's recent behavior:
- Cost velocity — Is the agent spending faster than its historical baseline?
- Error rate — Is the agent triggering more failures than normal?
- Content flags — Has the streaming firewall flagged output patterns?
- Request volume — Is the agent making requests at an unusual rate?
- Model distribution — Has the agent suddenly shifted to expensive models?
Each signal contributes to the composite score using exponentially-weighted moving averages (EWMA), which means recent behavior matters more than historical patterns. A sudden spike in cost or error rate raises the score quickly; sustained normal behavior lowers it gradually.
Routing enforcement
Trust degradation changes routing behavior in concrete ways:
Degraded (score 0.3 – 0.6)
The router overrides the requested strategy to price, selecting the cheapest available endpoint regardless of what the agent asked for. The agent still gets responses — just from lower-cost models. The X-BR-Guardian-Status: degraded header signals the agent (or its orchestrator) that behavior has been flagged.
Agent requests: anthropic/claude-sonnet-4 (quality strategy)
Router delivers: google/gemini-2.0-flash (price strategy override)
Restricted (score 0.6 – 0.8)
A hard cost cap is enforced per-request. The router rejects any request whose estimated cost exceeds the cap. Only models on the tenant's pre-approved list are routable. The agent can still function, but within a tightly controlled envelope.
Quarantine (score 0.8 – 1.0)
All requests return 403 Forbidden. If CAF is enabled, the agent's certificate is revoked. The agent is effectively offline until an administrator reviews the anomaly data and manually restores access.
Trust recovery
An agent recovers trust through consistent clean behavior. After 3 consecutive requests that produce clean verdicts (no anomaly signals), the trust level steps down by one tier:
quarantine → restricted → degraded → full
Recovery is intentionally slower than degradation. A single anomalous request can push an agent from full to degraded, but returning to full requires sustained clean behavior. This asymmetry prevents an adversarial agent from alternating between good and bad behavior to maintain access.
Philosophy: manage, don't kill
Binary kill switches create perverse incentives. If the penalty for suspicious behavior is total shutdown, operators err on the side of permissiveness — they set thresholds too high or disable detection entirely, because false positives are too costly.
Graduated trust changes the calculus:
- False positives are cheap — a wrongly-degraded agent still works, just at
lower cost. No pages, no incidents, no manual intervention required.
- True positives are contained — a genuinely misbehaving agent gets
progressively constrained before it can do real damage.
- Recovery is automatic — operators don't need to babysit every alert.
This approach mirrors how human organizations handle trust: probation before termination, warnings before sanctions. The difference is that BrainstormRouter enforces it in real-time, on every request, with sub-5ms overhead.
Comparison to binary approaches
| Aspect | Binary kill switch | Graduated trust |
|---|---|---|
| False positive cost | Full outage | Temporary cost reduction |
| Detection threshold | Set high to avoid outages | Set low — degradation is cheap |
| Recovery | Manual operator intervention | Automatic after clean behavior |
| Blast radius | Total shutdown or nothing | Proportional to severity |
| Agent visibility | No warning before kill | X-BR-Guardian-Status header on every response |
| Operator burden | Must review every kill | Only quarantine requires review |
See Agent Identity & CAF for how trust levels integrate with cryptographic identity and certificate lifecycle.