Graduated Trust Degradation

Continuous trust evaluation with four levels — manage misbehaving agents, don't kill them.

Overview

Most AI governance systems use a binary approach: an agent is either allowed or blocked. BrainstormRouter takes a different position. Killing a production agent is expensive — it disrupts workflows, creates cascading failures, and wastes the context the agent has built. Instead, BrainstormRouter degrades trust progressively, giving agents a chance to self-correct while limiting their blast radius.

Two trust systems, one envelope. The four levels described on this page — full / degraded / restricted / quarantine — are the runtime anomaly-driven trust signal computed by the Guardian middleware. They feed the br_trust.anomaly_score claim of the Trust Envelope. The 5-tier reputation system (platinum / gold / silver / bronze / restricted) described in Agent Reputation is the call-history-driven trust signal computed by the reputation engine and stored as br_trust.tier. Both signals are present on every envelope; the routing gate combines them via fixed precedence (tier first, then xdr_risk, then anomaly_score).

Trust levels

Every agent operates at one of four trust levels, evaluated continuously by the Guardian middleware (src/api/middleware/guardian.ts):

Level	Anomaly score	Behavior
`full`	0.0 – 0.3	No restrictions. All routing strategies, full budget access.
`degraded`	0.3 – 0.6	Routing forced to `price` strategy. Cost warnings emitted.
`restricted`	0.6 – 0.8	Hard cost cap enforced. Only pre-approved models available.
`quarantine`	0.8 – 1.0	All requests blocked. Certificate revocation triggered.

Trust level is reported in every response via the X-BR-Guardian-Status header.

How anomaly scoring works

The anomaly score is a composite signal derived from the agent's recent behavior:

Cost velocity — Is the agent spending faster than its historical baseline?
Error rate — Is the agent triggering more failures than normal?
Content flags — Has the streaming firewall flagged output patterns?
Request volume — Is the agent making requests at an unusual rate?
Model distribution — Has the agent suddenly shifted to expensive models?

Each signal contributes to the composite score using exponentially-weighted moving averages (EWMA), which means recent behavior matters more than historical patterns. A sudden spike in cost or error rate raises the score quickly; sustained normal behavior lowers it gradually.

Routing enforcement

Trust degradation changes routing behavior in concrete ways:

Degraded (score 0.3 – 0.6)

The router overrides the requested strategy to price, selecting the cheapest available endpoint regardless of what the agent asked for. The agent still gets responses — just from lower-cost models. The X-BR-Guardian-Status: degraded header signals the agent (or its orchestrator) that behavior has been flagged.

Agent requests: anthropic/claude-sonnet-4 (quality strategy)
Router delivers: google/gemini-2.0-flash (price strategy override)

Restricted (score 0.6 – 0.8)

A hard cost cap is enforced per-request. The router rejects any request whose estimated cost exceeds the cap. Only models on the tenant's pre-approved list are routable. The agent can still function, but within a tightly controlled envelope.

Quarantine (score 0.8 – 1.0)

All requests return 403 Forbidden. If CAF is enabled, the agent's certificate is revoked. The agent is effectively offline until an administrator reviews the anomaly data and manually restores access.

Trust recovery

An agent recovers trust through consistent clean behavior. After 3 consecutive requests that produce clean verdicts (no anomaly signals), the trust level steps down by one tier:

quarantine → restricted → degraded → full

Recovery is intentionally slower than degradation. A single anomalous request can push an agent from full to degraded, but returning to full requires sustained clean behavior. This asymmetry prevents an adversarial agent from alternating between good and bad behavior to maintain access.

Philosophy: manage, don't kill

Binary kill switches create perverse incentives. If the penalty for suspicious behavior is total shutdown, operators err on the side of permissiveness — they set thresholds too high or disable detection entirely, because false positives are too costly.

Graduated trust changes the calculus:

False positives are cheap — a wrongly-degraded agent still works, just at

lower cost. No pages, no incidents, no manual intervention required.

True positives are contained — a genuinely misbehaving agent gets

progressively constrained before it can do real damage.

Recovery is automatic — operators don't need to babysit every alert.

This approach mirrors how human organizations handle trust: probation before termination, warnings before sanctions. The difference is that BrainstormRouter enforces it in real-time, on every request, with sub-5ms overhead.

Comparison to binary approaches

Aspect	Binary kill switch	Graduated trust
False positive cost	Full outage	Temporary cost reduction
Detection threshold	Set high to avoid outages	Set low — degradation is cheap
Recovery	Manual operator intervention	Automatic after clean behavior
Blast radius	Total shutdown or nothing	Proportional to severity
Agent visibility	No warning before kill	`X-BR-Guardian-Status` header on every response
Operator burden	Must review every kill	Only quarantine requires review

See Agent Identity & CAF for how trust levels integrate with cryptographic identity and certificate lifecycle.