Agent Trap Defenses — 4 security fixes from DeepMind paper, 1 live bug patched

2026-04-07

routermemorysecurityapprovals

What We Built

Four security hardening fixes derived from Google DeepMind's "AI Agent Traps" paper (Franklin et al., April 2026) and validated by a 4-model adversarial review (DeepSeek V3, Gemini 2.5 Pro, GPT-5.4, Perplexity Sonar Pro).

Fix 1 — Router Model Pin Override (LIVE BUG): Thompson sampling was silently overriding explicit --model flags. When a user pinned perplexity/sonar-pro, the bandit selected deepseek/deepseek-chat instead. Added isExplicitPin guard: non-auto model pins now bypass bandit + cost-frontier unconditionally.

Fix 2 — Memory Trust Labels: Every memory entry now carries source provenance (user_input/web_fetch/agent_extraction/etc.) and trustScore (0.0-1.0). Web-fetched content defaults to 0.2 trust; entries below 0.3 are flagged as untrusted in context.

Fix 3 — Thompson Sampling Audit Trail: New anomaly detection system with 3 detectors: rate jumps (>0.3 shift in 5min), injection speed (>50 samples in <60s), and convergence collapse (100 consecutive same-model outcomes).

Fix 4 — Approval Velocity Tracking: Detects rapid-fire approval patterns (approval fatigue attack). 3+ approvals in <60s triggers warning. High-risk approvals require reviewer notes.

Why It Matters

The DeepMind paper identified 6 categories of adversarial attacks targeting autonomous AI agents. During our own analysis, we demonstrated the paper's thesis in real-time: the router silently overrode an explicit model pin, routing to a cheaper model instead of the one requested. The "independent judge" was routed to the same model family it was judging.

These aren't theoretical attacks. They're production bugs that an adversary could exploit.

The Numbers

1 live production bug fixed (router model pin override)
3 anomaly detectors (rate jump, injection speed, convergence)
7 trust levels for memory provenance
Approval velocity warning at 3+ rapid approvals

Lockstep Checklist

[x] API Routes: GET /v1/intelligence/sampling-audit added to governance.ts
[x] TS SDK: client.intelligence.samplingAudit() method added
[x] Python SDK: intelligence.sampling_audit() sync + async added
[x] MCP Schemas: N/A (no new agent-facing tools in this fix)
[ ] Master Record: Needs update with security hardening capabilities