P0a — Agent budget enforcement (canonicalize agentLimits keys; closes Computer Protocol #244)

2026-05-09

apisecurity

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["PATCH /v1/agent-limits", "PUT /v1/agent-limits", "POST /v1/chat/completions (response header)"] sdk_methods_updated: [ "none — no SDK contract change (canonicalization is server-side; SDKs that call PATCH continue working with bare or prefixed keys)", ] mcp_tools_updated: ["none"] ---

What We Built

Closes Computer Protocol issue #244. The bug: an agent with enforced_daily_usd: 0.000001 successfully ran a $0.000805 auto:best completion. Header reported X-BR-Budget-Remaining: 49.95 (tenant-level), not agent-level. Adversarial-agent role rated this 8.0 attacker-advantage in the Computer Protocol stochastic review.

Root cause (verified end-to-end in code):

  • src/api/middleware/agent-limits.ts:34 calls resolveAgentOrUserId(c, body).
  • src/api/shared/resolve-user-id.ts:53-62 returns agent:${agentIdentity.agentId} (prefixed) for agent-JWT callers — the prefix is the canonical namespace separator from human user IDs in the Redis budget/rate-limit stores.
  • src/api/routes/agent-limits.ts PUT/PATCH stored whatever string the operator submitted, with no canonicalization. Operators looking at /v1/agents/profiles/me see bare agentId strings everywhere and naturally submit those bare strings as map keys.
  • Result: middleware looked up agent: in agentLimits, found nothing under bare , returned next() — no enforcement.

Fix:

  1. New helper src/api/shared/agent-limits-keys.ts exports:
  • canonicalizeAgentLimitsKey(rawKey) — bare → agent:; preserves already-canonical and user: prefixed keys.
  • canonicalizeAgentLimitsMap(input) — applies canonicalization across a map.
  • lookupAgentLimit(map, canonicalKey) — migration-aware lookup. Tries canonical first; falls back to bare key for legacy storage. Returns {value, matchedKey, legacy} so callers can log migration progress.
  1. PUT /v1/agent-limits — canonicalizes the entire incoming map before persisting.
  1. PATCH /v1/agent-limits — canonicalizes each entry. Only touches the canonical key: writes go to agent:, null-deletes remove agent: only. Legacy bare-key entries are left in place to avoid the cross-tenant blast-radius bug codex round-1 flagged (a tenant with both an agent at agent:abc and a human at bare abc would lose the human entry on any PATCH that auto-cleaned bare aliases). Migration converges via the read-side shim (lookupAgentLimit prefers canonical over bare); legacy entries can be cleaned up explicitly with a PATCH submitting the bare key with null.
  1. agentLimitsMiddleware — uses lookupAgentLimit to read the canonical key first, falls back to bare for backward compat. Logs [Agent limit matched on legacy bare key …] warning once per legacy match so operator-side migration progress is observable in CloudWatch. Sets _agentBudgetRemainingUsd context variable.
  1. non-streaming.ts completion route emits X-BR-Agent-Budget-Remaining response header (separate from X-BR-Budget-Remaining which stays tenant-level). 6-decimal precision so micro-budgets like the $0.000001 test case display correctly.
  1. Error response now includes a recovery field with hint + endpoint pointing operators at PATCH /v1/agent-limits for resolution.

Why It Matters

This was a credibility-class gap. The Computer Protocol release-gate items 1 ("Agent budget overrun is blocked before provider execution") and 2 ("Agent identity appears correctly in completion audit, headers, usage, and cost-center attribution") both depended on this enforcement working. Until this PR, both items were red.

The fix also eliminates a class of operator-confusion bug: anyone interacting with /v1/agent-limits via the documented endpoints would silently fail enforcement unless they happened to know the runtime resolver's agent: prefix convention — which was undocumented and visible only by reading resolve-user-id.ts. Server-side canonicalization removes the ambiguity.

How It Works

Operator: PATCH /v1/agent-limits
  body: {"abc-123": {"maxBudgetUsd": 0.000001}}
        ↓ canonicalizeAgentLimitsKey("abc-123") → "agent:abc-123"
Stored: tenantSettings.agentLimits["agent:abc-123"] = {maxBudgetUsd: 0.000001}

Runtime: POST /v1/chat/completions (with agent JWT)
  → resolveAgentOrUserId returns "agent:abc-123"
  → lookupAgentLimit(settings.agentLimits, "agent:abc-123")
  → canonical match → returns {maxBudgetUsd: 0.000001, ..., legacy: false}
  → middleware enforces budget → 402 if spend ≥ cap
  → c.set("_agentBudgetRemainingUsd", remaining)

Response handler:
  → c.header("X-BR-Agent-Budget-Remaining", remaining.toFixed(6))

Migration shim (read-side only):

If canonical key not found, lookupAgentLimit strips "agent:" and tries bare:
  agentLimits["abc-123"]  → matched as legacy → log warning + return value
Operator can re-PATCH at any time to converge on canonical storage.

Test Plan

  • 14 new unit tests in src/api/shared/agent-limits-keys.test.ts covering:
  • Canonicalization: bare → prefixed; already-prefixed left alone; user-prefixed left alone; UUID format
  • Map canonicalization: full normalization; user-prefix preservation; duplicate-collision last-write-wins
  • Lookup: canonical match (legacy=false); bare-key fallback (legacy=true); canonical-preferred-when-both-exist; missing → no match; undefined map; user-prefix isolation (no fallback strip)
  • Existing src/api/middleware/agent-limits.test.ts continues to pass (migration shim is backward-compatible)
  • Full suite: 7676 passed / 0 failed (840 → 840 test files, +14 tests, +1 file)
  • pnpm tsgo clean; oxlint --type-aware 0 warnings 0 errors; oxfmt clean

What This Doesn't Fix

  • Mass migration of existing legacy storage. This PR's migration shim is read-side only — legacy bare-key entries stay in storage until the next PUT/PATCH rewrites them. A separate one-shot migration could canonicalize every tenant's agentLimits map in one pass; deferred because (a) the warning log surfaces which tenants still need a re-PATCH, (b) the production deploy doesn't have hard SLA pressure to drain legacy keys (graceful migration is fine), (c) blast-radius of a bulk rewrite is higher than this PR warrants.
  • Computer Protocol issue #245 (mTLS-behind-ALB acceptance) — separate PR, P0b on queue.
  • Computer Protocol issue #246 (tool-call routing reasoning-model exclusion + cost) — P1 on queue.

Lockstep

  • TypeScript SDK: no change. SDK callers of PATCH /v1/agent-limits continue to work — bare or prefixed keys both succeed; server canonicalizes.
  • Python SDK: no change.
  • MCP tools: no change.
  • API surface: PATCH/PUT /v1/agent-limits request shape unchanged (same Zod schemas); response shape unchanged. New response header X-BR-Agent-Budget-Remaining on completions.
  • site/public/routes.json: unchanged.
  • OpenAPI: no shape changes; consider documenting the new header in a follow-up doc PR.
  • Ship log: this entry.

Closes Release Gate

  • Item 1: Agent budget overrun is blocked before provider execution. ✓
  • Item 2: Agent identity appears correctly in completion audit, headers, usage, and cost-center attribution. ✓ (the new X-BR-Agent-Budget-Remaining header is the agent-identity surface; cost-center attribution is partial — depends on issue #248 TS-2 to fully close.)