Chaos drill 05 (CAF cert revoke) executed against production — sub-ms pub/sub propagation

2026-05-08

securitycafchaos-drills

What We Built

Executed chaos drill 05 — CAF cert revocation under live mTLS strict — against the production ALB. Bootstrapped a fresh test agent, exchanged JWT + RSA CSR for a CAF-issued cert, revoked the cert by serial via admin, then provisioned a recovery agent to validate re-issuance.

Closes the third-most-flagged R23 risk (3/10 agents): "drill 05 harness committed but not executed."

Why It Matters

PR #229 made strict mTLS live in production: every mesh-write request now requires a valid CAF cert with a 5-min TTL. The cert revocation path had been wired but never chaos-tested against real traffic. A CA hiccup could have blacked out the authenticated mesh-write surface within the TTL window with no observability into the failure mode.

The drill validated the core operational properties:

  • Revocation API works: 230 ms p100 from request to 200 OK
  • Pub/sub propagation is sub-millisecond: CW Logs share the same

millisecond timestamp for the API event and the CRL broadcast

  • RBAC is enforced: no admin key → 401
  • Idempotency holds: re-revoke returns 200 with updated revoked_at
  • Audit chain is complete: cert issue + revoke + admin actor ID

all visible in /ecs/brainstorm-router-production

  • Recovery is sub-second: fresh cert issued in 409 ms

How It Works

The drill issues a fresh agent provisioning sequence to keep blast radius bounded — only the drill05-test-\* agents are affected, no production-active agent loses its cert. Steps:

  1. Bootstrap test agent (POST /v1/agent/bootstrap) — admin API key auth
  2. Generate RSA 2048 CSR (openssl req -new)
  3. Exchange JWT + CSR for cert (POST /v1/agent/auth/cert)
  4. Revoke serial via admin (POST /v1/agent/auth/revoke)
  5. Provision recovery agent + issue fresh cert
  6. Negative tests: idempotent re-revoke, non-existent serial, no-auth
  7. Verify CW Logs audit chain

Findings

Three informational findings logged for follow-up (none drill-blocking):

  1. Revoke endpoint accepts arbitrary non-existent serials (low) —

pollutes the deny list with junk; no actual security impact since revocation is fail-deny by design.

  1. agent=unknown in revoke audit log (very low cosmetic) — the

endpoint takes serial alone and doesn't reverse-lookup the agent.

  1. /v1/agent/auth/cert returns non-RFC-8259-conformant JSON (low

interop) — multi-line PEM strings contain literal \n instead of escaped \\n. Strict parsers reject the response; SDK consumers must normalize.

What's NOT yet measured

Mesh-write end-to-end strict-gate rejection latency (the "<5 s from revoke API to first 403" target metric in the original drill plan) requires a long-lived synthetic mesh-write peer. Recommended for drill 05B.

Lockstep Checklist

  • [x] API Routes: unchanged — drill exercises existing endpoints
  • [x] TS SDK: unchanged
  • [x] Python SDK: unchanged
  • [x] MCP Schemas: unchanged
  • [x] Master Record: unchanged

R23 risk register: third-most-flagged item closed.