Chaos drill 05 (CAF cert revoke) executed against production — sub-ms pub/sub propagation
2026-05-08
What We Built
Executed chaos drill 05 — CAF cert revocation under live mTLS strict — against the production ALB. Bootstrapped a fresh test agent, exchanged JWT + RSA CSR for a CAF-issued cert, revoked the cert by serial via admin, then provisioned a recovery agent to validate re-issuance.
Closes the third-most-flagged R23 risk (3/10 agents): "drill 05 harness committed but not executed."
Why It Matters
PR #229 made strict mTLS live in production: every mesh-write request now requires a valid CAF cert with a 5-min TTL. The cert revocation path had been wired but never chaos-tested against real traffic. A CA hiccup could have blacked out the authenticated mesh-write surface within the TTL window with no observability into the failure mode.
The drill validated the core operational properties:
- Revocation API works: 230 ms p100 from request to 200 OK
- Pub/sub propagation is sub-millisecond: CW Logs share the same
millisecond timestamp for the API event and the CRL broadcast
- RBAC is enforced: no admin key → 401
- Idempotency holds: re-revoke returns 200 with updated
revoked_at - Audit chain is complete: cert issue + revoke + admin actor ID
all visible in /ecs/brainstorm-router-production
- Recovery is sub-second: fresh cert issued in 409 ms
How It Works
The drill issues a fresh agent provisioning sequence to keep blast radius bounded — only the drill05-test-\* agents are affected, no production-active agent loses its cert. Steps:
- Bootstrap test agent (
POST /v1/agent/bootstrap) — admin API key auth - Generate RSA 2048 CSR (
openssl req -new) - Exchange JWT + CSR for cert (
POST /v1/agent/auth/cert) - Revoke serial via admin (
POST /v1/agent/auth/revoke) - Provision recovery agent + issue fresh cert
- Negative tests: idempotent re-revoke, non-existent serial, no-auth
- Verify CW Logs audit chain
Findings
Three informational findings logged for follow-up (none drill-blocking):
- Revoke endpoint accepts arbitrary non-existent serials (low) —
pollutes the deny list with junk; no actual security impact since revocation is fail-deny by design.
agent=unknownin revoke audit log (very low cosmetic) — the
endpoint takes serial alone and doesn't reverse-lookup the agent.
/v1/agent/auth/certreturns non-RFC-8259-conformant JSON (low
interop) — multi-line PEM strings contain literal \n instead of escaped \\n. Strict parsers reject the response; SDK consumers must normalize.
What's NOT yet measured
Mesh-write end-to-end strict-gate rejection latency (the "<5 s from revoke API to first 403" target metric in the original drill plan) requires a long-lived synthetic mesh-write peer. Recommended for drill 05B.
Lockstep Checklist
- [x] API Routes: unchanged — drill exercises existing endpoints
- [x] TS SDK: unchanged
- [x] Python SDK: unchanged
- [x] MCP Schemas: unchanged
- [x] Master Record: unchanged
R23 risk register: third-most-flagged item closed.