2026-05-08-drill-02-redis-auth-failure

2026-05-08 — Drill 02 executed: Redis auth-failure on boot

Summary

Ran chaos drill 02 (Redis unavailability) against production via AWS Secrets Manager rotation + ECS force redeploy — the boot-time-AUTH-failure variant of the documented manual procedure. Captured one real production-grade bug: unhandled promise rejection on Redis WRONGPASS in the boot path, plus job-queue error spam during Redis reconnect loops.

Headline numbers

Chaos window: 8m12s. AWS Secret rotated to bogus URL at T0 = 11:49:53Z; restored at 11:58:05Z.
Trickle: 292 requests, 100% 200 OK. Latency p50 unchanged from baseline (0.87s vs 0.88s).
Recovery: 3m08s from rollback to 2/2 RUNNING/HEALTHY.
Customer impact: zero. ALB only routed to the one healthy old task (3h uptime); new tasks failed health check and never registered.

Drill setup

The documented drill 02 README labels Redis pull as "manual procedure" requiring Upstash console access. This run executed an automated variant:

Capture original Redis URL from AWS Secrets Manager + 1Password (verified match)
Rotate AWS Secret brainstorm-router/production/redis-url to a bogus password on the same host
Force ECS redeploy — running tasks keep their working URL (loaded into env on boot), new tasks see the bogus value
Background trickle of 1 RPS completions for the duration
After 8 minutes, restore the secret + redeploy
Watch recovery; verify /health.redis = true and 2/2 HEALTHY

This variant tests the boot-time AUTH-failure mode specifically — the under-tested branch in feedback_redis_failure_modes.md. The other branch (live-connection-drop on a running task) still requires Upstash console access for highest fidelity.

Real findings

1. Unhandled promise rejection in boot path

11:57:58 [brainstormrouter] Unhandled promise rejection: ReplyError: WRONGPASS invalid username-password pair
            at parseError (/app/node_modules/.pnpm/redis-parser@3.0.0/node_modules/redis-parser/lib/parser.js:179:12)
            at parseType (/app/node_modules/.pnpm/redis-parser@3.0.0/node_modules/redis-parser/lib/parser.js:302:14)

A Redis call site in the boot path is missing a .catch() handler. Today this didn't crash production because the failing task was already being killed by ECS health check. Under a live-connection-drop event on a _running_ task (Upstash console password reset), this rejection might fire and crash with --unhandled-rejections=strict.

2. Job-queue error spam during reconnect loop

11:57:49 [redis] connection closed
11:57:49 [job-queue] Queue error "cache-cleanup": ReplyError: WRONGPASS
11:57:50 [redis] connected
11:57:50 [redis] error: WRONGPASS
... (~1 cycle/sec, indefinitely) ...

cache-cleanup BullMQ-style worker logs an error per reconnect attempt. In a real outage at 1-second cadence × N workers, this would spike CW Logs ingestion. Need to debounce.

3. Boot fails closed (good for safety, bad for resilience)

The boot path requires working Redis. New tasks with WRONGPASS-returning Redis fail ALB health check and never enter rotation. ECS holds the old task in service indefinitely (or until rollback). This is safer than fail-open-with-wrong-defaults but means a true Upstash outage would block all deploys until Redis is restored. Policy decision: do we want a degraded-boot mode, or is "stuck at old deploy" the correct posture?

4. Debuggability gap

The ECS service event stream shows tasks starting + draining, but doesn't surface "task failed because Redis WRONGPASS." That information lives in the task's CloudWatch log stream — and the failed task's log stream has a different ARN per task, so an operator has to find the right ARN before they can read the failure reason.

Action items

[ ] Fix the unhandled rejection — needs a .catch() somewhere in the Redis bootstrap or job-queue init path. Stack frame above the redis-parser frame would identify it; CW Logs has 1d retention so requires ECS Exec or a fresh repro.
[ ] Debounce [job-queue] cache-cleanup reconnect errors — log first occurrence + count over a 30s window.
[ ] Decide boot-time Redis policy — fail-closed (today) vs degraded-boot. Per feedback_redis_failure_modes.md, today's behavior is the safer default but blocks recovery during real Redis outage.
[ ] Run drill 02 Variant B — Upstash console "Reset Password" to test live-connection-drop mode. Requires Justin at the console for the click + password reveal.

Lockstep checklist

[x] Drill artifact — docs/chaos-drills/results/02-redis-pull-2026-05-08T11-48-21Z.md
[x] Trickle log — docs/chaos-drills/results/02-trickle-2026-05-08T11-48-21Z.txt (292 lines)
[x] Ship log — this file
[ ] Production fix — deferred to a follow-up PR (the unhandled rejection + log spam)
[x] R19 risk register — A10's "drill suite theatre" flag retired for drill 02
[ ] Update ops-runbooks.md — the validated recovery procedure (rotate → force-deploy) belongs in the operational docs