Redis Failure-Mode Coverage

The dual-mode invariant — every Redis call site must handle both "Redis unavailable" and "Redis throws" without dropping the request.

The two failure modes

Code that uses Redis typically sees two different failure modes, and both must be handled or the service degrades to 5xx:

  1. getRedis() === undefined — the Redis client itself isn't configured. Happens in dev environments without REDIS_URL, or during a degraded boot when the client failed to connect at startup.
  2. Awaited Redis operation throws — the client exists, but a specific operation throws: network timeout, connection reset, server error, command syntax error.

A naive if (redis) { await redis.X(); } block handles mode 1 (the if skips when undefined) but NOT mode 2 (the await propagates the throw to the middleware chain, surfacing as 500).

How the bug class manifested

Pre-R18, four middleware sites had the dual-mode gap:

SitePre-fix behavior on Redis-op throw
src/api/middleware/auth.ts500 to caller (fixed in PR #176)
src/api/middleware/rate-limit.ts500 to caller — the pipeline run was unwrapped
src/api/middleware/community-tier.ts500 to caller — every Redis op in the if-block was raw await
src/api/middleware/guardrails.ts500 to caller — the prompt-rewrite rate-limit Redis block had no try/catch

Memory entry feedback_redis_failure_modes.md had flagged this exact pattern: PR #176 fixed only auth; the other three were vulnerable.

PR #205 closed all three remaining sites in one class-of-bugs PR. The shape of every fix:

if (redis) {
  try {
    await redis.X();
    await redis.Y();
    // use results
  } catch (err) {
    log.warn(`redis op failed, falling back to in-memory: ${err.message}`);
    // fall through to existing in-memory code path
  }
}
// in-memory fallback (unchanged)

Fail-open vs fail-closed semantics

Each site has its own fail-mode contract that the wrap must preserve:

  • auth.ts: on Redis failure, fall through to a DB lookup. Still authenticated correctly, just slower. Fail-open is wrong here — that would skip auth.
  • rate-limit.ts: on Redis failure, fall through to checkMemoryRateLimit(key, limitRpm). Per-task in-memory limiter is less accurate cross-fleet but still rate-limits.
  • community-tier.ts: on Redis failure, fall through to the in-memory counter path that was already present for redis === null.
  • guardrails.ts: on Redis failure, log a warning and leave rateLimited=false. The DB insert still runs, so the prompt-rewrite is still recorded — just without the rate-limit gate.

The general rule:

  • If the Redis call gates authentication or authorization → fail-closed (deny on Redis failure)
  • If the Redis call gates rate limits or counters → fail-open with degraded accuracy (the in-memory fallback)
  • If the Redis call gates observability or telemetry → fail-silent (log and continue)

What this prevents

The blast radius of a Redis blip used to be:

  • Upstash 503 → every request to BR returns 5xx
  • Network blip → cascading retries hit DB pool, request queue backs up
  • Connection drop during deploy → fleet-wide failure during the reconnect window

After PR #205, the same Redis blip degrades the in-memory fallback path (per-task, less accurate counters) but keeps the gateway accepting traffic.

Reviewer rules going forward

For any new code that uses Redis:

  1. Check both modes. if (redis === null/undefined) AND a try/catch around the awaited call.
  2. Specify the fail-mode contract explicitly. Comment what happens on each failure path.
  3. Document the fallback. If there's an in-memory fallback, point to it.
  4. Test both modes. Mock getRedis() returning null AND mock the operation throwing. Two regression tests, not one.

The pattern in src/api/middleware/budget.ts (reserveBudget) is the canonical reference — it predates the bug class and got it right.

Lockstep references

  • PR #205 — implementation across rate-limit, community-tier, guardrails
  • PR #176 — earlier fix for auth (predates this PR)
  • .quality/findings.jsonl IDs b2c3d4e5f6a7, c3d4e5f6a7b8, d4e5f6a7b8c9
  • Memory entry: feedback_redis_failure_modes.md (flagged the gap before scanner found it)
  • Ship log: docs/ship-log/2026-05-07-redis-fail-open-middleware-class.md