2026-05-08-drill-03-provider-key-rotation

2026-05-08 — Drill 03 executed: Provider key rotation under traffic

Summary

Ran chaos drill 03 (deepseek API key rotation during 2 RPS burst) against production. Captured one architectural finding (tenant-key insulation shields the trickle path entirely), one timing measurement (sentinel detects auth_error within 86 seconds of T0), and one operational lesson (cache hit rate masks failures from customer-side 5xx alerting).

Headline numbers

Chaos window: 7m07s (T0=12:25:38Z → T_restore=12:32:45Z)
Trickle: 235 requests, 100% 200 OK throughout chaos. Cache hit rate >99% on identical "OK" prompts; per-tenant deepseek key insulated the test tenant from system-Secret rotation entirely.
Sentinel detection: 86 seconds from T0 to first [router/endpoint-sentinel] Provider deepseek: auth_error (Auth error: 401) log line. The 86s is dominated by ECS rolling-deploy task replacement; first probe after a new task boots takes <1s to surface the failure.
Recovery: ~4 minutes from T_restore to /v1/ops/status showing deepseek: health=healthy again.
Customer impact: zero (cache + tenant-key isolation), with one caveat in findings.

Real findings

1. Tenant-key insulation is a feature, not a bug — but it changes the drill semantics

The trickle ran 235 successful requests because the BR API key under test (br_live_21f6...) belongs to a tenant with its own tenant_provider_keys.deepseek row. Tenant-scoped requests use the tenant's key, not the system fallback Secret. The bogus AWS Secret only affected:

[router/endpoint-sentinel] probes (which use the system key)
Untenanted internal calls
Tenants with no tenant_provider_keys row

To genuinely measure system-key auth recovery against tenanted traffic, the drill needs to either rotate the tenant's row, use a tenant without provider keys, or do the rotation at the provider's own dashboard.

2. /v1/ops/status correctly identified the outage

By 12:30:09 (~5 min post-T0), /v1/ops/status showed deepseek: health=down, total: down=1. Other 6 providers stayed healthy. After restoration, within ~4 min of T_restore, all 7 providers showed healthy. This is durable signal for monitoring.

3. Cache shadows the failure from customer-facing alerting

Internal logs show [router/semantic-cache] Persistent cache hit: similarity=1.0000 on every trickle request. Cost was $0.00. The customer side of the API saw zero degradation. Operational alerting must include sentinel/probe/internal-error rates, not just customer-impact metrics, because the cache layer absorbs failures that would otherwise be visible.

This pairs with feedback_redis_failure_modes.md's broader observation: cache + multi-layer fallback can make failures invisible at the application level even while the underlying providers are degraded.

4. ECS health check does NOT validate provider keys at boot

Both new tasks with the bogus key reached RUNNING/HEALTHY within ~90s. Provider keys are validated by sentinel (post-boot, every 60s) and by actual call path (per-request), but NOT during ECS health check. Right tradeoff (avoiding deploy bottleneck on every provider auth), but means deploys can succeed with broken provider configuration silently.

5. Cache-bypass produces the expected error structure

Forced upstream call at 12:28:01 (UUID-bearing prompt that bypassed cache):

{
  "error": {
    "message": "Model \"deepseek/deepseek-chat\" is temporarily unavailable",
    "type": "model_unavailable",
    "code": "model_unavailable",
    "recovery": {
      "action": "switch_model",
      "code": "model_unavailable",
      "endpoint": "GET /v1/catalog/runnable",
      "message": "Requested model is unavailable. Check runnable catalog for alternatives."
    }
  }
}

Structured 503 with a recovery hint. This is good UX — the client gets a machine-readable directive for fallback, not just a 401.

Action items

[ ] Update drill 03 README/script — clarify tenant-key vs system-key shadowing; recommend a tenanted-key variant for measuring real customer-impact recovery
[ ] Verify the br-providers-auth-error CloudWatch alarm fired during the 12:27:04 → 12:36:30 window (per memory, this alarm exists; verify it triggered)
[ ] Add key-provenance to M2M-INTEL log lines — each completion log should indicate which key path was used (tenant vs system) so operators can distinguish during incidents
[ ] Cache-hit-rate-during-degradation flag — surface elevated cache hit rate during provider degradation as a leading "incident-being-masked" indicator
[ ] Document the structured 503 contract — model_unavailable with recovery.action: "switch_model" is a publicly observable behavior; should be in the API reference

Lockstep checklist

[x] Drill artifact — docs/chaos-drills/results/03-key-rotate-burst-2026-05-08T12-24-30Z.md
[x] Trickle log — docs/chaos-drills/results/03-trickle-2026-05-08T12-24-30Z.txt (235 lines)
[x] Ship log — this file
[x] R19 risk register — A10's "drill suite theatre" flag retired for drill 03 too
[ ] Production fix — none required; all findings are observability/documentation improvements
[ ] Drill 02 Variant B — genuinely blocked on Upstash Console access (no automated equivalent; Upstash management API doesn't expose password reset)