2026-05-08-drill-04-task-stop-mid-stream

2026-05-08 — Drill 04 executed: ECS task stop mid-SSE-stream

Summary

R20 risk register, flagged 1/10 (A10 Chaos Monkey) as the top-priority untested failure mode: "kill a task mid-request — what happens?" Specifically: streaming completions when the serving ECS task is removed.

Drill executed against production. Found that ECS Fargate's cooperative graceful-drain mechanism keeps in-flight SSE streams alive until they complete naturally; clients see no disruption.

Headline numbers

  • Stream duration: 48.34s
  • Time-to-first-token: 2.50s
  • SSE events received: 47, clean [DONE]
  • Bytes received: 12,755
  • Customer impact: zero (stream completed normally despite mid-stream task stop)
  • Task replacement: 18s to launch, 75s to register with ALB

Method

  1. Started a long streaming completion (stream: true, max_tokens: 2000) targeting model: "auto".
  2. 5 seconds in (past TTFT), issued aws ecs stop-task against one of the two ECS tasks.
  3. Observed: client-side SSE events, ECS service event log, target task lifecycle.

Real findings

1. Cooperative drain protects in-flight streams

stop-task triggers SIGTERM + ALB target deregistration but does not kill the container immediately. The drained task continues serving any TCP connection it was already handling until the connection closes naturally (within the 30s Fargate grace window). The application's SIGTERM handler completes in-flight work before exiting.

ECS event timeline confirms:

  • 15:56:41 (T_kill): stop-task issued
  • 15:56:58 (T+17s): "begun draining connections on 1 tasks" — ALB deregister
  • 15:56:59 (T+18s): replacement task launched
  • 15:57:24 (T+43s): client receives [DONE] — stream completed normally on the drained task
  • 15:57:56 (T+1m15s): replacement task registered healthy with ALB

2. The hard-kill case remains untested

This drill exercises the COOPERATIVE termination path (the path used by deploys, scale-in, and routine task replacement). A real hardware/AZ fault would bypass SIGTERM and break TCP connections at the kernel. Testing that case requires AWS FIS (Fault Injection Simulator) — flagged as drill 04-followup.

3. The drill suite is now drill-01-04 inclusive

Cumulative findings from drills 01-04:

  • 01 (task drain): rolling deploy with streaming load showed 0/180 clean finishes → application-level streaming path needed work (closed in subsequent PRs).
  • 02 (Redis WRONGPASS at boot): found unhandled rejection (PR #221 closed).
  • 03 (provider key rotation): found tenant-key insulation + cache-shadowing of failures.
  • 04 (task stop mid-stream): graceful drain works, customer impact zero.

Action items

  • [ ] Drill 04 followup: AWS FIS hard-kill scenario. Bounded production impact (one task at a time, ECS auto-recovers). Estimated 30 min setup.
  • [ ] Drill 05: RDS connection storm (next on the chaos suite roadmap).
  • [ ] Drill 06: multi-AZ failover (both tasks killed in one AZ).

Lockstep checklist

  • [x] Drill artifactdocs/chaos-drills/results/04-task-sigkill-mid-stream-2026-05-08T15-56-35Z.md
  • [x] Stream logdocs/chaos-drills/results/04-stream-2026-05-08T15-56-35Z.txt
  • [x] Ship log — this file
  • [x] R20 risk register — A10's flagship failure-mode risk (1/10) documented
  • [x] No production fix needed — graceful drain works as designed