2026-05-08-drill-04-task-stop-mid-stream
2026-05-08 — Drill 04 executed: ECS task stop mid-SSE-stream
Summary
R20 risk register, flagged 1/10 (A10 Chaos Monkey) as the top-priority untested failure mode: "kill a task mid-request — what happens?" Specifically: streaming completions when the serving ECS task is removed.
Drill executed against production. Found that ECS Fargate's cooperative graceful-drain mechanism keeps in-flight SSE streams alive until they complete naturally; clients see no disruption.
Headline numbers
- Stream duration: 48.34s
- Time-to-first-token: 2.50s
- SSE events received: 47, clean
[DONE] - Bytes received: 12,755
- Customer impact: zero (stream completed normally despite mid-stream task stop)
- Task replacement: 18s to launch, 75s to register with ALB
Method
- Started a long streaming completion (
stream: true, max_tokens: 2000) targetingmodel: "auto". - 5 seconds in (past TTFT), issued
aws ecs stop-taskagainst one of the two ECS tasks. - Observed: client-side SSE events, ECS service event log, target task lifecycle.
Real findings
1. Cooperative drain protects in-flight streams
stop-task triggers SIGTERM + ALB target deregistration but does not kill the container immediately. The drained task continues serving any TCP connection it was already handling until the connection closes naturally (within the 30s Fargate grace window). The application's SIGTERM handler completes in-flight work before exiting.
ECS event timeline confirms:
- 15:56:41 (T_kill): stop-task issued
- 15:56:58 (T+17s): "begun draining connections on 1 tasks" — ALB deregister
- 15:56:59 (T+18s): replacement task launched
- 15:57:24 (T+43s): client receives
[DONE]— stream completed normally on the drained task - 15:57:56 (T+1m15s): replacement task registered healthy with ALB
2. The hard-kill case remains untested
This drill exercises the COOPERATIVE termination path (the path used by deploys, scale-in, and routine task replacement). A real hardware/AZ fault would bypass SIGTERM and break TCP connections at the kernel. Testing that case requires AWS FIS (Fault Injection Simulator) — flagged as drill 04-followup.
3. The drill suite is now drill-01-04 inclusive
Cumulative findings from drills 01-04:
- 01 (task drain): rolling deploy with streaming load showed 0/180 clean finishes → application-level streaming path needed work (closed in subsequent PRs).
- 02 (Redis WRONGPASS at boot): found unhandled rejection (PR #221 closed).
- 03 (provider key rotation): found tenant-key insulation + cache-shadowing of failures.
- 04 (task stop mid-stream): graceful drain works, customer impact zero.
Action items
- [ ] Drill 04 followup: AWS FIS hard-kill scenario. Bounded production impact (one task at a time, ECS auto-recovers). Estimated 30 min setup.
- [ ] Drill 05: RDS connection storm (next on the chaos suite roadmap).
- [ ] Drill 06: multi-AZ failover (both tasks killed in one AZ).
Lockstep checklist
- [x] Drill artifact —
docs/chaos-drills/results/04-task-sigkill-mid-stream-2026-05-08T15-56-35Z.md - [x] Stream log —
docs/chaos-drills/results/04-stream-2026-05-08T15-56-35Z.txt - [x] Ship log — this file
- [x] R20 risk register — A10's flagship failure-mode risk (1/10) documented
- [x] No production fix needed — graceful drain works as designed