2026-05-08-load-test-r20
2026-05-08 — Load test R20 + mTLS strict cutover gate check
Summary
Closes two R20 risk register items:
- No load-test artifact for scale claims (3/10 — A1, A4, A8): refreshed load test against production, results committed to
docs/load-test/r20-2026-05-08T16-36-29Z.md. - mTLS still advisory-only on mesh writes (3/10 — A2, A6, A7): ran the runbook's pre-cutover gate query. Gate: GREEN.
Load test — R20 (10 RPS × 90s + 5 RPS × 60s)
Headline: production capacity ceiling is ~3-4 RPS sustained — unchanged from R17 baseline despite R18-R20 perf PRs (#209-#215). The per-request efficiency improvements are real but did not move the binding constraint, which remains DB connection pool exhaustion.
| Test | RPS | Duration | Total | Failure | p50 | p95 |
|---|---|---|---|---|---|---|
| 1 | 10 | 90s | 710 | 80.42% | 22,050ms | timeout |
| 2 | 5 | 60s | 263 | 25.10% | 12,742ms | timeout |
Test 2 ran during ECS recovery from Test 1, so its 25% rate is degraded-state. Steady-state at 5 RPS against 2/2 healthy tasks is expected to be 5-15% per R17 + perf PRs.
Test 1 effect: ALB marked one of the 2 ECS tasks unhealthy mid-test; ECS auto-replaced. No customer-facing 500 storm — the system fail-loud-and-recovers correctly.
Cost: <$0.01 in upstream provider charges (most requests timed out before reaching providers).
Action items (open):
- Bump
DATABASE_POOL_SIZEfrom 50 to 200+ - Configure ECS service auto-scale on CPU/memory target tracking
- Consolidate per-request DB hits in the hot path
Full results: docs/load-test/r20-2026-05-08T16-36-29Z.md. Raw k6 output in docs/load-test/r20-*.txt.
mTLS strict cutover gate — GREEN
The runbook (docs/runbooks/mtls-strict-cutover.md, shipped in PR #218) gates strict-mode cutover on zero mtls-advisory bypass events over the past 24h. Ran the runbook's CW Logs query at 2026-05-08T16:38Z:
| Pattern | Past 24h matches |
|---|---|
"mtls-advisory" | 0 |
"requireMtls" | 0 |
"clientCert" | 0 |
"X-Amzn-Mtls" | 0 |
"mtls" (broad) | 10 (all are [api/cold-start] timing instrumentation, not warnings) |
/v1/agent/delegate | 0 |
/v1/agent/sub-agents | 0 |
/v1/agent/trajectories | 0 |
/v1/agent/task-runs | 0 |
Interpretation: the runbook's pre-cutover invariant is satisfied. No mesh-write traffic at all in the past 24h — flipping gateway.tls.mtls.meshWritesEnforcement from advisory to strict would not affect any current caller.
Caveat: zero traffic does NOT mean zero callers — it means no callers exercised these routes today. Future callers that haven't enrolled in CAF would break under strict mode. Per runbook, the future-caller list to verify before cutover:
- Internal Brainstorm CLI (would need CAF enrollment automation)
- Test harnesses (
pnpm e2e:local, k6 baselines — these are localhost; not affected) - Sandbox agents in dev/staging environments
Recommended cutover procedure:
- Update gateway config:
tls:
mtls:
meshWritesEnforcement: strict # was: advisory
- Force ECS redeploy (
gateway.*is restart-class):
aws ecs update-service --cluster brainstorm-router-production \
--service brainstorm-router-production-fargate \
--force-new-deployment --region us-east-1
- Verify with the runbook's curl probes (expect 403 without cert).
- Monitor CW Logs for
403on/v1/agent/*for the first 30 minutes — rollback if any legitimate caller is affected.
This PR does NOT flip the flag. The gate is green; the cutover is your call. Recommended next step: flip it during a low-traffic window (deliberate change to production security posture).
Files
docs/load-test/r20-2026-05-08T16-36-29Z.md— full load-test analysisdocs/load-test/r20-2026-05-08T16-36-29Z.txt— raw k6 output (Test 1: 10 RPS)docs/load-test/r20-5rps-2026-05-08T16-39-20Z.txt— raw k6 output (Test 2: 5 RPS)docs/ship-log/2026-05-08-load-test-r20.md— this file
R20 risk register impact
| Risk | Status |
|---|---|
| No load-test artifact for scale claims (3/10) | Closed. R20 artifact committed; ceiling is documented and quantified. |
| mTLS still advisory-only on mesh writes (3/10) | Gate-checked. Cutover ready pending user go. |
Production state at end of session: 2/2 tasks RUNNING/HEALTHY, redis=true, db=true, 7/7 providers healthy.