2026-05-08-load-test-r20

2026-05-08 — Load test R20 + mTLS strict cutover gate check

Summary

Closes two R20 risk register items:

No load-test artifact for scale claims (3/10 — A1, A4, A8): refreshed load test against production, results committed to docs/load-test/r20-2026-05-08T16-36-29Z.md.
mTLS still advisory-only on mesh writes (3/10 — A2, A6, A7): ran the runbook's pre-cutover gate query. Gate: GREEN.

Load test — R20 (10 RPS × 90s + 5 RPS × 60s)

Headline: production capacity ceiling is ~3-4 RPS sustained — unchanged from R17 baseline despite R18-R20 perf PRs (#209-#215). The per-request efficiency improvements are real but did not move the binding constraint, which remains DB connection pool exhaustion.

Test	RPS	Duration	Total	Failure	p50	p95
1	10	90s	710	80.42%	22,050ms	timeout
2	5	60s	263	25.10%	12,742ms	timeout

Test 2 ran during ECS recovery from Test 1, so its 25% rate is degraded-state. Steady-state at 5 RPS against 2/2 healthy tasks is expected to be 5-15% per R17 + perf PRs.

Test 1 effect: ALB marked one of the 2 ECS tasks unhealthy mid-test; ECS auto-replaced. No customer-facing 500 storm — the system fail-loud-and-recovers correctly.

Cost: <$0.01 in upstream provider charges (most requests timed out before reaching providers).

Action items (open):

Bump DATABASE_POOL_SIZE from 50 to 200+
Configure ECS service auto-scale on CPU/memory target tracking
Consolidate per-request DB hits in the hot path

Full results: docs/load-test/r20-2026-05-08T16-36-29Z.md. Raw k6 output in docs/load-test/r20-*.txt.

mTLS strict cutover gate — GREEN

The runbook (docs/runbooks/mtls-strict-cutover.md, shipped in PR #218) gates strict-mode cutover on zero mtls-advisory bypass events over the past 24h. Ran the runbook's CW Logs query at 2026-05-08T16:38Z:

Pattern	Past 24h matches
`"mtls-advisory"`	0
`"requireMtls"`	0
`"clientCert"`	0
`"X-Amzn-Mtls"`	0
`"mtls"` (broad)	10 (all are `[api/cold-start]` timing instrumentation, not warnings)
`/v1/agent/delegate`	0
`/v1/agent/sub-agents`	0
`/v1/agent/trajectories`	0
`/v1/agent/task-runs`	0

Interpretation: the runbook's pre-cutover invariant is satisfied. No mesh-write traffic at all in the past 24h — flipping gateway.tls.mtls.meshWritesEnforcement from advisory to strict would not affect any current caller.

Caveat: zero traffic does NOT mean zero callers — it means no callers exercised these routes today. Future callers that haven't enrolled in CAF would break under strict mode. Per runbook, the future-caller list to verify before cutover:

Internal Brainstorm CLI (would need CAF enrollment automation)
Test harnesses (pnpm e2e:local, k6 baselines — these are localhost; not affected)
Sandbox agents in dev/staging environments

Recommended cutover procedure:

Update gateway config:

tls:
       mtls:
         meshWritesEnforcement: strict # was: advisory

Force ECS redeploy (gateway.* is restart-class):

aws ecs update-service --cluster brainstorm-router-production \
     --service brainstorm-router-production-fargate \
     --force-new-deployment --region us-east-1

Verify with the runbook's curl probes (expect 403 without cert).
Monitor CW Logs for 403 on /v1/agent/* for the first 30 minutes — rollback if any legitimate caller is affected.

This PR does NOT flip the flag. The gate is green; the cutover is your call. Recommended next step: flip it during a low-traffic window (deliberate change to production security posture).

Files

docs/load-test/r20-2026-05-08T16-36-29Z.md — full load-test analysis
docs/load-test/r20-2026-05-08T16-36-29Z.txt — raw k6 output (Test 1: 10 RPS)
docs/load-test/r20-5rps-2026-05-08T16-39-20Z.txt — raw k6 output (Test 2: 5 RPS)
docs/ship-log/2026-05-08-load-test-r20.md — this file

R20 risk register impact

Risk	Status
No load-test artifact for scale claims (3/10)	Closed. R20 artifact committed; ceiling is documented and quantified.
mTLS still advisory-only on mesh writes (3/10)	Gate-checked. Cutover ready pending user go.

Production state at end of session: 2/2 tasks RUNNING/HEALTHY, redis=true, db=true, 7/7 providers healthy.