2026-05-08-load-test-r20

2026-05-08 — Load test R20 + mTLS strict cutover gate check

Summary

Closes two R20 risk register items:

  • No load-test artifact for scale claims (3/10 — A1, A4, A8): refreshed load test against production, results committed to docs/load-test/r20-2026-05-08T16-36-29Z.md.
  • mTLS still advisory-only on mesh writes (3/10 — A2, A6, A7): ran the runbook's pre-cutover gate query. Gate: GREEN.

Load test — R20 (10 RPS × 90s + 5 RPS × 60s)

Headline: production capacity ceiling is ~3-4 RPS sustained — unchanged from R17 baseline despite R18-R20 perf PRs (#209-#215). The per-request efficiency improvements are real but did not move the binding constraint, which remains DB connection pool exhaustion.

TestRPSDurationTotalFailurep50p95
11090s71080.42%22,050mstimeout
2560s26325.10%12,742mstimeout

Test 2 ran during ECS recovery from Test 1, so its 25% rate is degraded-state. Steady-state at 5 RPS against 2/2 healthy tasks is expected to be 5-15% per R17 + perf PRs.

Test 1 effect: ALB marked one of the 2 ECS tasks unhealthy mid-test; ECS auto-replaced. No customer-facing 500 storm — the system fail-loud-and-recovers correctly.

Cost: <$0.01 in upstream provider charges (most requests timed out before reaching providers).

Action items (open):

  • Bump DATABASE_POOL_SIZE from 50 to 200+
  • Configure ECS service auto-scale on CPU/memory target tracking
  • Consolidate per-request DB hits in the hot path

Full results: docs/load-test/r20-2026-05-08T16-36-29Z.md. Raw k6 output in docs/load-test/r20-*.txt.

mTLS strict cutover gate — GREEN

The runbook (docs/runbooks/mtls-strict-cutover.md, shipped in PR #218) gates strict-mode cutover on zero mtls-advisory bypass events over the past 24h. Ran the runbook's CW Logs query at 2026-05-08T16:38Z:

PatternPast 24h matches
"mtls-advisory"0
"requireMtls"0
"clientCert"0
"X-Amzn-Mtls"0
"mtls" (broad)10 (all are [api/cold-start] timing instrumentation, not warnings)
/v1/agent/delegate0
/v1/agent/sub-agents0
/v1/agent/trajectories0
/v1/agent/task-runs0

Interpretation: the runbook's pre-cutover invariant is satisfied. No mesh-write traffic at all in the past 24h — flipping gateway.tls.mtls.meshWritesEnforcement from advisory to strict would not affect any current caller.

Caveat: zero traffic does NOT mean zero callers — it means no callers exercised these routes today. Future callers that haven't enrolled in CAF would break under strict mode. Per runbook, the future-caller list to verify before cutover:

  • Internal Brainstorm CLI (would need CAF enrollment automation)
  • Test harnesses (pnpm e2e:local, k6 baselines — these are localhost; not affected)
  • Sandbox agents in dev/staging environments

Recommended cutover procedure:

  1. Update gateway config:
tls:
       mtls:
         meshWritesEnforcement: strict # was: advisory
  1. Force ECS redeploy (gateway.* is restart-class):
aws ecs update-service --cluster brainstorm-router-production \
     --service brainstorm-router-production-fargate \
     --force-new-deployment --region us-east-1
  1. Verify with the runbook's curl probes (expect 403 without cert).
  2. Monitor CW Logs for 403 on /v1/agent/* for the first 30 minutes — rollback if any legitimate caller is affected.

This PR does NOT flip the flag. The gate is green; the cutover is your call. Recommended next step: flip it during a low-traffic window (deliberate change to production security posture).

Files

  • docs/load-test/r20-2026-05-08T16-36-29Z.md — full load-test analysis
  • docs/load-test/r20-2026-05-08T16-36-29Z.txt — raw k6 output (Test 1: 10 RPS)
  • docs/load-test/r20-5rps-2026-05-08T16-39-20Z.txt — raw k6 output (Test 2: 5 RPS)
  • docs/ship-log/2026-05-08-load-test-r20.md — this file

R20 risk register impact

RiskStatus
No load-test artifact for scale claims (3/10)Closed. R20 artifact committed; ceiling is documented and quantified.
mTLS still advisory-only on mesh writes (3/10)Gate-checked. Cutover ready pending user go.

Production state at end of session: 2/2 tasks RUNNING/HEALTHY, redis=true, db=true, 7/7 providers healthy.