2026-05-08-capacity-bump-and-mtls-strict
2026-05-08 — Pool size 50→200 + mTLS strict cutover env-var override
Summary
Closes the two deferred items from R20:
- Bump
DATABASE_POOL_SIZE50→200 (capacity ceiling fix from R20 load-test artifact). - mTLS strict cutover — env-var override knob lets the operator flip strict mode via the ECS task definition env block instead of writing to the config DB.
Both shipped together: ECS task definition revision 743 has both DATABASE_POOL_SIZE=200 and BRAINSTORMROUTER_MTLS_STRICT_MESH_WRITES=1.
Pool size bump
Direct path (no code change required, env is in the task def):
aws ecs describe-task-definition→ revision 741 (DATABASE_POOL_SIZE=50)- Edit env: DATABASE_POOL_SIZE=200; register revision 742
aws ecs update-service --task-definition revision=742→ rolling deploy completed; production now serves on 742 with 2/2 running
Production capacity should now serve 4-5× the previous ceiling (per R17 + R20 load-test ceiling analysis: pool exhaustion was the binding constraint at ~3-4 RPS; 4× pool ≈ 12-16 RPS sustained ceiling, before any application-level changes).
The next R20+1 load test will validate the actual ceiling; targeted to run after this PR's deploy lands.
mTLS strict cutover
The R20 risk register flagged "mTLS still advisory-only on mesh writes" (3/10). The runbook (docs/runbooks/mtls-strict-cutover.md, shipped in PR #218) gates strict-mode cutover on zero event="mtls-advisory" log lines over the past 24h.
PR #228's gate check confirmed gate satisfied (zero advisory bypasses, zero mesh-write traffic on the gated routes in past 24h).
This PR adds the env-var override knob that the runbook recommends:
// src/gateway/boot/boot-api.ts
export function resolveMeshWritesEnforcement(opts: {
configValue: "advisory" | "strict" | undefined;
env: NodeJS.ProcessEnv;
}): "advisory" | "strict" | undefined {
if (opts.env.BRAINSTORMROUTER_MTLS_STRICT_MESH_WRITES === "1") {
return "strict";
}
return opts.configValue;
}
Precedence: env=1 → config value → undefined (caller defaults to "advisory"). The env-var path is operator-friendly: flipping strict mode is now a task-def env edit, not a config-DB write. No restart-class config change required (the task def update IS the restart, since it triggers a rolling deploy).
Once this PR merges and ecs-deploy.yml runs:
- Workflow's
aws ecs describe-task-definitionreturns latest revision (743 or higher) - Workflow registers a new revision with the new image + existing env vars (incl.
BRAINSTORMROUTER_MTLS_STRICT_MESH_WRITES=1) - Service updates to new revision → rolling deploy → new tasks read the env var →
meshWritesEnforcement: "strict"active
Changes
src/gateway/boot/boot-api.ts— extractedresolveMeshWritesEnforcement()helper with env-var precedencesrc/gateway/boot/boot-api.test.ts— new file, 7 paired regression tests covering the precedence matrix- ECS task def 743 — registered out-of-band (not in this PR; via aws CLI)
Verification
pnpm test:fast # 821 files / 7497 tests / 0 failed (was 7490; +7 from new tests)
pnpm check # types/lint/format clean
Post-deploy verification (after this PR merges):
# 1. Confirm new tasks have the env var
aws ecs describe-tasks --cluster brainstorm-router-production \
--tasks $(aws ecs list-tasks --cluster brainstorm-router-production \
--service-name brainstorm-router-production-fargate --region us-east-1 \
--query 'taskArns[0]' --output text) \
--region us-east-1 --query 'tasks[0].overrides.containerOverrides[0].environment'
# 2. Probe a mesh-write route without mTLS — expect 403
curl -sS -X POST -w "\nstatus=%{http_code}\n" \
-H "Authorization: Bearer $BR_AGENT_KEY" \
-H "Content-Type: application/json" \
-d '{"to_agent": "test", "task": "ping"}' \
https://api.brainstormrouter.com/v1/agent/delegate
# 3. CW Logs: confirm zero "mtls-advisory" events post-deploy
aws logs filter-log-events --log-group-name /ecs/brainstorm-router-production \
--start-time $(($(date +%s) - 1800))000 \
--filter-pattern '"mtls-advisory"' --region us-east-1 --max-items 5
Rollback
If strict mode breaks a legitimate caller, revert via task definition env:
# Set env to anything other than "1" (or remove the var entirely) and redeploy
aws ecs register-task-definition --cli-input-json <td-without-env-var>
aws ecs update-service --task-definition <new-arn> --force-new-deployment
The env-var-override path means rollback is just a task-def edit + redeploy — no config-DB rollback or schema change.
Lockstep checklist
- [x] Source —
boot-api.tsenv-var override resolver - [x] Test —
boot-api.test.ts7 paired regression tests - [x] Ship log — this file
- [x] Task def 743 prepared — env vars set; service rolls to it on next ecs-deploy.yml run
- [x] Pool size 50→200 — already deployed via task def 742 (out-of-band aws CLI)
- [x] R20 risk register — mTLS advisory-only (3/10) closed; load-test ceiling fix (3/10) capacity portion shipped
- [ ] Post-deploy load test — re-run R20 baseline to confirm ceiling moves above ~5 RPS
R20 final scoreboard (post-this-PR)
All 11 R20 risk register items now have a status:
- 9 closed in code/config
- 2 evaluated as already-mitigated
- 0 deferred