2026-05-08-capacity-bump-and-mtls-strict

2026-05-08 — Pool size 50→200 + mTLS strict cutover env-var override

Summary

Closes the two deferred items from R20:

Bump DATABASE_POOL_SIZE 50→200 (capacity ceiling fix from R20 load-test artifact).
mTLS strict cutover — env-var override knob lets the operator flip strict mode via the ECS task definition env block instead of writing to the config DB.

Both shipped together: ECS task definition revision 743 has both DATABASE_POOL_SIZE=200 and BRAINSTORMROUTER_MTLS_STRICT_MESH_WRITES=1.

Pool size bump

Direct path (no code change required, env is in the task def):

aws ecs describe-task-definition → revision 741 (DATABASE_POOL_SIZE=50)
Edit env: DATABASE_POOL_SIZE=200; register revision 742
aws ecs update-service --task-definition revision=742 → rolling deploy completed; production now serves on 742 with 2/2 running

Production capacity should now serve 4-5× the previous ceiling (per R17 + R20 load-test ceiling analysis: pool exhaustion was the binding constraint at ~3-4 RPS; 4× pool ≈ 12-16 RPS sustained ceiling, before any application-level changes).

The next R20+1 load test will validate the actual ceiling; targeted to run after this PR's deploy lands.

mTLS strict cutover

The R20 risk register flagged "mTLS still advisory-only on mesh writes" (3/10). The runbook (docs/runbooks/mtls-strict-cutover.md, shipped in PR #218) gates strict-mode cutover on zero event="mtls-advisory" log lines over the past 24h.

PR #228's gate check confirmed gate satisfied (zero advisory bypasses, zero mesh-write traffic on the gated routes in past 24h).

This PR adds the env-var override knob that the runbook recommends:

// src/gateway/boot/boot-api.ts
export function resolveMeshWritesEnforcement(opts: {
  configValue: "advisory" | "strict" | undefined;
  env: NodeJS.ProcessEnv;
}): "advisory" | "strict" | undefined {
  if (opts.env.BRAINSTORMROUTER_MTLS_STRICT_MESH_WRITES === "1") {
    return "strict";
  }
  return opts.configValue;
}

Precedence: env=1 → config value → undefined (caller defaults to "advisory"). The env-var path is operator-friendly: flipping strict mode is now a task-def env edit, not a config-DB write. No restart-class config change required (the task def update IS the restart, since it triggers a rolling deploy).

Once this PR merges and ecs-deploy.yml runs:

Workflow's aws ecs describe-task-definition returns latest revision (743 or higher)
Workflow registers a new revision with the new image + existing env vars (incl. BRAINSTORMROUTER_MTLS_STRICT_MESH_WRITES=1)
Service updates to new revision → rolling deploy → new tasks read the env var → meshWritesEnforcement: "strict" active

Changes

src/gateway/boot/boot-api.ts — extracted resolveMeshWritesEnforcement() helper with env-var precedence
src/gateway/boot/boot-api.test.ts — new file, 7 paired regression tests covering the precedence matrix
ECS task def 743 — registered out-of-band (not in this PR; via aws CLI)

Verification

pnpm test:fast   # 821 files / 7497 tests / 0 failed (was 7490; +7 from new tests)
pnpm check       # types/lint/format clean

Post-deploy verification (after this PR merges):

# 1. Confirm new tasks have the env var
aws ecs describe-tasks --cluster brainstorm-router-production \
  --tasks $(aws ecs list-tasks --cluster brainstorm-router-production \
    --service-name brainstorm-router-production-fargate --region us-east-1 \
    --query 'taskArns[0]' --output text) \
  --region us-east-1 --query 'tasks[0].overrides.containerOverrides[0].environment'

# 2. Probe a mesh-write route without mTLS — expect 403
curl -sS -X POST -w "\nstatus=%{http_code}\n" \
  -H "Authorization: Bearer $BR_AGENT_KEY" \
  -H "Content-Type: application/json" \
  -d '{"to_agent": "test", "task": "ping"}' \
  https://api.brainstormrouter.com/v1/agent/delegate

# 3. CW Logs: confirm zero "mtls-advisory" events post-deploy
aws logs filter-log-events --log-group-name /ecs/brainstorm-router-production \
  --start-time $(($(date +%s) - 1800))000 \
  --filter-pattern '"mtls-advisory"' --region us-east-1 --max-items 5

Rollback

If strict mode breaks a legitimate caller, revert via task definition env:

# Set env to anything other than "1" (or remove the var entirely) and redeploy
aws ecs register-task-definition --cli-input-json <td-without-env-var>
aws ecs update-service --task-definition <new-arn> --force-new-deployment

The env-var-override path means rollback is just a task-def edit + redeploy — no config-DB rollback or schema change.

Lockstep checklist

[x] Source — boot-api.ts env-var override resolver
[x] Test — boot-api.test.ts 7 paired regression tests
[x] Ship log — this file
[x] Task def 743 prepared — env vars set; service rolls to it on next ecs-deploy.yml run
[x] Pool size 50→200 — already deployed via task def 742 (out-of-band aws CLI)
[x] R20 risk register — mTLS advisory-only (3/10) closed; load-test ceiling fix (3/10) capacity portion shipped
[ ] Post-deploy load test — re-run R20 baseline to confirm ceiling moves above ~5 RPS

R20 final scoreboard (post-this-PR)

All 11 R20 risk register items now have a status:

9 closed in code/config
2 evaluated as already-mitigated
0 deferred