2026-05-08-capacity-bump-and-mtls-strict

2026-05-08 — Pool size 50→200 + mTLS strict cutover env-var override

Summary

Closes the two deferred items from R20:

  1. Bump DATABASE_POOL_SIZE 50→200 (capacity ceiling fix from R20 load-test artifact).
  2. mTLS strict cutover — env-var override knob lets the operator flip strict mode via the ECS task definition env block instead of writing to the config DB.

Both shipped together: ECS task definition revision 743 has both DATABASE_POOL_SIZE=200 and BRAINSTORMROUTER_MTLS_STRICT_MESH_WRITES=1.

Pool size bump

Direct path (no code change required, env is in the task def):

  1. aws ecs describe-task-definition → revision 741 (DATABASE_POOL_SIZE=50)
  2. Edit env: DATABASE_POOL_SIZE=200; register revision 742
  3. aws ecs update-service --task-definition revision=742 → rolling deploy completed; production now serves on 742 with 2/2 running

Production capacity should now serve 4-5× the previous ceiling (per R17 + R20 load-test ceiling analysis: pool exhaustion was the binding constraint at ~3-4 RPS; 4× pool ≈ 12-16 RPS sustained ceiling, before any application-level changes).

The next R20+1 load test will validate the actual ceiling; targeted to run after this PR's deploy lands.

mTLS strict cutover

The R20 risk register flagged "mTLS still advisory-only on mesh writes" (3/10). The runbook (docs/runbooks/mtls-strict-cutover.md, shipped in PR #218) gates strict-mode cutover on zero event="mtls-advisory" log lines over the past 24h.

PR #228's gate check confirmed gate satisfied (zero advisory bypasses, zero mesh-write traffic on the gated routes in past 24h).

This PR adds the env-var override knob that the runbook recommends:

// src/gateway/boot/boot-api.ts
export function resolveMeshWritesEnforcement(opts: {
  configValue: "advisory" | "strict" | undefined;
  env: NodeJS.ProcessEnv;
}): "advisory" | "strict" | undefined {
  if (opts.env.BRAINSTORMROUTER_MTLS_STRICT_MESH_WRITES === "1") {
    return "strict";
  }
  return opts.configValue;
}

Precedence: env=1 → config value → undefined (caller defaults to "advisory"). The env-var path is operator-friendly: flipping strict mode is now a task-def env edit, not a config-DB write. No restart-class config change required (the task def update IS the restart, since it triggers a rolling deploy).

Once this PR merges and ecs-deploy.yml runs:

  • Workflow's aws ecs describe-task-definition returns latest revision (743 or higher)
  • Workflow registers a new revision with the new image + existing env vars (incl. BRAINSTORMROUTER_MTLS_STRICT_MESH_WRITES=1)
  • Service updates to new revision → rolling deploy → new tasks read the env var → meshWritesEnforcement: "strict" active

Changes

  • src/gateway/boot/boot-api.ts — extracted resolveMeshWritesEnforcement() helper with env-var precedence
  • src/gateway/boot/boot-api.test.ts — new file, 7 paired regression tests covering the precedence matrix
  • ECS task def 743 — registered out-of-band (not in this PR; via aws CLI)

Verification

pnpm test:fast   # 821 files / 7497 tests / 0 failed (was 7490; +7 from new tests)
pnpm check       # types/lint/format clean

Post-deploy verification (after this PR merges):

# 1. Confirm new tasks have the env var
aws ecs describe-tasks --cluster brainstorm-router-production \
  --tasks $(aws ecs list-tasks --cluster brainstorm-router-production \
    --service-name brainstorm-router-production-fargate --region us-east-1 \
    --query 'taskArns[0]' --output text) \
  --region us-east-1 --query 'tasks[0].overrides.containerOverrides[0].environment'

# 2. Probe a mesh-write route without mTLS — expect 403
curl -sS -X POST -w "\nstatus=%{http_code}\n" \
  -H "Authorization: Bearer $BR_AGENT_KEY" \
  -H "Content-Type: application/json" \
  -d '{"to_agent": "test", "task": "ping"}' \
  https://api.brainstormrouter.com/v1/agent/delegate

# 3. CW Logs: confirm zero "mtls-advisory" events post-deploy
aws logs filter-log-events --log-group-name /ecs/brainstorm-router-production \
  --start-time $(($(date +%s) - 1800))000 \
  --filter-pattern '"mtls-advisory"' --region us-east-1 --max-items 5

Rollback

If strict mode breaks a legitimate caller, revert via task definition env:

# Set env to anything other than "1" (or remove the var entirely) and redeploy
aws ecs register-task-definition --cli-input-json <td-without-env-var>
aws ecs update-service --task-definition <new-arn> --force-new-deployment

The env-var-override path means rollback is just a task-def edit + redeploy — no config-DB rollback or schema change.

Lockstep checklist

  • [x] Sourceboot-api.ts env-var override resolver
  • [x] Testboot-api.test.ts 7 paired regression tests
  • [x] Ship log — this file
  • [x] Task def 743 prepared — env vars set; service rolls to it on next ecs-deploy.yml run
  • [x] Pool size 50→200 — already deployed via task def 742 (out-of-band aws CLI)
  • [x] R20 risk register — mTLS advisory-only (3/10) closed; load-test ceiling fix (3/10) capacity portion shipped
  • [ ] Post-deploy load test — re-run R20 baseline to confirm ceiling moves above ~5 RPS

R20 final scoreboard (post-this-PR)

All 11 R20 risk register items now have a status:

  • 9 closed in code/config
  • 2 evaluated as already-mitigated
  • 0 deferred