BR deploy pipeline Phase A — 18 min → 6 min target
2026-04-21
LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---
What Changed
Six optimizations to the BR deploy pipeline, no functional code changes. The plan at ~/.claude/plans/bright-wibbling-ladybug.md calls these out as Phase A.
- A1 — Docker BuildKit layer cache (ECR-backed).
docker buildreplaced withdocker/build-push-action@v5usingcache-from/cache-to: type=registryagainst the:buildcachetag in the same ECR repo. No separate storage cost. Saves 90–150s on warm cache. - A2 — ECS
minimumHealthyPercent=50. Applied in-workflow and live viaaws ecs update-service. With 2 tasks + ALB, one healthy task handles BR's sub-10 RPS during the ~60s rolling-replace. Deployment circuit breaker + auto-rollback still active. Saves ~120s per deploy. - A3 — TG deregistration delay 120s → 30s. Applied live via
modify-target-group-attributes. BR has no long-lived connections (no websockets, no uploads) so 30s is ample. Saves ~90s per deploy. - A4 — Trimmed workflow fat. Staging canary gated behind
STAGING_DEPLOY_ENABLED=true(the service doesn't exist, the step always wasted 5s on describe-services); redundantsleep 10before health check removed; alarm-watch reduced 120s → 60s. Saves ~75s. - A5 — Expanded
paths-ignore. Added.claude/,memory/,.vscode/,site/. Site edits are Vercel-deployed separately; should never trigger an ECS rebuild. - A7 — Dockerfile multi-stage. Split into
builder(node:22-bookworm + Bun + all deps +pnpm build+pnpm prune --prod) andruntime(node:22-bookworm-slim + curl + dist + pruned node_modules + packages/). Local test: 2.12 GB → 695 MB (67% reduction). Gateway boots cleanly in the slim image.
Also C5 — site/vercel.json ignoreCommand: replaced "exit 1" (always rebuild) with a git diff command that skips builds when only docs/markdown changed.
Skipped with rationale
A6 — parallelize cosign with task-def register. Investigated: the parallel split saves ~30s of cosign wall time but costs ~60–90s of 3× AWS OIDC re-auth across the split jobs, plus artifact-passing overhead. Net negative. Keeping sequential.
Verification
- Workflow YAML parses,
site/vercel.jsonJSON parses. - Local
docker buildof the new multi-stage Dockerfile succeeds in ~30s (cached layers) — 2.12 GB → 695 MB. - Container boots; gateway reaches DB-connect (expected dummy-URL failure). No missing modules, no broken imports.
- ECS live state:
minimumHealthyPercent=50,deploymentCircuitBreaker.rollback=true. TGderegistration_delay.timeout_seconds=30. - Not yet verified end-to-end: GitHub Actions is blocked on billing, so the new workflow path hasn't run once in CI. First real deploy after billing restore will be the live test.
Expected outcomes (once GHA billing restored)
| Metric | Before | Target after Phase A |
|---|---|---|
| Full ECS deploy (p50) | 18 min (measured, last 5 successful runs) | 6 min |
| ECR push time | ~30s on 2.12 GB image | ~10s on 695 MB image |
| Rolling replace (2 tasks) | 180–300s | 60–120s |
| Post-deploy alarm-watch | 120s | 60s |
| Vercel site rebuilds per doc-only commit | always | skipped via ignoreCommand |
Next
- Phase B: wire BR catalog ingestors so kimi-k2.7 / price bumps / deprecations are data changes, not container rebuilds. Round-trip: ~5s.
- Phase C1: self-hosted GHA runner on macOS to eliminate the billing pain entirely.
- Phase C2: consolidate brainstormhive's 13 crons into 5–6 orchestrators.
Lockstep Checklist
- [x] API Routes: no changes.
- [x] TS SDK: no changes.
- [x] Python SDK: no changes.
- [x] MCP Schemas: no changes.
- [x] Marketing site: no changes.
(CI/CD plumbing only — no runtime surface changes.)