BR deploy pipeline Phase A — 18 min → 6 min target

2026-04-21

ci-cddockerecsvercel

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---

What Changed

Six optimizations to the BR deploy pipeline, no functional code changes. The plan at ~/.claude/plans/bright-wibbling-ladybug.md calls these out as Phase A.

A1 — Docker BuildKit layer cache (ECR-backed). docker build replaced with docker/build-push-action@v5 using cache-from/cache-to: type=registry against the :buildcache tag in the same ECR repo. No separate storage cost. Saves 90–150s on warm cache.
A2 — ECS minimumHealthyPercent=50. Applied in-workflow and live via aws ecs update-service. With 2 tasks + ALB, one healthy task handles BR's sub-10 RPS during the ~60s rolling-replace. Deployment circuit breaker + auto-rollback still active. Saves ~120s per deploy.
A3 — TG deregistration delay 120s → 30s. Applied live via modify-target-group-attributes. BR has no long-lived connections (no websockets, no uploads) so 30s is ample. Saves ~90s per deploy.
A4 — Trimmed workflow fat. Staging canary gated behind STAGING_DEPLOY_ENABLED=true (the service doesn't exist, the step always wasted 5s on describe-services); redundant sleep 10 before health check removed; alarm-watch reduced 120s → 60s. Saves ~75s.
A5 — Expanded paths-ignore. Added .claude/, memory/, .vscode/, site/. Site edits are Vercel-deployed separately; should never trigger an ECS rebuild.
A7 — Dockerfile multi-stage. Split into builder (node:22-bookworm + Bun + all deps + pnpm build + pnpm prune --prod) and runtime (node:22-bookworm-slim + curl + dist + pruned node_modules + packages/). Local test: 2.12 GB → 695 MB (67% reduction). Gateway boots cleanly in the slim image.

Also C5 — site/vercel.json ignoreCommand: replaced "exit 1" (always rebuild) with a git diff command that skips builds when only docs/markdown changed.

Skipped with rationale

A6 — parallelize cosign with task-def register. Investigated: the parallel split saves ~30s of cosign wall time but costs ~60–90s of 3× AWS OIDC re-auth across the split jobs, plus artifact-passing overhead. Net negative. Keeping sequential.

Verification

Workflow YAML parses, site/vercel.json JSON parses.
Local docker build of the new multi-stage Dockerfile succeeds in ~30s (cached layers) — 2.12 GB → 695 MB.
Container boots; gateway reaches DB-connect (expected dummy-URL failure). No missing modules, no broken imports.
ECS live state: minimumHealthyPercent=50, deploymentCircuitBreaker.rollback=true. TG deregistration_delay.timeout_seconds=30.
Not yet verified end-to-end: GitHub Actions is blocked on billing, so the new workflow path hasn't run once in CI. First real deploy after billing restore will be the live test.

Expected outcomes (once GHA billing restored)

Metric	Before	Target after Phase A
Full ECS deploy (p50)	18 min (measured, last 5 successful runs)	6 min
ECR push time	~30s on 2.12 GB image	~10s on 695 MB image
Rolling replace (2 tasks)	180–300s	60–120s
Post-deploy alarm-watch	120s	60s
Vercel site rebuilds per doc-only commit	always	skipped via ignoreCommand

Phase B: wire BR catalog ingestors so kimi-k2.7 / price bumps / deprecations are data changes, not container rebuilds. Round-trip: ~5s.
Phase C1: self-hosted GHA runner on macOS to eliminate the billing pain entirely.
Phase C2: consolidate brainstormhive's 13 crons into 5–6 orchestrators.

Lockstep Checklist

[x] API Routes: no changes.
[x] TS SDK: no changes.
[x] Python SDK: no changes.
[x] MCP Schemas: no changes.
[x] Marketing site: no changes.