Cut CloudWatch Logs ingestion ~99% by clamping setTimeout overflow at the runtime layer

2026-05-05

gatewayinfra

What We Built

A global setTimeout/setInterval wrapper (src/infra/timer-clamp.ts) that clamps requested delays to Node's MAX_TIMEOUT (2^31-1 ms ≈ 24.85 days) and re-arms via a wake-up chain for longer delays. Replaces Node's silent 1ms-clamp + TimeoutOverflowWarning behavior with correct long-delay semantics.

The wrapper installs as the first executable statement in src/index.ts, before any module imports complete. It is idempotent via a Symbol.for global guard, captures the call site once per unique stack signature on clamp (max 10 unique stacks per process lifetime), and writes the diagnostic to the file logger only — never to console.* — so the diagnostic doesn't itself enter CloudWatch Logs ingestion.

Layer 2 (the runtime wrapper) is the durable fix; Layer 1 (--disable-warning=TimeoutOverflowWarning in the Dockerfile CMD) is defense-in-depth for any path that bypasses the JS-level wrapper.

Why It Matters

Production CloudWatch Logs ingestion was being dominated by TimeoutOverflowWarning spam — ~99% of all log lines, ~$208/month. Worse: each warning means a setTimeout callback that was supposed to fire in ~14 months is firing immediately, 800 times per second. That's a real correctness bug, not just log noise. Suppressing the warning at the JS level (PR #159 + #160 attempts) hides the bug while leaving the misfiring loop in place. Fixing it at the runtime layer eliminates the entire class — the bug becomes impossible because Node never sees an out-of-range value.

How It Works

The wrapper replaces globalThis.setTimeout. For delays ≤ MAX_TIMEOUT it passes through unchanged (zero overhead for the common case). For delays > MAX_TIMEOUT:

  1. Capture a stack signature (skipping the wrapper frame itself, normalizing absolute paths and line:col so dedup works against minified bundles).
  2. If the signature is new and we haven't logged 10 stacks yet, write one structured [timer-clamp] line to the file logger pointing at the offending caller.
  3. Schedule a "long timeout" handle: a small object that internally chains MAX_TIMEOUT_MS wake-up timers until the requested delay is exhausted, then fires the callback. Mirrors what Node's internal timer code does for long delays.
  4. Return that handle. clearTimeout(handle) works as expected (the wrapper exposes __brClampClear); unref() / ref() / hasRef() proxy to the active inner timer.

setInterval with overflow ms clamps to MAX_TIMEOUT (a 14-month interval is almost certainly a bug; 24.85d is a much safer degradation than firing every 1ms). setInterval doesn't return a wrapper handle, so clearInterval is unchanged.

The Numbers

  • Before: 1000-line CW Logs sample = 99.7% TimeoutOverflowWarning. Burst rate ~800 events/sec. ~600 MB/hour ingested = ~416 GB/month = $208/month at $0.50/GB.
  • After (expected, post-deploy): TimeoutOverflowWarning rate = 0. Ingestion drops to whatever real BR application logs are emitting (single-digit MB/hour). Cost falls to ~$2/month for the log group.
  • Correctness: The unknown caller scheduling 35.9-billion-ms timers will now wait MAX_TIMEOUT (24.85 days) per chunk instead of firing immediately. Whatever was supposed to happen "in 14 months" will actually happen in 14 months. Layer 2's stack capture surfaces the caller for a Layer 3 follow-up.

Competitive Edge

Self-healing infrastructure: BR's runtime now refuses to participate in the well-known Node TimeoutOverflowWarning pathology. Any service in the BR codebase that schedules a long-horizon timer gets correct semantics for free, instead of silent 1ms misfires. The diagnostic surfaces architectural bugs (timers > 24 days are usually a smell) within a single log line per call site.

Lockstep Checklist

> _You MUST check these boxes [x] and verify the corresponding files are updated BEFORE committing this log._

  • [x] API routes — N/A (infrastructure-only change, no new routes)
  • [x] TypeScript SDK (packages/sdk-ts/) — N/A (no API surface change)
  • [x] Python SDK (packages/sdk-py/) — N/A (no API surface change)
  • [x] MCP tools (src/mcp/server.ts, site/.well-known/agents.json) — N/A (no agent-facing change)
  • [x] Testssrc/infra/timer-clamp.test.ts (8 tests, all passing)
  • [x] Docs — this ship-log entry
  • [x] Verification — built locally, ran node -e "import('./dist/index.js')..." to confirm wrapper installs and 100 bad-delay setTimeout calls produce 0 warnings. Re-ran inside production-equivalent Docker image (Node 22.22.2, PID 1) — same result.

Honest Failure Notes

PRs #159 and #160 were earlier attempts at this same problem. Both shipped without proper verification:

  • #159 wired installProcessWarningFilter() into src/entry.ts — production runs dist/index.js directly, bypassing entry.ts. Filter never ran. I should have read the Dockerfile CMD before declaring success.
  • #160 added the install to src/index.ts. Locally process.emitWarning interception worked. Production ingestion stayed at 100% TimeoutOverflowWarning. Mechanism unknown — possibly Node 22's internal warning emission path bypasses user-replaced process.emitWarning. Either way, intercepting the warning was the wrong approach because it doesn't fix the underlying correctness bug.

This PR (timer-clamp) skips the warning-interception layer entirely and fixes the value at the source. Memory note feedback_no_hotfixes.md (durable solutions only) is the principle being applied: build the system that makes the class impossible.

A Layer 3 follow-up will use the diagnostic stacks Layer 2 captures to find and fix the actual scheduler that's requesting 35.9-billion-ms delays.