Layer 3 root-cause fix: DEFAULT_HEARTBEAT_EVERY was 600,000 minutes, not 600,000 ms

2026-05-05

gatewayinfra

What We Built

Changed DEFAULT_HEARTBEAT_EVERY from the bare numeric literal 600_000 to the duration string "10m". Adds a regression test that fails if the constant ever produces a duration > Node's MAX_TIMEOUT.

Why It Matters

This is the actual root-cause fix for the TimeoutOverflowWarning incident from earlier today. The Layer 2 clamp wrapper (PR #161, src/infra/timer-clamp.ts) stopped the warning spam and prevented Node's silent 1ms clamp from misfiring callbacks 800 times/sec. But the underlying scheduler was still requesting 36-billion-ms delays — Layer 2 just clamped them to 24.85 days, which is correct-but-not-intended. Layer 3 fixes the producer so the consumer never sees an absurd value.

How It Works

The DEFAULT_HEARTBEAT_EVERY constant is consumed by resolveHeartbeatIntervalMs in src/infra/heartbeat-runner.ts:227-254, which wraps the value in String(...) and passes it through parseDurationMs(value, { defaultUnit: "m" }). The default unit is minutes. So:

DEFAULT_HEARTBEAT_EVERY = 600_000 → String(600_000) = "600000" → no unit suffix → defaults to "m" → 600,000 minutes × 60,000 ms/min = 35,999,999,997 ms ≈ 416 days.
DEFAULT_HEARTBEAT_EVERY = "10m" → "10m" → explicit minutes suffix → 10 × 60,000 = 600,000 ms = 10 minutes.

The producer (the constant) and consumer (the parser) had a units contract mismatch. The producer assumed bare numbers are milliseconds; the consumer assumed bare numbers are minutes. Storing the value as a string with explicit unit suffix removes the ambiguity entirely.

The Numbers

The Layer 2 timer-clamp diagnostic surfaced exactly one call-site stack signature in production after deployment:

wrappedSetTimeout → scheduleNext → updateConfig → startHeartbeatRunner → startGatewayServer

with intervalMs: 36000000000 logged at heartbeat boot. Decoding 36000000000 / 60000 = 600000 minutes pointed directly at the constant.

Time from PR #161 merge → ECS deploy → diagnostic surfacing → root-cause identification: ~30 minutes, with one ECS exec to read the file logger inside a running container. This is exactly the workflow the Layer 2 design intended.

Competitive Edge

Demonstrates the value of the timer-clamp wrapper's diagnostic mode: a real correctness bug (an agent scheduler that was supposed to fire every 10 minutes was instead being scheduled 14 months out and rescuing itself via Node's silent 1ms clamp) was identified and fixed within an hour of the symptomatic cost spike, despite the underlying bug having existed unnoticed for a long time. The clamp wrapper not only stopped the cost bleed — it gave us the exact line to fix.

Lockstep Checklist

> _You MUST check these boxes [x] and verify the corresponding files are updated BEFORE committing this log._

[x] API routes — N/A (config-only change, no new routes)
[x] TypeScript SDK — N/A
[x] Python SDK — N/A
[x] MCP tools — N/A
[x] Tests — src/auto-reply/heartbeat.test.ts (2 tests: regression cap + 10-minute exact value)
[x] Docs — this ship-log entry
[x] Verification — local test suite green; will verify in production by sampling the timer-clamp diagnostic file post-deploy and confirming no intervalMs: 36000000000 log line on heartbeat boot

Why The Old Comment Is Misleading

The file src/auto-reply/heartbeat.ts opens with // Stub: heartbeat not needed in service orchestration. This comment is wrong — the heartbeat IS active in production (we observed it firing in the gateway boot logs). The comment was probably true at some prior simplification but the constant kept getting pulled in. Future cleanup: either remove the comment or actually disable heartbeats by default if that's the intent. Out of scope for this PR.