Notification cooldown — move from emit-time bus gate to per-subscriber decoration

2026-05-09

notificationsinfra

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none — internal infra refactor"] sdk_methods_updated: ["none — no public surface change"] mcp_tools_updated: ["none — no new tools"] ---

What We Built

PlatformEventBus.emit() previously gated every subscriber behind a 15-minute cooldown that lived as private state on the bus itself. The cooldown logic ran before fan-out, so a single suppression decision applied to all registered subscribers (Slack, Email, Webhook delivery) identically. This was correct for user-facing notification channels — Slack/Email don't want to be paged twice for the same flapping provider — but it bakes a delivery concern into the bus and blocks the XDR thread from registering security-event subscribers that need real-time fan-out without channel-level dedup.

This PR extracts the cooldown into a per-subscriber decorator (src/infra/notifications/cooldown-decorator.ts) and strips the bus back to a pure Promise.allSettled fan-out. The three existing subscribers — Slack, Email, Webhook — are wrapped at registration time in src/api/server.ts:initNotifications with withCooldown(...), preserving their existing dedup behavior verbatim. The default cooldown key extractor (defaultCooldownKey) replays the exact keying logic that lived on the bus: provider events keyed by providerId, intelligence events keyed by the narrowest subject identifier (tenantId:keyId, tenantId:agentId, serviceId), tenant.* and model.discovered return undefined (always fire).

Why It Matters

The bus-level cooldown was a delivery concern wearing infrastructure clothes. Three downstream consequences of moving it to the right layer:

  1. XDR thread is unblocked. When X-2 adds 32 new security event types

and X-5 registers SIEM/XDR adapters as subscribers, those adapters must NOT be cooldown-gated — security events need real-time delivery to be useful. Per-subscriber decoration lets each new subscriber choose its own dedup posture without forking the bus or special-casing event types in getCooldownKey().

  1. Subscriber isolation improves. The previous shared

Map meant a subscriber that took 30s to deliver to Slack would still mark the cooldown for Email — which had nothing to do with the Slack delivery. With per-subscriber maps, each channel maintains independent state. Slack failures don't suppress Email delivery for the same key.

  1. The bus is now a primitive. Sub-300 LOC, no private state beyond

subscribers: Map, no behavior beyond subscribe() and emit(). Future event-bus consumers (internal agents, audit chain) can rely on simple fan-out semantics without reasoning about hidden suppression.

How It Works

withCooldown(handler, opts) is a higher-order function. Each call returns a fresh subscriber with its own Map keyed by event-derived strings:

import { withCooldown } from "../infra/notifications/cooldown-decorator.js";
import { createSlackSubscriber } from "../infra/notifications/slack.js";

bus.subscribe(
  "*",
  withCooldown(createSlackSubscriber({ webhookUrl }), {
    windowMs: 15 * 60 * 1000, // override; defaults to DEFAULT_COOLDOWN_MS
    label: "slack",
  }),
);

Each invocation:

  1. Computes the cooldown key via getKey(event) (default:

defaultCooldownKey). undefined → no cooldown for this event.

  1. Looks up lastSeen.get(key). If within windowMs, log + suppress.
  2. Otherwise records Date.now() and bounds the map at maxKeys

(default 1000) by evicting the oldest insertion.

  1. Awaits the wrapped handler — errors propagate to the bus, where

Promise.allSettled already isolates them from sibling subscribers.

The bus itself no longer carries cooldown state, no longer takes a cooldownMs constructor argument, and no longer reads BRAINSTORMROUTER_NOTIFICATION_COOLDOWN_MS. That env var is now read once in initNotifications() and passed as windowMs to each withCooldown call — a single env-var read instead of N constructor reads.

What This Doesn't Fix

  • Fairness across subscribers. If a slow subscriber holds up

Promise.allSettled, fast subscribers still wait. That's a bus-level concern (queue depth / per-subscriber concurrency budget) for a separate PR.

  • Cross-subscriber dedup. Two distinct Slack workspace destinations

registered as separate subscribers will each maintain independent cooldowns. If you want a single cooldown across both, share state externally (getKey can read from a shared store).

Test Plan

  • New: src/infra/notifications/__tests__/cooldown-decorator.test.ts

15 tests covering same-key suppression, different-key passthrough, window expiry (via Date.now() mock), getKey() === undefined bypass, windowMs ≤ 0 shortcut, custom getKey override, bounded map under uniqueness pressure (20 unique keys with maxKeys=5), per-subscriber state isolation, and error propagation to the caller.

  • Updated: src/infra/notifications/__tests__/intelligence-events.test.ts

the existing bus-level cooldown test was rewritten as an integration test that wraps a subscriber with withCooldown (validates the refactor doesn't regress real-world dedup semantics).

  • Full unit suite: 7594 passed / 0 failed (832 test files).

Lockstep

  • TypeScript SDK: no change (internal infra)
  • Python SDK: no change
  • MCP tools: no change (no agent-facing surface)
  • API surface: no change (no routes affected)
  • site/public/routes.json: unchanged
  • OpenAPI: unchanged

This is a pure infrastructure refactor — by design it doesn't move any public surface. The lockstep matrix is empty by intent, not by oversight.