Provider Health Alerts — Automated Outage Detection Across 247 Endpoints

2026-03-08

endpoint-sentinelmodel-auto-discoveryplatform-event-busnotifications

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---

What We Built

BrainstormRouter now automatically detects and alerts on provider outages, authentication failures, and model deprecations across all 247 endpoints. The EndpointSentinel — which already probes providers every 30 seconds — now emits structured events through the platform event bus when a provider goes down, recovers, or removes models from its catalog.

When 4 of 7 configured providers were simultaneously failing in production, the operator had no automated way to know. The sentinel was detecting failures and opening circuit breakers, but no notification reached the human. This feature closes that gap: provider status changes flow through the same event bus that handles tenant lifecycle, reaching both Slack (Block Kit alerts with action buttons) and email (admin notification).

Model deprecation detection works alongside auto-discovery. When a model ID disappears from a provider's /models response for 3 consecutive probes (~90 seconds), the system force-opens circuit breakers and unregisters dynamically-discovered endpoints. Config and catalog endpoints are never auto-removed — only discovered endpoints are evictable.

Why It Matters

Multi-provider routing is BrainstormRouter's core value prop, but it's only valuable if operators know when providers degrade. Without automated alerting, a silent outage on one provider wastes budget routing requests into failures (even with circuit breakers, the first few requests per window still fail). With health alerts, operators get a Slack notification within 60 seconds of an outage and can take action — or let the circuit breaker handle it automatically with full visibility.

Model deprecation is a slow-burn problem: providers regularly sunset model IDs (Google gemini-2.0-flash → gemini-2.5-flash, OpenAI gpt-4-turbo → gpt-4.1). Without detection, the router keeps routing to dead models. With this feature, stale models are automatically purged from the dynamic catalog while config-pinned models stay stable.

How It Works

Composite health identity: The sentinel now keys health probes by ${provider}|${baseUrl} instead of bare provider ID. This means openai|https://api.openai.com/v1 and openai|https://proxy.example.com/v1 are tracked independently — a proxy outage doesn't mask the main API's health.

Status change detection with debounce: Two consecutive probe failures are required before emitting a provider.down or provider.auth_error event (prevents flapping on transient 500s). Recovery events emit immediately — you want to know ASAP when a provider returns.

// In endpoint-sentinel.ts — detectStatusChange()
if (current === "healthy" && (prev === "down" || prev === "auth_error")) {
  // Recovery: emit immediately
  this.onStatusChange({ providerId, baseUrl, previous: prev, current, ... });
} else if (failures >= 2 && prev !== current) {
  // Failure: debounced to 2 consecutive
  this.onStatusChange({ providerId, baseUrl, previous: prev, current, ... });
}

Missing-models callback: After each healthy probe, the sentinel compares discovered model IDs against the previously-known set for that composite key. Models that disappeared are reported to ModelAutoDiscovery.handleMissingModels(), which tracks a per-model miss count and deprecates after 3 misses.

Lazy event bus resolution: The sentinel callbacks use getEventBus() at emit time (not at init time) because initModelRouter() runs before initNotifications() in the boot sequence. The 2-failure debounce means the first event fires ~60s after boot, well after the event bus is initialized.

PlatformEvent discriminated union: Extended from 4 tenant event types to 9 types including provider.down, provider.recovered, provider.auth_error, model.discovered, and model.deprecated. Existing tenant event emitters and subscribers are unchanged — the union is backward compatible.

The Numbers

247 endpoints across 3 providers (Anthropic, OpenAI, Google) monitored every 30s
60-second detection for provider outages (2 × 30s probe interval)
Immediate recovery notification (no debounce on healthy → event)
90-second deprecation detection for removed models (3 × 30s)
0 new API endpoints — pure infrastructure, no SDK changes required
567 lines added across 10 files, 45 tests passing

Competitive Edge

Portkey and OpenRouter provide basic health monitoring, but neither detects model deprecation or emits structured events for downstream automation. BrainstormRouter's sentinel doesn't just open circuit breakers — it feeds the platform event bus, enabling internal agents (like the Growth Agent) to react to provider state changes. The composite key design means proxy failures are isolated from direct API health, which matters for enterprise deployments with custom proxy configurations.

Lockstep Checklist

> _You MUST check these boxes [x] and verify the corresponding files are updated BEFORE committing this log._

[x] API Routes: No API route changes — internal infrastructure only.
[x] TS SDK: No SDK changes required — no new endpoints.
[x] Python SDK: No SDK changes required — no new endpoints.
[x] MCP Schemas: No MCP changes required — no new endpoints.
[x] Master Record: Internal infrastructure — no new user-facing capability.