Sentinel: 60-second deep completion probes + CloudWatch alarms

2026-04-10

routerobservability

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---

What We Built

The endpoint sentinel now probes providers every 60 seconds instead of every 15 minutes, and each probe sends a real POST /v1/chat/completions (or provider equivalent) with max_tokens=1 on top of the metadata check. Per-provider health is published to CloudWatch (BrainstormRouter/Sentinel namespace) with two new alarms — br-prod-provider-down and br-prod-provider-auth-error — that page oncall when a provider's completions path is unreachable for ≥2 consecutive minutes.

Four new fortress tests (fortress-sentinel.e2e.test.ts) verify in production that (a) the sentinel loop is running, (b) every provider has a recent probe timestamp, (c) deep-probe latencies are consistent with real round-trips, and (d) the cross-check between sentinel state and circuit-breaker state is internally consistent.

Why It Matters

This closes the highest-rated remaining in-code risk from the Round 4 stochastic assessment — 3 of 10 independent assessment agents called it out. The historical concern was that sentinel only probed /models (a metadata read), so a provider with a healthy catalog but a broken completions endpoint would silently degrade user traffic before anyone noticed. Deep completion probes were added earlier but ran every 15 minutes, which left a multi-minute detection gap, and there was no CloudWatch alarm wired to the signal.

How It Works

Probe loop (every 60s, per provider):

  1. executeProbe()GET /models (metadata)
  2. If metadata healthy AND deepProbeEnabled, executeDeepProbe() sends a real max_tokens=1 completion
  3. If >50% of a provider's circuit breakers are open from real-traffic failures, override probe result to degraded/down

Metric publisher (every 60s):

sentinel-metrics.ts reads sentinel.getProviderStatus() and emits four gauges to CloudWatch:

  • ProvidersHealthy
  • ProvidersDegraded
  • ProvidersDown
  • ProvidersAuthError

Alarms (2+ consecutive minutes):

  • br-prod-provider-downProvidersDown > 0
  • br-prod-provider-auth-errorProvidersAuthError > 0

Both alarms publish to the existing AlertsTopic SNS topic, which routes to the Slack webhook subscription.

The Numbers

DimensionBeforeAfter
Probe interval900 s (15 min)60 s
Deep probe defaultopt-in (false)opt-in default true
Mean detection window~7.5 min~1 min
CloudWatch alarms on sentinel02
Fortress tests on sentinel04
Probe cost (7 providers, 24h)negligible~$0.01/day

Competitive Edge

Portkey, Bifrost, and OpenRouter rely on circuit breakers that only learn about outages after user traffic absorbs them. BrainstormRouter actively probes completions paths out-of-band every 60 seconds, so the breaker half-opens with fresh data and the oncall alarm fires before the first customer complaint. Combined with the closed-loop learned routing, this is the only gateway in the category that both detects outages proactively AND re-learns routing preferences from the same signal.

Lockstep Checklist

> _No API routes, SDK methods, or MCP tools changed. Purely infra + observability._

  • [x] API Routes: N/A — no new or changed routes.
  • [x] TS SDK: N/A — no API surface change.
  • [x] Python SDK: N/A — no API surface change.
  • [x] MCP Schemas: N/A — no tool change.
  • [x] Stale memory updated: feedback_sentinel_broken.md now marked FIXED with current file:line references.