Sentinel: 60-second deep completion probes + CloudWatch alarms

2026-04-10

routerobservability

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---

What We Built

The endpoint sentinel now probes providers every 60 seconds instead of every 15 minutes, and each probe sends a real POST /v1/chat/completions (or provider equivalent) with max_tokens=1 on top of the metadata check. Per-provider health is published to CloudWatch (BrainstormRouter/Sentinel namespace) with two new alarms — br-prod-provider-down and br-prod-provider-auth-error — that page oncall when a provider's completions path is unreachable for ≥2 consecutive minutes.

Four new fortress tests (fortress-sentinel.e2e.test.ts) verify in production that (a) the sentinel loop is running, (b) every provider has a recent probe timestamp, (c) deep-probe latencies are consistent with real round-trips, and (d) the cross-check between sentinel state and circuit-breaker state is internally consistent.

Why It Matters

This closes the highest-rated remaining in-code risk from the Round 4 stochastic assessment — 3 of 10 independent assessment agents called it out. The historical concern was that sentinel only probed /models (a metadata read), so a provider with a healthy catalog but a broken completions endpoint would silently degrade user traffic before anyone noticed. Deep completion probes were added earlier but ran every 15 minutes, which left a multi-minute detection gap, and there was no CloudWatch alarm wired to the signal.

How It Works

Probe loop (every 60s, per provider):

executeProbe() → GET /models (metadata)
If metadata healthy AND deepProbeEnabled, executeDeepProbe() sends a real max_tokens=1 completion
If >50% of a provider's circuit breakers are open from real-traffic failures, override probe result to degraded/down

Metric publisher (every 60s):

sentinel-metrics.ts reads sentinel.getProviderStatus() and emits four gauges to CloudWatch:

ProvidersHealthy
ProvidersDegraded
ProvidersDown
ProvidersAuthError

Alarms (2+ consecutive minutes):

br-prod-provider-down — ProvidersDown > 0
br-prod-provider-auth-error — ProvidersAuthError > 0

Both alarms publish to the existing AlertsTopic SNS topic, which routes to the Slack webhook subscription.

The Numbers

Dimension	Before	After
Probe interval	900 s (15 min)	60 s
Deep probe default	opt-in (`false`)	opt-in default `true`
Mean detection window	~7.5 min	~1 min
CloudWatch alarms on sentinel	0	2
Fortress tests on sentinel	0	4
Probe cost (7 providers, 24h)	negligible	~$0.01/day

Competitive Edge

Portkey, Bifrost, and OpenRouter rely on circuit breakers that only learn about outages after user traffic absorbs them. BrainstormRouter actively probes completions paths out-of-band every 60 seconds, so the breaker half-opens with fresh data and the oncall alarm fires before the first customer complaint. Combined with the closed-loop learned routing, this is the only gateway in the category that both detects outages proactively AND re-learns routing preferences from the same signal.

Lockstep Checklist

> _No API routes, SDK methods, or MCP tools changed. Purely infra + observability._

[x] API Routes: N/A — no new or changed routes.
[x] TS SDK: N/A — no API surface change.
[x] Python SDK: N/A — no API surface change.
[x] MCP Schemas: N/A — no tool change.
[x] Stale memory updated: feedback_sentinel_broken.md now marked FIXED with current file:line references.