Sentinel: 60-second deep completion probes + CloudWatch alarms
2026-04-10
LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---
What We Built
The endpoint sentinel now probes providers every 60 seconds instead of every 15 minutes, and each probe sends a real POST /v1/chat/completions (or provider equivalent) with max_tokens=1 on top of the metadata check. Per-provider health is published to CloudWatch (BrainstormRouter/Sentinel namespace) with two new alarms — br-prod-provider-down and br-prod-provider-auth-error — that page oncall when a provider's completions path is unreachable for ≥2 consecutive minutes.
Four new fortress tests (fortress-sentinel.e2e.test.ts) verify in production that (a) the sentinel loop is running, (b) every provider has a recent probe timestamp, (c) deep-probe latencies are consistent with real round-trips, and (d) the cross-check between sentinel state and circuit-breaker state is internally consistent.
Why It Matters
This closes the highest-rated remaining in-code risk from the Round 4 stochastic assessment — 3 of 10 independent assessment agents called it out. The historical concern was that sentinel only probed /models (a metadata read), so a provider with a healthy catalog but a broken completions endpoint would silently degrade user traffic before anyone noticed. Deep completion probes were added earlier but ran every 15 minutes, which left a multi-minute detection gap, and there was no CloudWatch alarm wired to the signal.
How It Works
Probe loop (every 60s, per provider):
executeProbe()→GET /models(metadata)- If metadata healthy AND
deepProbeEnabled,executeDeepProbe()sends a realmax_tokens=1completion - If >50% of a provider's circuit breakers are open from real-traffic failures, override probe result to
degraded/down
Metric publisher (every 60s):
sentinel-metrics.ts reads sentinel.getProviderStatus() and emits four gauges to CloudWatch:
ProvidersHealthyProvidersDegradedProvidersDownProvidersAuthError
Alarms (2+ consecutive minutes):
br-prod-provider-down—ProvidersDown > 0br-prod-provider-auth-error—ProvidersAuthError > 0
Both alarms publish to the existing AlertsTopic SNS topic, which routes to the Slack webhook subscription.
The Numbers
| Dimension | Before | After |
|---|---|---|
| Probe interval | 900 s (15 min) | 60 s |
| Deep probe default | opt-in (false) | opt-in default true |
| Mean detection window | ~7.5 min | ~1 min |
| CloudWatch alarms on sentinel | 0 | 2 |
| Fortress tests on sentinel | 0 | 4 |
| Probe cost (7 providers, 24h) | negligible | ~$0.01/day |
Competitive Edge
Portkey, Bifrost, and OpenRouter rely on circuit breakers that only learn about outages after user traffic absorbs them. BrainstormRouter actively probes completions paths out-of-band every 60 seconds, so the breaker half-opens with fresh data and the oncall alarm fires before the first customer complaint. Combined with the closed-loop learned routing, this is the only gateway in the category that both detects outages proactively AND re-learns routing preferences from the same signal.
Lockstep Checklist
> _No API routes, SDK methods, or MCP tools changed. Purely infra + observability._
- [x] API Routes: N/A — no new or changed routes.
- [x] TS SDK: N/A — no API surface change.
- [x] Python SDK: N/A — no API surface change.
- [x] MCP Schemas: N/A — no tool change.
- [x] Stale memory updated:
feedback_sentinel_broken.mdnow marked FIXED with current file:line references.