Sentinel recovery probe now bypasses model-list filter — closes the auto-recovery deadlock that PR #185 surfaced
2026-05-06
LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none — internal sentinel logic"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---
What We Built
EndpointSentinel.probeProvider no longer applies the "model not in discovery → skip endpoint" filter when the probe is a recovery probe (isRecoveryProbe === true). The filter still applies on every other healthy probe, so genuinely deprecated endpoints stay force-opened.
Why It Matters
PR #185 shipped the cross-task recovered: true pub/sub broadcast and verified the propagation works (peer task cleared its local force-open within ~17ms of the publish, evidenced live with manual admin reset). But auto-recovery was still not engaging in production — the deepseek loop kept firing every minute even after #185 deployed.
Investigation showed provider.recovered events were emitted by the debounce path (detectStatusChange's recovery branch) but zero Circuit reset events ever appeared. The per-endpoint loop body inside the recovery probe was hitting continue at the model-list filter and never reaching reset() at line 458.
DeepSeek's /v1/models returns model names that don't match the static catalog (deepseek-v4-flash returned for what the catalog calls deepseek-chat). The filter — meant to prevent auto-closing breakers for genuinely deprecated endpoints — was silently blocking the recovery clear.
The result: the entire recovery state machine (PR #164 single-task + PR #185 cross-task) was correct in isolation, but a 5-line filter several layers up prevented either of them from firing in production. Provider stayed force-open until manual operator intervention.
How It Works
// before
if (discoveredModelIds) {
const endpoint = this.registry.get(endpointId);
if (endpoint && !discoveredModelIds.has(endpoint.modelId)) {
continue;
}
}
// after
if (discoveredModelIds && !isRecoveryProbe) {
const endpoint = this.registry.get(endpointId);
if (endpoint && !discoveredModelIds.has(endpoint.modelId)) {
continue;
}
}
Why this is safe: the recovery probe only fires when previousStatus ∈ {down, auth_error, degraded} AND the current probe says healthy — i.e., the sentinel itself force-opened the breakers because the _whole provider_ failed, not because a specific model 404'd. The breakers we're about to clear are sentinel-force-opened, not traffic-force-opened. If a specific model is genuinely deprecated, the next real-traffic 404/403 on it will force-open it again immediately, so the worst case is one wasted request, not a sticky regression.
Two regression tests pinned:
- Recovery clears across name mismatch: registers
deepseek::deepseek-chat, runs two failure probes (drivespreviousStatus = down), then a healthy probe whose/v1/modelsreturns[{id: "deepseek-v4-flash"}]. Asserts the breaker IS reset. - Non-recovery still respects the filter: registers
openai::deprecated-model, force-opens it, runs one healthy probe whose discovery returns[{id: "gpt-4o"}]. Asserts the breaker stays force-open. Pins the contract so a future refactor can't accidentally widen the recovery exception to all healthy probes.
The Numbers
- DeepSeek loop: 11+ hours, ~480 sentinel events/hour, 4 logs per probe cycle × 2 ECS tasks
- Pub/sub propagation latency: ~17ms peer-to-peer (measured via manual admin reset)
- Code change: 1-line condition + 22 lines of comment explaining the recovery exception
- Tests added: 2 (one for the new recovery behavior, one regression guard for the existing non-recovery contract)
Lockstep Checklist
- [x] API Routes: No route surface change.
- [x] TS SDK: No SDK surface change.
- [x] Python SDK: No SDK surface change.
- [x] MCP Schemas: No tool surface change.
- [x] Master Record: No capability change.