Sentinel recovery probe now bypasses model-list filter — closes the auto-recovery deadlock that PR #185 surfaced

2026-05-06

routersentinelobservability

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none — internal sentinel logic"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---

What We Built

EndpointSentinel.probeProvider no longer applies the "model not in discovery → skip endpoint" filter when the probe is a recovery probe (isRecoveryProbe === true). The filter still applies on every other healthy probe, so genuinely deprecated endpoints stay force-opened.

Why It Matters

PR #185 shipped the cross-task recovered: true pub/sub broadcast and verified the propagation works (peer task cleared its local force-open within ~17ms of the publish, evidenced live with manual admin reset). But auto-recovery was still not engaging in production — the deepseek loop kept firing every minute even after #185 deployed.

Investigation showed provider.recovered events were emitted by the debounce path (detectStatusChange's recovery branch) but zero Circuit reset events ever appeared. The per-endpoint loop body inside the recovery probe was hitting continue at the model-list filter and never reaching reset() at line 458.

DeepSeek's /v1/models returns model names that don't match the static catalog (deepseek-v4-flash returned for what the catalog calls deepseek-chat). The filter — meant to prevent auto-closing breakers for genuinely deprecated endpoints — was silently blocking the recovery clear.

The result: the entire recovery state machine (PR #164 single-task + PR #185 cross-task) was correct in isolation, but a 5-line filter several layers up prevented either of them from firing in production. Provider stayed force-open until manual operator intervention.

How It Works

// before
if (discoveredModelIds) {
  const endpoint = this.registry.get(endpointId);
  if (endpoint && !discoveredModelIds.has(endpoint.modelId)) {
    continue;
  }
}

// after
if (discoveredModelIds && !isRecoveryProbe) {
  const endpoint = this.registry.get(endpointId);
  if (endpoint && !discoveredModelIds.has(endpoint.modelId)) {
    continue;
  }
}

Why this is safe: the recovery probe only fires when previousStatus ∈ {down, auth_error, degraded} AND the current probe says healthy — i.e., the sentinel itself force-opened the breakers because the _whole provider_ failed, not because a specific model 404'd. The breakers we're about to clear are sentinel-force-opened, not traffic-force-opened. If a specific model is genuinely deprecated, the next real-traffic 404/403 on it will force-open it again immediately, so the worst case is one wasted request, not a sticky regression.

Two regression tests pinned:

Recovery clears across name mismatch: registers deepseek::deepseek-chat, runs two failure probes (drives previousStatus = down), then a healthy probe whose /v1/models returns [{id: "deepseek-v4-flash"}]. Asserts the breaker IS reset.
Non-recovery still respects the filter: registers openai::deprecated-model, force-opens it, runs one healthy probe whose discovery returns [{id: "gpt-4o"}]. Asserts the breaker stays force-open. Pins the contract so a future refactor can't accidentally widen the recovery exception to all healthy probes.

The Numbers

DeepSeek loop: 11+ hours, ~480 sentinel events/hour, 4 logs per probe cycle × 2 ECS tasks
Pub/sub propagation latency: ~17ms peer-to-peer (measured via manual admin reset)
Code change: 1-line condition + 22 lines of comment explaining the recovery exception
Tests added: 2 (one for the new recovery behavior, one regression guard for the existing non-recovery contract)

Lockstep Checklist

[x] API Routes: No route surface change.
[x] TS SDK: No SDK surface change.
[x] Python SDK: No SDK surface change.
[x] MCP Schemas: No tool surface change.
[x] Master Record: No capability change.