Circuit-breaker reset() now broadcasts to peers — kills the cross-task force-open feedback loop

2026-05-06

routercircuit-breakerobservabilitysentinel

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["none — internal pub/sub semantics, route surface unchanged"] sdk_methods_updated: ["none"] mcp_tools_updated: ["none"] ---

What We Built

CircuitBreakerManager.reset() now publishes a recovered: true signal over the existing Redis pub/sub channel, and the subscriber treats that signal as authoritative — peers drop their local entry, including any forcedOpen flag. Without this round-trip, force-opened breakers desync across ECS tasks: one task's sentinel recovery would clear its local entry, but peers (with stale previousStatus = "down") kept republishing forceOpen via the override path, and the recovering task re-imported it. Net result: the breaker stayed stuck.

Also clears the boot-recovery hash entry on reset() so a cold-starting task doesn't re-import the stale open state.

Why It Matters

Caught in production 2026-05-06: a transient DeepSeek completions-probe timeout at 09:42 UTC put both endpoints into force-open. Eleven hours later, the sentinel was still cycling through the same self-reinforcing loop every minute:

probe says healthy but 2/2 circuit breakers are open — overriding to degraded
Circuit force-opened for deepseek::deepseek-chat (permanent failure)
Circuit force-opened for deepseek::deepseek-reasoner (permanent failure)
Provider deepseek: down (2/2 endpoints have open circuit breakers (real traffic failures))

There was zero real DeepSeek traffic failing — the bandit avoided DeepSeek because the breakers were open, so no recordFailure ever happened. The sentinel was the entire cause of the breakers staying open. PR #164 had added a recovery path (previousStatus ∈ {down, auth_error, degraded} + healthy probe → reset), but it was single-task-correct only. Across two ECS tasks, the recovery race meant whichever task hadn't yet hit its own recovery probe republished forceOpen faster than the other task could clear it.

The blast radius: one provider's breakers stuck = that provider gets effectively zero traffic for the full duration, regardless of actual health. With Thompson sampling biased toward proven providers, this can compound across providers.

How It Works

// before
reset(serviceId: string): void {
  this.circuits.delete(serviceId);
  log.info(`Circuit reset for ${serviceId}`);
}

// after
reset(serviceId: string): void {
  this.circuits.delete(serviceId);
  log.info(`Circuit reset for ${serviceId}`);
  if (this.redis) {
    const msg = JSON.stringify({
      serviceId, state: "closed", ts: Date.now(),
      forcedOpen: false, recovered: true,
    });
    this.redis.publish(CircuitBreakerManager.REDIS_CHANNEL, msg).catch(() => {});
    this.redis.hdel(`${CircuitBreakerManager.REDIS_KEY_PREFIX}states`, serviceId).catch(() => {});
  }
}

The subscriber gains a new branch that fires before the existing open/closed handling:

if (data.recovered === true) {
  if (this.circuits.delete(data.serviceId)) {
    log.info(`Peer circuit recovered: ${data.serviceId}`);
  }
  return;
}

The existing contract for plain state: "closed" messages (without recovered: true) is intentionally unchanged — those continue to refuse to clear a force-opened local entry, so a normal half-open success on one task doesn't silently override another task's force-open from a real outage signal.

The Numbers

DeepSeek loop pre-fix: 11+ hours, ~4 sentinel log lines per probe cycle × 60s × 2 tasks = ~480 stuck-loop events per hour
Tests added: 2 cross-task pub/sub tests using a fan-out fake bus (recovered: true propagates clear; plain closed does NOT clear force-open — both contracts pinned)
Code change: ~25 LOC across one file, plus the test infrastructure

Lockstep Checklist

[x] API Routes: No route surface change — internal pub/sub semantics.
[x] TS SDK: No SDK surface change.
[x] Python SDK: No SDK surface change.
[x] MCP Schemas: No tool surface change.
[x] Master Record: No capability change.