Circuit Breakers

Dual-trigger state machine with sliding window error rate and exponential backoff recovery.

Why circuit breakers matter for AI gateways

When a provider endpoint fails (rate limit, timeout, 500), a naive retry loop makes things worse — you're adding load to an already-failing service. Circuit breakers detect sustained failure and temporarily remove the endpoint from rotation, giving it time to recover.

BrainstormRouter's circuit breaker implementation is in src/router/circuit-breaker.ts.

State machine

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#d97706', 'lineColor': '#9494a8', 'primaryTextColor': '#e8e8ee'}}}%%
stateDiagram-v2
    [*] --> Closed
    Closed --> Open: 5 consecutive failures\nOR 50% error rate (60s window)
    Open --> HalfOpen: Reset timeout elapsed\n(base: 30s)
    HalfOpen --> Closed: Probe succeeds\n(reset timeout to base)
    HalfOpen --> Open: Probe fails\n(double timeout, cap 16×)

    state Closed {
        [*] --> Monitoring
        Monitoring: All requests allowed
        Monitoring: Recording success/failure
    }

    state Open {
        [*] --> Blocking
        Blocking: All requests rejected
        Blocking: Waiting for timeout
    }

    state HalfOpen {
        [*] --> Probing
        Probing: Single request allowed
        Probing: Testing recovery
    }

Configuration

From circuit-breaker.ts:17-30:

export type CircuitBreakerConfig = {
  failureThreshold: number; // Default: 5
  resetTimeoutMs: number; // Default: 30000 (30s)
  enableWindow?: boolean; // Default: true
  windowMs?: number; // Default: 60000 (60s)
  windowMinRequests?: number; // Default: 10
  windowErrorRateThreshold?: number; // Default: 0.5 (50%)
};

Dual triggers

The circuit opens when either trigger fires (not both):

Trigger 1: Consecutive failures (circuit-breaker.ts:128-134)

if (circuit.failures >= this.config.failureThreshold) {
  circuit.state = "open";
  circuit.lastStateChange = Date.now();
  log.warn(`Circuit opened for ${serviceId} (${circuit.failures} consecutive failures)`);
  return;
}

Five consecutive failures (default) open the circuit immediately. A single success resets the counter. This catches hard failures — endpoint down, auth revoked, DNS failure.

Trigger 2: Sliding window error rate (circuit-breaker.ts:136-149)

if (this.config.enableWindow) {
  const { total, errors } = this.windowStats(circuit);
  if (total >= this.config.windowMinRequests) {
    const errorRate = errors / total;
    if (errorRate >= this.config.windowErrorRateThreshold) {
      circuit.state = "open";
      // ...
    }
  }
}

If 50%+ of requests in the last 60 seconds failed (with at minimum 10 requests in the window), the circuit opens. This catches soft failures — intermittent 429s, partial timeouts, degraded responses.

The sliding window uses a ring buffer of (timestamp, isError) tuples, pruned on every append to stay within the window duration.

Recovery: exponential backoff

When a circuit is open, it waits for the reset timeout (default 30s) before transitioning to half-open. In half-open state, a single request is allowed through as a probe.

Probe succeeds (circuit-breaker.ts:86-94): Circuit closes. Reset timeout returns to base value (30s). Normal operation resumes.

Probe fails (circuit-breaker.ts:110-122): Circuit re-opens. Reset timeout doubles (capped at 16× base = ~8 minutes at default 30s base).

// Failed probe — exponential backoff
circuit.currentResetTimeoutMs = Math.min(
  circuit.currentResetTimeoutMs * 2,
  this.config.resetTimeoutMs * 16, // Cap at 16× base
);

This prevents a flapping endpoint from consuming probe requests endlessly. The backoff sequence: 30s → 60s → 120s → 240s → 480s (cap).

Integration with routing

When the circuit breaker reports an endpoint as open, the router removes it from candidate selection. The fallback chain takes over:

  1. Thompson Sampling selects from eligible (non-open) endpoints
  2. If the selected endpoint fails, the fallback resolver tries the next endpoint
  3. If all endpoints are in circuit-open state, the request fails with routing metadata

The circuit breaker check (allowRequest()) is a single Map lookup + timestamp comparison — less than 0.1ms overhead.

Per-service granularity

Circuits are keyed by serviceId, which maps to a specific provider+model endpoint. Opening the circuit for anthropic:claude-sonnet-4:us-east-1 does NOT affect anthropic:claude-sonnet-4:eu-west-1 or openai:gpt-4o. Each endpoint is isolated.

Manual recovery

For operational emergencies, circuits can be manually reset:

circuitBreaker.reset("anthropic:claude-sonnet-4:us-east-1");

This immediately closes the circuit and clears all failure history.