2026-05-10-279-auto-selector-gates-on-sentinel

#279: Auto-router gates on endpoint-sentinel degraded state

Date: 2026-05-10 Status: shipped Slug: 279-auto-selector-gates-on-sentinel Branch: feature/279-auto-selector-gates-on-sentinel Closes: #279

Summary

ModelAutoSelector.select() now consults EndpointSentinel and excludes any endpoint whose provider is currently degraded / down / auth_error before any selection strategy runs. The exclusion is applied to all paths — variant strategy, history-based, complexity-based, cross-tenant warm-start, pattern-store recommendation, final fallback — because the gate runs upstream of all of them.

Why this matters

Gauntlet baseline (PR #277) and second run (PR #278) both showed mission turns landing on x-ai/grok-3-mini while the sentinel was logging Provider x-ai: degraded continuously (~6 alarms/min):

F2 turn 1: model=x-ai/grok-3-mini  finish_reason=length  cost=0
L1 turn 1: model=x-ai/grok-3-mini  finish_reason=length  cost=0

The cost=0 came from the existing zero-completion-insurance waiver, which kicks in when a degraded provider returns empty/truncated content — i.e. the response was _known_ to be low-quality but the auto-selector had no mechanism to avoid the provider in the first place. getDownProviderKeys already existed but only covered down + auth_error; degraded was ignored.

This PR fixes the symptom in routing. It does not explain the -5 correctness deficit observed in the gauntlet (those scores were flat between runs even though grok was hit on F2/L1 in both — see issue #280's N=5 distribution analysis for the actual cause), but it removes a real correctness risk that would otherwise compound on any run where a provider flaps.

Files

Modified

  • src/router/endpoint-sentinel.ts — added getDegradedOrWorseProviderIds(): Set. Returns bare provider IDs (e.g. "x-ai") by stripping the | suffix from composite keys, because the auto-selector matches against endpoint.provider. Includes degraded, down, and auth_errorgetDownProviderKeys (existing) only covered the latter two.
  • src/router/model-auto-selector.ts:
  • AutoSelectorConfig accepts endpointSentinel and degradedProviderGate ("off" | "warn" | "enforce")
  • Constructor + new wireSentinel(sentinel) setter (the sentinel is built after wireIntelligence runs in model-router-init.ts, so a dedicated setter avoids a construction-order shuffle)
  • Gate runs after the forced-tool-choice exclusion and before all selection strategies
  • AutoSelectResult gains optional excludedDueToDegradedProvider: number and degradedProvidersExcluded: string[] for telemetry
  • Default mode: "enforce" when sentinel is wired, "off" when not — preserves legacy behavior on tests that don't opt in
  • src/router/model-router-init.ts — calls autoSelector.wireSentinel(sentinel) after sentinel is constructed

Tests

  • src/router/model-auto-selector.test.ts — new describe("issue #279 — degraded-provider gate") block, 6 tests:
  • excludes degraded providers (default = enforce)
  • warn mode reports telemetry without filtering
  • off mode is a true no-op
  • no-op when sentinel is unwired (legacy behavior)
  • zero telemetry when sentinel reports nothing degraded
  • exclusion stacks with caller-provided excludeEndpointIds
  • src/router/endpoint-sentinel.test.ts — new describe("issue #279 — getDegradedOrWorseProviderIds") block, 3 tests:
  • returns bare provider IDs for degraded/down/auth_error
  • includes auth_error explicitly
  • empty set when all healthy

Verification

  • pnpm tsgo — exit 0
  • pnpm test:fast7772/0 (+9 from #279 tests)
  • pnpm exec oxfmt --check / oxlint --type-aware — clean

Configuration

By default — once sentinel is wired in model-router-init.ts — the gate is in enforce mode. Operators can override per-environment via AutoSelectorConfig.degradedProviderGate:

ModeBehavior
"off"Legacy behavior — no exclusion, no telemetry.
"warn"Compute the exclusion set, log it, populate the telemetry, but do not actually filter. Useful for measuring impact before flipping.
"enforce" (default with sentinel)Exclude. The standard production posture.

Recommended rollout if anyone wants to be conservative: deploy in warn mode for one full sentinel cycle (60s × N intervals worth of telemetry), verify the excludedDueToDegradedProvider counters don't spike on healthy operation, then flip to enforce. Skipping straight to enforce is also fine — the gate is fail-closed in the sense that a missing sentinel means a no-op, not a no-routing-possible state.

What this PR does NOT do

  • Does not gate explicit-model requests (model=x-ai/grok-3-mini direct).

If a user explicitly asks for a degraded provider, BR honors the choice. A follow-up could 503 those requests with a recovery hint, but that is a behavior change, not a routing-policy change, and out of scope here.

  • Does not add a backoff window — the gate reflects sentinel's current

state. Sentinel itself controls the recovery cadence (deep probe every 60s; once a provider flips back to healthy, the next selection sees it available). A dedicated debounce window for "recently degraded but now healthy" providers is a separate refinement.

  • Does not affect cascade fallback logic. The cascade reads sentinel state

separately for fallback target selection.

  • Does not block on issue #280 (N=5 distribution baseline) — the routing

fix lands first because it removes a known systemic risk; the distribution baseline measures impact, not enables it.

Lockstep

  • TS / Python SDK / MCP — no public API surface change
  • OpenAPI — no new routes
  • Ship log — this file