2026-05-10-279-auto-selector-gates-on-sentinel

#279: Auto-router gates on endpoint-sentinel degraded state

Date: 2026-05-10 Status: shipped Slug: 279-auto-selector-gates-on-sentinel Branch: feature/279-auto-selector-gates-on-sentinel Closes: #279

Summary

ModelAutoSelector.select() now consults EndpointSentinel and excludes any endpoint whose provider is currently degraded / down / auth_error before any selection strategy runs. The exclusion is applied to all paths — variant strategy, history-based, complexity-based, cross-tenant warm-start, pattern-store recommendation, final fallback — because the gate runs upstream of all of them.

Why this matters

Gauntlet baseline (PR #277) and second run (PR #278) both showed mission turns landing on x-ai/grok-3-mini while the sentinel was logging Provider x-ai: degraded continuously (~6 alarms/min):

F2 turn 1: model=x-ai/grok-3-mini  finish_reason=length  cost=0
L1 turn 1: model=x-ai/grok-3-mini  finish_reason=length  cost=0

The cost=0 came from the existing zero-completion-insurance waiver, which kicks in when a degraded provider returns empty/truncated content — i.e. the response was _known_ to be low-quality but the auto-selector had no mechanism to avoid the provider in the first place. getDownProviderKeys already existed but only covered down + auth_error; degraded was ignored.

This PR fixes the symptom in routing. It does not explain the -5 correctness deficit observed in the gauntlet (those scores were flat between runs even though grok was hit on F2/L1 in both — see issue #280's N=5 distribution analysis for the actual cause), but it removes a real correctness risk that would otherwise compound on any run where a provider flaps.

Files

Modified

src/router/endpoint-sentinel.ts — added getDegradedOrWorseProviderIds(): Set. Returns bare provider IDs (e.g. "x-ai") by stripping the | suffix from composite keys, because the auto-selector matches against endpoint.provider. Includes degraded, down, and auth_error — getDownProviderKeys (existing) only covered the latter two.
src/router/model-auto-selector.ts:
AutoSelectorConfig accepts endpointSentinel and degradedProviderGate ("off" | "warn" | "enforce")
Constructor + new wireSentinel(sentinel) setter (the sentinel is built after wireIntelligence runs in model-router-init.ts, so a dedicated setter avoids a construction-order shuffle)
Gate runs after the forced-tool-choice exclusion and before all selection strategies
AutoSelectResult gains optional excludedDueToDegradedProvider: number and degradedProvidersExcluded: string[] for telemetry
Default mode: "enforce" when sentinel is wired, "off" when not — preserves legacy behavior on tests that don't opt in
src/router/model-router-init.ts — calls autoSelector.wireSentinel(sentinel) after sentinel is constructed

Tests

src/router/model-auto-selector.test.ts — new describe("issue #279 — degraded-provider gate") block, 6 tests:
excludes degraded providers (default = enforce)
warn mode reports telemetry without filtering
off mode is a true no-op
no-op when sentinel is unwired (legacy behavior)
zero telemetry when sentinel reports nothing degraded
exclusion stacks with caller-provided excludeEndpointIds
src/router/endpoint-sentinel.test.ts — new describe("issue #279 — getDegradedOrWorseProviderIds") block, 3 tests:
returns bare provider IDs for degraded/down/auth_error
includes auth_error explicitly
empty set when all healthy

Verification

pnpm tsgo — exit 0
pnpm test:fast — 7772/0 (+9 from #279 tests)
pnpm exec oxfmt --check / oxlint --type-aware — clean

Configuration

By default — once sentinel is wired in model-router-init.ts — the gate is in enforce mode. Operators can override per-environment via AutoSelectorConfig.degradedProviderGate:

Mode	Behavior
`"off"`	Legacy behavior — no exclusion, no telemetry.
`"warn"`	Compute the exclusion set, log it, populate the telemetry, but do not actually filter. Useful for measuring impact before flipping.
`"enforce"` (default with sentinel)	Exclude. The standard production posture.

Recommended rollout if anyone wants to be conservative: deploy in warn mode for one full sentinel cycle (60s × N intervals worth of telemetry), verify the excludedDueToDegradedProvider counters don't spike on healthy operation, then flip to enforce. Skipping straight to enforce is also fine — the gate is fail-closed in the sense that a missing sentinel means a no-op, not a no-routing-possible state.

What this PR does NOT do

Does not gate explicit-model requests (model=x-ai/grok-3-mini direct).

If a user explicitly asks for a degraded provider, BR honors the choice. A follow-up could 503 those requests with a recovery hint, but that is a behavior change, not a routing-policy change, and out of scope here.

Does not add a backoff window — the gate reflects sentinel's current

state. Sentinel itself controls the recovery cadence (deep probe every 60s; once a provider flips back to healthy, the next selection sees it available). A dedicated debounce window for "recently degraded but now healthy" providers is a separate refinement.

Does not affect cascade fallback logic. The cascade reads sentinel state

separately for fallback target selection.

Does not block on issue #280 (N=5 distribution baseline) — the routing

fix lands first because it removes a known systemic risk; the distribution baseline measures impact, not enables it.

Lockstep

TS / Python SDK / MCP — no public API surface change
OpenAPI — no new routes
Ship log — this file