2026-05-10-279-auto-selector-gates-on-sentinel
#279: Auto-router gates on endpoint-sentinel degraded state
Date: 2026-05-10 Status: shipped Slug: 279-auto-selector-gates-on-sentinel Branch: feature/279-auto-selector-gates-on-sentinel Closes: #279
Summary
ModelAutoSelector.select() now consults EndpointSentinel and excludes any endpoint whose provider is currently degraded / down / auth_error before any selection strategy runs. The exclusion is applied to all paths — variant strategy, history-based, complexity-based, cross-tenant warm-start, pattern-store recommendation, final fallback — because the gate runs upstream of all of them.
Why this matters
Gauntlet baseline (PR #277) and second run (PR #278) both showed mission turns landing on x-ai/grok-3-mini while the sentinel was logging Provider x-ai: degraded continuously (~6 alarms/min):
F2 turn 1: model=x-ai/grok-3-mini finish_reason=length cost=0
L1 turn 1: model=x-ai/grok-3-mini finish_reason=length cost=0
The cost=0 came from the existing zero-completion-insurance waiver, which kicks in when a degraded provider returns empty/truncated content — i.e. the response was _known_ to be low-quality but the auto-selector had no mechanism to avoid the provider in the first place. getDownProviderKeys already existed but only covered down + auth_error; degraded was ignored.
This PR fixes the symptom in routing. It does not explain the -5 correctness deficit observed in the gauntlet (those scores were flat between runs even though grok was hit on F2/L1 in both — see issue #280's N=5 distribution analysis for the actual cause), but it removes a real correctness risk that would otherwise compound on any run where a provider flaps.
Files
Modified
src/router/endpoint-sentinel.ts— addedgetDegradedOrWorseProviderIds(): Set. Returns bare provider IDs (e.g."x-ai") by stripping the|suffix from composite keys, because the auto-selector matches againstendpoint.provider. Includesdegraded,down, andauth_error—getDownProviderKeys(existing) only covered the latter two.src/router/model-auto-selector.ts:AutoSelectorConfigacceptsendpointSentinelanddegradedProviderGate("off" | "warn" | "enforce")- Constructor + new
wireSentinel(sentinel)setter (the sentinel is built afterwireIntelligenceruns inmodel-router-init.ts, so a dedicated setter avoids a construction-order shuffle) - Gate runs after the forced-tool-choice exclusion and before all selection strategies
AutoSelectResultgains optionalexcludedDueToDegradedProvider: numberanddegradedProvidersExcluded: string[]for telemetry- Default mode:
"enforce"when sentinel is wired,"off"when not — preserves legacy behavior on tests that don't opt in src/router/model-router-init.ts— callsautoSelector.wireSentinel(sentinel)after sentinel is constructed
Tests
src/router/model-auto-selector.test.ts— newdescribe("issue #279 — degraded-provider gate")block, 6 tests:- excludes degraded providers (default = enforce)
- warn mode reports telemetry without filtering
- off mode is a true no-op
- no-op when sentinel is unwired (legacy behavior)
- zero telemetry when sentinel reports nothing degraded
- exclusion stacks with caller-provided
excludeEndpointIds src/router/endpoint-sentinel.test.ts— newdescribe("issue #279 — getDegradedOrWorseProviderIds")block, 3 tests:- returns bare provider IDs for degraded/down/auth_error
- includes auth_error explicitly
- empty set when all healthy
Verification
pnpm tsgo— exit 0pnpm test:fast— 7772/0 (+9 from #279 tests)pnpm exec oxfmt --check/oxlint --type-aware— clean
Configuration
By default — once sentinel is wired in model-router-init.ts — the gate is in enforce mode. Operators can override per-environment via AutoSelectorConfig.degradedProviderGate:
| Mode | Behavior |
|---|---|
"off" | Legacy behavior — no exclusion, no telemetry. |
"warn" | Compute the exclusion set, log it, populate the telemetry, but do not actually filter. Useful for measuring impact before flipping. |
"enforce" (default with sentinel) | Exclude. The standard production posture. |
Recommended rollout if anyone wants to be conservative: deploy in warn mode for one full sentinel cycle (60s × N intervals worth of telemetry), verify the excludedDueToDegradedProvider counters don't spike on healthy operation, then flip to enforce. Skipping straight to enforce is also fine — the gate is fail-closed in the sense that a missing sentinel means a no-op, not a no-routing-possible state.
What this PR does NOT do
- Does not gate explicit-model requests (
model=x-ai/grok-3-minidirect).
If a user explicitly asks for a degraded provider, BR honors the choice. A follow-up could 503 those requests with a recovery hint, but that is a behavior change, not a routing-policy change, and out of scope here.
- Does not add a backoff window — the gate reflects sentinel's current
state. Sentinel itself controls the recovery cadence (deep probe every 60s; once a provider flips back to healthy, the next selection sees it available). A dedicated debounce window for "recently degraded but now healthy" providers is a separate refinement.
- Does not affect cascade fallback logic. The cascade reads sentinel state
separately for fallback target selection.
- Does not block on issue #280 (N=5 distribution baseline) — the routing
fix lands first because it removes a known systemic risk; the distribution baseline measures impact, not enables it.
Lockstep
- TS / Python SDK / MCP — no public API surface change
- OpenAPI — no new routes
- Ship log — this file