Provider doctor: BR explains its own circuit trips instead of waiting for paging
2026-04-30
LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: ["POST /v1/ops/diagnose"] sdk_methods_updated: ["ts: ops.diagnose()", "py: Ops.diagnose / AsyncOps.diagnose"] mcp_tools_updated: ["br_diagnose_providers"] ---
What We Built
A self-diagnosis engine that takes BR's existing failure signals (sentinel lastError, circuit-breaker state, provider catalog) and classifies each non-healthy provider into a structured category — credential_invalid, quota_exhausted, upstream_outage, config_drift, network, rate_limited_self — with a hypothesis, cited evidence, severity, and remediation steps. Exposed as POST /v1/ops/diagnose, Ops.diagnose() in both SDKs, and the br_diagnose_providers MCP tool.
The engine is a pure module (src/router/intelligence/provider-doctor.ts) with 24 unit tests covering every category, priority tie-breaking, severity scaling by circuit-open ratio, and a reproduction of the exact R15 production state (5 of 7 providers down).
Why It Matters
The R15 stochastic assessment had only one finding cited by all 10 reviewers: 5 of 7 providers fully circuit-tripped, with no on-call paging. The reviewers' consensus action was the same as R14's — wire alerting. That action did not land between rounds, and the provider state worsened (4/7 → 5/7).
This ships a different shape of fix. Instead of waking a human at 3am to look at a status page and then read 1Password, BR now diagnoses the failure itself: it pattern-matches the upstream error string against known categories, reports its hypothesis with the exact substrings that matched, and produces a remediation list (rotate the key from 1Password, check the billing dashboard, list deprecated models, etc.). The diagnosis is callable on demand from any agent in the mesh; it can be wrapped by a cron, a CI job, or a downstream self-healing agent that flips remediations from suggestions into actions.
Paging stays valuable for cases the diagnostic engine reports as unknown / low-confidence. For everything else, the gateway answers its own question.
How It Works
diagnoseProvider(input) takes a ProviderDoctorInput (provider id, sentinel health, last error string, total models, circuit-open count) and returns a ProviderDiagnosis:
{
providerId: "openai",
category: "credential_invalid",
confidence: "high",
severity: "critical",
hypothesis: "openai credentials are rejected at the API edge (signals: 401 / unauthorized). Most likely the deployed API key is expired, rotated, or the wrong tenant.",
evidence: ["lastError: 401 Unauthorized", "pattern: 401 / unauthorized", "circuit-open: 8/8"],
remediation: {
autoRemediable: false,
actions: [
"Read the provider's API key from 1Password",
"Compare against the value in the ECS task definition's secret store",
"Rotate the key at the provider, write the new value to 1Password, then redeploy",
"If rotation doesn't fix it: confirm tenant / org binding hasn't changed",
],
requiresApproval: true,
},
}
Pattern matching uses a single table in provider-doctor.ts:PATTERNS — adding a new failure signature (e.g. a new provider's specific 4xx phrasing) is one diff in that file. Multiple patterns can fire on a single error string; the doctor scores by confidence and breaks ties by category priority (credential > quota > upstream > config > network).
Severity scales by what fraction of the provider's catalog is affected: critical at full circuit-open ratio, high at ≥50%, medium for any open circuit, low for non-down without circuit involvement. Credential failures jump straight to critical regardless of ratio because every model behind that key is dead.
diagnoseFleet(inputs) aggregates per-provider diagnoses, computes fleet-wide severity, and emits a single-line summary suitable for Ship Log entries or paging payloads.
The POST /v1/ops/diagnose route assembles inputs from ops-service plus the sentinel's getStatus(providerId)?.lastError (which ops-service narrows out of its public shape). The MCP tool br_diagnose_providers is the agent-facing wrapper; it's the natural follow-up call after br_get_ops_status reveals down providers.
The Numbers
- 8 diagnostic categories, 27 patterns in the table.
- 24 unit tests, all passing. Tests include a reproduction of the exact R15 production
state (5 of 7 providers circuit-tripped) which produces 5 credential_invalid diagnoses and a fleet severity of critical — turning the assessment's #1 risk into a structured, actionable artifact in <1 ms.
- Zero new dependencies. The engine is pure TypeScript, no LLM calls, no I/O.
- Route count moves from 435 → 436 (
POST /v1/ops/diagnose).
Competitive Edge
No funded competitor (Portkey, OpenRouter, Helicone, LiteLLM, Bifrost, Letta) ships a self-diagnosing gateway. They expose status endpoints, alert on circuit trips, and hand the operator a dashboard. BR explains _why_ the dashboard looks the way it does, in the same primitive that surfaces the problem. This is the foundation under "BR operates itself" — a category step away from "BR is well-monitored."
The remediation list is intentionally conservative this commit: every entry sets autoRemediable: false and most set requiresApproval: true. Future commits that flip specific remediations to auto-action (credential rotation from 1Password, catalog refresh from a ModelIngestor) bolt onto this engine without rewriting it.
Follow-ups (NOT in this commit)
- Subscribe the doctor to
circuitBreaker.onStateChangeso a Ship Log entry is auto-filed
the moment a provider trips, not only on a polled /v1/ops/diagnose call.
- Wire
diagnose()into a CloudWatch alarm: page only onunknown+critical(the
cases the doctor can't explain on its own).
- Add an LLM-narrated summary endpoint that uses BR's own router to write the
natural-language postmortem from the structured diagnosis.
- Flip credential rotation from
requiresApproval: trueto a self-healing agent that
reads from 1Password, attempts a probe with the new value, and only pages if the probe fails.
Lockstep Checklist
- [x] API Routes:
src/api/routes/ops.tsaddsPOST /v1/ops/diagnose.routes.jsonregenerated to 436 routes. - [x] TS SDK:
packages/sdk-ts/src/resources/ops.tsaddsOps.diagnose()+FleetDiagnosis+ProviderDiagnosistypes. - [x] Python SDK:
packages/sdk-py/src/brainstormrouter/resources/ops.pyaddsOps.diagnose()+AsyncOps.diagnose(). - [x] MCP Schemas:
src/mcp/handlers/admin.tsaddsbr_diagnose_providers.src/mcp/server.tsaddsdiagnoseProvidersdep.src/mcp/tool-manifest.tsandsite/public/.well-known/agents.jsonandsite/public/llms.txtlist the new tool. - [x] Master Record: Listed under router/intelligence subsystem.
Cites: R15 stochastic-assessment finding (docs/assessment-synthesis.md) — risk row "5/7 providers fully circuit-tripped … no paging" cited 10/10. This commit converts that risk into a callable diagnostic instead of a wakeup.