Structured Error Recovery: Every error tells the agent what to do next

2026-03-25

error-recoveryrouterintelligence

What We Built

Transformed all API error responses to include machine-actionable recovery instructions. Every 4xx/5xx error now includes a recovery object inside the error envelope with action, code, alternative_model, wait_ms, reason, and docs_url. The agent can programmatically decide its next action from the response alone.

Added upstream error body parsing for Anthropic, OpenAI, and Google error formats. Instead of returning raw nested JSON like "anthropic 400: {\"type\":\"error\",...}", the system now extracts the structured error code and maps it to one of 6 recovery action types: retry, switch_model, reduce_input, wait, escalate, or retry_with_params.

Added intelligence integration via callback pattern. The handleRouterError function now accepts an optional intelligence parameter with a suggestModel callback. When wired to the Thompson sampling bandit, error responses include the best alternative model — not a generic "check the catalog" message.

Why It Matters

AI agents hitting an error currently have to parse nested upstream JSON, guess what went wrong, and figure out what to do next. Most agents just retry blindly. BrainstormRouter is now the only gateway that actively coaches agents through failures — telling them exactly which model to switch to, how long to wait, or which parameters to change.

How It Works

{
  "error": {
    "message": "Model context limit exceeded",
    "type": "upstream_error",
    "code": "context_limit_exceeded",
    "recovery": {
      "action": "reduce_input",
      "code": "context_limit_exceeded",
      "max_tokens_supported": 128000,
      "reason": "Model maximum is 128000 tokens.",
      "docs_url": "https://docs.brainstormrouter.com/errors/context_limit_exceeded"
    }
  }
}

The parseUpstreamError() function handles provider-specific error formats:

  • Anthropic: error.type (overloaded_error, rate_limit_error, invalid_request_error)
  • OpenAI: error.code (rate_limit_exceeded, context_length_exceeded, model_not_found)
  • Google: error.status (RESOURCE_EXHAUSTED, INVALID_ARGUMENT, NOT_FOUND)

The Numbers

  • 33 error types with recovery hints (all include code field now)
  • 6 recovery action types: retry, switch_model, reduce_input, wait, escalate, retry_with_params
  • 3 upstream providers with dedicated parsers (Anthropic, OpenAI, Google)
  • 46 new tests (39 error-recovery + 7 router-errors)
  • Backward compatible: recovery at both error.recovery and top-level during transition

Competitive Edge

No other AI gateway parses upstream provider errors and returns structured recovery instructions. Portkey returns raw upstream errors. OpenRouter wraps them but doesn't classify or suggest alternatives. BrainstormRouter uses its intelligence systems (Thompson sampling, circuit breaker) to suggest the actual best alternative model — not a generic fallback.

Lockstep Checklist

  • [x] API Routes: No route changes — src/api/shared/ utilities only.
  • [x] TS SDK: N/A — response shape enrichment, SDKs pass through JSON.
  • [x] Python SDK: N/A — same reason.
  • [x] MCP Schemas: N/A — no new tools.
  • [x] Master Record: Error recovery system already documented.