The Agentic Proxy

How BrainstormRouter turns a dumb proxy into an autonomous agent.

Two modes of operation

BrainstormRouter serves every request through an OpenAI-compatible /v1/chat/completions endpoint. How the request is processed depends on the mode parameter:

Proxy mode (default)

Standard pass-through routing. Your request is sent to the best available provider based on cost, latency, and health. Circuit breakers handle failures, fallback chains provide redundancy, and Guardian estimates cost before execution.

This is what every AI gateway does.

Agentic mode

Add "mode": "agentic" and the request routes through The Soul — BrainstormRouter's embedded agent engine. The Soul is a multi-turn agent runner with:

  • Persistent memory — Letta-style core + archival memory scoped to your tenant
  • Tool use — The agent can call tools, including memory read/write operations
  • Session management — Multi-turn conversations with automatic compaction
  • Streaming — Full SSE streaming support, same as proxy mode
response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-5",
    messages=[{"role": "user", "content": "What did we discuss last week?"}],
    extra_body={"mode": "agentic"},
    stream=True,
)

Architecture

Request → API Gateway → Mode Check
                          ├─ proxy  → ModelRouter → Provider (Claude, GPT, Gemini, etc.)
                          └─ agentic → The Soul → Provider
                                         ├─ RMM Memory (Postgres pgvector)
                                         ├─ Tool execution loop
                                         └─ Session compaction

The Soul runs runEmbeddedPiAgent — a full agent execution loop that processes tool calls, manages conversation history, and persists memory across requests. The response format is identical to standard OpenAI completions, so your client code doesn't need to change.

Guardian Intelligence

Every request (proxy or agentic) passes through the Guardian middleware layer:

  • Cost estimation before execution via response headers
  • Budget enforcement with per-key daily/monthly limits
  • Rate limiting with sliding-window counters
  • PII scanning with configurable warn/block modes

Guardian overhead is typically under 2ms and adds the following response headers:

X-BR-Estimated-Cost: $0.003
X-BR-Actual-Cost: $0.0028
X-BR-Efficiency: 0.93