Chat Completions
OpenAI-compatible chat completions with optional agentic mode.
POST /v1/chat/completions
Send a chat completion request. Fully compatible with the OpenAI API spec, with an additional mode parameter for agentic processing.
Request
curl https://api.brainstormrouter.com/v1/chat/completions \
-H "Authorization: Bearer br_live_..." \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic/claude-sonnet-4-5",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is BrainstormRouter?"}
],
"stream": true,
"mode": "agentic"
}'
Body parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID in provider/model format, or a strategy like auto |
messages | array | Yes | OpenAI-format message array |
stream | boolean | No | Enable SSE streaming (default: false) |
mode | string | No | Set to "agentic" to route through The Soul |
temperature | number | No | Sampling temperature (0-2) |
max_tokens | number | No | Maximum tokens to generate |
user | string | No | End-user identifier for tracking |
route | object | No | Routing controls — see Routing controls below |
cache | boolean | No | Set to false to bypass the prompt cache for this request |
x_no_cache | boolean | No | Equivalent to cache: false (legacy spelling) |
Routing controls
Use route to influence how BR picks a model when you pass "model": "auto" or a virtual model, and to constrain which providers are eligible.
{
"model": "auto",
"messages": [...],
"route": {
"strategy": "quality",
"providerAllow": ["anthropic", "openai"],
"providerDeny": ["deepseek"],
"maxCostPerRequest": 0.05,
"models": ["anthropic/claude-sonnet-4-5", "openai/gpt-4.1"],
"allowFallback": true
}
}
| Field | Type | Values |
|---|---|---|
strategy | string | "price" · "latency" · "throughput" · "priority" · "quality" · "cascade" |
providerAllow | string[] | Restrict selection to these provider IDs |
providerDeny | string[] | Exclude these provider IDs |
maxCostPerRequest | number | Reject a pick if its estimated cost exceeds this (USD) |
models | string[] | Restrict selection to this explicit model list |
allowFallback | boolean | If true, cascade to next provider on failure |
Strategy semantics (defined in src/router/model-router-types.ts):
price— cheapest model that matches the requestlatency— lowest p50 latency modelthroughput— highest observed tokens-per-secondpriority— tenant-configured provider preference orderquality— highest learned quality posterior (Thompson sampling)cascade— try providers in order until one succeeds
The default when strategy is omitted is quality. If strategy is set to a value not in the enum above, the request is rejected with a validation error — there is no silent fallback.
Cache controls
BR applies a semantic prompt cache to non-streaming requests by default. Identical prompts within a tenant serve a cached response instead of calling the upstream provider.
Bypass per request:
{ "model": "...", "messages": [...], "cache": false }
or the header alternative:
Cache key. The key is derived from model, messages, tools, response_format, temperature, max_tokens, _and_ route/provider.strategy. Two requests with identical prompts but different routing hints (e.g., one with strategy: "price", another with strategy: "quality") are cached separately.
Streaming requests are never cached.
Proxy mode (default)
When mode is omitted, the request routes through the ModelRouter:
- Provider selected based on model ID, cost, latency, and health
- Circuit breaker checked — unhealthy providers are skipped
- Fallback chain activated if the primary provider fails
- Guardian estimates cost and enforces budget limits
Agentic mode
When mode is "agentic", the request routes through The Soul:
- Tenant's core memory loaded into the system prompt
- Agent execution loop processes the message with tool access
- Memory tools (core + archival) available for read/write
- Response returned in standard OpenAI completion format
Response
Standard OpenAI chat completion response with BR additions inside usage:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1709000000,
"model": "anthropic/claude-sonnet-4-5",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "BrainstormRouter is an autonomous AI API engine..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 42,
"completion_tokens": 128,
"total_tokens": 170,
"cost_usd": 0.00042,
"routing_strategy": "quality"
}
}
Field (inside usage) | Description |
|---|---|
prompt_tokens | OpenAI standard |
completion_tokens | OpenAI standard |
total_tokens | OpenAI standard |
cost_usd | BR extension — actual USD charged for this request |
routing_strategy | BR extension — strategy that selected the model (if set) |
The cost_usd and routing_strategy fields are non-streaming only today. For streaming requests, read X-BR-Actual-Cost from the response headers.
Response headers
Every response includes Guardian intelligence headers:
| Header | Description |
|---|---|
X-BR-Guardian-Status | approved, warned, or blocked |
X-BR-Estimated-Cost | Pre-execution cost estimate (USD) |
X-BR-Actual-Cost | Post-execution actual cost (USD) |
X-BR-Efficiency | Cost efficiency percentage |
X-BR-Guardian-Overhead-Ms | Guardian processing time |
X-Model-Router-Endpoint | Provider endpoint used |
X-Model-Router-Latency-Ms | Provider response latency |
X-RateLimit-Remaining | Remaining requests in the current window |
Streaming
When stream: true, the response is delivered as Server-Sent Events:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":" world"},"index":0}]}
data: [DONE]