Self-soak test, streaming cascade stall detection, open-source prep

2026-04-16

soakcascadeopscronmonitoringgovernance

What We Built

Self-soak test: BR now tests itself through its own API every 6 hours. A deterministic 6-endpoint test suite (health, models, completion with Thompson sampling, completion with explicit model, completion with DeepSeek, ops/status) runs via cron, validates every response, and publishes pass/fail metrics to CloudWatch. Failures fire a SoakTestFailureAlarm through SNS to Slack.

Streaming cascade stall detection: The cascade escalation system now handles a previously uncovered failure mode — when a provider connects but stalls (no data flows). A configurable first-chunk timeout (default 15s) aborts the stalled stream and escalates to the next cascade tier before the client receives any data.

/v1/ops/status provider health: The operational status endpoint now returns per-provider health aggregated from the endpoint sentinel and circuit breaker systems. 6 providers, 25 models, all visible.

Open-source preparation: Purged 150+ internal documents (investor materials, competitive intelligence, strategy docs, assessment files), fixed LICENSE copyright, added CODE_OF_CONDUCT.md, updated governance files, cleaned OpenClaw branding from .env examples and issue templates.

Why It Matters

The soak test closes the "who watches the watcher" gap. A gateway that routes AI traffic for customers needs to prove it works continuously, not just when someone runs a manual test. The soak test exercises the full HTTP stack (auth, middleware, routing, provider, response) through real API calls — the same path customer traffic takes.

Streaming cascade stall detection prevents a class of silent failures where a provider TCP-connects but never sends data. Without this, the client hangs until timeout. With it, BR transparently switches to the next model tier within 15 seconds.

How It Works

Soak test

// src/soak/run-soak.ts — deterministic test suite
const SOAK_TESTS = [
  { name: "health", method: "GET", path: "/health" },
  { name: "models", method: "GET", path: "/v1/models" },
  { name: "completion_auto", method: "POST", path: "/v1/chat/completions",
    body: { model: "auto", messages: [...], max_tokens: 20 } },
  // ... 3 more tests
];
// Runs via cron, publishes to CloudWatch, fires alarm on failure

Streaming stall detection

// src/router/model-router-cascade.ts
const firstResult = await Promise.race([iterator.next(), timeoutPromise]);
if (firstResult === "timeout") {
  // Abort stalled stream, try next cascade tier
  void iterator.return?.();
  continue;
}

The Numbers

Soak test: 6/6 pass, 3 providers exercised per run, ~11s total latency
CloudWatch alarms: 13 total (ECS health, 5xx rate, latency, security, sentinel, soak)
Autoscaling: 2-8 ECS tasks configured
Open-source prep: 150+ files removed, 176 files changed, -13,468 lines

Competitive Edge

No competitor has a self-testing gateway. Portkey, Helicone, and OpenRouter rely on external monitoring to detect their own failures. BR validates itself through its own API on a cron — the same traffic path customers use. Combined with the streaming cascade stall detection (which Portkey and Bifrost don't have), BR handles provider instability more gracefully than any funded alternative.

Lockstep Checklist

[x] API Routes: /v1/ops/status updated with providers key
[x] TS SDK: packages/sdk-ts — ops.getStatus() added
[x] Python SDK: packages/sdk-py — ops.get_status() added (sync + async)
[x] MCP Schemas: N/A (no new MCP tools)
[ ] Master Record: Deleted during open-source prep — needs recreation