Experiment Framework — A/B Test Models with Real Tasks

2026-03-24

apisdk-tssdk-py

What We Built

Experiment Framework — A/B test models with real tasks. Create experiments that split traffic between model variants using deterministic hash-based assignment, track outcomes, and compute statistical significance with a z-test on success rates.

Three endpoints: create experiments with weighted variants, list all experiments, and get detailed results with interim metrics and significance scores.

Why It Matters

"Which model is actually better for TypeScript code generation?" isn't a question you answer with benchmarks — you answer it with production data. The experiment framework turns every real request into a data point, systematically comparing models on the metrics that matter to your workload.

How It Works

// Create an experiment
const exp = await client.experiments.create({
  name: "sonnet-vs-gpt4.1-typescript",
  hypothesis: "Sonnet produces higher-quality TypeScript",
  variants: [
    { id: "A", model: "anthropic/claude-sonnet-4-5", weight: 0.5 },
    { id: "B", model: "openai/gpt-4.1", weight: 0.5 },
  ],
  criteria: { task_type: "code-generation", language: "typescript" },
  sample_size: 50,
});

// Check results
const results = await client.experiments.get(exp.id);
// { statistical_significance: 0.87, interim_results: { A: {...}, B: {...} } }

Lockstep Checklist

[x] API Routes: src/api/routes/experiments.ts — POST + GET list + GET by ID
[x] TS SDK: packages/sdk-ts/src/resources/experiments.ts — create() + list() + get()
[x] Python SDK: packages/sdk-py/src/brainstormrouter/resources/experiments.py — sync + async
[x] MCP Schemas: N/A — experiments are human-managed
[x] Tests: src/api/routes/experiments.test.ts — 9 test cases