2026-05-10-gauntlet-skeleton-alpha-5

Enterprise Agent Gauntlet — alpha skeleton (5 missions, validator, gate)

Date: 2026-05-10 Status: shipped Slug: gauntlet-skeleton-alpha-5 Branch: feature/gauntlet-skeleton-alpha-5 Companion to: PR #251 (master spec, merged earlier today)

Summary

Adds the tests/gauntlet/ skeleton that Perplexity Computer drives against. BrainstormRouter is the system under test; Perplexity is the executor + oracle. The skeleton is a contract, not a runner: schemas that result bundles must conform to, the alpha-5 mission specs in machine-readable form, a TS validator that recomputes scoring math and audit-chain continuity, and a north-star gate (off / warn / enforce modes; gates private_beta = 80%, enterprise_pitch = 95%, production = 98%).

Submission shape (per Justin's call)

Perplexity opens a PR adding one new directory under tests/gauntlet/results// containing the artifacts listed in tests/gauntlet/results/README.md. The validator runs against the directory; merge-fitness is determined by ok: true + zero error-severity findings.

This is option (a) of the proposal — file-based PR submission with full diff visibility, GitHub-native audit trail, no new runtime attack surface. The schemas were designed so a future POST /v1/gauntlet/results endpoint (option (c)) can ingest the same result-bundle.json payload without a directory tree.

Files (this PR)

Schemas (Zod source-of-truth)

  • tests/gauntlet/schemas/index.tsMissionSchema,

TurnEnvelopeSchema, ScoreSchema, LedgerEntrySchema, RedactionLogSchema, JudgeLogEntrySchema, ManifestSchema, ResultBundleSchema, IssueManifestSchema, plus primitives (RunIdSchema, Sha256HexSchema, DataClassSchema, VerticalSchema, ModelTierSchema, PolicyTagSchema, AgentIdentitySchema).

Alpha-5 mission catalogue

IDVerticalTitle
H1healthcareLab panel triage with safety bound
F1finance10-Q drafting cell with MNPI firewall
F2financeKYC / sanctions adverse-media triage
L1legalMSA + DPA redline against playbook
S1msp_secopsRansomware incident response, hour zero

Each mission encodes: agent identities + scopes, allowed tools (with data-class ceilings + approval-required flags), policies (with enforce_at lifecycle hooks), per-mission/agent/turn budgets, planned turns with preferred model tier, HITL gates, expected artifacts, and the auto-fail conditions that override the rubric.

Runners

  • tests/gauntlet/runners/validate-result.ts — the validator. Loads

manifest + bundle + scores + ledger + redaction-log; validates each against its Zod schema; recomputes the score total per mission and cross-checks against the submitted total; recomputes load_bearing_identity_pass and cross-checks; verifies ledger hash continuity (canonical sort-keys SHA256 over each entry minus its own hash field, chained against prev_hash); cross-checks manifest.counters.load_bearing_identity_pass_rate against the recomputed pass rate; flags raw-redacted/ files without a redaction-log entry.

  • tests/gauntlet/runners/gate.ts — the north-star gate. Three presets

(private_beta / enterprise_pitch / production) with thresholds 80/95/98. Modes: off / warn / enforce. Defaults to warn so this skeleton does not block CI; flip via GAUNTLET_GATE=enforce when the live runs are stable.

  • tests/gauntlet/runners/build-example-fixture.ts — generates the

example fixture deterministically (canonical SHA256 chain, all required + most optional files). Run before commit if the schemas change so the Vitest spec stays green.

  • tests/gauntlet/runners/validate-result.test.ts — 8 Vitest specs:

the example fixture passes with zero errors; H1 recomputes to 96; ledger continuity intact; LB-identity pass rate = 1; private_beta + production gates both accept; regenerating the fixture is self-consistent; and a tampered ledger entry produces a ledger_hash_mismatch (negative case).

Documentation + structure

  • tests/gauntlet/README.md — overall harness + commands.
  • tests/gauntlet/results/README.md — submission protocol +

hard-constraint list + redaction rules.

  • tests/gauntlet/results/.gitignore — empty pattern set, kept for

convention so contributors don't reach for it.

  • tests/gauntlet/fixtures// — example run, regenerated

deterministically by build-example-fixture.ts.

Vitest config

  • vitest.unit.config.ts — added tests/gauntlet/*/.test.ts to the

include list so the validator self-tests run in pnpm test:fast.

What this PR does NOT do

  • Run missions. Perplexity Computer is the executor + oracle.
  • Drive real BR completions, real provider keys, or real costs.
  • Block CI or release on north-star pass-rate. Gate is warn only.
  • Score with frontier judge LLMs. Validator only recomputes dimension

sums + audit-chain continuity. Judge LLM scoring is Perplexity's surface and arrives in judge-log.jsonl.

  • Generate adversarial perturbations. Those are mission-side; future

PRs add them as separate *_perturbed.json mission specs.

  • Cover the remaining 10 missions (H2/H3, F3, L2/L3, I1/I2, S2, C1/C2).

Alpha-5 ships first to validate the contract; expansion follows once Perplexity successfully submits a real run.

Verification

  • pnpm tsgo — exit 0
  • pnpm test:fast7763/0 (+8 from the gauntlet validator suite)
  • pnpm exec oxfmt --check / oxlint --type-aware — clean
  • pnpm exec tsx tests/gauntlet/runners/validate-result.ts

ok: true, only info-level findings (missing optional files)

  • pnpm exec tsx tests/gauntlet/runners/gate.ts private_beta

pass: true

  • pnpm exec tsx tests/gauntlet/runners/build-example-fixture.ts — exits 0,

writes a self-consistent fixture

Open follow-ups (filed-as-issues candidates)

  1. Mission catalogue expansion — add the remaining 10 missions

(H2/H3, F3, L2/L3, I1/I2, S2, C1/C2). Each needs the same anatomy the alpha-5 already established.

  1. Perturbation mission variants — adversarial inputs (jailbreaks,

tool 5xx, jurisdiction flips, budget reductions) as derived mission specs.

  1. Result-bundle ingest endpointPOST /v1/gauntlet/results per

the original (c) option. Schema is already self-contained; the endpoint should reject anything the file-based validator would reject.

  1. CI gate flip — once Perplexity has submitted ~5 successful

runs and the score distribution is stable, set GAUNTLET_GATE=enforce in the workflow that runs after a tests/gauntlet/results/ PR lands. Keep private_beta (80%) until the second-tier expansion.

  1. Judge-log richness — currently the schema captures judge

provider/model/dimension/score/rationale-hash. Real runs will tell us whether per-judge token usage and disagreement traces should also land in the bundle.

Lockstep

  • TS SDK / Python SDK / MCP — no public API surface change.
  • OpenAPI — no new routes (the future POST /v1/gauntlet/results is

out of scope).

  • pnpm build — clean (no new auto-generated artifacts).
  • Ship log — this file.