2026-05-10-gauntlet-skeleton-alpha-5
Enterprise Agent Gauntlet — alpha skeleton (5 missions, validator, gate)
Date: 2026-05-10 Status: shipped Slug: gauntlet-skeleton-alpha-5 Branch: feature/gauntlet-skeleton-alpha-5 Companion to: PR #251 (master spec, merged earlier today)
Summary
Adds the tests/gauntlet/ skeleton that Perplexity Computer drives against. BrainstormRouter is the system under test; Perplexity is the executor + oracle. The skeleton is a contract, not a runner: schemas that result bundles must conform to, the alpha-5 mission specs in machine-readable form, a TS validator that recomputes scoring math and audit-chain continuity, and a north-star gate (off / warn / enforce modes; gates private_beta = 80%, enterprise_pitch = 95%, production = 98%).
Submission shape (per Justin's call)
Perplexity opens a PR adding one new directory under tests/gauntlet/results/ containing the artifacts listed in tests/gauntlet/results/README.md. The validator runs against the directory; merge-fitness is determined by ok: true + zero error-severity findings.
This is option (a) of the proposal — file-based PR submission with full diff visibility, GitHub-native audit trail, no new runtime attack surface. The schemas were designed so a future POST /v1/gauntlet/results endpoint (option (c)) can ingest the same result-bundle.json payload without a directory tree.
Files (this PR)
Schemas (Zod source-of-truth)
tests/gauntlet/schemas/index.ts—MissionSchema,
TurnEnvelopeSchema, ScoreSchema, LedgerEntrySchema, RedactionLogSchema, JudgeLogEntrySchema, ManifestSchema, ResultBundleSchema, IssueManifestSchema, plus primitives (RunIdSchema, Sha256HexSchema, DataClassSchema, VerticalSchema, ModelTierSchema, PolicyTagSchema, AgentIdentitySchema).
Alpha-5 mission catalogue
| ID | Vertical | Title |
|---|---|---|
| H1 | healthcare | Lab panel triage with safety bound |
| F1 | finance | 10-Q drafting cell with MNPI firewall |
| F2 | finance | KYC / sanctions adverse-media triage |
| L1 | legal | MSA + DPA redline against playbook |
| S1 | msp_secops | Ransomware incident response, hour zero |
Each mission encodes: agent identities + scopes, allowed tools (with data-class ceilings + approval-required flags), policies (with enforce_at lifecycle hooks), per-mission/agent/turn budgets, planned turns with preferred model tier, HITL gates, expected artifacts, and the auto-fail conditions that override the rubric.
Runners
tests/gauntlet/runners/validate-result.ts— the validator. Loads
manifest + bundle + scores + ledger + redaction-log; validates each against its Zod schema; recomputes the score total per mission and cross-checks against the submitted total; recomputes load_bearing_identity_pass and cross-checks; verifies ledger hash continuity (canonical sort-keys SHA256 over each entry minus its own hash field, chained against prev_hash); cross-checks manifest.counters.load_bearing_identity_pass_rate against the recomputed pass rate; flags raw-redacted/ files without a redaction-log entry.
tests/gauntlet/runners/gate.ts— the north-star gate. Three presets
(private_beta / enterprise_pitch / production) with thresholds 80/95/98. Modes: off / warn / enforce. Defaults to warn so this skeleton does not block CI; flip via GAUNTLET_GATE=enforce when the live runs are stable.
tests/gauntlet/runners/build-example-fixture.ts— generates the
example fixture deterministically (canonical SHA256 chain, all required + most optional files). Run before commit if the schemas change so the Vitest spec stays green.
tests/gauntlet/runners/validate-result.test.ts— 8 Vitest specs:
the example fixture passes with zero errors; H1 recomputes to 96; ledger continuity intact; LB-identity pass rate = 1; private_beta + production gates both accept; regenerating the fixture is self-consistent; and a tampered ledger entry produces a ledger_hash_mismatch (negative case).
Documentation + structure
tests/gauntlet/README.md— overall harness + commands.tests/gauntlet/results/README.md— submission protocol +
hard-constraint list + redaction rules.
tests/gauntlet/results/.gitignore— empty pattern set, kept for
convention so contributors don't reach for it.
tests/gauntlet/fixtures/— example run, regenerated/
deterministically by build-example-fixture.ts.
Vitest config
vitest.unit.config.ts— addedtests/gauntlet/*/.test.tsto the
include list so the validator self-tests run in pnpm test:fast.
What this PR does NOT do
- Run missions. Perplexity Computer is the executor + oracle.
- Drive real BR completions, real provider keys, or real costs.
- Block CI or release on north-star pass-rate. Gate is
warnonly. - Score with frontier judge LLMs. Validator only recomputes dimension
sums + audit-chain continuity. Judge LLM scoring is Perplexity's surface and arrives in judge-log.jsonl.
- Generate adversarial perturbations. Those are mission-side; future
PRs add them as separate *_perturbed.json mission specs.
- Cover the remaining 10 missions (H2/H3, F3, L2/L3, I1/I2, S2, C1/C2).
Alpha-5 ships first to validate the contract; expansion follows once Perplexity successfully submits a real run.
Verification
pnpm tsgo— exit 0pnpm test:fast— 7763/0 (+8 from the gauntlet validator suite)pnpm exec oxfmt --check/oxlint --type-aware— cleanpnpm exec tsx tests/gauntlet/runners/validate-result.ts
— ok: true, only info-level findings (missing optional files)
pnpm exec tsx tests/gauntlet/runners/gate.tsprivate_beta
— pass: true
pnpm exec tsx tests/gauntlet/runners/build-example-fixture.ts— exits 0,
writes a self-consistent fixture
Open follow-ups (filed-as-issues candidates)
- Mission catalogue expansion — add the remaining 10 missions
(H2/H3, F3, L2/L3, I1/I2, S2, C1/C2). Each needs the same anatomy the alpha-5 already established.
- Perturbation mission variants — adversarial inputs (jailbreaks,
tool 5xx, jurisdiction flips, budget reductions) as derived mission specs.
- Result-bundle ingest endpoint —
POST /v1/gauntlet/resultsper
the original (c) option. Schema is already self-contained; the endpoint should reject anything the file-based validator would reject.
- CI gate flip — once Perplexity has submitted ~5 successful
runs and the score distribution is stable, set GAUNTLET_GATE=enforce in the workflow that runs after a tests/gauntlet/results/ PR lands. Keep private_beta (80%) until the second-tier expansion.
- Judge-log richness — currently the schema captures judge
provider/model/dimension/score/rationale-hash. Real runs will tell us whether per-judge token usage and disagreement traces should also land in the bundle.
Lockstep
- TS SDK / Python SDK / MCP — no public API surface change.
- OpenAPI — no new routes (the future
POST /v1/gauntlet/resultsis
out of scope).
pnpm build— clean (no new auto-generated artifacts).- Ship log — this file.