2026-05-10-gauntlet-skeleton-alpha-5

Enterprise Agent Gauntlet — alpha skeleton (5 missions, validator, gate)

Date: 2026-05-10 Status: shipped Slug: gauntlet-skeleton-alpha-5 Branch: feature/gauntlet-skeleton-alpha-5 Companion to: PR #251 (master spec, merged earlier today)

Summary

Adds the tests/gauntlet/ skeleton that Perplexity Computer drives against. BrainstormRouter is the system under test; Perplexity is the executor + oracle. The skeleton is a contract, not a runner: schemas that result bundles must conform to, the alpha-5 mission specs in machine-readable form, a TS validator that recomputes scoring math and audit-chain continuity, and a north-star gate (off / warn / enforce modes; gates private_beta = 80%, enterprise_pitch = 95%, production = 98%).

Submission shape (per Justin's call)

Perplexity opens a PR adding one new directory under tests/gauntlet/results// containing the artifacts listed in tests/gauntlet/results/README.md. The validator runs against the directory; merge-fitness is determined by ok: true + zero error-severity findings.

This is option (a) of the proposal — file-based PR submission with full diff visibility, GitHub-native audit trail, no new runtime attack surface. The schemas were designed so a future POST /v1/gauntlet/results endpoint (option (c)) can ingest the same result-bundle.json payload without a directory tree.

Files (this PR)

Schemas (Zod source-of-truth)

tests/gauntlet/schemas/index.ts — MissionSchema,

TurnEnvelopeSchema, ScoreSchema, LedgerEntrySchema, RedactionLogSchema, JudgeLogEntrySchema, ManifestSchema, ResultBundleSchema, IssueManifestSchema, plus primitives (RunIdSchema, Sha256HexSchema, DataClassSchema, VerticalSchema, ModelTierSchema, PolicyTagSchema, AgentIdentitySchema).

Alpha-5 mission catalogue

ID	Vertical	Title
H1	healthcare	Lab panel triage with safety bound
F1	finance	10-Q drafting cell with MNPI firewall
F2	finance	KYC / sanctions adverse-media triage
L1	legal	MSA + DPA redline against playbook
S1	msp_secops	Ransomware incident response, hour zero

Each mission encodes: agent identities + scopes, allowed tools (with data-class ceilings + approval-required flags), policies (with enforce_at lifecycle hooks), per-mission/agent/turn budgets, planned turns with preferred model tier, HITL gates, expected artifacts, and the auto-fail conditions that override the rubric.

Runners

tests/gauntlet/runners/validate-result.ts — the validator. Loads

manifest + bundle + scores + ledger + redaction-log; validates each against its Zod schema; recomputes the score total per mission and cross-checks against the submitted total; recomputes load_bearing_identity_pass and cross-checks; verifies ledger hash continuity (canonical sort-keys SHA256 over each entry minus its own hash field, chained against prev_hash); cross-checks manifest.counters.load_bearing_identity_pass_rate against the recomputed pass rate; flags raw-redacted/ files without a redaction-log entry.

tests/gauntlet/runners/gate.ts — the north-star gate. Three presets

(private_beta / enterprise_pitch / production) with thresholds 80/95/98. Modes: off / warn / enforce. Defaults to warn so this skeleton does not block CI; flip via GAUNTLET_GATE=enforce when the live runs are stable.

tests/gauntlet/runners/build-example-fixture.ts — generates the

example fixture deterministically (canonical SHA256 chain, all required + most optional files). Run before commit if the schemas change so the Vitest spec stays green.

tests/gauntlet/runners/validate-result.test.ts — 8 Vitest specs:

the example fixture passes with zero errors; H1 recomputes to 96; ledger continuity intact; LB-identity pass rate = 1; private_beta + production gates both accept; regenerating the fixture is self-consistent; and a tampered ledger entry produces a ledger_hash_mismatch (negative case).

Documentation + structure

tests/gauntlet/README.md — overall harness + commands.
tests/gauntlet/results/README.md — submission protocol +

hard-constraint list + redaction rules.

tests/gauntlet/results/.gitignore — empty pattern set, kept for

convention so contributors don't reach for it.

tests/gauntlet/fixtures// — example run, regenerated

deterministically by build-example-fixture.ts.

Vitest config

vitest.unit.config.ts — added tests/gauntlet/*/.test.ts to the

include list so the validator self-tests run in pnpm test:fast.

What this PR does NOT do

Run missions. Perplexity Computer is the executor + oracle.
Drive real BR completions, real provider keys, or real costs.
Block CI or release on north-star pass-rate. Gate is warn only.
Score with frontier judge LLMs. Validator only recomputes dimension

sums + audit-chain continuity. Judge LLM scoring is Perplexity's surface and arrives in judge-log.jsonl.

Generate adversarial perturbations. Those are mission-side; future

PRs add them as separate *_perturbed.json mission specs.

Cover the remaining 10 missions (H2/H3, F3, L2/L3, I1/I2, S2, C1/C2).

Alpha-5 ships first to validate the contract; expansion follows once Perplexity successfully submits a real run.

Verification

pnpm tsgo — exit 0
pnpm test:fast — 7763/0 (+8 from the gauntlet validator suite)
pnpm exec oxfmt --check / oxlint --type-aware — clean
pnpm exec tsx tests/gauntlet/runners/validate-result.ts

— ok: true, only info-level findings (missing optional files)

pnpm exec tsx tests/gauntlet/runners/gate.ts private_beta

— pass: true

pnpm exec tsx tests/gauntlet/runners/build-example-fixture.ts — exits 0,

writes a self-consistent fixture

Open follow-ups (filed-as-issues candidates)

Mission catalogue expansion — add the remaining 10 missions

(H2/H3, F3, L2/L3, I1/I2, S2, C1/C2). Each needs the same anatomy the alpha-5 already established.

Perturbation mission variants — adversarial inputs (jailbreaks,

tool 5xx, jurisdiction flips, budget reductions) as derived mission specs.

Result-bundle ingest endpoint — POST /v1/gauntlet/results per

the original (c) option. Schema is already self-contained; the endpoint should reject anything the file-based validator would reject.

CI gate flip — once Perplexity has submitted ~5 successful

runs and the score distribution is stable, set GAUNTLET_GATE=enforce in the workflow that runs after a tests/gauntlet/results/ PR lands. Keep private_beta (80%) until the second-tier expansion.

Judge-log richness — currently the schema captures judge

provider/model/dimension/score/rationale-hash. Real runs will tell us whether per-judge token usage and disagreement traces should also land in the bundle.

Lockstep

TS SDK / Python SDK / MCP — no public API surface change.
OpenAPI — no new routes (the future POST /v1/gauntlet/results is

out of scope).

pnpm build — clean (no new auto-generated artifacts).
Ship log — this file.