2026-05-11-281-schema-v0.2-extensions

#281: Schema extensions for scorer v0.2 + F1 migration to v0.3.0

Date: 2026-05-11 Status: shipped Slug: 281-schema-v0.2-extensions Branch: feature/281-schema-v0.2-extensions Companion to: PR #287 (rubric v0.1), PR #289 (scorer v0.2 target spec) Addresses: #281

Summary

Extends MissionSchema, ExpectedArtifactSchema, HitlGateSchema, PolicyDeclarationSchema, and LedgerEntrySchema with every BR-side field the Computer scorer v0.2 needs to consume. The new fields are additive and optional at the type level so v0.1 missions continue to validate; a post-parse conformance check (checkRubricV02Conformance) enforces them only when a mission declares rubric_ref: "computer-rubric.v0.2".

F1 migrates to v0.3.0 as the wired prototype demonstrating every new field in use. H1/F2/L1/S1 are untouched (their migration is a follow-up once the F1 measurement signal is in).

Why this matters

The scorer v0.2 spec at docs/assessment/computer-scorer-v0.2-spec.md (PR #289) defines a 92-95-point ceiling for a clean run, raising it from v0.1's structural 79-point ceiling. Reaching that ceiling requires the scorer to consume:

Per-artifact content contracts (JSON Schema, content requirements, or

inline schema) to replace presence-only correctness scoring.

Per-gate evidence requirements (hitl_request, decision-per-flag,

blocked-no-token, side_effect_fired-after-approval) to replace language-pattern HITL detection.

Per-policy positive-evidence requirements (which ledger event proves

the policy was _checked_ at each hook) to lift policy compliance from 22 → 25.

None of those can land on the Computer side until BR-side missions expose the fields. This PR is that exposure.

Schema changes (`tests/gauntlet/schemas/index.ts`)

Mission top-level

rubric_ref?: string — identifier of the rubric this mission targets.
hitl_gates_rationale?: string — required when hitl_gates: [] AND

rubric_ref === "computer-rubric.v0.2".

`expected_artifacts[*]`

schema_ref?: string — relative path or URI to a JSON Schema 2020-12 document.
inline_schema?: object — embedded JSON Schema literal (alternative to schema_ref).
content_requirements?: { required_sections, min_citations, citation_pattern, prohibited_phrases? } — for markdown artifacts.
correctness_weight?: number — per-artifact weight override (defaults to uniform share of 25).
data_class_ceiling?: DataClassSchema — optional ceiling, inherits from inputs when absent.
Under v0.2: every required: true artifact must declare ONE of

schema_ref / inline_schema / content_requirements. The conformance check enforces this post-parse so v0.1 missions stay valid.

`hitl_gates[*]`

evidence_requirements: { requires_hitl_request: true, requires_decision_per_flag, blocked_no_token_required, per_flag_artifact? } — declares what the scorer must find.
gate_weight?: number — per-gate weight override.

`policies[*]`

positive_evidence?: { pre_turn?, pre_tool?, post_tool?, pre_artifact?, post_run? } — per-hook event type + required fields.
policy_weight?: number — per-policy weight override.

`LedgerEntry.event_type` (additive)

Added v0.2 events: approval_token_issued, tool_blocked, side_effect_fired, artifact_emitted, redaction_attestation. The v0.1 events remain unchanged.

Conformance check (`checkRubricV02Conformance`)

A new exported function that takes a parsed Mission and returns RubricV02Finding[]. Empty when the mission is fully v0.2-shaped or when rubric_ref is missing (short-circuits to keep v0.1 missions non-breaking).

Checks performed:

Required artifacts under v0.2 must declare a content contract.
HITL gates under v0.2 must declare evidence_requirements.
Policies under v0.2 must declare positive_evidence.
Empty hitl_gates under v0.2 must carry hitl_gates_rationale.
Weight sum overrides: correctness_weight sums to 25; gate_weight sums

to 15; policy_weight sums to 25 (only when ANY override is set).

The validator (validate-result.ts) calls this function for every mission referenced by scores.json and emits one error-severity finding per conformance violation. Missions targeting v0.1 are not affected.

F1 v0.3.0 prototype (wired)

F1 now declares rubric_ref: "computer-rubric.v0.2" and populates every new field:

Surface	F1 v0.3.0 value
`rubric_ref`	`"computer-rubric.v0.2"`
Markdown artifact contracts (`draft-section-1`, `draft-section-2`)	`content_requirements` with required sections, min citations, citation pattern, prohibited MNPI phrases
JSON artifact contracts (`disclosure-flags`, `disclosure-approval`, `final-10q-bundle`)	`inline_schema` (JSON Schema 2020-12) with required fields, regex patterns for IDs and hashes
`correctness_weight` per artifact	5 × 5 artifacts = 25 ✓
`disclosure-committee-pre-finalize` gate `evidence_requirements`	`requires_hitl_request: true, requires_decision_per_flag: true, blocked_no_token_required: true, per_flag_artifact: "disclosure-flags"`
Three policies with `positive_evidence`	each declares `policy_check` event types per `enforce_at` hook with named required fields

Validation:

F1 v0.3.0 (rubric_ref=computer-rubric.v0.2) parses.
v0.2 conformance findings: 0
✓ fully v0.2 conformant

Tests added (7 new)

In tests/gauntlet/runners/validate-result.test.ts:

F1 v0.3.0 conforms fully to rubric v0.2 (0 findings)
H1 v0.1 (no rubric_ref) is skipped — back-compat guarantee
Missing artifact contract under v0.2 → flagged
Missing evidence_requirements on a declared gate → flagged
Missing positive_evidence on a policy → flagged
Empty hitl_gates without rationale → flagged
correctness_weight sum mismatch → flagged

Pre-existing 8 tests still pass. Suite total: 7779/0 (+7 from v0.2 conformance).

Migration path for remaining alpha-5

Not in this PR. The order to migrate H1/F2/L1/S1 is:

Wait for the next gauntlet run against F1 v0.3.0 to see whether the

v0.2 scorer (when it ships on Perplexity's side) actually rewards the new fields. If F1 climbs from 79 toward 92-95, the model is validated.

Apply the same v0.2 pattern to H1/F2/L1/S1 (each gets rubric_ref,

artifact contracts, gate evidence_requirements, policy positive_evidence).

Each subsequent migration ships as a focused PR mirroring F1's

structure, so the score-delta per mission is independently measurable.

What this PR does NOT do

Does not change scorer behavior (Perplexity side).
Does not touch H1/F2/L1/S1 missions — deferred per the staged

migration.

Does not regenerate the example fixture (H1-based, not affected by

the new optional fields).

Does not change the validator's existing checks — it adds new

conformance findings on top.

Verification

pnpm tsgo — exit 0
pnpm exec oxfmt --check / oxlint --type-aware — clean
pnpm test:fast — 7779/0 (+7 from v0.2 conformance suite)
F1 v0.3.0 parses + zero conformance findings (programmatic check)
H1 v0.1.0 parses + zero conformance findings (back-compat verified)

Lockstep

TS / Python SDK / MCP — no public API surface change
OpenAPI — no new routes
Ship log — this file