2026-05-11-281-schema-v0.2-extensions
#281: Schema extensions for scorer v0.2 + F1 migration to v0.3.0
Date: 2026-05-11 Status: shipped Slug: 281-schema-v0.2-extensions Branch: feature/281-schema-v0.2-extensions Companion to: PR #287 (rubric v0.1), PR #289 (scorer v0.2 target spec) Addresses: #281
Summary
Extends MissionSchema, ExpectedArtifactSchema, HitlGateSchema, PolicyDeclarationSchema, and LedgerEntrySchema with every BR-side field the Computer scorer v0.2 needs to consume. The new fields are additive and optional at the type level so v0.1 missions continue to validate; a post-parse conformance check (checkRubricV02Conformance) enforces them only when a mission declares rubric_ref: "computer-rubric.v0.2".
F1 migrates to v0.3.0 as the wired prototype demonstrating every new field in use. H1/F2/L1/S1 are untouched (their migration is a follow-up once the F1 measurement signal is in).
Why this matters
The scorer v0.2 spec at docs/assessment/computer-scorer-v0.2-spec.md (PR #289) defines a 92-95-point ceiling for a clean run, raising it from v0.1's structural 79-point ceiling. Reaching that ceiling requires the scorer to consume:
- Per-artifact content contracts (JSON Schema, content requirements, or
inline schema) to replace presence-only correctness scoring.
- Per-gate evidence requirements (hitl_request, decision-per-flag,
blocked-no-token, side_effect_fired-after-approval) to replace language-pattern HITL detection.
- Per-policy positive-evidence requirements (which ledger event proves
the policy was _checked_ at each hook) to lift policy compliance from 22 → 25.
None of those can land on the Computer side until BR-side missions expose the fields. This PR is that exposure.
Schema changes (tests/gauntlet/schemas/index.ts)
Mission top-level
rubric_ref?: string— identifier of the rubric this mission targets.hitl_gates_rationale?: string— required whenhitl_gates: []AND
rubric_ref === "computer-rubric.v0.2".
expected_artifacts[*]
schema_ref?: string— relative path or URI to a JSON Schema 2020-12 document.inline_schema?: object— embedded JSON Schema literal (alternative toschema_ref).content_requirements?: { required_sections, min_citations, citation_pattern, prohibited_phrases? }— for markdown artifacts.correctness_weight?: number— per-artifact weight override (defaults to uniform share of 25).data_class_ceiling?: DataClassSchema— optional ceiling, inherits from inputs when absent.- Under v0.2: every
required: trueartifact must declare ONE of
schema_ref / inline_schema / content_requirements. The conformance check enforces this post-parse so v0.1 missions stay valid.
hitl_gates[*]
evidence_requirements: { requires_hitl_request: true, requires_decision_per_flag, blocked_no_token_required, per_flag_artifact? }— declares what the scorer must find.gate_weight?: number— per-gate weight override.
policies[*]
positive_evidence?: { pre_turn?, pre_tool?, post_tool?, pre_artifact?, post_run? }— per-hook event type + required fields.policy_weight?: number— per-policy weight override.
LedgerEntry.event_type (additive)
Added v0.2 events: approval_token_issued, tool_blocked, side_effect_fired, artifact_emitted, redaction_attestation. The v0.1 events remain unchanged.
Conformance check (checkRubricV02Conformance)
A new exported function that takes a parsed Mission and returns RubricV02Finding[]. Empty when the mission is fully v0.2-shaped or when rubric_ref is missing (short-circuits to keep v0.1 missions non-breaking).
Checks performed:
- Required artifacts under v0.2 must declare a content contract.
- HITL gates under v0.2 must declare
evidence_requirements. - Policies under v0.2 must declare
positive_evidence. - Empty
hitl_gatesunder v0.2 must carryhitl_gates_rationale. - Weight sum overrides: correctness_weight sums to 25; gate_weight sums
to 15; policy_weight sums to 25 (only when ANY override is set).
The validator (validate-result.ts) calls this function for every mission referenced by scores.json and emits one error-severity finding per conformance violation. Missions targeting v0.1 are not affected.
F1 v0.3.0 prototype (wired)
F1 now declares rubric_ref: "computer-rubric.v0.2" and populates every new field:
| Surface | F1 v0.3.0 value |
|---|---|
rubric_ref | "computer-rubric.v0.2" |
Markdown artifact contracts (draft-section-1, draft-section-2) | content_requirements with required sections, min citations, citation pattern, prohibited MNPI phrases |
JSON artifact contracts (disclosure-flags, disclosure-approval, final-10q-bundle) | inline_schema (JSON Schema 2020-12) with required fields, regex patterns for IDs and hashes |
correctness_weight per artifact | 5 × 5 artifacts = 25 ✓ |
disclosure-committee-pre-finalize gate evidence_requirements | requires_hitl_request: true, requires_decision_per_flag: true, blocked_no_token_required: true, per_flag_artifact: "disclosure-flags" |
Three policies with positive_evidence | each declares policy_check event types per enforce_at hook with named required fields |
Validation:
F1 v0.3.0 (rubric_ref=computer-rubric.v0.2) parses.
v0.2 conformance findings: 0
✓ fully v0.2 conformant
Tests added (7 new)
In tests/gauntlet/runners/validate-result.test.ts:
- F1 v0.3.0 conforms fully to rubric v0.2 (0 findings)
- H1 v0.1 (no
rubric_ref) is skipped — back-compat guarantee - Missing artifact contract under v0.2 → flagged
- Missing
evidence_requirementson a declared gate → flagged - Missing
positive_evidenceon a policy → flagged - Empty
hitl_gateswithout rationale → flagged correctness_weightsum mismatch → flagged
Pre-existing 8 tests still pass. Suite total: 7779/0 (+7 from v0.2 conformance).
Migration path for remaining alpha-5
Not in this PR. The order to migrate H1/F2/L1/S1 is:
- Wait for the next gauntlet run against F1 v0.3.0 to see whether the
v0.2 scorer (when it ships on Perplexity's side) actually rewards the new fields. If F1 climbs from 79 toward 92-95, the model is validated.
- Apply the same v0.2 pattern to H1/F2/L1/S1 (each gets
rubric_ref,
artifact contracts, gate evidence_requirements, policy positive_evidence).
- Each subsequent migration ships as a focused PR mirroring F1's
structure, so the score-delta per mission is independently measurable.
What this PR does NOT do
- Does not change scorer behavior (Perplexity side).
- Does not touch H1/F2/L1/S1 missions — deferred per the staged
migration.
- Does not regenerate the example fixture (H1-based, not affected by
the new optional fields).
- Does not change the validator's existing checks — it adds new
conformance findings on top.
Verification
pnpm tsgo— exit 0pnpm exec oxfmt --check/oxlint --type-aware— cleanpnpm test:fast— 7779/0 (+7 from v0.2 conformance suite)- F1 v0.3.0 parses + zero conformance findings (programmatic check)
- H1 v0.1.0 parses + zero conformance findings (back-compat verified)
Lockstep
- TS / Python SDK / MCP — no public API surface change
- OpenAPI — no new routes
- Ship log — this file