Savings ledger avgValidity wired end-to-end (was hardcoded 0 across all paths)
2026-05-08
What We Built
Closed the third-most-flagged R23 risk (3/10 agents): avgValidity: 0 across every path in /v1/intelligence/savings. Two coupled bugs masked each other and only joint fixing closes the gap.
Bug 1 — write-path coverage gap. Automatic per-response validity scoring went only to a JSONL file in container state directory. In multi-task ECS deployments that disk is ephemeral; on every task replace the buffer is wiped and the contents are not consulted by any aggregation. Only user-feedback writes (via src/api/usage-queries.ts) reached the usage_quality_events table.
Bug 2 — read-path skipped the join. The DB-backed summarizeFromUsage SQL had no JOIN to usage_quality_events and hardcoded avgValidity: 0 in the byPath aggregation. The original code comment was honest about it: "avgValidity isn't readily available without joining the quality_scores table; leave 0 and fill in later."
Why It Matters
The headline routing intelligence claim — "we route to cheaper models without sacrificing quality" — has been unverifiable end-to-end through the public surface for the duration of the savings ledger's existence. Operators saw "savings: $X" and "avgValidity: 0.00" on the same row, which reads as "we saved money and quality dropped to zero." With this PR the dashboard surfaces the request-weighted average across all paths that had any quality coverage, plus a clean 0 sentinel meaning "no quality signal yet" for paths the validity scorer hasn't reached.
How It Works
QualityEventWriter now takes an optional getPool: () => PgQuery | null in its constructor. Each record() call pulls the pool lazily — so writes start landing in DB the moment Postgres is initialized mid-boot, without rewiring the writer DI. Persistence is best-effort: a DB error logs a warning but never throws back into the inline routing path.
SavingsTracker.summarizeFromUsage now uses a two-CTE aggregation:
SELECT DISTINCT ON (tenant_id, request_id)
tenant_id, request_id, score, source
FROM usage_quality_events
WHERE created_at >= $1 AND created_at < $2
ORDER BY tenant_id, request_id,
CASE source
WHEN 'user_feedback' THEN 1
WHEN 'arena_vote' THEN 2
WHEN 'judge' THEN 3
WHEN 'validity' THEN 4
END
)
SELECT ..., AVG(prq.score)::float8 AS avg_validity
FROM usage_events ue
LEFT JOIN per_request_quality prq
ON prq.tenant_id = ue.tenant_id AND prq.request_id = ue.request_id
...
The CTE picks one score per request — highest-trust signal wins — before the bucket join, so a request with both user_feedback and a validity sample contributes only the user_feedback score to the bucket average. byPath aggregation is request-count-weighted across buckets so heavy-traffic buckets cannot be obscured by sparse high-validity buckets.
The Numbers
- 9 new paired regression tests:
- 5 in
quality-events.test.ts: DB persistence, signalType→source
mapping, no-tenant skip, fail-open on DB error, null-pool no-op
- 4 in
savings-tracker.test.ts: SQL contains the CTE + LEFT JOIN,
request-weighted weighted average, NULL-coverage handling, all-NULL → 0
- 0 new endpoints — response shape unchanged
- 0 SDK changes — field was already in the type, just always 0 in prod
Lockstep Checklist
- [x] API Routes:
GET /v1/intelligence/savings— response shape unchanged - [x] TS SDK: unchanged (field already present)
- [x] Python SDK: unchanged (field already present)
- [x] MCP Schemas: unchanged (no agent-facing surface)
- [x] Master Record: unchanged
R23 risk register: third-most-flagged item closed.