Savings ledger avgValidity wired end-to-end (was hardcoded 0 across all paths)

2026-05-08

routerintelligencesavings-ledgerquality-events

What We Built

Closed the third-most-flagged R23 risk (3/10 agents): avgValidity: 0 across every path in /v1/intelligence/savings. Two coupled bugs masked each other and only joint fixing closes the gap.

Bug 1 — write-path coverage gap. Automatic per-response validity scoring went only to a JSONL file in container state directory. In multi-task ECS deployments that disk is ephemeral; on every task replace the buffer is wiped and the contents are not consulted by any aggregation. Only user-feedback writes (via src/api/usage-queries.ts) reached the usage_quality_events table.

Bug 2 — read-path skipped the join. The DB-backed summarizeFromUsage SQL had no JOIN to usage_quality_events and hardcoded avgValidity: 0 in the byPath aggregation. The original code comment was honest about it: "avgValidity isn't readily available without joining the quality_scores table; leave 0 and fill in later."

Why It Matters

The headline routing intelligence claim — "we route to cheaper models without sacrificing quality" — has been unverifiable end-to-end through the public surface for the duration of the savings ledger's existence. Operators saw "savings: $X" and "avgValidity: 0.00" on the same row, which reads as "we saved money and quality dropped to zero." With this PR the dashboard surfaces the request-weighted average across all paths that had any quality coverage, plus a clean 0 sentinel meaning "no quality signal yet" for paths the validity scorer hasn't reached.

How It Works

QualityEventWriter now takes an optional getPool: () => PgQuery | null in its constructor. Each record() call pulls the pool lazily — so writes start landing in DB the moment Postgres is initialized mid-boot, without rewiring the writer DI. Persistence is best-effort: a DB error logs a warning but never throws back into the inline routing path.

SavingsTracker.summarizeFromUsage now uses a two-CTE aggregation:

SELECT DISTINCT ON (tenant_id, request_id)
    tenant_id, request_id, score, source
  FROM usage_quality_events
  WHERE created_at >= $1 AND created_at < $2
  ORDER BY tenant_id, request_id,
    CASE source
      WHEN 'user_feedback' THEN 1
      WHEN 'arena_vote'    THEN 2
      WHEN 'judge'         THEN 3
      WHEN 'validity'      THEN 4
    END
)
SELECT ..., AVG(prq.score)::float8 AS avg_validity
FROM usage_events ue
LEFT JOIN per_request_quality prq
  ON prq.tenant_id = ue.tenant_id AND prq.request_id = ue.request_id
...

The CTE picks one score per request — highest-trust signal wins — before the bucket join, so a request with both user_feedback and a validity sample contributes only the user_feedback score to the bucket average. byPath aggregation is request-count-weighted across buckets so heavy-traffic buckets cannot be obscured by sparse high-validity buckets.

The Numbers

  • 9 new paired regression tests:
  • 5 in quality-events.test.ts: DB persistence, signalType→source

mapping, no-tenant skip, fail-open on DB error, null-pool no-op

  • 4 in savings-tracker.test.ts: SQL contains the CTE + LEFT JOIN,

request-weighted weighted average, NULL-coverage handling, all-NULL → 0

  • 0 new endpoints — response shape unchanged
  • 0 SDK changes — field was already in the type, just always 0 in prod

Lockstep Checklist

  • [x] API Routes: GET /v1/intelligence/savings — response shape unchanged
  • [x] TS SDK: unchanged (field already present)
  • [x] Python SDK: unchanged (field already present)
  • [x] MCP Schemas: unchanged (no agent-facing surface)
  • [x] Master Record: unchanged

R23 risk register: third-most-flagged item closed.