Fix: model-performance persist no longer aborts with ON CONFLICT DO UPDATE error

2026-05-08

routerintelligencemodel-performance

What We Built

R23 closure verification surfaced repeated CW Logs entries:

[router/model-performance] Failed to persist model performance buckets:
  ON CONFLICT DO UPDATE command cannot affect row a second time

Root cause: persistToDb builds a multi-row INSERT … ON CONFLICT (tenant_id, model_key, task_type, day) DO UPDATE … statement. Postgres raises this error when two rows in the SAME INSERT statement both target the same conflict row — the planner cannot safely re-update a row twice in one statement. Two in-memory buckets that serialize onto the same (tenant_id, model_key, task_type, day) natural key trigger this every persist cycle, aborting the entire transaction and losing all the tenant's bucket updates for that cycle.

Fix: dedupe by natural key BEFORE the multi-row INSERT, keeping the row with the highest sample count n (same precedence the ON CONFLICT clause uses to decide which row wins on update). Best-effort defense — makes the failure mode impossible regardless of which keying anomaly produced the duplicate.

Why It Matters

Model performance buckets are the persistence layer for the bandit selector's reward estimates and the ranking engine's quality signals. Failed persist cycles meant the in-memory state diverged from what other ECS tasks could see after restart, eroding the routing fabric's cross-task continuity. The fix ensures every persist cycle completes, even when in-memory keying produces collisions.

How It Works

const dedupedByKey = new Map<string, DailyModelBucket>();
for (const row of rows) {
  const taskType = row.taskType ?? "__global__";
  const key = `${row.modelKey} ${taskType} ${row.date}`;
  const existing = dedupedByKey.get(key);
  if (!existing || row.n > existing.n) {
    dedupedByKey.set(key, row);
  }
}
const dedupedRows = [...dedupedByKey.values()];

A warn-level log fires when dedup actually collapses any rows so we get visibility into how often the keying anomaly occurs. The downstream chunked INSERT loop iterates dedupedRows instead of the raw rows.

The Numbers

  • 1 new paired regression test pinning the dedup behavior end-to-end:

injects two buckets that share (tenant_id, model_key, task_type, day) via different outer composite-key paths, persists, and verifies (a) no error thrown, (b) only ONE row landed in the INSERT, (c) the surviving row is the one with the higher n (12, not 5).

  • The existing 50-bucket multi-row batch test still passes (no

regression on the happy path).

  • Production log volume: every persist cycle was logging this error

~1-2× per minute. After deploy, the line should drop to 0.

Lockstep Checklist

  • [x] API Routes: unchanged
  • [x] TS SDK: unchanged
  • [x] Python SDK: unchanged
  • [x] MCP Schemas: unchanged
  • [x] Master Record: unchanged