BR Tier 1 — god-mode audit table + CAF upsert-key cryptographic anchor

2026-04-22

cafgod-modesecurity-events

What We Built

Three BR-side workstreams (BR-1, BR-2, BR-3) completing Tier 1 of the cross-project enrollment-idempotence RFC (~/.brainstorm/rfcs/2026-04-enrollment-idempotence.md, ratified v0.7).

BR-1 — god_mode_executions audit table (v49). Every /v1/god-mode/execute invocation writes one synchronous row before the HTTP response returns, capturing trace_id, tenant, caller api_key_id, source_ip, tool, params, risk_level, result, error_code, error, executed_at, duration_ms. Closes the audit gap where the trace_id returned to clients was not persisted anywhere.
BR-2 — CAF AgentServiceRegistry upsert key switched to (tenant_id, spiffe_id) (v50). Adds machine_id_fingerprint column as a trust-ladder signal and a ux_agent_registry_tenant_spiffe UNIQUE INDEX — the first DB-level uniqueness constraint on agent_registry (the table previously had only non-unique B-trees; the hostname-based "upsert" was app-level only, exposing a hostname-takeover race). Emits identity.enrollment_success on every register and identity.fingerprint_drift when a cert's prior fingerprint differs from the new one. Bundles a heartbeat DB-fallback: when the local cache misses (node restart, Redis pub/sub gap), heartbeat() now queries Postgres and hydrates the cache instead of silently returning false.
BR-3 — documents the 30-second API-key revocation propagation lag in ops-runbooks.md, including the manual Redis cache-invalidation recipe for sub-30s incident-response cutoff.

Why It Matters

Tier 1 closes three correctness gaps that don't show up in production metrics because no one was looking at the right surface:

God-mode calls were untraceable post-hoc. The trace_id BR returned in responses was cosmetic — nothing stored it. Forensics on a mutation ("who revoked which key at 14:07 UTC?") required parsing application logs, which are ephemeral. Now every god-mode call has a queryable audit row.
agent_registry allowed hostname takeover. Since the idempotency check was app-level (read-then-update-or-insert) and not DB-enforced, a concurrent registration race with matching hostnames could produce duplicate rows or overwrite an existing agent's cryptographic anchor. Switching to (tenant_id, spiffe_id) upsert anchors identity in the mTLS cert SAN — something the caller can't forge.
API-key revocation had an undocumented 30s lag. Operators rotating keys in normal flow are unaffected, but incident response (leaked key with active abuse) needed a manual Redis-flush recipe that wasn't written down anywhere.

How It Works

BR-1 — god-mode audit:

// src/api/routes/god-mode.ts
const traceId = crypto.randomUUID().slice(0, 16); // generated up-front, not post-hoc
const apiKey = c.get("apiKey");
const sourceIp = c.req.header("cf-connecting-ip") ?? c.req.header("x-forwarded-for") ?? null;
const startedAt = performance.now();
// ... run executor, capture result or error ...
const durationMs = Math.max(0, Math.round(performance.now() - startedAt));

await getDb()
  .insert(godModeExecutions)
  .values({
    traceId,
    tenantId,
    apiKeyId: apiKey?.id ?? null,
    sourceIp,
    tool: toolName,
    params,
    riskLevel: toolDef.risk_level,
    result,
    errorCode,
    error: errorMessage,
    durationMs,
  });

Audit failures are logged but never fail the user call — audit is advisory, not authoritative. api_key_id is text, not uuid, because mTLS produces a synthetic key id like mtls:.

BR-2 — CAF upsert:

// src/security/caf/agent-registry.ts — register()
// Lookup by SPIFFE ID, not hostname:
const existing = await tx.select().from(agentRegistry)
  .where(and(eq(agentRegistry.tenantId, opts.tenantId),
             eq(agentRegistry.spiffeId, opts.spiffeId)))
  .limit(1);

// On update: refresh hostname (machine may have been legitimately renamed),
// and compare machine_id_fingerprint for drift detection.
if (existing.length > 0 && opts.machineIdFingerprint &&
    priorFingerprint && priorFingerprint !== opts.machineIdFingerprint) {
  emitIdentityEvent({ eventType: "identity.fingerprint_drift", severity: "warn", ... });
}

Events route through the existing security_events store (same bus as auth.mtls_success / auth.mtls_failure), matching the RFC §5.5 cross-product taxonomy so SIEM queries can join across products.

BR-2 — heartbeat DB-fallback:

// Before: cache miss → silent `return false`
// After: cache miss → DB lookup by (tenantId, hostname) → hydrate cache → complete heartbeat

BR-3 — doc: added a "API Key Revocation — Known Propagation Lag" section under Secret Rotation in ops-runbooks.md, with the manual Redis invalidation recipe.

The Numbers

2 new schema versions (v49, v50) — deploy-independent of any other BR work.
1 new DB-level unique constraint on agent_registry (was 0).
1 new identity event family (identity.*) on the existing security_events store — no new bus, no new wiring.
Audit write overhead per god-mode call: one synchronous INSERT, measured in microseconds against a warm connection pool. God-mode call throughput is low enough that this is not a hot path.
/v1/mesh/register traffic today: zero production callers (pre-emptive correctness fix; RFC §6 Phase 2b — BR-4 SDK methods will wire real callers in the next sprint).

Competitive Edge

BR is now the only ecosystem-comparable gateway with a cryptographically-anchored agent registry: (tenant, SPIFFE ID) is the DB-enforced identity. Portkey, Helicone, Letta, and Vercel AI Gateway all treat agent identity as either a client-asserted hostname/label or a simple api_key. The CAF mesh (mTLS + CA + revocation + registry upsert on crypto identity) is a zero-trust primitive none of them ship natively.

Lockstep Checklist

[x] API Routes: no new routes, no contract changes (/v1/god-mode/execute, /v1/mesh/register, /v1/mesh/heartbeat JSON contracts unchanged).
[x] TS SDK: not applicable — no new routes or fields exposed. BR-4 (next sprint) adds mesh.register()/heartbeat()/deregister() as a separate workstream.
[x] Python SDK: not applicable — same reason.
[x] MCP Schemas: not applicable.
[x] Ship Log: this file.

RFC: ~/.brainstorm/rfcs/2026-04-enrollment-idempotence.md (v0.7, cross-ratified)
Coordinated with: brainstorm CLI peer (6sv3av7w), BrainstormMSP peer (m6uoiyb2), brainstorm-agent peer (a3br7vkd)
Upcoming: BR-4 (TS + Python SDK mesh methods — prerequisite for any external mesh enrollment client), BR-5 (optional MCP tool wrapper, deferred)