P0b — mTLS recovery hints + ECS misconfiguration warning (closes Computer Protocol #245 partial)

2026-05-09

apisecurity

LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: [ "403 mtls_required (requireMtlsForRoute strict)", "401 tenant_mtls_required (tenantMtlsEnforcementMiddleware)", ] sdk_methods_updated: ["none — no contract change; recovery hint shape was already RecoveryHint"] mcp_tools_updated: ["none"] ---

What We Built

Closes the code-side half of Computer Protocol issue #245. The infra-side half (ensuring gateway.tls.mtls.behindAlb=true is actually set in the production ECS task definition + verifying ALB cert-header forwarding end-to-end) is Justin's task — out of scope for this PR per loop substrate-boundary rules.

Three changes:

  1. Recovery hints registered in error-recovery.ts — added mtls_required and tenant_mtls_required entries pointing at POST /v1/agent/auth/cert (the cert exchange endpoint) instead of the previously-injected generic auth_error hint that pointed at /v1/register (the API-key signup endpoint — wrong direction for an mTLS auth failure).
  1. Top-level recovery field on the 403/401 responses — both requireMtlsForRoute (strict mode) and tenantMtlsEnforcementMiddleware now embed the recovery hint at top level (not nested under error.recovery) so errorEnrichmentMiddleware finds it via body.recovery and surfaces the RecoveryHint headers, instead of overriding with a generic auth_error hint pointing at /v1/register. This is the same shape we converged to in P0a.
  1. Startup warning helper warnIfBehindAlbMisconfigured(config) — fires once at middleware construction when ECS is detected (ECS_CONTAINER_METADATA_URI or AWS_EXECUTION_ENV env vars set) but behindAlb is OFF. Logs mTLS misconfiguration: ECS environment detected … but mtlsAuth.behindAlb is OFF. Surfaces the most-likely root cause of Computer's symptom (issued certs not authenticating) at boot rather than letting every cert-bearing request silently 403.

Why It Matters

Per Computer Protocol Release Gate items 3+4: "mTLS-issued certs work on mTLS-required routes" + "Delegated child agents can be created, constrained, used, terminated, and verified inactive." The recovery-hint fix unblocks autonomous agents following the error chain — they hit mtls_required, read the recovery hint, exchange a CSR, retry. Before this PR they hit mtls_required, read a recovery hint pointing at signup, looped on signup attempts that don't apply.

The startup warning is the cheaper-than-an-integration-test signal that the production ECS task is missing the behindAlb config flip. CloudWatch will surface it at every boot until Justin sets the env var.

How It Works

curl --cert <pem> --key <key> https://api.brainstormrouter.com/v1/agent/delegate
  → 403 { error: { code: "mtls_required", ... } }
  → errorEnrichmentMiddleware sees no body.recovery
  → injects recovery hint via resolveRecoveryHint({type: "auth_error"})
  → endpoint: "POST /v1/register" (API-key signup, wrong direction)

After P0b:
  curl --cert <pem> --key <key> https://api.brainstormrouter.com/v1/agent/delegate
  → 403 { error: { code: "mtls_required", ... }, recovery: { ... } }
  → top-level body.recovery already populated
  → errorEnrichmentMiddleware surfaces RecoveryHint headers + skips override
  → endpoint: "POST /v1/agent/auth/cert" (cert exchange, correct path)

What This Doesn't Fix

  • Production config flip — setting gateway.tls.mtls.behindAlb=true on the ECS task definition is Justin's task. The code path was already in mtls-auth.ts:57-72 (reads x-amzn-mtls-clientcert when behindAlb=true); the bug was the FLAG NEVER FIRED in production. Until the env var lands, this PR's recovery hint just gives operators the right endpoint to call — but their cert still won't authenticate.
  • ALB cert-header forwarding — once behindAlb=true is set, the ALB listener config must forward the client cert. That's a one-time verify Justin should do via curl --cert ... https:///v1/agent/delegate and confirm the request reaches the app with the header attached.
  • Integration test for full CSR-exchange-then-use loop — would require either a live ALB or a sophisticated mock. Defer to Computer's next adversarial run as the integration-level proof.

Lockstep

  • TypeScript SDK: no change (response shape additive — adds top-level recovery field, doesn't break existing readers)
  • Python SDK: no change
  • MCP tools: no change
  • API surface: 403/401 response shape now includes recovery field (additive)
  • OpenAPI: no shape changes (could document the recovery field in a docs-only follow-up)
  • Ship log: this entry

Closes Release Gate

  • Item 3 (mTLS-issued certs work on mTLS-required routes): closed at the code level for recovery-flow correctness. Production-config-level closure depends on Justin's ECS env var change.
  • Item 4 (Delegated child agents lifecycle): partial. Once the config flip lands and Computer reruns the lifecycle tests, this should fully close.