P0b — mTLS recovery hints + ECS misconfiguration warning (closes Computer Protocol #245 partial)
2026-05-09
LOCKSTEP TRACEABILITY MATRIX --- api_endpoints: [ "403 mtls_required (requireMtlsForRoute strict)", "401 tenant_mtls_required (tenantMtlsEnforcementMiddleware)", ] sdk_methods_updated: ["none — no contract change; recovery hint shape was already RecoveryHint"] mcp_tools_updated: ["none"] ---
What We Built
Closes the code-side half of Computer Protocol issue #245. The infra-side half (ensuring gateway.tls.mtls.behindAlb=true is actually set in the production ECS task definition + verifying ALB cert-header forwarding end-to-end) is Justin's task — out of scope for this PR per loop substrate-boundary rules.
Three changes:
- Recovery hints registered in
error-recovery.ts— addedmtls_requiredandtenant_mtls_requiredentries pointing atPOST /v1/agent/auth/cert(the cert exchange endpoint) instead of the previously-injected genericauth_errorhint that pointed at/v1/register(the API-key signup endpoint — wrong direction for an mTLS auth failure).
- Top-level
recoveryfield on the 403/401 responses — bothrequireMtlsForRoute(strict mode) andtenantMtlsEnforcementMiddlewarenow embed the recovery hint at top level (not nested undererror.recovery) soerrorEnrichmentMiddlewarefinds it viabody.recoveryand surfaces the RecoveryHint headers, instead of overriding with a generic auth_error hint pointing at/v1/register. This is the same shape we converged to in P0a.
- Startup warning helper
warnIfBehindAlbMisconfigured(config)— fires once at middleware construction when ECS is detected (ECS_CONTAINER_METADATA_URIorAWS_EXECUTION_ENVenv vars set) butbehindAlbis OFF. LogsmTLS misconfiguration: ECS environment detected … but mtlsAuth.behindAlb is OFF. Surfaces the most-likely root cause of Computer's symptom (issued certs not authenticating) at boot rather than letting every cert-bearing request silently 403.
Why It Matters
Per Computer Protocol Release Gate items 3+4: "mTLS-issued certs work on mTLS-required routes" + "Delegated child agents can be created, constrained, used, terminated, and verified inactive." The recovery-hint fix unblocks autonomous agents following the error chain — they hit mtls_required, read the recovery hint, exchange a CSR, retry. Before this PR they hit mtls_required, read a recovery hint pointing at signup, looped on signup attempts that don't apply.
The startup warning is the cheaper-than-an-integration-test signal that the production ECS task is missing the behindAlb config flip. CloudWatch will surface it at every boot until Justin sets the env var.
How It Works
curl --cert <pem> --key <key> https://api.brainstormrouter.com/v1/agent/delegate
→ 403 { error: { code: "mtls_required", ... } }
→ errorEnrichmentMiddleware sees no body.recovery
→ injects recovery hint via resolveRecoveryHint({type: "auth_error"})
→ endpoint: "POST /v1/register" (API-key signup, wrong direction)
After P0b:
curl --cert <pem> --key <key> https://api.brainstormrouter.com/v1/agent/delegate
→ 403 { error: { code: "mtls_required", ... }, recovery: { ... } }
→ top-level body.recovery already populated
→ errorEnrichmentMiddleware surfaces RecoveryHint headers + skips override
→ endpoint: "POST /v1/agent/auth/cert" (cert exchange, correct path)
What This Doesn't Fix
- Production config flip — setting
gateway.tls.mtls.behindAlb=trueon the ECS task definition is Justin's task. The code path was already in mtls-auth.ts:57-72 (readsx-amzn-mtls-clientcertwhenbehindAlb=true); the bug was the FLAG NEVER FIRED in production. Until the env var lands, this PR's recovery hint just gives operators the right endpoint to call — but their cert still won't authenticate. - ALB cert-header forwarding — once
behindAlb=trueis set, the ALB listener config must forward the client cert. That's a one-time verify Justin should do viacurl --cert ... https://and confirm the request reaches the app with the header attached./v1/agent/delegate - Integration test for full CSR-exchange-then-use loop — would require either a live ALB or a sophisticated mock. Defer to Computer's next adversarial run as the integration-level proof.
Lockstep
- TypeScript SDK: no change (response shape additive — adds top-level
recoveryfield, doesn't break existing readers) - Python SDK: no change
- MCP tools: no change
- API surface: 403/401 response shape now includes
recoveryfield (additive) - OpenAPI: no shape changes (could document the recovery field in a docs-only follow-up)
- Ship log: this entry
Closes Release Gate
- Item 3 (mTLS-issued certs work on mTLS-required routes): closed at the code level for recovery-flow correctness. Production-config-level closure depends on Justin's ECS env var change.
- Item 4 (Delegated child agents lifecycle): partial. Once the config flip lands and Computer reruns the lifecycle tests, this should fully close.