Skip to main content

ADR-0051: Tenant-scoped session selection for RLS-enforced access

Status: Accepted Date: 2026-06-13 Supersedes: none Superseded by: none

Context

Once RLS was actually enforced (ADR-0050), a 9-agent audit (§10 of the internal gaps register) classified all 119 get_session() call-sites: 14 P0 writes that the RLS WITH CHECK would reject, 35 P1 reads that returned the demo tenant's rows for a non-demo request, 16 P2 shared-reference reads that work via demo-seeding, 4 ambiguous, and 42 genuinely safe. get_session() pins the GUC to the demo tenant — correct for the single-tenant demo, wrong for any real second tenant.

The flip also exposed two latent bugs invisible to the prior (mocked-DB) suite:

  1. DiagnosticService.record_stage cast a VARCHAR case_id to uuid; every real INSERT raised invalid input syntax for type uuid, the guard-and-swallow ate it, and pipeline_stages had 0 rows in production.
  2. agent_progress / tool_audit / automation_tier writes hardcoded DEMO_TENANT_ID (or relied on a demo server_default), collapsing every tenant's rows onto the demo tenant.

Decision

Select the session factory by where the tenant comes from, not by habit:

  • User-asserted tenant (officer JWT, activity input.tenant_id): get_tenant_session(tenant) — RLS WITH CHECK gives defense-in-depth on writes; reads are correctly scoped. Thread tenant_id through the call chain when needed.
  • System-derived tenant (the row's tenant is its case's tenant): derive it from the authoritative row under get_admin_session()INSERT … SELECT c.case_id, …, c.tenant_id FROM cases c WHERE c.case_id = :id when the case must exist, or COALESCE((SELECT tenant_id FROM cases WHERE case_id = :case_id_lookup), demo) for case-optional events. Use a separate bind for the subquery — reusing one :case_id across an INSERT column and a subquery predicate makes asyncpg raise AmbiguousParameterError (text vs character varying).
  • System read by a globally-unique key where the caller pre-verifies ownership (workflow_id, capsule_id, case_id), or a genuinely shared/global read (regulatory KB, cross-case cost aggregate): get_admin_session().
  • Genuinely safe (non-RLS tables like tenants/reasoning_templates, or a policy that already admits tenant_id IS NULL globals like cost_pricing): leave get_session() — verified, not assumed.

case_id columns are VARCHAR (values like CASE-AWDC-…), never UUID; cases.id is the UUID PK. Never cast case_id to uuid.

Consequences

  • All P0 writes and P1 reads are remediated across ~30 modules in three batches; the P2 tail is resolved (most verified safe, the rest moved to admin or threaded). The remediation landed on fix/code-review-2026-06-11 (PR #21).
  • Testing rule (name-the-oracle): any SQL path a mock would otherwise stand in for ships a marker-gated real-DB test that seeds a non-demo tenant and asserts persistence + correct tenant attribution + RLS read isolation (test_diagnostic_rls_persistence.py, test_agent_progress_rls_persistence.py). A mocked-DB test plus a guard-and-swallow except is an information black hole.
  • Reports now surface governance_decisions / evoi_decisions (the persistence fix unblocked them), and restriction decisions carry evidence_refs provenance.