ADR-0088: Monitoring Framework Record, calibration-decision records, tenant-facing effectiveness runs and data-quality register

Date: 2026-07-03 Status: Proposed Deciders: Adrian (Soft4U), Claude Fable 5 (AMLA remediation design session)

Decision context:

Latency: framework-record JSON assembly is reads over config tables + in-process registries (estimated tens of ms); PDF render via WeasyPrint is in the 1–2 s class already accepted for the decision memo (PR #169) and the AI-Act record — both officer-initiated, never on the case hot path. An effectiveness run replays 7 packaged golden snapshots through the pure risk-matrix/gate computation (no retrieval), estimated sub-second; not measured because every surface here is an on-demand officer/admin action, not a per-case cost.
Dependency surface: zero new packages. Reuses WeasyPrint + Jinja2 (report_service._build_env), the golden YAML fixtures (backend/tests/golden/, 7 files), country_capability, and the existing risk_config_audit table (migration 038). One new Alembic migration (next sequential revision at implementation time): risk_config_audit.rationale column + effectiveness_runs table.
Debuggability: every record is derived from named live sources, so a wrong claim traces to one producer (check catalog → MonitoringCheckType + implementation status; cadences → the §2.2 canonical tables; gaps → country_capability_gap). Effectiveness runs persist per-golden-case expected-vs-actual JSONB, so a failed run shows exactly which oracle diverged and by what. A source that cannot be read renders as an honest gap entry, never a fabricated "satisfied".
Reversibility: additive services + endpoints (delete the files, done); the migration adds one nullable column and one new table (hours to revert). The mandatory-rationale change on risk-config draft/activate is request validation — one commit to revert. No workflow or state-machine changes.
Blast radius: additive except three touch points — the risk-config create_draft/activate API gains a required rationale field (admin UI updated in the same wave), case_pack_service.py gains an ongoing-monitoring appendix section, and ai_act_conformity_service._build_known_gaps flips one entry to satisfied-by-reference. Nothing in the Temporal workflow or the decision path changes.
Alternative considered: a static, authored Monitoring Framework PDF — rejected because it drifts from the deployed configuration; the data-driven ai_act_conformity_service mold exists precisely so the record "can never claim a model/prompt the system does not actually run" (its own docstring), and that property is the whole point of a supervisor-challenge artifact.

Context

The AMLA draft ongoing-monitoring guidelines (consultation 2026, final expected Q4 2026; NCA comply-or-explain declarations Q1 2027; AMLR (EU) 2024/1624 applies 2027-07-10) require an obliged entity to demonstrate its monitoring framework to a supervisor: what is monitored, at what cadence, in which forms (pre- / real-time / post-event), with which documented limitations and mitigations, and on what risk-based calibration basis. The 2026-07-03 gap analysis (docs/research/2026-07-03-amla-ongoing-monitoring-gap-analysis.md, §3.1 rows "Calibration/data quality" and "Supervisor pack") verified four gaps against the code:

No system-level Monitoring Framework Record exists. All monitoring documentation is per-case (case pack, ADR-0069) or per-event (MonitoringEvent rows, ADR-0066). Nothing answers the supervisor's framework-level question. The nearest mold is ai_act_conformity_service.py — a data-driven, honestly-statused record derived live from running configuration — and that service itself lists post-market monitoring as a KnownGap ("Post-market monitoring (Art. 72)", ai_act_conformity_service.py:394-399) and states Art. 15 continuous accuracy monitoring "is not yet automated" (:349).
The case pack contains zero post-approval monitoring content (case_pack_service.py, grep-verified: no reference to monitoring, ScreeningResult post-approval trail, or alert dispositions). A regulator asking "prove this relationship was monitored after onboarding" gets nothing from the ADR-0069 export.
Risk-config changes capture no rationale. risk_config_service.py writes only auto-generated changes_summary strings — "Draft v{n} created" (:596), "Draft config data updated" (:658), "Config v{n} activated" (:727). Worse, _create_default_active_config (:320) self-heals a missing config by auto-activating factory defaults with no tenant assessment record — the exact "default settings without prior risk-based assessment" the AMLA guidelines flag as poor practice.
Effectiveness testing is vendor-only. The golden-case harness (backend/tests/test_golden_cases.py + 7 YAML oracles in backend/tests/golden/, incl. the OB Holding CRITICAL/90 lineage) runs in Soft4U's dev CI. A tenant has no way to demonstrate that their active configuration still produces the known-correct outcomes — AMLA's "defined, tested, calibrated" expectation for automated tools.

Additionally, per-source data-quality signals already exist scattered across the codebase — country_capability_gap findings (country_capability.py:331), data_quality_warnings (app/workflows/registry_activity.py, app/agents/osint_agent.py), evidence_completeness (confidence_engine.py:132) — but there is no aggregated register with deficiency ownership, which the guidelines expect ("understand what data they lack and how the framework still detects risk").

This ADR is wave W6 of the AMLA remediation architecture (docs/superpowers/specs/2026-07-03-amla-remediation-architecture.md, §2.4/§2.5/§3). W6 runs last because it documents and reads what W0–W5 build. Settled questions §5.3 and §5.4 of that document are binding here and are not re-litigated.

Decision

Build the supervisor-challenge and calibration surfaces as data-driven records in the ai_act_conformity_service mold — derived live from running configuration wherever a live source exists, with honest satisfied/partial/gap statusing where it does not.

1. Monitoring Framework Record — backend/app/services/monitoring_framework_service.py. Assembles a system-level record containing: (a) the check catalog derived from MonitoringCheckType (packages/trustrelay-models/src/trustrelay_models/monitoring.py:20, including the W1/W5 additions document_expiry/profile_deviation) with honest per-check maturity — production / stub / vendor-gated — so the fail-closed company-status stub is reported as a stub, never as coverage; (b) cadences and ceilings from the §2.2 canonical tables (REVIEW_CADENCE_MONTHS_BY_TIER, RESCREEN_CADENCE_DAYS_BY_TIER) plus the AMLR Art. 26(2) ceilings (AMLR_MAX_CADENCE, monitoring_schedule_service.py:26) and the tenant's MonitoringConfig overrides; (c) an authored coverage statement (products/services/channels) plus pre-/real-time/post-event applicability per check; (d) documented limitations + mitigations, sourced from country_capability extended with a monitoring-form dimension (which monitoring forms apply per country, and the mitigation when one does not); (e) calibration-decision status (latest rationale-bearing risk_config_audit rows) and defaults-review status per item 2. Exposed as GET /api/monitoring/framework-record?format=json|pdf (WeasyPrint via report_service._build_env, PR #169 pattern; RBAC-gated officer endpoint). On completion, ai_act_conformity_service._build_known_gaps updates the "Post-market monitoring (Art. 72)" entry to satisfied-by-reference to this record — status changes only because the referenced mechanism now exists.

2. Calibration-decision records. Create the next sequential Alembic revision adding risk_config_audit.rationale (text, nullable — historical rows have none and are not backfilled with fiction). create_draft and activate_version (risk_config_service.py) gain a mandatory rationale parameter — free-text, ≥20 characters on activate — persisted alongside the existing auto-generated changes_summary. The factory-default self-heal path (:320) continues to auto-activate (it exists so first-recalc does not 500), but its audit row is labelled as a factory-default auto-activation and the framework record surfaces "factory defaults in use, no tenant defaults review on file" honestly. POST /api/risk-config/defaults-review records the tenant's own assessment of the EBA-derived factory defaults against their business-wide risk assessment (reviewer, date, rationale, per-area acceptance) as an artifact; per architecture §5.3 this is an honest-surfacing artifact, NOT a hard activation block. Permission: CONFIG_CALIBRATE = the existing CONFIG_WRITE (ADR-0074 registry; no new enum member).

3. Tenant-facing effectiveness runs — backend/app/services/effectiveness_run_service.py. Package the golden snapshots (the same 7 YAML oracles, shipped as app data, not reached into tests/) and replay them at the configuration layer against the tenant's active or draft config: recompute the EBA risk matrix and decision gates from the snapshot's frozen investigation inputs, compare expected tier/score/gate outcomes. Persist each run to the new effectiveness_runs table (id, tenant_id set explicitly — RLS WITH CHECK, the PR #177 lesson — config_version, ran_by, ran_at, per-golden-case expected-vs-actual results JSONB, passed bool). Endpoints: POST /api/risk-config/effectiveness-runs + GET /api/risk-config/effectiveness-runs. Every rendered surface (API response, framework record, PDF) carries the verbatim label "config-layer replay, not end-to-end retrieval" — per architecture §5.4, an end-to-end replay is out of scope and the run must never be presented as one. A failed run is a surfaced WARNING in the framework record; it does not block activation (same §5.3 logic — but it is visible to the supervisor and the MLRO, which is the enforcement mechanism the guidelines actually describe).

4. Data-quality register — backend/app/services/data_quality_register_service.py. Aggregates the existing per-source signals — country_capability_gap findings, data_quality_warnings, evidence_completeness breakdowns — into a per-source completeness/attribution register with deficiency-owner assignment from tenant config. Read-only over existing data (no new detection logic); exposed as GET /api/monitoring/data-quality and summarised in the framework record. Where a source emits no quality signal, the register says "no quality signal instrumented for this source" — a gap entry, not an implied clean.

5. Case-pack ongoing-monitoring appendix. case_pack_service.py gains an appendix rendering the post-approval trail for the case's relationship: ScreeningResult history (ADR-0063), MonitoringEvent rows (ADR-0066), and W2 alert dispositions with closure rationales. Fail-closed inclusion per the ADR-0069/PR #166 pattern: if the trail cannot be read, the pack says so; it never omits the section silently. SAR-adjacent disposition content in the pack remains inside the ADR-0071 tipping-off boundary — the pack is a regulator artifact and is never customer-visible, but any content shared onward to the customer goes through the customer_contact_gate predicate as today (AMLD Art. 39).

Verification follows architecture §4: fail-closed invariants in the no-false-reassurance oracle suite (a stubbed check must appear as a stub in the record; a tenant on unreviewed factory defaults must surface as such; an effectiveness run against a config that flips the OB Holding oracle below CRITICAL must report passed=false), testcontainers throughout, ruff F + tsc zero.

Consequences

Positive

The comply-or-explain artifact an NCA will actually ask a tenant for exists, is exportable, and — because it is derived live — cannot claim a check, cadence, or coverage the deployed system does not have. This converts the repo's existing honesty machinery (ADR-0067/0068) into a supervisor-facing asset.
Every risk-config change now carries a human rationale on the immutable audit spine, and the "default settings without prior risk-based assessment" poor practice is either remediated (defaults-review on file) or visibly outstanding — never silent.
Tenants can demonstrate "defined, tested, calibrated" with a persisted, reproducible run against their own configuration, closing the gap between the vendor's CI and the obliged entity's accountability.
The AI-Act conformity record's longest-standing honest gap (post-market monitoring) closes by reference to a real mechanism rather than by editing copy.

Negative

Mandatory rationale is a length check, not a quality check: it adds friction to every config change and will attract boilerplate ("adjusted per review meeting…"). Genuine calibration discipline still depends on the tenant's governance; we are creating the record, not the culture.
A config-layer effectiveness run can pass while the retrieval layer regresses (a provider silently degrades, an escalator term stops matching upstream text). The verbatim label mitigates over-reading, but a supervisor or buyer may still treat a green run as broader assurance than it is — this is an accepted, documented residual risk of settling §5.4.
Because defaults-review is non-blocking (§5.3), a tenant can run on unreviewed factory defaults indefinitely; the only pressure is surfacing. That is a deliberate trade against bricking new tenants, but it means the poor-practice condition is detectable, not prevented.
The framework record's coverage statement and per-check applicability contain curated authored copy (like _COMPONENT_DESCRIPTORS in the AI-Act record) that must be maintained when checks change — a staleness risk the data-driven parts do not have.

Neutral

The golden YAML snapshots become dual-use (dev CI + shipped app data); they gain a compatibility obligation but no behavioural change.
country_capability grows a monitoring-form dimension; existing signal×country consumers are unaffected.
Historical risk_config_audit rows keep rationale = NULL, rendered as "recorded before rationale capture (2026-07)" — honest, not backfilled.

Amendment (2026-07-03, post-review reconciliation). Decision item 3 above says "the same 7 YAML oracles" as the config-layer replay input. On implementation review this is a deliberate deviation: the 7 tests/golden/*.yaml are end-to-end workflow-acceptance specs carrying live-retrieval assertions, not frozen investigation-input dicts, so they are unsuitable as reproducible config-layer replay fixtures. W6 instead ships two purpose-built synthetic snapshots (a benign case and a criminal-investigation-floor case) as app data, asserting on actual_score floors (criminal ⇒ ≥ 90) rather than on the passed label (adversarial threshold-ordering can flip the label legitimately). See architecture §5 settled-question 7.

Amendment (2026-07-03, post-implementation review — overclaim correction). The Positive bullet above ("the AI-Act conformity record's longest-standing honest gap ... closes by reference") and the Blast-radius bullet ("flips one entry to satisfied-by-reference") overstate what W6 actually ships. The original KnownGap text for "Post-market monitoring (Art. 72)" specifically named automated accuracy back-testing against realised case outcomes as the missing piece; what ships is a system-level Monitoring Framework Record (genuinely satisfied-by-reference) plus a config-layer effectiveness-run harness that replays 2 synthetic golden snapshots (explicitly self-labelled everywhere else in this wave as CONFIG_LAYER_LABEL — "config-layer replay, not end-to-end retrieval"). Replaying synthetic fixtures is not accuracy back-testing against real, realised outcomes. Flipping the whole gap to satisfied would violate the Calibration Review Checklist's presence-≠-evidence rule. _build_known_gaps therefore now emits two Art. 72 sub-items instead of one: "Post-market monitoring — framework & calibration (Art. 72)" (satisfied, credits the framework record) and "Post-market monitoring — accuracy back-testing (Art. 72)" (partial, names the still-missing real-outcome back-testing capability). See app/services/ai_act_conformity_service.py::_build_known_gaps and backend/tests/test_ai_act_conformity.py::TestPostMarketMonitoringGapReference.

Alternatives Considered

Alternative 1: Static authored Monitoring Framework document (PDF/Markdown maintained by hand)

Author the framework record once as a document and update it manually per release.

Why rejected: it drifts from the deployed configuration — the precise defect class the ai_act_conformity_service mold was built to prevent (its inventory is generated from live model-tier/prompt config so it "can never claim a model/prompt the system does not actually run"). A hand-maintained record would have claimed the company-status check as coverage while the code held a stub (gap analysis §3.1), which is a false supervisor representation, not a formatting problem.

Alternative 2: Hard-block risk-config activation until a defaults review is on file

Make POST /api/risk-config/defaults-review a prerequisite for activate_version and disable the factory-default self-heal.

Why rejected (settled, architecture §5.3): it bricks new tenants — _create_default_active_config (risk_config_service.py:320) exists because a tenant with zero config rows fails risk recalculation and the config UI for every case; a hard block reintroduces that failure on first contact. Honest surfacing in the framework record plus supervisor visibility achieves the guideline's intent (the practice must be assessed and demonstrable) without a denial-of-service on onboarding.

Alternative 3: Full end-to-end effectiveness replay (re-run OSINT retrieval per golden case against live providers)

Replay each golden case through the complete 12-step pipeline including live retrieval, per tenant, per run.

Why rejected (settled, architecture §5.4): cost and vendor load (Tavily/BrightData/registry rate limits multiplied by tenants × runs), and nondeterministic external sources make pass/fail flaky — a red run would more often mean "a website changed" than "your calibration regressed," destroying the signal the artifact exists to provide. The config layer is where tenant calibration decisions actually take effect; isolating it makes the run a valid test of exactly those decisions, honestly labelled as nothing more.

Context​

Decision​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

Alternative 1: Static authored Monitoring Framework document (PDF/Markdown maintained by hand)​

Alternative 2: Hard-block risk-config activation until a defaults review is on file​

Alternative 3: Full end-to-end effectiveness replay (re-run OSINT retrieval per golden case against live providers)​