Skip to main content

ADR-0084: Monitoring-Alert Disposition Lifecycle

Date: 2026-07-03 Status: Accepted (implemented 2026-07-03, plan 2026-07-03-amla-w2-alert-disposition) Deciders: Adrian (Soft4U), Claude Fable 5 (AMLA remediation design session)

Revision note (implementation, 2026-07-03): implemented as backend/app/services/monitoring_alert_service.py, class MonitoringAlertService — NOT alert_service.py as originally drafted below (§3). app/services/alert_service.py pre-exists as the ADR-0025 cross-case pattern-alert service and is untouched by this ADR. See the architecture-doc amendment (2026-07-03) in the W2 plan, Task 1.

Decision context:

  • Latency: disposition transitions are single-row UPDATE + one audit_events INSERT inside one tenant-scoped session — same cost profile as a suppression-rule write (ADR-0045). Alert-queue listing adds indexed filters (status, assigned_to_user_id, due_at) to a table that is tiny until W1 starts writing it. Not measured because the table currently has zero rows in every environment (no writers exist).
  • Dependency surface: no new packages. Two new enums in trustrelay-models, one Alembic migration widening monitoring_alerts, one new service (alert_service.py), four disposition endpoints, one new Permission member. SAR linkage reuses the existing SARService.raise_sar (backend/app/services/sar_service.py:260).
  • Debuggability: every transition lands as an immutable audit_events row (ADR-0064) carrying actor, from/to status, rationale, and evidence refs; closure reason is a typed enum, so MI counts (time-to-close, backlog aging, closure-reason mix) are a GROUP BY, not log archaeology.
  • Reversibility: additive migration on an orphaned table; branch revert restores today's read-only surface. Nothing in the Temporal workflow changes. The one hard-to-reverse element is the audit trail itself, which is append-only by design.
  • Blast radius: the alerts UI (which today polls a permanently empty table) gains disposition actions; W1's trigger_router_service becomes the writer; W3 (relationship lifecycle) and W6 (case-pack monitoring appendix) consume dispositions. Onboarding decision paths are untouched.
  • Alternative considered: dropping the dead monitoring_alerts table and opening a case per detection — rejected below (a full 12-step case per sanctions-list ping is response-inflation, and AMLA Guideline 2 asks for triage before escalation, not instead of it).

Context

AMLA's draft ongoing-monitoring guidelines (level-3 elaboration of AMLR (EU) 2024/1624 Art. 26; consultation closes 2026-09-03, comply-or-explain via NCAs from Q1 2027) require, per Guideline 2, a controlled process to assess, prioritise, escalate, close and evidence monitoring outputs. AMLA's framing at the 2026-07-02 hearing was blunt: what happens after an alert is generated is the effectiveness test — alert volume proves nothing; disposition discipline does.

The 2026-07-03 gap audit (docs/research/2026-07-03-amla-ongoing-monitoring-gap-analysis.md, §3.1 "Alert lifecycle", §3.2.2/§3.2.3) verified the current state with file:line evidence:

  • monitoring_alerts is an orphaned table. The ORM model exists (backend/app/db/models.py:3102-3124: MonitoringAlert with trigger_type, risk-score deltas, status defaulting to 'new' at :3121, and bare acknowledged_at/acknowledged_by columns) and has live readersmonitoring_service.py:286-290 lists by status, :318-319 counts status == "new" for the dashboard badge — but zero writers anywhere in the codebase. The alerts UI polls a permanently empty table.
  • The only disposition primitive in the monitoring domain is a bare boolean. MonitoringService.acknowledge_event (backend/app/services/monitoring_service.py:83-108) sets acknowledged = True with no rationale, no evidence, and — despite the ADR-0064 infrastructure being available — no audit_events row. The endpoint (backend/app/api/monitoring.py:157-178) takes no body at all. There is no priority, no SLA, no assignment, no RBAC permission, and no path from a monitoring finding to a SAR: a CRITICAL sanctions hit on a monitored UBO becomes a list row an officer can silently tick away.
  • The repo already contains a production-verified disposition shape — the ADR-0045 sanctions false-positive suppression workflow (backend/app/api/sanctions_suppression.py:65-239): mandatory rationale ≥10 chars + evidence_refs on create (:65-69, per EU AI Act Art. 13 traceability), 12-month expiry forcing officer re-review (sanctions_suppression_service.py:43 DEFAULT_RULE_LIFETIME, :47 RENEWAL_WINDOW), revoke-with-reason (:72-73, :189-222), and fire-count/housekeeping telemetry (:225-239). The gap audit's Tier 1.3 recommendation is to generalise exactly this shape onto monitoring_alerts.

This ADR is Wave 2 (amla-w2-alert-disposition) of the AMLA remediation architecture (docs/superpowers/specs/2026-07-03-amla-remediation-architecture.md, §3 W2). It depends on W1 (ADR-0083), which turns the trigger taxonomy into detection code and makes trigger_router_service.py the first writer of monitoring_alerts. Without W2, W1's alerts would reproduce the exact defect AMLA names: detections dying in a list. The design principle binding both (architecture §1.2): Detection → Response — every detection routes somewhere typed, and every response leaves an immutable record.

Decision

Generalise the ADR-0045 suppression disposition shape onto monitoring_alerts, giving every monitoring alert a typed lifecycle with mandatory evidenced closure, assignment, SLA/aging, an RBAC-gated permission, a SAR link, and MI counts. Vocabulary, columns, service, and endpoints are exactly those fixed in the architecture document §2 — verbatim, no synonyms.

1. Lifecycle enums (architecture §2.1), added to packages/trustrelay-models/src/trustrelay_models/monitoring.py (beside the existing MonitoringCheckType, :20):

class AlertStatus(str, Enum):
new = "new" # matches the existing server_default 'new' (db/models.py:3121)
triaged = "triaged"
escalated = "escalated"
closed = "closed"

class AlertClosureReason(str, Enum):
resolved = "resolved"
false_positive = "false_positive"
escalated_sar = "escalated_sar"
review_opened = "review_opened"
duplicate = "duplicate"

Legal transitions: new → triaged → escalated → closed, plus new → closed and triaged → closed (not every alert warrants escalation). closed is terminal — a wrongly closed alert is not reopened in place; the underlying condition re-fires a new alert (same append-only philosophy as ADR-0045 revoke: the record of the wrong decision stays). No transition may skip the mandatory closure fields.

2. Schema (architecture §2.3): the next sequential Alembic revision (numbered at implementation time — never hardcoded in the plan) widens monitoring_alerts with the W2 lifecycle columns: response_required (TriggerResponse, ADR-0083), priority int, assigned_to_user_id UUID, due_at, closed_at, closed_by, closure_reason, closure_rationale text, evidence_refs JSONB, sar_id, review_case_id, source_event_id (back-ref to the originating monitoring_events row). trigger_type already exists (:3116). The migration also enables RLS on monitoring_alerts in line with ADR-0023/0050; every INSERT sets tenant_id explicitly (the PR #177 confirm-website lesson — the existing server_default at :3110-3113 pins the default tenant and violates WITH CHECK for every other tenant, so writers must never rely on it). Any raw-SQL JSONB parameter uses CAST(:param AS jsonb), never ::jsonb (asyncpg).

3. Disposition service — new backend/app/services/alert_service.py (architecture §2.4):

  • assign(alert_id, user_id) — assignment via the PR #153 officer-picker population (app users table).
  • triage(alert_id, priority, note)new → triaged, sets priority and due_at.
  • escalate(alert_id, target, rationale)→ escalated; when target is SAR, calls SARService.raise_sar (sar_service.py:260, exposed at POST /cases/{case_id}/sar, backend/app/api/sar.py:88-118) to pre-populate a draft SAR carrying an alert_id back-ref, and stores the returned sar_id on the alert. The SAR then follows its own ADR-0071 lifecycle (MLRO four-eyes, tipping-off boundary) — this ADR adds the missing link, not a parallel filing path. When target is a review case (W1 routing), review_case_id is stored instead.
  • close(alert_id, closure_reason, closure_rationale, evidence_refs) — rationale mandatory ≥10 chars (the ADR-0045 floor, api/sanctions_suppression.py:68), typed reason mandatory. Closing as false_positive requires at least one evidence ref; closing as escalated_sar / review_opened requires the corresponding sar_id / review_case_id to be set — fail-closed: the service rejects a closure that claims an escalation it cannot point at.
  • SLA/aging: due_at derived from severity at write time (W1) or triage; the alert-queue endpoint computes overdue; MI counts expose backlog aging. Reuses the SLA vocabulary, not the case SLA rows.
  • Every transition writes an immutable audit_events row (ADR-0064) — event types alert_assigned, alert_triaged, alert_escalated, alert_closed — carrying actor id, from/to status, rationale, evidence refs, and linked ids. The disposition columns on the alert row are the queryable state; audit_events is the evidence.

4. API surface (architecture §2.5): POST /api/monitoring/alerts/{id}/assign · /triage · /escalate · /close; GET /api/monitoring/alerts gains status / assigned / overdue filters; a MI-counts endpoint returns time-to-close, backlog aging buckets, and closure-reason mix per tenant (the supervisor-facing "effectiveness of disposition" numbers; W6 renders them into the Monitoring Framework Record). All disposition endpoints are gated by a new Permission.MONITORING_DISPOSE = "monitoring.dispose" in backend/app/api/deps/permissions.py:42-98, granted at officer level and inherited upward through the ADR-0074 strict-superset hierarchy (_OFFICER set, :70-77); auditor stays read-only. RBAC Phase 2 is active (2026-06-28), so denial is a real 403.

5. UI: the existing alerts surface becomes an alert-queue tab with disposition actions — inline confirmation + Sonner toasts, no modal dialogs (S4U UI standard). Until W1 ships its writer, the W0 honest empty-state ("no alert engine events yet") stands; this ADR never fakes rows.

6. Boundary with the event acknowledge fix: W0 (architecture §3 W0 item b) separately fixes acknowledge_event on monitoring_events (mandatory rationale + audit event). That remains a lightweight informational-tier primitive; the full lifecycle in this ADR applies to monitoring_alerts, the response spine. record_only routing (ADR-0083) still creates an alert row so nothing bypasses the disposition trail.

Consequences

Positive

  • Closes the AMLA Guideline 2 blocker verbatim: monitoring outputs are assessed (triage), prioritised (priority/due_at), escalated (SAR/review-case links), closed (typed reason + mandatory rationale), and evidenced (immutable audit rows + evidence_refs) — the comply-or-explain artifact an NCA will ask a tenant for.
  • The monitoring→SAR gap (§3.1 "SAR never linked from a monitoring finding") closes by reusing the production ADR-0071 lifecycle rather than inventing a second filing path; the alert back-ref makes the chain detection→alert→SAR reconstructable end-to-end.
  • Proven shape, not a novel design: rationale floor, evidence refs, revoke/close-with-reason and telemetry are all lifted from ADR-0045 code that has survived adversarial review.
  • MI counts turn "we monitor" into measurable numbers (time-to-close, aging, reason mix) — AMLA's effectiveness-over-volume posture, and W6's framework record gets real data.

Negative

  • Officer workload becomes visible and mandatory: every alert now demands a typed, evidenced disposition. A tenant with noisy W1 detection will face a real backlog queue where today they see a comfortable empty list — that is the point, but it is friction, and until detection precision is tuned it may push officers toward rote false_positive closures (mitigated, not eliminated, by the evidence-ref requirement and reason-mix MI).
  • closed being terminal means an erroneous closure cannot be amended in place; the correction path (condition re-fires a new alert) depends on W1's detectors actually re-firing, and a one-shot trigger wrongly closed leaves only the audit trail as recourse.
  • Two disposition primitives now coexist (event acknowledge-with-rationale from W0 vs the full alert lifecycle), which requires the UI and officer training to keep the informational vs actionable distinction crisp.

Neutral

  • The table keeps its legacy acknowledged_at/acknowledged_by columns (:3123-3124) unused by the new lifecycle; they are not dropped in W2 (nothing ever wrote them, dropping is cosmetic and can ride any later migration).
  • MI counts are per-tenant operational telemetry; no cross-tenant benchmarking is introduced (Pillar 6 was dropped — officers reject data sharing).
  • W3 consumes escalated alerts for suspension recommendations and W6 renders dispositions into the case-pack monitoring appendix; both are additive consumers, no W2 rework expected.

Alternatives Considered

Alternative 1: delete the dead table; open a case per alert

Drop monitoring_alerts (zero writers, so no data loss) and route every detection straight into a ComplianceCaseWorkflow review case. Rejected: it collapses AMLA's assess/prioritise step into escalate — a sanctions-list update touching a monitored entity is not yet a finding, and spawning a 12-step case per ping is response-inflation that would bury officers and dilute real escalations. The architecture (§2.1 TriggerResponse) deliberately reserves case-opening for full_kyc_refresh/targeted_update routing; record_only and triage-first paths need a lighter, still-evidenced object. The table also already has live read surfaces (monitoring_service.py:286-319, dashboard badge, UI) that wiring preserves and deleting wastes.

Alternative 2: reuse follow-up tasks as alert triage

Represent monitoring outputs as generated follow-up tasks on the originating case (the generate_follow_up_tasks machinery exists). Rejected: follow-up tasks are case-iteration artifacts inside the onboarding loop — they carry no status machine beyond completion, no closure reason, no SLA/aging, no SAR linkage, and no tenant-level queue across cases; and post-approval the case workflow is terminal, so there is no iteration to attach them to (gap doc §3.1 "an APPROVED relationship is frozen forever"). Bending them into a disposition lifecycle would rebuild everything in this ADR inside a model shaped for something else.

Alternative 3: boolean-acknowledge++ (mandatory rationale, no lifecycle)

Extend the W0 acknowledge fix to alerts: keep new → acknowledged but require a rationale and emit an audit event. Rejected as the terminal state, kept as the W0 stopgap for events: it satisfies "evidence" but not "assess, prioritise, escalate, close" — no priority, no aging, no assignment, no typed closure reason (so no MI reason-mix), and critically no SAR/review link, leaving the detection→response chain broken exactly where the audit found it (§3.1: "a CRITICAL sanctions hit becomes a list row"). AMLA's poor-practice example is precisely alert handling that ends at acknowledgement.

Alternative 4: do nothing

Viable only until W1 ships — today the table is empty, so there is nothing to dispose. Rejected because W1 (ADR-0083) makes trigger_router_service a writer; shipping detection without disposition would manufacture the exact "detections die in a list" defect at larger scale, and the comply-or-explain clock (final guidelines Q4 2026, NCA declarations Q1 2027) runs regardless.