ADR-0082: Post-validation calibration & integrity hardening (OB Holding adversarial run)
- Status: Accepted
- Date: 2026-06-30
- Deciders: Adrian (Soft4U), Claude
- Related: ADR-0020 (EBA weighted-max risk matrix), ADR-0045 (sanctions FP suppression), ADR-0067 (fail-closed / never report clear), ADR-0073 (round-2 entity-resolution & signal hardening), ADR-0075 (document-content adverse analysis), ADR-0077 / ADR-0078 (multi-provider adverse media / alias-brand-group expansion)
Context
A live OB Holding 1 OÜ investigation (Estonian PSP, psp_merchant_onboarding) was put
through a 24-agent adversarial calibration validation. The case outcome was correct —
EDD / CRITICAL, driven by a confirmed EPPO criminal investigation + asset freeze — but the
validation surfaced engine-level calibration and integrity defects that did not flip
this verdict yet were latent, repeatable, and (in one case) actively risk-suppressing.
The project's standing principle (ADR-0067, mission memory) is that the system may add
scrutiny but must never suppress a risk signal, and that over-calibration is a defect
equal to suppression. The defects below are resolved under that principle.
Implemented across PRs #152 (the six validation findings + retrievability), #156 (recalc guard) and #157 (confidence provenance), with regression tests in #160.
Decision
-
Subject criminal-investigation CRITICAL floor (#2). A confirmed criminal-law- enforcement investigation of the subject itself now floors the authoritative EBA score to CRITICAL (≥90) via a deterministic
ENTITY_CRIMINAL_INVESTIGATIONescalator, at parity withNETWORK_SANCTIONS_CONFIRMED. Previously it capped at HIGH/85 (a critical adverse-media dimension → weighted-max floor 100×0.85) while lesser connected-entity sanctions hits reached 90 — a calibration inversion. Set from the existinghas_entity_criminal(respects officer rejections).eba_risk_matrix.py,risk_matrix_service.py. -
Gambling MCC scored at its true vertical (#3). MCC 7995 now resolves to the
gamblingindustry category (score 90) via a conservative MCC→category map (mcc_to_industry_category), used both in the product/service dimension lookup and to setbusiness_profile. Previously any high-tier industry finding was coerced toconstruction(50) and the MCC lookup missed (dataset keyed by category name, not MCC), so a high-risk vertical under-scored; a pure-gambling merchant under-tiered to CDD. -
ET/LT adverse-media process discipline (#1). Native-language criminal escalators now obey the same "process language, not bare crime noun" discipline as English: the escalator set (
all_native_criminal_process_terms) holds only criminal-process terms (investigation/prosecution/arrest/freeze) plus noun+process compounds; bare crime nouns (rahapesu/kelmus/…) live in a query-only set (AML_CRIME_NOUNS_NATIVE) used for search recall but never the CRITICAL floor. An administrative AML fine reported in EE/LT no longer over-escalates to CRITICAL.country_capability.py,osint_post_processing.py. -
presence ≠ evidencefor VERIFIED emitters (#4/#5). NorthData is labelled VERIFIED ("corroborated") only when its record carries a registration identifier (reg-no / EUID / LEI) that matches the subject; a name-only/identifier-less/mismatched hit is a LOW "name-only candidate — unverified" (ADR-0073 R9). Crunchbase is VERIFIED only on a domain match to the subject's website; otherwise a LOW "located, identity not confirmed".osint_phases.py(subject reg threaded viaemit_pre_enrichment_source_findings). -
Post-workflow retrievability (#7). The case-detail API hydrates
resolved_requirements,cross_reference_resultanddocument_manifestfrom their dedicated DB columns (andadditional_data) on the DB-fallback path — previously they were read only from live Temporal state and returned null once the workflow archived.cross_reference_result.discrepanciesis a risk-signal carrier, so dropping it after archival was a never-suppress violation (EU AI Act Art.11/12, AMLR 5-yr retention).case_crud.py. -
Recalc never-suppress guard (#156).
recalculate_case_riskhas a findings-aware "rich" path and a findings-blind fallback (country + MCC only). The fallback is taken whenadditional_data["investigation_results"]is empty — which races with the workflow persisting investigation_results after its post-OSINT reassessment. Without a guard, a recalc fired in that window reset an investigated case (EPPO CRITICAL/90/EDD) to a country-only baseline (medium/42/CDD), silently suppressing the risk. A findings-blind recompute may now never downgrade a higher findings-based assessment (_recalc_should_preserve); a rich-path recalc may change the score freely (evidence-traceable).risk_config.py. (Diagnosed via the append-onlyrisk_assessmentshistory table —initial/42 → post_osint/90 → manual_recalculate/42— which is the canonical forensic tool for "which write set the displayed risk".) -
Confidence source-provenance recognition (#157). The confidence "Source Diversity" breakdown now splits combined provenance (
"A + B") and recognises real sources via exact + substring aliases (VIES, localized registries e.g. Estonian Äriregister, EU sanctions, news/AML adverse-media, the subject's own website =self_declared). Internal pipeline stages (mcc_classifier, screening-suppression, financial_analyzer, EVOI, knowledge graph, country-capability) are classifiedinternal_analysisat 0 pts — consulted, not independent sources. The L1 principle is preserved: a truly unrecognised source still scores 1 pt (honest, never inflated to a mid-tier default).confidence_engine.py(mirrored to the extractedtrustrelay-enginespackage, #158).
Consequences
- No risk-score inflation. Where a dimension was already capped (e.g. OB Holding Source Diversity at 25) the displayed total is unchanged; the corrections fix labels and the low-source / no-findings cases. The criminal floor and gambling fixes raise scores only where the evidence justifies it.
- The confidence breakdown is computed live (
GET /api/cases/{wf}/confidence,confidence.py), so #157 reflects on refresh without re-persisting. - Forensic tooling. The
risk_assessmentsappend-only table (trigger/score/assessed_at per write) is the authoritative way to see which computation produced the displayed risk. - Regression coverage.
test_validation_findings_fixes.py(#152, +13), confidence provenance tests (#157, +8), and_recalc_should_preservetests (#160, +4) pin the new behaviour. 42 confidence + 79 EBA tests green. - Known follow-ups. Bot-review pipeline (Codex/CodeRabbit/Aikido) was quota-limited
during the run, so #152–#160 merged on local verification; a review pass is owed once
quota returns.
osint_post_processing.pyis not yet inarchitecture-index.json.