ADR-0024: Entity Matching with Blocking Keys and Trust-Weighted Survivorship

Date: 2026-03-31 (date decision was originally made) Status: Accepted Deciders: Adrian Birlogeanu (Soft4U BV), Claude Code Documented retroactively: 2026-04-03

Context

Cross-investigation entity deduplication requires both algorithmic efficiency and regulatory compliance. OSINT investigations across multiple cases frequently encounter the same companies, directors, and UBOs under slightly different spellings or transliterations. Naively comparing all entities across all cases is O(n^2), which becomes untenable as the case volume grows.

When two sources disagree on a field value -- for example, KBO reports an address in Brussels while NorthData reports Antwerp -- the system needs a principled way to decide which value to keep. A simple "last write wins" strategy loses provenance information that compliance officers need for audit trails. Government registries should be trusted over web scrapers, and structured APIs over LLM-extracted data, but this hierarchy must be explicit and configurable rather than buried in ad-hoc code.

Additionally, certain fields carry regulatory significance that demands special handling. Sanctions flags and PEP designations must never be overwritten by lower-trust sources, even if those sources have more recent data. A web scraper reporting "no sanctions" should not suppress a confirmed OFAC match.

Decision

We implement a two-stage entity matching and resolution system:

Stage 1: Blocking Keys for Candidate Selection

Entities are grouped into candidate pairs using blocking keys composed of jurisdiction:name_prefix (3-character prefix). This reduces comparison space from O(n^2) to O(n) by only comparing entities within the same block. Within each block, Python's SequenceMatcher performs fuzzy string matching on normalized entity names with three threshold bands:

>0.85: automatic match -- entities are merged without human review
0.70-0.85: review queue -- entities are flagged for compliance officer confirmation
<0.70: distinct -- entities are treated as separate

Stage 2: Trust-Weighted Survivorship

When matched entities have conflicting field values, the system resolves conflicts using per-provider trust scores:

Provider	Trust Score
KBO (government registry)	0.98
GLEIF (global LEI registry)	0.95
Structured API (NorthData, VIES)	0.85
LLM-extracted data	0.75
Web scraping (crawl4ai, BrightData)	0.70

The highest-trust value wins for each field. When the trust delta between competing values is less than 0.02, the conflict is recorded for human review rather than auto-resolved. Protected fields (sanctions status, PEP designation) require authorized sources only -- values from providers below the 0.85 trust threshold are ignored for these fields.

Consequences

Positive

O(n) candidate selection via blocking keys makes deduplication feasible at scale (thousands of entities across hundreds of cases)
Trust scores create a transparent, auditable hierarchy for conflict resolution -- compliance officers can see exactly why a particular value was chosen
Protected field handling ensures sanctions and PEP flags are never suppressed by lower-quality sources, satisfying AML regulatory requirements
The 0.70-0.85 review band captures ambiguous matches for human judgment rather than making incorrect automated decisions

Negative

The 3-character prefix blocking key can miss matches when entity names differ in their first 3 characters (e.g., "The Company" vs "Company") -- requires normalization preprocessing
Fixed trust scores do not account for temporal freshness -- a 2-year-old KBO record (0.98) will be preferred over yesterday's NorthData record (0.85) even if the latter is more current
The review queue for 0.70-0.85 matches adds to compliance officer workload, potentially creating a backlog

Neutral

Trust scores are configuration-driven and can be adjusted per deployment without code changes
The entity matcher integrates with the graph ETL pipeline (see graph_etl.py) to deduplicate nodes before graph insertion
Conflict records feed into the discrepancy resolution system, enabling compliance officers to resolve disagreements through the chatbot interface

Alternatives Considered

Alternative 1: Simple Fuzzy Matching Without Blocking

Why rejected: Comparing every entity against every other entity is O(n^2). With 10,000 entities across 500 cases, this means 100 million comparisons -- computationally infeasible for real-time deduplication and unnecessary when jurisdiction alone eliminates the vast majority of non-matching pairs.

Alternative 2: Consensus Voting (Majority Wins)

Why rejected: Loses the source quality hierarchy. Three web scrapers reporting the same incorrect address would outvote one government registry reporting the correct address. In compliance contexts, data provenance matters more than data prevalence -- an authoritative source should always override multiple low-quality sources.

Alternative 3: Manual Resolution of All Conflicts

Why rejected: The volume of entity overlaps across investigations makes fully manual resolution impractical. A single investigation can surface 50-100 entities, and cross-case deduplication multiplies this. Compliance officers are already overloaded with case review; automated resolution for high-confidence matches (>0.85 similarity, >0.02 trust delta) is necessary to keep the workflow manageable.

Context​

Decision​

Stage 1: Blocking Keys for Candidate Selection​

Stage 2: Trust-Weighted Survivorship​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

Alternative 1: Simple Fuzzy Matching Without Blocking​

Alternative 2: Consensus Voting (Majority Wins)​

Alternative 3: Manual Resolution of All Conflicts​