Skip to main content

ADR-0077 — Multi-provider adverse-media retrieval with native-language recall

Status: Accepted Date: 2026-06-28 Deciders: Compliance engineering (with the OB-Holding reachability investigation oracle) Context source: docs/research/2026-06-28-ob-holding-reachability-investigation.md

Context

The adverse-media agent (app/agents/adverse_media_agent.py) is single-provider, shallow, and English-only:

  • Phase 1 calls only Tavily REST (_tavily_search, line 62), search_depth="basic", max_results=5, snippet-only (~300 chars).
  • On HTTP 429/432 it returns ([], degraded=True) (lines 90–95) and emits an honest screening_error data-gap finding (ADR-0067). The honesty is correct; the recall is zero — there is no fallback path.
  • _build_search_specs (line 101) emits English query terms only.

A 9-agent investigation (2026-06-28) empirically refuted the prior belief — stated in ADR-0073 ("real-world detection … remains coverage/search-limited, vendor-dependent") and ADR-0075 ("Estonian/Lithuanian court and news sites are bot-blocked") — that the OB Holding blockers were structurally unreachable. Two live web-research agents retrieved every blocker from free, public, HTTP-200-confirmed sources: the €31.6m EPPO freeze (maxwin.ee/ERR.ee, in Estonian), the €8.4m fine (casinocitytimes/yogonet), the EPPO/BaltCap thread (eppo.europa.eu official press, ERR.ee — no paywall), and the McLoughlin/SKS365/Galassia links (igamingbusiness/egr.global/occrp.org).

The gap is our retrieval code, not the web. This ADR covers the provider + language + depth dimension. The entity-set dimension (the keystone — we screen the wrong names) is ADR-0078 and is a hard precondition for the value of everything here.

A critical downstream coupling was found by the adversarial reviewer: the deterministic escalators that convert retrieved text into CRITICAL/REJECT (osint_post_processing.py:_CRIMINAL_LE_SIGNALS / _DIRECTOR_ADVERSE_SIGNALS) are English-only. Retrieving Estonian text (arestis, rahapesu, prokuratuur) will not change the verdict unless the escalators also learn those terms — the same English-only recall gap ADR-0075 flagged in its consequences.

Decision

Make adverse-media retrieval multi-provider, multi-language, and depth-aware, behind the existing search seam, preserving ADR-0067 fail-closed honesty and the recall-floor determinism.

  1. Tavily hardening (Wave 0, no new vendor). search_depth basicadvanced; max_results 5→15; topic="news"; include_raw_content=True. Add a bounded Semaphore(2) on the asyncio.gather burst in _search_all_entities (the concurrent burst causes the 432) and exponential-backoff-with-jitter retry (3 attempts) on 429/432 inside _tavily_search before declaring degraded. Reconsider excluding 429/432 in circuit_breaker._is_excluded_error for the tavily breaker so rate-limit storms actually pace.

  2. Native-language query variants (Wave 0). Thread country (the caller osint_agent.py:_safe_adverse_media already has it) into run_adverse_media_agent. Add ISO-country-keyed criminal/AML term sets mirroring country_capability._ADVERSE_VOCABET: kriminaaluurimine, rahapesu, arestimine/vara arest, prokuratuur, kahtlustus, süüdistus; LT: ikiteisminis tyrimas, pinigų plovimas, turto areštas, prokuratūra.

  3. Escalator native-language terms (Wave 0 — load-bearing, same wave as #2). Add ET/LT signal terms to osint_post_processing.py:_CRIMINAL_LE_SIGNALS / _DIRECTOR_ADVERSE_SIGNALS. This deterministic-terms path is mandatory: routing a deterministic hard-blocker through an LLM translation step would let a dropped or mistranslated term (arestis / rahapesu / prokuratuur) silently suppress a CRITICAL — the exact single-point failure the deterministic escalator exists to prevent, and a breach of the "never suppress a risk signal" rule. Optional Phase-2 LLM English-normalisation may be added as an additive supplement, never the sole path. The query-side ET/LT vocabulary (#2) and the escalator-side ET/LT vocabulary share a single documented source of truth (extend country_capability._ADVERSE_VOCAB) with a sync note, so adding a country can never raise recall while leaving the verdict English-only. Without this, native-language retrieval raises recall but not the verdict. (This also closes the English-only consequence noted in ADR-0075.)

  4. BrightData SERP fallback (Wave 3 — triggered only). When _tavily_search returns degraded or the subject's recall-floor query yields 0 hits, re-run the same query plan via BrightData hosted-MCP search_engine (a different network path that bypasses the 432 wall), merge+dedup via _dedup_results. Mark the screen degraded only if both providers fail. BrightData is already owned/paid (config.py:71, MCP proven in 4 scrape_as_markdown call-sites) — but its search_engine product has zero call-sites today, so this is a spike, not "proven capability." Acceptance condition (ADR-0067): the SERP seam MUST return a (results, degraded) tuple that distinguishes a transport/error failure from a genuine 0-hit result (mirroring _search_all_entities' degraded flag) — an unproven integration that silently returns [] on error would masquerade as "0 hits = clean" and produce a false clean after Tavily already came back empty. Fail-closed, never fail-to-empty.

  5. BrightData Web Unlocker deep-fetch (Wave 4). After SERP returns URLs, pull full article/primary-source text for the top N (~3/entity) from a curated allow-list (eppo.europa.eu, err.ee, postimees.ee, maxwin.ee, inforegister.ee) via scrape_as_markdown, feeding the markdown into Phase-2 analysis to replace snippet truncation. Carry the .pdf/.zip MCP-crash guard (social_intelligence.jinja2:11) into the fetcher. Web Unlocker bypasses bot-walls/Cloudflare, not logins — fetch the free outlet (ERR), never a paywalled one.

  6. EPPO/OLAF allow-listed fetch (Wave 4). Add site:eppo.europa.eu + news-RSS specs for subject+aliases+persons, via the already-integrated crawl4ai_service.py, feeding the same buckets — authoritative and free.

  7. Per-provider provenance (Wave 0 — honesty precondition). Stamp every Finding with its true source (tavily / brightdata_serp / brightdata_unlocker / eppo) and the resolved URL as evidence_ref. Fix the hard-coded source='tavily' in ANALYSIS_PROMPT (~line 254). When all providers fail, keep the existing deterministic data-gap finding — nothing is ever rendered "screened clean" without a backing URL.

Budget contention (mandatory before Waves 3–4)

The shared brightdata_concurrency.brightdata_slot() Semaphore(2) is already consumed by social_intelligence_agent + person_validation_agent, with documented 4-case timeout storms. Adding SERP + Unlocker as a 3rd/4th consumer through the same gate risks re-triggering those timeouts. Model aggregate contention and partition or raise the budget before enabling the BrightData waves. A shared brightdata_scrape_service (consolidating the 4 copy-pasted MCP call-sites behind one budget-gated seam) is the recommended refactor and may land as part of this ADR.

Why keep determinism + fail-closed (not adopt atlas's agentic tool-choice)

trustrelay-atlas wires Tavily + BrightData + Exa with agentic provider selection. We adopt the breadth (BrightData first, since we own it) but keep our fixed query plan + recall-floor + ADR-0067 data-gap findings. We explicitly do not port atlas's analysis__extract_findings.txt, which omits tool errors/not-found from findings — that would regress our fail-closed honesty.

Consequences

  • Adverse-media recall rises materially: transient-432 recovery (Tavily hardening) + a true fallback path (BrightData SERP) + native-language coverage + full article bodies. Paired with ADR-0078 alias expansion, the OB Holding blockers become reachable by the pipeline.
  • The escalator language fix (#3) is the difference between "we retrieved it" and "the verdict changed." It must ship in the same wave as native-language retrieval.
  • Cost is contained: BrightData paths are triggered-only (Tavily-degraded or zero-hit), deep-fetch is capped per entity, and the budget is gated.
  • Honesty is preserved end-to-end: every fact carries a real source URL + provider; all-providers-fail still yields the fail-closed gap; we never relabel a group entity's enforcement as the subject's (see ADR-0078).
  • Known residual limits (unchanged, must stay honest): EPPO sealed files, login-paywalled outlets, and causal claims not independently sourced (e.g. "resigned because of EPPO").

Addendum — live-validation refinements (2026-06-29)

End-to-end validation against the live OB Holding case (full pipeline, real providers) surfaced four gaps between "retrieved" and "surfaced as CRITICAL" that the original waves did not anticipate. Each was fixed and pinned by tests; together they are what makes the freeze autonomously reach a CRITICAL verdict.

  1. SERP via DIRECT MCP call, not an LLM-agent wrapper. The first SERP integration wrapped search_engine in a PydanticAI Agent (an LLM choosing the tool). Under concurrent full-pipeline load every SERP call timed out. Replaced with MCPServerStreamableHTTP.direct_call_tool("search_engine", …) — no model in the loop (~3s vs 90s). _normalise_serp_organic maps {organic:[{link,title,description}]} → the pipeline's {title,url,content}; a malformed payload (no organic key) fail-closes to a data gap, never a silent clean empty.

  2. Bounded transient retry on the SERP call. The hosted MCP intermittently returns an error result (surfacing as ModelRetry) or times out under load. For the native-language query — frequently the only provider reaching regional enforcement press — a single blip silently dropped a CRITICAL signal. _brightdata_serp_search now retries transient errors (_SERP_RETRIES, backoff+jitter, fresh session per attempt); CircuitOpenError still fails-closed immediately, and a sustained outage still becomes a data gap.

  3. Honest degradation for high-signal queries (SERP is co-primary). The proactive-merge originally marked a screen degraded only if both providers failed. But Tavily's non-English coverage is empirically noise (it missed the OlyBet freeze entirely while returning 15+ irrelevant English crime articles), so a Tavily "success" masked a SERP failure → the gap read as "screened clean." For high-signal (recall-floor / native-language / EPPO allow-list) queries, SERP's status now governs the degraded flag: if the adequate provider could not run, the screen is a data gap regardless of Tavily (never suppress a risk signal; ADR-0067). The retry in #2 keeps a transient blip from over-firing this.

  4. Relevance-RANKED truncation of the Phase-2 input (the subtle one). The retrieved freeze landed at index ~43 of a 53-result bucket; the analysis formatter passed only the first N to the LLM, in retrieval order (Tavily first, SERP appended last) — so the freeze was silently truncated before the LLM ever saw it, even though every prior layer reported success. Naive crime-word counting did not rescue it: the adverse-themed queries pull English crime articles about unrelated entities that out-score a 3-word Estonian headline. _format_results_for_analysis now relevance-ranks each bucket before the cap, using a banded key: a result that NAMES the screened entity OR carries a NATIVE-language enforcement term (arestis / prokuratuur / rahapesu) enters a high band (ordered within by native density + provider); generic English crime noise stays in the low band. This guarantees the regional enforcement hit tops the bucket regardless of noise volume. ORDERING ONLY — classification remains the LLM's job. Cap raised 8→10.

  5. Phase-2 ANALYSIS_PROMPT recall/precision calibration. With the freeze finally in the input, the analyst LLM still dismissed it as "general gaming news … does not name the exact entity," because the prompt demanded a byte-for-byte name match — which the Estonian genitive "Olybeti Eesti" fails. The prompt now treats a declined / translated / transliterated / suffix-variant of the screened name as the SAME entity (flag), while a DIFFERENT entity merely sharing a generic/sector word stays clean (ADR-0073 R9 preserved): same-entity requires a corroborating signal (jurisdiction, the given brand↔group link, reg-no/LEI/domain). Adds subject-of-enforcement-not-co-occurrence, negation/ exoneration, and homonym-person guards. Calibrated with a real-LLM gated fixture set (REAL_LLM=1): recall cases (the OlyBet €31.6m freeze under name variants) and adversarial precision cases (different-entity, sector-regulator, parent-bleed, homonym, exoneration, victim-framing) — all green.

Result: the OB Holding case now autonomously escalates to 2 CRITICAL findings — subject-direct (BaltCap EPPO seizure) and verified-group-chain (OlyBet Eesti → OB Holding, ADR-0078) — while a name-similar director charge is correctly cleaned. This matches the manual analyst Reject pack.

Alternatives considered

  • Add a commercial adverse-media feed (ComplyAdvantage / Dow Jones / World-Check) now. Rejected as first move: contract-gated, costly, and unnecessary — the OB Holding facts are free/public; reach them with capability we already own. Kept as an optional Wave 6 backstop.
  • Swap Tavily for Exa. Rejected: new vendor + key; BrightData (owned) covers the SERP+scrape need. Exa stays an optional third provider for paraphrased/non-English recall.

References

ADR-0078 (alias/brand/group expansion — keystone precondition), ADR-0067 (fail-closed outputs), ADR-0019 (OSINT pipeline), ADR-0075 (document adverse analysis — shares the escalators + the English-only gap), ADR-0073 (entity-resolution; this ADR refines its "vendor-limited" consequence), ADR-0029 (model tiers), ADR-0032 (circuit breakers), osint_post_processing.py, brightdata_concurrency.py, research doc docs/research/2026-06-28-ob-holding-reachability-investigation.md.