Skip to main content

Lex — Regulatory Knowledge Layer

Lex gives Atlas the actual text of 40 regulations — 27 EU regulations plus 13 national AML laws across 12 jurisdictions. When an officer sees "AMLR Art.28 CDD Coverage — 85%", they can ask the copilot what Article 28 actually requires and receive an answer grounded in verbatim regulation text — with zero hallucinated citations.

The Problem

Architecture

Regulatory Corpus

The corpus contains 40 regulations27 EU regulations sourced from EUR-Lex CELLAR plus 13 national AML laws across 12 jurisdictions (BE, CH, CZ, DE, DK, EE, FI, FR, NL, NO, RO, SK) sourced from national official gazettes/authority websites. The corpus is defined declaratively in corpus_config.py (one RegulationConfig per regulation, each specifying its fetcher and parser).

Wave 1 — Core EU Regulations (8)

RegulationCELEXPriority
AMLR (EU 2024/1624) — AML Regulation32024R1624Critical
AMLD6 (EU 2024/1640) — 6th AML Directive32024L1640Critical
EU AI Act (EU 2024/1689) — AI harmonised rules32024R1689Critical
GDPR (EU 2016/679) — Data protection32016R0679High
DORA (EU 2022/2554) — Digital operational resilience32022R2554Medium
MiCA (EU 2023/1114) — Crypto-assets32023R1114Medium
EU-IPR (EU 2024/886) — Instant payments32024R0886High
PSD2 (EU 2015/2366) — Payment services32015L2366Medium

Wave 2 — Financial Services & Digital Infrastructure (6)

RegulationCELEXPriority
NIS2 (EU 2022/2555) — Cybersecurity32022L2555High
DSA (EU 2022/2065) — Digital Services Act32022R2065Medium
MiFID II (2014/65/EU) — Financial instruments32014L0065Medium
eIDAS 2 (EU 2024/1183) — Digital identity32024R1183Medium
CRD IV (2013/36/EU) — Capital requirements32013L0036Medium
AMLA Reg (EU 2024/1620) — AMLA establishment32024R1620Critical

Wave 3 — Sustainability, Travel Rule & Payments (6)

RegulationCELEXPriority
TFR (EU 2023/1113) — Transfer of Funds / Travel Rule32023R1113High
CSDDD (EU 2024/1760) — Corporate sustainability due diligence32024L1760Medium
CSRD (EU 2022/2464) — Corporate sustainability reporting32022L2464Medium
Whistleblower (EU 2019/1937) — Whistleblower protection32019L1937Medium
EMD2 (2009/110/EC) — Electronic money32009L0110Medium
SEPA (EU 260/2012) — Credit transfers & direct debits32012R0260Medium

Wave 4 — Fiscal Representatives & Taxation (4)

RegulationCELEXPriority
VAT Directive (2006/112/EC) — Common VAT system32006L0112High
DAC (2011/16/EU) — Administrative cooperation in taxation32011L0016Medium
DAC7 (EU 2021/514) — Digital platform reporting32021L0514High
AMLD5 (EU 2018/843) — 5th AML Directive32018L0843High

Wave 5 — Customs & Trade Compliance (3)

The Union Customs Code trilogy — critical for entities acting as importer/exporter or holding AEO status.

RegulationCELEXPriority
UCC (EU 952/2013) — Union Customs Code32013R0952Critical
UCC-DA (EU 2015/2446) — UCC Delegated Act32015R2446High
UCC-IA (EU 2015/2447) — UCC Implementing Act32015R2447High

Wave 6+ — National AML Regulations (13)

National transpositions of the EU AML Directives into domestic law. Each uses a jurisdiction-specific fetcher (the fetcher field in each RegulationConfig).

RegulationJurisdictionFetcherSource
EE-AML — Estonian AML Prevention ActEEriigi_teatajaRiigi Teataja
FI-AML — Finnish AML Act (444/2017)FIpdfFin-FSA
NL-Wwft — Dutch WwftNLbwbwetten.overheid.nl
DK-AML — Danish HvidvasklovenDKpdfFinanstilsynet
BE-AML — Belgian AML Law (18 Sept 2017)BEpdfNBB
CZ-AML — Czech Act 253/2008CZpdfFAU
FR-CMF — French Code monétaire et financierFRlegifranceLégifrance
DE-GwG — German GeldwäschegesetzDEhtmlGesetze-im-Internet
CH-GwG — Swiss GeldwäschereigesetzCHpdf_fetcherFedlex
NO-AML — Norwegian HvitvaskingslovenNOhtml_fetcherLovdata
SK-AML — Slovak AML Act 297/2008SKhtml_fetcherSlov-Lex
RO-AML — Romanian Law 129/2019ROpdfMonitorul Oficial
RO-AML-OUG — Romanian OUG transpositionROpdfMonitorul Oficial

Two national configs (CH-GwG, NO-AML, SK-AML) declare pdf_fetcher/html_fetcher fetcher keys, but the ingest.py dispatch table only registers pdf/html (along with eurlex, riigi_teataja, finlex, bwb, retsinformation, legifrance). These three would raise Unknown fetcher type until the dispatch keys are aligned.

Ingestion Pipeline

Five-stage pipeline with component isolation — each stage has typed I/O contracts and can be replaced independently:

Multi-Source Fetcher Architecture

The ingestion pipeline dispatches across 8 fetcher types (the _get_fetcher factory in ingest.py), selected per regulation via the fetcher field in corpus_config.py:

FetcherSource TypeUsed By
eurlexEUR-Lex CELLAR REST API (XHTML, no auth)All 27 EU regulations (default)
riigi_teatajaEstonian official gazette HTMLEE-AML
finlexFinnish Finlex gazette(available; FI-AML config currently uses pdf)
bwbDutch wetten.overheid.nl (Basis Wetten Bestand)NL-Wwft
retsinformationDanish Retsinformation gazette(available; DK-AML config currently uses pdf)
pdfPDF download + pypdf text extractionFI-AML, DK-AML, BE-AML, CZ-AML, RO-AML, RO-AML-OUG
htmlDirect HTML scraping from official sitesDE-GwG
legifranceFrench LégifranceFR-CMF

All fetchers produce the same FetchedRegulation output (raw content + SHA-256 hash), so the downstream parser, chunker, embedder, and indexer stages are source-agnostic.

Context-Prefixed Chunking

Every chunk includes its structural context as a prefix, so the embedding captures both content and position:

[AMLR | EU | TITLE III > CHAPTER 2 > Section 1 > Art. 28 | Enhanced CDD]
1. Member States shall ensure that obliged entities apply enhanced
customer due diligence measures in the cases referred to in Article 27...

Chunking rules:

  1. Primary split at article boundaries (never across articles)
  2. Oversized articles (>1500 tokens): split at paragraph boundaries
  3. Oversized paragraphs: split at sub-point boundaries ((a), (b), (c))
  4. Paragraph-level overlap for cross-reference continuity

Zero-Hallucination Citation Verification

The CitationVerifier is deterministic and uses no LLM. Every citation passes through 4 checks:

CheckWhat It ValidatesFailure Mode
Article existsCited article number exists in corpusCitation rejected
Regulation existsCited regulation is in scopeCitation rejected
Quote accuracyQuoted text is verbatim substring (>95% SequenceMatcher)Quote flagged
Hierarchy accuracyCited hierarchy path matches corpusPath corrected

This satisfies SC-5: zero hallucinated article references on the evaluation set.

Data Model

Tenancy model: Corpus tables (lex_regulations, lex_articles, lex_chunks, lex_article_references) are shared — EU regulations are universal truth. Integration tables (lex_radar_links, lex_ingestion_log) are tenant-scoped with RLS.

Copilot Integration — Citation Cards

┌─────────────────────────────────────────────────────┐
│ AMLR Article 28(1) ✓ verified │
│ │
│ "Member States shall ensure that obliged entities │
│ apply enhanced customer due diligence measures [...] │
│ including identifying the source of funds and source │
│ of wealth of the customer and of the beneficial │
│ owner" │
│ │
│ TITLE III > CHAPTER 2 > Section 1 — Enhanced CDD │
└─────────────────────────────────────────────────────┘

Each card shows: regulation + article number, verification badge, verbatim quoted text, hierarchy path, and a link to open the full article in the side panel.

Compliance Tab — Two-Level Progressive Disclosure

Level 1 — Inline Expansion: Each gap item expands to show verbatim requirement, applicability reasoning, evidence guidance, verification badge.

Level 2 — Side Panel: Full article text with highlighted relevant paragraph, hierarchy breadcrumb, cross-references, source link (EUR-Lex, national gazette, or authority PDF), content hash, fetch timestamp.

VLAIO Alignment

The Lex ingestion pipeline (fetcher → parser → chunker) is the same infrastructure needed for VLAIO WP1's regulatory document analysis engine. Building Lex first de-risks the VLAIO project and delivers immediate product value. The hierarchy-aware parser and cross-reference extraction are prerequisites for WP2's Compliance Procedure Intermediate Representation (CPIR).

VectorStore Protocol

The vector store is pluggable — PgVectorStore today, Qdrant or Weaviate when scale demands it:

class VectorStore(Protocol):
async def upsert(self, chunks: list[EmbeddedChunk]) -> int: ...
async def search_semantic(self, embedding, top_k, filters) -> list[VectorSearchResult]: ...
async def search_keyword(self, query, top_k, filters) -> list[VectorSearchResult]: ...
async def search_hybrid(self, query, embedding, top_k, filters, weights) -> list[VectorSearchResult]: ...
async def delete_by_regulation(self, regulation_id) -> int: ...
async def get_stats(self) -> VectorStoreStats: ...