Lex — Regulatory Knowledge Layer
Lex gives Atlas the actual text of 40 regulations — 27 EU regulations plus 13 national AML laws across 12 jurisdictions. When an officer sees "AMLR Art.28 CDD Coverage — 85%", they can ask the copilot what Article 28 actually requires and receive an answer grounded in verbatim regulation text — with zero hallucinated citations.
The Problem
Architecture
Regulatory Corpus
The corpus contains 40 regulations — 27 EU regulations sourced from EUR-Lex CELLAR plus 13 national AML laws across 12 jurisdictions (BE, CH, CZ, DE, DK, EE, FI, FR, NL, NO, RO, SK) sourced from national official gazettes/authority websites. The corpus is defined declaratively in corpus_config.py (one RegulationConfig per regulation, each specifying its fetcher and parser).
Wave 1 — Core EU Regulations (8)
| Regulation | CELEX | Priority |
|---|---|---|
| AMLR (EU 2024/1624) — AML Regulation | 32024R1624 | Critical |
| AMLD6 (EU 2024/1640) — 6th AML Directive | 32024L1640 | Critical |
| EU AI Act (EU 2024/1689) — AI harmonised rules | 32024R1689 | Critical |
| GDPR (EU 2016/679) — Data protection | 32016R0679 | High |
| DORA (EU 2022/2554) — Digital operational resilience | 32022R2554 | Medium |
| MiCA (EU 2023/1114) — Crypto-assets | 32023R1114 | Medium |
| EU-IPR (EU 2024/886) — Instant payments | 32024R0886 | High |
| PSD2 (EU 2015/2366) — Payment services | 32015L2366 | Medium |
Wave 2 — Financial Services & Digital Infrastructure (6)
| Regulation | CELEX | Priority |
|---|---|---|
| NIS2 (EU 2022/2555) — Cybersecurity | 32022L2555 | High |
| DSA (EU 2022/2065) — Digital Services Act | 32022R2065 | Medium |
| MiFID II (2014/65/EU) — Financial instruments | 32014L0065 | Medium |
| eIDAS 2 (EU 2024/1183) — Digital identity | 32024R1183 | Medium |
| CRD IV (2013/36/EU) — Capital requirements | 32013L0036 | Medium |
| AMLA Reg (EU 2024/1620) — AMLA establishment | 32024R1620 | Critical |
Wave 3 — Sustainability, Travel Rule & Payments (6)
| Regulation | CELEX | Priority |
|---|---|---|
| TFR (EU 2023/1113) — Transfer of Funds / Travel Rule | 32023R1113 | High |
| CSDDD (EU 2024/1760) — Corporate sustainability due diligence | 32024L1760 | Medium |
| CSRD (EU 2022/2464) — Corporate sustainability reporting | 32022L2464 | Medium |
| Whistleblower (EU 2019/1937) — Whistleblower protection | 32019L1937 | Medium |
| EMD2 (2009/110/EC) — Electronic money | 32009L0110 | Medium |
| SEPA (EU 260/2012) — Credit transfers & direct debits | 32012R0260 | Medium |
Wave 4 — Fiscal Representatives & Taxation (4)
| Regulation | CELEX | Priority |
|---|---|---|
| VAT Directive (2006/112/EC) — Common VAT system | 32006L0112 | High |
| DAC (2011/16/EU) — Administrative cooperation in taxation | 32011L0016 | Medium |
| DAC7 (EU 2021/514) — Digital platform reporting | 32021L0514 | High |
| AMLD5 (EU 2018/843) — 5th AML Directive | 32018L0843 | High |
Wave 5 — Customs & Trade Compliance (3)
The Union Customs Code trilogy — critical for entities acting as importer/exporter or holding AEO status.
| Regulation | CELEX | Priority |
|---|---|---|
| UCC (EU 952/2013) — Union Customs Code | 32013R0952 | Critical |
| UCC-DA (EU 2015/2446) — UCC Delegated Act | 32015R2446 | High |
| UCC-IA (EU 2015/2447) — UCC Implementing Act | 32015R2447 | High |
Wave 6+ — National AML Regulations (13)
National transpositions of the EU AML Directives into domestic law. Each uses a jurisdiction-specific fetcher (the fetcher field in each RegulationConfig).
| Regulation | Jurisdiction | Fetcher | Source |
|---|---|---|---|
| EE-AML — Estonian AML Prevention Act | EE | riigi_teataja | Riigi Teataja |
| FI-AML — Finnish AML Act (444/2017) | FI | pdf | Fin-FSA |
| NL-Wwft — Dutch Wwft | NL | bwb | wetten.overheid.nl |
| DK-AML — Danish Hvidvaskloven | DK | pdf | Finanstilsynet |
| BE-AML — Belgian AML Law (18 Sept 2017) | BE | pdf | NBB |
| CZ-AML — Czech Act 253/2008 | CZ | pdf | FAU |
| FR-CMF — French Code monétaire et financier | FR | legifrance | Légifrance |
| DE-GwG — German Geldwäschegesetz | DE | html | Gesetze-im-Internet |
| CH-GwG — Swiss Geldwäschereigesetz | CH | pdf_fetcher | Fedlex |
| NO-AML — Norwegian Hvitvaskingsloven | NO | html_fetcher | Lovdata |
| SK-AML — Slovak AML Act 297/2008 | SK | html_fetcher | Slov-Lex |
| RO-AML — Romanian Law 129/2019 | RO | pdf | Monitorul Oficial |
| RO-AML-OUG — Romanian OUG transposition | RO | pdf | Monitorul Oficial |
Two national configs (
CH-GwG,NO-AML,SK-AML) declarepdf_fetcher/html_fetcherfetcher keys, but theingest.pydispatch table only registershtml(along witheurlex,riigi_teataja,finlex,bwb,retsinformation,legifrance). These three would raiseUnknown fetcher typeuntil the dispatch keys are aligned.
Ingestion Pipeline
Five-stage pipeline with component isolation — each stage has typed I/O contracts and can be replaced independently:
Multi-Source Fetcher Architecture
The ingestion pipeline dispatches across 8 fetcher types (the _get_fetcher factory in ingest.py), selected per regulation via the fetcher field in corpus_config.py:
| Fetcher | Source Type | Used By |
|---|---|---|
eurlex | EUR-Lex CELLAR REST API (XHTML, no auth) | All 27 EU regulations (default) |
riigi_teataja | Estonian official gazette HTML | EE-AML |
finlex | Finnish Finlex gazette | (available; FI-AML config currently uses pdf) |
bwb | Dutch wetten.overheid.nl (Basis Wetten Bestand) | NL-Wwft |
retsinformation | Danish Retsinformation gazette | (available; DK-AML config currently uses pdf) |
pdf | PDF download + pypdf text extraction | FI-AML, DK-AML, BE-AML, CZ-AML, RO-AML, RO-AML-OUG |
html | Direct HTML scraping from official sites | DE-GwG |
legifrance | French Légifrance | FR-CMF |
All fetchers produce the same FetchedRegulation output (raw content + SHA-256 hash), so the downstream parser, chunker, embedder, and indexer stages are source-agnostic.
Context-Prefixed Chunking
Every chunk includes its structural context as a prefix, so the embedding captures both content and position:
[AMLR | EU | TITLE III > CHAPTER 2 > Section 1 > Art. 28 | Enhanced CDD]
1. Member States shall ensure that obliged entities apply enhanced
customer due diligence measures in the cases referred to in Article 27...
Chunking rules:
- Primary split at article boundaries (never across articles)
- Oversized articles (>1500 tokens): split at paragraph boundaries
- Oversized paragraphs: split at sub-point boundaries ((a), (b), (c))
- Paragraph-level overlap for cross-reference continuity
Query Service — Fast-Path + Hybrid Search
Zero-Hallucination Citation Verification
The CitationVerifier is deterministic and uses no LLM. Every citation passes through 4 checks:
| Check | What It Validates | Failure Mode |
|---|---|---|
| Article exists | Cited article number exists in corpus | Citation rejected |
| Regulation exists | Cited regulation is in scope | Citation rejected |
| Quote accuracy | Quoted text is verbatim substring (>95% SequenceMatcher) | Quote flagged |
| Hierarchy accuracy | Cited hierarchy path matches corpus | Path corrected |
This satisfies SC-5: zero hallucinated article references on the evaluation set.
Data Model
Tenancy model: Corpus tables (lex_regulations, lex_articles, lex_chunks, lex_article_references) are shared — EU regulations are universal truth. Integration tables (lex_radar_links, lex_ingestion_log) are tenant-scoped with RLS.
Copilot Integration — Citation Cards
┌─────────────────────────────────────────────────────┐
│ AMLR Article 28(1) ✓ verified │
│ │
│ "Member States shall ensure that obliged entities │
│ apply enhanced customer due diligence measures [...] │
│ including identifying the source of funds and source │
│ of wealth of the customer and of the beneficial │
│ owner" │
│ │
│ TITLE III > CHAPTER 2 > Section 1 — Enhanced CDD │
└─────────────────────────────────────────────────────┘
Each card shows: regulation + article number, verification badge, verbatim quoted text, hierarchy path, and a link to open the full article in the side panel.
Compliance Tab — Two-Level Progressive Disclosure
Level 1 — Inline Expansion: Each gap item expands to show verbatim requirement, applicability reasoning, evidence guidance, verification badge.
Level 2 — Side Panel: Full article text with highlighted relevant paragraph, hierarchy breadcrumb, cross-references, source link (EUR-Lex, national gazette, or authority PDF), content hash, fetch timestamp.
VLAIO Alignment
The Lex ingestion pipeline (fetcher → parser → chunker) is the same infrastructure needed for VLAIO WP1's regulatory document analysis engine. Building Lex first de-risks the VLAIO project and delivers immediate product value. The hierarchy-aware parser and cross-reference extraction are prerequisites for WP2's Compliance Procedure Intermediate Representation (CPIR).
VectorStore Protocol
The vector store is pluggable — PgVectorStore today, Qdrant or Weaviate when scale demands it:
class VectorStore(Protocol):
async def upsert(self, chunks: list[EmbeddedChunk]) -> int: ...
async def search_semantic(self, embedding, top_k, filters) -> list[VectorSearchResult]: ...
async def search_keyword(self, query, top_k, filters) -> list[VectorSearchResult]: ...
async def search_hybrid(self, query, embedding, top_k, filters, weights) -> list[VectorSearchResult]: ...
async def delete_by_regulation(self, regulation_id) -> int: ...
async def get_stats(self) -> VectorStoreStats: ...