Skip to main content

PII Classification & Encryption

Field-level PII classification, AES-256-GCM encryption at rest, HMAC search hashes, and GDPR data subject request handling -- making Trust Relay's data model self-describing for compliance auditors and bank security questionnaires.

Business Value

Regulated customers (banks, fintechs) ask "where is PII stored and how is it protected?" in every security questionnaire. Trust Relay answers this with a machine-generated pii_manifest.json -- a GDPR Art. 30 artifact listing every PII field, its classification, encryption status, retention policy, and legal basis. The PII classification system also enables automated GDPR data subject requests (access, erasure, rectification) with proper AML retention handling.

Architecture

Three-layer design: classification metadata on ORM models, transparent column encryption via SQLAlchemy TypeDecorator, and GDPR DSR endpoints backed by a person resolution index.

PII Categories

Six classification levels, three requiring encryption:

CategoryEncryptionExamples
DIRECT_IDENTIFIERAES-256-GCMNational ID, passport number, document number
FINANCIALAES-256-GCMIBAN, account number
CONTACTAES-256-GCMEmail, phone
QUASI_IDENTIFIERRLS onlyName, DOB, address, company name
SENSITIVERLS onlyPEP status, sanctions hits
DOCUMENTRLS onlyUploaded ID documents (MinIO)

Categories are declared as info={"pii": PIIField(category=PIICategory.X)} on SQLAlchemy columns. The PIIRegistry scans all ORM models at runtime and builds an in-memory index of classified fields. As of 2026-06-11 the codebase carries 20 PII-classified fields across 5 tables (users, cases, person_verifications, investigation_persons, investigation_accounts).

:::warning RLS is not active in the shipped configuration The categories marked "RLS only" above rely on PostgreSQL Row Level Security for tenant isolation. RLS is currently inert because the application connects as the bootstrap PostgreSQL superuser (temporal, the database_url default in app/config.py), and superusers bypass even FORCE ROW LEVEL SECURITY. The non-superuser application role exists in scripts/create_app_role.sql (trustrelay_app) but is not wired into the runtime DATABASE_URL. Until the app connects as trustrelay_app, "RLS only" fields have no enforced at-rest isolation. See Security. :::

Encryption Architecture

EncryptedText TypeDecorator

Transparent AES-256-GCM encryption at the column level. Application code reads and writes plaintext strings -- encryption happens inside SQLAlchemy's type system.

Wire format: key_id (4B) || iv (12B) || ciphertext || tag (16B)

  • key_id enables key rotation without re-encrypting existing data
  • Random IV per encryption ensures identical values produce different ciphertext
  • GCM mode provides authenticated encryption (integrity + confidentiality)

KeyProvider Interface

class KeyProvider(ABC):
def get_current_key(self) -> tuple[str, bytes]: ... # For new encryptions
def get_key_by_id(self, key_id: str) -> bytes: ... # For decrypting existing data
  • EnvKeyProvider -- reads PII_ENCRYPTION_KEY env var (hex-encoded 32-byte AES-256 key, validated at construction). PoC/staging.
  • NullKeyProvider -- returns empty key material for the disabled/bypass path

A KMS-backed provider (AWS KMS / HashiCorp Vault) for production key rotation is anticipated but not yet implemented -- only EnvKeyProvider and NullKeyProvider exist in key_providers.py today. EncryptedText currently instantiates EnvKeyProvider directly rather than selecting a provider through the KeyProvider ABC.

HMAC Search Hashes

Encrypted columns can't be used in WHERE clauses. For fields that need equality lookups (email, IBAN), a parallel _hash column stores HMAC-SHA256(normalized_value, pepper):

  • Deterministic: same value always produces the same hash
  • One-way: hash can't be reversed to recover the plaintext
  • Pepper: separate secret from the encryption key (PII_ENCRYPTION_PEPPER)
  • Normalized: lowercase, trimmed before hashing

JSONB PII Encryption

JSONB columns containing PII arrays (identification, phones, emails) can't use EncryptedText because the PII is nested inside JSON structures. The jsonb_crypto service encrypts specific fields within each JSON object:

JSONB FieldPII Key EncryptedNon-PII Preserved
emailsEach email string--
phonesnumber keyphone_type, country_prefix
identificationdocument_number keydocument_type, issuing_country, dates

Dev/Test Bypass

When pii_encryption_enabled=False (default), EncryptedText stores plain UTF-8 bytes -- no encryption overhead in development. Set PII_ENCRYPTION_ENABLED=true + keys in production.

PII Manifest

docs/pii_manifest.json is the GDPR Art. 30 artifact -- a machine-generated inventory of all PII fields:

cd backend && python -m app.pii.manifest

Outputs:

  • Summary: total PII fields, encrypted fields, tables with PII
  • Per-table field inventory with category, encryption status, retention, legal basis
  • DSR scope: which tables, JSONB paths, MinIO prefixes, and Neo4j nodes contain person data

This file is committed to the repository and serves as the compliance artifact for security questionnaires, ISO 27001 audits, and GDPR Art. 30 records of processing.

GDPR Data Subject Requests

Three endpoints handle natural person rights under GDPR. All three require the compliance_manager or super_admin role (require_role(...) in app/api/data_subject.py) and run inside a get_tenant_session(user.tenant_id) scope, so a request only resolves people within the caller's tenant:

Subject Access (Art. 15)

POST /api/data-subject/access
Body: { "identifier_type": "name_dob", "first_name": "...", "last_name": "...", "date_of_birth": "..." }

Returns all PII records for the person across all cases, with retention information per case.

Erasure (Art. 17)

POST /api/data-subject/erase
Body: { "identifier_type": "name_dob", ..., "reason": "..." }

Applies AML-aware erasure rules per case:

Case StatusActionLegal Basis
ActiveRefuseGDPR Art. 17(3)(b) -- legal obligation
Closed < 5 yearsAnonymize -- [REDACTED-{dsr_id}]AML 5-year retention
Closed > 5 yearsFull deleteRetention expired

Anonymization preserves case structure for AML audits while removing personal data.

Rectification (Art. 16)

POST /api/data-subject/rectify
Body: { "identifier_type": "name_dob", ..., "corrections": {"first_name": "corrected"} }

Corrects allowed PII fields across all appearances of the person.

Person Resolution Index

The person_data_index table maps HMAC person hashes to data locations across tables. This enables cross-case person lookup without storing PII in the index itself. A person appearing as a director in 3 different company investigations can be found via a single hash lookup.

Audit Trail

Every DSR operation creates an audit_event with type DSR_ACCESS, DSR_ERASURE, or DSR_RECTIFICATION. These events are exempt from erasure (GDPR Art. 17(3)(e) -- establishment of legal claims).

Retention Policies

ContextDurationLegal Basis
AML/KYB records5 years from case closure6AMLD Art. 40
Audit events5 years6AMLD + EU AI Act Art. 12
Person data (no active case)1 year after last caseGDPR Art. 5(1)(e)
DSR audit eventsExempt from erasureGDPR Art. 17(3)(e)

File Structure

backend/app/pii/
categories.py # PIICategory enum, PIIField dataclass
encryption.py # EncryptedText TypeDecorator, AES-256-GCM
key_providers.py # KeyProvider ABC, EnvKeyProvider, NullKeyProvider
hashing.py # HMAC-SHA256 search hash helpers
jsonb_crypto.py # Encrypt/decrypt PII inside JSONB arrays
registry.py # PIIRegistry -- scans models, builds field index
manifest.py # CLI: generates docs/pii_manifest.json
person_index.py # PersonIndexService
dsr_service.py # GDPR access/erase/rectify logic
backfill.py # Encrypt existing plaintext data

backend/app/api/
data_subject.py # 3 GDPR DSR endpoints

ADR

ADR-0036: PII Classification Architecture (docs/adr/ADR-0036-pii-classification-architecture.md)