Skip to main content

Investigation Confidence Scoring (Pillar 1)

Quantified certainty for every compliance investigation — replacing binary pass/fail with a 4-dimension confidence framework.

Business Value

Compliance officers need to understand not just what the AI found, but how certain it is. Confidence Scoring provides a 0-100 score decomposed into four independently measurable dimensions, enabling evidence-based decision-making.

Architecture

Confidence Dimensions

DimensionRangeMeasures
Evidence Completeness0-25Coverage of required document categories
Source Diversity0-25Number and variety of independent sources
Consistency0-25Agreement between sources on key facts
Historical Calibration0-25Accuracy of similar past predictions

Confidence Levels

LevelScore RangeAction
HIGH85-100Automated approval eligible
MEDIUM65-84Standard review
LOW40-64Enhanced review recommended
INSUFFICIENT0-39Additional investigation required

Workflow Integration

Confidence scoring is invoked through the _compute_and_store_confidence() helper method, which was extracted from the workflow's run method during the codebase hardening sweep (change I6). This helper is shared between the KYC and KYB investigation paths — both call it after their respective investigation activities complete.

# Shared for both KYC and KYB paths
await self._compute_and_store_confidence(
input, investigation_result, retry_policy
)

The helper:

  1. Checks the confidence-score-v1 version gate (skipped for old workflow histories)
  2. Calls the compute_confidence_score activity with a 30-second timeout
  3. Appends the result to self._state.confidence_scores
  4. Logs a confidence_computed audit event
  5. Swallows all exceptions (confidence scoring is best-effort — a scoring failure never blocks case progression)

Prior to I6, confidence scoring was duplicated inline in both the KYC and KYB branches. Extracting it to _compute_and_store_confidence() eliminates the duplication and ensures both paths always score using identical logic.

Key Components

  • confidence_engine.py — Core scoring engine with dimensional computation. The ConfidenceScore/ConfidenceLevel Pydantic models live in the shared trustrelay_models.confidence package (ADR-0037); level_from_score() applies the 85/65/40 thresholds.
  • calibration_service.py — Feedback loop: officer decisions are recorded via record_data_point() and surfaced via get_calibration_stats(), feeding the Historical Calibration dimension.
  • quality_scorer.py — LLM-as-judge quality scoring used alongside the deterministic confidence engine.
  • ConfidenceScoreCard.tsx — Visual breakdown in case detail view

API Endpoint

The confidence router is mounted under /api/cases (app/api/confidence.py):

MethodPathDescription
GET/api/cases/{workflow_id}/confidenceGet the latest 4-dimension confidence breakdown for a case

Calibration is not exposed as a standalone REST surface: officer decisions feed CalibrationService.record_data_point() internally from the decision flow, and get_calibration_stats() supplies the Historical Calibration dimension at scoring time.

Configuration

  • The confidence score is computed by the compute_confidence_score Temporal activity and is best-effort: a scoring failure never blocks case progression. There is no dedicated confidence_scoring_enabled feature flag — scoring runs as part of the workflow, gated by the confidence-score-v1 workflow version guard.
  • Alembic migration: 006_calibration_data