LLM Customer Deduplication — Muhammad Obaidullah

Live System Preview

Duplicate Comparison Engine

The AI scores each record pair 0–100. Scores ≥80 trigger automatic merge. Scores below 80 queue for analyst review with LLM reasoning.

TELUS · Veritas Deduplication Engine · Live

Record A · CRM System

John M. Williamson

4821 Oak Street NW, Calgary AB T2N 1P4

DOB: 1978-03-14 · Acct: TLS-882341

96

score

Record B · Billing System

John Williamson

4821 Oak St NW, Calgary, Alberta T2N1P4

DOB: 1978-03-14 · Acct: BIL-004821

Record A · CRM System

Sarah Chen

220 Burrard St, Vancouver BC V6C 3L6

DOB: 1985-07-22 · Acct: TLS-441098

71

score

Record B · Support Tickets

S. Chen

220 Burrard Street, Suite 400, Vancouver

DOB: unknown · Acct: SUP-220BC

Record A · CRM System

Michael Torres

88 King St W, Toronto ON M5H 1S3

DOB: 1990-11-05 · Acct: TLS-774322

23

score

Record B · Billing System

Miguel Torres

88 King St E, Toronto ON M5C 1G3

DOB: 1962-04-18 · Acct: BIL-088912

System Architecture

How It Works

1

Data Ingestion — BigQuery

Customer records pulled from CRM, billing, support, and CDR systems via BigQuery federated queries. Normalized into a unified schema.

2

Candidate Generation — Blocking

Fuzzy blocking on address tokens, name phonetics (Soundex/Metaphone), and postal codes to reduce 100M² comparisons to manageable candidate pairs.

3

LLM Scoring — Semantic Comparison

Each candidate pair sent to LLM with structured prompt. Model returns similarity score (0–100) and natural language reasoning explaining the match decision.

4

Automated Action — Score Routing

Score ≥80: automatic merge via GCP Workflow + BigQuery MERGE statement. Score <80: AI reasoning + record pair queued in analyst dashboard for human review.

5

Feedback Loop — Continuous Improvement

Analyst decisions fed back as training signal. Model thresholds and prompt templates updated quarterly to improve accuracy over time.

Overview

🔴 The Problem

Millions of customer records duplicated across CRM, billing, and support systems
Manual deduplication took analyst teams weeks per data quality cycle
Rule-based matching failed on address formatting variations and name abbreviations
No unified customer view — same person held multiple accounts with different data

🟢 The Solution

LLM-powered semantic comparison understanding address variants and name aliases
Automated merge pipeline for high-confidence matches (score ≥80)
AI-generated reasoning for analyst review queue — decisions made in seconds not hours
GCP Workflow orchestration — zero manual steps for auto-merge execution

Outcomes & Impact

95%+ Auto-Merge AccuracyLLM scoring achieved high precision on address-variant and name-abbreviated duplicates that rule-based systems missed entirely.

Analyst Time ReclaimedManual review queue reduced by 80%+ — analysts only see ambiguous cases with AI reasoning already provided.

Unified Customer IdentitySingle customer view across 5+ enterprise systems — zero duplicate accounts post-merge for matched records.

Production Pipeline — GCPFully automated on GCP Workflows + BigQuery. Runs on schedule with alerting, logging, and rollback capability.

Technology Stack

PythonLLM (GPT / Vertex AI)BigQuery GCP WorkflowsFuzzy MatchingSoundex / Metaphone FastAPIPostgreSQLTerraformDocker

LLM CustomerDeduplication Engine

Duplicate Comparison Engine

How It Works

🔴 The Problem

🟢 The Solution

LLM Customer
Deduplication Engine