Artificial Intelligence · NLP · Enterprise Data

LLM Customer
Deduplication Engine

Built an AI-powered entity resolution system for TELUS that identifies duplicate customer records using LLM semantic analysis, fuzzy address matching, and automated merge workflows — eliminating data fragmentation across millions of records.

ClientTELUS (Enterprise Telecom)
RoleSenior Backend / AI Engineer
StackPython · LLM · BigQuery · GCP
0M+
Customer records analyzed
0%
Auto-merge accuracy (score ≥80)
0%
Duplicate detection rate
0hrs
Analyst time saved per week
Live System Preview

Duplicate Comparison Engine

The AI scores each record pair 0–100. Scores ≥80 trigger automatic merge. Scores below 80 queue for analyst review with LLM reasoning.

TELUS · Veritas Deduplication Engine · Live
Record A · CRM System
John M. Williamson
4821 Oak Street NW, Calgary AB T2N 1P4
DOB: 1978-03-14 · Acct: TLS-882341
96
score
Record B · Billing System
John Williamson
4821 Oak St NW, Calgary, Alberta T2N1P4
DOB: 1978-03-14 · Acct: BIL-004821
Record A · CRM System
Sarah Chen
220 Burrard St, Vancouver BC V6C 3L6
DOB: 1985-07-22 · Acct: TLS-441098
71
score
Record B · Support Tickets
S. Chen
220 Burrard Street, Suite 400, Vancouver
DOB: unknown · Acct: SUP-220BC
Record A · CRM System
Michael Torres
88 King St W, Toronto ON M5H 1S3
DOB: 1990-11-05 · Acct: TLS-774322
23
score
Record B · Billing System
Miguel Torres
88 King St E, Toronto ON M5C 1G3
DOB: 1962-04-18 · Acct: BIL-088912
System Architecture

How It Works

1
Data Ingestion — BigQuery
Customer records pulled from CRM, billing, support, and CDR systems via BigQuery federated queries. Normalized into a unified schema.
2
Candidate Generation — Blocking
Fuzzy blocking on address tokens, name phonetics (Soundex/Metaphone), and postal codes to reduce 100M² comparisons to manageable candidate pairs.
3
LLM Scoring — Semantic Comparison
Each candidate pair sent to LLM with structured prompt. Model returns similarity score (0–100) and natural language reasoning explaining the match decision.
4
Automated Action — Score Routing
Score ≥80: automatic merge via GCP Workflow + BigQuery MERGE statement. Score <80: AI reasoning + record pair queued in analyst dashboard for human review.
5
Feedback Loop — Continuous Improvement
Analyst decisions fed back as training signal. Model thresholds and prompt templates updated quarterly to improve accuracy over time.
Overview

🔴 The Problem

  • Millions of customer records duplicated across CRM, billing, and support systems
  • Manual deduplication took analyst teams weeks per data quality cycle
  • Rule-based matching failed on address formatting variations and name abbreviations
  • No unified customer view — same person held multiple accounts with different data

🟢 The Solution

  • LLM-powered semantic comparison understanding address variants and name aliases
  • Automated merge pipeline for high-confidence matches (score ≥80)
  • AI-generated reasoning for analyst review queue — decisions made in seconds not hours
  • GCP Workflow orchestration — zero manual steps for auto-merge execution
Outcomes & Impact
95%+ Auto-Merge AccuracyLLM scoring achieved high precision on address-variant and name-abbreviated duplicates that rule-based systems missed entirely.
Analyst Time ReclaimedManual review queue reduced by 80%+ — analysts only see ambiguous cases with AI reasoning already provided.
Unified Customer IdentitySingle customer view across 5+ enterprise systems — zero duplicate accounts post-merge for matched records.
Production Pipeline — GCPFully automated on GCP Workflows + BigQuery. Runs on schedule with alerting, logging, and rollback capability.
Technology Stack
PythonLLM (GPT / Vertex AI)BigQuery GCP WorkflowsFuzzy MatchingSoundex / Metaphone FastAPIPostgreSQLTerraformDocker