Backend · Data Engineering · GCP

Customer Intelligence
Data Pipeline

Designed and built a distributed data platform processing 100M+ customer records daily — powering identity resolution, real-time search, and enterprise analytics at scale.

ClientTELUS (Enterprise Telecom)
RoleSenior Backend / Data Platform Engineer
DurationFeb 2025 – Present
StackPython · GCP · PySpark · BigQuery
0
Records processed daily
0%
Query cost reduction via Parquet architecture
0%
Performance improvement on SQL & backend services
// data flow architecture
🏢
Enterprise Sources
CRM · CDC · APIs
Ingestion
GCP Dataflow
PySpark · ETL
Processing
🗄️
GCS + Parquet
Optimized Storage
Storage
🔍
BigQuery
Analytics · Search
Warehouse
📊
Looker / BI
Exec Dashboards
Insights
Overview

🔴 The Problem

  • Fragmented customer data across 10+ enterprise systems with no unified identity layer
  • BigQuery queries scanning full tables — slow and expensive at 100M+ record scale
  • Manual workflows causing data latency of 24–48 hours in reporting
  • No automated monitoring — pipeline failures discovered post-mortem

🟢 The Solution

  • Built Veritas — a unified customer identity resolution and search platform
  • Implemented Parquet-based serving layer to bypass full BigQuery table scans
  • Designed distributed PySpark pipelines on GCP Dataflow for parallel processing
  • Automated workflows via GCP Workflows + Cloud Functions with alerting
Live Platform Preview
Veritas · Customer Intelligence Platform · TELUS
RECORDS TODAY
103.4M
↑ 2.1% vs yesterday
PIPELINE STATUS
LIVE
All 6 jobs healthy
QUERY LATENCY
180ms
↓ 62% from baseline
COST / QUERY
$0.003
↓ 41% Parquet savings
// records processed — last 7 days (millions)
Mon
Tue
Wed
Thu
Fri
Sat
Sun
// recent pipeline jobs
identity-resolution-v3COMPLETE
cdc-ingestion-telcoCOMPLETE
parquet-optimizerRUNNING
bigquery-syncCOMPLETE
looker-refreshSCHEDULED
// data sources connected
CRM Platform38.2M rows
Telco CDR Events41.7M rows
Customer Portal12.4M rows
Support Tickets6.1M rows
Billing System5.0M rows
Outcomes & Impact
📈
30% Processing Performance GainOptimized SQL queries and backend services reduced compute time significantly across all pipeline stages.
💰
Significant Cost ReductionParquet serving architecture eliminated redundant BigQuery full-table scans, cutting query costs substantially.
🔗
Unified Customer IdentityVeritas platform resolved customer identity across 5+ fragmented data sources with search latency under 200ms.
⚙️
Fully Automated InfrastructureZero-touch pipeline orchestration via GCP Workflows + Terraform, with CI/CD via GitHub Actions.
Technology Stack
Python 3.11 PySpark GCP Dataflow BigQuery GCS GCP Workflows Cloud Functions Parquet REST APIs Docker Kubernetes Terraform GitHub Actions Looker PostgreSQL