Backend · Data Engineering · GCP

Customer Intelligence
Data Pipeline

Designed and built a distributed data platform processing 100M+ customer records daily — powering identity resolution, real-time search, and enterprise analytics at scale.

ClientTELUS (Enterprise Telecom)

RoleSenior Backend / Data Platform Engineer

DurationFeb 2025 – Present

StackPython · GCP · PySpark · BigQuery

Records processed daily

Query cost reduction via Parquet architecture

Performance improvement on SQL & backend services

// data flow architecture

🏢

Enterprise Sources

CRM · CDC · APIs

Ingestion

⚡

GCP Dataflow

PySpark · ETL

Processing

🗄️

GCS + Parquet

Optimized Storage

Storage

🔍

BigQuery

Analytics · Search

Warehouse

📊

Looker / BI

Exec Dashboards

Insights

Overview

🔴 The Problem

Fragmented customer data across 10+ enterprise systems with no unified identity layer
BigQuery queries scanning full tables — slow and expensive at 100M+ record scale
Manual workflows causing data latency of 24–48 hours in reporting
No automated monitoring — pipeline failures discovered post-mortem

🟢 The Solution

Built Veritas — a unified customer identity resolution and search platform
Implemented Parquet-based serving layer to bypass full BigQuery table scans
Designed distributed PySpark pipelines on GCP Dataflow for parallel processing
Automated workflows via GCP Workflows + Cloud Functions with alerting

Live Platform Preview

RECORDS TODAY

103.4M

↑ 2.1% vs yesterday

PIPELINE STATUS

LIVE

All 6 jobs healthy

QUERY LATENCY

180ms

↓ 62% from baseline

COST / QUERY

$0.003

↓ 41% Parquet savings

// records processed — last 7 days (millions)

Mon

Tue

Wed

Thu

Fri

Sat

Sun

// recent pipeline jobs

identity-resolution-v3COMPLETE

cdc-ingestion-telcoCOMPLETE

parquet-optimizerRUNNING

bigquery-syncCOMPLETE

looker-refreshSCHEDULED

// data sources connected

CRM Platform38.2M rows

Telco CDR Events41.7M rows

Customer Portal12.4M rows

Support Tickets6.1M rows

Billing System5.0M rows

Outcomes & Impact

📈

30% Processing Performance GainOptimized SQL queries and backend services reduced compute time significantly across all pipeline stages.

💰

Significant Cost ReductionParquet serving architecture eliminated redundant BigQuery full-table scans, cutting query costs substantially.

🔗

Unified Customer IdentityVeritas platform resolved customer identity across 5+ fragmented data sources with search latency under 200ms.

⚙️

Fully Automated InfrastructureZero-touch pipeline orchestration via GCP Workflows + Terraform, with CI/CD via GitHub Actions.

Technology Stack

Python 3.11 PySpark GCP Dataflow BigQuery GCS GCP Workflows Cloud Functions Parquet REST APIs Docker Kubernetes Terraform GitHub Actions Looker PostgreSQL

Customer IntelligenceData Pipeline

🔴 The Problem

🟢 The Solution

Customer Intelligence
Data Pipeline