Healthcare AnalyticsBig Data EngineeringDistinguished Project

COVID-19 Healthcare Cost Intelligence Solution

A policy-grade big data platform analyzing the pandemic's $200 billion impact on the U.S. healthcare system - unifying fragmented federal data sources into a single analytical engine whose findings are designed to reach government leaders making decisions about healthcare budgets and quality of care.

Correlation One DS4 Data Engineering Program

One of the most selective data engineering programs in the country. A single engineer served as technical lead, driving all architecture decisions and mentoring a team of less experienced developers through implementation.

32M+

Records Processed

Data Sources Unified

$200B

Healthcare Impact Analyzed

100%

Pipeline Automation

Project Overview

Data Swan - Correlation One DS4 Data Engineering Program

The goal was clear: build a comprehensive analytics platform to help healthcare organizations understand and respond to COVID-19's unprecedented financial impact on the U.S. healthcare system. With costs projected to reach $200 billion according to McKinsey research, healthcare administrators and policymakers urgently needed unified, data-driven intelligence across physician payments, quality metrics, and expenditures spanning pre-pandemic and pandemic periods.

The Challenge

Massive Data Scale

Over 32 million records across four federal and international sources - CMS Open Payments, MIPS performance data, WHO expenditures, and state hospital records - each with different schemas, formats, and data quality issues.

Data Integration Complexity

Unifying physician payment data, quality scores, and expenditure metrics required sophisticated transformation logic to build a coherent analytical model spanning pre-pandemic and pandemic periods across multiple jurisdictions.

Infrastructure at Scale

Initial Pandas-based processing hit critical memory limitations at 32M+ records. The solution required rethinking the pipeline at every layer - processing framework, file format, and compression - not just scaling up infrastructure.

Policy-Grade Accuracy

Outputs were intended to inform government decision-making on healthcare budgets. That raised the bar beyond a working pipeline - the analysis had to be accurate, defensible, and grounded in genuine domain understanding.

Approach

The quality of this output wasn't accidental. It required a combination of technical depth, healthcare domain fluency, and the judgment to make the right architectural calls under pressure - things most engineers bring one or two of, rarely all three.

Healthcare Domain Fluency

A background spanning biomedical engineering, medical device development, and FDA/CMS regulatory experience meant understanding what the data actually represents - not just its structure. CMS Open Payments, MIPS quality scores, and WHO expenditure benchmarks are complex policy instruments. Processing them correctly required knowing what they mean.

Multi-Layer Performance Breakthrough

When data volume threatened to sink the project, the solution wasn't just throwing more compute at the problem. The pipeline was rebuilt across three layers: migrating from Pandas to PySpark for distributed processing, switching from CSV to Parquet for columnar storage, and applying Snappy compression - reducing the memory footprint by 85% and dramatically cutting pipeline runtime. Most engineers fix one layer. This required seeing all three.

Data Quality Judgment

Six physician review data sources were evaluated - Healthgrades, ZocDoc, WebMD, RateMDs, RealSelf, and Vitals. RateMD data was rejected outright due to insufficient physician identification and inconsistent scoring. The decision to source quality over volume kept the analysis defensible at a policy level.

Technical Leadership

Served as technical lead for a team of less experienced developers, driving architecture, ETL implementation, infrastructure deployment, and documentation - while mentoring the team through the process within a competitive fellowship recognized for technical excellence.

What Was Built

End-to-End Data Pipeline Architecture

Extract

CMS, WHO, MIPS, State Hospital APIs & Files

Transform

PySpark distributed processing, Parquet + Snappy compression

Load

AWS RDS PostgreSQL + S3 data lake

Orchestrate

Apache Airflow DAG automation

Data Transformations

• Schema normalization across heterogeneous sources
• Temporal alignment for pre/post-pandemic analysis
• Physician and facility entity resolution
• Payment aggregation and quality score indexing

Analytical Capabilities

• Physician payment trend analysis by specialty
• Quality score correlation with expenditures
• Geographic healthcare cost comparisons
• Pandemic impact year-over-year metrics

Data Sources Integrated

CMS Open Payments

Physician payment data from drug and medical device companies

MIPS Clinician Performance

Merit-Based Incentive Payment System quality scores

WHO Global Expenditures

International healthcare spending benchmarks

Wisconsin Hospital Association

Hospital checkpoint performance and cost metrics

Technologies Used

Apache Spark (PySpark)Apache AirflowAWS RDS (PostgreSQL)AWS S3PythonPandasParquetSnappy Compression

Value Delivered

Before

$200 billion in COVID-19 healthcare costs with no unified analytical view. Data locked in siloed federal systems - each in different formats, each requiring manual interpretation - leaving administrators and policymakers making budget decisions without cross-source intelligence.

After

A single automated platform processing 32M+ records across all four sources, producing analytical intelligence designed to reach government leaders and inform decisions on maximizing healthcare budgets while maintaining quality care.

Fragmented Data → Unified Intelligence

Four disparate federal and international data sources, each with different schemas and quality issues, unified into a single analytical model enabling cross-source insights that were previously impossible.

Domain Knowledge at Scale

CMS pricing structures, MIPS quality frameworks, and WHO expenditure benchmarks aren't just data - they're complex policy instruments. Understanding them semantically, not just technically, is what made the analysis defensible at a government level.

Policy-Grade Recommendations

The analysis was designed to connect to QALY (quality-adjusted life year) and DALY (disability-adjusted life year) frameworks - the real instruments governments use to allocate healthcare budgets. Findings are designed to reach government leaders making decisions on post-pandemic healthcare spending and quality of care.

Recognition

This project received the Distinguished Project Award from the Correlation One DS4 Data Engineering Program, recognizing the sophisticated architecture, successful scaling strategy, and production-quality delivery. Selected from a highly competitive cohort, the award reflects what becomes possible when deep technical execution meets genuine domain expertise.

Book a Diagnostic Call →Next: Calaxy →