COVID-19 Healthcare Cost Intelligence Solution
A policy-grade big data platform analyzing the pandemic's $200 billion impact on the U.S. healthcare system - unifying fragmented federal data sources into a single analytical engine whose findings are designed to reach government leaders making decisions about healthcare budgets and quality of care.
Correlation One DS4 Data Engineering Program
One of the most selective data engineering programs in the country. A single engineer served as technical lead, driving all architecture decisions and mentoring a team of less experienced developers through implementation.
Project Overview
Data Swan - Correlation One DS4 Data Engineering Program
The goal was clear: build a comprehensive analytics platform to help healthcare organizations understand and respond to COVID-19's unprecedented financial impact on the U.S. healthcare system. With costs projected to reach $200 billion according to McKinsey research, healthcare administrators and policymakers urgently needed unified, data-driven intelligence across physician payments, quality metrics, and expenditures spanning pre-pandemic and pandemic periods.
The Challenge
Massive Data Scale
Over 32 million records across four federal and international sources - CMS Open Payments, MIPS performance data, WHO expenditures, and state hospital records - each with different schemas, formats, and data quality issues.
Data Integration Complexity
Unifying physician payment data, quality scores, and expenditure metrics required sophisticated transformation logic to build a coherent analytical model spanning pre-pandemic and pandemic periods across multiple jurisdictions.
Infrastructure at Scale
Initial Pandas-based processing hit critical memory limitations at 32M+ records. The solution required rethinking the pipeline at every layer - processing framework, file format, and compression - not just scaling up infrastructure.
Policy-Grade Accuracy
Outputs were intended to inform government decision-making on healthcare budgets. That raised the bar beyond a working pipeline - the analysis had to be accurate, defensible, and grounded in genuine domain understanding.
Approach
The quality of this output wasn't accidental. It required a combination of technical depth, healthcare domain fluency, and the judgment to make the right architectural calls under pressure - things most engineers bring one or two of, rarely all three.
Healthcare Domain Fluency
A background spanning biomedical engineering, medical device development, and FDA/CMS regulatory experience meant understanding what the data actually represents - not just its structure. CMS Open Payments, MIPS quality scores, and WHO expenditure benchmarks are complex policy instruments. Processing them correctly required knowing what they mean.
Multi-Layer Performance Breakthrough
When data volume threatened to sink the project, the solution wasn't just throwing more compute at the problem. The pipeline was rebuilt across three layers: migrating from Pandas to PySpark for distributed processing, switching from CSV to Parquet for columnar storage, and applying Snappy compression - reducing the memory footprint by 85% and dramatically cutting pipeline runtime. Most engineers fix one layer. This required seeing all three.
Data Quality Judgment
Six physician review data sources were evaluated - Healthgrades, ZocDoc, WebMD, RateMDs, RealSelf, and Vitals. RateMD data was rejected outright due to insufficient physician identification and inconsistent scoring. The decision to source quality over volume kept the analysis defensible at a policy level.
Technical Leadership
Served as technical lead for a team of less experienced developers, driving architecture, ETL implementation, infrastructure deployment, and documentation - while mentoring the team through the process within a competitive fellowship recognized for technical excellence.
What Was Built
End-to-End Data Pipeline Architecture
CMS, WHO, MIPS, State Hospital APIs & Files
PySpark distributed processing, Parquet + Snappy compression
AWS RDS PostgreSQL + S3 data lake
Apache Airflow DAG automation
Data Transformations
- • Schema normalization across heterogeneous sources
- • Temporal alignment for pre/post-pandemic analysis
- • Physician and facility entity resolution
- • Payment aggregation and quality score indexing
Analytical Capabilities
- • Physician payment trend analysis by specialty
- • Quality score correlation with expenditures
- • Geographic healthcare cost comparisons
- • Pandemic impact year-over-year metrics
Data Sources Integrated
CMS Open Payments
Physician payment data from drug and medical device companies
MIPS Clinician Performance
Merit-Based Incentive Payment System quality scores
WHO Global Expenditures
International healthcare spending benchmarks
Wisconsin Hospital Association
Hospital checkpoint performance and cost metrics
Technologies Used
Value Delivered
$200 billion in COVID-19 healthcare costs with no unified analytical view. Data locked in siloed federal systems - each in different formats, each requiring manual interpretation - leaving administrators and policymakers making budget decisions without cross-source intelligence.
A single automated platform processing 32M+ records across all four sources, producing analytical intelligence designed to reach government leaders and inform decisions on maximizing healthcare budgets while maintaining quality care.
Fragmented Data → Unified Intelligence
Four disparate federal and international data sources, each with different schemas and quality issues, unified into a single analytical model enabling cross-source insights that were previously impossible.
Domain Knowledge at Scale
CMS pricing structures, MIPS quality frameworks, and WHO expenditure benchmarks aren't just data - they're complex policy instruments. Understanding them semantically, not just technically, is what made the analysis defensible at a government level.
Policy-Grade Recommendations
The analysis was designed to connect to QALY (quality-adjusted life year) and DALY (disability-adjusted life year) frameworks - the real instruments governments use to allocate healthcare budgets. Findings are designed to reach government leaders making decisions on post-pandemic healthcare spending and quality of care.
Recognition
This project received the Distinguished Project Award from the Correlation One DS4 Data Engineering Program, recognizing the sophisticated architecture, successful scaling strategy, and production-quality delivery. Selected from a highly competitive cohort, the award reflects what becomes possible when deep technical execution meets genuine domain expertise.