GoDaddy Data Lineage & Attribution
SDE Intern · NDA-redacted
Problem
Inside GoDaddy's marketing-data org, every campaign sits on top of a long chain of upstream tables — pulled from SFMC, MessageGears, CRMs, and a forest of internal APIs. Nobody had a clear picture of which datasets actually drove which campaigns, or how much downstream business value flowed through any given table. Teams were paying to compute and store data without good signal on what was load-bearing and what was dead weight.
Approach
I built a cross-system lineage graph by parsing campaign definitions and SQL out of SFMC, MessageGears, and the internal ad-platform APIs, normalizing them into a single dependency graph in Athena/Redshift. On top of that I shipped a table-value attribution model: it walks the graph backward from active campaigns and assigns weighted credit to each upstream dataset based on the campaign's impact metrics. A scoring formula for "value per GB / compute-hour" let storage and warehouse cost be reasoned about against actual business contribution. Separately, I built an MCP server powering an internal data-hydration tool so non-technical ops could define and schedule validated hydration jobs against the same data assets — closing the loop between "what's worth maintaining" and "who needs to keep it fresh."
Outcome
The system produced hundreds of thousands of campaign→table mappings and is now used to guide data-maintenance, deprecation, and cost/retention decisions across the org. Specific figures and internal architecture stay redacted.