Data Provenance: The Secret Life of Data
This module introduces data provenance, the complete record of where your data came from and what happened to it.
You’ll learn:
- What data provenance means and why it’s critical for reproducible science.
- How a spreadsheet error by Harvard economists reshaped economic policy worldwide.
- The three essential tools for tracking data transformations in your projects.
- Practical consequences for trust, error prevention, and scientific integrity.
The Mystery of the Disappearing Data
Have you ever opened a dataset you worked on a few months ago, only to find that you have no idea where it came from? You can’t remember what the columns mean or what transformations you applied. This experience is surprisingly common in data science.
Think of data provenance as the complete story of your data. It answers the who, what, when, where, and why of your data’s journey from raw form to its current state. Understanding this story is crucial for doing good science and catching errors before they spread.
A Tale of Two Economists and a Spreadsheet Error

Picture this: in 2010, two Harvard economists named Carmen Reinhart and Kenneth Rogoff published a paper called “Growth in a Time of Debt.” Their main argument was simple and powerful: countries with high government debt have lower economic growth. The paper became incredibly influential, shaping policy decisions around the world and justifying austerity measures in multiple countries.
Then in 2013, a graduate student named Thomas Herndon decided to reproduce their results. He couldn’t. After persistent effort, he obtained their original spreadsheet and discovered a simple but catastrophic error: they had accidentally excluded the first five countries from their analysis.
When Herndon corrected that single mistake, the paper’s main finding vanished. Without a clear, documented record of how data was processed, errors slip through unnoticed and spread far. The consequences can reshape entire economies.
Want the full story? Read The Reinhart-Rogoff Error.
The Data Detective’s Toolkit
Let’s talk about how you actually track provenance. There are three essential approaches.
First, keep a lab notebook. This can be physical or digital, but it should record where your data came from, what you did to it, and why you made each decision. A good lab notebook becomes the narrative companion to your code.
Second, embrace scripting. When you process data with code (Python, R, or anything else), your scripts become documentation. They show exactly what transformations happened. If you version control those scripts, you have a complete history of changes.
Third, for complex projects, consider workflow management tools like Snakemake or Nextflow. These tools let you define and track your entire analysis pipeline, automatically recording which data went into which step and what came out.
Why do all this? When you embrace these provenance practices, you become a more trustworthy data scientist. You’ll trust your own results because you know exactly how they were created. Others will trust them too.
Further Reading
Ready to go deeper? Check out What is Data Provenance and Why is it Important? and Data Provenance: What It Is, Why It Matters, and How to Implement It for more perspectives.