This module introduces the tidy data philosophy. You’ll learn what makes data “tidy” and what pitfalls to avoid, explore practical tools like melt and pivot to reshape your data, and understand why standardizing data structure makes analysis faster and more reliable.
The 80% Problem
Have you ever spent hours wrestling with a dataset before you could even start analyzing it? It is often said that 80% of data analysis is spent cleaning and preparing data. This isn’t an exaggeration.
Getting your data into the right shape makes everything else easier. The good news is that once you understand the tidy data philosophy, you can apply it consistently across projects. If you want to dive deeper, read Tidy Data by Hadley Wickham.
What is Tidy Data?
At its core, tidy data is a standard way of mapping the meaning of a dataset to its structure. Whether your data is messy or tidy depends entirely on how rows, columns, and tables match up with observations, variables, and types.
Let’s talk about the three core principles. First, each variable forms its own column. A variable measures the same underlying attribute (like height, temperature, or duration) across different units.
Second, each observation forms a row. An observation captures all measurements on the same unit (like a person, a day, or a race) across different attributes.
Third, each type of observational unit gets its own table. In a study of allergy medication, you’d have separate tables for demographic data, daily medical data, and meteorological data, not one giant table mixing everything together.
Why does this matter? Tidy datasets are dramatically easier to manipulate, model, and visualize. They make exploration faster and analysis clearer. Most importantly, they standardize data organization, making your code reusable and reliable.
Common Pitfalls
Now let’s flip the perspective and look at the most common mistakes. When you first encounter messy data, it usually falls into one of five patterns.
The first problem is that column headers often contain values instead of variable names. Imagine a table where months (“Jan”, “Feb”, “Mar”) are the column headers, rather than having a single “Month” column with those values.
The second problem is multiple variables stored in one column. You might find a column like “height_weight” containing values like “5.5_130” instead of splitting those into separate “height” and “weight” columns.
The third problem is variables scattered across both rows and columns. A piece of information like gender might be encoded in a specific column and also hidden within the values of another column.
The fourth problem is mixing different types of observational units in one table. For example, a single table containing both patient demographic information and medical test results mashes two fundamentally different kinds of data together.
The fifth and final problem is splitting a single observational unit across multiple tables. Patient information scattered across one table for addresses, another for test results, and another for appointments, with no clean way to link them together, makes every analysis painful.