Data Science

SydneyF · ‎02-04-2021

Unsurprisingly, it is really hard to do data science without data. Also unsurprisingly, because the data itself is so important, a lot of resources are dedicated to manipulating, regulating, and storing data. With all of that in mind, I think it is important as an analyst or data scientist to have at least an understanding of how and why our data is structured - something called data normalization.

Starting at the beginning, when we think about data, the first visual that comes to mind looks something like an Excel spreadsheet. We think about data in the form of a table, where each row is an individual recording of whatever the dataset is capturing, and the details of each of the records are organized into columns.

This makes sense – tables are a tidy way to keep data – every cell in the table has clear and explicit meaning based on the row and column it sits in. This concept of the meaning of rows and columns is foundational in most data architectures – and applies in large databases just as much as it does to spreadsheets.

Here, we have an (imaginary) dataset of animal observations from a collection of camera traps. Each row is a recorded observation, and the columns in the dataset detail information about the time, location, and animal that was observed. This single data table includes lots of different types of data, relating to different aspects of animal sightings.

The reality is that we can do better than this - particularly in terms of eliminating redundant data, making the relationships between our columns explicit, and grouping our data together in helpful sub-groups. And while this might not matter very much for my ten rows of fake camera-trap data, it does start to matter very much when working with thousands of rows of real-world data.

For every animal observation, we record a location ID, as well as details about the location – like latitude, longitude, and elevation. This is where you might start to notice some redundancy.

In this dataset, our observation stations (imaginary camera traps) don’t move – so this information for each station doesn’t change. So what if we broke out this little group of variables, and made them their own table?

And then removed all of the duplicate rows?

Without losing any information about the individual stations, we have managed to eliminate four rows of data. When we look back at our original data table, if we keep the Station_ID variable, we map our original observation data back to detailed information about our stations in our mini station table. This allows us to remove Latitude, Longitude, and Elevation from our original table.

Splitting up the data like this also makes it explicit that the locations of the stations do not change. The individual observations are still tied to the locations they occurred at in the main table, and if you’d like to have more details about the location an observation occurred at, you can use the new location table as a reference.

We have clearly identified the features that “belong” or depend on station ID by splitting the information out into its own table and creating a relationship between this table and our table of observations.

We can repeat the same process with our animal details as well.

Each observed animal has a unique ID, and details about those animals like species, color, and the date they were first observed are dependent on the animal itself. With this in mind, we can comfortably create another separate table to track the details of the animal.

The rule to follow here is that every column that “names” or identifies a thing (like a static location or a specific animal), should be split into its own group, along with all of the columns that add detail to or describe that thing. The result is each table should end up with a unique identifier (i.e., ID column) and then a collection of columns that directly describe features of that unique ID.

In Data Base Administrator (DBA) or Data Engineering terms, what we have accomplished by splitting up our data and identifying the relationships between the tables is called second normal form (abbreviated as 2NF).

If you’re wondering - “Why is this called 2^nd form? Is there a first normal form? Or a third?” - the answer is a strong “yes”.

2NF sits right in between first and third normal form. In this hierarchy of data normalization, it is mandatory that in order to move up a rung, your data needs to comply with the rules for all of the rungs below it.

First Normal Form

The basic idea behind 1^st normal form is that each cell represents just one piece of information, and this piece of information only belongs to one column. So, it doesn’t allow for shenanigans like this, where you have a list stored in a cell:

Or this, where you have two columns with the same information:

Lists of information in cells are bad idea in part because to computers, a list that says peas, lettuce, tomatoes is not the same thing as tomatoes, lettuce, peas. This idea is called Atomic columns.

Duplicate columns are just pointless and redundant, and open up opportunities for bad data. (What happens if you update color but not hue? Who are you really going to trust here?)

If you’re not accomplishing 1^st normal form, your data is considered to be in unnormalized form (UNF).

Third Normal Form

Moving up a step in database normalization excellence to 3^rd normal form (3NF) you would separate out any columns that can be derived from another column. So if you have something like “birthday” and “age” (or first observed data, and total days since first observed), you would be violating third normal form.

The way to fix this while maintaining both pieces of information is to create another table, where the date would act as the key, and the derived data would be stored in its own column.

Fourth Normal Form?

Hopefully, this is making sense to you, and maybe you’re starting to wonder if there is a fourth or fifth normalized form. The answer is yes, but we aren’t going to get into it because 1NF, 2NF, and 3NF are the OG normalized forms established by Edgar F. Codd in the early 1970s, and this seems as good a cutoff point as any.

“Normalized” in the data engineering world typically refers to third normal form – so if this is making sense to you, you have a solid foundation in your data structure journey.

I thought Normalization had Something to Do with Scaling Data?

Yes. Database normalization is different from what normalization means to statisticians. You can read more about this type of normalization (also called standardization or feature scaling) here and here.

Unfortunately, using the same word to mean different things is common between the groups involved in the wide world of data, and I think it is an artifact of the deeply interdisciplinary nature of our field. Maybe we should work on “normalizing” our language as well. 😊

What About Machine Learning? Shouldn’t I Just have One Table?

Machine learning algorithms typically expect data in a single table as an input. These tables need to have unique observations for each row in the table, so if you’re trying to estimate something like “How much will customer X spend on a unicorn?” you might have a dataset where each of your customers is represented as a single row (observation), and the variables are features related to purchasing a unicorn.

If you have normalized data, this means you probably have a table of customer information, and then a second table with data about transactions that your customers have made. In order to use the transaction data, you will need to perform queries with the customer table that relate the data in the transaction table as single values. If you had this data in a single table, you would need to aggregate your data to a point where you only have one row per customer. This is a part of a critical process to prepare data and engineer key features to setup your algorithm for success.

A machine learning algorithm can only be as good as the data it was given (also frequently expressed as, garbage in, garbage out). The more you can help the algorithm out by providing it with clean data and meaningful features, the happier you will be. Regardless of whether you store your data in a single table instead of multiple tables, you will need to be thoughtful about getting your data to a point where it is ready for analysis.

When it comes to data normalization, using tables and defining relationships between them is your friend. This helps eliminate redundancies and help ensure that there is only one source of truth for all of your data.