Data Science

SydneyF · ‎05-16-2019

Like many things (the moon’s phases, seasons, and the hydrologic cycle) the data science process is cyclical.

The cyclical nature of the data science lifecycle is dependent on topic expertise, which is both the start and end of any data science project. When someone has expertise in a topic, they tend to want to know even more about it, which leads to asking questions. Questions lead to investigation and (hopefully) answers, resulting in even more knowledge of the topic. This, in turn, leads to more questions, which kicks off the whole process once again. This is what data science looks like in action.

In this article, I’d like to unpack each step of the data science lifecycle and talk about how like any scientific endeavor, data science is a fluid process and molded by the people doing the analysis.

Let’s get started with this beautiful chart I bribed our graphic designers into making for me:

isn't it lovely?

Once you’ve digested this, we can move on to the words.

Topic Expertise

Humans are inquisitive by nature. Data scientists (and scientists in general) tend to engage with the world with the question “Why?” That is what makes data science an important asset to a business; the nature of the field is to seek explanations for why things are the way they are. This deeper understanding leads to better and more confident data-driven decisions.

To ask meaningful questions about anything, you need to have a base understanding of that thing as it is. Without this base understanding, any analysis you perform risks not being helpful or meaningful. A foundation of topic expertise defines the need for a data science project.

Topic expertise also enables for the scope and outcomes of data science projects to be clearly defined. With background knowledge, you are able to know what is possible as well as what is reasonable. Having a clear vision of what your end goal looks like is what ultimately enables a project to be successful.

Data Acquisition

Also known as data discovery or data collection, data acquisition is where you start gathering up the data you need to answer the question you’ve defined. You might be given a data set to work with right away (enter Kaggle-type projects), or you might have to start by coming up with a wish list of data and having to hunt it down yourself.

Data acquisition might include (but is not limited to) web scraping, database queries, making some phone calls and writing some emails to request data, creating labeled features by hand (or if you’re lucky, paying or tricking someone else into doing it), or setting up infrastructure to capture data (depending on the size of your team, you might get to pass this off to a data engineer).

Data acquisition will vary in difficulty from project to project. You might need to get creative with the features you use as a proxy for what you are actually interested in measuring. That said, everything that follows data acquisition will be dependent on the quality of the data you are able to collect, and how well you process it. Spend the time making sure you get this, and the next step, right!

Data Preparation

No matter how you acquired it, once you have your data, you need to clean and prepare it for analysis. This includes integrating disparate data sources, handling missing values and outliers, and even starting the all-important, secret sauce process of feature engineering, which is converting your collected variables into variables that will be more valuable to the algorithm you end up using. You might have noticed that the inner arrow on this step points both backward and forwards. If so, well spotted! This is because we have entered the iterative parts of the data science lifecycle.

During data preparation, or even further down the line in your process, you might find you actually need to go back and gather more data. That’s okay! It might not feel like it, but you’re still making progress towards your end goal. The more your work with your data, the better you will understand it, and that might mean finding gaps or missing information. Taking a step back to handle these shortcomings will only make the outcome of your project more robust.

Data preparation is the most time-consuming (and in many ways, most important) step in the data science cycle. Most analysts and data scientists report that data preparation and cleansing takes up 80% of their time.

Data Exploration

Now that you’ve found the data and cleaned it up to a point where it is usable, you can start forming hypotheses to test and spend time really getting to know your data. Data exploration (also known as data mining) is all about identifying and understanding patterns in your data set and includes identifying relationships and potentially important features with statistical analysis. The better you know and understand the data you are working with, the better your modeling outcomes will be, so don’t hesitate to sink a lot of your time here.

This is another iterative step in an iterative process; don't be surprised when you find yourself taking one or two steps back to perform additional cleansing and feature engineering based on what you find during exploration.

Predictive Modeling + Evaluation

Once you’ve spent time getting to know your data, and have it in a clean format, you can start training predictive models.

Early in this stage of the lifecycle, predictive modeling and data exploration can kind of blend together. As you start training models with your data and evaluating the outcomes, you’ll likely notice new things about the features in your data set. You might take another step back to iterate upon your feature engineering and try different combinations of your features.

When you are training and evaluating predictive models, be sure to try many different types of models; remember there is no reason to prefer one family of models over another without knowing something about your data (i.e. there is no free lunch).

As you build models, you need to assess them. A best practice is to use a separate validation dataset to determine how well a model is performing on unseen observations. This is another iterative process, where you will keep testing and refining models until you end up with one you’re happy with.

Interpretation + Deployment

Once you have a model you feel good about, you can move on to your final destination. What your outcome looks like will depend on how you defined the scope of your project in step 1.

Your outcome might be an interpretation of the data and results, where you use the model and all of the analysis you’ve conducted throughout the lifecycle to answer the question you started with. This is an important process, and it can be difficult to do it correctly.

It could be that your model is destined for deployment, where it will be used in real time (or near-real-time) to help your stakeholders make data-driven decisions or automate a process (if this is your outcome, don’t forget about continuing upkeep and maintenance).

And Again

Regardless of if your final goal was visualization or deployment, this final step is where the cycle resets itself. The ultimate outcome of any data science projects should be that you learned something new about the phenomena you were investigating. This increases your topic expertise, which means you are now equipped to ask new questions.

The Data Science Lifecycle is Molded by the Person Doing It

One interesting aspect of the data science lifecycle is that putting it into action is heavily influenced by the person or people doing it. The objectivity of any scientific field is a myth. We all bring our own perspectives, processes, and preferences into any project we take on. The ways I might approach finding an answer to a question or even the questions I think of to ask are likely different from yours. With that in mind, I tried to emphasize the iterative nature of my process, while still leaving it open to your own interpretation. This is also why incorporating feedback and diversity into any project is so important – with more people considering a project, you are more likely to account for different aspects of a phenomenon.

Another interesting point about the data science lifecycle (and data science as a field) is that there is not a single, definitive version of it. If my write up isn’t connecting with you, there are many other definitions and resources for you to choose from.

Other Versions

One version of the lifecycle and article I really like is from Sudeep Agarwal, posted on his personal blog. It is bold and colorful, and the steps are descriptive and clear.

Another interpretation I think is neat is this more linear visualization of the process with a relative distribution of time spent at each step, which was published as a conference talk in the context of healthcare analytics. I really like how each step is represented with a distribution of time spent in that step.

A classic version you might have seen before or find useful is the Cross-Industry Standard Process for Data Mining (CRISP-DM).

This article on Medium includes a cool figure-eight version that loops through the building and deploying of a model.

And here is Microsoft’s more interconnected/diamond shaped version, another classic standby.

The End (or the Beginning?)

The exact form your process takes on doesn't matter so much, as long as your steps are thoughtful, and you are performing rigorous and robust analysis. What does matter is that you take the time to think about a process, that you are always open to learning something new during your process, and that you aren’t afraid to go backwards if something needs to be changed or worked on further.

I hope that sharing this interpretation of the Data Science Lifecycle has helped you learn something that you can take with you on your next iteration through it 😊.