Are you new to data science? Are you transitioning from a business analyst to a citizen data scientist? Or are you a seasoned data scientist with a relevant degree from university?
Regardless of when you started out on the journey of machine learning, have you ever felt lost when facing the long list of models, not knowing which one you should choose for your problem? Or maybe you are familiar with Logistic Regression and Linear Regression, but have always wondered what those other algorithms can be used for?
I would like to share with you the Alteryx Predictive Flowcharts that our Data Science practice created here at TrueCue.
The Predictive Flowcharts visualise some common considerations that analysts face when choosing a Predictive algorithm. They give some ideas and guidance when selecting which model to use, for instance, consider what kind of data you are trying to predict, what volume of data you have, or how important it is for the model to be interpretable.
Click image for an interactive version!
The flowcharts are designed to accompany TrueCue’s Predictive Analytics Alteryx training for novice data analysts and provide a starting point for learning the Predictive Analytics toolbox in Alteryx. A trained Data Scientist will spot some simplifications and generalisations.
The Data Investigation tool category includes tools for understanding the data to be used in a predictive analytics project, and tools for conducting specialised data sampling tasks for predictive analytics. Understanding what your data looks like is the first step of designing a machine learning solution.
Click the image for an interactive version!
Model selection plays a crucial role in a predictive project. When we get our data, we typically start with some basic descriptive analysis to investigate and understand the data we are dealing with. Then based on the predictive goal, we determine if we have a Classification problem (where we want to classify data into groups or categories, e.g. predicting if a loan applicant will default), or a Regression problem (where we want to predict numbers, e.g. predicting how many software licenses we are going to sell next quarter).
After deciding whether we have a Classification or Regression problem, we can move to the model selection. You will see in the flowcharts that some models can be used for both Classification and Regression, while some can only be used for one of the two. There might be multiple models that are suitable in a given situation, and you don’t know which one will perform better.
Click image for an interactive version!
Click image for an interactive version!
This is why we have a Validation process, where we split the data, train the selected models and validate their performance with a “hold-out” dataset (which was hidden from all the models during the training stage) so that we can compare the model performance. Sometimes we split the data in multiple folds so that we can test the performance multiple times to increase the robustness – this is called cross-validation.
Once we have a winning model, we can then use this model to create prediction – this is called Inference (or scoring).
Click image for an interactive version!
If you find the charts useful or would like to share your thoughts or comments, please drop us a line or reply to this blog. We would love to hear from you!
The flowcharts were created by Katelyn Weber (analytics) and Jakub Szepietowski (design).