This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Are you new to data science? Are you transitioning from a business analyst to a citizen data scientist? Or are you a seasoned data scientist with a relevant degree from university?
Regardless of when you started out on the journey of machine learning, have you ever felt lost when facing the long list of models, not knowing which one you should choose for your problem? Or maybe you are familiar with Logistic Regression and Linear Regression, but have always wondered what those other algorithms can be used for?
I would like to share with you the Alteryx Predictive Flowcharts that our Data Science practice created here at TrueCue.
The Predictive Flowcharts visualise some common considerations that analysts face when choosing a Predictive algorithm. They give some ideas and guidance when selecting which model to use, for instance, consider what kind of data you are trying to predict, what volume of data you have, or how important it is for the model to be interpretable.
Click image for an interactive version! The flowcharts are designed to accompany TrueCue’s Predictive Analytics Alteryx training for novice data analysts and provide a starting point for learning the Predictive Analytics toolbox in Alteryx. A trained Data Scientist will spot some simplifications and generalisations.
The Data Investigation tool category includes tools for understanding the data to be used in a predictive analytics project, and tools for conducting specialised data sampling tasks for predictive analytics. Understanding what your data looks like is the first step of designing a machine learning solution.
Model selection plays a crucial role in a predictive project. When we get our data, we typically start with some basic descriptive analysis to investigate and understand the data we are dealing with. Then based on the predictive goal, we determine if we have a Classification problem (where we want to classify data into groups or categories, e.g. predicting if a loan applicant will default), or a Regression problem (where we want to predict numbers, e.g. predicting how many software licenses we are going to sell next quarter).
After deciding whether we have a Classification or Regression problem, we can move to the model selection. You will see in the flowcharts that some models can be used for both Classification and Regression, while some can only be used for one of the two. There might be multiple models that are suitable in a given situation, and you don’t know which one will perform better.
This is why we have a Validation process, where we split the data, train the selected models and validate their performance with a “hold-out” dataset (which was hidden from all the models during the training stage) so that we can compare the model performance. Sometimes we split the data in multiple folds so that we can test the performance multiple times to increase the robustness – this is called cross-validation.
Once we have a winning model, we can then use this model to create prediction – this is called Inference (or scoring).
Bingqian believes in the power of Analytics and Data Science in uncovering insights and helping to better inform decision making. As a Senior Consultant and Data Science Lead at TrueCue, she enjoys finding solutions for challenges in data consolidation, modelling, visualisation and Advanced Analytics. She leverages modern technology such as Alteryx, Tableau, DataRobot, and Microsoft Azure Machine Learning, and is one of the 17 Certified Alteryx Experts in the world. Outside of work, she enjoys a wide range of activities, from oil painting, poetry reading, scuba diving, to boxing and krav maga. Find @bingqian_gao on LinkedIn, or reach out via email.