Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Data Science

Machine learning & data science for beginners and experts alike.
raymond-peck
Alteryx Alumni (Retired)

Why Alteryx Machine Learning?

 

The field of Machine Learning provides us with many tools that we can use for both data understanding and for prediction. Deep expertise and experience are usually required in order to do ML projects effectively and correctly since Data Science and Machine Learning are full of potholes and minefields that can derail a project, even for experts.

 

Alteryx ML automates many kinds of analyses and data checks to help you avoid these problems so that you can be more confident in your results. Whether you're a Data Analyst or domain expert that's new to Predictive Analytics and ML, or a Data Scientist who spends their days doing this kind of work, Alteryx Machine Learning can make your life easier and help ensure that your projects follow best practices. In this blog post I show you some examples of this in action.

 

What can ML Do For Me?

 

Machine Learning (ML) is all about automatically finding patterns in your data. These patterns can help you understand the information hidden in your data so that you can take actions to help your day-to-day activities. Alteryx Machine Learning has tools to automatically find some of these patterns for you, and it makes it easy for you to explore your data. Rather than having to check manually for every possible problem Alteryx ML directs your attention to where it's most needed.

 

After investigating your raw data, you can train models to gain even more insights. Machine Learning models find patterns automatically and surface them for you to explore using various tools. They also allow you to make predictions on new data. Alteryx Machine Learning automatically takes many actions to help ensure that you can count on these predictions.

 

An Example: HR employee data

 

Imagine that you work in an HR department and would like to understand the factors that drive employee satisfaction and turnover. Employee attrition is expensive and often disheartening to those left behind, so you'd like to put a big focus on retention. You probably have a lot of historical data at your fingertips, perhaps something that looks like this:

 

hr-preview.jpg

 

If you'd like to play along, this dataset is available here: https://www.kaggle.com/giripujar/hr-analytics.

 

The two columns that are of most interest are satisfaction_level and left. We can see that satisfaction_level is numeric, while left is a yes/no flag, a classification of whether the employee left the company or not. Machine Learning can help us understand and predict both kinds of columns. Prediction of a continuous numeric value is called regression, and prediction of a column that can take a limited number of fixed values, or classes, is called classification. In this case, left has only two possible values, so it's a common special case called binary classification.

 

We'll start out by taking a look at our data, and then we'll build models to dig in deeper. These models will learn more patterns than we can find just by looking at the raw data, even with sophisticated data exploration tools.

 

Data Checks and Data Quality

 

Right off the bat, Alteryx ML does a number of data checks to draw our attention to specific problems that may lie within our dataset. This particular data is pretty well-formed, so we don't get any data check messages when we load it. We can see here that our data health is quite good for the purposes of Machine Learning:

 

03 - Data Health.jpg

 

Note that the section on the upper right says that our Distribution by Column isn't ideal. It turns out that this is due to natural skew in columns like time_spend_company, as we'll see in the next section. This is ok unless we need a very high degree of accuracy in our predictions, especially for the tree-based models we'll learn about below. If we do, the built-in ML Assistant text in the right sidebar will suggest how we can transform our data to reduce the skew.

 

The ML Assistant

 

The ML Assistant text available in Alteryx ML's right sidebar gives you clear guidance on all the screens and visualizations in the product. Everything that we see in Alteryx ML is explained there, as is guidance about any actions you should take. The tiny book icon at the upper right lets you keep the ML Assistant open all the time, or to bring it up only when you click the little ⓘ icons you'll find.

 

Exploring Our Data

 

Alteryx ML gives us a preview of our data with some visualizations:

 

01 - Data Profiling a.jpg

 

By looking at the left column we can see that we've lost almost 24% of our employees over the period that the dataset covers. Very few employees have been with the company for a long time. Satisfaction is generally pretty high, and there's a gap between satisfaction rating of the bulk of our employees and those that are the most dissatisfied. We might intuit that the employees that fall into the lowest satisfaction range are a lost cause, but that those people in that gap that we see in the histogram can be saved.

 

As we click through the workflow we next get a view of the correlations between each pair of columns. They are sorted so that the highest correlations are in the upper left. The correlations visualizations are important both for data understanding and for predictive modeling. This is explained extensively in the ML Assistant text, but the takeaway for now is that a high correlation tells us that the columns vary together in some way. They are not independent of each other. This is important to keep in mind when using the data and model understanding tools we look at below. The topic of correlations is a deep one that we'll leave to the ML Assistant text, since this dataset does not show any strong correlations.

 

By clicking on the cell with the highest value we can see a plot of satisfaction_level vs. left. As we would expect, the average job satisfaction level of the people who left the company was significantly lower than those who stayed:

 

04 - Data Insights Correlation Matrix a.jpg

 

Fortunately, the gap between them is quite large, so we can take some comfort in the fact that we ought to be able to identify which employees are likely to leave. If the satisfaction levels were very similar it might indicate that it would be difficult to tell who would leave, and who would stay. In that case there might be another column that would separate those that leave from those that don't, and it's usually the case that it's actually a combination of columns that will help us predict our column of interest, but this is a good sign that we're going to be able to predict turnover with this dataset.

 

Looking further, as we might expect, employees with higher salary levels have higher job satisfaction, but the signal isn't as strong as it was for satisfaction_level:

 

05 - Data Insights Correlation Matrix b - satisfaction vs salary.jpg

 

This relationship might not have a clear cause-and-effect connection (in other words, salary and satisfaction_level may both have the same underlying drivers, such as department or education level), and in fact satisfied employees might perform better leading to higher salaries, but it's still useful to this relationship keep in mind!

 

Looking at satisfaction by department might help us focus our attention on the departments most at risk:

 

06 - Data Insights Correlation Matrix b - satisfaction vs department.jpg

 

As an HR professional, we might be distressed that the HR department comes last on this list!

 

Clicking to the next Data Insights subtab we can easily identify the people who have been with the company the longest:

 

07 - Data Insights Outliers a - time_spend_company - delete rows.png

If there are data collection errors or other data quality issues they are usually highlighted in the Outliers visualizations. We can either go back to the data source to fix the problem, or can delete the bad data with a couple of clicks.

 

We always need to take care when deleting rows from our dataset. The Outliers panel is primarily there to point out data that might be incorrect, for example due to data collection errors or file corruption, so that we can remove it. In this case, though, we know that these rows are not mistakes in the data, but rather the visualization is automatically pointing out the loyal old-timers. We should decide whether or not to include these people in the predictive models that we build, or if we should delete them from the dataset.

 

When we build predictive models it's important that the data we use to train them is representative of the cases we'll want to make predictions on in the future. The patterns the models learn need to also apply to our future data. In this example, if we include the old-timers then the models will be applicable to both shorter-term and longer-term employees, but they might not be as accurate for the shorter-term employees as they could be. If we're confident that these people aren't at risk of attrition we might want to remove them from the dataset to gain a bit of accuracy in predicting turnover of the short-timers, but in that case we'll need to be careful not to use our models on old-timers.

 

Choosing Our Prediction Target

 

Up to this point we haven't done any actual Machine Learning. Alteryx ML has done some automatic analysis for us so that it can point out interesting things about our data, and we may have chosen to further prep our data for ML by taking actions like dropping outliers or certain columns. Now it's time to let Alteryx ML build some models for us!

 

We've seen that in this dataset there are two obvious choices to consider for the column we'd want to predict or to understand in more detail: left and satisfaction_level. We call this column the target.

 

In this case, satisfaction_level probably is important in predicting left, and it precedes left in time. In other words, satisfaction_level likely is a causal factor in someone leaving the company, so it's better to first try to predict left. If we do find that satisfaction_level is important in predicting left we can dig into its predictors in more detail by building a second model for satisfaction_level.

 

Note that if we do so we should not include left in the potential predictors of satisfaction_level (we should drop that column). Why? Again, it's because satisfaction_level precedes left in time. In the future if we want to make predictions about whether current employees are likely to leave the company we won't yet know their value of the left flag! It's always important to ensure that the bits of information that we give to the models are things we'll know at the time we ask the model to make predictions. Violating this is called leakage, and Alteryx ML has some tools to help us detect and deal with it. The important takeaway for now is that we follow its instructions if it detects leakage, and that we question our models if their performance seems too good to be true. If it is, the Feature Importances information that we'll see in a moment will usually tell us if we have more subtle leakage issues. As always, see the ML Assistant for more info.

 

Building Models

 

We first select left as our target, and Classification as our machine learning method. Notice that Alteryx ML tells us that our dataset has 3571 rows in which left is True (the minority class) and 11428 in which it is False (the majority class). If there are far more examples of one class than the other we say the dataset is unbalanced. If the imbalance is within reasonable bounds, Alteryx ML will automatically rebalance the dataset for us so that it will properly learn the data patterns that identify the minority class. It will warn us if there aren't enough examples of the minority class to handle automatically, in which case we'll need to go back to our source data to collect more instances of the minority. It's just this sort of often-overlooked detail that helps to ensure that our project follows best practices for reliable results.

 

09 - select target and model type.jpg

 

Once we click Next the system will begin to automatically create new engineered features and then train a set of models of many different types. Let's dig into what that means.

 

Automated Feature Engineering

 

One of the strengths of Alteryx Machine Learning is its ability to automatically discover variations of your original dataset's columns (features) that can help the models find the hidden patterns in your data. The Advanced button pops up a dialog that lets you control this process, as well as to give you expert options for the model training process.

 

We don't have enough space to go into a lot of detail here, but let's think through a simple example. Imagine you are trying to model some consumer-oriented behavior such as retail sales. Retail sales are usually affected by weekends and holidays. Alteryx ML's automated feature engineering can take a simple DateTime column and automatically generate new features that represent day_of_week, is_weekend, is_holiday, and many others. These can not only improve prediction accuracy, but also can help you tease out the important predictors of your target more precisely. As another example, for a fraud dataset the ratio of transaction_amount to average_transaction_amount for a client is a very strong predictor.

 

Alteryx ML can discover this for you automatically. Doing this kind of Feature Engineering manually is said to take roughly 80% of the time of Machine Learning projects, so whether you're new to ML or a full-time Data Scientist this kind of automation can reveal information hidden in your data and save a huge amount of time.

 

Model Training

 

Each model algorithm has its strengths and weaknesses in terms of accuracy, and in the tradeoff between model interpretability/explainability and accuracy. That's a deep topic that will need to wait for a future blog post, but suffice it to say for now that GBM models like XGBoost, lightGBM, and CatBoost usually have the highest prediction performance, while those that are extensions of linear regression (Elastic Net, Ridge, and Logistic Regression) are more easily interpreted.

 

At this point we could go into the Advanced settings to control automated feature engineering and tweak how many models will be trained and how, but that's not necessary for this demonstration. If you need the highest prediction performance you should explore the options you'll find there.

 

Evaluating Our Model Pipelines

 

As Alteryx ML trains models it adds them to a leaderboard that we can sort by the model performance metric of our choice. I'd like to see how accurate the predictions are, and to account for the False and True cases equally, so I've chosen Balanced Accuracy as my metric. If we choose Accuracy instead, then the False cases would be more heavily weighted in our metric than the True cases, since there are roughly 4 times as many rows in which left is False.

 

10 - Leaderboard.jpg

 

It's important that the metric we use to measure prediction performance matches your use case. Fortunately, the ML Assistant text gives you clear guidance about what each metric means:

10 - Model Metrics ML Assistant text.jpg

 

As I hinted at above, you can see that the GBM models are 97% and 96% accurate on our training data, while the linear models are 75% accurate. We also see a baseline model, the simplest possible prediction, which in this case is a random "coin flip". As you might expect, the Balanced Accuracy for a coin flip is 50%.

 

Note that I said that this is the accuracy on the training data. Of course, we don't need to make predictions on our training data; we already know the answers for it! So how do we know how well our models will do on new, unseen data? Once again, Alteryx ML has helped us by automating best practices. First of all, the training data metrics actually come from a technique called cross validation, where the training data is split into multiple groups and different models are trained on each group. This gives us a more dependable view of the model performance on the training data. But it has gone even further. A portion of your data, called the holdout data, has been set aside to get an even better idea of how the models will work in the future. Let's apply this holdout data in order to see the holdout metrics, by clicking the Evaluate Model button:

 

11 - apply holdout.jpg

 

This summary shows that the Balanced Accuracy on our holdout data is even better than on the training data, which frankly is a bit unusual. Let's take a look at some more detail:

 

12 - Metrics and Confusion Matrix.jpg

 

All of the metrics are quite close to each other on the training data and holdout data, which tells us that our model ought to generalize well to future data, as long as it is similar to our training data. Since this is a binary classification problem we also get a summary of the correctness of our predictions, split out into four quadrants. This grid is called the Confusion Matrix. The terms true negative, true positive, false negative, and false positive are unfortunately familiar to us in these days of COVID-19, but the ML Assistant will remind us of what those mean in case we've forgotten.

 

Why Are Some Models Doing Better Than Others?

 

An obvious question we might ask is, why do some classification models do so much better than others?

 

The Decision Boundary

 

For classification problems, the job of the model is to learn how to separate our target's classes. The model training algorithms first find patterns that can help separate these classes, and then they encode these patterns into a representation that we can use to make predictions. Usually we're only interested in a yes or no answer, but internally the models generate a number from 0 to 1 that looks an awful lot like a probability. Unless we do a process called calibration we can't quite interpret this number as the true probability, but it's close enough that we can normally think of it that way. Let's call that number the "probability", in quotes.

 

The model then internally finds the "probability" that optimizes the metric we’ve chosen, for example Balanced Accuracy. We call this “probability” the Decision Boundary. When we make predictions, if the model predicts a "probability" less than the decision boundary it returns False, and greater than the Decision Boundary it returns True.

 

Prediction Probability plots are histograms that show the number of rows in our holdout data plotted against their predicted “probability”. The rows that are Actual True are colored rose, and those that are Actual False are colored blue. We would expect that the "probability" should be near 0 for our Actual False rows, and for it to be near 1 for our Actual True rows.

 

In the example below, from a fraud detection dataset, the Decision Boundary happens to be 0.4739. Predictions that are near that threshold are either barely on the wrong side and are predicted incorrectly, or they are correct but “at risk”: similar future data might land on the other side of the boundary and be predicted incorrectly. In other words, the rows near the boundary are difficult for the model to properly classify.

 

Distributions of Income Predictions by Actuals.png

What Does the Probability Plot Tell Us About Our Models?

 

Let's take a look at the "probabilities" that these models generate for the holdout data for our HR employee attrition dataset.

Here's the plot for our LightGBM model, which performed the best at 99% Balanced Accuracy:

 

22 - Decision Boundary for LightGBM.jpg

 

You can easily see that the model separates the two classes extremely well, since all the Actual False rows are at the far left, near 0, and the Actual True values are near 1. The vertical line is our Decision Boundary. Note that there are almost no rows anywhere near it.

Now let's take a look at our Logistic Regression (linear) model, which had a Balanced Accuracy of around 75%:

 

22 - Decision Boundary for Logistic Regression.jpg

 

You can see that the Actual False and Actual True rows are not well separated, and mix together in the center of the plot. The Actual False rows are still pushed toward zero fairly well, even though the histogram is still quite thick as it passes over to the wrong side of the Decision Boundary, but the Actual Trues are spread much more evenly between 0 and 1.

 

Probability Plots, in Summary

 

The various model metrics and the Confusion Matrix give us a rich set of measures about how well our model is making predictions, but the Probability Plot gives us something more: it shows us visually how well our model is separating the two classes. The fewer the rows in the middle of the plot, the better our model is likely to generalize to future data that we throw at it, meaning our predictions will be more stable and reliable.

 

Data Insights from Models

 

How Can Models Help Generate Insights?

 

You might be asking yourself this question right now:

 

If my main interest is in understanding our data, then why did we build Machine Learning models?

The answer lies in the fact that ML models are pattern-finding machines. Let's see what this means in practice.

 

What Are The Most Important Predictors, and How Do They Relate To The Target?

 

First we'll take a look at the most basic example: Which features (columns) are most important in predicting my target?

 

 

14 - Feature Importance XGBoost all.jpg

 

This plot shows us that there are five features that are important. Interestingly, salary range and promotion_last_5years are not among them. As we might have guessed, satisfaction_level is high on the list, so let's dig into that more deeply.

 

Partial Dependence Plots

 

Partial Dependence plots show us how our target changes as a specific feature is varied over its range. Note that we need to be careful when interpreting these plots when we have collinear features, because collinear features vary together to at lease some degree. This is explained in some detail in the ML Assistant, but as a simple example a 7 foot human probably does not weigh 85 pounds. Height and weight don't vary together precisely, but they certainly are related. For this particular dataset the correlations between columns are low, so we don't need to worry too much about this issue. We can go back to the two-variable plots from the correlations page to dig into this in detail.

 

15 - Partial Dependence a.jpg

 

 

Interestingly, it's only the very lowest level that's really important. The range between 0.2 and 0.5 has a bump, but it's not as high as we might expect. Let's take a look now at time_spend_company:

 

 

16 - Partial Dependence b.jpg

 

Now here's something fascinating! Employees are most likely to leave in years 5 and 6! We ought to think hard about why this might be so. Have they vested all of their stock and therefore don't feel a strong need to stay? Or maybe their position has them feeling stagnant, unchallenged and bored? Probably it's time for a detailed satisfaction survey, so we can look into the mood of these folks!

What about the number of projects that people are working on?

 

17 - Partial Dependence c.jpg

 

Huh! We might have guessed that people juggling a lot of projects are unhappy. But too few is also an issue! Before drawing any conclusions it'd be good to review our correlations visualizations to dig into the data to see if the people with only 2 projects are also unusual for other reasons.

 

What about performance evaluations?

 

18 - Partial Dependence d.jpg

 

Of course, it's a bit hard to tell cause and effect for the people with low evaluations, since employees might leave if they are poor performers or they might leave if they feel they aren't being treated fairly. If we are able to go back to our source data to bring in a departure_reason column perhaps we'd be able to tease those cases apart.

 

We might expect that people would leave if their last evaluations were low, and in fact we might even hope that they do! But what's a bit more surprising is that staff with high evaluations also have a higher rate of attrition. It would probably serve the company well to focus on retaining these people. We might look deeper into the data to see how long they leave after their evaluation date. Perhaps some focused attention during evaluation season would be effective. Or maybe these people need more challenge, attention, or recognition all year long.

 

Simulations

 

Finally, let's take a look at Simulations, a feature that can help us gain intuition and confidence about how our model is working, and can also help us to explain its predictions for individual rows and understand how to influence them. Here is an example row from our holdout data:

 

20 - Simulations before.jpg

 

We can see that the "probability" of this person leaving is 99.32%. Remember that satisfaction_level is the most important feature, which is why it's shown first. What would happen if we could make this person happier with their job? Let's move it up a couple points and click Run:

 

21 - Simulations after.jpg

 

There's hope! If we want to keep this employee from leaving it looks possible.

 

The ML Workflow

 

The ML workflow is exploratory and iterative. As we go along, we gain insights about our data. As we do, we may discover that we want to return to earlier stages in the process to make changes. Alteryx ML makes this easy by tracking our steps in the left sidebar and allowing us to return back at any time simply by clicking on the step.

 

A version of the workflow that doesn't include all of the feedback loops looks like this:

 

ml_workflow_101.png

 

 

To understand why we may want to loop back to earlier steps, let's quickly think through a specific example. If we are building a model that we'll use to make predictions about human clients we will probably need to be careful to give the model only information that is ethical to use. We may even have regulatory constraints on the information that goes into making our predictions. Examples of this might be the clients' race or gender, or proxies for these such as postal code or first name respectively. If we don't think this through up front we'll probably notice it when we look at Feature Importances or begin to use Simulations to explain specific predictions. At that point we'll want to jump back to Data Prep to drop the problematic columns from the data.

 

Similarly, if we need to explain our predictions and we're using Alteryx ML's automated feature engineering we might discover that some of the features that it generates are difficult to explain. If so, we'll want to circle back to change the automated feature engineering settings in the Advanced settings.

 

It might also occur to us that we can enhance our dataset, perhaps by joining in other internal or third-party data. If so, jump back into Designer to do that and see if your models get even better!

 

You've Declared Victory. What Now?

 

Once you are satisfied that the analysis and models are solid what do you do?

 

If you're primarily interested in understanding your data, Alteryx ML makes it simple to export the data visualizations so that you can share them with others. Both PowerPoint and single-image options are available. If I were the imaginary HR professional in the example I might write up a short report on the conclusions that I drew about the drivers of employee attrition, and use the Feature Importances, 2-variable correlation plots, and Partial Dependence plots to illustrate the actions that I would recommend. I'd choose a couple employees who had quit as examples, and use Simulations to show how we might have kept them. This should make the possibilities very concrete for my colleagues.

 

If predictions are what you're after, you can upload up-to-date data of current employees for scoring or integrate your predictive model into a Designer workflow to run it regularly. I might rerun my satisfaction_level surveys to get fresh results and then get my best model's predictions for my current employees. I could then rank the employees in terms of their likelihood to leave, and then engage the company's managers to try to keep them with the company.

 

Getting Ahead of Attrition: Modeling satisfaction_level

 

Recall that satisfaction_level is the primary predictor of someone leaving the company. We can use this information to focus in on the factors that drive employee satisfaction, to try to get ahead of the problems that might eventually cause them to leave. Unfortunately, this dataset doesn't contain very many features that go into the satisfaction rating.

 

Most likely, though, we have detailed questionnaire data from the employee surveys that are summarized in the satisfaction rating. There are two approaches we can take to incorporate this detailed survey data: We can either join (blend) the survey dataset with this one and build new models for left that make use of these additional features, or we can build a separate set of models that simply predict satisfaction_level from those survey results. The big advantage of taking the second route is that we can easily iterate on our survey as we learn more about the factors that go into the satisfaction rating. If we take the first route, and decide we need to enhance our survey, we'll have to wait until quite a few employees had left to have enough data to train a new set of models. In other words, joining the detailed survey data to this dataset would help us make changes, but would make it much more difficult to quickly iterate on improvements we might make to how we run our company.

 

Data Drift

 

If you intend to use your model to make predictions in the future, save your project for later and continue to collect data that includes the actual value of your target variable. Data can drift over time, meaning that one or more features can change. As an example, monetary inflation changes the value of a currency, and this change can affect the performance of models that include measures such as costs or revenues. If you come back to your project you can apply your newer data as a holdout set to check that your models are still performing well. If they aren't, you will need to retrain your models on the newer training data.

 

Next Steps with Alteryx Machine Learning

 

Now that you've seen most of the features in Alteryx Machine Learning it's time to try it out on your own datasets! Remember, the data we have available won't always be predictive of the column we're interested in, but it's surprising how often it is. If your problem can be framed as the prediction of a numerical value (regression) or the prediction of a discrete set of choices (classification) it's quick and easy to give it a try!

 

Learning More About Data Science and Machine Learning

 

Be sure to dig into the ML Assistant sidebar text. There's a lot of information there, but it really is approachable. As you work with your own datasets and refer to the ML Assistant it should all sink in, and help you gain new ways of understanding your data. Of course, there are many other resources online to help you deepen your knowledge.

Best of luck for your success with Alteryx ML!

Comments