In the machine learning process, data preparation and understanding are essential and time-consuming tasks that can ultimately make large impacts on the resulting model. This blog post serves to explain a few of the more common metrics used, as well as showcase their differences and their influence on the resulting machine learning models.

# Insights on these metrics

## Pearson’s correlation

Pearson’s correlation coefficient, or Pearson’s r, measures the strength and direction of linear relationships between variables. This metric is limited to numeric data, so while ordinal data is acceptable, categorical data isn’t. Pearson’s correlation ranges from -1 to 1, inclusive, where -1 represents a perfect negative correlation, 1 represents a perfect positive correlation, and 0 represents a no linear correlation.

One example use-case of Pearson’s could be the trend between an individual’s height and their shoe size. Generally, as a person becomes taller, their shoe size also increases, so we would expect the resulting Pearson value to be closer to 1 rather than 0. If Pearson’s was 1 exactly, then given someone’s height, we could perfectly tell their shoe size, and vice versa.

## Spearman’s correlation

Spearman’s correlation coefficient is very similar to Pearson’s; however, rather than measuring the linear relationship between variables, it measures the monotonic relationship, or trends that are either entirely non-increasing or non-decreasing. In addition, it also is limited to numeric data. For example, trends like `X^3`

could be captured by Spearman’s better than Pearson’s, but neither would be able to handle a quartic, or `X^4`

, graph, pictured below. Similarly to Pearson’s, the range goes from -1 to 1, and the interpretations are equivalent, but uses monotonic relationships rather than linear.

*Quartic graph (x^4)*

An example use-case of Spearman’s could be the amount of time spent working out versus the pounds of fat lost. While we’d expect a general positive trend between these values (the more hours working out, the more pounds lost), these cannot linearly increase with each other forever due to physical limitations (an individual has only a finite amount of pounds that can be lost, but a much larger finite number of possible hours to work out). In this case, Spearman’s correlation would likely perform better than Pearson’s in capturing a correlative trend between the variables.

## Mutual Information

Mutual information (referred to as MI) measures the amount of data, or information, between two variables – given some variable X, can we garner additional information on what Y could be, and narrow down its possible values? This metric isn’t limited by real-valued (ie numeric) problems, can easily be used with comparing non-numeric data, and is generally used for discrete, rather than continuous, data. The possible values range from 0 to 1, inclusive, where 0 means there is no shared or linked information between the variables and 1 means that knowing one value will give you exactly the other associated value. MI is symmetric, so the mutual information between (X, Y) will be the same as (Y, X).

One example use-case of MI could be an individual’s age versus their preferred clothing brand. Brands associated with hypewear could narrow down the likely age to be younger, whereas more generic or wholesale brands could be associated with older individuals. We don’t require brands to be a numeric value, and we don't care what the shape of the relationship is. This allows MI to be applicable for non-numeric data, like categorical string data, and makes it a fairly robust option for determining if data could have an impact on the resulting ML model.

# Correlation and information for data prep

Pearson and Spearman’s correlations provide a lot of information when the relationships between variables in a dataset are roughly linear and when the data is continuous; mutual information, on the other hand, is useful for discrete and non-numeric data, as well as when the relationships aren’t necessarily linear. MI is a more general metric that can identify a variety of dependencies between variables, making it more robust than Pearson and Spearman as a more “all-purpose” metric; however, each of these metrics have datasets in which they perform more strongly than the next, and it’s important to note that one measure is not a replacement for the others.

We show some example trends in generated data and display the associated MI and correlation measures in the table below. This data is generated from 10k points, and noise is added from a normal distribution to emulate potential real datasets and possible trends that may arise. MI calculations involve a process called binning, during which we create equally-sized ranges that the data in a feature will be grouped into. Using 20 bins means that we create 20 equally-sized ranges to group the data, allowing us to treat numerical values as categorical. This procedure lets us to make direct comparisons between values that would otherwise be difficult to compare, like numeric and datetime values.

*Identity functions with varying levels of noise*

*Cubic functions with varying levels of noise*

*Decay functions with varying levels of noise*

*Step functions with varying levels of noise*

* Value measures associated with the different functions*

As seen in the examples above, mutual information becomes more susceptible to noise, especially when compared to the behaviors of the other two correlation metrics. While these examples are chosen to highlight instances where these correlation metrics can perform well (no sinusoidal or non-monotonic trends are shown here), we can note that noise holds a large impact on the MI detected; therefore, it would be beneficial to use other more-specific metrics to better detect correlative trends in the data.

# Correlation and information for modeling

Since these metrics highlight the amount of data that two features share, they can simplify the features that are fed into the model. A couple of examples depicting the usage of these metrics are as follows:

1. Features “A” and “B” have a very high (>0.95) correlation or mutual information with feature “C”. From a machine learning perspective, this means that there could be a lot of repetitive information contained within these features. If these are all fed into a model, it could cause bias in the model. If one of these features has high impact on the model, the other two would as well due to the high amount of data that exists between the features. Instead of emphasizing a single feature, the model would distribute the influence of the underlying relationship among the three features, introducing potential bias and discrepancy that would result in a less-interpretable model, although it could maintain the same predictive performance. In this scenario, dropping two of the three features could be useful in reducing bias and creating a more resilient model, while also saving on training time and memory usage. When multiple features are correlated strongly, it usually makes no difference which features are eliminated.

2. Features “A” and “B” could have a high (>0.95) correlation or information with the target value. In some scenarios, this could be useful to know. For instance, a feature about “the number of customer service calls a customer has requested” could be very useful in predicting churn and could result in higher model performance. However, there could be scenarios where target leakage exists. Target leakage occurs when an ML model has access to data that wouldn’t be available at the time of prediction. This is important to identify and remove before training models. In this case, a feature such as “talked with re-signing representative” could have high correlation with the target “churn”, but this data should not exist in the training features.

## Example 1

Example 1 can be showcased using a public dataset from IBM that we’ve edited. We will use Alteryx’s Woodwork and EvalML libraries to illustrate how these metrics can be used to potentially improve model performance.

In this dataset, we have data about the user, including their age, number of dependents, state of residence, whether they referred a friend, and if they’re married, among many other features. The goal is to predict whether the user will churn.

We can run Woodwork’s dependence() function on this dataset, which will find the `mutual_info`

, `pearson`

, and `spearman`

values between all possible features that exist.

We can run EvalML’s AutoMLSearch function on this dataset (with `Churn`

as the target). `AutoMLSearch`

automates the modeling process, creating pipelines that contain transformers to preprocess and clean the data before training a model with the estimators.

This generates and trains a set of pipelines to find the best model for the data. We can grab this object’s `best_pipeline`

value, which contains the pipeline that performs the best according to EvalML’s default metric (in this case, `log loss binary`

). We then score it on the test data, specifically focusing on `AUC`

and `F1`

, which are fairly reliable measures to determine how well our model performs.

This time, we drop the features that have high mutual information or correlation values (in this case, >0.95). The features we drop are `['Country', 'State', 'Quarter', 'Dependents', 'Total Charges']`

.

From this example, we have a small decrease in AUC but a more-sizable increase in F1. While this doesn’t necessarily mean that the model became better, it highlights the potential impacts that these features could have on the resulting performance. Looking at the test data confusion matrix (given through EvalML’s confusion_matrix) highlights the difference in modeling. In this matrix, the row headers are the actual labels, and the column headers are the predicted labels:

*Including all features *

*Dropping correlated features*

By dropping the correlated features, the model better correctly predicts the `false`

examples while maintaining its performance of the `true`

samples, which is a great performance gain for modeling.

In addition, this second search took 29 seconds compared to the 36 seconds of the first search. Although this time increase isn’t too substantial, the differences will be larger as the size of the data grows, both in number of features and number of samples.

## Example 2

EvalML contains a data health metric, the TargetLeakageDataCheck, that warns users of potential target leakage in the data based on these metrics and their values when compared with the target column. This function is a helpful tool that allows further analysis and cleaning of the data to ensure that the features are useful for training resilient and better-performing machine learning models.

We can use the `TargetLeakageDataCheck`

to find potential target leakage in the original churn data from above. This churn data is taken directly from the link above, rather than being altered like in the previous example, which means some of the columns and data has changed.

We can then look into the dataset and see that `churn_rate`

is the exact same as `Churn`

. Had we created a model, we would get a perfect score.

We can use MI and correlations to determine possible features that could represent target leakage, allowing us to catch this early and create a more robust and well-performing model.

# Final notes

In the end, these metrics act as a means of selecting or excluding potential features from the modeling process. Each metric serves a different role for the patterns that it can detect. Using a larger variety of these metrics can be helpful in capturing a greater diversity of these cases, resulting in the capacity to train more trustworthy and unbiased models.

This article was originally published on innovation.alteryx.com