Welcome to Part 3 (out of 4) of the Pre-Predictive series. In this article series, we are introducing you to the very exciting world of data investigation.
Just to remind you on where we are in our journey; Part 1gave an overview of the Field Summary Tool,Part 2covered the Contingency and Frequency Table Tools, as well as the Distribution Analysis Macro, Part 3 includes the Correlation Tools, including the Association Analysis Tool, and ThePearson and Spearman Correlation Tools, finally, Part 4 will give you a tour of all the nifty plotting tools.
Allthree tools included in this section are used to assess correlation between variables. My general PSA’s for these toolsare to screen your predictor variables for multicollinearity, and to remember thatcorrelation does not imply causation🙂
Also, all three of these tools primarily use numeric input data. The fields being analyzed should not contain nulls. You can impute values to replace the nulls by using the Imputation Toolor using a custom method.Columns with unique identifiers, such as key fields, should not be included in any statistical analysis. They have zero predictive value and can cause runtime exceptions.
The Association Analysis Toolwill help you determine which fields in your data have a bivariate associationwith one another.
In the tool’s configuration, you have the option to select a target field for more detailed analysis. The target field can be numeric or binary categorical. You can set the Association Analysis target field as your eventual model’s target variable to help you determine which set of variables to use in your predictive model. You can select two or more Fields in the Field window. These fields (every field other than your target field) must be numeric. Being able to perform a correlation analysis on a binary categorical variable is unique to the Association Analysis Tool, both the Pearson and Spearman Correlation Tools only accept numeric inputs.
The Association Analysis Toolallows you to select betweenPerson product-moment correlation,Spearman rank-order correlation, andHoeffding’s D statistic. Take some time to poke through the outputs of the Association Analysis Tool. There is both a report (R) and an interactive output (I). If you’ve elected to use a target field, there will be a Focused Analysis table for that variable, as well as a Full Correlation Matrix, and a Matrix of p-valuesin the (R) output. In the (I) output, there is a color-coded correlation matrix and a Scatter Plot. Clicking on any box in the correlation matrix will cause the Scatter Plot to generate to the right of the matrix.
There is an excellent Association Analysis Tool Mastery articleon the Community that you can refer to for more information.
The Pearson Correlation Toolcan be used to measure the strength of a linear association between two variables, either by calculating a Pearson Correlation,or by calculating covariance.
Pearson’s correlation is calculated by drawing a line of best fit through the two variables, and then calculating how far away each of the data points are from the line of best fit. The analysis will return a value ranging from +1 to -1 (inclusive), where +1 indicates a positive linear relationship (as one value increases the other also increases), -1 indicates a negative linear relationship (as one value increases the other value decreases) and 0 indicates there is no linear association between the two variables.
Covariance is a measure of the join variability of two random variables. If the variables tend to have similar behavior, covariance is positive, if variables tend to show the opposite variable, covariance is negative. The magnitude of covariance is difficult to interpret as it is not normalized. The normalized version of covariance is a correlation coefficient, which is what Pearson’s CorrelationTool returns by default.
In the tool’s configuration, you select two or more variables to analyze correlations between. The fields must be numeric. Also in the tool’s configuration, you can also choose between the options Calculate Correlation or Calculate Covariance. Calculate Correlation performs a Pearson Correlation. Calculate Covariance will calculate sample covariance.
Spearman Correlation is anonparametricmeasure of rank correlations. It can be used for both continuous and discrete ordinal variables. The input fields for this tool must be numeric.
Spearman's correlation assesses a monotonicrelationship (a positive or negative relationship, regardless of it is linear) between variables. Like Pearson’s correlation coefficient, the value of Spearman’s correlation coefficient ranges between -1 and +1 (inclusive). In the absence of repeated data values, a perfect Spearman correlation (± 1) occurs when each variable is a perfect (positive or negative) monotone function of the other. Zero indicates that there is no correlation.
In the tool’s configuration, you can select two fields to analyze the correlation between. You also have the option to select a Group By field, which will determine the correlation for categories in the field you provide.
There you have it! A basic run-down of the three correlation tools.
Generally speaking, correlation analysisis useful to indicate if a variable has predictive power for a target variable. Keep an eye out for high correlation coefficients (close to ±1), as these are the predictor variables that may be the most powerful in a predictive model. Low correlation coefficients between a predictor and target variable alone is not enough to exclude it from a model, but it should prompt you to take a moment to consider if including that variable is really important, and if it really makes sense (How relevant is the price of oranges in predicting the number of ski passes sold in a year, really?). High correlation coefficients between predictor variables can help you screen for multicollinearity, which can lead to all sorts of statistical shenanigans, particularly in regression models.
The Association Analysis Tool can do correlations with a binary categorical variable, whereas the Pearson and Spearman Tools can only handle numeric data. The Pearson Correlation Tool has an option to return covariance, and the Spearman Correlation tool has an option to group values. If you are not sure which correlation measure to use, you have the option to try each of them. Typically, Pearson's correlation should be used when both variables have a normal distribution, otherwise Spearman's correlation is appropriate. Additionally, Spearman's correlation is more robust to outliers.
Finally, and I know I've already said this but it bears repeating, correlation coefficients do notexpress causal relationships, only associations. Correlation does not imply causation.
The correlation tools are a great way to explore how your variables relate to one another and should be included in your pre-predictive data exploration and preparation. Only you know your data, your use case, and how your results will be applied, but the Data Investigation tools are here to help set you up for modeling success.