This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the Association Analysis Toolon our way to mastering the Alteryx Designer:
The Association Analysis Tool allows you to choose any numerical fields and assesses the level of correlation between those fields. You can either use the Pearson product-moment correlation, Spearmen rank-order correlation, or Hoeffding's D statistics to perform your analysis. You can also have the option of doing an in-depth analysis of your target variable in relation to the other numerical fields. After you’ve run through the tool, you will have two outputs:
The R output will give you two or three tables depending on if you’ve selected “Target a field for more detailed analysis” in the tool’s configuration. If this checkbox is checked, you will get a table that lists the coefficients as well as their respective p-values for all the fields that are being compared with the target variable like so:
If you are unfamiliar with what a correlation coefficient is or what p-values are or you simply want to know more about them, I suggest you take look at this resource.
In the second table, you have a matrix of correlation values of all the fields compared with one another.
And lastly, you get the matrix of p-values for those coefficients:
The I output is basically the same as the O output but with a little more flair. It provides you with a correlation matrix in the form of an interactive heat map. When you select a pixel, a scatterplot of the two variables will be displayed next to it.
In general, the association analysis is a great tool to help understand the relationships in your data (i.e. how your variables correlate) and which variables to choose for predictive models such as regression. In the tool, we have three different methods of correlation. We often get a lot of questions over which to use and what is the difference between the three, thus I’ll go over them briefly.
Pearson product moment correlation
The Pearson method measures the strength of linear dependence between two variables. This means you will see a higher correlation among variables that increase or decrease concurrently at the same rate.
Strong positive linear correlation
Strong negative correlation
Spearman rank-order correlation
The Spearman method is a nonparametric version of the Pearson method. It looks at the strength of any monotonic relationship. A monotonic relationship is any relationship where both variables increase or decrease concurrently but not necessarily at the same rate. This includes relationships that are not only linear, but can also be exponential, logarithmic etc.… Another way to think of monotonic relationships is, the rate of change will only stay in one direction, increasing or decreasing. It will never be both.
The two graphs on the left never change direction while the two graphs on the right do change direction and are considered non-monotonic.
At times you may get a good coefficient for a Pearson correlation between two values but an even better one for a Spearman. If this is the case, then it is possible that the relationship between the two
variables is not truly linear. Therefore, we highly suggest that you consult the scatterplot of the data.
Displ and MPG in this case have a strong Pearson coefficient of -0.85. But looking at the scatterplot, the relationship can be better described by an exponential curve. So after doing a Spearman correlation, we get an even better correlation of -0.91.
Hoeffding's D statistics
Hoeffdings D statistic is another non-parametric test that is useful for identifying non-monotonic relationships like the ones discussed above.
Now that you know what all three look for, you may be asking “Well, which method should I pick?”
The easy answer, especially if you are not truly sure about your data, is all of them. Knowing how each variablecorrelates will give you a better understanding of what models you want to use and what variable you should or shouldn’t choose for those models. Any time you are doing any predictive modeling you should always be using Data Investigation Tools such as this one before constructing any predictive model.
Things to look out for
Since the Association is an R-based macro usually any errors that come from this tool is almost always a data issue.
Example: If you only feed in 4 or less records into the association analysis tool (you shouldn’t be doing this anyway since it's bad practice) you will get this error:
Error: Association Analysis (39): Tool #9: Error in rcorr(the.data, type = cor.type) : must have >4 observations
The tables and scatterplots in this article are from the association analysis sample workflow in Alteryx. You can find it if you go to the Help tab->Sample Workflows->Predictive Analytics-> Association Analysis.
By now, you should have expert-level proficiency with the Association Analysis Tool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at Community@alteryx.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.
Stay tuned with our latest posts every Tool Tuesday by following Alteryx on Twitter! If you want to master all the Designer tools, consider subscribing for email notifications.