Alteryx Designer

Find answers, ask questions, and share expertise about Alteryx Designer.

Correlation Tool- Filtering Out Excessive Multi Collinearity

Highlighted
6 - Meteoroid

Hello All,

 

I am trying to devise a way to systematically remove variables based on an excessive correlation, The Pearson tool allows me to create an n x n matrix of correlations for all variables, however, there doesn't appear to be an obvious way to remove a correlation pairing based on a threshold value (e.g. remove if >0.8).

 

I have also attempted to do the same but using Spearman's coefficient but the association analysis outputs a series of info-graphics displaying the relationships and appears to be geared towards data investigation rather than processing. If there is a way to undertake the above problem using Spearman's.

 

I have attached an example dataset below. 

 

Thanks in advance,

 

Stu

 

 

Highlighted
6 - Meteoroid

Anyone got any ideas on this?

Highlighted
6 - Meteoroid

Bump

Highlighted
9 - Comet

@Stu1 

 

If I understand your question, you want to systematically filter out collinear variables based on a correlation result. Is that right? The challenge is that the Pearson tool only compares on pair of variables, whereas the Association analysis tool, which does multiple pairwise comparisons, only outputs a report, not the data. 

 

Try the Pearson correlation Matrix macro from Laszlo Dobiasz in the Gallery.

https://gallery.alteryx.com/#!app/Pearson-Correlation-Matrix/5e55ed0e8a933712c0404c30 

With it you can get the correlation values either as a matrix or as a long list.  Then you can use a filter to remove whatever variables you want (note that you will have 4 rows for each pair of variables), and join back to your original data. 

 

I've attached a workflow to get you started.

 

Does this solve your question?

 

 

 

 

Highlighted
6 - Meteoroid

Thanks Terry.

 

You correctly understand my intention but there does appear to be an issue with the tool (the solution itself is good): it throws out at error stating 'There were more than 16 records in the source' which I initially interpreted to mean that it was limited to 16 pairwise correlations but looking at the output, it appears to computed them all. Is it safe to ignore this error?

 

Thanks,

 

Stuart

Labels