Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Pearson correlation with nulls

Mikis
8 - Asteroid

Hello,

 

I'm trying to run a pearson correlation, and everything goes well except when my data contains nulls.

My data looks like this (actual data attached):

var1var2var3
53 
625
726
9  
513
652

 

When entering this into pearson, the outcome says

FieldNamevar1var2var3
var11  
var2 1 
var3  1

 

the pearson documentation recommends to use the imputation tool and replace the null values, but it's important that the calculation is identical to the excel calculation.

in Excel, a record is ignored when only one of the two values are null (cf screenshot below)

 

Is this a setting that I can set to the pearson tool? Or can someone suggest a method to achieve this?

I imagine a batch macro might do the trick, but that sounds kinda sloppy for what I hope is a button that I'm missing or something :) ?

 

example.PNG

Thanks,

Mikis

2 REPLIES 2
MichalM
Alteryx Alumni (Retired)

Hi @Mikis

 

Pearson correlation simply doesn't work when there are NULL values.  The excel function filters out rows with NULLs and only provides a correlation values for two data sets (variables only).

The only way I can think of replicating this would be using a batch macro - selecting all variable pair combinations one at a time, filtering out rows with NULLs, and running the Pearson correlation. 

The other way would be to use Imputation / Data cleansing (replacing NULLS with 0s) before running the correlation but as you correctly pointed out, the correlation value would be different.

 

Michal

 

Mikis
8 - Asteroid

Hi @MichalM

 

Thanks, I have built what I think you mean and added it in attachment.

The problem is that it's really slow. The previous pearson correlation with 33 variables took less than a second, this thing takes almost 2 minutes (since the batch macro with 33 variables needs to run 33² = 1089 times).

 

This correlation is just the first step of a whole flow. If this already takes 2 minutes then I'll probably have to ask my users to just correlate in excel and start the flow from there. But if someone can use it, the package attached might help.

 

Regards,

Mikis

Labels