Dynamic Correlation -> Is there a better way than using Python?

Hi everyone,

This is both a sharing and a question 🙂

I need to calculate the correlation coefficients between columns dynamically because the number and names of the said-columns might change randomly.

When using the standard "Pearson Correlation" tool, the macro will crash every time the names (and/or number) of columns change.

To by-pass this problem, I am using a Python bloc and use a basic "dataframe.corr()" to get my correlation matrix.

This solves the problem (and it might help some of us facing the same issue!) but I was wondering if there was a better/proper way to do so?

In addition, the standard Alteryx tool is providing - as an output - a full matrix:

FieldName	X4	X5	X6
X4	1	-0.494204753335428	1
X5	-0.494204753335428	1	-0.494204753335428
X6	1	-0.494204753335428	1

Whereas the Python bloc is removing the "FieldName" column (even if the code output in the Jupyter notebook is a full matrix...)

X4	X5	X6
1	-0.494204753335428	1
-0.494204753335428	1	-0.494204753335428
1	-0.494204753335428	1

I will be happy to have your thoughts! (the corresponding simplified workflow is attached)

Pierre-Louis

Edit : Correlation Matrix corrected; thanks @chrisha

Dynamic Correlation.yxmd

Tips and Tricks

Python

Dynamic Processing

Data Investigation

Accepted answers

chrisha

Hi @pierrelouisbescond ,

I faced a similar issue before and used the R tool for this. I feel like the R tool is faster and more convenient for this:

df <- read.Alteryx("#1", mode="data.frame")
corr.matrix <- as.data.frame(cor(df, method="spearman"))
corr.matrix$FieldNames <- colnames(corr.matrix)
write.Alteryx(corr.matrix, 2)

Hope that helps

Christopher

BTW: Your correlation matrix above looks wrong. It's not symmetrical, but might simply be a copy-paste-error. 🙂

All comments

chrisha

Hi @pierrelouisbescond ,

I faced a similar issue before and used the R tool for this. I feel like the R tool is faster and more convenient for this:

df <- read.Alteryx("#1", mode="data.frame")
corr.matrix <- as.data.frame(cor(df, method="spearman"))
corr.matrix$FieldNames <- colnames(corr.matrix)
write.Alteryx(corr.matrix, 2)

Hope that helps

Christopher

BTW: Your correlation matrix above looks wrong. It's not symmetrical, but might simply be a copy-paste-error. 🙂

danilang

Hello @pierrelouisbescond

This was a fun problem. It's rare that we see cases that are almost completely solved by metadata manipulation.

Here's a solution that will work with any field names as long as they're in the same position.

Start with a Dynamic Select to get the data columns based on column position. In the top branch, rename the original columns to X1, X2, X3 in the Dynamic Rename based on the values in the Pearson column Names input. Run the Pearson Correlation which is still configured to use the X1, etc columns.

The bottom branch selects the Name field from the output of the Field info tool to get the original names. This is joined by record position to the Pearson names to give a mapping between original names and the Pearson Names. This list is used in the rename input of the second Dynamic Rename on the top branch to restore the original column names. The same list is joined to the output of this Dynamic Rename to replaces the Pearson field names with the original ones

Here's the input that I used

the Dog, Camel, and Fish columns are the renamed ones from your original data.

After the process, the output is

You should be able to adapt this technique to your data set.

Edit: Excellent R solution from @chrisha. Very clever to attach the fieldname column before the output.

Dan

Dynamic Correlation v2.0.yxmd

pierrelouisbescond

Thanks a lot @chrisha and @danilang!

I will keep the solution from @chrisha for its implementation simplicity...and also because the range of columns can vary from 30 up to 500 so the R solution is definitely answering to my need 🙂

Is there a simple way in R to put the Fieldnames as the first column? (I can do it with select after the R block but it would be nicer)

And, for the record, I have compared R and Python's performance for 3 columns of 300 values and R (2.6 seconds) is indeed faster than Python (3.1 seconds) but the accuracy is lower (maybe some parameters to fine tune): R output is "1" on a two slightly different columns whereas Python provides "0.999996". Not a big deal anyway for the use I have ^^

Pierre-Louis

Quick Links

This months top contributors

atcodedog05 19458

Qiu 15866

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703