Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Dynamic Correlation -> Is there a better way than using Python?

pierrelouisbescond
8 - Asteroid

Hi everyone,

 

This is both a sharing and a question 🙂

 

I need to calculate the correlation coefficients between columns dynamically because the number and names of the said-columns might change randomly.

 

When using the standard "Pearson Correlation" tool, the macro will crash every time the names (and/or number) of columns change.

 

To by-pass this problem, I am using a Python bloc and use a basic "dataframe.corr()" to get my correlation matrix.

 

This solves the problem (and it might help some of us facing the same issue!) but I was wondering if there was a better/proper way to do so?

 

In addition, the standard Alteryx tool is providing - as an output - a full matrix:

 

FieldNameX4X5X6
X41-0.4942047533354281
X5-0.4942047533354281-0.494204753335428
X61-0.4942047533354281

 

Whereas the Python bloc is removing the "FieldName" column (even if the code output in the Jupyter notebook is a full matrix...)

 

X4X5X6
1-0.4942047533354281
-0.4942047533354281-0.494204753335428
1-0.4942047533354281

 

I will be happy to have your thoughts! (the corresponding simplified workflow is attached)

Pierre-Louis

 

Edit : Correlation Matrix corrected; thanks @chrisha 

6 REPLIES 6
chrisha
11 - Bolide

Hi @pierrelouisbescond ,

 

I faced a similar issue before and used the R tool for this. I feel like the R tool is faster and more convenient for this:

 

 

df <- read.Alteryx("#1", mode="data.frame")
corr.matrix <- as.data.frame(cor(df, method="spearman"))
corr.matrix$FieldNames <- colnames(corr.matrix)
write.Alteryx(corr.matrix, 2)

 

 

Hope that helps

Christopher

 

BTW: Your correlation matrix above looks wrong. It's not symmetrical, but might simply be a copy-paste-error. 🙂

danilang
19 - Altair
19 - Altair

Hello @pierrelouisbescond 

 

This was a fun problem.  It's rare that we see cases that are almost completely solved by metadata manipulation.  

 

Here's a solution that will work with any field names as long as they're in the same position.

 

WF.png

 

Start with a Dynamic Select to get the data columns based on column position.  In the top branch, rename the original columns to X1, X2, X3 in the Dynamic Rename based on the values in the Pearson column Names input.  Run the Pearson Correlation which is still configured to use the X1, etc columns. 

 

The bottom branch selects the Name field from the output of the Field info tool to get the original names.  This is joined by record position to the Pearson names to give a mapping between original names and the Pearson Names.  This list is used in the rename input of the second Dynamic Rename on the top branch to restore the original column names.  The same list is joined to the output of this Dynamic Rename to replaces the Pearson field names with the original ones

 

Here's the input that I used

 

Input.png

 

the Dog, Camel, and Fish columns are the renamed ones from your original data.

 

After the process, the output is 

 

Output.png 

 

You should be able to adapt this technique to your data set.

 

Edit: Excellent R solution from @chrisha.  Very clever to attach the fieldname column before the output. 

 

Dan

pierrelouisbescond
8 - Asteroid

Thanks a lot @chrisha and @danilang!

 

I will keep the solution from @chrisha for its implementation simplicity...and also because the range of columns can vary from 30 up to 500 so the R solution is definitely answering to my need 🙂

 

Is there a simple way in R to put the Fieldnames as the first column? (I can do it with select after the R block but it would be nicer)

 

And, for the record, I have compared R and Python's performance for 3 columns of 300 values and R (2.6 seconds) is indeed faster than Python (3.1 seconds) but the accuracy is lower (maybe some parameters to fine tune): R output is "1" on a two slightly different columns whereas Python provides "0.999996". Not a big deal anyway for the use I have ^^

 

Pierre-Louis

chrisha
11 - Bolide

Note that my code above uses Spearman's Rank Correlations, because I quickly copied it from my project. Your Python code would have produced Pearson correlations. So, you might remove the method="Spearman" part or write mehod="Pearson". Might explain your differences in the values. Otherwise, there shouldn't be any differences between the Python and the R implementation.

 

Regarding the column ordering: In Base R it can be quite extensive code if you don't want to type all column names. Personally, I'd use dplyr's select for these kind of problems: https://stackoverflow.com/a/30471808

pierrelouisbescond
8 - Asteroid

Thanks (again) for the guidance @chrisha!

I even found a simpler way in R:

 

df <- read.Alteryx("#1", mode="data.frame")

corr.matrix <- as.data.frame(cor(df, method="spearman"))

corr.matrix <- data.frame(FieldNames = colnames(corr.matrix), corr.matrix)

write.Alteryx(corr.matrix, 1)

 

This is the second time I write in R but it helps a lot with Stackoverflow 😅🙂

chrisha
11 - Bolide

Yes, very elegant solution! Well done!

Labels