Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Feature Selection in K-Centroids Analysis

Anthony_A
7 - Meteor

I have created some clusters based off survey data. I would like to see the Feature Selection or variable importance of each variable in its contribution to the cluster analysis. It doesn't seem like there is an option to view this out of the box.

 

I am able to do this in R (with the same package Alteryx is using) and have been referencing this documentation here: https://cran.r-project.org/web/packages/FeatureImpCluster/readme/README.html

 

Does anyone have any ideas on how to get the variable importance in regard to the clusters? I would like to see which variables have the most impact in determining the clusters. 

 

Thanks!

 

 

 

 

7 REPLIES 7
apoorv_trivedi
5 - Atom

Hi Anthony_A,

 

To get the best output of model using cluster analysis the main points that we look at after the model is created is :

Inter-cluster distance and Intra-Cluster distance. So the distance of points within cluster should be minimum and the distance between centroid of different clusters should be max. To understand the best value of K we use the elbow graph to get the best value of number of clusters. 

 

 

Further if this does not solve your question , you can reproduce the R code that you referenced above using the R tool to get the output in Alteryx. 

 

 

Hope this helps!!!

Anthony_A
7 - Meteor

Where is the elbow chart you mentioned located in Alteryx? I don't see that information using the K-centroids Diagnostic tool.

 

That information is useful but lets say for sake of example I have 4 clusters that are based off  3 variables Age, Income, and Education. I would like to see how much effect each variable has on determining the clusters (is Age more important than Income). It seems something like this should be available no matter what type of data you are clustering. 

ImadZidan
12 - Quasar

Hello @Anthony_A ,

 

I see what you are trying to do here. Your question is very interesting and my guess it is beyong tools and more of a closer look into the algorithmic used behind the scene..

If I understand you correctly you are after the exact formula/s that is used within the algorithm that decides the importance of a a feature.

 

We must start from somewhere. Please have a look at this thread that I find addressing some of the questions that you have, not all but some.

 

Features Importance for Clustering ? (researchgate.net)

 

Lets discuss. It is interesting and educating.

Anthony_A
7 - Meteor

Thanks @ImadZidan 

 

I was able to find a workaround for this but don't know if this is the right method.

 

  • Complete cluster analysis
  • Drag a Select on the canvas and select the same variables I used in the Cluster setup (If I used 10 variables in my analysis, I select the same variables so I have a dataframe that matches) (I need to use the same data but there doesn't seem a way to grab the data (the.data) used in the clustering tool through the output or other means)
  • Drag an R Tool to the canvas and connect to the Select Tool
  • I take the "Call" that Alteryx used from the cluster analysis and stick it between some other code in the R Tool to get the variable importance. 
  • Here's an example of my R Code:
library(flexclust) #import library
library(FeatureImpCluster) #import library

the.data <- read.Alteryx("#1", mode="data.frame") #read same data used in clustering tool

#paste in the call from the Alteryx clustering tool
clust <- stepFlexclust(model.matrix(~-1+ Var1 + Var2 + Var3 + Var4 + Var5 + Var6 + Var7 + Var8 + Var9 + Var10, the.data), k = 4, nrep = 3, FUN = kcca, family = kccaFamily("kmedians"))
#end of Alteryx Call


FeatureImp_res <- FeatureImpCluster(clust,as.data.table(the.data)) #Use FeatureImpCluster to take the cluster model and get variable imp.

FeatureImp_df <- as.data.frame(FeatureImp_res$featureImp) #turns features from a list to dataframe

FeatureImp_df_rn <- tibble::rownames_to_column(FeatureImp_df, "Variable") #Adds the variable name to the importance scores

write.Alteryx(FeatureImp_df_rn, 1) #outputs dataframe in output #1

 

So essentially I'm taking the call that was used in the clustering tool (so everything is 1:1) assigning it to the designation 'clust' and then use the FeatureImpCluster package to get the variable importance. My result at the end is a table with variable importance in regard to the clusters. 

 

It seems to work properly but I do have some concerns I am missing something because my clusters are slightly different when pasting in the R Call Alteryx uses. I thought this may have to do with setting the seeds or something else. 

 

I'm sure I could open up the clustering tool macro and add this in at some point but I think I'm going to save that for another day. 

Anthony_A_0-1613683575286.png

 

ImadZidan
12 - Quasar

Hello @Anthony_A ,

 

Your analysis is spot on. I have had a brief look at the macro and yes the slight difference is as a result of not setting the seed in your R-Code.

 

 I suggest the following.

 

In your raw R-code set the see to 1 as it is in the macro. set.seed(1) . Setting the seed will ensure that running the R-code for the same DS will produce same output. 

 

...

set.seed(1)

the.data -> .....

...

 

also in the R-code

 

clust <- stepFlexclust(model.matrix(~-1+ Var1 + Var2 + Var3 + Var4 + Var5 + Var6 + Var7 + Var8 + Var9 + Var10, the.data), k = 4, nrep = 3, FUN = kcca, family = kccaFamily("kmedians"))

 

ensure that the K number and the nrep number correspond to the number of clusters and number of seeds configured in the Cluster Analysis tool.

 

see where that takes you.

 

 

 

Anthony_A
7 - Meteor

Thanks! That worked @ImadZidan 

 

By setting the seed everything now aligns. I know there is either a way to make a macro out of this or add this as an output on the K-Centroids Analysis but I am in the middle of a project right now. Once I have some downtime I'll develop a more elegant solution and post it here.

 

Also, it looks likes there are some other easy things to do with the R code and this package. For example, if you look at the R code I provided earlier I can export a plot of median values for each of the clusters by setting up the Alteryx graph function and just using barplot(cluster_name). If you could easily connect to the Call in the model output it would make some of these functions seamless with macros. 

 

AlteryxGraph(5, width=1000, height=1000)#Use Alteryx graph function
barplot(clust) #graph plot of cluster
invisible(dev.off())

 

Thanks for your help everyone! 

Anthony_A
7 - Meteor

I figured out a better way to do this. You'll need to install the FeatureImpCluster library. 

 

Open K-Centroids Macro
Save a copy (or at least I did)
Open the R tool in the macro and add the following to the bottom: 

library(FeatureImpCluster)#load library
FeatureImp_res <- FeatureImpCluster(clus.sol,as.data.table(the.matrix)) #Use FeatureImpCluster to take the cluster model (clus.sol) and data (the.matrix) to get variable imp.
FeatureImp_df <- as.data.frame(FeatureImp_res$featureImp) #turns features from a list to dataframe
FeatureImp_df_rn <- tibble::rownames_to_column(FeatureImp_df, "Variable") #Adds the variable name to the importance scores
write.Alteryx(FeatureImp_df_rn, 3) #outputs dataframe in output #3

Add a Macro output to the 3rd connection on the R node.

Save and you're up and running. The third output on the K-Centroid Analysis will show the variable importance. 

Disclaimer, I do not have all contingencies covered here. I have come across some situations where one column won't let the variable importance run but the cluster analysis will work just find. 

 

 

Labels