Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Community is experiencing an influx of spam. As we work toward a solution, please use the 'Notify Moderator' option on the ellipsis menu to flag inappropriate posts.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Which and how many variables to choose for clustering.

gurpe974
7 - Meteor

Hello everyone,

After getting my Alteryx Core Certification and the Udacity Nanodegree in Predictive analytics for business last year, I haven't had many chances to work with Alteryx. That's why I decided to use it for my Bachelor Thesis about customer segmentation within the Austrian local skier population.

I have collected a sample of 200 people with 28 Likert scaled variables. After preparing the data for the analysis, I observed some correlations between the variables, so I decided to use PCA for some of them and leave others individually.

My problem is that I do not know which and how many variables I should use to obtain the most representative cluster solution. I have been using the Adjusted Rand Index and the Calinski-Harabasz Index to observe the number of clusters and the quality of the solution with different variables. I wanted to know if there is an approach to learning which variables or PCAs to use to obtain "the best" clustering model with Alteryx.

Thank you very much to anyone that could give me some tips or resources to complete my Clustering project. I really don't know how to move forward at this point.

4 REPLIES 4
mst3k
11 - Bolide

Look into the K-Centroids Diagnostics tool in the Predictive Grouping menu

https://help.alteryx.com/20213/designer/k-centroids-diagnostics-tool

This uses both the adjusted Rand index and the Calinski–Harabasz index built in

gurpe974
7 - Meteor

Hello mst3k,

Thanks for your answer! Still, that does not precisely solve my question. Sorry, I might not have explained it well enough. I was using the K-Centroids Diagnostics tool to check different clustering solutions with different variables and methods. But my question was if it is possible to find the solution with the highest indexes without trial and error. As I said, I have 28 variables, so there are many combinations, which makes it very difficult and time-consuming to check all the different solutions. I wanted to know if there is maybe a better way to use an iterative approach with Alteryx.

Cheers!

mst3k
11 - Bolide

yikes hmmm.... i could see doing it with a macro, but attempting every combination of 28 fields would be 28! possibilities? are there some you can rule out? not sure where to go from there. could you see which variables correlate with whatever you're trying to measure using some of the Data Investigation tools, and only use fields with a high correlation to whatever outcome?

gurpe974
7 - Meteor

I have performed several PCA components with the variables that had the highest correlation, and I could imagine cutting it to 20 variables, including the PCA components. Moreover, according to some resources I have read, I shouldn't use more than 5-7 variables for the clustering solution due to my small sample size. Maybe that makes the process a bit easier. How should I proceed with the Macro? I guess it's worth a try.

Thanks!!

Labels