This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the K-Centroids Clustering Tool on our way to mastering the Alteryx Designer:
Cluster analysis has a wide variety of use cases, including harnessing spatial data for grouping stores by location, performing customer segmentation or even insurance fraud detection. Clustering analysis groups individual observations in a way that each group (cluster) contains data that are more similar to one another than the data in other groups. Included with the Predictive Tools installation, the K-Centroids Cluster Analysis Tool allows you to perform cluster analysis on a data set with the option of using three different algorithms: K-Means, K-Medians, and Neural Gas.
One popular use case for cluster analysis is Market Segmentation, which is the process of dividing a large customer base or market into smaller groups of consumers, based on shared characteristics. Cluster analysis will group potential customers based on shared traits (e.g., age, gender, interests), which can allow a business to focus on sending marketing to the groups with this highest potential, or even create more personalized marketing strategies.
You can think of cluster analysis as the process of creating groups based on where points are plotted in an n-dimensional scatter plot. Generally speaking, the goal is to minimize the distance between points within the same cluster, while also maximizing the distance between cluster groups. Because this type of analysis is based on numeric distance, all of the variables that are being used for clustering need to be continuous. Clustering is an unsupervised classification method, which means you do not provide the target groups for analysis.
Because n-dimensions can be difficult to imagine, especially when n is more than 3 (which is the number of dimensions we are used to dealing with) I have included a plot of a 2-dimensional example, clustering the Iris data set based on Petal Length (x-axis) and Petal Width (y-axis).
As you can see, observations (records) with similar traits (variable values) are grouped together and labeled as belonging to the same cluster.
The configuration of the K-Centroids Cluster Analysis Tool is straight-forward. However, before even configuring you might see the following error after connecting the input of the K-Centroids clustering tool to data:
No need to worry, this is a metadata error and will be resolved as soon as the workflow is run.
The Configuration Tab for the tool displays all of the options related to the algorithm itself.
The first Configuration Option is the Solution Name. This is the name you want to give to your clustering solution. You can name your clustering solution anything you want, as long as it starts with a letter, and only contains letters, numbers, and the special characters period (".") or underscore ("_").
The next configuration options is Fields. This is a check-box list of the fields that you would like to be considered in your cluster analysis. You will notice that only numeric field types populate in this list. This is because cluster analysis by nature can only be performed on continuous variables. Select the combination of features that you would like your cluster groups to be created from by checking and unchecking the associated boxes. As indicated, you must select two or more variables for cluster analysis.
Once you have your combination of clustering variables selected, you can choose if you’d like to Standardize the fields. Standardization of fields in Cluster Analysis is a frequent practice because of the impact distance between variable values has on the clustering solution. In this tool, your options for standardizing are either to do so with z-score or unit interval standardization. If you would like to read more, please see the Community article: Standardization in Cluster Analysis.
Next, you can select a clustering method. The K-Centroids Cluster Analysis Tool uses the underlying R package flexclust to implement the three clustering algorithm options: K-Means, K-Medians, and Neural Gas. Each of these algorithms approaches the task of dividing data into groups based on distance differently.
K-means partitions the observations in a dataset into any number of clusters (specified in the next step in configuration) by assigning the observation to the cluster with the nearest mean using Euclidian distance. This method effectively partitions the data space into Voronoi cells. The algorithm is implemented iteratively by first randomly selecting n number of points (n being your target number of clusters) as starter centroids, grouping all of the points around these centroids, and recalculating the centroid based on the mean values of each group. This process is repeated until the points become stable (reaching convergence).
K-medians is a variation of k-means, which uses the median to determine the centroid of each cluster, instead of the mean. The median is computed in each dimension (for each variable) with a Manhattan distance formula (think of walking or city-block distance, where you have to follow sidewalk paths). This method is more reliable for discrete variables or even binary data sets.
Like K-Means, Neural Gas uses Euclidean distance. However, the location of the centroid of the cluster is a weighted average of all of the data points, with the points assigned to the cluster receiving the greatest weight. The weights for each point become less and less based on the distance rank of the classes. The Neural Gas algorithm implemented in this tool can be read about in the paper “Neural-Gas” Network for Vector Quantization and its Application to Time-Series Prediction by Thomas Martinetz et al.
Your algorithm selection will depend on your data and use case. For help getting started with selecting the appropriate algorithm, check out this chapter on Cluster Analysis from Introduction to Data Mining, or these two Stack Exchange threads from the Cross Validated forum, here and here.
The last two options in the configuration tab are the Number of clusters and the Number of starting seeds.
The number of clusters argument sets the number of target clusters that will be created. If you are hoping to create two groups from your data, you would set this value to 2, for three groups, 3, and so on. If you are not sure what number of clusters is appropriate for your data, consider using the K-Centroids Diagnostics Tool.
The number of starting seeds sets the number of repetitions (nrep, in the R function) argument, which repeats the entire solution building process the specified number of times, and keeps only the best solution. This is a necessary argument because of the random nature of how the clustering algorithms are initiated. The first step is randomly creating points as initial centroids. The final solution can be impacted by where these initial points are created. When using multiple starting seeds, the best solution from all iterations is kept. This is a way of ensuring more consistent clustering solutions. A higher number of starting seeds will help ensure the best possible solution is found; however, higher values will increase the tool’s processing time.
The Plot Options tab of the tool allows you to set Options to Plot Points, Plot Centroids, neither or both. You can also constrain the number of dimensions displayed in your plot. By default, The highest number of dimensions to include in the biplots is set to 2. This is because it is difficult to visually display more than 2-dimensions in a flat plot.
The Graphic Options tab simply lets you configure the graphic components of the plot output with the R anchor. You can specify the size of your plot in inches or centimeters, as well as the Graph Resolution (in dots per inch) and the base font size (in points).
Once your tool’s configuration is set, you can run your workflow and see your outputs! There are two output anchors for the K-Centroids Cluster Analysis Tool, an O anchor and an R anchor. The O anchor, as with most predictive tools, is the Model Object. This can be used as an input to the Append Clusters Tool, which allows you to assign the groups to your actual data set.
In addition to the Model Object itself, this output includes the call, which is the R code used to generate the model, the Formula itself, the model class, which should always be flexclust (this is the R package used), the Model Object, the information about each of the clusters, separated by pipes (you can think of rows 7-10 as a table, where row 7 contains the headers for each column, and 8, 9, and 10 are each cluster. The number of rows will depend on the number of clusters created). Convergence describes how many iterations were run before the model began to produce consistent clusters. The Sum of Distances can be thought of as an overall model metric. Cluster Centers (rows 13-16) is another table which describes the centroid values for each variable for each cluster. This information can all be parsed out and used as data with a combination of data preparation tools.
The R anchor is a report on the clustering solution.
Some of the information included in the O output is also included in the report and formatted so it is more digestible. The Cluster Information (5) and Cluster Centroids (7) are both in legible tables. The Sum of within-cluster distances and convergence, as well as the call, are all in the report. In addition to this information, the Report output also includes a plotted illustration.
Typically, the K-Centroids Cluster Analysis tool will be used in conjunction with the K-Centroids Diagnostic Tool, and the Append Cluster Tool. The K-Centroids Diagnostics Tool provides information to assist in determining how many clusters to specify, and the Append Cluster Tool functions like a Score Tool, attaching the assigned cluster number to each of your data points.
Used in concert, these three tools can get you through any of your clustering needs!
By now, you should have expert-level proficiency with the K-Centroids Clustering Tool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at email@example.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.
Stay tuned with our latest posts every #ToolTuesday by following @alteryx on Twitter! If you want to master all the Designer tools, consider subscribing for email notifications.