Data Science

SusanCS · ‎04-22-2020

It’s never too early to start learning about data science! We’d planned to celebrate Take Your Kids to Work Day with this fun new video, but right now, you may be working alongside your kids every day. Maybe it’s time to share some data expertise with them?

While we’re all at home, we ~~took advantage of free labor~~ decided to use kids’ wisdom and skills to explain the key data science technique of clustering. Posy, daughter of Alteryx creative director @TaraM, is our star. Watch the short video below and then read on to see how “real, grown-up” data scientists would use clustering!

So what’s the data story here? Each step in Posy’s laundry journey corresponds to a part of the clustering process. Let’s watch it unfold (pun intended). Explanations are under each of the spoiler tags.

Posy is faced with a giant pile of unsorted, clean laundry, fresh out of the dryer, ready for folding and sorting into smaller piles of similar items.

With no parental guidance immediately available, Posy has to make sense of this mountain of clothes herself! She looks for similar characteristics among the items: Does it fit on her foot? Over her head? Eventually she figures out which things get put together: shirts, socks, and pants each in their own piles.

Spoiler

Posy’s task, in data terms, is to take the “dataset” of the laundry items and figure out which of them are similar enough to be grouped together. The different items don’t have their names written on them -- the shirts don’t say “shirt,” and no one is telling Posy, “Put the items with two sleeves and a collar in this pile.” This is an unsupervised learning task. The names for the groups of items aren’t provided, and there’s no specific target variable or outcome, just the task of making groups of like items. Posy looks for ways to group these unlabeled items based on her observations about their shape and fit, without prior input from anyone as to how they should be grouped.

A parallel real-world example is customer segmentation; you might be interested in seeing how your customers break out into groups with similar characteristics and buying habits, but you might not necessarily know in advance how to define those groups. In other words, you’re not sure what makes one group different from another. You also certainly don’t want to scrutinize their data and try to manually sort them into groups.

You’ll ask your clustering algorithm of choice (more on that later) to identify the key characteristics that differentiate the customer groups. Maybe two key characteristics defining distinct types of customers turn out to be the frequency of shopping visits and the average purchase amount. Having identified those characteristics, the algorithm can then label each customer as belonging to three main groups; say, group 1 (“infrequent shoppers, big spenders”); group 2 (“frequent shoppers, big spenders”); and group 3 (“frequent shoppers, small purchases”). Like Posy figuring out important clothing characteristics in order to make her laundry piles, the clustering algorithm also analyzes and groups your customers according to their defining characteristics.

Clustering has many other uses, too. Those individual groups could be worthy of detailed analysis in themselves, so this process could be a way of subsetting your data (in the laundry analogy, we could just take a closer look at all the shirts, once we know they are shirts and should get grouped together). Clustering can be used for a type of semi-supervised learning, which I’ll explain in the last part of this piece. This family of algorithms is used for search engines and image segmentation, too. And finally, clustering can also be used for anomaly/outlier detection, as we’ll soon address with those cute footie pajamas.

Posy’s task, in data terms, is to take the “dataset” of the laundry items and figure out which of them are similar enough to be grouped together. The different items don’t have their names written on them -- the shirts don’t say “shirt,” and no one is telling Posy, “Put the items with two sleeves and a collar in this pile.” This is an unsupervised learning task. The names for the groups of items aren’t provided, and there’s no specific target variable or outcome, just the task of making groups of like items. Posy looks for ways to group these unlabeled items based on her observations about their shape and fit, without prior input from anyone as to how they should be grouped. A parallel real-world example is customer segmentation; you might be interested in seeing how your customers break out into groups with similar characteristics and buying habits, but you might not necessarily know in advance how to define those groups. In other words, you’re not sure what makes one group different from another. You also certainly don’t want to scrutinize their data and try to manually sort them into groups. You’ll ask your clustering algorithm of choice (more on that later) to identify the key characteristics that differentiate the customer groups. Maybe two key characteristics defining distinct types of customers turn out to be the frequency of shopping visits and the average purchase amount. Having identified those characteristics, the algorithm can then label each customer as belonging to three main groups; say, group 1 (“infrequent shoppers, big spenders”); group 2 (“frequent shoppers, big spenders”); and group 3 (“frequent shoppers, small purchases”). Like Posy figuring out important clothing characteristics in order to make her laundry piles, the clustering algorithm also analyzes and groups your customers according to their defining characteristics. Clustering has many other uses, too. Those individual groups could be worthy of detailed analysis in themselves, so this process could be a way of subsetting your data (in the laundry analogy, we could just take a closer look at all the shirts, once we know they are shirts and should get grouped together). Clustering can be used for a type of semi-supervised learning, which I’ll explain in the last part of this piece. This family of algorithms is used for search engines and image segmentation, too. And finally, clustering can also be used for anomaly/outlier detection, as we’ll soon address with those cute footie pajamas.

But then -- as Posy sorts the laundry, some items that don’t quite fit are left behind...what are leggings, anyway? Really long conjoined socks, or tight pants with feet? And the footie pajamas? They’re all the things -- shirt, pants and socks in one. So those items end up stacked on their own for more attention later.

Spoiler

Generally, when we use clustering techniques, we hope to find clearly defined, well differentiated groups in which the members are similar to each other and also notably different from members of the other groups.

We made Posy’s task a little challenging! Clusters in your data are not always clearly defined. The leggings and the footie pajamas are meant to represent outliers that may not easily fit into one of the clusters; rather, they might fall between two well defined groups. These outliers may demand more attention from the analyst later.

Additionally, it’s important to think about the algorithm you use for your particular dataset. Some algorithms are better at dealing with messy data -- like outliers -- than others and can still produce useful clusters. One algorithm is k-means clustering (available in the K-Centroids Cluster Analysis Tool in Designer), in which k stands for the number of clusters you tell the algorithm to look for. As you can imagine, choosing that correct number for k can be a little tricky and definitely affects your results. The K-Centroids Diagnostic Tool in Designer can help you determine a good starting number for k. Additionally, your exploratory data analysis and your domain knowledge may suggest a number for you.

This method randomly selects k points, called centroids, and forms clusters based on identifying which points are closest to each of those centroids. Then it calculates the actual centroids, or central points, of those clusters, using the means of all the data points. The process begins again to find better-fitting clusters for all the data points. The algorithm iterates until either it meets some stopping criterion the user has set, or until the clusters don’t change anymore. Usually this process is run multiple times with different randomly chosen centroids as the starting point each time. Another method, k-medians, uses the medians of the data points in each cluster to determine the centroids; you can choose either of these techniques in Designer, as well as one more approach, Neural Gas, which weights the data points in calculating the means for each cluster based on which points are closest to the centroids. Check out this Tool Mastery article for more detail on how all of these work.

In the laundry analogy, the equivalent might be taking some random items from the unsorted laundry mountain to serve as the foundations for a few piles, seeing if the next items you pull from the mountain resemble those foundations, and placing the items where they seem to best fit. You’d repeat that process until you see which items truly are the best representatives around which to sort the clothes.

Generally, when we use clustering techniques, we hope to find clearly defined, well differentiated groups in which the members are similar to each other and also notably different from members of the other groups. We made Posy’s task a little challenging! Clusters in your data are not always clearly defined. The leggings and the footie pajamas are meant to represent outliers that may not easily fit into one of the clusters; rather, they might fall between two well defined groups. These outliers may demand more attention from the analyst later. Additionally, it’s important to think about the algorithm you use for your particular dataset. Some algorithms are better at dealing with messy data -- like outliers -- than others and can still produce useful clusters. One algorithm is k-means clustering (available in the K-Centroids Cluster Analysis Tool in Designer), in which k stands for the number of clusters you tell the algorithm to look for. As you can imagine, choosing that correct number for k can be a little tricky and definitely affects your results. The K-Centroids Diagnostic Tool in Designer can help you determine a good starting number for k. Additionally, your exploratory data analysis and your domain knowledge may suggest a number for you. This method randomly selects k points, called centroids, and forms clusters based on identifying which points are closest to each of those centroids. Then it calculates the actual centroids, or central points, of those clusters, using the means of all the data points. The process begins again to find better-fitting clusters for all the data points. The algorithm iterates until either it meets some stopping criterion the user has set, or until the clusters don’t change anymore. Usually this process is run multiple times with different randomly chosen centroids as the starting point each time. Another method, k-medians, uses the medians of the data points in each cluster to determine the centroids; you can choose either of these techniques in Designer, as well as one more approach, Neural Gas, which weights the data points in calculating the means for each cluster based on which points are closest to the centroids. Check out this Tool Mastery article for more detail on how all of these work. In the laundry analogy, the equivalent might be taking some random items from the unsorted laundry mountain to serve as the foundations for a few piles, seeing if the next items you pull from the mountain resemble those foundations, and placing the items where they seem to best fit. You’d repeat that process until you see which items truly are the best representatives around which to sort the clothes.

After contending with the challenging footie pajamas and leggings, Posy successfully sorts the clothes, making the laundry heap into sorted piles with identifiable items in each. She has conquered the laundry mountain!

Spoiler

Posy has clustered the laundry! She now knows which pile best fits each of the items from the laundry item and has set up “clusters” that fit each one. Likewise, in your clustering analysis, you’ve now got data points assigned to their best-fitting clusters. You can add their cluster assignments to the dataset with the Append Cluster Tool. Adding that information to each row of your dataset lets you then filter your data, explore each cluster, and try to identify what makes each cluster distinctive from the others.

Clustering is most often used for exploratory data analysis and for gaining deeper understanding of datasets in which groupings are unknown and hard for humans to identify. Clustering can reveal structure in your data that was invisible without this powerful tool -- like being able to look at Posy’s laundry mountain and quickly know how it should be sorted and how many things will be in each pile.

Of course, Posy’s stacks of shirts and pants are a little more intelligible than clusters may be. The results of clustering can sometimes be hard to interpret: Why did certain people or items get grouped together, especially if there were a lot of variables that shaped those clusters? Not every cluster may be meaningful or useful. But you are likely to find new insights that can enhance your exploratory data analysis and suggest new, informed questions to investigate.

A limitation of k-means is that the boundaries between the clusters have to be linear, i.e., straight lines. What if your clusters take all kinds of shapes? While k-means is powerful and often a great starting point, other clustering algorithms can offer better results in some situations. You might like to learn more about hierarchical clustering and DBSCAN/HDSCAN, which can offer better clustering results in the case of unusually shaped clusters and/or a large number of outliers. These two posts explain more about using these methods in Designer.

Posy has clustered the laundry! She now knows which pile best fits each of the items from the laundry item and has set up “clusters” that fit each one. Likewise, in your clustering analysis, you’ve now got data points assigned to their best-fitting clusters. You can add their cluster assignments to the dataset with the Append Cluster Tool. Adding that information to each row of your dataset lets you then filter your data, explore each cluster, and try to identify what makes each cluster distinctive from the others. Clustering is most often used for exploratory data analysis and for gaining deeper understanding of datasets in which groupings are unknown and hard for humans to identify. Clustering can reveal structure in your data that was invisible without this powerful tool -- like being able to look at Posy’s laundry mountain and quickly know how it should be sorted and how many things will be in each pile. Of course, Posy’s stacks of shirts and pants are a little more intelligible than clusters may be. The results of clustering can sometimes be hard to interpret: Why did certain people or items get grouped together, especially if there were a lot of variables that shaped those clusters? Not every cluster may be meaningful or useful. But you are likely to find new insights that can enhance your exploratory data analysis and suggest new, informed questions to investigate. A limitation of k-means is that the boundaries between the clusters have to be linear, i.e., straight lines. What if your clusters take all kinds of shapes? While k-means is powerful and often a great starting point, other clustering algorithms can offer better results in some situations. You might like to learn more about hierarchical clustering and DBSCAN/HDSCAN, which can offer better clustering results in the case of unusually shaped clusters and/or a large number of outliers. These two posts explain more about using these methods in Designer.

What’s extra cool about what Posy has gone through in this process is that she has learned how to label the unsorted laundry. Therefore, when she is given another pile of unsorted laundry later -- that laundry just keeps on coming! -- she can independently handle it and predict which pile each item should be placed in, without having to ask anyone for help.

Spoiler

This last step in Posy’s laundry-sorting journey represents an optional use of clustering for prediction, if a successful model for clustering existing data can be built. One possible application for your clustering results is to turn them into a classification model for new data. Once the data points in your dataset have each received a cluster label reflecting where they were assigned, you essentially have a training set of labeled data that can be used with a supervised learning method to classify new data -- just like Posy, independently sorting laundry now that she has learned what goes in each pile. This method is supervised since you are providing the algorithm with labeled data that it can learn from in order to apply that knowledge to new, unlabeled data. For example, you could feed your analyzed customer data with its assigned cluster labels, plus unlabeled new data in need of classification, to a classification Decision Tree and generate labels for the new data. This process is demonstrated in the attached laundry-themed workflow.

This semi-supervised approach, sometimes called “cluster and label,” can be useful when getting a large quantity of labeled data would be too difficult or expensive. You may want to choose only the data points that are the best representatives of their clusters, though, for use in training your classification algorithm for the unlabeled data. These are the data points closest to the centroids of each cluster, not those closer to the boundaries among your clusters that are more likely to be mislabeled. To extend the laundry analogy, you want to think about using the jeans that obviously belong in the pile of pants as your representative examples, not the leggings that are somewhere in between pants and socks.

This last step in Posy’s laundry-sorting journey represents an optional use of clustering for prediction, if a successful model for clustering existing data can be built. One possible application for your clustering results is to turn them into a classification model for new data. Once the data points in your dataset have each received a cluster label reflecting where they were assigned, you essentially have a training set of labeled data that can be used with a supervised learning method to classify new data -- just like Posy, independently sorting laundry now that she has learned what goes in each pile. This method is supervised since you are providing the algorithm with labeled data that it can learn from in order to apply that knowledge to new, unlabeled data. For example, you could feed your analyzed customer data with its assigned cluster labels, plus unlabeled new data in need of classification, to a classification Decision Tree and generate labels for the new data. This process is demonstrated in the attached laundry-themed workflow. This semi-supervised approach, sometimes called “cluster and label,” can be useful when getting a large quantity of labeled data would be too difficult or expensive. You may want to choose only the data points that are the best representatives of their clusters, though, for use in training your classification algorithm for the unlabeled data. These are the data points closest to the centroids of each cluster, not those closer to the boundaries among your clusters that are more likely to be mislabeled. To extend the laundry analogy, you want to think about using the jeans that obviously belong in the pile of pants as your representative examples, not the leggings that are somewhere in between pants and socks.

And at last, Posy is done with laundry, plus she’s learned all about clustering. Chores plus education -- a win for the whole family.

Try out Posy’s workflow -- in Alteryx form -- and maybe use it to teach a young person in your house about data science!

Data Science

Kids Explain Data Science, Episode 1: Clustering the Laundry