This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I was hoping to pick your brains for an interesting problem I am working on. Unfortunately, I cannot provide data at this point owing to my NDA but will work on sharing a scrubbed version subsequently. In the meanwhile, I hope to define my problem in as much detail as I can:
I want to understand the various products that my client currently offers and identify the high performing from the low performing products based on different metrics that I have identified
For each product we have,
1. Number of Accounts
2. Nubmer of orders
3. Quantity ordered per month
4. Revenue per month
5. Average order duration
I did a basic K-Centroids clustering on this data (about 2000 records). Since its not a normally distributed data I used K-Medians with Standatdization and was able to get 4 clusters which look good.
That being said, this has been done across all products which is not very accurate given that there are different product types and comparing products across product types is like an apples to oranges comparision
To solve this, I have grouped the products by their product types and want to cluster each product type independently. I can do this manually but was hoping to use a macro to help me out.
With this approach, I created a batch macro with the K-Centroid clustering embedded in it. The plan is to run this batch macro once for each product type, which will be my Control-GroupBy Field. I figured I dont need a separate control parameter and a simple batch job would suffice.
The main issue is that the K-Centroid tool requires atleast two input fields. For my macro input I first provided a text input that was a template for my datafields viz. product code and the numeric metrics mentioned above and the group by field. That did not work as I could not see those fields in the K-Centroids tool.
After that, I changed my input type to a file input where I loaded whole data set, still the same issue..
Is there a way to create this batch macro without having to edit the K-Centroids macro itself? I understand that could be another way but I am weary of messing around with that.
Ha... so the issues was that my input columns were being interpreted as text... Put in a select tool to change them to int/float and was able to create the batch macro... am working on running the batch.
Since it sounds like your current data was a relatively small set, I was going to suggest just placing the Input Data tool within the macro and then having a Control Parameter Action on a Filter that filters through each product group in your batch. It's not the most efficient way of doing it, but for a small data set, it shouldn't be an issue.
Just a thought on the overall process...
This process will now assume that 4 clusters are actually the best segmentation for each product group. This may not be the case for each product group and you may have some misrepresentation within each of the "within product group" clusters. It's always difficult without knowing the full details of the business problem, but just thought I'd mentioned that as a consideration for doing a batch clustering process.
Absolutely!. That is something I am mulling over... some of the product types have <50 records so the 4 clusters may not be the best method.
I am thinking that one way around that would be to reduce the number of clustering inputs based on the number of records in the data set. Will use likea detour to do that (never used it before so lets see how that goes)
For instance, if I have less than 50 records, I would only pick the number of orders and the revenue, effectively using the K-Centroids to do a 2X2 (revXorders) assuming that is how it would behave. If greater than 50 I shall keep my current settings.
Or you could run the batch process with another control parameter which would be the number of clusters.
Then you could run a clustering analysis on a product group with either 3 or 4 (or other combinations of) clusters, and then run an analysis of the standard deviation (for example) by cluster of the metrics after the assignment of clusters, for each product group and compare the "accuracy" of the clusters that way. I've not done that process in the past, but never had Alteryx as a tool when I was consulting where I could do that sort of batch analysis. ;-) I'm thinking that is would be a valid approach.