Hey all, I'm trying to run a component similarity analysis between Finished Goods (FG) using Alteryx but unsure how to implement this.
Input - List of Finished Goods with their corresponding components
Expected Output - Matrix with similarity % for each finished good combination:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
My approach for the solution is as follows:
1. Build a single list of unique FGs with a concatenated list of corresponding components in each row:
2. Build a matrix with the input and the concat list and search for each component in the corresponding concatenated list for each FG combination. If found, enter 1 else enter 0:
3. Sum up the values for each combination (eg. FG1-FG1, FG1-FG2, FG1-FG3, and so on):
4. Build a new matrix which represents the count of distinct components for each FG combination (for e.g. FG1 has components 1 and 2, while FG2 has components 1,3,4. So the resulting FG combination FG1-FG2 will have 4 distinct components (1,2,3,4):
5. Divide the matrix obtained in step 3 by the matrix from step 4, thereby giving the similarity matrix
Please let me know if there is a way to implement this in Alteryx, or if there's a better way to get to this output.
Thanks in advance for all your help!
Solved! Go to Solution.
Thank you @FinnCharlton, that was really helpful and does the job!
However, is there a way to optimize the workflow? I have an input dataset that has over 100k rows (~36k unique FGs, and ~24k unique components), so the Cartesian Join (and subsequent joins) would consume a tremendous amount of memory / storage. It would be helpful if you could also recommend an alternative to this.
Thanks again!
Here is an optimised version. As the output is symmetrical (i.e. the combination of FG1/FG2 is the same as FG2/FG1) we only need to calculate a subset of combinations. We also don't need to calculate the combination of an FG with itself, as this will always be 100%. This new workflow only generates the combinations we need.
If you compare the output of this workflow to the one above, you will see that these values are the only important ones - everything else can be inferred.
I think the nature of this analysis means it will always be computationally intensive, but this should help a bit.
Thanks for the quick reply on this! The logic does make absolute sense, but unfortunately the FG and component names are strings and not necessarily sequential either (e.g. "FG-XYZ-0A2", "FG-2ABC",etc.), so numerical comparison in the loop will not work.
Is there still an alternative to this solution, by any chance?
I would solve this by assigning a numerical ID to each FG Name and usinng that for the workflow. If you need some help attach some sample data and I will show you what I mean.