Hi, everyone.
I am currently trying to learn macro and would like to apply on this case. I have a sample dataset below where I would like to know the combinations of RecordID that when its Data are combined, it will contain all the numbers from 1 to 28. (the field "Data" contains random numbers from 1-28)
Example of possible combinations: RecordIDs 7-8-9 or RecordIDs 7-8-10
I am interested in the minimum number of rows.
RecordID | Data |
1 | 5,6,11,12,18,26,27,28 |
2 | 18,27,28 |
3 | 9,10,19,20 |
4 | 1,2,7,9,10,14,15,16,17,19,20 |
5 | 1,2,4,7,8,14,15,16,17 |
6 | 9,10,11,12,19,20,21,22,23,24,25 |
7 | 1,3,4,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,28 |
8 | 2,3,4,7,8,13,14,15,16,17,18,24,27 |
9 | 5,6,11,12,21,22,23,25,26 |
10 | 3,5,6,11,12,13,18,22,23,24,25,26,27,28 |
11 | 3,18,26,27,28 |
12 | 5,6,11,12,25 |
Any help is very much appreciated. Thank you so much.
Solved! Go to Solution.
You need to create an algorithm to find the minimum number of rows to have all numbers (1 to 28) or do you need to check within all available all records and return all possible combination matches?
And how large is the final dataset? Just to see what kind of solutions we can think of.
Adding more questions to the below because this is an interesting problem.
Do you want all combinations or the optimal (minimum # of rows) one? Or just one as fast as possible? Likely different versions of the solution to each.
Also, do you want redundant combinations included? In the example, if 7-8-9 is complete, so is 6-7-8-9, but that doesn't feel like an answer you would want in most versions of the problem.
Thank you. I am actually after of the minimum number of rows. Sorry I forgot to mention.
A moderately optimized solution. Finds an arbitrary solution of minimal length.
If you're learning these rather than trying to solve a problem, suggest looking at what happens if you take out the logic (the count records and append part of the loop output) that says stop when you find a solution. You should see the run time jump by about a factor of 30, which is part of why we were interested in the potential data set size.
If your data is big or you like to go fast for fast's sake the next optimizations are pruning redundant combinations during the run. If you take out the select first record, you'll see that this solution finds 10,4,7, 10,7,4 etc. We ought to be able to throw out combinations that have the same set of original rows.
Happy to talk about any of the moving parts if you want.
Hi @JovanieYonson
Fun problem, and a chance to use the optimisation tool!
As always with the tool, getting the data structure correct is the main challenge. If you need more info on the tool then check out this blog by @PhilipMannering or the tool mastery blog
Hope that helps,
Ollie
Kudos to @ScottLewis and @OllieClarke ! Loved both solutions, that is amazing!
@JovanieYonson , if you are facing issues running Ollie's workflow, you might need to download and install R Predictive tools on your Alteryx.
@OllieClarke, I was very interested in your solution and I read more about the optimization tool.
I changed the coefficient to the frequency of numbers each variable have, before it was all ones now it ranges from 3 to 23.
With this change the workflow now runs slightly faster (maybe not significant) and the output are based on variables that has less duplications and still have all 28 unique values. Which might scale better and return a more optimized set in the end.
Identified set 5-6-10.
I am still learning about this statistical method, does that makes sense to you?
@ScottLewis Thank you so much for your solution. It is very helpful. :)
@OllieClarke Thank you so much for introducing this approach. Honestly I am new to this tool. But I'm sure this is great solution. :)