Hi everyone!
I'm struggling figuring out how to sample one item every million rows.
Let's suppose we have a population of 50 million rows.
I have to sample 1 item for the first million rows, 1 item for the second million rows, 1 item for the third, and so on...
Do you have any advice on how to do this?
I guess a Macro should help, but I'm not familiar with these functionalities/tools yet.
Thank you for your support!
Solved! Go to Solution.
Hey @AlbertoRefra, here's one way you could go about it - adding a RecordID and then checking that if we divide this by 1 million then the result is an integer, or the RecordID is 1 for the first million:
If this needs to be random then it would probably need a macro/more intricate process but if it can just be the first of every million then this ought to do.
Quick edit:
You could also use the Sample tool which the '1 of every N rows' set to 1,000,000 but this depends on whether you want it inclusive of the 50 millionth row or not in your case:
Hi @DataNath! Thanks for your answer!
The fact is that the sample needs to be random, so that's the part I'm struggling with.
I managed to number every million rows with the "Tile" tool, so I can identify the first million rows with "1", the second million rows with "2" and so on...
Now, I don't know how to build a macro to sample one item out of each "group of data" (doing it manually would be inefficient since I will need to run it periodically)
Ok thanks for clarifying @AlbertoRefra - here's a couple of additions you could implement to make it random. We can just use the Rand() function to generate a random float value for each row. From then, we can sort on these random values within the N million that they belong to. After that, we can take the first value from every N million which has previously been randomised:
That's definitely an interesting solution, but I wonder if these logics make the sample repeatable running again the same workflow with the same input data.
Since the sampling is for an audit activity, this is something to be taken into account.
I guess that the Rand() function gives everytime a different value to each row, so probably the results would be always changing. Am I right?
In this case, another kind of solution should be found 🤔
Ah, that does add another level to things then @AlbertoRefra. From what you describe, I can't think of a method that is random for only the first run of a workflow - you're right in the assumption that Rand() will generate a different value each run. For such a request, I'd imagine we have to break this into 2 parts. 1) Use my process above to get a random 50/51 RecordIDs from the data and then print this out so they can be re-used for audit, then 2) Use this reference file to conduct an inner join on the original data and pick out these RecordIDs. Placing part 1 in a container means after the first run, you can disable this and the output won't be overwritten, meaning the same records are always picked out. If you need to generate a new list, simply re-enable the container:
Unfortunately I think that the copy/paste of the RecordIDs and the disabling of the container could be called into question by a possible external audit, since there is a manual intervention in the process.
Do you think that a sort of "iterative macro" that:
- processes each group of "1 million rows" (identified by ascending integers as 1, 2, 3...)
- samples 1 item for each of them
- appends the "50" sampled items in a unique output
could solve this?
@AlbertoRefra using a macro (which in this case would likely be a batch), would just give you a different way of doing what I have above. I think the crux here is how we 'sample' - we can easily isolate each million rows at a time, but if you need the one row from each to be random, but only for a single run, I can't think of how you'd do this off the top of my head. The only method(s) I can think of involve some slight manual intervention like I've outlined above. Unless you were to configure the output so that the file couldn't be overwritten, in which case you could just continuously keep writing the same file and then conducting the join from when you re-read it in.
Let me try to explain what I have in my mind (but I don't know how to implement)...
1. In the first workflow I import and have a full population with each "1 million rows" identified by a integer (1st million rows will have "1" as identifier, 2nd million rows will have "2", etc...)
2. A macro should run "N" times, from 1st million rows with identifier "1" to the last million rows with identifier "N".
Each run would sample one by one all the "1 million rows" with the Random % Sample tool (1 item, with a Random seed that guarantees the replicability of the sampling).
3. The output of the macro (going back to the first workflow) should be a table/file where to append the item sampled for each of the N runs
Do you think this is feasible?
@AlbertoRefra, It perfectly achievable. You need a batch macro inside your workflow.
See the workflow attached. Hope it helps.