I have a large data set that looks something like this:
ID | partner's ID |
G | |
H | P |
J | |
K | L |
L | K |
M | |
N | |
P | H |
etc... |
I want to split this dataset randomly to carry out an AB test for a mailing. However, I want to ensure that if someone has a partner, they and their partner end up in the same segment, e.g. that H and P receive the same version, and K and L receive the same version. One simple workaround would be to put all the records that have a partner into the same segment; but that wouldn't be random, and could impact the AB test results.
Any ideas how I could randomly split this data set into two, whilst still ensuring that the partnered records end up in the same sample?
Solved! Go to Solution.
In this case it sounds like you would like partners to be considered their own entity (one ID instead of two). If you separate all single ID's then unite them with couple ID's you can randomly sample from that.
By sorting by Record ID then the ID value, choosing unique record ID's and filtering for only those that are "ID" leaves you with the first half of each couple (alphabetically)/
I've attached my workflow.
Flow:
Result:
Not sure how you're randomizing, but if you're adding an indicator you could use the following:
This basically splits out those records with partners, then re-assigns both to be together in whichever sample the "top" partner is in. Should maintain the random-ness while also assuring that partners are together. You mentioned 2 groups, but if you want to split into more just change the value in the Formula tool.
Thank you all so much for the suggestions! I've had a look through each method to figure out what works best for my data.
JoshKushner - unfortunately I couldn't download yours as it said the file was missing, so I couldn't easily produce what you'd done - but it still helped me to get a sense of the key methods for this workflow.
jdunkerley79 - I liked your method as it was the simplest, however when I used it on my dataset I found that one sample had about 5,000 more records in, since the process takes 50% of the data, then adds in any partners of people who were in that sample, effectively boosting the numbers. I could counteract this by reducing the number of records drawn into the sample (e.g. 49%), but it still meant that one set had about twice as many couples in as the other, and I would prefer that both samples are even in this respect, if possible.
danrh - this was pretty much exactly what I was looking for. The only adjustment I made was that I added a bit at the start to use the 'Random % Sample' to assign everyone to A or B initially, instead of using the Random Integer formula - because it meant my final two samples come out at similar sizes, which is what I'd prefer. There is still slight variation due to the reassigning couples to the same segment, but less than a 100 records difference, so not a big deal.
Perfect - thanks so much for all your help!