Hi! I have a rather unbalanced dataset (<1%) that I'd like to balance by sampling with replacement from the minority population. Any ideas appreciated.
Kai :-)
Solved! Go to Solution.
have you considered using the oversample tool? http://help.alteryx.com/10.5/index.htm#cshid=Oversample_Field.htm
Hi Kai,
The natural thing to do would be to sample from a binomial distribution for each record and redefine the parameters of the distribution based on how many samples were made in the previous record. Unfortunately, this relies on a factorial or combination function (for the draws out of the pmf of the binomial distribution), and Alteryx does not have such a function available.
To avoid this problem, I used R and made a macro that would sample with replacement. Please let me know if this works for you.
However, as @MarqueeCrew suggested, the oversample tool is our standard tool for balancing the dataset. If that's your only goal with the replacement sampling, I'd generally recommend using it.
Thanks Dylan,
I did end up using your macro after the standard tool for some strange reason did not want to cooperate. Since I think the one parameter referred to how many records one wanted to draw, I edited the code to allow a max number of records of 1 million rather than 100. I also filtered out the majority class records first as I wasn't sure it was set up to detect the minority class and sample only from that? Either way, I got it to work, so appreciate it!
I didn't find the Oversampling tool to be the solution. It didn't have the expected behavior. I had 10000 records which I passed through the sampling tool with the parameters
Select the field you want to base the oversampling on: TargetVariable
The field value you wish to oversample: 1
The percentage of records that should have the desired value in the field of interest: 50
The "1" class made up about 3% of the data so I expected to get back around 9700*2 = 19400 records. Instead I got back a handful (~600) of records, but that was a 50-50 split. It did the opposite of oversample is undersampled the "0" class.