Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Sampling with replacement

KaiLarsen
9 - Comet

Hi!  I have a rather unbalanced dataset (<1%) that I'd like to balance by sampling with replacement from the minority population.  Any ideas appreciated.

 

Kai :-)

4 REPLIES 4
MarqueeCrew
20 - Arcturus
20 - Arcturus

have you considered using the oversample tool? http://help.alteryx.com/10.5/index.htm#cshid=Oversample_Field.htm

 

Capture.PNG

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
DylanB
Alteryx Alumni (Retired)

Hi Kai,

 

The natural thing to do would be to sample from a binomial distribution for each record and redefine the parameters of the distribution based on how many samples were made in the previous record. Unfortunately, this relies on a factorial or combination function (for the draws out of the pmf of the binomial distribution), and Alteryx does not have such a function available. 

 

To avoid this problem, I used R and made a macro that would sample with replacement. Please let me know if this works for you. 

 

 

However, as @MarqueeCrew suggested, the oversample tool is our standard tool for balancing the dataset. If that's your only goal with the replacement sampling, I'd generally recommend using it.

KaiLarsen
9 - Comet

Thanks Dylan,

 

I did end up using your macro after the standard tool for some strange reason did not want to cooperate. Since I think the one parameter referred to how many records one wanted to draw, I edited the code to allow a max number of records of 1 million rather than 100. I also filtered out the majority class records first as I wasn't sure it was set up to detect the minority class and sample only from that? Either way, I got it to work, so appreciate it!

RobesMaGobes
5 - Atom

I didn't find the Oversampling tool to be the solution. It didn't have the expected behavior. I had 10000 records which I passed through the sampling tool with the parameters

 

Select the field you want to base the oversampling on: TargetVariable

The field value you wish to oversample: 1

The percentage of records that should have the desired value in the field of interest: 50

 

The "1" class made up about 3% of the data so I expected to get back around 9700*2 = 19400 records. Instead I got back a handful (~600) of records, but that was a 50-50 split. It did the opposite of oversample is undersampled the "0" class.

 

Labels