This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I am using the oversampling tool to tackle my imbalanced dataset and have a few questions.
Isn't this element supposed to be called undersampling, as rather majority class members are getting deleted. I tried modelling it other way around to see if minory class members would be created artificially to bring the dataset into a balance but I could not observe it. Thus, this elemens looks to me as undersampling. Any comments on that?
When I use the oversampling element, some part of the dataset gets deleted. In order to create statistically significant and reliable results, I need to do oversampling a few times such as 10x to obtain different results. I was expecting to see a seed value to set in order to control the oversampling effect but could not see that. How can I better control the oversampling method in Alteryx so that I can run different experiments with different sampling outcomes?
You are oversampling the underrepresented field. "For example, in the case of untargeted direct mail campaigns, it is not uncommon to find that 2% of potential prospects respond favorably to an appeal, while 98% do not. In this case, predictive models have a difficult time distinguishing the signal from the noise since the cost of classifying all potential prospects in the "no" category will nearly always be correct."
Thank you for your reply, even though I had read that documentation
I am just trying to say that in order to make the portion of the underrepresented field in the whole dataset higher the oversampling element of Alteryx is deleting the entries of the majority class. To me this is rather undersampling.
Beside this not so important definition dilemma, I am actually rather interested in how I can control & see what is being deleted within the tool so that I can create different datasets.
Imagine you have 1000 'yes' and 100 'no' entries. In order to bring a 50%-50% balance Alteryx's oversampling element would delete 900 'yes' entries. At this point I would like to change the 900 entries that are being deleted in each iteration such that I obtain unique datasets with 100 'yes' and 100 'no' after each iteration. How can this be done in Alteryx with Oversampling or some other tool? This is for me the rather critical question.