Hello Alteryx Community,
I am new to Alteryx, so I was going through the Oversample Field tool using single tool sample workflow and the comments say use it before predictive for effective modeling but what I observed in the workflow, starts with a dataset having 226 records and outputs 150 records, that is in effect reducing total data available, though it balances the data, is it good to reduce the available data for effective training? why is it called oversampling tool in fact it is reducing the samples? Little bit confused here, can some Alteryx Gurus clarify it?
In my understanding what it outputs is undersampled balanced data, is my understanding correct?
Hi @kvssetty ,
Here's the explanation included in the tool's help page Oversample Field Tool | Alteryx Help
My interpretation of it and how I like to think of that tool is that the term oversample refers to the tool keeping all records from one side (in your example oversampling the Yes, 75/75 records kept) and then maintaining an equal balance with the No (so keeping only 75/151). If you had to oversample the dataset by generating new records for Yes then that would create an uncertainty of the values your other columns should take while balancing Yes/No.
Hope that helps,
Angelos
Hello,
Understood it works by randomly down sampling majority classes (75/151) it is quite opposite of SMOTE algorithm, wherein minority (75) classes are synthetically generated (upsampling) to 151 to balance for the majority class (151) using some sophisticated algorithm (like k-nearest neighbors) so that samples will be balanced with a total 151+151=302 samples. And truly balanced oversampling.
Here is what SMOTE is:
Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The component works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases.
Hi,
I encountered the same. The "Oversample Field" tool is really UNDERSAMPLING per Definition.
This article confirms: https://community.alteryx.com/t5/Data-Science/Balancing-Act-Classification-with-Imbalanced-Data/ba-p...
This seems disappointing? Would be nice to have the option of whether we want to remove or add records with this tool