Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Ambiguity in Oversample field tool

kvssetty
6 - Meteoroid

Hello Alteryx Community,

I am new to Alteryx, so I was going through the Oversample Field tool using single tool sample workflow and the comments say use it before predictive for effective modeling but what I observed in the workflow, starts with a dataset having 226 records and outputs 150 records, that is in effect reducing total data available, though it balances the data, is it good to reduce the available data for effective training? why is it called oversampling tool in fact it is reducing the samples? Little bit confused here, can some Alteryx Gurus clarify it? 

In my understanding what it outputs is undersampled balanced data, is my understanding correct?

4 REPLIES 4
AngelosPachis
16 - Nebula

Hi @kvssetty ,

 

Here's the explanation included in the tool's help page Oversample Field Tool | Alteryx Help

 

AngelosPachis_0-1648032247649.png

 

My interpretation of it and how I like to think of that tool is that the term oversample refers to the tool keeping all records from one side (in your example oversampling the Yes,  75/75 records kept) and then maintaining an equal balance with the No (so keeping only 75/151). If you had to oversample the dataset by generating new records for Yes then that would create an uncertainty of the values your other columns should take while balancing Yes/No.

 

Hope that helps,

Angelos

kvssetty
6 - Meteoroid

Hello,

Understood it works by randomly down sampling majority classes (75/151) it is quite opposite of SMOTE algorithm, wherein minority (75) classes are synthetically generated (upsampling) to 151  to balance for the majority class (151) using some sophisticated algorithm (like k-nearest neighbors) so that samples will be balanced with a total 151+151=302 samples. And truly balanced oversampling.

Here is what SMOTE is:

 

Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The component works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases.

jb23989
5 - Atom

Hi,

I encountered the same.  The "Oversample Field" tool is really UNDERSAMPLING per Definition. 

  • Random Oversampling: Randomly duplicate examples in the minority class.
  • Random Undersampling: Randomly delete examples in the majority class.

This article confirms: https://community.alteryx.com/t5/Data-Science/Balancing-Act-Classification-with-Imbalanced-Data/ba-p...

 

JamieHankins
7 - Meteor

This seems disappointing? Would be nice to have the option of whether we want to remove or add records with this tool 

Labels