Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Finding Outliers - where to start?

anthony
11 - Bolide

Hi, I am working on a project were I have weights of products shipped. I am trying to identify extreme outliers by product that do not fit into the normal distribution for that products actual weight.

 

This has 2 purposes; investigate why they are outliers to identify issues and to remove for average weight calculation by product for further downstream needs.

 

Can someone point me in the right direction/method I should consider so I can setup a marco to process each product to identify outliers so I can remove them? Are there R functions or exsisting alterxy workflows that can easily accomplish?

 

Thanks,

Anthony

 

 

 

 

 

 

 

 

 

 

10 REPLIES 10
michael_treadwell
ACE Emeritus
ACE Emeritus

Generally, for normally distributed data, a Z-score or median absolute deviation (MAD) is the quickest and easiest way to identify potential outliers.

 

Both of these can be calculated easily inside of Alteryx. Try looking for Z > 3 or MAD > 3.5. Again, these are only starting points. Outlier detection is a complicated issue.

 

For more information:

 

http://www.statisticshowto.com/how-to-calculate-a-z-score/

http://www.statisticshowto.com/median-absolute-deviation/

 

RodL
Alteryx Alumni (Retired)

Documentation on what is an "outlier" indicates that it can be rather nebulous in the definition. I've seen it be considered "one and a half times the size of the Interquartile Range" or "greater than 3 standard deviations" (and a few other more complex calculations).

So you would need to decide what you would consider the outlier for your specific data.

That said, once you determine the min/max range for what is "normal", then it is an easy process to append those fields to your data and filter on whatever lies outside of those values.

 

Philip
12 - Quasar

When I taught statistics, I taught my students to plot and visualize the data before doing anything else. So, I'd recommend creating some plots to see what you'd consider outliers. A good software will also let you select data points to investigate further.

 

You could try using Alteryx's plotting tools in the Data Investigation tab, use R as a stand alone data exploration tool, or use something like Tableau that lets you interact with your data to really dig into it.

anthony
11 - Bolide

I started with standard deviation and it is being skewed by some big outliers so I am trying to figure out the best way to deal with them.

 

Most of the data is a tight grouping and random ones that are way off that i need to flag for review so it might be a combination of sd/average.

 

I will take a look at some of the options to see what works best.

 

Thanks,

Anthony

 

 

NeilR
Alteryx Alumni (Retired)

Here's a quick and dirty outlier detection tool using the z-score methodology to help get you started.

anthony
11 - Bolide

Neil thanks - that caught my over obvious ones - I will test on some tighter groups but this should get me started.

 

Thanks,

Anthony

anthony
11 - Bolide

Ok, I think the outlier macro will work after adding group by feature. I can now quickly look for outliers by product and test by looking at new average for the false outliers. Seems like this would work well with a iterative macro to cycle through a range and then develop some thresholds to identify borderline outliers.(another day)

 

I have attached the group by version for future users.

Number4
8 - Asteroid

Perhaps a silly question, but what is the [#1] supposed to be in the formula tool?

sirelandcbre
6 - Meteoroid

[#1] here is gonna be the user-specified Z score threshold (made possible by connecting the Numeric Up/Down to the Formula tool's Input/Output anchor (magnifying glass/Q). #1 signifies the connection number; you'll see these when connecting interface tools throughout your workflows

Labels