Finding Outliers - where to start?

Hi, I am working on a project were I have weights of products shipped. I am trying to identify extreme outliers by product that do not fit into the normal distribution for that products actual weight.

This has 2 purposes; investigate why they are outliers to identify issues and to remove for average weight calculation by product for further downstream needs.

Can someone point me in the right direction/method I should consider so I can setup a marco to process each product to identify outliers so I can remove them? Are there R functions or exsisting alterxy workflows that can easily accomplish?

Thanks,

Anthony

Tips and Tricks

R Tool

Accepted answers

anthony

Ok, I think the outlier macro will work after adding group by feature. I can now quickly look for outliers by product and test by looking at new average for the false outliers. Seems like this would work well with a iterative macro to cycle through a range and then develop some thresholds to identify borderline outliers.(another day)

I have attached the group by version for future users.

OutlierGroupBY.yxmc

All comments

michael_treadwell

Generally, for normally distributed data, a Z-score or median absolute deviation (MAD) is the quickest and easiest way to identify potential outliers.

Both of these can be calculated easily inside of Alteryx. Try looking for Z > 3 or MAD > 3.5. Again, these are only starting points. Outlier detection is a complicated issue.

For more information:

http://www.statisticshowto.com/how-to-calculate-a-z-score/

http://www.statisticshowto.com/median-absolute-deviation/

RodL

Documentation on what is an "outlier" indicates that it can be rather nebulous in the definition. I've seen it be considered "one and a half times the size of the Interquartile Range" or "greater than 3 standard deviations" (and a few other more complex calculations).

So you would need to decide what you would consider the outlier for your specific data.

That said, once you determine the min/max range for what is "normal", then it is an easy process to append those fields to your data and filter on whatever lies outside of those values.

Philip

When I taught statistics, I taught my students to plot and visualize the data before doing anything else. So, I'd recommend creating some plots to see what you'd consider outliers. A good software will also let you select data points to investigate further.

You could try using Alteryx's plotting tools in the Data Investigation tab, use R as a stand alone data exploration tool, or use something like Tableau that lets you interact with your data to really dig into it.

anthony

I started with standard deviation and it is being skewed by some big outliers so I am trying to figure out the best way to deal with them.

Most of the data is a tight grouping and random ones that are way off that i need to flag for review so it might be a combination of sd/average.

I will take a look at some of the options to see what works best.

Thanks,

Anthony

NeilR

Here's a quick and dirty outlier detection tool using the z-score methodology to help get you started.

Outlier.yxmc

anthony

Neil thanks - that caught my over obvious ones - I will test on some tighter groups but this should get me started.

Thanks,

Anthony

Number4

Perhaps a silly question, but what is the [#1] supposed to be in the formula tool?

sirelandcbre

[#1] here is gonna be the user-specified Z score threshold (made possible by connecting the Numeric Up/Down to the Formula tool's Input/Output anchor (magnifying glass/Q). #1 signifies the connection number; you'll see these when connecting interface tools throughout your workflows

Janet_Tan

refering to the threshold. I just used value 1 which is equivalent to 1 SD for my dataset. You could read up materials on Z score.

Quick Links

This months top contributors

atcodedog05 19598

Qiu 15867

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703