Hi, I am working on a project were I have weights of products shipped. I am trying to identify extreme outliers by product that do not fit into the normal distribution for that products actual weight.
This has 2 purposes; investigate why they are outliers to identify issues and to remove for average weight calculation by product for further downstream needs.
Can someone point me in the right direction/method I should consider so I can setup a marco to process each product to identify outliers so I can remove them? Are there R functions or exsisting alterxy workflows that can easily accomplish?
Thanks,
Anthony
Solved! Go to Solution.
Generally, for normally distributed data, a Z-score or median absolute deviation (MAD) is the quickest and easiest way to identify potential outliers.
Both of these can be calculated easily inside of Alteryx. Try looking for Z > 3 or MAD > 3.5. Again, these are only starting points. Outlier detection is a complicated issue.
For more information:
http://www.statisticshowto.com/how-to-calculate-a-z-score/
http://www.statisticshowto.com/median-absolute-deviation/
Documentation on what is an "outlier" indicates that it can be rather nebulous in the definition. I've seen it be considered "one and a half times the size of the Interquartile Range" or "greater than 3 standard deviations" (and a few other more complex calculations).
So you would need to decide what you would consider the outlier for your specific data.
That said, once you determine the min/max range for what is "normal", then it is an easy process to append those fields to your data and filter on whatever lies outside of those values.
When I taught statistics, I taught my students to plot and visualize the data before doing anything else. So, I'd recommend creating some plots to see what you'd consider outliers. A good software will also let you select data points to investigate further.
You could try using Alteryx's plotting tools in the Data Investigation tab, use R as a stand alone data exploration tool, or use something like Tableau that lets you interact with your data to really dig into it.
I started with standard deviation and it is being skewed by some big outliers so I am trying to figure out the best way to deal with them.
Most of the data is a tight grouping and random ones that are way off that i need to flag for review so it might be a combination of sd/average.
I will take a look at some of the options to see what works best.
Thanks,
Anthony
Neil thanks - that caught my over obvious ones - I will test on some tighter groups but this should get me started.
Thanks,
Anthony
Ok, I think the outlier macro will work after adding group by feature. I can now quickly look for outliers by product and test by looking at new average for the false outliers. Seems like this would work well with a iterative macro to cycle through a range and then develop some thresholds to identify borderline outliers.(another day)
I have attached the group by version for future users.
Perhaps a silly question, but what is the [#1] supposed to be in the formula tool?
[#1] here is gonna be the user-specified Z score threshold (made possible by connecting the Numeric Up/Down to the Formula tool's Input/Output anchor (magnifying glass/Q). #1 signifies the connection number; you'll see these when connecting interface tools throughout your workflows