This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Hi, I am working on a project were I have weights of products shipped. I am trying to identify extreme outliers by product that do not fit into the normal distribution for that products actual weight.
This has 2 purposes; investigate why they are outliers to identify issues and to remove for average weight calculation by product for further downstream needs.
Can someone point me in the right direction/method I should consider so I can setup a marco to process each product to identify outliers so I can remove them? Are there R functions or exsisting alterxy workflows that can easily accomplish?
Documentation on what is an "outlier" indicates that it can be rather nebulous in the definition. I've seen it be considered "one and a half times the size of the Interquartile Range" or "greater than 3 standard deviations" (and a few other more complex calculations).
So you would need to decide what you would consider the outlier for your specific data.
That said, once you determine the min/max range for what is "normal", then it is an easy process to append those fields to your data and filter on whatever lies outside of those values.
When I taught statistics, I taught my students to plot and visualize the data before doing anything else. So, I'd recommend creating some plots to see what you'd consider outliers. A good software will also let you select data points to investigate further.
You could try using Alteryx's plotting tools in the Data Investigation tab, use R as a stand alone data exploration tool, or use something like Tableau that lets you interact with your data to really dig into it.
Ok, I think the outlier macro will work after adding group by feature. I can now quickly look for outliers by product and test by looking at new average for the false outliers. Seems like this would work well with a iterative macro to cycle through a range and then develop some thresholds to identify borderline outliers.(another day)
I have attached the group by version for future users.