This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
It's the most wonderful time of the year - Santalytics 2020 is here! This year, Santa's workshop needs the help of the Alteryx Community to help get back on track, so head over to the Group Hub for all the info to get started!
This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into use of the Histogram Tool on our way to mastering the Alteryx Designer:
The humble histogram is something many people are first exposed to in grade school. Histograms are a type of bar graph that display the distribution of continuous numerical data. Histograms are sometimes confused with bar charts, which are plots of categorical variables.
To create a histogram, the data is first split into "bins" or "breaks" (i.e., a series of intervals covering the range of data values). These breaks are adjacent, non-overlapping, consecutive, and most often (but not necessarily) of equal size. After the bins are determined, frequency for each bin (the count of times a value in the data set falls within each bin) is calculated, and then a rectangle is created for each bin, with its height proportional to the frequency.
The histogram’s simplicity is part of what makes it so powerful as a data investigation and visualization tool. Histograms allow us to visualize the distribution values of a variable. By organizing the data into larger breaks, Histograms depict a smoother probability density than plotting the frequency of individual data values, which will generally create a more accurate depiction of the distribution of the variable of interest. They are easily understood and can help illuminate patterns in the data that impact how the data should be treated.
The configuration of the Alteryx Histogram Tool is very simple. All you need to do is select which field you would like to create a histogram for, and select the number of breaks (i.e. bins) to create. If you leave it set to auto, R will calculate break points with an algorithm based on Sturges formula. If you are interested in learning more about how R calculates histogram break points, there is a thorough blog post you can read here. Bad break points can create unhelpful or misleading histograms, so it is best practice in data investigation to experiment with different breaks.
If you choose to select the Plot a smoothed density curve… option, one additional setting is displayed, allowing you to set the bandwidth of the smoother. As noted in the configuration window, a smaller number means a narrower smoother and a larger number means a wider smoother.
In the second tab of the tool’s configuration window, you can specify your plot size, font size and graph resolution. This is particularly helpful if your histogram is going to end up in a report.
Now that you know how to configure the tool, lets talk a little bit about description and interpretation. As I mentioned, it is a good idea to spend a little bit of time playing with different bin widths and seeing how it impacts the overall shape of the histogram. While you are going through this process, keep an eye on what the overall distribution of your data looks like. Words used to describe patterns in a histogram include symmetric, skewed left, and skewed right, to describe skewness, and unimodal, bimodal, and multimodal to describe mode(s).
Symmetric and left or right skewed describe the relative position of the distribution’s peak. Symmetric describes when the peak is in the center of the data, and the distribution is the same to the left or right of the peak. Skewed left or right describes when the peak is off center, toward one of the limits, and a “tail” stretches away from it. Left or right is assigned based on which side the tail is on.
Unimodal, Right-Skewed Histogram
Unimodal, bimodal and multimodal describe the number of peaks that occur in the data’s distribution. Unimodal means a single peak, bimodal two, and multimodal more than two.
Histogram with Bimodal Distribution
Interpretation is an art, and only you know what your data represents, how it was collected, and how it needs to be handled. The Histogram Tool, like all the tools included in the Data Investigation Toolbox, are here to help you with that process.
By now, you should have expert-level proficiency with the Histogram Tool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at firstname.lastname@example.org you’d like your creative tool uses to be featured in the Tool Mastery Series.
Stay tuned with our latest posts every #ToolTuesday by following @alteryx on Twitter! If you want to master all the Designer tools, consider subscribing for email notifications.