01-29-2018 09:26 AM - edited 08-03-2021 03:47 PM
Welcome to the final chapter of our voyage through the Pre-Predictive series. This has been a four-part journey introducing you to theexciting world of data investigation.
In this article we are checking out all of the super cool plotting tools available to you in the Data Investigation Toolbox. InPart 1, we looked at theField Summary Tool, Part 2covered the ContingencyandFrequency Table Tools, as well as the Distribution Analysis Tool, and Part 3reviewed the Association Analysis Tool, as well as the Pearson and Spearman Correlation Tools.
The Data Investigation Toolbox includes 5 different plotting tools. These are the Heat Plot, Histogram, Plot of Means, Scatterplot, and Violin Plot Tools. Plotting tools are great for visualizing your data, and each tool has different strengths. To demonstrate the use of these tools, I have created a plot with each tool for the Iris flower data set. There is a workflow attached to this post with the Irisdata for you to experiment with.
This post looks long, but that is mostly because of all the awesome images of plot outputs I’ve included. If you don’t have a lot of time, please feel free to skim to the plot that speaks to you and read about that
The Heat Plot Toolcreates a bivariate density functionvisualization. In plainer English, it creates a plot that shows how frequently the values of two variables occur in one observation (record) with colors. It displays the relative frequency of overlap of values between the two fields with color.
The value and application of a Heat Plot of this is similar to that of a Contingency Table. The Heat Plot Tool allows the user to visualize how continuous variables relate to one another. Heat Plots has suggest correlations between variables, including non-linear relationships.
This Heat Plot (default configuration) depicts the bivariate density of Petal Length vs. Sepal Length of the popular Iris Classification data set. In this plot we are able to clearly identify two separate groups based on these predictor variables, and we see there is a relationship between Petal Length and Sepal Length.
The Histogram Tool creates a visualization of the distribution of numerical data. Histogramsare very useful for screening variables for distinguishing characteristics or patterns in their distribution.
The input fields for this tool must be numeric. You have the option to specify the number of breaks (how many bins the values of the variable are divided into) or allow the tool to automatically break up your data. It is best practice to experiment with different breaks. You have the option of generating a standard histogram, where the Y-axis depicts frequency, or adding a smoothed empirical densitycurve by selecting the Plot a smoothed density curve… option. This option plots a density curve in addition to a histogram, and density is represented by the y-axis instead of frequency.
This histogram (Plot a smoothed density curve… option selected), depicts the distribution of the Petal Length (cm) variable across all of the measured Iris samples in this dataset. We can see this variable has a bimodal(i.e., double-peaked) distribution. This is useful information to us, because it suggests there are two distinct populations with different Petal Lengths, and can potentially be a helpful predictor variable for categorizing the Iris species.
The Plot of Means Tool allows you to visually compare the mean of a numeric or binary categorical field based on a categorical variable.
You have the option of including Error bars to representStandard Error, Standard Deviation, or a Confidence Intervalwith a confidence level of your choice. The Plot of Means Tool will put a dot at the sample population’s mean, and then error bars to represent the metric of your choice, and a line connecting the means of each of the sampled categories.
This plot depicts the mean sepal length for each Iris Species. The error bars depict a 95% confidence interval. The black dot depicts the sample mean for each species. The confidence interval is a range of values that is likely to include the population mean 95% of the time. Because the confidence interval around the Mean Sepal Length for each species do not overlap, the plot is suggests that the population mean Sepal Length may be statistically significantly different from one another. This indicates that Sepal Length may be useful for classifying Iris Species.
The Scatter Plot Tool creates (wait for it…) a scatter plot,which is a plot that displays values for two variables, where the value of one variable determines the x-coordinate of a record and the other determines the y-coordinate. A scatter plot depicts similar information to the output of the Heat Plot Tool. Both are plotting how two variables relate to one another, the Heat Plot depicting density with colors, and the Scatter Plot with individual points for each record. Like heat plots, scatter plots are useful to suggesting correlations between variables.
By default, along with points, the Scatterplot Tool includes a Least-Squares regression line, a smooth line calculated using a LOESSfunction and two dashed lines indicating the spread of the smooth line (the size of the local area used to construct the loess estimates). If you don’t like the lines, you can turn them off in the Plot elements Tab in the tool’s configuration. The scatterplot tool also creates box plots for the variables you are examining on each axis. Scatterplots are useful for identifying the type and strength of a relationship, outliers, time-based trends, and group-related patterns including clusters in the data.
This plot is comparing Petal Length to Sepal Length. We can see there is somewhat of a positive linear relationship between the two variables, but also that there are two distinctive groups of values. The Least-Squares Regression Line suggests a strongly positive relationship (as Petal Length increases, Sepal Length Increases) and the Smooth line and spread lines highlight the two distinct clusters. The box plots suggest Sepal Length is more narrowly distributed around 5.75 cm, and Petal Length is more widely distributed, skewing to the shorter measurements.
The Violin Plot Tool creates a (you guessed it) violin plotfor a single numeric variable. You can optionally have the tool create separate violin plots for each value in a categorical variable.
If you haven’t worked with them before, violin plots are similar to box plots, except they also show theprobability densityof the data at different values (kind of like the Plot a smoothed density curve option in the Histogram Tool).
In the Alteryx Violin Plots, the mean of a sample is depicted with a white dot, the range with a vertical black line, with an orange rectangle, and the curvy blue shape depicts the probability density of the data at different values. With this plot, we can see that the ranges of sepal lengths for all three species do overlap with one another. Iris-virginica has the largest range of Sepal Lengths, however there are much fewer values at the shorter end. The majority of the sampled Iris-setosas have sepal lengths around 5.0 cm.
Plots are a fantastic way to do data investigation. They provide visual ways to interpret your data, and Alteryx provides a variety of plotting tools to help you achieve data enlightenment.
And so concludes the Pre-Predictive series, but hopefully this is just the beginning of a long happy relationship with the Data Investigation Tools. I hope this series has introduced you to at least one new tool you are excited to start using, and that you always remember, if you don’t do data investigation before using the Predictive Tools, you’re gonna have a bad time.
your attachment doesnt have input file
Hi @PreetiVatnani ,
Thank you for the heads up! I've updated the workflow to include the initial data.
Hi Sydney, how did you get the different colors in the scatterplot?
Mine looks like this, all in blue, which is not so pretty:
I just found this 4-part series, very good introduction on data investigation. Thank you!