# Alteryx Designer

Andy Uttley, Alteryx ACE, makes music with Alteryx | Math + Music

# Pre-Predictive: Using the Data Investigation Tools - Part 2 of 4

Alteryx
Created

Welcome to Part 2 of the Pre-Predictive series! After a strong start but long hiatus, we will be resuming our tour of the Data Investigation Tools.

Data Investigation is deeply underrated. Every time you go about using a new data set for analysis or modeling, data investigation should be your first step. It is impossible to make meaningful conclusions without an understanding of the data you are working with.

These aqua-colored tools are provided to you with that exact purpose in mind. If the Data Investigation Toolbox isn’t already your best friend, it should be, and I am thrilled that I have the opportunity to introduce the two of you 😊

This section of the Pre-Predictive series includes the Frequency Table, Contingency Table, and Distribution Analysis Tools. Part 1 covered the Field Summary Tool, and Sections 3 and 4 will cover Association Analysis and Correlations, and the Plotting Tools, respectively.

For each field selected in the tool set-up, the Frequency Table Tool creates a summary of the data with frequency counts and percentages for each value in each selected field. It can be used on quantitative or qualitative variables, but will not accept FixedDecimal, Float, Double, Date/Time, Blob or SpatialObj fields. This tool gives you the ability to visualize how the values within each of your fields are distributed. A Frequency Table is a snapshot of your data giving you the chance to identify any patterns in preliminary investigation.

The Frequency Table Tool has three outputs. The D output is a data table that includes each of your field values, labeled by name, and each values frequency, percentage within its field, cumulative frequency and cumulative percent (cumulative is based on a given field name and is a running total). The R output is a report that includes all the information in the D output, divided into tables by field name. The interactive output (I) allows you to interactively filter which fields and metrics you are viewing and create plots.

Look out for fields that don’t have many unique values. The report will warn you that you might have a categorical field set as numeric if it detects a numeric field with fewer than ten unique values. If the tool has correctly identified a categorical field masquerading as a numeric field, consider changing the field data type to string using the Select Tool, so that it is treated correctly as a categorical variable by the predictive tools later in your workflow. In general, look out for fields that have distinctive patterns or distributions. This can help guide how you treat and use your data.

A Contingency Table (also called a two-way table) is a special type of a frequency distribution table that displays the frequency distribution of two or more categorical variables as well as how they relate to one another. The Contingency Table Tool provides a basic snap-shot of the interrelation between two or more variables, and can help you identify interactions between them.

In the set-up, you have the option to include a chi-squared statistic or not. If you elect to include it you can compare 2 variables, if you choose not to include it you can select up to four variables to compare. The Chi-Square test determines whether a significant relationship exists between the two variables being compared. A low p-value suggests a statistically significant relationship.

Like the Frequency Table Tool, the Contingency Table Tool also creates three outputs.The data (D) output is a data table, containing the frequency counts and percentages for each individual feild and combination of fields. The report (R) output is a fancy formatted report of the data output, and is where the Chi-squared value, df, and p-value will be reported if that option was selected in your setup. The interactive (I) output allows you to interactively filter which fields and metrics you are viewing, as well as how the data is displayed.

You can modify the interactive view to compare two variables at a time and highlight strong relationships between the variables being compared.

It is worthwhile to check for these relationships. Relationships between the predictor variables and the target variable are a good thing, as it suggests predictive power. Strong relationships between two predictor variables should be investigated for the potential of introducing multicollinearity into your model.

The Distribution Analysis Tool allows you to fit your input data to different statistical distributions and compare Goodness-of-Fit to each distribution. This is pretty awesome for modeling because it enables you determine which distribution best represents your data, and may help guide your predictive model selection. It should only be used on continuous variables

This tool allows you to compare your data to the Normal, Lognormal, Weibull and Gamma Distributions. It is important to note that the Lognormal, Weibull and Gamma Distributions can only be compared to non-negative data.

The output of this tool is a series of report snippets, which includes a histogram, summary statistics for the test results, goodness of fit statistics, data quantiles per distribution and the distribution parameters.

If your target data (what you are trying to predict) is a continuous variable, I would strongly recommend using this tool to help guide what type of predictive model to use. For example, if your target data fit a Normal Distribution best, a Linear Regression would be an appropriate model, data with a Gamma Distribution would be better represented with a Gamma Regression, etc.

Getting to know your data better will only improve the quality of your models. Only you know your data, your use case, and how your results will be applied, but the Data Investigation tools are here to help get you as informed as possible on what you’re working with.

14 - Magnetar

Hey @SydneyF thanks for this! I found the part about the types of fields that the Frequency Table tool will not accept particularly helpful. Is there any way to update the help doc with all of these field types? Currently it only lists FixedDecimal, Date, Time, DateTime, Blob, and SpatialObj (shown below). I was having the hardest time trying to figure out why some of my fields weren't coming through until I found this article that said Doubles also can't be used. I typically go to the help page first, so I think it would make sense to list all of the unacceptable field types there as well. Thanks!

Alteryx

Hi @Kenda,

Thank you for the feedback! I have passed it on to our tech writing team.

Sydney

Alteryx Partner

Hi, @SydneyF  -  I am experimenting with the Contingency Table and feeding it the NYC_Collisions_2015_Queens.yxdb file to examine

I have selected "Number of persons injured" and the R output report confirms that it is numeric.  However, the Record 2 sort by number of persons injures goes 0, 1, 11, 12, 2, 22, etc. like a standard sort of a string.  If the tool converts the bytes to strings, could the tool at least do a dictionary sort?  Thanks.