Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

Pre-Predictive: Using the Data Investigation Tools - Part 2 of 4

SydneyF
Alteryx Alumni (Retired)
Created

Welcome to Part 2 of the Pre-Predictive series! After a strong start but long hiatus, we will be resuming our tour of the Data Investigation Tools.

Data Investigation is deeply underrated. Every time you go about using a new data set for analysis or modeling, data investigation should be your first step. It is impossible to make meaningful conclusions without an understanding of the data you are working with.

These aqua-colored tools are provided to you with that exact purpose in mind. If the Data Investigation Toolbox isn’t already your best friend, it should be, and I am thrilled that I have the opportunity to introduce the two of you 

This section of the Pre-Predictive series includes the Frequency Table, Contingency Table, and Distribution Analysis Tools.Part 1covered the Field Summary Tool, and Sections 3 and 4 will cover Association Analysis and Correlations, and the Plotting Tools, respectively.

2019-04-22_15-15-46.png

For each field selected in the tool set-up, the Frequency Table Toolcreates a summary of the data with frequency counts and percentages for each value in each selected field. It can be used on quantitative or qualitative variables, but will not accept FixedDecimal, Float, Double, Date/Time, Blob or SpatialObj fields. This tool gives you the ability to visualize how the values within each of your fields are distributed. A Frequency Tableis a snapshot of your data giving you the chance to identify any patterns in preliminary investigation.

The Frequency Table Tool has three outputs. The D output is a data table that includes each of your field values, labeled by name, and each values frequency, percentage within its field, cumulative frequency and cumulative percent (cumulative is based on a given field name and is a running total). The R output is a report that includes all the information in the D output, divided into tables by field name. The interactive output (I) allows you to interactively filter which fields and metrics you are viewing and create plots.

Look out for fields that don’t have many unique values. The report will warn you that you might have a categorical field set as numeric if it detects a numeric field with fewer than ten unique values. If the tool has correctly identified a categorical field masquerading as a numeric field, consider changing the field data type to string using the Select Tool, so that it is treated correctly as a categorical variable by the predictive tools later in your workflow. In general, look out for fields that have distinctive patterns or distributions. This can help guide how you treat and use your data.

2019-04-22_15-15-20.png

A Contingency Table (also called a two-way table) is a special type of a frequency distribution table that displays the frequency distribution of two or more categorical variables as well as how they relate to one another.The Contingency Table Tool provides a basic snap-shot of the interrelation between two or more variables, and can help you identify interactions between them.

In the set-up, you have the option to include a chi-squared statistic or not. If you elect to include it you can compare 2 variables, if you choose not to include it you can select up to four variables to compare. The Chi-Square test determines whether a significant relationship exists between the two variables being compared. A low p-valuesuggests a statistically significant relationship.

Like the Frequency Table Tool, the Contingency Table Tool also creates three outputs.The data (D) output is a data table, containing the frequency counts and percentages for each individual feild and combination of fields. The report (R) output is a fancy formatted report of the data output, and is where the Chi-squared value, df, and p-value will be reported if that option was selected in your setup. The interactive (I) outputallows you to interactively filter which fields and metrics you are viewing, as well as how the data is displayed.

You can modify the interactive view to compare two variables at a time and highlight strong relationships between the variables being compared.

It is worthwhile to check for these relationships. Relationships between the predictor variables and the target variable are a good thing, as it suggests predictive power. Strong relationships between two predictor variables should be investigated for the potential of introducingmulticollinearityinto your model.

2019-04-22_15-16-13.png

The Distribution AnalysisTool allows you to fit your input data to different statistical distributions and compare Goodness-of-Fit to each distribution. This is pretty awesome for modeling because it enables you determine which distribution best represents your data, and may help guide your predictive model selection. It should only be used on continuous variables.

This tool allows you to compare your data to theNormal, Lognormal, Weibull and Gamma Distributions. It is important to note that the Lognormal, Weibull and Gamma Distributions can only be compared to non-negative data.

The output of this tool is a series of report snippets, which includes a histogram, summary statistics for the test results, goodness of fit statistics, data quantiles per distribution and the distribution parameters.

If your target data (what you are trying to predict) is a continuous variable, I would strongly recommend using this tool to helpguide what type of predictive model to use. For example, if your target data fit a Normal Distribution best, a Linear Regression would be an appropriate model, data with a Gamma Distributionwould be better represented with a Gamma Regression, etc.

Getting to know your data better will only improve the quality of your models. Only you know your data, your use case, and how your results will be applied, but the Data Investigation tools are here to help get you as informed as possible on what you’re working with.

Comments
Kenda
16 - Nebula
16 - Nebula

Hey @SydneyF thanks for this! I found the part about the types of fields that the Frequency Table tool will not accept particularly helpful. Is there any way to update the help doc with all of these field types? Currently it only lists FixedDecimal, Date, Time, DateTime, Blob, and SpatialObj (shown below). I was having the hardest time trying to figure out why some of my fields weren't coming through until I found this article that said Doubles also can't be used. I typically go to the help page first, so I think it would make sense to list all of the unacceptable field types there as well. Thanks!

 

Capture.PNG

SydneyF
Alteryx Alumni (Retired)

Hi @Kenda,

 

Thank you for the feedback! I have passed it on to our tech writing team. 

 

Sydney

Newt
8 - Asteroid

Hi, @SydneyF  -  I am experimenting with the Contingency Table and feeding it the NYC_Collisions_2015_Queens.yxdb file to examine

I have selected "Number of persons injured" and the R output report confirms that it is numeric.  However, the Record 2 sort by number of persons injures goes 0, 1, 11, 12, 2, 22, etc. like a standard sort of a string.  If the tool converts the bytes to strings, could the tool at least do a dictionary sort?  Thanks.

DawnDuong
13 - Pulsar
13 - Pulsar

I just found this 4-part series, very good introduction on data investigation. Thank you!