Welcome to Part 2 of the Pre-Predictive series! After a strong start but long hiatus, we will be resuming our tour of the Data Investigation Tools.
Data Investigation is deeply underrated. Every time you go about using a new data set for analysis or modeling, data investigation should be your first step. It is impossible to make meaningful conclusions without an understanding of the data you are working with.
These aqua-colored tools are provided to you with that exact purpose in mind. If the Data Investigation Toolbox isn’t already your best friend, it should be, and I am thrilled that I have the opportunity to introduce the two of you
This section of the Pre-Predictive series includes the Frequency Table, Contingency Table, and Distribution Analysis Tools.Part 1covered the Field Summary Tool, and Sections 3 and 4 will cover Association Analysis and Correlations, and the Plotting Tools, respectively.
For each field selected in the tool set-up, the Frequency Table Toolcreates a summary of the data with frequency counts and percentages for each value in each selected field. It can be used on quantitative or qualitative variables, but will not accept FixedDecimal, Float, Double, Date/Time, Blob or SpatialObj fields. This tool gives you the ability to visualize how the values within each of your fields are distributed. A Frequency Tableis a snapshot of your data giving you the chance to identify any patterns in preliminary investigation.
The Frequency Table Tool has three outputs. The D output is a data table that includes each of your field values, labeled by name, and each values frequency, percentage within its field, cumulative frequency and cumulative percent (cumulative is based on a given field name and is a running total). The R output is a report that includes all the information in the D output, divided into tables by field name. The interactive output (I) allows you to interactively filter which fields and metrics you are viewing and create plots.
Look out for fields that don’t have many unique values. The report will warn you that you might have a categorical field set as numeric if it detects a numeric field with fewer than ten unique values. If the tool has correctly identified a categorical field masquerading as a numeric field, consider changing the field data type to string using the Select Tool, so that it is treated correctly as a categorical variable by the predictive tools later in your workflow. In general, look out for fields that have distinctive patterns or distributions. This can help guide how you treat and use your data.
A Contingency Table (also called a two-way table) is a special type of a frequency distribution table that displays the frequency distribution of two or more categorical variables as well as how they relate to one another.The Contingency Table Tool provides a basic snap-shot of the interrelation between two or more variables, and can help you identify interactions between them.
In the set-up, you have the option to include a chi-squared statistic or not. If you elect to include it you can compare 2 variables, if you choose not to include it you can select up to four variables to compare. The Chi-Square test determines whether a significant relationship exists between the two variables being compared. A low p-valuesuggests a statistically significant relationship.
Like the Frequency Table Tool, the Contingency Table Tool also creates three outputs.The data (D) output is a data table, containing the frequency counts and percentages for each individual feild and combination of fields. The report (R) output is a fancy formatted report of the data output, and is where the Chi-squared value, df, and p-value will be reported if that option was selected in your setup. The interactive (I) outputallows you to interactively filter which fields and metrics you are viewing, as well as how the data is displayed.
You can modify the interactive view to compare two variables at a time and highlight strong relationships between the variables being compared.
It is worthwhile to check for these relationships. Relationships between the predictor variables and the target variable are a good thing, as it suggests predictive power. Strong relationships between two predictor variables should be investigated for the potential of introducingmulticollinearityinto your model.
The Distribution AnalysisTool allows you to fit your input data to different statistical distributions and compare Goodness-of-Fit to each distribution. This is pretty awesome for modeling because it enables you determine which distribution best represents your data, and may help guide your predictive model selection. It should only be used on continuous variables.
This tool allows you to compare your data to theNormal, Lognormal, Weibull and Gamma Distributions. It is important to note that the Lognormal, Weibull and Gamma Distributions can only be compared to non-negative data.
The output of this tool is a series of report snippets, which includes a histogram, summary statistics for the test results, goodness of fit statistics, data quantiles per distribution and the distribution parameters.
If your target data (what you are trying to predict) is a continuous variable, I would strongly recommend using this tool to helpguide what type of predictive model to use. For example, if your target data fit a Normal Distribution best, a Linear Regression would be an appropriate model, data with a Gamma Distributionwould be better represented with a Gamma Regression, etc.
Getting to know your data better will only improve the quality of your models. Only you know your data, your use case, and how your results will be applied, but the Data Investigation tools are here to help get you as informed as possible on what you’re working with.