This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Welcome to Part 2 of the Pre-Predictive series! After a strong start but long hiatus, we will be resuming our tour of the Data Investigation Tools.
Data Investigation is deeply underrated. Every time you go about using a new data set for analysis or modeling, data investigation should be your first step. It is impossible to make meaningful conclusions without an understanding of the data you are working with.
These aqua-colored tools are provided to you with that exact purpose in mind. If the Data Investigation Toolbox isn’t already your best friend, it should be, and I am thrilled that I have the opportunity to introduce the two of you 😊
For each field selected in the tool set-up, the Frequency Table Tool creates a summary of the data with frequency counts and percentages for each value in each selected field. It can be used on quantitative or qualitative variables, but will not accept FixedDecimal, Float, Double, Date/Time, Blob or SpatialObj fields. This tool gives you the ability to visualize how the values within each of your fields are distributed. A Frequency Table is a snapshot of your data giving you the chance to identify any patterns in preliminary investigation.
The Frequency Table Tool has three outputs. The D output is a data table that includes each of your field values, labeled by name, and each values frequency, percentage within its field, cumulative frequency and cumulative percent (cumulative is based on a given field name and is a running total). The R output is a report that includes all the information in the D output, divided into tables by field name. The interactive output (I) allows you to interactively filter which fields and metrics you are viewing and create plots.
Look out for fields that don’t have many unique values. The report will warn you that you might have a categorical field set as numeric if it detects a numeric field with fewer than ten unique values. If the tool has correctly identified a categorical field masquerading as a numeric field, consider changing the field data type to string using the Select Tool, so that it is treated correctly as a categorical variable by the predictive tools later in your workflow. In general, look out for fields that have distinctive patterns or distributions. This can help guide how you treat and use your data.
A Contingency Table (also called a two-way table) is a special type of a frequency distribution table that displays the frequency distribution of two or more categorical variables as well as how they relate to one another. The Contingency Table Tool provides a basic snap-shot of the interrelation between two or more variables, and can help you identify interactions between them.
In the set-up, you have the option to include a chi-squared statistic or not. If you elect to include it you can compare 2 variables, if you choose not to include it you can select up to four variables to compare. The Chi-Square test determines whether a significant relationship exists between the two variables being compared. A low p-value suggests a statistically significant relationship.
Like the Frequency Table Tool, the Contingency Table Tool also creates three outputs.The data (D) output is a data table, containing the frequency counts and percentages for each individual feild and combination of fields. The report (R) output is a fancy formatted report of the data output, and is where the Chi-squared value, df, and p-value will be reported if that option was selected in your setup. The interactive (I) output allows you to interactively filter which fields and metrics you are viewing, as well as how the data is displayed.
You can modify the interactive view to compare two variables at a time and highlight strong relationships between the variables being compared.
It is worthwhile to check for these relationships. Relationships between the predictor variables and the target variable are a good thing, as it suggests predictive power. Strong relationships between two predictor variables should be investigated for the potential of introducing multicollinearity into your model.
The Distribution Analysis Tool allows you to fit your input data to different statistical distributions and compare Goodness-of-Fit to each distribution. This is pretty awesome for modeling because it enables you determine which distribution best represents your data, and may help guide your predictive model selection. It should only be used on continuous variables.
This tool allows you to compare your data to the Normal, Lognormal, Weibull and Gamma Distributions. It is important to note that the Lognormal, Weibull and Gamma Distributions can only be compared to non-negative data.
The output of this tool is a series of report snippets, which includes a histogram, summary statistics for the test results, goodness of fit statistics, data quantiles per distribution and the distribution parameters.
If your target data (what you are trying to predict) is a continuous variable, I would strongly recommend using this tool to help guide what type of predictive model to use. For example, if your target data fit a Normal Distribution best, a Linear Regression would be an appropriate model, data with a Gamma Distribution would be better represented with a Gamma Regression, etc.
Getting to know your data better will only improve the quality of your models. Only you know your data, your use case, and how your results will be applied, but the Data Investigation tools are here to help get you as informed as possible on what you’re working with.