This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Now that it's live, don't forget to accept your certification badge on Credly today! Learn more here.
We are currently experiencing an issue with Email verification at this time and working towards a solution. Should you encounter this issue, please click on the "Send Verification Button" a second time and the request should go through. If the issue still persists for you, please email firstname.lastname@example.org for assistance.
on 03-23-202010:00 AM - edited on 08-10-202109:52 AM by sachink
How To: Complete Data Preparation And Investigation For Predictive Modeling
Data preparation and investigation are a must for successful Predictive Modeling.
It is essential to ask these questions to ensure a data set ready. Are there errors, improperly parsed values, duplicates, or incorrect data types? Do missing values exist? Are there outlier values? Do multiple fields measure the same thing (multicollinearity)? Are the predictor columns statistically significant with a p-value of 0.05 or less?
This article covers these necessary basic checks. Keep in mind that more in-depth research of data may be needed with the Data Investigation Tools.
Are there errors, improperly parsed values, duplicates, or incorrect data types?
If so, correct these issues first. Use the Data Cleansing tool to remove extra white space and unwanted characters such as quotation marks and field delimiters. Be sure that field values are not combined when they should be in separate columns. If there are combined field values, the tools in the Parse Toolbar can help. The Unique Tool identifies duplicate records. Use the Select Tool to verify and change data types.
Do missing values exist?
Be sure to replace all missing values with the Imputation tool. If there is a high percentage of missing records in a field, consider dropping that column from the data used.
Generally recommended uses of imputation for null and blank values are: Numeric fields - replace with the median Categorical fields - replace with a user-defined constant Boolean fields - replace with the mode Depending on the use case and data set, different options may be needed.
The Field Summary Tool R output anchor provides a report table with useful statistics for all columns selected, including the percentage of missing values. Connect a Browse Tool to the output, and results appear grouped by data type. Here are numeric and string field examples.
Are there outliers?
Outliers are values that do not seem to fit with the rest of the data. For numeric fields, check for extremely small or large values in comparison with the rest of the data. In categorical fields, check the frequency of values. Values that rarely occur in comparison with others may be outliers.
Outliers may skew the results. It is best to determine ahead of time if records with outlier values will remain part of the data set used for assisted or predictive modeling.
The Field Summary Tool has an I output anchor with an interactive view of the distribution of values for a field. Each column is in a histogram. Details pop up when hovering over any bar in the graph.
String fields show the number of occurrences for a specific value, and number data types will have value ranges each with the number of occurrences in that range. Be sure to attach a Browse Tool for viewing.
String data type
Numeric data type
For numeric values, a Browse Tool connected directly to the data provides additional detail. Click on a column header in the Results window. The selected column appears in the Browse Tool Profile window with a scatterplot of values as well as a box and whiskers graph showing the field’s value distribution in quadrants, with the bar representing the two middle quadrants.
For fields with categorical values, use the Frequency Tool to obtain detailed statistics about each value in the column. Here is an example of the Frequency Tool Report from the R output anchor with an outlier value. The Interactive I output anchor has the same information in bar graphs.
Do multiple fields measure the same thing (multicollinearity)?
Use the Association Analysis Tool to check for this issue and remove redundant fields. Be sure to include the target field you are trying to predict as that will provide more detailed results.
The R output anchor has a Full Correlation Matrix report. A correlation score near 1.0 shows that both predictor columns measure the same thing for predicting the target variable. Similarly, a score near -1.0 shows a high negative correlation (inversely proportional fields).
In this example, variable two and variable four have a correlation score of 1.0, the highest correlation score possible. Use only one of the fields and deselect the other.
The interactive I output anchor shows all fields on both the X-axis and Y-axis. When selecting a box in the grid and the matching X-axis and Y-axis column names will appear along with a pop-up containing the correlation score. Highly positive and negative correlations become a darker color as the correlation increases.
Are the predictor columns statistically significant with a p-value of 0.05 or less?
In simple terms, the p-value is the percentage of chance that any observed correlation between the predictor field and the target field is just random, and no real correlation is occurring. A predictor field is statistically significant when the p-value is 0.05 less, as there is a low chance that there is no actual correlation.
The option for a target field should be selected and set to match the column that will be predicted.
The column list is in order of significance. Check the stars to the right of the p-value column for an easy way to determine if the columns are statistically significant. Fields with a p-value of 0.001 or less receive three stars, 0.01 or less receives two stars, and 0.05 or less has a single star.
If you would like to research a data set further, please see the tools on Data Investigation Toolbar and reference the Pre-Predictive four-part series of articles on Community referenced in the Additional Resources section.