I have a binary categorical target variable. The potential predictive variables include both categorical and continuous variables. I am conducting preliminary data investigation to deduce which of the predictive variables would be suitable for use in the modeling. I have the following questions regarding the preliminary analysis: * Analyzing continuous predictive variables* Using the association analysis tool if fairly obvious, and the multicollinearity can be analyzed by looking at the reports. However, if I would like to immediately produce a table of the best predictors and/or remove the ones with possibly too great correlation with each other, is there any other way to do this than to edit the underlying R-code? Additionally, just to be sure, for two intercorrelated predictor variables the rule of thumb is to remove the one with the lower p-value, right? * When using the importance weights tool, however, there is no correlation measures to be selected, only the entropies and relief method. If the target was also continuous, the tool has the option to calculate correlations the Pearson correlation. Should I interpret this as the correlation measures not being suitable for analyzing the relationships with a binary target variable, or just as a decision made by the macro's developer not to overlap the association analysis tool? If the latter, what is the order of importance I should follow when conducting the analysis. * Analyzing categorical predictors* I should apparently use the Contingency Table tool’s chi-squared statistic to analyze the predictive variables relationship with the target variable. Should I also analyze the multicollinearity between the predictors, and if so, how? Is there no tool similar to the association analysis, which would compute the intracorrelations also? * Using the importance weights tool, does the Cramer’s V take the possible multicollinearity into account already? * Do you know of any resources which could enlighten me more on which order I should assess the results of the analyses? I.E. “Ig the gain ratio is x, then it is likely a good predictor. If the Pearson correlation is below y, don’t take it into account, especially if the p-values are below z”. Do forgive me if I have butchered some terminology, as I am not a statistician by trade.

Association Analysis vs. Importance Weights vs. Contingency Table

I have a binary categorical target variable. The potential predictive variables include both categorical and continuous variables. I am conducting preliminary data investigation to deduce which of the predictive variables would be suitable for use in the modeling.

I have the following questions regarding the preliminary analysis:

Analyzing continuous predictive variables
1. Using the association analysis tool if fairly obvious, and the multicollinearity can be analyzed by looking at the reports. However, if I would like to immediately produce a table of the best predictors and/or remove the ones with possibly too great correlation with each other, is there any other way to do this than to edit the underlying R-code? Additionally, just to be sure, for two intercorrelated predictor variables the rule of thumb is to remove the one with the lower p-value, right?
2. When using the importance weights tool, however, there is no correlation measures to be selected, only the entropies and relief method. If the target was also continuous, the tool has the option to calculate correlations the Pearson correlation. Should I interpret this as the correlation measures not being suitable for analyzing the relationships with a binary target variable, or just as a decision made by the macro's developer not to overlap the association analysis tool? If the latter, what is the order of importance I should follow when conducting the analysis.
Analyzing categorical predictors
1. I should apparently use the Contingency Table tool’s chi-squared statistic to analyze the predictive variables relationship with the target variable. Should I also analyze the multicollinearity between the predictors, and if so, how? Is there no tool similar to the association analysis, which would compute the intracorrelations also?
2. Using the importance weights tool, does the Cramer’s V take the possible multicollinearity into account already?
Do you know of any resources which could enlighten me more on which order I should assess the results of the analyses? I.E. “Ig the gain ratio is x, then it is likely a good predictor. If the Pearson correlation is below y, don’t take it into account, especially if the p-values are below z”.

Do forgive me if I have butchered some terminology, as I am not a statistician by trade.

Macros

Predictive Analysis

Accepted answers

CristonS

Hi @joona_rauhamaki great questions! And I really appreciate your interest and dedication to data investigation

Both the Pearson Correlation tool (with correlation and covariance), and the Spearman Correlation tool produce data tables that you can use downstream. If two variables are highly correlated, you could remove the one with a lower p-value, but please be sure to investigate the variables (and combinations of variables) further before making a decision. One variable may represent something that means a lot more in your organization or industry, and only you would know that.

Modeling is a very iterative process because you'll want to test combinations and transforms before settling on variable sets. Try the Nested Test tool to compare variable subsets. For linear, logistic, and other traditional regression models, the Stepwise tool can help determine the "best" predictor variables to include in a model out of a larger set of potential predictor variables.

There is not necessarily an order of importance when it comes to these processes. Try different correlation and covariance methodologies, plus importance weights, and evaluate those results based on your data and your use cases. Try the available data investigation tools to become familiar with the results, and the effects that tweaking the parameters have on the metrics.

The Importance Weights tool only looks at the strength of a possible predictor on the target in isolation, ignoring possible interaction effects and correlation between predictors. Knowing this, you'll assess these results differently than those of the correlation tools. For example, if a discrete predictor is significant, but only slightly stronger when used in combination with one or more other predictors, you'll need to weigh the efficacy of both options before you decide.

The Help pages have links to descriptions of the statistics used in these tools. For example, you mentioned Cramér's V - this is based on Pearson's chi-squared. It is used to determine the strength of associations after chi-squared has determined significance.

The Community has abundant resources to help you learn and figure out the process. Our Live Training site has offerings in Data Investigation and Predictive Analytics. If you feel there’s something you need that isn’t included in our tools, we have training on creating your own R-based macros. If you want a more comprehensive course that can step-by-step you through real-world examples, our Udacity Nanodegree includes data investigation specific to each modelling process (regression, time series, etc.).

Part 3 of the stellar series Pre-Predictive: Using the Data Investigation tools covers correlation and covariance. The Tool Mastery Article for Association Analysis includes additional resources for learning how to interpret these results.

I hope this helps get you started on your awesome data investigation path. There is a lot to learn and it can be intimidating. You already have the investigative mindset, and that is the hardest part 😊

Happy Alteryx-ing!

All comments

SophiaF

@joona_rauhamaki - these are some great questions; I hope that my response will 'bump' the post to get more eyes on it from our predictive gurus. If you haven't already, check out the great 4-part knowledge base series on all things related to data investigation - 'Pre-Predictive Using the Data Investigation Tools'.

joona_rauhamaki

Hi @SophiaF,

unfortunately no answers thus far. I re-read the guides you also linked, but they did not really answer my questions.

Perhaps one of the authors @CristonS or SydneyF would have time to look into the questions? Or perhaps even @DrDan could find interesting.

edit: it seems that I do not yet possess the ability to @-tag other people, hopefully someone sees this regardless

Quick Links

This months top contributors

atcodedog05 19458

Qiu 15866

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703