I have a binary categorical target variable. The potential predictive variables include both categorical and continuous variables. I am conducting preliminary data investigation to deduce which of the predictive variables would be suitable for use in the modeling.
I have the following questions regarding the preliminary analysis:
- Analyzing continuous predictive variables
- Using the association analysis tool if fairly obvious, and the multicollinearity can be analyzed by looking at the reports. However, if I would like to immediately produce a table of the best predictors and/or remove the ones with possibly too great correlation with each other, is there any other way to do this than to edit the underlying R-code? Additionally, just to be sure, for two intercorrelated predictor variables the rule of thumb is to remove the one with the lower p-value, right?
- When using the importance weights tool, however, there is no correlation measures to be selected, only the entropies and relief method. If the target was also continuous, the tool has the option to calculate correlations the Pearson correlation. Should I interpret this as the correlation measures not being suitable for analyzing the relationships with a binary target variable, or just as a decision made by the macro's developer not to overlap the association analysis tool? If the latter, what is the order of importance I should follow when conducting the analysis.
- Analyzing categorical predictors
- I should apparently use the Contingency Table tool’s chi-squared statistic to analyze the predictive variables relationship with the target variable. Should I also analyze the multicollinearity between the predictors, and if so, how? Is there no tool similar to the association analysis, which would compute the intracorrelations also?
- Using the importance weights tool, does the Cramer’s V take the possible multicollinearity into account already?
- Do you know of any resources which could enlighten me more on which order I should assess the results of the analyses? I.E. “Ig the gain ratio is x, then it is likely a good predictor. If the Pearson correlation is below y, don’t take it into account, especially if the p-values are below z”.
Do forgive me if I have butchered some terminology, as I am not a statistician by trade.