I have a binary categorical target variable. The potential predictive variables include both categorical and continuous variables. I am conducting preliminary data investigation to deduce which of the predictive variables would be suitable for use in the modeling.
I have the following questions regarding the preliminary analysis:
Do forgive me if I have butchered some terminology, as I am not a statistician by trade.
Solved! Go to Solution.
@joona_rauhamaki - these are some great questions; I hope that my response will 'bump' the post to get more eyes on it from our predictive gurus. If you haven't already, check out the great 4-part knowledge base series on all things related to data investigation - 'Pre-Predictive Using the Data Investigation Tools'.
Hi @SophiaF,
unfortunately no answers thus far. I re-read the guides you also linked, but they did not really answer my questions.
Perhaps one of the authors @CristonS or SydneyF would have time to look into the questions? Or perhaps even @DrDan could find interesting.
edit: it seems that I do not yet possess the ability to @-tag other people, hopefully someone sees this regardless
Hi @joona_rauhamaki great questions! And I really appreciate your interest and dedication to data investigation :)
Both the Pearson Correlation tool (with correlation and covariance), and the Spearman Correlation tool produce data tables that you can use downstream. If two variables are highly correlated, you could remove the one with a lower p-value, but please be sure to investigate the variables (and combinations of variables) further before making a decision. One variable may represent something that means a lot more in your organization or industry, and only you would know that.
Modeling is a very iterative process because you'll want to test combinations and transforms before settling on variable sets. Try the Nested Test tool to compare variable subsets. For linear, logistic, and other traditional regression models, the Stepwise tool can help determine the "best" predictor variables to include in a model out of a larger set of potential predictor variables.
There is not necessarily an order of importance when it comes to these processes. Try different correlation and covariance methodologies, plus importance weights, and evaluate those results based on your data and your use cases. Try the available data investigation tools to become familiar with the results, and the effects that tweaking the parameters have on the metrics.
The Importance Weights tool only looks at the strength of a possible predictor on the target in isolation, ignoring possible interaction effects and correlation between predictors. Knowing this, you'll assess these results differently than those of the correlation tools. For example, if a discrete predictor is significant, but only slightly stronger when used in combination with one or more other predictors, you'll need to weigh the efficacy of both options before you decide.
The Help pages have links to descriptions of the statistics used in these tools. For example, you mentioned Cramér's V - this is based on Pearson's chi-squared. It is used to determine the strength of associations after chi-squared has determined significance.
The Community has abundant resources to help you learn and figure out the process. Our Live Training site has offerings in Data Investigation and Predictive Analytics. If you feel there’s something you need that isn’t included in our tools, we have training on creating your own R-based macros. If you want a more comprehensive course that can step-by-step you through real-world examples, our Udacity Nanodegree includes data investigation specific to each modelling process (regression, time series, etc.).
Part 3 of the stellar series Pre-Predictive: Using the Data Investigation tools covers correlation and covariance. The Tool Mastery Article for Association Analysis includes additional resources for learning how to interpret these results.
I hope this helps get you started on your awesome data investigation path. There is a lot to learn and it can be intimidating. You already have the investigative mindset, and that is the hardest part 😊
Happy Alteryx-ing!
User | Count |
---|---|
19 | |
14 | |
13 | |
9 | |
8 |