I'm looking for the appropriate Alteryx tool to deal with non-numeric independent variables. I'm having a problem figuring out how my dependent variable changes as certain non-numeric independent variables change. These non-numeric variables are categorical, e.g., male / female. The solution that I found (see attached workflow) works well with variables that only have two or three classifications. The workflow is basically taking each variable's classification and making a dummy variable out of it so that it equals 1 if the record meets the criterion and 0 if it doesn't. For instance, if I want to determine the effect that education level has, and that variable has four classifications - 1) no HS diploma, 2) HS grad, 3) some college, and 4) college grad, I'd wind up with four additional independent variables (all dummies) and each record would have a 1 in only one of those four columns based on their highest education level attained.
The problem I'm running into is when the categories extend beyond just a few classifications. Here's an example - I want to determine if certain supervisors (of which there are hundreds in our organization) are more likely to have employees quit within the first year of employment. If I were to use this same workflow, I would have hundreds of additional dummy variables - one for each supervisor. It's not a numeric variable, so it doesn't work with my Logistic Regression tool. How are variables like this worked into logistic or multiple regression analyses?
Solved! Go to Solution.
As opposed to creating a dummy binary variable for each of the possible values of each question, what if you create a mulit-valued variable for each question, i.e. male/female get 1-2, education gets 1-4, supervisor gets 1-99, etc. You then run your regression on these 3 variables
Edit: I have indeed been shown the error of my ways by @BenMoss. I'll leave the original, incorrect response there so that it may serve as a reminder to others not to stray from the path of light(i.e. don't forget basic statistics.)
Dan
Thanks you guys. Yes I thought that through pretty exhaustively. By numbering the supervisors, I'd wind up with a numeric value that didn't bear any relation to a scale, e.g., the supervisor that I numbered 20 would not twice the "amount" or "magnitude" of the one I numbered 10. It seems my dummy variables will only work when I want to same something either IS or IS NOT. I would only be comfortable using a few dummies for each category. It doesn't just need to be a two class category. For instance, if I wanted to determine whether a dependent variable is driven by the region of the country, I could make four dummy variables - one for each region - Northeast, Southeast, Midwest, and West. So each record would only have the "switch flipped" from 0 to 1 for only one of those four dummies. You can see how messy that would get with hundreds of supervisors. I would wind up with hundreds of dummy variables - one for each supervisor. I don't think that's the way to approach this problem. Is there another tool that may work?
Could you break this down into multiple parts? Start by breaking your data set down into groups by supervisor. You can perform basic statistics here without having to perform a regression. Look at retention rate/supervisor: Are there any outliers? If the retention rate is pretty much the same for all supervisors, then you can focus on other variables. If the retention rates are wildly different then you can dig deeper. Perform your regression analysis on each group and look at the results. From here you can start controlling for the other variables within each supervisor. Of course, this is dependent on having a data set that's large enough to have a significant population by supervisor
You can also use properties of the supervisors, sex, age, department, etc. to reduce the number of dummy variables in play
Hope this helps
Dan
These are all great ideas. Thanks Dan!
I'm not sure I understand where the problem is. You can use categorical variables in the regression tools without having to pre-encode your data:
Check out the attached workflow with dummy data and examples of how to use the regression tools. In the middle output you'll find the following report:
The coefficients table shows that M has a negative effect compared to F. Same for MS/PHD compared to HS, and opposite for AGE.
Note, this is dummy data and none of these variables appear significant.
This is helpful David. I didn't know the tools would automatically take care of the dummy variable creation for you, so that's a relief. The only remaining problem is determining whether a dependent variable is driven by a categorical independent variable with hundreds of classifications. For instance, I want to use the logistic regression tool to determine if someone is likely to quit within the first six months of employment. One of the independent variables that I'm interested in is those employees' supervisors. Looking at all employee data from those that HAVE and HAVE NOT quit in the first six months, I'd like to determine if there are particular supervisors that are more likely to drive employees away. That could be unwieldy because there are hundreds of supervisors, which would yield hundreds -1 dummy variables.
You may have to dig deeper into using R code to extend the functionality of the regression tools to get what you want. For example, here's an Alteryx-provided macro for grabbing the coefficients from one of the regression tools (connect it to the "O" output):
https://gallery.alteryx.com/#!app/Model-Coefficients/56bcd989a248970ce0ba08be
If you open that custom macro you'll see it's just R code. You could modify that code to also return the p-values for each variable so that you can filter out variables (i.e. supervisors) that are "significant" i.e. p-value <= 0.05. Or if you wanted to see the overall effect of all supervisors, you can use the anova() function on the lm object which will roll up all different individual levels of a categorical factor.
... Or if you want to just see the effect of supervisors specifically, you can do a frequency analysis (use the Summarize tool) to see rate of quitting by supervisor, and/or do a simple correlation.