Logistic Regression - Categorical vs. Numeric Independent Variables

Question

This may be more of a statistical question than an Alteryx question.  I'm having a problem figuring out how my dependent variable changes as certain non-numeric independent variables change.  These non-numeric variables are categorical, e.g., male / female.  The solution that I found (see attached workflow) works well with variables that only have two or three classifications.  The workflow is basically taking each variable's classification and making a dummy variable out of it so that it equals 1 if the record meets the criterion and 0 if it doesn't.  For instance, if I want to determine the effect that education level has, and that variable has four classifications - 1) no HS diploma, 2) HS grad, 3) some college, and 4) college grad, I'd wind up with four additional independent variables (all dummies) and each record would have a 1 in only one of those four columns based on their highest education level attained.

The problem I'm running into is when the categories extend beyond just a few classifications.  Here's an example - I want to determine if certain supervisors (of which there are hundreds in our organization) are more likely to have employees quit within the first year of employment.  If I were to use this same workflow, I would have hundreds of additional dummy variables - one for each supervisor.  It's not a numeric variable, so it doesn't work with my Logistic Regression tool.  How are variables like this worked into logistic or multiple regression analyses?

Creating Dummy variables for regression analysis.yxmd

JohnJPS · Accepted Answer

I think the analysis described would only be useful if you're scratching the surface and looking for reason codes: e.g. trying to answer why people under a given supervisor quit.  If you just want a probability, simple aggregation would give you that: how many people quit in general vs. how many under a given supervisor. That's very simple and also more accurate than the guesswork a GLM would give you.

For machine learning / modeling to be useful, to provide the answer to "why people quit under certain supervisors," you might not be able to limit it to just one supervisor. You could exclude that variable, and then look for other reasons in an analysis that takes into account other factors. Then grab the "ten worst" from the simple aggregation described above, and examine how they overlap with the results of your GLM... that may help you pinpoint for that particular supervisor, how they can improve their game.

Hope that helps!

John