Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Logistic Regression - Categorical vs. Numeric Independent Variables

oracleoftemple
9 - Comet

This may be more of a statistical question than an Alteryx question.  I'm having a problem figuring out how my dependent variable changes as certain non-numeric independent variables change.  These non-numeric variables are categorical, e.g., male / female.  The solution that I found (see attached workflow) works well with variables that only have two or three classifications.  The workflow is basically taking each variable's classification and making a dummy variable out of it so that it equals 1 if the record meets the criterion and 0 if it doesn't.  For instance, if I want to determine the effect that education level has, and that variable has four classifications - 1) no HS diploma, 2) HS grad, 3) some college, and 4) college grad, I'd wind up with four additional independent variables (all dummies) and each record would have a 1 in only one of those four columns based on their highest education level attained.

 

The problem I'm running into is when the categories extend beyond just a few classifications.  Here's an example - I want to determine if certain supervisors (of which there are hundreds in our organization) are more likely to have employees quit within the first year of employment.  If I were to use this same workflow, I would have hundreds of additional dummy variables - one for each supervisor.  It's not a numeric variable, so it doesn't work with my Logistic Regression tool.  How are variables like this worked into logistic or multiple regression analyses?

1 REPLY 1
JohnJPS
15 - Aurora

I think the analysis described would only be useful if you're scratching the surface and looking for reason codes: e.g. trying to answer why people under a given supervisor quit.  If you just want a probability, simple aggregation would give you that: how many people quit in general vs. how many under a given supervisor. That's very simple and also more accurate than the guesswork a GLM would give you.

 

For machine learning / modeling to be useful, to provide the answer to "why people quit under certain supervisors," you might not be able to limit it to just one supervisor. You could exclude that variable, and then look for other reasons in an analysis that takes into account other factors. Then grab the "ten worst" from the simple aggregation described above, and examine how they overlap with the results of your GLM... that may help you pinpoint for that particular supervisor, how they can improve their game.

 

Hope that helps!

John

Labels