Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

General Discussions has some can't miss conversations going on right now! From conversations about automation to sharing your favorite Alteryx memes, there's something for everyone. Make it part of your community routine!

SOLVED

Categorical independent variable in regression? Should I be creating dummy variables?

gakkos2323
6 - Meteoroid

Hello All,

 

I am running a linear regression. One of my independent variables were categorical, the rest is continuous 

 

I had 6 standardized vehicle brands in that categorical variables. I transformed each brand into numerical values (1-Honda, 2-Toyota....). I changed the data types to v_string (for that categorical variable). Then, I ran the regression. Based on the output table, my understanding is that the regression took care of the dummy conversion automatically. Am I missing something here? 

ArtApa
Alteryx
Alteryx

Hi @gakkos2323 - Do not code the categorical variables as 1, 2, 3, ... as if on a likert scale. Doing so would give you misleading results. If you use categorical variable in regression you need to pay a special attention to it and use a special coding. You may read about the problem and possible solutions here: https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis...  

CharlieS
17 - Castor
17 - Castor

Are you using the Linear Regression tool or the Assisted Modeling?

 

It should make them for you, but you can always check this in the output. You should see a variable created for each make with coefficients.

 

I use a small sample of vehicle data with the manufacturer name as a string field named "MAKE NAME". In the model report (R output anchor from the Linear Regression tool) you can see that a variable was created for each value, meaning a dummy was made.

 

Ignore the model, this was just a dummy test.Ignore the model, this was just a dummy test.

 

If you want to make your own ahead of time, that's also a good option. I like to do this because it gives me the opportunity to analze outside of the Linear Regression tool. @MarqueeCrew recently released a macro to make this process super easy. Follow the link below to download the macro if you don't want to create the dummy values yourself:

http://www.chaosreignswithin.com/2020/12/building-crew-generate-dummy-variables.html 

danilang
17 - Castor
17 - Castor

Hi @gakkos2323 

 

According to this the replies to this post by Alteryx's own @SydneyF , string variables will be converted to the corresponding categorical variables using one-hot encoding in the Linear Regression tool.  This conversion removes the need for you to perform the encoding yourself.  The vehicle brand column will be automatically encoded to a binary column for each distinct value in the original column.   Allowing the tool to perform the encoding directly also makes interpretation of the results easier, since the brand names used in the model are directly reported in the results table

 

danilang_0-1613315521927.png

 

Note that this applies specifically to the Linear Regression tool.  For other predictive tools, you may need to create the dummy variables yourself.

 

Dan    

gakkos2323
6 - Meteoroid

Yes, I am using the linear regression tool. In fact, I just reran my workflow "with" (1-Honda, 2- Toyota) and "without" (Honda,Toyota) conversion and it seems that both give me the same results. I think, as you said, it takes care of the dummy variables

gakkos2323
6 - Meteoroid

I actually tried the macro and it keeps giving me an error "Tried to apply string operator to numeric value(Replace)". I made sure that the data type was "string", but it still does not go through 

Labels