Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

In the Logistic Regression Report - Factor Missing

KarlWang
7 - Meteor

It is so strange that one of my String factors is missing in the final report.

 

For example, I use 3 factors:

A (Double)

B(Double)

C(String: C1, C2, C3)   [category factor] 

 

Then I run the Logistic Regression Model and generate the report.

In the report, the result of 'Coefficients' should like this:

 

Coefficients:

                                      Estimate         Std.Error         z value      Pr(>|Z|)

(Intercept)             

A                  

B                             

C1              

C2                    

C3          

 

However,  the 'C1' is missing.... The result becomes this:

Coefficients:

                                      Estimate         Std.Error         z value      Pr(>|Z|)

(Intercept)        

A                                 

B                       

C2                             

C3                    

 

Could someone tell me why this happens?  Thank you so much!

 

9 REPLIES 9
RamnathV
Alteryx Alumni (Retired)

This is as expected. When you have categorical variables as predictors, R uses one of the levels as the reference. Let me use an example to illustrate my point. Shown below is partial regression output where we are trying to predict tip as a function of other predictors. The day variable had four levels: Thu, Fri, Sat and Sun. You will see that the coefficient for Fri is missing, since R is using Fri as the reference.

 

So what does the coefficient daySat signifiy then? Well it implies that given all other variables are the same, dining on a Saturday is expected to result in a tip of 12 cents less than dining on a Friday.

 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.803817   0.352702   2.279   0.0236 *  
total_bill   0.094487   0.009601   9.841   <2e-16 ***
sexMale     -0.032441   0.141612  -0.229   0.8190    
smokerYes   -0.086408   0.146587  -0.589   0.5561    
daySat      -0.121458   0.309742  -0.392   0.6953    
daySun      -0.025481   0.321298  -0.079   0.9369    
dayThur     -0.162259   0.393405  -0.412   0.6804    
---

 

michael_treadwell
ACE Emeritus
ACE Emeritus

In my experience, I have seen two possible causes of this.

 

(1) Are you creating samples of your data before modeling? If so, sometimes a class can accidentally get dropped from a sample set and it will not show up in the model.

 

(2) The underlying R function glm() that drives logistic regression in R drops the first class of any factor by default as a reference category. This is actually what the function relevel() is for: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/relevel.html

 

Seems like someone else is having this exact same problem in R here: http://stackoverflow.com/questions/31930261/r-logistic-regression-missing-coefficients

RamnathV
Alteryx Alumni (Retired)

This is not a missing coefficient. For categorical predictors, R uses one of the levels as a reference. So all coefficients for a categorical predictor measure the impact of that level with respect to the reference level.

DylanB
Alteryx Alumni (Retired)

The level 'C1' of your C variable is omitted as a reference category. This (the omission of one level of a variable) will happen for any categorical input. Although it may seem a little unintuitive, it actually makes some sense:

 

The general formula referenced by the coefficients from a logistic regression model is this:

log(p/(1-p))=B0+B1*x1+...B1*xn

where the Bi's are your coefficients and the xi's are the incomming values of the variables.

 

Your formula looks like this:

log(p/(1-p))=Intercept+Estimate(A)*A+Estimate(B)*B+Estimate(C2)*C2+Estimate(C3)*C3

where C2 and C3 are 0/1.

A categorical variable is converted into 'dummy variables'. For each level of the categorical variable except the first, a 0/1 'dummy variable' is created that indicates whether the input is in that group. Thus, instead of having one variable for 'C', we have one for 'C2' and one for 'C3'.

When C2=0, C3=0, we have a C1 record, so it's not omitted - just implied by the other variables. 

You can interpret your C1 results as a reference category/ baseline: 'a change from C1 to C2 gives an increase of Estimate(C2) to the expected log odds (log(p/(1-p)))'

In other words, you can think of your coefficient for C1 as being 0.

 

KarlWang
7 - Meteor

Thank you so much, everyone.

And thank you DylanB for the  detailed explanation. It is greatly helpful. I understand now.

 

 

KarlWang
7 - Meteor

So in the final formula, we can take C1 as 0 to calculate directly?

Is it means that in this formula C1 will have no effect to the final result?

When C2=0, C3=0, then what is the coefficient of C1? 

DylanB
Alteryx Alumni (Retired)

You don't need to insert C1 anywhere in the formula. Essentially look at it like this:

CC2C3
100
210
301

 

If C2=C3=0, then C=1. The formula takes the C1 value into account. You can think of it as being taken into account by the intercept, though.

The C1 is taken into account by the formula as it is. If you wanted, you could insert a "C1*0" into the formula (i.e. coefficient of C1 being 0), but the multiplication by 0 would make it go away.

KarlWang
7 - Meteor

Thank you!! DylanB

jbh1128d1
10 - Fireball

Hi Dylan,

 

Do you know how I can change the reference level? Say I a variable for credit score and I want 775+ to be the reference level instead of <374.

Labels
Top Solution Authors