In the Logistic Regression Report - Factor Missing
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
It is so strange that one of my String factors is missing in the final report.
For example, I use 3 factors:
A (Double)
B(Double)
C(String: C1, C2, C3) [category factor]
Then I run the Logistic Regression Model and generate the report.
In the report, the result of 'Coefficients' should like this:
Coefficients:
Estimate Std.Error z value Pr(>|Z|)
(Intercept)
A
B
C1
C2
C3
However, the 'C1' is missing.... The result becomes this:
Coefficients:
Estimate Std.Error z value Pr(>|Z|)
(Intercept)
A
B
C2
C3
Could someone tell me why this happens? Thank you so much!
Solved! Go to Solution.
- Labels:
- Predictive Analysis
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
This is as expected. When you have categorical variables as predictors, R uses one of the levels as the reference. Let me use an example to illustrate my point. Shown below is partial regression output where we are trying to predict tip as a function of other predictors. The day variable had four levels: Thu, Fri, Sat and Sun. You will see that the coefficient for Fri is missing, since R is using Fri as the reference.
So what does the coefficient daySat signifiy then? Well it implies that given all other variables are the same, dining on a Saturday is expected to result in a tip of 12 cents less than dining on a Friday.
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.803817 0.352702 2.279 0.0236 * total_bill 0.094487 0.009601 9.841 <2e-16 *** sexMale -0.032441 0.141612 -0.229 0.8190 smokerYes -0.086408 0.146587 -0.589 0.5561 daySat -0.121458 0.309742 -0.392 0.6953 daySun -0.025481 0.321298 -0.079 0.9369 dayThur -0.162259 0.393405 -0.412 0.6804 ---
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
In my experience, I have seen two possible causes of this.
(1) Are you creating samples of your data before modeling? If so, sometimes a class can accidentally get dropped from a sample set and it will not show up in the model.
(2) The underlying R function glm() that drives logistic regression in R drops the first class of any factor by default as a reference category. This is actually what the function relevel() is for: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/relevel.html
Seems like someone else is having this exact same problem in R here: http://stackoverflow.com/questions/31930261/r-logistic-regression-missing-coefficients
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
This is not a missing coefficient. For categorical predictors, R uses one of the levels as a reference. So all coefficients for a categorical predictor measure the impact of that level with respect to the reference level.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
The level 'C1' of your C variable is omitted as a reference category. This (the omission of one level of a variable) will happen for any categorical input. Although it may seem a little unintuitive, it actually makes some sense:
The general formula referenced by the coefficients from a logistic regression model is this:
log(p/(1-p))=B0+B1*x1+...B1*xn
where the Bi's are your coefficients and the xi's are the incomming values of the variables.
Your formula looks like this:
log(p/(1-p))=Intercept+Estimate(A)*A+Estimate(B)*B+Estimate(C2)*C2+Estimate(C3)*C3
where C2 and C3 are 0/1.
A categorical variable is converted into 'dummy variables'. For each level of the categorical variable except the first, a 0/1 'dummy variable' is created that indicates whether the input is in that group. Thus, instead of having one variable for 'C', we have one for 'C2' and one for 'C3'.
When C2=0, C3=0, we have a C1 record, so it's not omitted - just implied by the other variables.
You can interpret your C1 results as a reference category/ baseline: 'a change from C1 to C2 gives an increase of Estimate(C2) to the expected log odds (log(p/(1-p)))'
In other words, you can think of your coefficient for C1 as being 0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you so much, everyone.
And thank you DylanB for the detailed explanation. It is greatly helpful. I understand now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
So in the final formula, we can take C1 as 0 to calculate directly?
Is it means that in this formula C1 will have no effect to the final result?
When C2=0, C3=0, then what is the coefficient of C1?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
You don't need to insert C1 anywhere in the formula. Essentially look at it like this:
C | C2 | C3 |
1 | 0 | 0 |
2 | 1 | 0 |
3 | 0 | 1 |
If C2=C3=0, then C=1. The formula takes the C1 value into account. You can think of it as being taken into account by the intercept, though.
The C1 is taken into account by the formula as it is. If you wanted, you could insert a "C1*0" into the formula (i.e. coefficient of C1 being 0), but the multiplication by 0 would make it go away.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you!! DylanB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Dylan,
Do you know how I can change the reference level? Say I a variable for credit score and I want 775+ to be the reference level instead of <374.