Alteryx Designer Desktop Discussions

KarlWang · ‎12-21-2015

It is so strange that one of my String factors is missing in the final report.

For example, I use 3 factors:

A (Double)

B(Double)

C(String: C1, C2, C3) [category factor]

Then I run the Logistic Regression Model and generate the report.

In the report, the result of 'Coefficients' should like this:

Coefficients:

Estimate Std.Error z value Pr(>|Z|)

(Intercept)

A

B

C1

C2

C3

However, the 'C1' is missing.... The result becomes this:

Coefficients:

Estimate Std.Error z value Pr(>|Z|)

(Intercept)

A

B

C2

C3

Could someone tell me why this happens? Thank you so much!

RamnathV · ‎12-21-2015

This is as expected. When you have categorical variables as predictors, R uses one of the levels as the reference. Let me use an example to illustrate my point. Shown below is partial regression output where we are trying to predict tip as a function of other predictors. The day variable had four levels: Thu, Fri, Sat and Sun. You will see that the coefficient for Fri is missing, since R is using Fri as the reference.

So what does the coefficient daySat signifiy then? Well it implies that given all other variables are the same, dining on a Saturday is expected to result in a tip of 12 cents less than dining on a Friday.

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.803817   0.352702   2.279   0.0236 *  
total_bill   0.094487   0.009601   9.841   <2e-16 ***
sexMale     -0.032441   0.141612  -0.229   0.8190    
smokerYes   -0.086408   0.146587  -0.589   0.5561    
daySat      -0.121458   0.309742  -0.392   0.6953    
daySun      -0.025481   0.321298  -0.079   0.9369    
dayThur     -0.162259   0.393405  -0.412   0.6804    
---

michael_treadwell · ‎12-21-2015

In my experience, I have seen two possible causes of this.

(1) Are you creating samples of your data before modeling? If so, sometimes a class can accidentally get dropped from a sample set and it will not show up in the model.

(2) The underlying R function glm() that drives logistic regression in R drops the first class of any factor by default as a reference category. This is actually what the function relevel() is for: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/relevel.html

Seems like someone else is having this exact same problem in R here: http://stackoverflow.com/questions/31930261/r-logistic-regression-missing-coefficients

RamnathV · ‎12-21-2015

This is not a missing coefficient. For categorical predictors, R uses one of the levels as a reference. So all coefficients for a categorical predictor measure the impact of that level with respect to the reference level.

DylanB · ‎12-21-2015

The level 'C1' of your C variable is omitted as a reference category. This (the omission of one level of a variable) will happen for any categorical input. Although it may seem a little unintuitive, it actually makes some sense:

The general formula referenced by the coefficients from a logistic regression model is this:

log(p/(1-p))=B0+B1*x1+...B1*xn

where the Bi's are your coefficients and the xi's are the incomming values of the variables.

Your formula looks like this:

log(p/(1-p))=Intercept+Estimate(A)*A+Estimate(B)*B+Estimate(C2)*C2+Estimate(C3)*C3

where C2 and C3 are 0/1.

A categorical variable is converted into 'dummy variables'. For each level of the categorical variable except the first, a 0/1 'dummy variable' is created that indicates whether the input is in that group. Thus, instead of having one variable for 'C', we have one for 'C2' and one for 'C3'.

When C2=0, C3=0, we have a C1 record, so it's not omitted - just implied by the other variables.

You can interpret your C1 results as a reference category/ baseline: 'a change from C1 to C2 gives an increase of Estimate(C2) to the expected log odds (log(p/(1-p)))'

In other words, you can think of your coefficient for C1 as being 0.

KarlWang · ‎12-22-2015

Thank you so much, everyone.

And thank you DylanB for the detailed explanation. It is greatly helpful. I understand now.

KarlWang · ‎01-06-2016

So in the final formula, we can take C1 as 0 to calculate directly?

Is it means that in this formula C1 will have no effect to the final result?

When C2=0, C3=0, then what is the coefficient of C1?

DylanB · ‎01-06-2016

You don't need to insert C1 anywhere in the formula. Essentially look at it like this:

C	C2	C3
1	0	0
2	1	0
3	0	1

If C2=C3=0, then C=1. The formula takes the C1 value into account. You can think of it as being taken into account by the intercept, though.

The C1 is taken into account by the formula as it is. If you wanted, you could insert a "C1*0" into the formula (i.e. coefficient of C1 being 0), but the multiplication by 0 would make it go away.

KarlWang · ‎01-07-2016

Thank you!! DylanB

jbh1128d1 · ‎10-23-2017

Hi Dylan,

Do you know how I can change the reference level? Say I a variable for credit score and I want 775+ to be the reference level instead of <374.

Alteryx Designer Desktop Discussions

In the Logistic Regression Report - Factor Missing