Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Logistic regression: identical coefficients and odd results compared to descriptive stats

Flo_G
7 - Meteor

I'm running an analysis on conversions, which are represented by a 0/1 variable. I would like to use two predicting variables, language and country. All variables are string.

 

If I run descriptive statistics on my database, conversion rate by language and conversion rate by country, I can see pretty stark differences between populations. I tried to run a logistic regression combining the two to see if I could get indication on significant differences, but the results are completely off.

 

I have 4 languages and 20 countries in the database.

 

If I run glm(conversion.flag = Language + Country), the coefficients are almost exactly the same for all languages and countries. Almost all coefficients are not statistically significant, with one exception. I'm sure this should not be the case, how can I fix this? I would be very interested in this as I want to see if there are different behaviors when combining languages and countries (eg EN speakers in a DE country are not as engaged as DE speakers in DE countries).

 

Second problem, If I run glm(conversion.flag = Language) (assuming that country might not be a significant predictor), the coefficients are all significant (***) but they go in the completely opposite direction of what I'd infer from descriptive statistics! 

 

Descriptive stats:

LanguageConversion rate
N/A76%

E

88%
D96%
F96%
I94%

 

Log coefficients:

LanguageCoefficient
N/A (intercept, I assume)0.85

E

-0.31
D-0.21
F-0.25
I-0.16

 

Maybe the Log regression tool is actually reading the '1' flag as 0, and the '0' flag as 1? How do I change that?

 

I'm using the logit model, which seems to be appropriate for my use case. Am I doing something wrong in the setup?

11 REPLIES 11
mceleavey
17 - Castor
17 - Castor

Hi @Flo_G ,

 

This might be the case. Have you tried using One-Hot Encoding on your categorical variables to drive the correct binary values?

I've attached a tool to do this for you (set the Record ID field to be a string with 6 characters and only select the Record ID and the variables you wish to encode to go into the encoder, then join it back to the main data on Record ID).

 

Hope this helps.

 

M.



Bulien

bwakefield
5 - Atom

I'm completely new to Alteryx, but I'm a huge fan of logistic regression. Sounds like mceleavey thinks your assessment of the reverse-encoding of your response variable is 1 (true, lol) and has a potential fix in the flow he posted. Is there a way in Alteryx to explicitly set your response variable as a factor with two levels (binary variable)? This would also be a solution, as 0/1 forces integer levels to a binary, categorical variable, but 'Yes' and 'No' would also work if you can set the data type of the response variable it to be a factor.

Another thought for you (thinking out loud), I might expect the effect of 'Country' to be different from country to country, so you might consider using a mixed-effect model where 'Country' is a set as a random effect as opposed to a fixed effect. If Alteryx can deal with GMLs, I bet it can run a logistic model with a mixed-effect approach. Let me know if this sounds inappropriate.

mceleavey
17 - Castor
17 - Castor

Hi @bwakefield ,

 

You can determine the positive and negative in the customisation screen of the logistic regression tool:

 

mceleavey_0-1619629589039.png

 

M.



Bulien

DrDan
Alteryx Alumni (Retired)

Hi @Flo_G ,

 

I'm going to start with a couple of questions, and then a comment about how answer the question you want to address. My first question is did the coefficients you provide come from the model with both country and language or just language? The second question is was there an actual category level N/A, or are they missing values (NA in R, Null in Alteryx)? If the answer to the first question is that the model had both country and language, it would likely explain what you are seeing. Language and country are likely highly related to one another. Assuming D is German, E is English, F is French, and I is Italian, you will find most people in Germany speak German, and so on. As a result, the coefficients on language will not necessarily link to your descriptive statistics due to the effect of country. If the model only used language as a predictor, then the predicted probability of the model for each language group should correspond to your summary statistics (I looked at this, based on the coefficients you reported, they do not, they also don't correspond 100 less the summary statistics, which would be the case if the coding of the 1 and 0 values of the target were reversed). There is also oddness in what you indicate are the categorical variables in the model (hence my question on what the N/A category represents). When working with categorical variables, one category is omitted (implicitly being captured in the intercept, and acts as a base case). Under the hood, R is one-hot encoding the categorical variable, and just omits one of the resulting encoded columns. If this is not done, then a model intercept could not be estimated since the "design matrix" for the problem would not be full rank, and the estimation algorithm would have issues (R is smarter than this, and removes problematic columns when the design matrix is not full rank). What the coefficients for the remaining categories pick up is the difference between that category and the base case. Now what is odd is that by default, R omits the first category based on an alphabetic sort of the category labels, and that corresponds to language D. However, this is not the column that is removed, which leads me to believe that there may be other issues with your data. By default, the category being predicted corresponds to the second category in a sort of the target labels, so in your case it is "1".

 

You indicate that: "I want to see if there are different behaviors when combining languages and countries (eg EN speakers in a DE country are not as engaged as DE speakers in DE countries)." It turns out the analysis you have done up to this point won't actually answer this question. So far you have run main effects models, and the question you are asking involves interaction effects. To do this, you will need to construct a new categorical variable, which could have up to 80 categories (the unique combinations of categories and countries). This could likely be simplified, such as creating a variable that indicates whether someone speaks the dominant language of the country in which the individual resides.

 

Dan

Flo_G
7 - Meteor

Hi @mceleavey ,

 

Is that option available on a certain version of Alteryx? My log tool doesn't have that option. This is what I have:

 

Flo_G_0-1619685161253.png

 

mceleavey
17 - Castor
17 - Castor

Hi @Flo_G ,

 

this is in the Customisation menu:

 

mceleavey_0-1619685843963.png

 

M.



Bulien

Flo_G
7 - Meteor

Hi @mceleavey ,

 

Yes, the screenshot I attached is from the customization menu. There is no option other than the model one I indicated, and a tab for Cross validation and Plot option.

mceleavey
17 - Castor
17 - Castor

Ah sorry. That's Interesting.

Have you connected the data to the logistic regression tool so it has the target variable populated? If so, I'm on the latest version of the tool.

 

M.



Bulien

Flo_G
7 - Meteor

Hi @DrDan ,

 

Thank you for the answer.

  • The coefficients I provided are from language only model
  • Language has a N/A level; I'm aware regressions will take one category as base level for intercept, and I assume it's #N/A (which would also be the first one in alphabetical order)
    Flo_G_0-1619685815409.png

     

  • The conversion variable - predicted - is also composed of 1 and 0, and is a string, so not sure why the model is not picking it up correctly, and I don't see a way to set it
  • The issue with identical coefficients is visible only in the language + country model, and the issue is that all languages and all countries have the exact same coefficient - this surely can't be right?
    Flo_G_1-1619686634180.png

     

  • I'm also aware that the log model of language + country would not give me their interaction effect, it's a first exploratory step - and I'm already stumped by the odd numbers with a simple log regression, so I'm taking it step by step 🙂
Labels
Top Solution Authors