Linear Regression Output Discussion: Multiple R-square or R-square
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi All,
Recently I was playing with the Regression tool and a little confused when interpreting the outcomes, especially R-square and its variations.
E.g., for the same data points (See attached). Alteryx and Excel posted slightly different outcomes:
Alteryx posted both Multiple R-square and Adj R-square while Excel got both in addition to R-square. However, it seems
Multiple R-sq (Alteryx) = R-square (Excel)
Question:
1. Does my the equation above correct because the value of Multiple R-sq should be R-sqaure ^0.5
2. Is it possible to change the specific number of decimal points in Alteryx. E.g., currently 0.992 for Multiple R-sq, where (probably the macro behind) can we increase the decimal number to 0.9920 if possible.
Alteryx
Excel
Thanks in advance!
Solved! Go to Solution.
- Labels:
- Predictive Analysis
- R Tool
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Multiple R-Squared is simply a standard R-Squared value for models with more than one "x", or predictor variable. This means that any R-Squared value when you use multiple predictors is technically Multiple R-Squared. this means that your equation above the question is correct, Multiple R-Squared in Alteryx should be the same as the R-Squared value you're getting from Excel.
Adjusted R-Squared is an alternate metric which is used when you want to make comparisons between models that have different number of predictors. Due to the way that standard R-Squared is defined, adding a predictor variable will *always* increase its value, even when there is no predictive power in the added variable. As a result, Adjusted R-Squared includes a penalty term for additional variables, making it so that in order for your model to improve, the increase in predictive power needs to be enough to offset an additional penalty from adding the variable. Here's a StackExchange post with a better explanation, and I'm happy to clear up any confusion about this.
As for your second question, it does seem possible, although I've never looked too closely into the Linear Regression macro, so I'm not 100% sure how to go about this.
At first glance, I've pinpointed R Tool (170) in the macro as the one where the interactive report going to the I output is generated. There is a loop where a variable called "dashboard" is set, being created from the data and model. I believe that this here would be what you'd want to look at to increase the number of decimals.
If you're having trouble changing the macro, I'd also recommend just trying it out in the R tool yourself, as the `lm` package is rather easy to use and extract details from.
You can write something as simple as:
dat <- read.Alteryx('#1') model <- lm(y~x1+x2+x3, data=dat)
summary(model)
to fit a linear model to your data.
You should look into the broom package (here) to tidy the goodness of fit metrics into a nice format for you to read back into Alteryx
Let me know if this helps, or you have any other questions.
Cheers!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Awesome! Really appreciate your help on this!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hey,
Hope you had a great weekend.
Same regression output, different stat. It seems for highly significant p-value, Alteryx gives an in-equation instead of a precise number. I know it's smaller enough for decision-making. But is there a reason why since it seems not a big effort to get there. Also see Excel screenshot for comparison below. Thanks.
Alteryx
Excel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
This is something that's coming through from R actually. It's pretty much just saying that the p-value is very close to 0. Since it's not possible to exactly test equality of floating point numbers (due to some weird representation issues), it is sufficient to say something is "close enough" to another value. We usually use some threshold to determine what is close enough. In this case, that line is just saying that p is within 0.000000000000000022 of 0, which is sufficient for it to call it 0. However, it provides you the information about the threshold so you can make informed decisions about if its precise enough.
I believe that if you use the R tool to run that regression and take a look at the output, your p-value stored in the regression object will actually have full precision (double precision, ranging from 2e-308 to 2e+308), and if you require further granularity you can extract it from there.
Let me know if this helps,
Cheers!
