I am trying to check the possibility of creating a prediction model on the basis of variables i have, in a data set. For this, i am checking various predictor variables relationships with the target variable. I am using Linear Regression technique to check the same.
Below is a result of one such run:
From this, i could see that Pr and Significance(*) are showing that predictor variables have strong relationships with target variable. But the R-squared value shows a not very strong model. Is this a normal finding to have this differences among various parameters?
You should meet with a statistician to go over your model. I've run into this when there is a strong relationship between the independent variables and the dependent variables, but there is a lot of variability in the data.
Here is a good place to start for understanding this situation:
The coefficients estimate the trends while R-squared represents the scatter around the regression line. The interpretations of the significant variables are the same for both high and low R-squared models. Low R-squared values are problematic when you need precise predictions.
So, what’s to be done if you have significant predictors but a low R-squared value? I can hear some of you saying, "add more variables to the model!"
In some cases, it’s possible that additional predictors can increase the true explanatory power of the model. However, in other cases, the data contain an inherently higher amount of unexplainable variability. For example, many psychology studies have R-squared values less that 50% because people are fairly unpredictable.
I'd be interested to see plots of the independent variables versus the dependent variables.
I went through the link you provided and it made complete sense to me. As suggested by you, i will try to dig down more on what variables i am analyzing for prediction.
The point i want to clarify is: If there is a situation like this, the way to move forward is to analyze the predictor variable and check the variations by adding/removing different variables, from the model. If we still don't find a correct model, do we say that this problem is not predictive at all or we try for some other possible predictive modelling approach, on the same variables? Or we redefine the whole predictive problem itself and then work to create a model for the new problem?
That's more of a statistical question that statisticians can answer better. Here are things from my experience:
1. When I've done regression-based predictive modeling, I've used half the data to develop the model, then the other half as a test set to see how the model performs predicting outcomes. The method for model development is to use a stepwise methodology. My statistics professor in university suggested to us to perform both forward and backward stepwise model selection and see if we arrive at the same model.
Forward selection: You begin with no candidate variables in the model. Select the variable that has the highest R-Squared. At each step, select the candidate variable that increases R-Squared the most. Stop adding variables when none of the remaining variables are significant. Note that once a variable enters the model, it cannot be deleted.
Backward selection: At each step, the variable that is the least significant is removed. This process continues until no nonsignificant variables remain. The user sets the significance level at which variables can be removed from the model.
2. With significant predictors but low Rsqr, I would tend to explain it as there is a relationship, but it has high variability. Any predictions made from this model should not be considered precise.
Precision refers to the closeness of two or more measurements to each other. Using the example above, if you weigh a given substance five times, and get 3.2 kg each time, then your measurement is very precise. Precision is independent of accuracy. You can be very precise but inaccurate, as described above.
3. Having said that, I've seen public health and epidemiology studies present models with and Rsqr of 20% and seem very happy with it. This is because there is just so much variability in human and environmental interactions related to a disease outcome. However, this was more for understanding relationships than trying to develop a predictive model.
I have to second everything that Philip has already said. Philip mentioned that R squares can vary based on the data set. When I do marketing analysis, a good R squared is around 15%, but an average on is 5%. Some of the things we look at just have a wide degree of variability.
You may be looking at a situation where a simple linear model is not what is called for. You may need an interacted model, or a non-linear model. Alteryx is great for running simple linear models, but anything more complicated needs to be run in a stats program.
As Philip suggested, you may need to have a conversation with a statistician, or an ecnometrician. There are a variety of more sophisticated options that are available.
From personal experience, the more complicated models are rarely needed. I've had models like your's that had low R squared. I switched them over to R, and ran all sorts of complicated stuff. Non-linear, interacted, and Simultaneous. In the end I found that the simple linear was the best model about 75% of the time.