Hi,
I am having an assignement in my class.
I am creating a predictive model for bike usage of the dataset from the bixi
https://www.bixi.com/en/page-27
we have to create a model that predict the usage of the bike, so we have to predict the number bike user per hour and have been given some dataset.
1. bike usage (from bixi)
2. temperature dataset
I have created the model below for prediction.
I have added attached a sample of the dataset as the input of the model (a blended dataset from bix and temprature file).
The problem is that the model created is predicting negative values while it shouldn't (the nomber of passenger cannot be negative!!)
Any help, thanks!
Hi @fa5fou5 ,
I don't find it strange that the model is predicting some negative values ; Based on the target variables you have selected, the equation coefficients are the following
That is an equation that looks like that:
Number of users = 1.225*X1 + 1.250*X2 - 6.261
where
X1 : Start_Hour
X2 : Avg_Temperature
If your start hour is approximately equal to 1, and the same goes for your average temperature, then the returned value would definitely be negative.
As a linear regression line, given a certain set of target variables, your predictor variable will eventually become negative. The model is unable to tell if that makes sense or not, you should have to interpret it as necessary.
Also, I am not totally sure about the use of Start hour as a predictor variable for a linear regression line. One of the assumption for a linear regression model is that X and Y should be connected via a linear relationship.
If you use an association analysis tool, and compare the number of users with start hour, the relationship does not look very linear
So probably that's an indication that linear regression is not the way forward, and explains the negative values and the reason why the model performs relatively poorly.
Interesting to know the more scientific answer to that however, do you mind posting it here when you find out the solution from your professor?
As a final side note @fa5fou5 , I would be tempted to increase the % of data used for the evaluation/creation of the model from around 35 that it is currently to around 60, and the validation to be around 25-30%.
The evaluation batch should always be significantly larger than the validation one for a successful/trustworthy prediction.
Hope that helps,
Angelos
Thanks for giving me an insight on how to read the coefficients values.
I was confused on how to read them!
I was doubting on how to implement the time variable.
To make it even clear. we can even make a distinction between two differents model:
1. Weekend demand
2. Weekday demand
So Yes the Start hour is not linear with the desired response but it DOES have an effect on the nomber of user.
Which bring an Idea to me but I don't know how to implement it.
The IDEA:
is there a way to make a regression model with the two variables :
1. Tempreture
2. Humidity %
Then adjust the response with a weight factor depending on the hour of the day, (I dont know if we can use the decision tree for that, and if there is way for it)