Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Score predictive tool is giving negative values for a positive target variable

fa5fou5
5 - Atom
 

Hi,

I am having an assignement in my class.

I am creating a predictive model for bike usage of the dataset from the bixi 

https://www.bixi.com/en/page-27

we have to create a model that predict the usage of the bike, so we have to predict the number bike user per hour and have been given some dataset.

1. bike usage (from bixi)

2. temperature dataset 

I have  created the model below for prediction.

I have added attached a sample of the dataset as the input of the model (a blended dataset from bix and temprature file).

The problem is that the model created is predicting negative values while it shouldn't (the nomber of passenger cannot be negative!!)

Any help, thanks!

 

5 REPLIES 5
AngelosPachis
16 - Nebula

Hi @fa5fou5 ,

 

I don't find it strange that the model is predicting some negative values ; Based on the target variables you have selected, the equation coefficients are the following 

 

AngelosPachis_0-1609231848216.png

 

That is an equation that looks like that:

 

Number of users = 1.225*X1 + 1.250*X2 - 6.261

where 

X1 : Start_Hour
X2 : Avg_Temperature

 

If your start hour is approximately equal to 1, and the same goes for your average temperature, then the returned value would definitely be negative.

 

As a linear regression line, given a certain set of target variables, your predictor variable will eventually become negative. The model is unable to tell if that makes sense or not, you should have to interpret it as necessary.

AngelosPachis
16 - Nebula

Also, I am not totally sure about the use of Start hour as a predictor variable for a linear regression line. One of the assumption for a linear regression model is that X and Y should be connected via a linear relationship.

 

If you use an  association analysis tool, and compare the number of users with start hour, the relationship does not look very linear

 

AngelosPachis_0-1609232939188.png

 

So probably that's an indication that linear regression is not the way forward, and explains the negative values and the reason why the model performs relatively poorly.

 

Interesting to know the more scientific answer to that however, do you mind posting it here when you find out the solution from your professor?

AngelosPachis
16 - Nebula

As a final side note @fa5fou5 , I would be tempted to increase the % of data used for the evaluation/creation of the model from around 35 that it is currently to around 60, and the validation to be around 25-30%. 

 

The evaluation batch should always be significantly larger than the validation one for a successful/trustworthy prediction.

 

Hope that helps,

 

Angelos

fa5fou5
5 - Atom

Thanks for giving me an insight on how to read the coefficients values.
I was confused on how to read them!

fa5fou5
5 - Atom

I was doubting on how to implement the time variable.

To make it even clear. we can even make a distinction between two differents model:

1. Weekend demand

2. Weekday demand

Screenshot 2020-12-29 085001.png

So Yes the Start hour is not linear with the desired response but it DOES have an effect on the nomber of user.

Which bring an Idea to me but I don't know how to implement it.

The IDEA:

is there a way to make a regression model with the two variables : 

1. Tempreture

2. Humidity %

Then adjust the response with a weight factor depending on the hour of the day, (I dont know if we can use the decision tree for that, and if there is way for it)

 

 

Labels