Prediction modeling

Question

Hi community,

I have some store data and would like to produce a predictive model to score Revenue forecasting hence I can perform a percent error calculation based on the hold out (actual) and forecast.

What I have done is following the requirement guide:

1. use select tool to split hold out data from the data set

2. run pearson correlation with the filtered dataset in #1 to find out the required p value variables from requirement

3. perform linear regression based on the variables found in #2.

4. Score the hold out store data based on the model in #3.

Up until this point, my predicted data has no change to the hold out(actual) data which is also far from the expected output.  Hence I have no way to go the further step for percent error calculation. Attached the workflow below. Can anyone please shed some light?

BTW: I am not quite sure in one of the requirement, Find all variables that have a significant pearson correlation (p < .1) to Revenue. From my understanding of pearson correlation, for the value of p closer to 1 means more significant the value is. What would  be the possible considerations why we need the variables has p < .1 in data analytics?

store revenue prediction.yxmd

FrederikE · Accepted Answer

True, in our workflows the Pearson Correlation is used, which has to be high for a correlation to be meaningful.

You might want to use the "Association Analysis" Tool to determine the corresponding p-Values.

I am not sure if I understand what you are trying to do, since the input into the Linear Regression Tool should always be the original data and not the correlation/p-values.

As you can see from my approach this also leads to a reasonable value (10% error), although the wrong variables might have been chosen.

aubh · Accepted Answer

Yep, totally agree with you about the input of linear regression. I made a mistake in the very beginning by using the output from Pearson Tool.

I think I've just solved it. Attached my updated workflow.

However, there is one thing I am still not sure about, the Association Analysis Tool can only provide Browser Tool output into a report format. Just wondering if there is a way I can output the Association Analysis result, filter the variables I need, just similar to the way how you previously output from the Pearson Tool?