Hi all
I'm working with a continuous data set to try and forecast prices for metals (in this example, nickel). I'm looking at trying to compare the relative accuracy of various different models, including a Random Forest model. However when I run my models trained over a period of 3 years between 2015 and 2017 to the validation period (2018 and 2019), as well as a model trained on data prior to 2015, I get the diagnostic plot as attached. In short, the 2015_17 model won't forecast above ~14,500, while the model trained on data pre 2015 won't forecast under ~14,000. This is odd given the target variable has ranged between ~9,000 in 2015 and 18,000 in 2019 and 2014, while looking at the variable importance plots the models have access to all of the predictor variables that are deemed important, which track the target variable quite closely.
Any assistance is much appreciated.
Solved! Go to Solution.
There's definitely something fishy going on here. Any chance you could share the data+workflow with the Community?
Thanks Charlie -
I've attached the workflow and data. After doing a bit of troubleshooting myself I was able to work around the problem by re-cutting my training set to before 2016 and then using the 4+ years of data from then as my test period. The data for 2016 is not of the same quality as other years with some sets having a number of 0s, although not for those that are prominent in the variable importance plots. Anyhow, while this change might have worked for the Random Forest model, it didn't work for the Decision Tree, where forecasts are not continuous but instead grouped at distinct levels (workflow and diagnostics also attached). Would be great to get your thoughts on what is going ion here.
Thanks,
Piers
I think the first thing we need to talk about ts the target variable range. In this case, we're looking to estimate values of [ThreeSixDaysAhead]. When we break out the observations by your date range groups, here's the violin plot of the target variable:
Looking at these distributions, it's no surprise that there's very little overlap in estimated values between the two breakouts. There's something to be said for different types of models, the values model is trained on will be reflected in the estimated value range.
So given the plot of [ThreeSixDaysAhead], why were those date-based model breakouts selected?
Thanks Charlie, your help is much appreciated
My thinking was to try and split the models by year groupings in the hope of being able to see them improve with the passage of time, while retaining the same training data history length (now I see that additive models i.e. 2012_2014, 2012_2015, 2012_2016 etc. would probably be better). From there I wanted to then compare the performance of different classification models over a validation period, say 2018 onwards. This was because I didn't want them to have the benefit of forward looking information, thinking that if I simply select the full data history in an RF/DT/Boosted model and connect to the Score tool to get the forecasts, then the models would have been trained on data which would not have been available at the time of a historical prediction, which of course is not realistic.
So in answer to your question the breakouts are fairly arbitrary in an attempt to keep them 'pure'. However I appreciate now that increased information available to the models augment their pattern recognition capability so there is a trade off between the accuracy of predictions and the length of prediction history.