Alteryx Designer Desktop Discussions

sjm · ‎07-14-2017

Hi All - I’ve created a forest model to predict future units. I took 70% of a dataset to train the model and 30% to test the model.

I was wondering if anyone can offer any advice on the following questions:

When we put the model into production, should we use that same model (with 70% of the original data) to score the new records each week? Or should our production model be based on 100% of the original data?
We were thinking about the possibility of adding a new week of data to the training set on a weekly basis. If we do this, should we use 100% of the new records or should we set 30% aside (to test to see if the new data adds value)?

Please let me know if I can clarify anything.

Thanks,

Steve

JohnJPS · ‎07-14-2017

Hi Steve,

For me, it seems that the 70/30 split is useful for validating hyperparameter tuning (e.g. config panel settings). Once you're happy with those, retraining with 100% of the data generally gives a slightly better model, so that's my preference.

For retraining, I would do the same thing: tune hyperparameters using a 70/30 split, and then retrain on 100% If I'm comfortable over time that my hyperparameters never need alteration, then just retrain on the new 100%

That's just me; I'd be interested in other viewpoints too though.

sjm · ‎07-14-2017

Thanks for sharing John! This is helpful. Open to hearing any other views, but we'll definitely consider this approach.

Alteryx Designer Desktop Discussions

Predictive Modeling – Training/Testing Question