Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Training a Model recommended procedure

datascot
7 - Meteor

Hi,
      Quite new to this, but wanted to check my approach in terms of using Training & Test data in training a model?

Am I right in thinking that each time you try the "training" data on a model, you should save the Model using the output tool?

Then once you have run a few variations, and stored different models, you decide which one(s) to run against the test data?

I'm just a bit confused about whether you just use the test data right at the end of the whole training process or whether it's ok to try a model on

training data...then try it on test data etc throughout the process?

(but isn't this allowing the models to "learn" from the test data?)

thanks
jim

3 REPLIES 3
tcroberts
12 - Quasar

You've got the right idea that you don't want to expose your model to the test data too often. What you're thinking of is called "Optimization Bias", which means that you're learning the hyperparameters such as Decision Tree depth, Regularization strength, etc. which work best on your particular dataset, but may not be optimal overall. Another way to think about this is in terms of "overfitting".

 

To avoid this, you have a few options. If you have a large amount of data, you could split your training data again, giving yourself a training set, a test set, and a validation set. You then set out the validation set and avoid checking your model with it until you've finished your tuning. This would allow you to use your test-data throughout the process, while mitigating some of the risk of optimization bias.

 

Another alternative for when you don't have enough data to use the above approach is exactly as you mentioned: create a number of models which you think should perform well a priori, and once you're done look at their results on the test set. If you take this approach it is important that you pick a reasonably small number of models to check though, because looking at a million possible models is just the same as if you were peeking at the test data the whole time.

 

Another thing is that it really depends on your goals for this model. If you're looking more for statistical interpretation and inference of the magnitude of the effects of your predictive variables, then these rules are much more rigid in order to have statistical power (look up multiple comparisons/family-wise error rates for more information about this). If you're simply looking to be able to make predictions with high accuracy, you can be a little more flexible.

 

Let me know if you want me to expand on anything. Hope this helps,

 

Cheers!

datascot
7 - Meteor

Hi,

      Thanks very much that's all really helpful!


I'll go with the training/validation/test split as you suggested,
(I'm just looking to make predictions with high accuracy)

 

much appreciated!

J

tcroberts
12 - Quasar

Glad I could help! Let me know if you have any other modelling questions (:

Labels