Hi all -
First time user of this feature. I have a workflow to train a model and am trying to understand how I will deploy it. The video I watched here through Alteryx said to simply connect the new data to the "Predict Values" node "D" input and go. But if I do that with the same workflow I created the model in, with the original training data still connected, does it retrain the model again on that original data before scoring? That would be weird.
Also, does the "Predict Values" node apply the same processes that were done in the "Random Forest" container (set data types, clean up missing values, etc.)? Trying to understand how I need to prep data before scoring as some of the steps were done before using the assisted model node. If someone could assist or point me towards some documentation or video, that'd be great.
Thanks in advance!
James
Hi @mceleavey do you have any inputs on this. Since I have seen you using this 🙂
Hi @jsantosjkc ,
You need to have the data split between the Train dataset and the Test dataset. These sets need to be split just before they go into the assisted modelling process, so after any manual feature engineering you may have done.
I see you are using the Create Samples tool. In the sample output you are using to train the model, you will have those records where you historically know the answer. From the test output, connect up to the D input of the Predict Values tool.
Alternatively, please provide the workflow and some data and I'll build it for you.
M.
ps - Cheers @atcodedog05
.
Thanks for your quick response @mceleavey and for the referral @atcodedog05!
So I did run the test data through the predict node and built a lift chart (or at least calculate the metrics) to see how it was performing.
Say I'm happy with the configuration, and ready to deploy with new production data (so no more dependent variables present), can you let me know how that would look? I know with my old tool, I would save the "model" (an output of that model node) to my drive and then build a separate, new workflow with the same data prep steps, call up the saved model and use a "predictor" node to apply the model to the prepared data. Is it a similar process with this workflow in Alteryx? Or do I simply drop in an "input data" node into this workflow above and run everything? Again, its a little confusing cause in the latter, I would worry that the model would get retrained again.
Again, appreciate your guidance on this!
- James
Hi @jsantosjkc ,
No problem at all.
It depends what you want to do. You can use Alteryx Promote to deploy the model to an API an endpoint and stream in data for real time results, or you can do what you described. Using Promote you would deploy the trained model essentially offline, which would serve as your live model, into which you would stream n ew records to get your results.
In the live environment you would drop the sample tool as you have simply split a data stream to form both the Train and Test datasets, where in reality you would have historic records training the model, and the new records coming in would then go into the model (unioned together before any feature engineering to ensure both streams have the same columns and same data formats) then split the known records from the new. The known records form your training sets and the new records form your Test set.
So, you have two alternatives, continually running through Alteryx designer which would include retraining the model constantly, which is great if you have processing punch or the server, or you would deploy the model to an API endpoint, and retrain the model offline intermittently.
I hope this helps,
M.
Ok, great, makes sense now. I'll continue to tweak and build out my workflow. Just based on the amount of our data, running in designer should not be an issue.
But to confirm, am I crazy or is this process a bit clunky? Even in Alteryx, if you use the standard predictive models, you can save those and build a dedicated scoring workflow for deployment. Would seem like a nice update to make.
Either way, thanks again for the assist!
The algorithm uses cross validation to evaluate model performance so you don't need to split your data into training and validation sets. My understanding is Predict tool is similar to Scoring tool. Model comparison has been embedded in the assisted modeling configuration process, there is no need to validate your model against the validation set. Tools in the Predictive pallet need to use model comparison tool and validation data set to compare model performance.