Hi Everyone,
I am trying to help my son with a science project. I have attached a sample data file. What I want is have alteryx look at this data and make a prediction if a person will recidivate. The field "recidivate" is the actual outcome to these data samples. I ran this through the boosted model and got outputs as to the predictive variable, but I cannot figure out how to get an output of what alteryx predicted for each input.
Can someone help?
Basically i need something like a confusion matrix and a data output for each item. See PREDICTED column
Thanks
AgeCurrent | ValidLicense | YearsCompletedSchool | CurrentlyinSchool | Priors | Convictions | Misdemeanors | FTA | AgeFirstArrest | PretrialRelease | ProbationRelease | PersonCharge | PropertyChrage | PublicCharge | DrugCharge | TrafficCharge | REFUSED | THREAT | MENTAL | BAD_INFO | VIOL_CHARGE | PRETRIAL_REL | EXTRADITED | SELF_SURREND | PRIM_AGGRESSOR | PFA | GENDER | RACE | Recidivate | PREDICTED |
48 | TRUE | 0 | FALSE | 7 | 0 | 3 | 0 | 24 | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | MALE | WHITE | no | NO |
Solved! Go to Solution.
What you have done in the workflow is trained 2 different models, however there has been no application of those models to data. You would use the model output to score the data. The report output of the models will tell you some info about the training of the model and the importance of variables, but will be "Metadata related" as in it will talk about the fields, not the data rows.
On the score tool, plug the Model Output (O) into the Score Model Input (M) and your data into the Data Input (D). As you have a binary response variable, you will then get 2 extra columns for Score_No and Score_Yes. You can then use that to create a confusion matrix or similar. Model Comparison tool or Lift Chart tool is something else you may want to play with.
I also note, that this is a full in-sample model, meaning that the data you are scoring is also the data the model was trained on. Ideally, you would train the model on one set of data and then score another set or a more complete set. As you only have 100 records, this is fine, but you MAY be able to get better results from using the "Create Samples" tool. I say MAY, as you also may not on this number of records.
If you would like to look further into this, check out the "New Donor and "New Donor Score" sample workflows under "Help > Sample Workflows > Predictive ... > Predictive ... "
The actual data source has 120k rows. Could I sample the first 90k rows into the Boost to train and then take the remaining 30k into score?
Very good. That's enough to get some real results. Short answer, yes, 75/25 will work. 80/20 is used often as people like the pareto principle. As cross validation is used, I wouldn't worry too much, it's not totally about the ratio, but rather the quantity in each set.
Also, a quick and dirty test to see if your sets are valid is to run a few sets and see if your confusion matrix is similar. If it's got high variability from one run to another, then back to the design.
Longer answer:
So, at it's core, predictive modelling would have a train set, then a test set, acceptance of the model, and then apply the model to the real data (with unknown target response). A production model would also have updated test sets that regularly test how the model is performing over time as well.
The boosted model uses cross validation by default to "test" the number of trees that are included. You could also select other options.
The actual number for Train/Test(Validation) is a movable target depending on a fair few things, but degrees of freedom is usually a good first shot indicator as to how much focus you need to give it. If you were to spend the time to test all the variables, you could refine the model and would most likely come up with 5 variables or somewhere near that based on the variable importance plot. (I say "most likely", "somewhere near" etc as with further analysis, you may find some correlations or something). The higher your number of "degrees of freedom", the more effort you might want to put in to making sure your sets are valid for instance.
In selecting variables, you could spend a lot of time, but I wouldn't.