Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Machine Learning Tutorial - Need help with Boosted Model

JohnMaty
9 - Comet

Hi Everyone,

I am trying to help my son with a science project.  I have attached a sample data file.  What I want is have alteryx look at this data and make a prediction if a person will recidivate.  The field "recidivate" is the actual outcome to these data samples.  I ran this through the boosted model and got outputs as to the predictive variable, but I cannot figure out how to get an output of what alteryx predicted for each input.

 

Can someone help?

 

Basically i need something like a confusion matrix and a data output for each item. See PREDICTED column

Thanks

 

AgeCurrentValidLicenseYearsCompletedSchoolCurrentlyinSchoolPriorsConvictionsMisdemeanorsFTAAgeFirstArrestPretrialReleaseProbationReleasePersonChargePropertyChragePublicChargeDrugChargeTrafficChargeREFUSEDTHREATMENTALBAD_INFOVIOL_CHARGEPRETRIAL_RELEXTRADITEDSELF_SURRENDPRIM_AGGRESSORPFAGENDERRACERecidivatePREDICTED
48TRUE0FALSE703024FALSEFALSETRUEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEMALEWHITEnoNO
4 REPLIES 4
JohnMaty
9 - Comet

Here is a sample of the workflow.  I am getting acceptable results, but I would like to know WHAT the model predicted for each record in relation to recidivism.   

KGT
12 - Quasar

What you have done in the workflow is trained 2 different models, however there has been no application of those models to data. You would use the model output to score the data. The report output of the models will tell you some info about the training of the model and the importance of variables, but will be "Metadata related" as in it will talk about the fields, not the data rows.

 

On the score tool, plug the Model Output (O) into the Score Model Input (M) and your data into the Data Input (D). As you have a binary response variable, you will then get 2 extra columns for Score_No and Score_Yes. You can then use that to create a confusion matrix or similar. Model Comparison tool or Lift Chart tool is something else you may want to play with.

 

I also note, that this is a full in-sample model, meaning that the data you are scoring is also the data the model was trained on. Ideally, you would train the model on one set of data and then score another set or a more complete set. As you only have 100 records, this is fine, but you MAY be able to get better results from using the "Create Samples" tool. I say MAY, as you also may not on this number of records.

 

If you would like to look further into this, check out the "New Donor and "New Donor Score" sample workflows under "Help > Sample Workflows > Predictive ... > Predictive ... "

AlteryxGui_W4hPvRDX2L.png

JohnMaty
9 - Comet

The actual data source has 120k rows.  Could I sample the first 90k rows into the Boost to train and then take the remaining 30k into score?

KGT
12 - Quasar

Very good. That's enough to get some real results. Short answer, yes, 75/25 will work. 80/20 is used often as people like the pareto principle. As cross validation is used, I wouldn't worry too much, it's not totally about the ratio, but rather the quantity in each set.

 

Also, a quick and dirty test to see if your sets are valid is to run a few sets and see if your confusion matrix is similar. If it's got high variability from one run to another, then back to the design.

 

Longer answer:

So, at it's core, predictive modelling would have a train set, then a test set, acceptance of the model, and then apply the model to the real data (with unknown target response). A production model would also have updated test sets that regularly test how the model is performing over time as well.

 

The boosted model uses cross validation by default to "test" the number of trees that are included. You could also select other options.

 

The actual number for Train/Test(Validation) is a movable target depending on a fair few things, but degrees of freedom is usually a good first shot indicator as to how much focus you need to give it. If you were to spend the time to test all the variables, you could refine the model and would most likely come up with 5 variables or somewhere near that based on the variable importance plot. (I say "most likely", "somewhere near" etc as with further analysis, you may find some correlations or something). The higher your number of "degrees of freedom", the more effort you might want to put in to making sure your sets are valid for instance.

 

In selecting variables, you could spend a lot of time, but I wouldn't.

  • FTA for instance is a high predictor on this model. But should possibly be feature engineered as if it is what I think (Failure to Appear), then a non-limited continuous variable may not be the best. It does have a reasonable distribution already and so testing a few engineered fields from this (1/0, truncate to max 4 or 6 etc) would be pretty simple.
  • The Age variables could also be looked at to determine what makes sense here, is AgeCurrent really the predictor or is there just more chance of recidivism as people have more years? Would years since first arrest be a better variable? Do both have vaildity? (The bit that jumps out at me is having two age variables showing up next to each other in the Variable Importance Plot, I'm sure there's some correlation there).

 

 

Labels
Top Solution Authors