Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Decision Tree, Oversampling and Score Question

Alteryx Certified Partner

Hi all,


I'm using a Decision Tree to classify users that that read a newsletter (target=1 around 18k records) and the ones that doesn't (target=0 around 240k records). The objective is to identify what characteristics they have, and in the future send the email just to the ones with higher probability of reading it. 


1) I'm oversampling my target, because the % of target=1 in my database is low. So I set it up to have a 50% of records with target=1.

2) Then I'm running my decision tree model using also the create samples tool, and I have a model I'm happy with: 74% accuracy, and a bit better predicting positive than negative.

3) The lift chart looks also good.

4) The next step is to use the score to get a list of the most probable users that will read the newsletter (higher probability of 1 based on the decision tree).


Here is where I have my main question:


If I connect the Score tool to my current model and the data AFTER the oversampling, I get a list of 5.5k users with a very high probability (more than 0.8) of target=1. Of those 5.5k, 4.5k are correctly classified (they are target=1 in my data). And just 1k are target=0. So for the ones more probable to be target=1 I'm classifying correctly an 80%. 


If I connect the Score tool to my current model and the data BEFORE the oversampling (that is the way to do it if I'm not wrong). I get also a list of 4.5k that are in the group of higher probability of being target = 1 and correctly classified (exactly the same number as in the previous paragraph). But the probability us not a 0.8, is just a 0.25! Why is this happening? I specify in this case that the target has an oversampled value (value oversampled = 1) and the percentage of the oversampled value in the original data prior to oversampling is 7 (18k of 252k in total).


In fact, in this second case, the number of records with that probability of 0.25 (the highest I get) of being target=1 but that are incorrectly classified (are target=0 in my data) is much higher: like 14k. I will think that this makes sense, because now I have much more records in my data (I'm using the whole data set before oversampling) so it classifies a higher number of users incorrectly.


Still, I think this is not too bad. Because it will mean that from a initial scenario of around 260k users, with just 18k that reads my newsletter, I can get a list of approx. 20k where 4.5k (a 25% approx) really reads the newsletter. And 4.5k of my total readers is also a 25% of my total readers. That will mean that using just a 8% of my total customers, I can reach a 25% of my readers. 


I'll say that at the end the probability that the Score Tool gives me is based on the number of correct classifications (80% correct using it AFTER the oversampling data, and just 25% using it BEFORE the oversampling data), but I wanted to know if you think this is correct or might be something wrong with the model? Would you consider this a good approach and a more or less good predictive model?


I hope I explain the case in a more or less clear way.




Alteryx Community Team
Alteryx Community Team

It sounds as if you are modelling correctly.  Only you know your data, your use case, and how your results are being used.

Use of the "oversampled" option in the Score tool is related to how you're using the tool - scoring the efficacy of your model, versus scoring new data.  You will use that option differently if you're connecting from your oversampled data stream to test, rather than from your original data stream to score.

Alteryx Certified Partner


So just to clarify and complete your response...

If I connect the Score tool to the Validation set (from the Create Samples tool) which is AFTER the Oversample tool, then I should check the box the says "The target field has an oversampled value" and complete the value and percentage fields.

However, if I connect the Score tool to the original set of data without using the Oversample tool (or a new set of data with the same unbalanced data), then I should leave the box unchecked.

Is this correct?