Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Confused by different Random Forest error estimates

MrMagnus
6 - Meteoroid

Hello, 

 

This is my first post so please bear with me if I ask a strange / unclear question. 

 

I'm a bit confused about the outcome from a random forest classification model output. I have a model which tries to predict 5 categories of customers. 

The browse tool after the RF tool says the OOB estimate of error is 79.5 %. If I calculate the outcome from the confusion matrix just below (in the browse tool), there are 62% wrongly classified.

And if I use the score tool on the test set I get that 19% are wrongly classified. (training set has less than 1% wrongly classified using score)

 

In my world they should all be fairly close to each other (minus maybe the score from the training set).

Am I missing something?

 

The insanely good score from the training set makes me think my model is overfitted. How do I adjust the RF model to reduce that (if that is the problem)?

Thanks

4 REPLIES 4
SydneyF
Alteryx Alumni (Retired)

Hi @MrMagnus

 

OOB is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample. OOB has been proven to be an unbiased measure of performance in many tests, and is typically considered to be a conservative measure, biased towards higher percent error values.

 

The OOB estimate of error and the Confusion Matrix should correspond with one another. However, in my test model I am seeing the same behavior where they don't. I will be investigating this matter further. For the time being, I would trust your calculation from the confusion matrix over the reported OOB estimate of the error rate. 

 

Is your test data set completely independent from your training data set? If not, this could definitely cause an artificially low percent error. Running the model on any training data will cause artificially low error. This is an artifact of the way Random Forest Classification works. It is expected that the error is very close to zero when you use the score tool on the training data for a Classification Random Forest. This is described in a  forum post by Andy Liaw, who maintains the R randomForest package (This is the R package Alteryx uses in the Forest Model Tool), as follows:

 

For the most part, performance on training set is meaningless. (That's the case for most algorithms, but especially so for RF.) In the default (and recommended) setting, the trees are grown to the maximum size, which means that quite likely there's only one data point in most terminal nodes, and the prediction at the terminal nodes are determined by the majority class in the node, or the lone data point. Suppose that is the case all the time; i.e., in all trees all terminal nodes have only one data point. A particular data point would be "in-bag" in about 64% of the trees in the forest, and every one of those trees has the correct prediction for that data point. Even if all the trees where that data points are out-of-bag gave the wrong prediction, by majority vote of all trees, you still get the right answer in the end. Thus basically the perfect prediction on train set for RF is "by design".

 

Overfitting is indicated when the model performs poorly on an independent validation data set. Random Forest models are more robust and far more resistant to over-fitting than individual decision trees are. Only you know your data and your model, but based on what you have posted I would not say your model is overfitted. 

 

 

If you'd like more information on how Random Forest models work, here is documentation on Random Forest models published by Leo Breiman and Adele Cutler. 

 

Does this help clear up your confusion? 

 

Thanks!

 

SydneyF - Customer Support Engineer

 

 

 

 

MrMagnus
6 - Meteoroid

Hi @SydneyF

 

Thanks, yes, it cleared up some of my confusion on the confusion matrix :)

 

I have one follow up question: How is the confusion matrix in a browse tool after an RF model created (based on what data)? I cant understand the R-code and I cant reproduce it, either in R using randomForest or with the score tool straight after. As I have a classification forest I get "likelhoods" out of the score tool, I've assigned the class with highest likelihood to be the class when I make my confusion matrix. 

 

Another observation that looks a bit odd to me, the "OOB estimate of the error rate" has in all cases I have checked been the same as the unweighted average of the "classification errors" in the confusion matrix. A small example on the often used iris data set: 

CaptureIrisConfMatrix.PNG

Thanks, 

Magnus

SydneyF
Alteryx Alumni (Retired)

Hi @MrMagnus

 

 

OOB, or Out of Bag data is the data that is withheld from the construction of each tree. For each tree, a different training data set is created by randomly sampling the training data with replacement. About one third of training data records are excluded from constructing each tree. In total, each individual record in your training data participates in constructing about 64% of trees and are withheld from constructing the remaining 36%. The confusion matrix is how each of your training records are classified based on the trees where the records were withheld from construction. The confusion matrix in the Forest Model Report output is calculated as a specific point determined by the cutoff on the votes (e.g., >50% of trees (where the record was withheld from construction) voted this record is setosa, so it will be classified as setosa).

 

The output of the score tool is essentially the percentage of trees in the Random Forest that predicted the record belonged to a respective category. Assigning the class with the highest likelihood provided by the score tool makes sense. Another option would be to set thresholds for each category (if likelihood of x is > .60, than the record is x) and if a record does not meet any category threshold it is classified as uncertain. Likelihoods allow you to determine the level of confidence the model has in classifying a record. 

 

Nice catch on the OOB estimate of error rate. A colleague of mine and I have identified the specific R code that calculates this output, and it is an unweighted average of the classification errors. We are currently looking into the matter further.

 

Does this all make sense? Did I answer your questions? Please let me know!

 

Thanks!

 

SydneyF - Customer Support Engineer

 

MrMagnus
6 - Meteoroid

Yes, there is a clearing in the forest of confusion. Thank you

Labels