Confused by different Random Forest error estimates

Question

Hello,

This is my first post so please bear with me if I ask a strange / unclear question.

I'm a bit confused about the outcome from a random forest classification model output. I have a model which tries to predict 5 categories of customers.

The browse tool after the RF tool says the OOB estimate of error is 79.5 %. If I calculate the outcome from the confusion matrix just below (in the browse tool), there are 62% wrongly classified.

And if I use the score tool on the test set I get that 19% are wrongly classified. (training set has less than 1% wrongly classified using score)

In my world they should all be fairly close to each other (minus maybe the score from the training set).

Am I missing something?

The insanely good score from the training set makes me think my model is overfitted. How do I adjust the RF model to reduce that (if that is the problem)?

Thanks

SydneyF · Accepted Answer

Hi @MrMagnus

OOB, or Out of Bag data is the data that is withheld from the construction of each tree. For each tree, a different training data set is created by randomly sampling the training data with replacement. About one third of training data records are excluded from constructing each tree. In total, each individual record in your training data participates in constructing about 64% of trees and are withheld from constructing the remaining 36%. The confusion matrix is how each of your training records are classified based on the trees where the records were withheld from construction. The confusion matrix in the Forest Model Report output is calculated as a specific point determined by the cutoff on the votes (e.g., >50% of trees (where the record was withheld from construction) voted this record is setosa, so it will be classified as setosa).

The output of the score tool is essentially the percentage of trees in the Random Forest that predicted the record belonged to a respective category. Assigning the class with the highest likelihood provided by the score tool makes sense. Another option would be to set thresholds for each category (if likelihood of x is > .60, than the record is x) and if a record does not meet any category threshold it is classified as uncertain. Likelihoods allow you to determine the level of confidence the model has in classifying a record.

Nice catch on the OOB estimate of error rate. A colleague of mine and I have identified the specific R code that calculates this output, and it is an unweighted average of the classification errors. We are currently looking into the matter further.

Does this all make sense? Did I answer your questions? Please let me know!

Thanks!

SydneyF - Customer Support Engineer

MrMagnus · Accepted Answer

Yes, there is a clearing in the forest of confusion. Thank you