This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
OOB is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample. OOB has been proven to be an unbiased measure of performance in many tests, and is typically considered to be a conservative measure, biased towards higher percent error values.
The OOB estimate of error and the Confusion Matrix should correspond with one another. However, in my test model I am seeing the same behavior where they don't. I will be investigating this matter further. For the time being, I would trust your calculation from the confusion matrix over the reported OOB estimate of the error rate.
Is your test data set completely independent from your training data set? If not, this could definitely cause an artificially low percent error. Running the model on any training data will cause artificially low error. This is an artifact of the way Random Forest Classification works. It is expected that the error is very close to zero when you use the score tool on the training data for a Classification Random Forest. This is described in a forum post by Andy Liaw, who maintains the R randomForest package (This is the R package Alteryx uses in the Forest Model Tool), as follows:
For the most part, performance on training set is meaningless. (That's the case for most algorithms, but especially so for RF.) In the default (and recommended) setting, the trees are grown to the maximum size, which means that quite likely there's only one data point in most terminal nodes, and the prediction at the terminal nodes are determined by the majority class in the node, or the lone data point. Suppose that is the case all the time; i.e., in all trees all terminal nodes have only one data point. A particular data point would be "in-bag" in about 64% of the trees in the forest, and every one of those trees has the correct prediction for that data point. Even if all the trees where that data points are out-of-bag gave the wrong prediction, by majority vote of all trees, you still get the right answer in the end. Thus basically the perfect prediction on train set for RF is "by design".
Overfitting is indicated when the model performs poorly on an independent validation data set. Random Forest models are more robust and far more resistant to over-fitting than individual decision trees are. Only you know your data and your model, but based on what you have posted I would not say your model is overfitted.
Thanks, yes, it cleared up some of my confusion on the confusion matrix :)
I have one follow up question: How is the confusion matrix in a browse tool after an RF model created (based on what data)? I cant understand the R-code and I cant reproduce it, either in R using randomForest or with the score tool straight after. As I have a classification forest I get "likelhoods" out of the score tool, I've assigned the class with highest likelihood to be the class when I make my confusion matrix.
Another observation that looks a bit odd to me, the "OOB estimate of the error rate" has in all cases I have checked been the same as the unweighted average of the "classification errors" in the confusion matrix. A small example on the often used iris data set:
OOB, or Out of Bag data is the data that is withheld from the construction of each tree. For each tree, a different training data set is created by randomly sampling the training data with replacement. About one third of training data records are excluded from constructing each tree. In total, each individual record in your training data participates in constructing about 64% of trees and are withheld from constructing the remaining 36%. The confusion matrix is how each of your training records are classified based on the trees where the records were withheld from construction. The confusion matrix in the Forest Model Report output is calculated as a specific point determined by the cutoff on the votes (e.g., >50% of trees (where the record was withheld from construction) voted this record is setosa, so it will be classified as setosa).
The output of the score tool is essentially the percentage of trees in the Random Forest that predicted the record belonged to a respective category. Assigning the class with the highest likelihood provided by the score tool makes sense. Another option would be to set thresholds for each category (if likelihood of x is > .60, than the record is x) and if a record does not meet any category threshold it is classified as uncertain. Likelihoods allow you to determine the level of confidence the model has in classifying a record.
Nice catch on the OOB estimate of error rate. A colleague of mine and I have identified the specific R code that calculates this output, and it is an unweighted average of the classification errors. We are currently looking into the matter further.
Does this all make sense? Did I answer your questions? Please let me know!