This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I am seeing the variable importance plot based on Mean Decrease in Gine that the Forest Model report throws out. However, I really don't understand this metric. Will be great if someone can explain it to me very easily (I don´t have advance knowledge in statistics). And how to interpret it. For example, what does it mean for a variable to have a mean decrease in gini of 0.35?
In order to understand Mean Decrease in Gini, it is important first to understand Gini Impurity, which is a metric used in Decision Trees to determine how (using which variable, and at what threshold) to split the data into smaller groups. Gini Impurity measures how often a randomly chosen record from the data set used to train the model will be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset (e.g., if half of the records in a group are "A" and the other half of the records are "B", a record randomly labeled based on the composition of that group has a 50% chance of being labeled incorrectly). Gini Impurity reaches zero when all records in a group fall into a single category (i.e., if there is only one possible label in a group, a record will be given that label 100% of the time). This measure is essentially the probability of a new record being incorrectly classified at a given node in a Decision Tree, based on the training data.
Because Random Forests are an ensemble of individual Decision Trees, Gini Importance can be leveraged to calculate Mean Decrease in Gini, which is a measure of variable importance for estimating a target variable. Mean Decrease in Gini is the average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest. This is effectively a measure of how important a variable is for estimating the value of the target variable across all of the trees that make up the forest. A higher Mean Decrease in Gini indicates higher variable importance. Variables are sorted and displayed in the Variable Importance Plot created for the Random Forest by this measure. The most important variables to the model will be highest in the plot and have the largest Mean Decrease in Gini Values, conversely, the least important variable will be lowest in the plot, and have the smallest Mean Decrease in Gini values.
Mean Decrease in Gini is a forest-wide weighted average of the decrease in the Gini Impurity metric between the parent and daughter nodes that a variable is splitting. It can be defined as the total decrease in node impurity (weighted by the proportion of samples reaching a given node) averaged across all of the trees that make up the forest. It does not directly translate to a percent decrease in node impurity.
You might find the following resources helpful for understanding Gini Impurity and Mean Decrease Gini: