Alteryx Designer

Find answers, ask questions, and share expertise about Alteryx Designer.

Help!... Mean Decrease in Gini for dummies

Highlighted
5 - Atom

 

Hi!

 

I am seeing the variable importance plot based on Mean Decrease in Gine that the Forest Model report throws out. However, I really don't understand this metric. Will be great if someone can explain it to me very easily (I don´t have advance knowledge in statistics). And how to interpret it. For example, what does it mean for a variable to have a mean decrease in gini of 0.35?

 

Many thanks!

Highlighted
Alteryx
Alteryx

Hi @Saprissa2018,

 

In order to understand Mean Decrease in Gini, it is important first to understand Gini Impurity, which is a metric used in Decision Trees to determine how (using which variable, and at what threshold) to split the data into smaller groups. Gini Impurity measures how often a randomly chosen record from the data set used to train the model will be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset (e.g., if half of the records in a group are "A" and the other half of the records are "B", a record randomly labeled based on the composition of that group has a 50% chance of being labeled incorrectly). Gini Impurity reaches zero when all records in a group fall into a single category (i.e., if there is only one possible label in a group, a record will be given that label 100% of the time). This measure is essentially the probability of a new record being incorrectly classified at a given node in a Decision Tree, based on the training data. 

 

Because Random Forests are an ensemble of individual Decision Trees, Gini Importance can be leveraged to calculate Mean Decrease in Gini, which is a measure of variable importance for estimating a target variable. Mean Decrease in Gini is the average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest. This is effectively a measure of how important a variable is for estimating the value of the target variable across all of the trees that make up the forest. A higher Mean Decrease in Gini indicates higher variable importance. Variables are sorted and displayed in the Variable Importance Plot created for the Random Forest by this measure. The most important variables to the model will be highest in the plot and have the largest Mean Decrease in Gini Values, conversely, the least important variable will be lowest in the plot, and have the smallest Mean Decrease in Gini values. 

 

 

For an introduction to Random Forests, please read Seeing the Forest for the Trees; an Introduction to Random Forests. For an introduction to Decision Trees, please see Planting Seeds; an Introduction to Decision Trees. These articles briefly explain Gini Importance and Gini Impurity respectively.

 

Does this make sense? Are there any points you would like clarification on? Please let me know!

 

Highlighted
5 - Atom

 

Thanks so much Sydney! Super helpful. So if I have a variable with Mean Decrease in Gini of 0.35, this means that such variable on average decreases node impurity by 35%?

Highlighted
Alteryx
Alteryx

Hi @Saprissa2018,

 

Mean Decrease in Gini is a forest-wide weighted average of the decrease in the Gini Impurity metric between the parent and daughter nodes that a variable is splitting. It can be defined as the total decrease in node impurity (weighted by the proportion of samples reaching a given node) averaged across all of the trees that make up the forest. It does not directly translate to a percent decrease in node impurity. 

 

You might find the following resources helpful for understanding Gini Impurity and Mean Decrease Gini:

 

An Example of Calculating Gini Impurity

Gini Impurity

How do you explain ‘mean decrease accuracy’ and ‘mean decrease gini’ in layman’s terms?

How does a tree decide where to split?

A simple & clear explanation of the Gini impurity?

Does Breiman's random forest use information gain or Gini index?

 

And these resources might be helpful to get started in interpreting Mean Decrease Gini:

 

How to interpret Mean Decrease in Accuracy and Mean Decrease GINI in Random Forest models

The difference between mean decrease in accuracy and mean decrease in Gini impurity in Random Forest

Random forest regression produce different importance ranking

In the R randomForest package for random forest feature selection, how is the dataset split for trai...

Selecting good features – Part III: random forests

Understanding variable importance in forest of randomized trees

 

Thanks!

Labels