Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Question about the result of Random Forest

Yuta
6 - Meteoroid

In the result of Random Forest tool, it is possible to confirm "Mean of the squared residuals" after calculation. At the beginning, I thought this value means MSE (Mean Square Error), however this value was not matched with the calculated RMSE^2 by using Model Comparison tool. So, I would like to know the meaning and the definition of "Mean of the squared residuals".

Yuta_1-1635227460380.png

 

6 REPLIES 6
VictorCruz
Alteryx
Alteryx

Hello @Yuta , how are you doing?

I ran the Random Forest tool on a dataset I have, and my results don't show that metric you cite.

VictorCruz_0-1635251811966.png

Within each of these topics, there is no metric referencing "Mean of the squared residuals"


Could you share a screenshot of your results or your flow so I can better understand your case?

Thanks. Regards,

Victor Cruz
Sales Engineer, LATAM
Alteryx
Yuta
6 - Meteoroid

Hello Victor,

 

Thank you for your reply.

 

The screenshoot of my workflow is below and I use a Boston house price data as a sample file for the input.

I use default values for the model input.

Yuta_0-1635293848751.png

A below table is the result of Random Forest and MSE is displayed.

Yuta_1-1635293975657.png

 

Also, I calculate RMSE by using Model Comparison tool and the result is below.

Yuta_2-1635294144950.png

 

RMSE (=MSE^0.5) calculated by using MSE output from Random Forest tool and that one calculated by using Model Comparison tool is not matched. So, I would like to know the definition of MSE calculated by Random Forest tool.

 

I have already confirmed that RMSE output from Model Comparison tool is calculated by a following equation.

(RMSE calculated by the following equation is matched with the value output from Model Comparison tool.)

Yuta_3-1635295206739.png

 

Best Regards,

Yuta

 

 

VictorCruz
Alteryx
Alteryx

@Yuta  the Random Forest tool uses a macro, inside this macro we have some parts that are encoded in R.

When viewing the R code called by the macro, it is possible to observe some important points:

1 - the package used by the tool is:
# Determine if the randomForest package is available
loadPackages("randomForest")

Within this link, you will be able to see all the content within this package.

Some important details that this document brings us:
"Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression)."

MSE - (regression only) vector of mean square errors: sum of squared residuals divided by n.
RSQ - (regression only) “pseudo R-squared”: 1 -mse / Var(y).

I believe that now knowing how the MSE is calculated, you have your answer, right?

A nice option we have here is the customization you can bring to each building block you bring into your workflow.
You can do the calculation you want, using the formula you want, this is one of the benefits of integrating open source with Alteryx.

If you have any further questions, I am available. Regards,

Victor Cruz
Sales Engineer, LATAM
Alteryx
Yuta
6 - Meteoroid

Hello Victor,

 

Thank you so much for your kind support.

 

My understanding is as follows, is it correct?

  • Same calculation method is used for the calculation of MSE between the Random Forest tool and the Model Comparison tool.      {(actual - prediction)^2}/n
  • In the Model Comparison tool, MSE is calculated for all input data. On the other hand, MSE is calculated for OOB data (Out of Bag data is all data not chosen in the sampling process and generally about 30% of all data) in the Random Forest tool.
  • As a result, there is some difference between the both calculation.

Best regards,

Yuta

VictorCruz
Alteryx
Alteryx

@Yuta 

my understanding is exactly the same as yours.


As the Random Forest tool calculates the MSE based on the OOB data and the Comparison tool on all input data, that's where we have the discrepancy of values.

 

Perhaps a validation would be to add in the Random Forest tool, a new code line to calculate with all the data (this would require a work of customizing the tool), but then we would arrive at the same result.

Any questions, please do not hesitate to contact me.
Best regards,

Victor Cruz
Sales Engineer, LATAM
Alteryx
Yuta
6 - Meteoroid

Hello Victor,

 

Thank you for your kind support.

My question was resolved thanks to your support.

 

Best regards,

Yuta

Labels