We are celebrating the 10-year anniversary of the Alteryx Community! Learn more and join in on the fun here.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Help Understanding Random Forest Output for Missing Target Values

ceejunior301
5 - Atom

Hello!

I am a brand new Alteryx user and I am working on a predictive modeling workflow where I am using a Random Forest model to predict future values of my target variable on a daily basis out to 2033. My predictors have no null values and are fully filled in through 2033, but my target variable is only filled in through July 31, 2025. My flow is currently filtering so where my target variable is null it is getting fed into a Score tool and where my target variable is filled in, I am running that through the forest model so it can get trained. I have my browse tool attached to this forest model on the R and on the O, it is going to the M input of my Score tool. The browse tool attached to my forest model is showing an MSE of 11 billion and an R-squared of 24%. This seems very off, and I am wondering if I have set up my flow correctly, but the outcomes do not look too bad. 

 

My questions are is using the score tool in this way appropriate for predicting missing target values?

Is there a better way to determine how the model is actually performing?

Is there a better way to validate or improve this model within Alteryx?

 

Any insights or suggestions would be greatly appreciated!
Thanks
Calvin

4 REPLIES 4
KGT
13 - Pulsar

The setup is valid in terms of training the model on the previous and using that model to score the future, however 8 years of daily value predictions would want a LOT of history to get a decent result. BUT, the headline is that the data you are trying to work with is Time Series Data and Random Forest is not ideal for this.

 

As this is Time Series Data, have a look into some of the Time Series Examples, rather than a classification model. Also, even with Time Series, I rarely see data that would support a future prediction of 8 years daily, and I would question the validity of the hypotheses. "Rule of Thumb" is 80% training, 20% prediction, but that is of course, not set in stone. Also look at what you are trying to get from the prediction, is an model really going to predict the value on Tuesday 18th July, 2028 different to Thursday October 19th, 2028 accurately? Would you be better predicting weekly or monthly totals? Without knowing the type and regularity of the data etc, I can't say for sure, but a classification model shouldn't be first try for this.

 

This is old, (and so some of the information around parallel cores etc is not valid anymore since AMP), but may help as a starter for the different models in Alteryx palettes.

 

A Random Forest is essentially a collection of decision trees aggregated and so in order to use this for Time Series, you would need to use a sliding window, essentially predicting and then lightly re-training the model to incorporate that (It's just generally not good). This is way more work and still not an ideal result for what you are trying to do.

KGT
13 - Pulsar

As an extra, you seem to have a little idea of what you are trying to do and can feel your way around. I advise starting by understanding a Linear Regression, then Decision Tree, followed by Logistic Regression/Forest/Boosted. You can run them side by side and compare outputs. (See Help > Samples > Predict.... > Donor Model in Alteryx Designer).

 

The topics I would say to look into for a better feel of the area are:

  • Continuous (Time Series being a subset) vs categorical predictors and response variables.
  • Regression vs Classification models (although Logistic Regression falls under Classification using regression-like techniques)
  • Supervised vs un-supervised: This is especially relevant today with a lot more unsupervised models being used.
ceejunior301
5 - Atom

Hey! Thank you for the detailed response this is really helpful!

I had understood that Random Forest could be used for regression when the target variable is continuous, which is why I went that route initially. I’ve just started building a new flow to compare models like Linear Regression, Decision Trees, and others you mentioned to see how they perform.

Regarding the daily predictions: I’m actually aggregating the results into monthly totals after scoring, so I’m not aiming for pinpoint daily accuracy. My thinking was that training on daily data from 2022 onward would give the model more granularity and potentially better performance than training on monthly data alone, which felt too small of a sample size.

I do recognize that accuracy will likely degrade the further out we go thinking it will probably only be good for the next 12 months, but that level of uncertainty has been deemed acceptable for our use case.

I’ll definitely look into the time series examples and consider whether a different modeling approach might be more appropriate. Thanks again for the guidance!

ceejunior301
5 - Atom

PS I am looking into ARIMA and ETS now, thank you for introducing me to time series models I am new to data analytics as well so will have to research this some more!

Labels
Top Solution Authors