community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Designer Ideas

Share your Designer product ideas - we're listening!
New Data Science Blog

Check out the latest post: All Models Are Wrong

READ MORE

Please implement the Ranger random forest package

Hello,

 

the randomforest package implementation in Alteryx works fine for smaller datasets but becomes very slow for large datasets with many features.

There is the opensource Ranger package https://arxiv.org/pdf/1508.04409.pdf that could help on this.

 

Along with XGBoost/LightGMB/Catboost it would be an extremely welcome addition to the predictive package!

13 Comments

I second your request for XGBoost to be added to the predictive tools.

Alteryx Partner

+1 great idea. I would mention @AshleyK @DrDan if we'd like to raise interest internally...

 

Random Forest (RIP Breiman) is a life saver in predictive and below benchmarks show how fast the new package is compared to existing package and some alternatives...

 

Dan2,.jpg

 

 

 

Alteryx Partner

It might also be more productive to create a single topic for all R/Python packages we'd like to see in Alteryx or ones we'd like to improve.

Alteryx
Alteryx

The ranger package definitely needs to be looked at. The randomForest package is the current R package we use that I'm least happy with in terms of its finicky behavior, plus, there have been a huge number of speed improvements for random forest models since the algorithm was first developed, while the randomForest package is based on Leo Breiman's and Adele Cutler's original (circa 2001) FORTRAN code. We did look at randomForestSRC a couple of years ago, but at that time, we found it was less performant than the original randomForest package.

 

Dan

Alteryx
Alteryx

In terms of XGBoost, we also looked at that a couple of years ago as well, but there were implementation issues with it (it didn't work directly with data frames at that time).

Alteryx Partner

Aside from the null value allergy and the 2GB model size limit (I use a lot of variables) I can't say Alteryx Random Forest implementation is that bad.

C5 decision tree is a lot more finicky (it's allergic to white spaces BOTH in variable names and data, this needs to be looked at) in my findings and the graphical output leaves a lot to be desired.

 

- As for XGBoost, perhaps the Python implementation would be easier to implement?

 

- Deep Forest (https://github.com/kingfengji/gcForest) would be an interesting package to implement as well, it's a tree-based alternative to Deep Learning.

 

- KNN and K-Modes (for categorical clustering) would be also great to have, the more options the merrier.

Alteryx Partner

big + for fixing null value allergy in random forest

  • which can be done with a few lines of code actually

++ for deep forest

  • looking forward to it
  • needs Alteryx to be able to utilize multi cores in parallel or GPU's maybe?

 

Alteryx Partner
Unlike deep learning, deep forest uses layers of random forests so it doesn't require GPU to reach decent performance.
Alteryx Partner

You are true @marco_zara though it's a massively parrellizable algorithm.

when number of columns (variables) and rows increase it still takes a lot of time to model things...

 

recently a model of mine in a fintech takes approx 2 hours... long wait if you need to do near-realtime learning or active learning...

Alteryx Partner
2 hours to train or score? Here I'm doing Churn Prediction models on a 4 year old I7 with 16GB of RAM, GPUs for machine learning are something in the fantasy realm especially as there is nobody that knows CUDA or OpenCL in my company. If it wasn't for Alteryx there is no way I'd be doing ML and we'd instead have to rely on consultants, so every new feature is welcome...