This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Check out the latest post: All Models Are Wrong
the randomforest package implementation in Alteryx works fine for smaller datasets but becomes very slow for large datasets with many features.
There is the opensource Ranger package https://arxiv.org/pdf/1508.04409.pdf that could help on this.
Along with XGBoost/LightGMB/Catboost it would be an extremely welcome addition to the predictive package!
I second your request for XGBoost to be added to the predictive tools.
+1 great idea. I would mention @AshleyK @DrDan if we'd like to raise interest internally...
Random Forest (RIP Breiman) is a life saver in predictive and below benchmarks show how fast the new package is compared to existing package and some alternatives...
It might also be more productive to create a single topic for all R/Python packages we'd like to see in Alteryx or ones we'd like to improve.
The ranger package definitely needs to be looked at. The randomForest package is the current R package we use that I'm least happy with in terms of its finicky behavior, plus, there have been a huge number of speed improvements for random forest models since the algorithm was first developed, while the randomForest package is based on Leo Breiman's and Adele Cutler's original (circa 2001) FORTRAN code. We did look at randomForestSRC a couple of years ago, but at that time, we found it was less performant than the original randomForest package.
In terms of XGBoost, we also looked at that a couple of years ago as well, but there were implementation issues with it (it didn't work directly with data frames at that time).
Aside from the null value allergy and the 2GB model size limit (I use a lot of variables) I can't say Alteryx Random Forest implementation is that bad.
C5 decision tree is a lot more finicky (it's allergic to white spaces BOTH in variable names and data, this needs to be looked at) in my findings and the graphical output leaves a lot to be desired.
- As for XGBoost, perhaps the Python implementation would be easier to implement?
- Deep Forest (https://github.com/kingfengji/gcForest) would be an interesting package to implement as well, it's a tree-based alternative to Deep Learning.
- KNN and K-Modes (for categorical clustering) would be also great to have, the more options the merrier.
big + for fixing null value allergy in random forest
++ for deep forest
You are true @marco_zara though it's a massively parrellizable algorithm.
when number of columns (variables) and rows increase it still takes a lot of time to model things...
recently a model of mine in a fintech takes approx 2 hours... long wait if you need to do near-realtime learning or active learning...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.