Data Science

Machine learning & data science for beginners and experts alike.
Register for the upcoming Live Community Q&A Session - and don't forget to submit your questions for @DeanS regarding the future role of analytics here.
Alteryx
Alteryx

In 2017, Randal S. Olson published a paper using 13 state-of-the art algorithms on 157 datasets. It proved that gradient tree boosting models outperform other algorithms in most scenarios. The same year, KDNugget pointed out that there is a particular type of boosted tree model most widely adopted. More than half of the winning solutions in machine learning challenges in Kaggle use xgboost

 

 

Capture1.JPG

 

Let's Start

 

In this example, we explore how to use XGBoost through Python. Certainly, we won't forget our R buddies!  Download the sample workflow with both R & Python macro from the Gallery.

 

If you never heard of it, XGBoost or eXtreme Gradient Boosting is under the boosted tree family and follows the same principles of gradient boosting machine (GBM) used in the Boosted Model in Alteryx predictive palette. The key differences include:

 

  • Regularised to prevent overfitting, giving more accurate results; and
  • A sparse matrix data structure, giving more efficient cache utilisation and processing speed

 

Thanks to this beautiful design, XGBoost parallel processing is blazingly faster when compared to other implementations of gradient boosting. 

 

Kaggle imageKaggle image

 

 

To use the XGBoost macro, you need to install the libraries (xgboost, readr, etc) for both R & Python macro to work. Here are the links for R & Python guideline if you are new to this, look for the below error message then install the corresponding library.

 

 

Capture3.JPG

 

 

Once the packages are installed, run the workflow and click the browse tool for the result. You can also Enable Performance Profiling (on the Runtime tab of the Workflow - Configuration window) to check how fast the XGBoost completes, compared with the traditional logistic regression.

 

Capture4.JPG

 

 

After running it, you might want to use the tools for the other use cases immediately. But wait!  Hold on, let's learn how to prep your data and interpret your output first.

 

Capture5.JPG

 

 

 

 

 

 

 

 

 

Data Input

 

As with our other predictive tools, connect your training data to the I anchor, and your new raw data stream (which does not contain the target variable) to the D anchor.

 

A few things to consider:

  • XGBoost only supports numeric values. You need to transform the categorical features with one hot encoding, mean encoding, etc. Don't know how to do it? Alteryx Gallery got one for you!
  • Use the Imputation Tool to fill all the missing, blank & null values
  • No need split into validation & holdout sets, the tools do it for you. Change the split ratio? Just open the macro and modify the Create Sample Tool.

 

 

Capture6.JPG

 

 

 

Report & Score Output

 

The R anchor is the model report, which presents to you key insights:

  • Which variable is the most crucial factor in Feature Importance Table
  • How accurate is the model in Accurate Metric Table
  • What is the model complexity in Tree Plots

 

S anchor is the scored result using the testing data. The reason we are not using the score tool here is XGBoost transforms data into sparse matrix format, where our score tool has to be customised. In case you want to save the model object and load it in another time, go to the additional resource at the bottom.

 

 

 

Capture7.JPG

 

 


Hyper-Parameter Optimisation (HPO)


After setting up the variable, there are some extra parameters to fine-tune the model. In fact, there are hundreds! Such as defining your optimisation objectives, evaluation criteria, loss function, more and more. It's this level of flexibility that makes every data scientist addicted to it.

I only carve out some critical parameters in the configuration box, feel free to jump inside the macro and enrich the settings based on your need. If you are new to the Alteryx interface tools and macros, please check out our chief scientist Dr. Dan's Guide to Creating Your Own R-Based Macro Series.

R Configuration:

 

Capture8.JPG

 

 

 

 

 

 

 

 

 

 

 

R Code:

 

 

 

 

 

 

md <- xgb.train(data = dxgb_train, nthread = n_proc, 
          objective = "%Question.objective.var%", nrounds = %Question.roundno.var%, 
          max_depth = 20, eta = %Question.lr.var%, 
          min_child_weight = 1, subsample = 0.5,
          watchlist = list(valid = dxgb_valid, train = dxgb_train), 
          eval_metric = "auc",
          early_stopping_rounds = 10, print_every_n = 10)

 

 

 

 

 

 

 

To save your time, here is a summary table with parameter info and tuning tips. Hot sauce highlighted are the ones in configuration box.

 

Define the problem & evaluation:

 

Objective / Learning Task

  • Linear regression (reg:linear)
  • Logistic regression for binary classification (binary:logistic)
  • Softmax for multi-class classification (multi:softprob)

 

Evaluation Metrics:

  • AUC - Area under curve (used in classification)
  • RMSE - Root mean square error (used in regression)
  • merror - Exact matching error, (used in multi-class classification)

 

Parameters to control over fitting:

 

Learning rate or eta  [default=0.3][range: (0-1)]

  • Makes the model more robust by shrinking the weights on each step. 
  • Lower eta leads to slower computation, higher eta rate avoid over-fitting but less accurate result
  • If time allows and model performance is key, decrease incrementally the eta rate while increasing the no. of rounds
  • Optimal values lie between 0.01 - 0.3

 

Max_depth [default=6][range: (0,Inf)]

  • It controls the depth of the tree.
  • Larger the depth, more complex the model; higher chances of overfitting.
  • There is no standard value for max_depth. Larger data sets require deep trees to learn the rules from data.

 

Parameters for speed

 

No of rounds/tree: [default=10][range: (1, Inf)]

  • It controls the maximum number of iterations
  • For classification, it is similar to the number of trees to grow

 

Sub-sample[default=1][range: (0,1)]

  • It controls the number of samples (observations) supplied to a tree.
  • Typically, its values lie between (0.5-0.8)

 

Early Stopping:

  • If NULL, the early stopping function is not triggered.
  • If set to an integer k, training with a validation set will stop if the performance doesn’t improve for k rounds.

 

 

Capture9.JPG

 

 

 

Hyper-Parameter Optimisation (HPO)

 

Don't get panic when you see the long list of parameters. These days, there are many packages out there to help data scientist to auto-tune the model. From very simple random grid search to Bayesian Optimisation to genetic algorithms. Inside the python macro, there is a snippet of random search code for you to try. For example, modify the number iteration to 400 or the Learning Rate range to 0.01,0.2. 

 

 

 

 

 

 

xgb_model = xgboost.XGBClassifier()

params = {
    "colsample_bytree": uniform(0.7, 0.3),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.03, 0.3), # default 0.1 
    "max_depth": randint(2, 6), # default 3
    "n_estimators": randint(10, 100), # default 100
    "subsample": uniform(0.6, 0.4)
}

search = RandomizedSearchCV(xgb_model, param_distributions=params, random_state=42, n_iter=200, cv=3, verbose=1, n_jobs=2, return_train_score=True)

search.fit(X_train, y_train)

report_best_scores(search.cv_results_, 1)

 

 

 

 

 

 

 

 Then the best parameter set is produced:

 

Capture10.JPG

 

End. And?

 

That's all for now. In this post, we learned some basics of XGBoost and how to integrate it into the Alteryx platform using both R and Python. Once you feel ready, explore more advanced topics such as CPU vs GPU computation, or level-wise vs leaf-wise splits in decision trees. Below gather some materials. Enjoy.

 

Additional resources: 

 

What's next?  LightGBM!

 

 

Comments
Alteryx
Alteryx

Great article! Thanks for share with us!

5 - Atom
Hi, I am unclear on how to do a forecast with this tool ? If I have two columns of data, the first being time periods, and the second being sales, what would I do to create a sales forecast ? Am I misunderstanding the purpose of this tool ? Thanks Neil
Alteryx Alumni (Retired)

Kudos @TimothyL !!!

5 - Atom

Hi Timothy,

 

Thanks for the great post.  I can get the R version working fine but when I open the workflow I get the message that is missing the file XGBoost_Python.yxmc.  Also, the tool is then missing from the workflow.  It looks like that file is not included when I extract the original download package XGBoost.yxzp.  Or - maybe I am missing a step during the installation?

 

Thanks,

Greg

 

ThXGBoost Screen Shot 1.JPGXGBoost Screen Shot 2.JPG

Alteryx
Alteryx

@gregnolder 

 

May I know which version of Designer you are using? Could send you the macro offline if needed. TL

7 - Meteor

I'm having the same issue as outlined by gregnolder, except I don't have 'R' either- 

 

Did I install incorrectly, or is there another place I can find the Macros?  Please advise.

 

Picture1.pngPicture2.png

 

 
Alteryx
Alteryx

@gregnolder @ebarr 

 

Pardon for the late reponse. Just fixed the link. Please download one more time see if works. Thanks

7 - Meteor

I installed the alteryx xgboost package and used python connector to install scikit-learn and xgboost and hit run.  I received the error below indicating an error with the training.

 

ebarr_0-1586702403930.png

I then opened the macro and hit run to see the error message in the jupityr workbook:

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

I did some research online and found this article as an example of a solution:

https://stackoverflow.com/questions/34165731/a-column-vector-y-was-passed-when-a-1d-array-was-expect...

 

I'm not exactly sure how to fix the xgboost, if it even needs fixing, did I miss a package to install?  Any suggestions would be greatly appreciated.

 

 

Alteryx
Alteryx

@ebarr 

 

Fixed. Changed all the input data type into double instead of int or byte due to our latest upgrade on python environment. Try to download the macro again see if it works. Message me if the problem remains. Thanks

 

 

7 - Meteor

Success!  Thank you very much!

ebarr_0-1586870090124.png

 

8 - Asteroid

I'm experiencing a frustrating problem with this tool. When I download your sample workflow the R package works fine, but when I input my data it's giving me to below error. I tried putting in only one predictor variable (double type) and it's still not working. Any ideas why your data works fine but mine doesn't? 

 

XGBoost_R (48) Tool #1: Error in xgb.DMatrix(data = X_train, label = ifelse(d_train$Price == "1", :

5 - Atom

I have a quick question that I do not believe has been addressed in the explanation above. I know it seems rather odd, but please stay with me 🙂

 

On the XGBoost tool in R, what comes off of the S output anchor is a table of the records that are scored, most notably with a "Prediction" variable that corresponds to the probability score made via the XGBoost algorithm. I assumed this was the probability of a "True" response. However, when I compare these predictions with various other models using the exact same training and test data, the probabilities are almost exactly reversed; that is, for example, other models are assigning a probability of .15 to a case when XGB is assigning .85.  I even ran XGBoost via the new Intelligence Suite on the same data and came up with vastly different results. The true positive rate is about 20% so it doesn't make sense that one of if not the best algorithms for this type of problem is so vastly inaccurate and different.

 

My target variable is 0 = "False" and 1 = "True". In this case with this XBG model, could it be that the prediction probability relates to the probability of obtaining the first value, in this case, 0/False?

 

I was so excited to get this configured and working but it simply is not squaring with every other model I have run. Any ideas?

Alteryx Community Team
Alteryx Community Team

FYI XGBoost is included in the Classification tool, part of the new Alteryx Intelligence Suite. (Apologies in advance to @bkramer66 and @Hjardine as I'm not answering your question. Perhaps @TimothyL can weigh in on your queries.)

7 - Meteor
Im actually having same type of issue and just stopped using the xgboost and went back to boost as it was more “accurate” (at least based on my understanding. Of “accurate”.

Sent from my iPhone
Alteryx
Alteryx

@ebarr @bkramer66 @Hjardine 

 

Thanks for all the support! This post is written before our official support toolkit Intelligence Suite which contains xgboost package and our Designer has been upgraded together with R & Python environments.

 

I believe there is huge potential of improvement within this macro either in general applicability or parameter reporting. Feel free to shoot me a message so that we could discuss further with your sample data. 😉

 

For now, please give our Alteryx Intelligence Suite a try, which we could support you in the long term.