Data Science

TimothyL · ‎04-18-2019

In 2017, Randal S. Olson published a paper using 13 state-of-the art algorithms on 157 datasets. It proved that gradient tree boosting models outperform other algorithms in most scenarios. The same year, KDNugget pointed out that there is a particular type of boosted tree model most widely adopted. More than half of the winning solutions in machine learning challenges in Kaggle use xgboost.

Let's Start

In this example, we explore how to use XGBoost through Python. Certainly, we won't forget our R buddies! Download the sample workflow with both R & Python macro.

If you never heard of it, XGBoost or eXtreme Gradient Boosting is under the boosted tree family and follows the same principles of gradient boosting machine (GBM) used in the Boosted Model in Alteryx predictive palette. The key differences include:

Regularised to prevent overfitting, giving more accurate results; and
A sparse matrix data structure, giving more efficient cache utilisation and processing speed

Thanks to this beautiful design, XGBoost parallel processing is blazingly faster when compared to other implementations of gradient boosting.

Kaggle image

To use the XGBoost macro, you need to install the libraries (xgboost, readr, etc) for both R & Python macro to work. Here are the links for R & Python guideline if you are new to this, look for the below error message then install the corresponding library.

Once the packages are installed, run the workflow and click the browse tool for the result. You can also Enable Performance Profiling (on the Runtime tab of the Workflow - Configuration window) to check how fast the XGBoost completes, compared with the traditional logistic regression.

After running it, you might want to use the tools for the other use cases immediately. But wait! Hold on, let's learn how to prep your data and interpret your output first.

Data Input

As with our other predictive tools, connect your training data to the I anchor, and your new raw data stream (which does not contain the target variable) to the D anchor.

A few things to consider:

XGBoost only supports numeric values. You need to transform the categorical features with one hot encoding, mean encoding, etc. Don't know how to do it? I got one for you!
Use the Imputation Tool to fill all the missing, blank & null values
No need split into validation & holdout sets, the tools do it for you. Change the split ratio? Just open the macro and modify the Create Sample Tool.

Report & Score Output

The R anchor is the model report, which presents to you key insights:

Which variable is the most crucial factor in Feature Importance Table
How accurate is the model in Accurate Metric Table
What is the model complexity in Tree Plots

S anchor is the scored result using the testing data. The reason we are not using the score tool here is XGBoost transforms data into sparse matrix format, where our score tool has to be customised. In case you want to save the model object and load it in another time, go to the additional resource at the bottom.

Hyper-Parameter Optimisation (HPO)

After setting up the variable, there are some extra parameters to fine-tune the model. In fact, there are hundreds! Such as defining your optimisation objectives, evaluation criteria, loss function, more and more. It's this level of flexibility that makes every data scientist addicted to it.

I only carve out some critical parameters in the configuration box, feel free to jump inside the macro and enrich the settings based on your need. If you are new to the Alteryx interface tools and macros, please check out our chief scientist Dr. Dan's Guide to Creating Your Own R-Based Macro Series.

R Configuration:

R Code:

md <- xgb.train(data = dxgb_train, nthread = n_proc, 
          objective = "%Question.objective.var%", nrounds = %Question.roundno.var%, 
          max_depth = 20, eta = %Question.lr.var%, 
          min_child_weight = 1, subsample = 0.5,
          watchlist = list(valid = dxgb_valid, train = dxgb_train), 
          eval_metric = "auc",
          early_stopping_rounds = 10, print_every_n = 10)

To save your time, here is a summary table with parameter info and tuning tips. Hot sauce highlighted are the ones in configuration box.

Define the problem & evaluation:

Objective / Learning Task

Linear regression (reg:linear)
Logistic regression for binary classification (binary:logistic)
Softmax for multi-class classification (multi:softprob)

Evaluation Metrics:

AUC - Area under curve (used in classification)
RMSE - Root mean square error (used in regression)
merror - Exact matching error, (used in multi-class classification)

Parameters to control over fitting:

Learning rate or eta [default=0.3][range: (0-1)]

Makes the model more robust by shrinking the weights on each step.
Lower eta leads to slower computation, higher eta rate avoid over-fitting but less accurate result
If time allows and model performance is key, decrease incrementally the eta rate while increasing the no. of rounds
Optimal values lie between 0.01 - 0.3

Max_depth [default=6][range: (0,Inf)]

It controls the depth of the tree.
Larger the depth, more complex the model; higher chances of overfitting.
There is no standard value for max_depth. Larger data sets require deep trees to learn the rules from data.

Parameters for speed

No of rounds/tree: [default=10][range: (1, Inf)]

It controls the maximum number of iterations
For classification, it is similar to the number of trees to grow

Sub-sample[default=1][range: (0,1)]

It controls the number of samples (observations) supplied to a tree.
Typically, its values lie between (0.5-0.8)

Early Stopping:

If NULL, the early stopping function is not triggered.
If set to an integer k, training with a validation set will stop if the performance doesn’t improve for k rounds.

Hyper-Parameter Optimisation (HPO)

Don't get panic when you see the long list of parameters. These days, there are many packages out there to help data scientist to auto-tune the model. From very simple random grid search to Bayesian Optimisation to genetic algorithms. Inside the python macro, there is a snippet of random search code for you to try. For example, modify the number iteration to 400 or the Learning Rate range to 0.01,0.2.

xgb_model = xgboost.XGBClassifier()

params = {
    "colsample_bytree": uniform(0.7, 0.3),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.03, 0.3), # default 0.1 
    "max_depth": randint(2, 6), # default 3
    "n_estimators": randint(10, 100), # default 100
    "subsample": uniform(0.6, 0.4)
}

search = RandomizedSearchCV(xgb_model, param_distributions=params, random_state=42, n_iter=200, cv=3, verbose=1, n_jobs=2, return_train_score=True)

search.fit(X_train, y_train)

report_best_scores(search.cv_results_, 1)

Then the best parameter set is produced:

End. And?

That's all for now. In this post, we learned some basics of XGBoost and how to integrate it into the Alteryx platform using both R and Python. Once you feel ready, explore more advanced topics such as CPU vs GPU computation, or level-wise vs leaf-wise splits in decision trees. Below gather some materials. Enjoy.

Additional resources:

DataCamp XGBoost Course
KDNuggets: A top machine learning on Kaggle
Analytic Vidhya: An End-to-End Guide to XGBoost
Math geeks can go to the algorithm paper written by Tian Qi Chen.

What's next? LightGBM!

Loic · ‎04-30-2019

Great article! Thanks for share with us!

helpplease · ‎06-13-2019

Hi, I am unclear on how to do a forecast with this tool ? If I have two columns of data, the first being time periods, and the second being sales, what would I do to create a sales forecast ? Am I misunderstanding the purpose of this tool ? Thanks Neil

BowK · ‎07-22-2019

Kudos @TimothyL !!!

gregnolder · ‎07-25-2019

Hi Timothy,

Thanks for the great post. I can get the R version working fine but when I open the workflow I get the message that is missing the file XGBoost_Python.yxmc. Also, the tool is then missing from the workflow. It looks like that file is not included when I extract the original download package XGBoost.yxzp. Or - maybe I am missing a step during the installation?

Thanks,

Greg

Th XGBoost Screen Shot 1.JPG XGBoost Screen Shot 2.JPG

TimothyL · ‎09-08-2019

@gregnolder

May I know which version of Designer you are using? Could send you the macro offline if needed. TL

ebarr · ‎09-23-2019

I'm having the same issue as outlined by gregnolder, except I don't have 'R' either-

Did I install incorrectly, or is there another place I can find the Macros? Please advise.

TimothyL · ‎09-29-2019

@gregnolder @ebarr

Pardon for the late reponse. Just fixed the link. Please download one more time see if works. Thanks

ebarr · ‎04-12-2020

I installed the alteryx xgboost package and used python connector to install scikit-learn and xgboost and hit run. I received the error below indicating an error with the training.

I then opened the macro and hit run to see the error message in the jupityr workbook:

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\sklearn\preprocessing\label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\sklearn\preprocessing\label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

I did some research online and found this article as an example of a solution:

https://stackoverflow.com/questions/34165731/a-column-vector-y-was-passed-when-a-1d-array-was-expect...

I'm not exactly sure how to fix the xgboost, if it even needs fixing, did I miss a package to install? Any suggestions would be greatly appreciated.

TimothyL · ‎04-14-2020

@ebarr

Fixed. Changed all the input data type into double instead of int or byte due to our latest upgrade on python environment. Try to download the macro again see if it works. Message me if the problem remains. Thanks

ebarr · ‎04-14-2020

Success! Thank you very much!

Hjardine · ‎06-18-2020

I'm experiencing a frustrating problem with this tool. When I download your sample workflow the R package works fine, but when I input my data it's giving me to below error. I tried putting in only one predictor variable (double type) and it's still not working. Any ideas why your data works fine but mine doesn't?

XGBoost_R (48) Tool #1: Error in xgb.DMatrix(data = X_train, label = ifelse(d_train$Price == "1", :

bkramer66_dup_418 · ‎06-29-2020

I have a quick question that I do not believe has been addressed in the explanation above. I know it seems rather odd, but please stay with me 🙂

On the XGBoost tool in R, what comes off of the S output anchor is a table of the records that are scored, most notably with a "Prediction" variable that corresponds to the probability score made via the XGBoost algorithm. I assumed this was the probability of a "True" response. However, when I compare these predictions with various other models using the exact same training and test data, the probabilities are almost exactly reversed; that is, for example, other models are assigning a probability of .15 to a case when XGB is assigning .85. I even ran XGBoost via the new Intelligence Suite on the same data and came up with vastly different results. The true positive rate is about 20% so it doesn't make sense that one of if not the best algorithms for this type of problem is so vastly inaccurate and different.

My target variable is 0 = "False" and 1 = "True". In this case with this XBG model, could it be that the prediction probability relates to the probability of obtaining the first value, in this case, 0/False?

I was so excited to get this configured and working but it simply is not squaring with every other model I have run. Any ideas?

NeilR · ‎06-29-2020

FYI XGBoost is included in the Classification tool, part of the new Alteryx Intelligence Suite. (Apologies in advance to @bkramer66_dup_418 and @Hjardine as I'm not answering your question. Perhaps @TimothyL can weigh in on your queries.)

ebarr · ‎06-29-2020

Im actually having same type of issue and just stopped using the xgboost and went back to boost as it was more “accurate” (at least based on my understanding. Of “accurate”.

Sent from my iPhone

TimothyL · ‎06-29-2020

@ebarr @bkramer66_dup_418 @Hjardine

Thanks for all the support! This post is written before our official support toolkit Intelligence Suite which contains xgboost package and our Designer has been upgraded together with R & Python environments.

I believe there is huge potential of improvement within this macro either in general applicability or parameter reporting. Feel free to shoot me a message so that we could discuss further with your sample data. 😉

For now, please give our Alteryx Intelligence Suite a try, which we could support you in the long term.

rmboaz · ‎08-11-2020

I'm having trouble with my own R tool that is building XGBoost models. It builds them and scores them within the R tool file, however I am outputting the XGBoost model as a model object to be used later. I have done this with no problem with Ranger Forest models and linear regression models; however, I can't seem to get the XGBoost model to unserialize from its Object format in a later R tool. Does anyone know of a way to unserialize the object successfully in R?

Here is my current script:

boost.mod<-unserializeObject(as.character(model.obj$Object[[i]]))

where model.obj$Object[i] is the boost model I am using. This exact syntax works fine for Ranger, random forest, and regression models, but I get this error for XGBoost:

ReadItem: unknown type 0, perhaps written by later version of R

However, both were written using the same version of R and the xgboost pacakge.

(the purpose of this is to build the model at 1 time point and save it as a usable object to score future observations off of ad hoc)

Thanks for your help!

Ray

TimothyL · ‎09-03-2020

Hi Ray @rmboaz ,

XGBoost package works a bit differently comparing with the other algorithms: It requires incoming data in matrix format.

As our standard Score tool would not perform matrix transformation, to simplify the experiment, I decided to combine the scoring component into one.

In case you would like to separate the train & score piece, essentially what you need is below code block in a R tool.

pred <- predict(XGBOOST_MODEL, data.matrix(SCORING_DATA))
dt <- as.data.table(pred)
pred_t <- as.list(data.table(dt))
write.Alteryx(pred_t, 4)

Don't worry if coding is not your forte, give a try on our Intelligence Suite which allows you to build XGBoost in click & drop manner.

Lastly, as we can tell data prep is such crucial task before any modelling, here is our latest series: Expand Your Predictive Palette IV: Imputation Beyond Mean Enjoy!

TL