Alteryx Designer Knowledge Base

Definitive answers from Designer experts.
The Expert Exam is now live online! Read about the specifics and what it took to bring it to life in the blog by our very own Elizabeth Bonnell!
This article has an attached macro from Dr. Dan for grouping linear regressions by unique values in a column.
View full article
This article includes a macro for obtaining parsed statistics from the Linear Regression tool.
View full article
An ETS model, otherwise known as exponential smoothing, estimates single variable forecasts using weighted averages of past observations. There is more weight given to recent observations with a gradual and constant rate of decrease for the observation weight over time. Depending on the method used, there is a smoothing equation for one or more of the following: level, trend, and seasonality.
View full article
ARIMA stands for Autoregressive Integrated Moving Average. An ARIMA model produces time series forecasts using autoregressive integrated moving averages based on a single variable model or covariate model. Generally, either the AR or MA terms are used, models with both terms are less common.
View full article
Errors and warnings occur when predictive model names contain spaces
View full article
If you are installing two different versions of Alteryx Designer and their respective predictive tools- there is an order of operation.    1.  Install the Administrative Version of Alteryx Designer. 2.  Install the Administrative Version of Alteryx R/Predictive tools. 3.  Install the Non-Administrative Version of Alteryx Designer. 4.  Install the Non-Administrative Version of Alteryx R/Predictive tools.
View full article
Predictive Tools or R-Tool missing in Designer   Cause 1   After the Designer installation, the separate Predictive Tools installation is not yet complete.   Solution   The toolbar with Predictive Tools and the R Tool in the Developer Toolbar are not part of the Designer installation. These tools require a separate installation matching the non-Admin or Admin version of Designer. To verify the version of Designer installed, first go to the Help, About menu. Then, go to downloads.alteryx.com and click on the Designer tab to find the matching version of the Predictive tools in the list.     Cause 2   Installation of the incorrect version of the Predictive Tools download occurred.   Solution   Be sure to install the version of the Predictive Tools that match Designer, either non-Admin or Admin.   Cause 3   There are additional initialization files beyond those generated by the Predictive Tools installation.   Solution   Designer will check three locations for the initialization (.ini) files that contain settings by needed by R based tools that run in Alteryx. These .ini files include the folder location of plug-ins and tool macros. The correct version of the R libraries and modules is included in the .ini file settings as well. Note that all tools in the Predictive Tool Download need information from the R folders in the initialization settings to run.   When there are custom tools from the Alteryx Public Gallery, or other applications besides those from Alteryx referencing R libraries and modules, there can be initialization settings that misdirect Designer to the wrong folders. For example, an additional .ini file may point to the location used a non-Admin version of Designer even after the version of Designer changes to Admin. The .ini files installed by the Predictive Tools Download are listed below. Note there are separate lists for the non-Admin and Admin versions of Designer.   Admin version C:\Program Files\Alteryx\Settings\RPluginSettings.ini C:\Program Files\Alteryx\Settings\AdditionalPlugins\RPlugin.ini C:\Program Files\Alteryx\Settings\AddOnData\Macros\R-(version_number)_Macros.ini   Non-Admin version C:\Users\(user_name)\AppData\Local\Alteryx\Settings\RPluginSettings.ini C:\Users\(user_name)\AppData\Local\Alteryx\Settings\AdditionalPlugins\RPlugin.ini C:\Users\(user_name)\AppData\Local\Alteryx\Settings\AddOnData\Macros\R-(version_number)_Macros.ini   Check the folders containing the .ini files listed above to see if additional .ini files exist. If there are more .ini files, open each file and look for folder paths with an R folder such as R-(version_number) or R-library. If such as path exists in the file, then close it, and temporarily move the file to your Desktop. Then, close and re-open Designer and see if the missing tools appear. If the moved file is still needed (to run a custom tool, for example), edit it with the correct settings matching those found in the default files from the Predictive Tool Download, and then return the file to the original folder. For example, this is the correct location for the plug-in files used by an Admin version of Designer.   Here a fictitious example file that still has an R folder path for non-Admin version of Designer, even though the workstation has the Admin version. The file should be corrected or removed if obsolete.   Additional Resources   Troubleshooting Predictive Tools Installation  Installing two different versions of Alteryx on the same machine- with predictive tools  How to install the Microsoft R Client Predictive Tools     
View full article
Fix "Error: package or namespace load failed for 'AlteryxRDataX'"   Following an new installation of the predictive tools, while trying to run a workflow involving R tool or any macros that use R tool (most of the predictive tools for example), you may face following error message:         Error: R (1): Error: package or namespace load failed for 'AlteryxRDataX': Info: R (1): .onLoad failed in loadNamespace() for 'AlteryxRDataX', details: Info: R (1): call: NULL Info: R (1): error: package or namespace load failed for 'showtext': Info: R (1): .onLoad failed in loadNamespace() for 'showtext', details: Info: R (1): call: utils::unzip(font_file, exdir = out_dir, overwrite = FALSE) Info: R (1): error: 'exdir' does not exist Error: R (1): Error: Unable to load the AlteryxRDataX package - Use the R installer provided by Alteryx Error: R (1): Execution halted         Example:     Environment   Product - Alteryx Designer All versions Product - Predictive tools All versions   Cause   This error occurs if of the R packages used by Alteryx for the R tool (https://cran.r-project.org/web/packages/showtext/index.html), can't freely access the folder defined by environment variable TMPDIR, TMP or TEMP. Package will stop as soon as one of them could be find found. They will be checked in this particular order, with user variable taking precedence over system variables if the 2 are defined).   Example: In the following, example, environment variables will be read in following order (first available folder will be used):   system TMP user TEMP system TEMP                 As C:\Windows\Temp exists but it not accessible for current user (see below), error will occur:       Solution   Close Designer Open Control Panel Open System Open Advanced system settings Click on Environment Variables Create a user environment variable called TMP or (TMPDIR if TMP already exists) with a value set to a fully available folder (example: %LOCALAPPDATA%\Temp). Click on OK to close the window.     7. Start Designer and run your workflow or test again      
View full article
How to install the Microsoft R Client Predictive Tools   Most people using Designer will only install the Alteryx Predictive Tools. These tools do not need additional software other than Designer. The Microsoft R Client Predictive Tools require a separate Microsoft R Client installation.   Prerequisites   An account with Windows Administrator privileges on the workstation for the installation Admin version of Designer   See the link below for the Alteryx Designer Predictive Tools Compatibility Policy.    Predictive Tools Compatibility Policy    Please complete the installation using the sequence listed below. When a different order of installation occurs, there may be errors such as Microsoft R Client was not found on this computer and is required for this install.   Admin privileges are required to install Microsoft R Client Predictive Tools, as some Windows components and drivers are part of the installation. These tools must be used with the Admin version of Designer.    Procedure   Go to downloads.alteryx.com. In the Product Downloads List, select the version of Designer or Server that will be used with the Predictive Tools.   Download and install Alteryx Designer or Server first.     In the Designer Product Download List at downloads.alteryx.com, choose the Alteryx Predictive tools that match the Designer version, and install these tools.   Next, the Microsoft R Client should be downloaded and installed. Download files and a compatibility chart for the correct version to use are available at Microsoft Machine Learning Server R Client.   The last software to install is in the first Other Predictive Tools section of the Designer downloads page, the Microsoft R Client Predictive Tools (MRCInstaller).   Additional Resources   Alteryx and Microsoft R Integration   What is Microsoft R Client  
View full article
With the Python Tool, Alteryx can manipulate your data using everyone’s favorite programming language - Python! Included with the tool are a few of pre-built libraries that extend past even the native Python download. This allows you to extend your data manipulation even further than one could ever imagine. The libraries installed are listed here - and below I’ll go into a bit more detail on what and why these libraries are so useful.   Each library is well documented, and there’s usually an introduction or examples on their sites to get you started on how a basic function in their library works.     ayx – Alteryx API – simply enough, we’re using Alteryx, sooo yea, kind of a requirement for the translation between Alteryx and Python.   jupyter – Jupyter metapackage – If you’ve used a Jupyter notebook in the past, you’ll notice the interface for the Python Tool is similar. This interface allows you to run sections of code outside of actually running the workflow, which makes understanding and testing your data that much easier. http://jupyter.org/index.html   matplotlib – Python plotting package – Any charting, plotting, or graphical needs you would want will be in this package. This provides a great deal of flexibility for whatever you want to visualize. https://matplotlib.org/   numPy – NumPy, array processing for numbers, strings, records, and objects – Native Python processes data in what some would call a cumbersome way. For instance, if you wanted to make a matrix, a.k.a. a 4x4 table, you would need to create a list within a list, which can slow processing a bit. However, NumPy has its own “array” type that fits the data in this matrix pattern that allows for faster processing. Additionally, it has a bunch of methods of handling numbers, strings, and objects that make processing a whole lot easier and a whole lot faster. http://www.numpy.org/   pandas – Powerful data structures for data analysis, time series, and statistics – This is your staple for handling data within Alteryx. Those who have used Python, but never pandas, will enter a whole new beautiful world of data handling and structure. Data manipulation within Python is faster, cleaner, and easier to code with. The best part about it is that the Python Tool will read in your Alteryx data as a pandas data frame! Understanding this library should be one of the first things to know when tackling the Python code. https://pandas.pydata.org/   geopandas – Extends the data types used by pandas to allow spatial operations on geometric types. Are you interested in geospatial analysis using Python? Try this package.  It makes working with geospatial data in Python much easier and faster.  http://geopandas.org/   requests – Python HTTP for Humans – for all the connector/Download Tool fans out there. If any of you are familiar with making HTTP requests (API calls and the like), then you should introduce yourselves to this package and explore how Python performs these requests. http://docs.python-requests.org/en/master/   scikit-learn – a set of Python modules for machine learning and data mining – Welcome to the world of machine learning in Python! This library is your go-to for statistical and predictive modeling and evaluation. Any crazy and wild methods you’ve learned for machine learning will most likely be found here and can really push the boundaries of data science. http://scikit-learn.org/stable/   scipy – Scientific Library for Python – all your scientific and technical computing can be found here. This library builds off the packages already installed here, like numPy, pandas, and matplotlib. Dealing with mathematical models and formulae are usually located within this library and can help provide that higher level analysis of your data. https://www.scipy.org/   six – Python 2 and 3 compatibility utilities – For those who are unfamiliar, Python versions come in 2 forms, version 2.x and 3.x (with 3.x being the most recent). Now, even though Python 3 is supposed to be the latest and greatest, there are still many users out there who prefer using Python 2. Therefore, integration between the two is a bit tricky with syntax differences, etc. The six module provides functions that are usable between the two so everyone can remain calm and happy! Their documentation is usually coupled with which version the functions most closely align to, so a user can get a better idea to its functionality. https://pypi.org/project/six/   SQLAlchemy – Database Abstraction Library – SQL in Python! Covers all your database needs from connecting to and extracting data, allowing it to interact with your Python code and thus, Alteryx itself. https://www.sqlalchemy.org/   statsmodels – statistical computations and models for Python – This library builds off sci-kit learn but focuses more on statistical tests and data exploration. Additionally, it utilizes R-style formulae with pandas data frames to fit models! https://www.statsmodels.org/stable/index.html   These are the libraries installed with the Python Tool, which can do almost any data function imaginable. Of course, if you’re looking to do something that these libraries don’t provide, there are myriad other Python libraries that I’m sure will help you with your use case. Most of these are also well documented in how to use so search away and let your mind float away in the beautiful cosmos created by Python.
View full article
K-Centroids Cluster Analysis: No valid fields were selected   When using the K-Centroids Cluster Analysis tool, you may see the following error message:   Tool #27: Tool #6: No valid fields were selected. Tool #41: Tool #6: No valid fields were selected.   Environment   Product - Alteryx Designer Predictive Tools   Cause   The most common cause for this error message is: One of the fields selected has no statistical variance (all values are the same) If this is the case, you will often see another error message stating: "One or more of the provided data columns has a single data value."   Solution   Attach a Field Summary tool (Data Investigation) to the data you are sending into the K-Centroids Cluster Analysis tool Investigate the results to find any fields no variance   Additional Resources Pre-Predictive: Using the Data Investigation Tools  
View full article
AlteryxRPluginEngine.dll could not be loaded: The specified procedure could not be found   When running any of the R-based predictive tools in Designer, you may come across the below error message:   AlteryxRPluginEngine.dll could not be loaded: The specified procedure could not be found. (127)   Environment   Product - Alteryx Predictive Tools Any R-based predictive tool (Linear Regression, Decision Tree, Association Analysis, etc.) Cause   Typically this error is caused by having a mismatch in versions between Designer and Predictive Tools. It is required that both Designer and Predictive Tools are on the same version number (e..g, 2018.4) and type (Admin/Non-Admin).   Solution   Search Windows for "Add or Remove Programs" or go there through the Control Panel Uninstall both Alteryx Designer and Alteryx Predictive Tools Note: It is fine to only uninstall either all Admin versions, or all Non-Admin versions, if desired. Non-Admin versions will say (User) in parentheses. Download the matching versions from the Downloads page Admin example:  AlteryxInstallx64_2019.3.5.17947.exe AND RInstaller_2019.3.5.17947.exe Non-Admin example: AlteryxNonAdminInstallx64_2019.3.5.17947.exe AND RNonAdminInstall_2019.3.5.17947.exe   Additional Resources   Alteryx Downloads
View full article
Unable to load the Rcpp package   When using the Predictive tools, you may get an error stating that the Rcpp package cannot be loaded, or that the namespace load failed.   Unable to load the Rcpp package - Use the R installer provided by Alteryx   Environment   Product - Alteryx Designer Predictive Tools   Cause   User-installed packages may have overwritten Alteryx's default Rcpp package or a mismatch between Designer and Predictive Tools version may have caused the Rcpp version to become out of sync with Designer. Other causes are possible as well.   Solution   Make a backup and delete the existing Rcpp package from the folder corresponding to your Designer version (Admin or non-Admin): Non-Admin: %LOCALAPPDATA%\Alteryx\R-{VERSION}\library\Rcpp Admin: C:\Program Files\Alteryx\R-{VERSION}\library\Rcpp Run the R terminal as Administrator: Non-Admin: %LOCALAPPDATA%\Alteryx\R-{VERSION}\bin\x64\Rterm.exe Admin: C:\Program Files\Alteryx\R-{VERSION}\bin\x64\Rterm.exe Re-install the Rcpp package with the following command:  install.packages("Rcpp", type="binary") Restart Designer and test the sample predictive workflows found under Help -> Sample Workflows If you continue to experience issues, open the R terminal again and run the command below. Provide any error message to Support to further troubleshoot. library(Rcpp)    
View full article
This article describes and explains the outputs of the Decision Tree Tool.
View full article
The TS Factory tools can build time series models and forecasts for multiple sets of historical data groups, without the need to separate them.  This is their relationship to the batch macro.
View full article
Alteryx Designer comes with tools (based on both R and Python) to create and use predictive models without needing to write any code. But what if you've got custom models written in R or Python outside of Designer that you want to use in Designer, or vice versa?
View full article
A broad overview and introduction to what Decision Trees are, and how they work.
View full article
This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the Boosted Model Tool on our way to mastering the Alteryx Designer:    The Boosted Model tool in Alteryx is a powerful predictive analytics tool. It is one of the more complex tools in the tool chest, but in difficult use cases, it proves to be very valuable. This article will explain what boosting is and provide step-by-step instructions on how to incorporate this technique in a workflow, illustrated with a credit card fraud detection use case. In the Predictive Set of Tools in Alteryx, the very first one on the left is the Boosted Model. The icon for the tool has an upward arrow and what looks like an upright spring.   What is the Boosted Model?   The Boosted Model relies on the machine learning technique known as boosting, in which small decision trees (“stumps”) are serially chain-linked together. Each successive tree in the chain is optimized to better predict the errored records from the previous link. How to better-predict those previous errors is what distinguishes each boosting algorithm. One of the strongest and most proven algorithms is the Gradient Boosting Machine (aka “GBM”). Both R and Python have GBM packages; the Alteryx Boosted Model Tool uses the R implementation.   The Gradient Boosting Machine was invented by Jerome Friedman two decades ago while he was a Statistics Professor at Stanford University. You can download his paper here. A Gradient Boosting Machine is one that uses gradient descent as the method of zeroing in on the errors from the previous tree in the chain.   Boosting – especially gradient boosting – is very popular because of how well it works. New alternative versions of GBM are available in R and Python. The most popular are XGBoost developed in 2014, CatBoost developed in 2017, and Light GBM developed in 2017.    Since the Gradient Boosting Machine uses gradient descent, it needs to have a learning rate configured for it, like other gradient descent algorithms. For GBM, the learning rate is called the “shrinkage factor”.   Gradient boosting (as well as other boosting algorithms) is resistant to overfitting which is basically when an algorithm learns too much about the specifics of the data records it initially learns on, thereby failing to generalize to new, previously unseen, data at prediction time. This is a common problem with many algorithms, but not so much with gradient boosting.   To understand what overfitting is, look at the two sets of graphs below. The blue dots represent the training data points, and the red lines represent the model function inferred from the training data that will be used to predict on new data. The top two graphs show two possible models based on the same training data, while the bottom graphs show how the same models perform on test data. On each set the left model is what we refer to as a “Generalized” Model. The red line (aka ‘model’) serves as a trend line and predicts the trend by trying to be placed in the middle of the training points on the scatterplot. It is a simple straight line defined by a first order linear equation. In the Generalized Model, the model only accurately predicts 2 of the 16 training points (shown by going through the 2 points). However, its placement minimizes the error (distance between the other points and the line). In contrast, in the Overfit Model on the right, the model employed is more complicated and defined by higher order equations. In the case of the Overfit Model, the model BETTER PREDICTS the training data, accurately predicting 7 of the 16 training data points. We might be tempted to think that the Overfit Model is better because it achieves higher Accuracy when predicting the training data; however, when the models try to predict on the previously unseen Test Data, the Accuracy of the Overfit Model underperforms that of the Generalized Model and it does not reflect the true trend of the data.       Gradient boosting is flexible in the types of target variables (dependent variables) that it can handle. It can deal with continuous numeric variables such as temperature, as well as binary variables such as 0 and 1. It can also work with categorical variables such as Yes and No, or multinomial categorical variables such as “Dog”, “Cat”, and “Mouse”.   The Gradient Boosting Machine (like the other boosting algorithms) is said to be good for imbalanced training datasets. So, for example, if a training dataset has 50% of the records with a target variable of “Y” and the other 50% has a target variable of “N”, then it is considered a perfectly balanced dataset. Balanced datasets are excellent for training, but there are many cases in which the training dataset by its very nature is highly imbalanced. For example, if we are trying to detect credit card fraud, we would have a historical dataset in which each credit card transaction is a record. The number of fraudulent transactions within those records would be a very small percentage of the total. In this case, the dataset available for training a model is highly imbalanced, which presents a difficulty for many algorithms such as Logistic Regression. It is less of a problem for the GBM, however, because of the way the algorithm focuses on error records in each tree generation, as described above.    Credit Card Fraud Data   In order to help illustrate how to use the Alteryx Boosted Model, we will use a credit card fraud detection dataset for model training. Kaggle has a sizeable dataset available, which you can download here .   This particular dataset has nearly 285,000 rows, each representing a single credit card transaction. In order to hide the personal data associated with each credit card transaction, the variables are all obfuscated by their names and normalized. It is very likely these variables are the output of a PCA (Principal Components Analysis) preprocess. PCA processes are often employed to reduce many variables down to only a few independent and regularized (normalized) variables. One of the benefits of PCA is that the smaller set of variables that are output are often obfuscated and difficult to correlate back to a known input variable (without the output reports of the PCA process). Another benefit of PCA preprocessing is that the output variables have very low collinearity, which makes them better suited for modeling.   The Kaggle credit card fraud dataset has 30 columns. The main variables are named “V1” through “V28”. Following this is the “Amount” column, which is the purchase amount, and then finally the Target Variable labeled “ClassYN”. If Class=Y, then the record represents a fraudulent transaction; if Class=N, then the record is a normal transaction. Of the 284,807 total records, only 492 (0.17%) are fraud records. This means that 99.83% of the records are not fraud hence this is a very imbalanced dataset.   In preparation for modeling, we separated the 285K records into three different datasets: one for training, one for testing, and one for final evaluation. This separation into Training, Testing, and Evaluation sets can be done through random assignment, but – especially with imbalanced datasets – it is better to split the data in a semi-random way. This ensures each sub-dataset has a representative sampling of the important types of records from the original dataset [1] . Once the full dataset is split evenly into three datasets, a T-Test (Test of Means) is performed on the target variable and the most important of the other variables (“V1” – “V28”) to ensure that each dataset is statistically similar to the original.   The Alteryx Workflow   Below is a screenshot of the credit card fraud detection model and workflow created for this blog. You can download a copy of it at the Alteryx Public Gallery here .     The workflow is broken into three columns. The Input Data column on the far left, the colored processing sections in the middle, and the Output Data column on the far right. The data flows from left to right, Inputs to Outputs. The processing happens in three sections: The pink section is where the Boosted Model gets trained. In the green section, the model is tested using Test Dataset 1. The yellow section is where the final model evaluation is performed using Test Dataset 2.   Before the Training Data is used in step one of processing, we decided to transform the Amount field. Transforming variables can help the modeling process. Because the Amount field spanned from $0 to $25,000 and this scale is significantly different than the span of the variables V1 through V28, it would be best to either normalize the field or transform it to make it better correlate with the Target Variable. In this case, we decided to do a Logarithmic transformation. Specifically, the new variable “3LogAmount” is 3*Log(Amount). This transformation brings the scale to the same degree as the other variables; and, the new variable better correlates with the Target Variable (using Pearson Correlation). Because we transformed Amount for Training, we also must transform it for Testing and Evaluation.   Configuring the Alteryx Boosted Model   Required Parameters Menu   In order to configure the Alteryx Boosted Model tool, click on it in the workflow. This is what you will see in the Configuration Pane:     On the “Required parameters” tab, you can choose a Model Name. You can choose any name you want, but a good practice is to include information about how the model was configured.   Next, choose the target field. This is the field we are trying to predict. In our dataset the target field is “ClassYN”.   The next section shows a list of checkboxes for indicating the predictor fields. We check every field except Amount and ClassYN. Since we transformed the Amount field into 3LogAmount, we should check 3LogAmount, but not Amount. Also, it is very important that we do not check the ClassYN field, because it is the dependent variable and not an independent variable.   The next option is to choose whether you want to use sampling weights. If you want to do this, include a field in the data with the numeric weight for each record. For example, if there is a set of records you want to be 10 times more likely to be used than the rest of the records, then those records get weighted as 10 and the rest of the records get weighted as 1. When configuring the tool, choose the field in the dataset that has the weights. In our case, we do not need to use weights.   The next option is whether or not to include marginal effect plots. These are plots that isolate a single variable to inspect which values of that variable are most predictive of the Target Variable. These are helpful in determining how well a variable will serve as a predictor.   Model Customization Menu   Next, to further customize the model, choose the Model customization tab.     If you click on the checkbox: “Specify target type and the loss function distribution”, you can choose what type of variable the Target Value is, and then its loss function. A loss function essentially tells the algorithm how to penalize errored predictions.   Continuous target:   For example, if the target variable is continuous, there are three different loss functions available to use. A Gaussian (Squared Error Loss) function is very good to use when the predicted errors are large in value. With this loss function, since the error amount is squared, this penalizes the errors much more when they are large. A Laplace (Absolute Value Loss) function penalizes the errors linearly to their error amount. This is a faster loss function, but it does not work as well as Gaussian in cases in which the errored amounts are large.   The t-distribution loss function (AKA Student’s T loss function) should penalize errors even less than the Laplace loss function. If you choose the t-distribution, you will need to enter the number of predictive variables in the “Degrees of Freedom” field.   Count (integer) Target:   If your target variable is a count of some kind (e.g. count of sneezes a person has during allergy season), then the Boosted Model tool will automatically choose the Poisson loss function.   Binary (two outcomes) Categorical:   If you choose a target variable that is Binary Categorical, then you will have a choice of two loss functions: Bernoulli (logistic regression) or AdaBoost (exponential loss). Bernoulli has a loss function that is less steep than AdaBoost for large errors and thus penalizes such errors less. For small errors, however, Bernoulli penalizes more than AdaBoost. If your errors are large, use AdaBoost; if your errors are small, use Bernoulli.   Multinomial (three or more outcomes) categorical:   If you choose Multinomial categorical (three or more target values), then it will use a loss function geared specifically for a target variable with three or more values.   Since, in our credit card fraud use case, we have a binary categorical target (“ClassYN”), we experimented between the Bernoulli Loss Function and the AdaBoost Loss Function, because we did not know whether the errors would be large or small.   The maximum number of trees in the model:   The next customization parameter that we have is the maximum number of trees. With this parameter, we want to use the minimum number of trees that will give us the best balance between processing time and error amount when tested via the Testing and Evaluation datasets. Typically, there is a value for the number of trees after which the quality of the model (measured by AUC, or Precision/Recall, or Accuracy) plateaus. In the example graph below, which was generated on the R anchor of the Boosted Model Tool, the X axis (titled “Iteration”) is the number of trees created. The Y axis titled “AdaBoost exponential bound” refers to the amount of prediction error with the number of trees. In this example, we were using an AdaBoost loss function with cross validation. There are two key points identified by the arrows in the graph.   The blue arrow identifies the “knee” of the graph. This is an inflection point where the error no longer significantly reduces as the number of trees increases. In this example, it is at about 2000 trees.   The yellow arrow identifies the “gone too far” point where the error actually begins to increase as the number of trees increases. In this example, it is at about 6000 trees.     The maximum number of trees should be between these two points on the graph. The closer to the yellow arrow point, the longer it takes to create the model, but the error will be the minimum. The closer to the blue arrow point, the less time it will take to create the model, but the error will be slightly higher. We chose right in the middle at 4000 trees.   Understand that the only way to view this graph is to run the model. So, in order to choose the optimal value for maximum number of trees, you may have to experiment by running the model with different loss functions and different values for the other parameters.   Method to determine the final number of trees in the model:   The next parameter we need to set with respect to the number of trees is the Method to Determine the Final Number of Trees in the Model. The three options are cross validation, test (validation sample), and out-of-bag. For the credit card fraud detection use case, we experimented with each method.   Cross validation:   Of the three options, the one that is consistently the best is cross validation. A recommendation for the number of cross validation folds is somewhere between 5 and 10. If the data is very noisy, you may want to increase the number of CV folds even more, to 20. Cross validation has the benefit of preventing overfitting. Unfortunately, it also takes a lot of processing power to use cross validation. This is why you can choose the number of CPU cores to dedicate to cross validation. Using more cores than the number of folds does not help, but keeping it to one core will take the longest time to process. For instance, if you have 5 CV folds, and 8 CPU cores on your machine, then you can probably up the number of cores to 4 and not significantly slow down any processing happening concurrently outside of Alteryx.   Test (validation) sample:   Using Test (validation) sample will simply tell the tool to separate out a percent of the training records (that you set) to be used for internal testing. This method is much faster than cross validation, but it is not better than cross validation.   Out-of-bag:   This method is based on the algorithm’s “out-of-bag” estimator, which typically does not reach the optimal number of trees. It is faster than cross validation and test sample, but it usually does not yield the most accurate model.   The fraction of observations used in the out-of-bag sample:   The next customization parameter is “The fraction of observations used in the out-of-bag sample” (AKA “bag fraction”). This is the percentage of training records pulled and used in each iteration of tree creation. This pulling of the records out of the bag of training records is normally random, except in the case where the random seed value is set to zero. If the random seed value is set to zero, and the bag fraction is less than one, then every time the model is retrained with the same data and the same parameters, the resulting model will be slightly different. If the random seed value is zero then it can cause that a slightly different model is built each time the workflow is run and even with the same input data. If we set it to any number (e.g. 1), and keep it on that number, then every time the workflow is run with the same input data, the model that is built will always be the same.   For the credit card fraud detection use case, we experimented with different fractions of observations to determine the optimal value.   Shrinkage:   The shrinkage value is also called the learning rate. As a general rule, the smaller the learning rate, the more accurate the model will be. But please note there are exceptions to this general rule which sometimes are encountered. Also, as a general rule, the smaller the learning rate, the longer it takes to process in an effort to reach an optimal model.   For the credit card fraud detection use case, we experimented with different shrinkage values to find the optimal value.   Interaction depth:   The interaction depth relates to how big each tree should be. There is a lot of debate on how many tree nodes and tree leaves will be formed for each value of interaction depth. What is not debated is that the greater the number, the larger each tree will be. The default value is 1 and the value that will prevent overfitting the most is a value of 1.   For the credit card fraud detection use case, we experimented with values 1 through 5 on the interaction depth to determine the best value.   Minimum required number of objects in each tree node:   The next parameter is the minimum required number of objects in each tree node. This value is the minimum number of training records that must be in each terminal node of the tree for the model to use that tree. When creating a tree, it is necessary to split the tree into right and left nodes. But before making the split official, the dataset must provide the model this minimum number of records to put in each new node. If the value is higher, this will reduce the number of tree splits that can happen and possibly reduce the number of trees. The higher the number, the more it will generalize the model and prevent overfitting. The default value is 10. The best way to know the optimal value is to experiment.   For the credit card fraud detection use case, we experimented with values 1 through 100 (the maximum possible value).   Random seed value:   Finally, the last customization parameter is the random seed value. If it is set to 0, the output models for each training run will be slightly different (even with the exact same input data). If you instead set the seed to any number greater than zero, all runs with that same seed value (and the same input data, and the same customization parameters) will yield the same output model. This is important for repeatability and testing.   Credit Card Fraud Detection Use Case   With the credit card fraud data, we ran several experiments to determine the best set of customization hyper-parameters. The first set of experiments examined the type of loss function together with the method to determine the final number of trees. The two choices for loss functions are Bernoulli and AdaBoost, and the three choices for methods are cross validation (5 folds), Test Sample (50%), and Out-Of-Bag. Below are the results.       Model Measurements   Confusion Matrix:   The first column is the model description. The next four columns are the “confusion matrix” values: TP (True Positive) means we predicted the record as fraud and it truly was fraud. FP (False Positive) means we predicted the record as fraud, but it was not actually fraud. FN (False Negative) means we predicted the record as not fraud but in reality, it was fraud. TN (True Negative) means we predicted the record as not fraud and it truly was not fraud. Our main focus with the confusion matrix numbers lies with the False Positive and False Negative values: We want these to be as low as possible because they have actual costs to the company.   Please note that the confusion matrix is not an output of the Boosted Model Tool. In order to create it, we created a tool container titled “Post Processing Evaluation of Scored Data” in which we put the logic to create the confusion matrix along with the Optimal Cutoff, TotalLoss, Precision, Recall, and Accuracy fields. To find the tool container, look in the green and yellow processing sections on the right side. If you open the tool container, you can easily follow the logic.   Here is a good example to help you understand the four confusion matrix metrics:     Optimal Cutoff and TotalLoss:   After the model is created, it is fed directly into a score tool along with the Test records. The output of the score tool is a Score_Y field containing a probability that the record is fraudulent. In a default setting, if the probability is 0.50 and above, we would classify the transaction as fraud. However, by taking into account the actual costs for False Positives and False Negatives, we can calculate an optimal cutoff value that minimizes costs.   For example, if we inaccurately predict a transaction as fraudulent, the company could take an action that makes a customer upset, such as blocking the transaction and maybe even putting a freeze on the card. This could cause the customer undo stress and some will take their business elsewhere. This is the potential loss due to a False Positive. If a card has true fraudulent transactions and yet we predict that the transactions are not fraud, then we will likely lose more money. This loss results from a False Negative. In our use case, we assigned the loss for a False Positive as $100 and the loss for a False Negative as $400. Using these values, we calculate the Optimal Cutoff score and the TotalLoss in dollars. We want the Optimal Cutoff score to be as high as possible, and the TotalLoss to be as small as possible to make sure our model is robust.   The logic for this calculation is in the tool container. There is a Text Input tool where you can set the costs for each of the categories of classifications (TP, FP, FN, and TN).   Precision, Recall and Accuracy:   These are common measurements that can be applied to models based on the results of the test data. For all of these measurements, the higher the number the better. A perfect score for each would be 1.00, although in practice that is almost never achieved.     Precision is defined as the proportion of correct positive classifications (predictions) from the cases that are predicted as positive. When False Positives rise, then the Precision declines. If our focus were to maximize detecting fraud with a concern about False Positives and with little concern about False Negatives, then Precision is a good measurement to use.     Recall is defined as the proportion of correct positive classifications from all the cases that are actually positive, which is the TP plus FN. When False Negatives rise, the Recall value declines. If our focus is on maximizing detection of all fraud with False Positives as a second priority, then Recall is the measurement to use.     Accuracy is the proportion of correct classifications (true positives and true negatives) from the overall number of cases. When we want to simultaneously measure how well our model correctly classifies “fraud” and “not fraud” transactions, then Accuracy is a good measurement to use. However, when we have a highly imbalanced training dataset (as we do in this case), Accuracy is not a good measurement. For example, imagine the model simply predicts all records it sees as not fraud. If that were the case, then with 492 fraud transactions and 47,211 normal transactions, our accuracy will still calculate to be 98.9% and yet our model would not be able to detect any fraud. Our model would be useless, but still very accurate, simply by chance. For this reason, with a highly imbalanced data set, we normally do not use Accuracy as a primary measure.   AUC (Area Under the Curve) and Gini Coefficient:   AUC: The Area Under the Curve is a total area measurement of the area under the Receiver Operating Characteristic (ROC) curve when plotting the True Positive Rate (AKA Recall) against the False Positive Rate (AKA FPR or “1-Recall”). It’s a good measurement because it represents the degree or measure of separability between the ‘fraud’ and ‘not fraud’ classes and tells us how much the model is capable of distinguishing between these classes. A perfect measurement of AUC would be equal to 1.00. In Alteryx, the tool that we use to generate the ROC curve and measure the AUC (area under the ROC curve) is the Lift Chart tool which is found in the Predictive suite of tools. The lift tool can measure the AUC for one or many models, allowing us to conveniently compare the graphs and final measures for each model.   The output graph below is an actual output from our experiments in determining the best Model Customization set of parameters. The ROC graph shows six ROC curves (one for each of six different models). In the middle of the graph is the diagonal line. Any ROC curve from a model that is working “better than random” will have a ROC curve that is above and to the left of the diagonal line. In this actual graph, all six models have curves that indicate they are all “better than random”. Typically, the model curve that is the farthest to the left and highest up of the diagonal line will have the highest AUC value and is considered the best model. Another way of identifying this is the model curve closest to the upper left corner of the graph, typically is the best model. In the graph below, the closest curve to the upper left corner is the Adaboost_Cross. Next would be the Adaboost_Out_of_the_Bag model. And the third best is the Adaboost_Test_Sample. Note that all the Adaboost models are closer to the upper left corner than all three of the Bernoulli model curves (the Bernoulli_Cross is hidden behind the Bernoully_Out_of_the_Bag). In the table below the graph, we see that the Adaboost_Cross model has the highest value for AUC confirming that it is best of the 6 models. The worst would be the Bernoulli_Cross or Bernoulli_Out_of_the_Bag.     Gini Coefficient: The Gini Coefficient is found in the table below the ROC Curve on the far right. It is related to the AUC calculation and like the AUC value, it is scaled from 0 to 1, where 0 indicates no ability to separate the classes, and 1 indicates perfect separability.   Which measurement shall we use?   For uses cases where the data is highly imbalanced and the target variable is binary, the best measurement to use is the AUC (Area Under the Curve). In our experiments, we looked at all the measurements and gave decision preference to AUC, Optimal Cutoff, and TotalLoss.   What experiments were run?   We ran through various experiments determining the optimal customization parameters for the Boosted Model. In all cases, we used the Test Data to determine these parameters. Here are the winners for our credit card fraud detection use case: Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Final experiment results:   We ran a final experiment using the Evaluation Data against the top two contending models and the default Model (the default model is the model generated when we use all the default parameters). Here are the default parameters:   Default: Model_Default_Bernoulli_Cross_4000_20_1_10_50 Loss function: Bernoulli The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 50% Shrinkage: 0020 Interaction depth: 1 Minimum required number of objects in each tree node: 10 Our two contending models were: Model_Adaboost_Cross_4000_35_1_100_30 Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Model_Adaboost_Test_4000_35_1_100_30 Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Test Sample (50%) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Our results from the three contending models are below:     Looking at the results, we see that the first model (Default) is not bad. It has a high True Positive value. It also has the lowest TotalLoss value. However, it also has the lowest AUC value, which is not good. The second and third models had better AUC values, yet their TotalLoss values were greater (in magnitude). The second model has the highest Optimal Cutoff value, which indicates a stronger model.   One last thing we can look at when comparing the models is the ROC curves. Below is a combined chart with the ROC curves of the three contending models.     From the chart, we see that the ROC curve farthest to the left of the diagonal and higher towards the top than the others is the Model_Adaboost_Cross_4000_35_1_100_30 model. These curves confirm the AUC values in the table. This is the winning model because it has the highest AUC value and the highest Optimal Cutoff value.   Conclusion:   The Boosted Model Tool in Alteryx is a very powerful predictive analytics tool. It is one of the more complex tools in the tool chest, but in difficult use cases, it proves to be very valuable. This credit card fraud detection use case is especially difficult because it is highly imbalanced.   The Boosted Model Tool uses the GBM package in R. This algorithm is resistant to over-fitting. It also works well with a variety of target variable types.   If you decide to use the Boosted Model Tool, you will find that the default values provide decent results. If you want to improve upon those results, experiment with the customization parameters. This will add time to the design and modeling process, but it can produce significantly better results.     References:   A Gentle Introduction to Gradient Boosting Boosting with AdaBoost and Gradient Boosting Boosting and Logistic Regression A Short Introduction to Boosting ADABOOST Quick Introduction to Boosting Algorithms in Machine Learning Beyond Accuracy: Precision and Recall Understanding the AUC – ROC Curve Useful properties of ROC curves, AUC scoring, and Gini Coefficients   [1] How to do this semi-random separation is beyond the scope of this article, but if you would like to know how best to split an imbalanced dataset, please contact the author through the Alteryx Community. By now, you should have expert-level proficiency with the Boosted Model Tool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at community@alteryx.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.   Stay tuned with our latest posts every #ToolTuesday by following @alteryx on Twitter! If you want to master all the Designer tools, consider subscribing for email notifications.
View full article
In Alteryx, there are 5 customizable options within the Cross-validation screen.
View full article
We were recently approached by a concerned client with "Help! I have a model object in a .yxdb but my computer crashed and I need to document the predictor variables!" This naturally led to a discussion on how we can pull these variables back for the client, and what kind of scenarios would lead to this. The first scenario is the most obvious (the case of the client). The model object was created using Alteryx and was stored in a .yxdb.  During another process, my computer crashed and I lost all of my data! Luckily, I still had the model object in a shared location, but I need to document the variables and the model object looks like this: Unfortunately, this does not give us any information about the data or more importantly, the predictor variables. Luckily, a simple script can break down this model object and fill you in on all of the details.   Within Alteryx, attach an R Tool to your data stream (I am using the Forest Model Object that is created from an Alteryx Sample):   Next, copy and paste the following script into your R Tool code builder: model.data <- read.Alteryx("#1") the.obj <- unserializeObject(as.character(model.data$Object[1])) print(the.obj$call) This script states to take the data coming in from Input #1 and label it "model.data".  Next, unserialize (break down) the data in the field Object (specified by "model.data%Object[1]").  Finally, print the results in the Alteryx Output window.  The final results for this particular object are then printed out, as shown. As you can see, the output clearly states that my predictor variables are Chk_Bal, Duration, Credit_Hist, Purpose, etc.  The end result is quick, clean, and can really help get you out of a jam if you lose your data.
View full article
A number of the Predictive tools have Interactive Visualizations.  A Predictive tool will have this capability if you see an output anchor with the letter ‘I’ (for "Interactive").  
View full article