community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Designer Knowledge Base

Definitive answers from Designer experts.
Community v19.6

Looks aren't everything... But the latest Community refresh looks darn good!

Learn More
Alteryx Designer comes with tools (based on both R and Python) to create and use predictive models without needing to write any code. But what if you've got custom models written in R or Python outside of Designer that you want to use in Designer, or vice versa?
View full article
A broad overview and introduction to what Decision Trees are, and how they work.
View full article
This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the Boosted Model Tool on our way to mastering the Alteryx Designer:    The Boosted Model tool in Alteryx is a powerful predictive analytics tool. It is one of the more complex tools in the tool chest, but in difficult use cases, it proves to be very valuable. This article will explain what boosting is and provide step-by-step instructions on how to incorporate this technique in a workflow, illustrated with a credit card fraud detection use case. In the Predictive Set of Tools in Alteryx, the very first one on the left is the Boosted Model. The icon for the tool has an upward arrow and what looks like an upright spring.   What is the Boosted Model?   The Boosted Model relies on the machine learning technique known as boosting, in which small decision trees (“stumps”) are serially chain-linked together. Each successive tree in the chain is optimized to better predict the errored records from the previous link. How to better-predict those previous errors is what distinguishes each boosting algorithm. One of the strongest and most proven algorithms is the Gradient Boosting Machine (aka “GBM”). Both R and Python have GBM packages; the Alteryx Boosted Model Tool uses the R implementation.   The Gradient Boosting Machine was invented by Jerome Friedman two decades ago while he was a Statistics Professor at Stanford University. You can download his paper here. A Gradient Boosting Machine is one that uses gradient descent as the method of zeroing in on the errors from the previous tree in the chain.   Boosting – especially gradient boosting – is very popular because of how well it works. New alternative versions of GBM are available in R and Python. The most popular are XGBoost developed in 2014, CatBoost developed in 2017, and Light GBM developed in 2017.    Since the Gradient Boosting Machine uses gradient descent, it needs to have a learning rate configured for it, like other gradient descent algorithms. For GBM, the learning rate is called the “shrinkage factor”.   Gradient boosting (as well as other boosting algorithms) is resistant to overfitting which is basically when an algorithm learns too much about the specifics of the data records it initially learns on, thereby failing to generalize to new, previously unseen, data at prediction time. This is a common problem with many algorithms, but not so much with gradient boosting.   To understand what overfitting is, look at the two sets of graphs below. The blue dots represent the training data points, and the red lines represent the model function inferred from the training data that will be used to predict on new data. The top two graphs show two possible models based on the same training data, while the bottom graphs show how the same models perform on test data. On each set the left model is what we refer to as a “Generalized” Model. The red line (aka ‘model’) serves as a trend line and predicts the trend by trying to be placed in the middle of the training points on the scatterplot. It is a simple straight line defined by a first order linear equation. In the Generalized Model, the model only accurately predicts 2 of the 16 training points (shown by going through the 2 points). However, its placement minimizes the error (distance between the other points and the line). In contrast, in the Overfit Model on the right, the model employed is more complicated and defined by higher order equations. In the case of the Overfit Model, the model BETTER PREDICTS the training data, accurately predicting 7 of the 16 training data points. We might be tempted to think that the Overfit Model is better because it achieves higher Accuracy when predicting the training data; however, when the models try to predict on the previously unseen Test Data, the Accuracy of the Overfit Model underperforms that of the Generalized Model and it does not reflect the true trend of the data.       Gradient boosting is flexible in the types of target variables (dependent variables) that it can handle. It can deal with continuous numeric variables such as temperature, as well as binary variables such as 0 and 1. It can also work with categorical variables such as Yes and No, or multinomial categorical variables such as “Dog”, “Cat”, and “Mouse”.   The Gradient Boosting Machine (like the other boosting algorithms) is said to be good for imbalanced training datasets. So, for example, if a training dataset has 50% of the records with a target variable of “Y” and the other 50% has a target variable of “N”, then it is considered a perfectly balanced dataset. Balanced datasets are excellent for training, but there are many cases in which the training dataset by its very nature is highly imbalanced. For example, if we are trying to detect credit card fraud, we would have a historical dataset in which each credit card transaction is a record. The number of fraudulent transactions within those records would be a very small percentage of the total. In this case, the dataset available for training a model is highly imbalanced, which presents a difficulty for many algorithms such as Logistic Regression. It is less of a problem for the GBM, however, because of the way the algorithm focuses on error records in each tree generation, as described above.    Credit Card Fraud Data   In order to help illustrate how to use the Alteryx Boosted Model, we will use a credit card fraud detection dataset for model training. Kaggle has a sizeable dataset available, which you can download here .   This particular dataset has nearly 285,000 rows, each representing a single credit card transaction. In order to hide the personal data associated with each credit card transaction, the variables are all obfuscated by their names and normalized. It is very likely these variables are the output of a PCA (Principal Components Analysis) preprocess. PCA processes are often employed to reduce many variables down to only a few independent and regularized (normalized) variables. One of the benefits of PCA is that the smaller set of variables that are output are often obfuscated and difficult to correlate back to a known input variable (without the output reports of the PCA process). Another benefit of PCA preprocessing is that the output variables have very low collinearity, which makes them better suited for modeling.   The Kaggle credit card fraud dataset has 30 columns. The main variables are named “V1” through “V28”. Following this is the “Amount” column, which is the purchase amount, and then finally the Target Variable labeled “ClassYN”. If Class=Y, then the record represents a fraudulent transaction; if Class=N, then the record is a normal transaction. Of the 284,807 total records, only 492 (0.17%) are fraud records. This means that 99.83% of the records are not fraud hence this is a very imbalanced dataset.   In preparation for modeling, we separated the 285K records into three different datasets: one for training, one for testing, and one for final evaluation. This separation into Training, Testing, and Evaluation sets can be done through random assignment, but – especially with imbalanced datasets – it is better to split the data in a semi-random way. This ensures each sub-dataset has a representative sampling of the important types of records from the original dataset [1] . Once the full dataset is split evenly into three datasets, a T-Test (Test of Means) is performed on the target variable and the most important of the other variables (“V1” – “V28”) to ensure that each dataset is statistically similar to the original.   The Alteryx Workflow   Below is a screenshot of the credit card fraud detection model and workflow created for this blog. You can download a copy of it at the Alteryx Public Gallery here .     The workflow is broken into three columns. The Input Data column on the far left, the colored processing sections in the middle, and the Output Data column on the far right. The data flows from left to right, Inputs to Outputs. The processing happens in three sections: The pink section is where the Boosted Model gets trained. In the green section, the model is tested using Test Dataset 1. The yellow section is where the final model evaluation is performed using Test Dataset 2.   Before the Training Data is used in step one of processing, we decided to transform the Amount field. Transforming variables can help the modeling process. Because the Amount field spanned from $0 to $25,000 and this scale is significantly different than the span of the variables V1 through V28, it would be best to either normalize the field or transform it to make it better correlate with the Target Variable. In this case, we decided to do a Logarithmic transformation. Specifically, the new variable “3LogAmount” is 3*Log(Amount). This transformation brings the scale to the same degree as the other variables; and, the new variable better correlates with the Target Variable (using Pearson Correlation). Because we transformed Amount for Training, we also must transform it for Testing and Evaluation.   Configuring the Alteryx Boosted Model   Required Parameters Menu   In order to configure the Alteryx Boosted Model tool, click on it in the workflow. This is what you will see in the Configuration Pane:     On the “Required parameters” tab, you can choose a Model Name. You can choose any name you want, but a good practice is to include information about how the model was configured.   Next, choose the target field. This is the field we are trying to predict. In our dataset the target field is “ClassYN”.   The next section shows a list of checkboxes for indicating the predictor fields. We check every field except Amount and ClassYN. Since we transformed the Amount field into 3LogAmount, we should check 3LogAmount, but not Amount. Also, it is very important that we do not check the ClassYN field, because it is the dependent variable and not an independent variable.   The next option is to choose whether you want to use sampling weights. If you want to do this, include a field in the data with the numeric weight for each record. For example, if there is a set of records you want to be 10 times more likely to be used than the rest of the records, then those records get weighted as 10 and the rest of the records get weighted as 1. When configuring the tool, choose the field in the dataset that has the weights. In our case, we do not need to use weights.   The next option is whether or not to include marginal effect plots. These are plots that isolate a single variable to inspect which values of that variable are most predictive of the Target Variable. These are helpful in determining how well a variable will serve as a predictor.   Model Customization Menu   Next, to further customize the model, choose the Model customization tab.     If you click on the checkbox: “Specify target type and the loss function distribution”, you can choose what type of variable the Target Value is, and then its loss function. A loss function essentially tells the algorithm how to penalize errored predictions.   Continuous target:   For example, if the target variable is continuous, there are three different loss functions available to use. A Gaussian (Squared Error Loss) function is very good to use when the predicted errors are large in value. With this loss function, since the error amount is squared, this penalizes the errors much more when they are large. A Laplace (Absolute Value Loss) function penalizes the errors linearly to their error amount. This is a faster loss function, but it does not work as well as Gaussian in cases in which the errored amounts are large.   The t-distribution loss function (AKA Student’s T loss function) should penalize errors even less than the Laplace loss function. If you choose the t-distribution, you will need to enter the number of predictive variables in the “Degrees of Freedom” field.   Count (integer) Target:   If your target variable is a count of some kind (e.g. count of sneezes a person has during allergy season), then the Boosted Model tool will automatically choose the Poisson loss function.   Binary (two outcomes) Categorical:   If you choose a target variable that is Binary Categorical, then you will have a choice of two loss functions: Bernoulli (logistic regression) or AdaBoost (exponential loss). Bernoulli has a loss function that is less steep than AdaBoost for large errors and thus penalizes such errors less. For small errors, however, Bernoulli penalizes more than AdaBoost. If your errors are large, use AdaBoost; if your errors are small, use Bernoulli.   Multinomial (three or more outcomes) categorical:   If you choose Multinomial categorical (three or more target values), then it will use a loss function geared specifically for a target variable with three or more values.   Since, in our credit card fraud use case, we have a binary categorical target (“ClassYN”), we experimented between the Bernoulli Loss Function and the AdaBoost Loss Function, because we did not know whether the errors would be large or small.   The maximum number of trees in the model:   The next customization parameter that we have is the maximum number of trees. With this parameter, we want to use the minimum number of trees that will give us the best balance between processing time and error amount when tested via the Testing and Evaluation datasets. Typically, there is a value for the number of trees after which the quality of the model (measured by AUC, or Precision/Recall, or Accuracy) plateaus. In the example graph below, which was generated on the R anchor of the Boosted Model Tool, the X axis (titled “Iteration”) is the number of trees created. The Y axis titled “AdaBoost exponential bound” refers to the amount of prediction error with the number of trees. In this example, we were using an AdaBoost loss function with cross validation. There are two key points identified by the arrows in the graph.   The blue arrow identifies the “knee” of the graph. This is an inflection point where the error no longer significantly reduces as the number of trees increases. In this example, it is at about 2000 trees.   The yellow arrow identifies the “gone too far” point where the error actually begins to increase as the number of trees increases. In this example, it is at about 6000 trees.     The maximum number of trees should be between these two points on the graph. The closer to the yellow arrow point, the longer it takes to create the model, but the error will be the minimum. The closer to the blue arrow point, the less time it will take to create the model, but the error will be slightly higher. We chose right in the middle at 4000 trees.   Understand that the only way to view this graph is to run the model. So, in order to choose the optimal value for maximum number of trees, you may have to experiment by running the model with different loss functions and different values for the other parameters.   Method to determine the final number of trees in the model:   The next parameter we need to set with respect to the number of trees is the Method to Determine the Final Number of Trees in the Model. The three options are cross validation, test (validation sample), and out-of-bag. For the credit card fraud detection use case, we experimented with each method.   Cross validation:   Of the three options, the one that is consistently the best is cross validation. A recommendation for the number of cross validation folds is somewhere between 5 and 10. If the data is very noisy, you may want to increase the number of CV folds even more, to 20. Cross validation has the benefit of preventing overfitting. Unfortunately, it also takes a lot of processing power to use cross validation. This is why you can choose the number of CPU cores to dedicate to cross validation. Using more cores than the number of folds does not help, but keeping it to one core will take the longest time to process. For instance, if you have 5 CV folds, and 8 CPU cores on your machine, then you can probably up the number of cores to 4 and not significantly slow down any processing happening concurrently outside of Alteryx.   Test (validation) sample:   Using Test (validation) sample will simply tell the tool to separate out a percent of the training records (that you set) to be used for internal testing. This method is much faster than cross validation, but it is not better than cross validation.   Out-of-bag:   This method is based on the algorithm’s “out-of-bag” estimator, which typically does not reach the optimal number of trees. It is faster than cross validation and test sample, but it usually does not yield the most accurate model.   The fraction of observations used in the out-of-bag sample:   The next customization parameter is “The fraction of observations used in the out-of-bag sample” (AKA “bag fraction”). This is the percentage of training records pulled and used in each iteration of tree creation. This pulling of the records out of the bag of training records is normally random, except in the case where the random seed value is set to zero. If the random seed value is set to zero, and the bag fraction is less than one, then every time the model is retrained with the same data and the same parameters, the resulting model will be slightly different. If the random seed value is zero then it can cause that a slightly different model is built each time the workflow is run and even with the same input data. If we set it to any number (e.g. 1), and keep it on that number, then every time the workflow is run with the same input data, the model that is built will always be the same.   For the credit card fraud detection use case, we experimented with different fractions of observations to determine the optimal value.   Shrinkage:   The shrinkage value is also called the learning rate. As a general rule, the smaller the learning rate, the more accurate the model will be. But please note there are exceptions to this general rule which sometimes are encountered. Also, as a general rule, the smaller the learning rate, the longer it takes to process in an effort to reach an optimal model.   For the credit card fraud detection use case, we experimented with different shrinkage values to find the optimal value.   Interaction depth:   The interaction depth relates to how big each tree should be. There is a lot of debate on how many tree nodes and tree leaves will be formed for each value of interaction depth. What is not debated is that the greater the number, the larger each tree will be. The default value is 1 and the value that will prevent overfitting the most is a value of 1.   For the credit card fraud detection use case, we experimented with values 1 through 5 on the interaction depth to determine the best value.   Minimum required number of objects in each tree node:   The next parameter is the minimum required number of objects in each tree node. This value is the minimum number of training records that must be in each terminal node of the tree for the model to use that tree. When creating a tree, it is necessary to split the tree into right and left nodes. But before making the split official, the dataset must provide the model this minimum number of records to put in each new node. If the value is higher, this will reduce the number of tree splits that can happen and possibly reduce the number of trees. The higher the number, the more it will generalize the model and prevent overfitting. The default value is 10. The best way to know the optimal value is to experiment.   For the credit card fraud detection use case, we experimented with values 1 through 100 (the maximum possible value).   Random seed value:   Finally, the last customization parameter is the random seed value. If it is set to 0, the output models for each training run will be slightly different (even with the exact same input data). If you instead set the seed to any number greater than zero, all runs with that same seed value (and the same input data, and the same customization parameters) will yield the same output model. This is important for repeatability and testing.   Credit Card Fraud Detection Use Case   With the credit card fraud data, we ran several experiments to determine the best set of customization hyper-parameters. The first set of experiments examined the type of loss function together with the method to determine the final number of trees. The two choices for loss functions are Bernoulli and AdaBoost, and the three choices for methods are cross validation (5 folds), Test Sample (50%), and Out-Of-Bag. Below are the results.       Model Measurements   Confusion Matrix:   The first column is the model description. The next four columns are the “confusion matrix” values: TP (True Positive) means we predicted the record as fraud and it truly was fraud. FP (False Positive) means we predicted the record as fraud, but it was not actually fraud. FN (False Negative) means we predicted the record as not fraud but in reality, it was fraud. TN (True Negative) means we predicted the record as not fraud and it truly was not fraud. Our main focus with the confusion matrix numbers lies with the False Positive and False Negative values: We want these to be as low as possible because they have actual costs to the company.   Please note that the confusion matrix is not an output of the Boosted Model Tool. In order to create it, we created a tool container titled “Post Processing Evaluation of Scored Data” in which we put the logic to create the confusion matrix along with the Optimal Cutoff, TotalLoss, Precision, Recall, and Accuracy fields. To find the tool container, look in the green and yellow processing sections on the right side. If you open the tool container, you can easily follow the logic.   Here is a good example to help you understand the four confusion matrix metrics:     Optimal Cutoff and TotalLoss:   After the model is created, it is fed directly into a score tool along with the Test records. The output of the score tool is a Score_Y field containing a probability that the record is fraudulent. In a default setting, if the probability is 0.50 and above, we would classify the transaction as fraud. However, by taking into account the actual costs for False Positives and False Negatives, we can calculate an optimal cutoff value that minimizes costs.   For example, if we inaccurately predict a transaction as fraudulent, the company could take an action that makes a customer upset, such as blocking the transaction and maybe even putting a freeze on the card. This could cause the customer undo stress and some will take their business elsewhere. This is the potential loss due to a False Positive. If a card has true fraudulent transactions and yet we predict that the transactions are not fraud, then we will likely lose more money. This loss results from a False Negative. In our use case, we assigned the loss for a False Positive as $100 and the loss for a False Negative as $400. Using these values, we calculate the Optimal Cutoff score and the TotalLoss in dollars. We want the Optimal Cutoff score to be as high as possible, and the TotalLoss to be as small as possible to make sure our model is robust.   The logic for this calculation is in the tool container. There is a Text Input tool where you can set the costs for each of the categories of classifications (TP, FP, FN, and TN).   Precision, Recall and Accuracy:   These are common measurements that can be applied to models based on the results of the test data. For all of these measurements, the higher the number the better. A perfect score for each would be 1.00, although in practice that is almost never achieved.     Precision is defined as the proportion of correct positive classifications (predictions) from the cases that are predicted as positive. When False Positives rise, then the Precision declines. If our focus were to maximize detecting fraud with a concern about False Positives and with little concern about False Negatives, then Precision is a good measurement to use.     Recall is defined as the proportion of correct positive classifications from all the cases that are actually positive, which is the TP plus FN. When False Negatives rise, the Recall value declines. If our focus is on maximizing detection of all fraud with False Positives as a second priority, then Recall is the measurement to use.     Accuracy is the proportion of correct classifications (true positives and true negatives) from the overall number of cases. When we want to simultaneously measure how well our model correctly classifies “fraud” and “not fraud” transactions, then Accuracy is a good measurement to use. However, when we have a highly imbalanced training dataset (as we do in this case), Accuracy is not a good measurement. For example, imagine the model simply predicts all records it sees as not fraud. If that were the case, then with 492 fraud transactions and 47,211 normal transactions, our accuracy will still calculate to be 98.9% and yet our model would not be able to detect any fraud. Our model would be useless, but still very accurate, simply by chance. For this reason, with a highly imbalanced data set, we normally do not use Accuracy as a primary measure.   AUC (Area Under the Curve) and Gini Coefficient:   AUC: The Area Under the Curve is a total area measurement of the area under the Receiver Operating Characteristic (ROC) curve when plotting the True Positive Rate (AKA Recall) against the False Positive Rate (AKA FPR or “1-Recall”). It’s a good measurement because it represents the degree or measure of separability between the ‘fraud’ and ‘not fraud’ classes and tells us how much the model is capable of distinguishing between these classes. A perfect measurement of AUC would be equal to 1.00. In Alteryx, the tool that we use to generate the ROC curve and measure the AUC (area under the ROC curve) is the Lift Chart tool which is found in the Predictive suite of tools. The lift tool can measure the AUC for one or many models, allowing us to conveniently compare the graphs and final measures for each model.   The output graph below is an actual output from our experiments in determining the best Model Customization set of parameters. The ROC graph shows six ROC curves (one for each of six different models). In the middle of the graph is the diagonal line. Any ROC curve from a model that is working “better than random” will have a ROC curve that is above and to the left of the diagonal line. In this actual graph, all six models have curves that indicate they are all “better than random”. Typically, the model curve that is the farthest to the left and highest up of the diagonal line will have the highest AUC value and is considered the best model. Another way of identifying this is the model curve closest to the upper left corner of the graph, typically is the best model. In the graph below, the closest curve to the upper left corner is the Adaboost_Cross. Next would be the Adaboost_Out_of_the_Bag model. And the third best is the Adaboost_Test_Sample. Note that all the Adaboost models are closer to the upper left corner than all three of the Bernoulli model curves (the Bernoulli_Cross is hidden behind the Bernoully_Out_of_the_Bag). In the table below the graph, we see that the Adaboost_Cross model has the highest value for AUC confirming that it is best of the 6 models. The worst would be the Bernoulli_Cross or Bernoulli_Out_of_the_Bag.     Gini Coefficient: The Gini Coefficient is found in the table below the ROC Curve on the far right. It is related to the AUC calculation and like the AUC value, it is scaled from 0 to 1, where 0 indicates no ability to separate the classes, and 1 indicates perfect separability.   Which measurement shall we use?   For uses cases where the data is highly imbalanced and the target variable is binary, the best measurement to use is the AUC (Area Under the Curve). In our experiments, we looked at all the measurements and gave decision preference to AUC, Optimal Cutoff, and TotalLoss.   What experiments were run?   We ran through various experiments determining the optimal customization parameters for the Boosted Model. In all cases, we used the Test Data to determine these parameters. Here are the winners for our credit card fraud detection use case: Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Final experiment results:   We ran a final experiment using the Evaluation Data against the top two contending models and the default Model (the default model is the model generated when we use all the default parameters). Here are the default parameters:   Default: Model_Default_Bernoulli_Cross_4000_20_1_10_50 Loss function: Bernoulli The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 50% Shrinkage: 0020 Interaction depth: 1 Minimum required number of objects in each tree node: 10 Our two contending models were: Model_Adaboost_Cross_4000_35_1_100_30 Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Model_Adaboost_Test_4000_35_1_100_30 Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Test Sample (50%) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Our results from the three contending models are below:     Looking at the results, we see that the first model (Default) is not bad. It has a high True Positive value. It also has the lowest TotalLoss value. However, it also has the lowest AUC value, which is not good. The second and third models had better AUC values, yet their TotalLoss values were greater (in magnitude). The second model has the highest Optimal Cutoff value, which indicates a stronger model.   One last thing we can look at when comparing the models is the ROC curves. Below is a combined chart with the ROC curves of the three contending models.     From the chart, we see that the ROC curve farthest to the left of the diagonal and higher towards the top than the others is the Model_Adaboost_Cross_4000_35_1_100_30 model. These curves confirm the AUC values in the table. This is the winning model because it has the highest AUC value and the highest Optimal Cutoff value.   Conclusion:   The Boosted Model Tool in Alteryx is a very powerful predictive analytics tool. It is one of the more complex tools in the tool chest, but in difficult use cases, it proves to be very valuable. This credit card fraud detection use case is especially difficult because it is highly imbalanced.   The Boosted Model Tool uses the GBM package in R. This algorithm is resistant to over-fitting. It also works well with a variety of target variable types.   If you decide to use the Boosted Model Tool, you will find that the default values provide decent results. If you want to improve upon those results, experiment with the customization parameters. This will add time to the design and modeling process, but it can produce significantly better results.     References:   A Gentle Introduction to Gradient Boosting Boosting with AdaBoost and Gradient Boosting Boosting and Logistic Regression A Short Introduction to Boosting ADABOOST Quick Introduction to Boosting Algorithms in Machine Learning Beyond Accuracy: Precision and Recall Understanding the AUC – ROC Curve Useful properties of ROC curves, AUC scoring, and Gini Coefficients   [1] How to do this semi-random separation is beyond the scope of this article, but if you would like to know how best to split an imbalanced dataset, please contact the author through the Alteryx Community. By now, you should have expert-level proficiency with the Boosted Model Tool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at community@alteryx.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.   Stay tuned with our latest posts every #ToolTuesday by following @alteryx on Twitter! If you want to master all the Designer tools, consider subscribing for email notifications.
View full article
In Alteryx, there are 5 customizable options within the Cross-validation screen.
View full article
We were recently approached by a concerned client with "Help! I have a model object in a .yxdb but my computer crashed and I need to document the predictor variables!" This naturally led to a discussion on how we can pull these variables back for the client, and what kind of scenarios would lead to this. The first scenario is the most obvious (the case of the client). The model object was created using Alteryx and was stored in a .yxdb.  During another process, my computer crashed and I lost all of my data! Luckily, I still had the model object in a shared location, but I need to document the variables and the model object looks like this: Unfortunately, this does not give us any information about the data or more importantly, the predictor variables. Luckily, a simple script can break down this model object and fill you in on all of the details.   Within Alteryx, attach an R Tool to your data stream (I am using the Forest Model Object that is created from an Alteryx Sample):   Next, copy and paste the following script into your R Tool code builder: model.data <- read.Alteryx("#1") the.obj <- unserializeObject(as.character(model.data$Object[1])) print(the.obj$call) This script states to take the data coming in from Input #1 and label it "model.data".  Next, unserialize (break down) the data in the field Object (specified by "model.data%Object[1]").  Finally, print the results in the Alteryx Output window.  The final results for this particular object are then printed out, as shown. As you can see, the output clearly states that my predictor variables are Chk_Bal, Duration, Credit_Hist, Purpose, etc.  The end result is quick, clean, and can really help get you out of a jam if you lose your data.
View full article
A number of the Predictive tools have Interactive Visualizations.  A Predictive tool will have this capability if you see an output anchor with the letter ‘I’ (for "Interactive").  
View full article
Occasionally, users might have issues when it comes to installing the predictive tools on their machines.   The first common error (show below) can occur when first trying to install predictive tools after finishing a 64-bit Alteryx Designer installation.     This error occurs when the predictive installer can’t locate the “Alteryx” folder in the “Program Files” folder since the tools are always installed within this folder.    Now you might be thinking “Well I just installed 64-bit Alteryx and I set it to save in Program Files. How is it not in my Program Files folder?”     So what ends up happening here is that if you are running the Designer Installer as a non-admin Windows will sometimes redirect it and save Alteryx to the “Program Files (x86)” folder. Program Files (x86) is meant for 32-bit applications but even if the Designer is 64-bit it will sometimes still write to this folder. You can confirm this by just watching the installation process and checking where Alteryx is actually being saved.   The second thing that might go wrong is if you successfully installed predictive but you can’t see the predictive tools in the “Configure Tool Palette” window.   This issue goes hand in hand with the previous one. You might have an Alteryx folder in Program Files but the actual Designer is somewhere else in a different directory. Just make sure that both the predictive tools and the Designer are being installed in the same directory.   To fix these issues or any other issues that may pop up when trying to obtain the predictive tools:    First, uninstall the Designer and the predictive tools.   Second, run your desired installer as an administrator   This makes sure that Windows doesn’t redirect Alteryx and saves in the proper directory.     As you install, you should see it writing to the correct folder.   Now you can go ahead and install the predictive tools as you normally would, making sure that both are installed in the same place or by default in Program Files-> Alteryx
View full article
Predictive Grouping is an approach that allows users to assess and create the appropriate number of clusters (groups) for their data to be assigned based on their similarity to each other in the same cluster and dissimilar to other data assigned to other clusters. K-Centroids represent a class of algorithms for doing what is known as partitioning cluster analysis. These methods work by taking the records in a database and dividing (partitioning) them into the best K groups based on some criteria. The purpose of creating clusters is to assist you in the business decision-making process as it relates to the clustered data.
View full article
This article describes and explains the outputs of the Decision Tree Tool.
View full article
Question What are some good resources to help figure out how to estimate the model?  Know if results are good or bad?  Understand regression reports? Answer Alteryx focuses on how the tools function in the Designer.  The help pages for each tool explain how to use the tools, with reference links to assist in understanding the methodology.  In addition, please find links to some of the outside and Alteryx resources below:   Outside Resources * Online courses (mix of recorded presentations and weekly practice exercises). Many of them are free (with the option of paying a fee for a certificate) https://www.coursera.org/courses?query=statistics   * Book (Free online with video) https://www.openintro.org/stat/textbook.php   * Websites (includes extensive links) www.analyticsvidhya.com www.datasciencecentral.com   Alteryx Resources * Alteryx has a series of four recorded webinars which cover the predictive tools. http://www.alteryx.com/virtual-training * Community!
View full article
  With the release of 10.6 came awesome new features, and an upgrade in the underlying R version (from 3.1.3 code named "Smooth Sidewalk," to 3.2.3 code named "Wooden Christmas-Tree").  Using the incompatible R version will cause errors in your R macros.     Simply make sure that your Predictive Tools download is the version compatible with your Alteryx Designer version:     Users on 10.5 should continue to use the R3.1.3 version.   When using Alteryx and Microsoft Revolution R Enterprise, a separate predictive tools install is required (in green). For details, see the Alteryx and Revolution Analytics Integration Guide.   And remember to use the non-Admin Predictive version with non-Admin versions of Alteryx Designer.   To install Predictive Tools for Alteryx 10.0, go to Previous Releases. For Alteryx 9.5, within Designer, go to Help > Install Predictive Tools.    Happy Alteryx-ing!
View full article
An explanation of stochastic processes, pseudorandom number generators, and their existence in Alteryx.
View full article
We love helping users be successful with Alteryx, and this means providing a ton of great resources for getting started, learning more, and keeping you up to date with all the amazing stuff we're doing here at Alteryx… and the most compelling is Predictive!   Check out the Predictive District on the Gallery. There are great macros, apps, and sample workflows to demonstrate some nifty new tools.  This post by DrDan on the Analytics Blog gives an overview of what's currently available – stay tuned for additions!   One of my favorites is the Predictive Analytics Starter Kit Volume 1.  It enables you to learn the fundamentals of key predictive models with an interactive guided experience. Examples include Linear Regression, Logistic Regression, and AB Testing, and demonstrates the steps necessary to develop the dataset needed for analysis, and then how to actually build these predictive models yourself.   With v10.6, we introduced the Prescriptive Tool Category, comprising the Optimization and Simulation tools, to assist with determining the best course of action or outcome for a particular situation or set of scenarios. The Engine Works Blog has an introduction to this toolset, plus an extensive use case demonstration.   If you need more Optimization and Simulation action, there are several sample workflows, including Fantasy Sports Lineups (hey, sports fans – blog post here!), a mixing problem, workforce scheduling, and more!   Speaking of use cases, the software itself contains a plethora of predictive sample workflows - and the installed Starter Kits show up here, too!  Help > Sample Workflows > Predictive Analytics.   Of course, don't forget the Predictive Analytics help pages, for overviews and configuration tips.   Visit our Product Training page for On-Demand and Virtual webinars on everything Predictive – regression modelling, cluster analysis, time series…  As always, please begin with Data Prep and Investigation! Can I mention the Field Summary Tool enough times?   Want to show off the interactive visualizations from the models you've built?   This Knowledge Base post shows you how.   Another Engine Works post outlines how to build your own Custom Interactive Visualizations (Part 1 and counting…)   For the most in-depth, resource-rich training on leveraging predictive analytics to answer your business questions, consider the Udacity Predictive Analytics for Business NanoDegree.  It consists of seven courses focused on selecting the right methodology, data preparation, and data visualization as well as four courses that will equip you to use predictive analytics to answer your business problems.   But really, it all starts with the Community. Cruise the Knowledge Base posts, search for Predictive or other favorite keywords, follow the blogs… and for the love of Ned, just play with the software!  It's how we learn 🙂   Happy Alteryx-ing!
View full article
How to save your predictive model.
View full article
If you are building a predictive model, inevitably you will want to analyze the effect that your independent variables have on your dependent variable. This article is meant to shed some light on the Alteryx-specific options for this type of analysis!
View full article
A common task that analysts can run into (and a good practice when analyzing data) is to determine if the means of 2 sampled groups are significantly different. When this inquest arises, the  Test of Means  tool is right for you! To demonstrate how to configure this tool and how to interpret the results, a workflow has been attached. The attached workflow (v. 11.7 ) compares the amount of money that customers spent across different regions in the US. The  Dollars_Spent  field identifies the amount of money an individual spent and the  Region  field identifies the region that the individual resides in (NORTH, SOUTH, EAST, WEST).
View full article
Regression analysis is widely used for prediction and forecasting. Alteryx customers use these statistical tools to understand risk, fraud, customer retention and pricing, among many other business needs.
View full article
The TS Filler Tool t akes a data stream of time series data and fills in any gaps in the series.
View full article
We had a fun question in the solutions center at Inspire this year.  A customer showed us the sample for network analysis and asked us if it is possible to create one that has a picture for each icon.       We've provided a sample workflow (attached) that you can use as a reference to add as many pictures as needed.     Please note that the field names are case sensitive.   Happy Visualizing!
View full article
My code runs in R, but not in the R Tool?
View full article