community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Designer Knowledge Base

Definitive answers from Designer experts.
Community v19.6

Looks aren't everything... But the latest Community refresh looks darn good!

Learn More
Quick navigation for the Tool Mastery Series!
View full article
Sometimes you look at the steaming pile of data before you and wonder how you’ll ever get it in the form you need. Every option seems to require a great deal of manual labor, and as a lazy– er that is , as a data blending professional , that is simply something you will not abide.
View full article
The RegEx tool is kind of like the Swiss Army Knife of parsing in Alteryx; there are a whole lot of ways you can use it to do things faster or more effectively, but even if you just use the blade it's still immensely useful. Sometimes that's all you need, but if you do take the time to figure out how to use a few other tools in that knife, you'll start to see that there isn't much you can't do with it .
View full article
The  Transpose  tool pivots data for all selected fields. The column headers are listed in the name field and the corresponding data items are listed in the value field. You can also select key fields which will remain unchanged through the transformation. The transpose tool is often used in conjunction with the  Cross Tab  tool, which essentially works in the opposite direction.
View full article
The Tile Tool is one of those Designer tools that appears to be nothing spectacular, but secretly simplifies a number of operations that would otherwise be labor intensive. The Tile Tool will never be as important as the Input Data Tool , as versatile as the Download Tool , make something as beautiful as the Render Tool , or be as totally rad as the Lift Chart (I mean just look at that icon, bro). Basically, if the Designer was the Beatles, the Tile Tool would be Ringo Starr. But hey, even though Ringo wasn’t the most important , handsome , or charismatic of the bunch, he still made impact in the world of music. And that’s just what the Tile Tool is about – impact. Check out the ways the Tile Tool can impact your Designer use and significantly reduce the difficulty of some operations:
View full article
Unlike a snowflake, it is actually possible for duplicates exist when it comes to data. To distinguish whether or not a record in your data is unique or a duplicate we have an awesome tool called the Unique Tool  that will actually turn your data into a unique snowflake.
View full article
Let's start with the basics of how to create a report map in Alteryx.  To start off, ensure that the layers you want to show in your map have a spatial object field. This can be checked by placing a select tool and confirming that there is a column of type 'SpatialObj.'
View full article
This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the Boosted Model Tool on our way to mastering the Alteryx Designer:    The Boosted Model tool in Alteryx is a powerful predictive analytics tool. It is one of the more complex tools in the tool chest, but in difficult use cases, it proves to be very valuable. This article will explain what boosting is and provide step-by-step instructions on how to incorporate this technique in a workflow, illustrated with a credit card fraud detection use case. In the Predictive Set of Tools in Alteryx, the very first one on the left is the Boosted Model. The icon for the tool has an upward arrow and what looks like an upright spring.   What is the Boosted Model?   The Boosted Model relies on the machine learning technique known as boosting, in which small decision trees (“stumps”) are serially chain-linked together. Each successive tree in the chain is optimized to better predict the errored records from the previous link. How to better-predict those previous errors is what distinguishes each boosting algorithm. One of the strongest and most proven algorithms is the Gradient Boosting Machine (aka “GBM”). Both R and Python have GBM packages; the Alteryx Boosted Model Tool uses the R implementation.   The Gradient Boosting Machine was invented by Jerome Friedman two decades ago while he was a Statistics Professor at Stanford University. You can download his paper here. A Gradient Boosting Machine is one that uses gradient descent as the method of zeroing in on the errors from the previous tree in the chain.   Boosting – especially gradient boosting – is very popular because of how well it works. New alternative versions of GBM are available in R and Python. The most popular are XGBoost developed in 2014, CatBoost developed in 2017, and Light GBM developed in 2017.    Since the Gradient Boosting Machine uses gradient descent, it needs to have a learning rate configured for it, like other gradient descent algorithms. For GBM, the learning rate is called the “shrinkage factor”.   Gradient boosting (as well as other boosting algorithms) is resistant to overfitting which is basically when an algorithm learns too much about the specifics of the data records it initially learns on, thereby failing to generalize to new, previously unseen, data at prediction time. This is a common problem with many algorithms, but not so much with gradient boosting.   To understand what overfitting is, look at the two sets of graphs below. The blue dots represent the training data points, and the red lines represent the model function inferred from the training data that will be used to predict on new data. The top two graphs show two possible models based on the same training data, while the bottom graphs show how the same models perform on test data. On each set the left model is what we refer to as a “Generalized” Model. The red line (aka ‘model’) serves as a trend line and predicts the trend by trying to be placed in the middle of the training points on the scatterplot. It is a simple straight line defined by a first order linear equation. In the Generalized Model, the model only accurately predicts 2 of the 16 training points (shown by going through the 2 points). However, its placement minimizes the error (distance between the other points and the line). In contrast, in the Overfit Model on the right, the model employed is more complicated and defined by higher order equations. In the case of the Overfit Model, the model BETTER PREDICTS the training data, accurately predicting 7 of the 16 training data points. We might be tempted to think that the Overfit Model is better because it achieves higher Accuracy when predicting the training data; however, when the models try to predict on the previously unseen Test Data, the Accuracy of the Overfit Model underperforms that of the Generalized Model and it does not reflect the true trend of the data.       Gradient boosting is flexible in the types of target variables (dependent variables) that it can handle. It can deal with continuous numeric variables such as temperature, as well as binary variables such as 0 and 1. It can also work with categorical variables such as Yes and No, or multinomial categorical variables such as “Dog”, “Cat”, and “Mouse”.   The Gradient Boosting Machine (like the other boosting algorithms) is said to be good for imbalanced training datasets. So, for example, if a training dataset has 50% of the records with a target variable of “Y” and the other 50% has a target variable of “N”, then it is considered a perfectly balanced dataset. Balanced datasets are excellent for training, but there are many cases in which the training dataset by its very nature is highly imbalanced. For example, if we are trying to detect credit card fraud, we would have a historical dataset in which each credit card transaction is a record. The number of fraudulent transactions within those records would be a very small percentage of the total. In this case, the dataset available for training a model is highly imbalanced, which presents a difficulty for many algorithms such as Logistic Regression. It is less of a problem for the GBM, however, because of the way the algorithm focuses on error records in each tree generation, as described above.    Credit Card Fraud Data   In order to help illustrate how to use the Alteryx Boosted Model, we will use a credit card fraud detection dataset for model training. Kaggle has a sizeable dataset available, which you can download here .   This particular dataset has nearly 285,000 rows, each representing a single credit card transaction. In order to hide the personal data associated with each credit card transaction, the variables are all obfuscated by their names and normalized. It is very likely these variables are the output of a PCA (Principal Components Analysis) preprocess. PCA processes are often employed to reduce many variables down to only a few independent and regularized (normalized) variables. One of the benefits of PCA is that the smaller set of variables that are output are often obfuscated and difficult to correlate back to a known input variable (without the output reports of the PCA process). Another benefit of PCA preprocessing is that the output variables have very low collinearity, which makes them better suited for modeling.   The Kaggle credit card fraud dataset has 30 columns. The main variables are named “V1” through “V28”. Following this is the “Amount” column, which is the purchase amount, and then finally the Target Variable labeled “ClassYN”. If Class=Y, then the record represents a fraudulent transaction; if Class=N, then the record is a normal transaction. Of the 284,807 total records, only 492 (0.17%) are fraud records. This means that 99.83% of the records are not fraud hence this is a very imbalanced dataset.   In preparation for modeling, we separated the 285K records into three different datasets: one for training, one for testing, and one for final evaluation. This separation into Training, Testing, and Evaluation sets can be done through random assignment, but – especially with imbalanced datasets – it is better to split the data in a semi-random way. This ensures each sub-dataset has a representative sampling of the important types of records from the original dataset [1] . Once the full dataset is split evenly into three datasets, a T-Test (Test of Means) is performed on the target variable and the most important of the other variables (“V1” – “V28”) to ensure that each dataset is statistically similar to the original.   The Alteryx Workflow   Below is a screenshot of the credit card fraud detection model and workflow created for this blog. You can download a copy of it at the Alteryx Public Gallery here .     The workflow is broken into three columns. The Input Data column on the far left, the colored processing sections in the middle, and the Output Data column on the far right. The data flows from left to right, Inputs to Outputs. The processing happens in three sections: The pink section is where the Boosted Model gets trained. In the green section, the model is tested using Test Dataset 1. The yellow section is where the final model evaluation is performed using Test Dataset 2.   Before the Training Data is used in step one of processing, we decided to transform the Amount field. Transforming variables can help the modeling process. Because the Amount field spanned from $0 to $25,000 and this scale is significantly different than the span of the variables V1 through V28, it would be best to either normalize the field or transform it to make it better correlate with the Target Variable. In this case, we decided to do a Logarithmic transformation. Specifically, the new variable “3LogAmount” is 3*Log(Amount). This transformation brings the scale to the same degree as the other variables; and, the new variable better correlates with the Target Variable (using Pearson Correlation). Because we transformed Amount for Training, we also must transform it for Testing and Evaluation.   Configuring the Alteryx Boosted Model   Required Parameters Menu   In order to configure the Alteryx Boosted Model tool, click on it in the workflow. This is what you will see in the Configuration Pane:     On the “Required parameters” tab, you can choose a Model Name. You can choose any name you want, but a good practice is to include information about how the model was configured.   Next, choose the target field. This is the field we are trying to predict. In our dataset the target field is “ClassYN”.   The next section shows a list of checkboxes for indicating the predictor fields. We check every field except Amount and ClassYN. Since we transformed the Amount field into 3LogAmount, we should check 3LogAmount, but not Amount. Also, it is very important that we do not check the ClassYN field, because it is the dependent variable and not an independent variable.   The next option is to choose whether you want to use sampling weights. If you want to do this, include a field in the data with the numeric weight for each record. For example, if there is a set of records you want to be 10 times more likely to be used than the rest of the records, then those records get weighted as 10 and the rest of the records get weighted as 1. When configuring the tool, choose the field in the dataset that has the weights. In our case, we do not need to use weights.   The next option is whether or not to include marginal effect plots. These are plots that isolate a single variable to inspect which values of that variable are most predictive of the Target Variable. These are helpful in determining how well a variable will serve as a predictor.   Model Customization Menu   Next, to further customize the model, choose the Model customization tab.     If you click on the checkbox: “Specify target type and the loss function distribution”, you can choose what type of variable the Target Value is, and then its loss function. A loss function essentially tells the algorithm how to penalize errored predictions.   Continuous target:   For example, if the target variable is continuous, there are three different loss functions available to use. A Gaussian (Squared Error Loss) function is very good to use when the predicted errors are large in value. With this loss function, since the error amount is squared, this penalizes the errors much more when they are large. A Laplace (Absolute Value Loss) function penalizes the errors linearly to their error amount. This is a faster loss function, but it does not work as well as Gaussian in cases in which the errored amounts are large.   The t-distribution loss function (AKA Student’s T loss function) should penalize errors even less than the Laplace loss function. If you choose the t-distribution, you will need to enter the number of predictive variables in the “Degrees of Freedom” field.   Count (integer) Target:   If your target variable is a count of some kind (e.g. count of sneezes a person has during allergy season), then the Boosted Model tool will automatically choose the Poisson loss function.   Binary (two outcomes) Categorical:   If you choose a target variable that is Binary Categorical, then you will have a choice of two loss functions: Bernoulli (logistic regression) or AdaBoost (exponential loss). Bernoulli has a loss function that is less steep than AdaBoost for large errors and thus penalizes such errors less. For small errors, however, Bernoulli penalizes more than AdaBoost. If your errors are large, use AdaBoost; if your errors are small, use Bernoulli.   Multinomial (three or more outcomes) categorical:   If you choose Multinomial categorical (three or more target values), then it will use a loss function geared specifically for a target variable with three or more values.   Since, in our credit card fraud use case, we have a binary categorical target (“ClassYN”), we experimented between the Bernoulli Loss Function and the AdaBoost Loss Function, because we did not know whether the errors would be large or small.   The maximum number of trees in the model:   The next customization parameter that we have is the maximum number of trees. With this parameter, we want to use the minimum number of trees that will give us the best balance between processing time and error amount when tested via the Testing and Evaluation datasets. Typically, there is a value for the number of trees after which the quality of the model (measured by AUC, or Precision/Recall, or Accuracy) plateaus. In the example graph below, which was generated on the R anchor of the Boosted Model Tool, the X axis (titled “Iteration”) is the number of trees created. The Y axis titled “AdaBoost exponential bound” refers to the amount of prediction error with the number of trees. In this example, we were using an AdaBoost loss function with cross validation. There are two key points identified by the arrows in the graph.   The blue arrow identifies the “knee” of the graph. This is an inflection point where the error no longer significantly reduces as the number of trees increases. In this example, it is at about 2000 trees.   The yellow arrow identifies the “gone too far” point where the error actually begins to increase as the number of trees increases. In this example, it is at about 6000 trees.     The maximum number of trees should be between these two points on the graph. The closer to the yellow arrow point, the longer it takes to create the model, but the error will be the minimum. The closer to the blue arrow point, the less time it will take to create the model, but the error will be slightly higher. We chose right in the middle at 4000 trees.   Understand that the only way to view this graph is to run the model. So, in order to choose the optimal value for maximum number of trees, you may have to experiment by running the model with different loss functions and different values for the other parameters.   Method to determine the final number of trees in the model:   The next parameter we need to set with respect to the number of trees is the Method to Determine the Final Number of Trees in the Model. The three options are cross validation, test (validation sample), and out-of-bag. For the credit card fraud detection use case, we experimented with each method.   Cross validation:   Of the three options, the one that is consistently the best is cross validation. A recommendation for the number of cross validation folds is somewhere between 5 and 10. If the data is very noisy, you may want to increase the number of CV folds even more, to 20. Cross validation has the benefit of preventing overfitting. Unfortunately, it also takes a lot of processing power to use cross validation. This is why you can choose the number of CPU cores to dedicate to cross validation. Using more cores than the number of folds does not help, but keeping it to one core will take the longest time to process. For instance, if you have 5 CV folds, and 8 CPU cores on your machine, then you can probably up the number of cores to 4 and not significantly slow down any processing happening concurrently outside of Alteryx.   Test (validation) sample:   Using Test (validation) sample will simply tell the tool to separate out a percent of the training records (that you set) to be used for internal testing. This method is much faster than cross validation, but it is not better than cross validation.   Out-of-bag:   This method is based on the algorithm’s “out-of-bag” estimator, which typically does not reach the optimal number of trees. It is faster than cross validation and test sample, but it usually does not yield the most accurate model.   The fraction of observations used in the out-of-bag sample:   The next customization parameter is “The fraction of observations used in the out-of-bag sample” (AKA “bag fraction”). This is the percentage of training records pulled and used in each iteration of tree creation. This pulling of the records out of the bag of training records is normally random, except in the case where the random seed value is set to zero. If the random seed value is set to zero, and the bag fraction is less than one, then every time the model is retrained with the same data and the same parameters, the resulting model will be slightly different. If the random seed value is zero then it can cause that a slightly different model is built each time the workflow is run and even with the same input data. If we set it to any number (e.g. 1), and keep it on that number, then every time the workflow is run with the same input data, the model that is built will always be the same.   For the credit card fraud detection use case, we experimented with different fractions of observations to determine the optimal value.   Shrinkage:   The shrinkage value is also called the learning rate. As a general rule, the smaller the learning rate, the more accurate the model will be. But please note there are exceptions to this general rule which sometimes are encountered. Also, as a general rule, the smaller the learning rate, the longer it takes to process in an effort to reach an optimal model.   For the credit card fraud detection use case, we experimented with different shrinkage values to find the optimal value.   Interaction depth:   The interaction depth relates to how big each tree should be. There is a lot of debate on how many tree nodes and tree leaves will be formed for each value of interaction depth. What is not debated is that the greater the number, the larger each tree will be. The default value is 1 and the value that will prevent overfitting the most is a value of 1.   For the credit card fraud detection use case, we experimented with values 1 through 5 on the interaction depth to determine the best value.   Minimum required number of objects in each tree node:   The next parameter is the minimum required number of objects in each tree node. This value is the minimum number of training records that must be in each terminal node of the tree for the model to use that tree. When creating a tree, it is necessary to split the tree into right and left nodes. But before making the split official, the dataset must provide the model this minimum number of records to put in each new node. If the value is higher, this will reduce the number of tree splits that can happen and possibly reduce the number of trees. The higher the number, the more it will generalize the model and prevent overfitting. The default value is 10. The best way to know the optimal value is to experiment.   For the credit card fraud detection use case, we experimented with values 1 through 100 (the maximum possible value).   Random seed value:   Finally, the last customization parameter is the random seed value. If it is set to 0, the output models for each training run will be slightly different (even with the exact same input data). If you instead set the seed to any number greater than zero, all runs with that same seed value (and the same input data, and the same customization parameters) will yield the same output model. This is important for repeatability and testing.   Credit Card Fraud Detection Use Case   With the credit card fraud data, we ran several experiments to determine the best set of customization hyper-parameters. The first set of experiments examined the type of loss function together with the method to determine the final number of trees. The two choices for loss functions are Bernoulli and AdaBoost, and the three choices for methods are cross validation (5 folds), Test Sample (50%), and Out-Of-Bag. Below are the results.       Model Measurements   Confusion Matrix:   The first column is the model description. The next four columns are the “confusion matrix” values: TP (True Positive) means we predicted the record as fraud and it truly was fraud. FP (False Positive) means we predicted the record as fraud, but it was not actually fraud. FN (False Negative) means we predicted the record as not fraud but in reality, it was fraud. TN (True Negative) means we predicted the record as not fraud and it truly was not fraud. Our main focus with the confusion matrix numbers lies with the False Positive and False Negative values: We want these to be as low as possible because they have actual costs to the company.   Please note that the confusion matrix is not an output of the Boosted Model Tool. In order to create it, we created a tool container titled “Post Processing Evaluation of Scored Data” in which we put the logic to create the confusion matrix along with the Optimal Cutoff, TotalLoss, Precision, Recall, and Accuracy fields. To find the tool container, look in the green and yellow processing sections on the right side. If you open the tool container, you can easily follow the logic.   Here is a good example to help you understand the four confusion matrix metrics:     Optimal Cutoff and TotalLoss:   After the model is created, it is fed directly into a score tool along with the Test records. The output of the score tool is a Score_Y field containing a probability that the record is fraudulent. In a default setting, if the probability is 0.50 and above, we would classify the transaction as fraud. However, by taking into account the actual costs for False Positives and False Negatives, we can calculate an optimal cutoff value that minimizes costs.   For example, if we inaccurately predict a transaction as fraudulent, the company could take an action that makes a customer upset, such as blocking the transaction and maybe even putting a freeze on the card. This could cause the customer undo stress and some will take their business elsewhere. This is the potential loss due to a False Positive. If a card has true fraudulent transactions and yet we predict that the transactions are not fraud, then we will likely lose more money. This loss results from a False Negative. In our use case, we assigned the loss for a False Positive as $100 and the loss for a False Negative as $400. Using these values, we calculate the Optimal Cutoff score and the TotalLoss in dollars. We want the Optimal Cutoff score to be as high as possible, and the TotalLoss to be as small as possible to make sure our model is robust.   The logic for this calculation is in the tool container. There is a Text Input tool where you can set the costs for each of the categories of classifications (TP, FP, FN, and TN).   Precision, Recall and Accuracy:   These are common measurements that can be applied to models based on the results of the test data. For all of these measurements, the higher the number the better. A perfect score for each would be 1.00, although in practice that is almost never achieved.     Precision is defined as the proportion of correct positive classifications (predictions) from the cases that are predicted as positive. When False Positives rise, then the Precision declines. If our focus were to maximize detecting fraud with a concern about False Positives and with little concern about False Negatives, then Precision is a good measurement to use.     Recall is defined as the proportion of correct positive classifications from all the cases that are actually positive, which is the TP plus FN. When False Negatives rise, the Recall value declines. If our focus is on maximizing detection of all fraud with False Positives as a second priority, then Recall is the measurement to use.     Accuracy is the proportion of correct classifications (true positives and true negatives) from the overall number of cases. When we want to simultaneously measure how well our model correctly classifies “fraud” and “not fraud” transactions, then Accuracy is a good measurement to use. However, when we have a highly imbalanced training dataset (as we do in this case), Accuracy is not a good measurement. For example, imagine the model simply predicts all records it sees as not fraud. If that were the case, then with 492 fraud transactions and 47,211 normal transactions, our accuracy will still calculate to be 98.9% and yet our model would not be able to detect any fraud. Our model would be useless, but still very accurate, simply by chance. For this reason, with a highly imbalanced data set, we normally do not use Accuracy as a primary measure.   AUC (Area Under the Curve) and Gini Coefficient:   AUC: The Area Under the Curve is a total area measurement of the area under the Receiver Operating Characteristic (ROC) curve when plotting the True Positive Rate (AKA Recall) against the False Positive Rate (AKA FPR or “1-Recall”). It’s a good measurement because it represents the degree or measure of separability between the ‘fraud’ and ‘not fraud’ classes and tells us how much the model is capable of distinguishing between these classes. A perfect measurement of AUC would be equal to 1.00. In Alteryx, the tool that we use to generate the ROC curve and measure the AUC (area under the ROC curve) is the Lift Chart tool which is found in the Predictive suite of tools. The lift tool can measure the AUC for one or many models, allowing us to conveniently compare the graphs and final measures for each model.   The output graph below is an actual output from our experiments in determining the best Model Customization set of parameters. The ROC graph shows six ROC curves (one for each of six different models). In the middle of the graph is the diagonal line. Any ROC curve from a model that is working “better than random” will have a ROC curve that is above and to the left of the diagonal line. In this actual graph, all six models have curves that indicate they are all “better than random”. Typically, the model curve that is the farthest to the left and highest up of the diagonal line will have the highest AUC value and is considered the best model. Another way of identifying this is the model curve closest to the upper left corner of the graph, typically is the best model. In the graph below, the closest curve to the upper left corner is the Adaboost_Cross. Next would be the Adaboost_Out_of_the_Bag model. And the third best is the Adaboost_Test_Sample. Note that all the Adaboost models are closer to the upper left corner than all three of the Bernoulli model curves (the Bernoulli_Cross is hidden behind the Bernoully_Out_of_the_Bag). In the table below the graph, we see that the Adaboost_Cross model has the highest value for AUC confirming that it is best of the 6 models. The worst would be the Bernoulli_Cross or Bernoulli_Out_of_the_Bag.     Gini Coefficient: The Gini Coefficient is found in the table below the ROC Curve on the far right. It is related to the AUC calculation and like the AUC value, it is scaled from 0 to 1, where 0 indicates no ability to separate the classes, and 1 indicates perfect separability.   Which measurement shall we use?   For uses cases where the data is highly imbalanced and the target variable is binary, the best measurement to use is the AUC (Area Under the Curve). In our experiments, we looked at all the measurements and gave decision preference to AUC, Optimal Cutoff, and TotalLoss.   What experiments were run?   We ran through various experiments determining the optimal customization parameters for the Boosted Model. In all cases, we used the Test Data to determine these parameters. Here are the winners for our credit card fraud detection use case: Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Final experiment results:   We ran a final experiment using the Evaluation Data against the top two contending models and the default Model (the default model is the model generated when we use all the default parameters). Here are the default parameters:   Default: Model_Default_Bernoulli_Cross_4000_20_1_10_50 Loss function: Bernoulli The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 50% Shrinkage: 0020 Interaction depth: 1 Minimum required number of objects in each tree node: 10 Our two contending models were: Model_Adaboost_Cross_4000_35_1_100_30 Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Model_Adaboost_Test_4000_35_1_100_30 Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Test Sample (50%) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Our results from the three contending models are below:     Looking at the results, we see that the first model (Default) is not bad. It has a high True Positive value. It also has the lowest TotalLoss value. However, it also has the lowest AUC value, which is not good. The second and third models had better AUC values, yet their TotalLoss values were greater (in magnitude). The second model has the highest Optimal Cutoff value, which indicates a stronger model.   One last thing we can look at when comparing the models is the ROC curves. Below is a combined chart with the ROC curves of the three contending models.     From the chart, we see that the ROC curve farthest to the left of the diagonal and higher towards the top than the others is the Model_Adaboost_Cross_4000_35_1_100_30 model. These curves confirm the AUC values in the table. This is the winning model because it has the highest AUC value and the highest Optimal Cutoff value.   Conclusion:   The Boosted Model Tool in Alteryx is a very powerful predictive analytics tool. It is one of the more complex tools in the tool chest, but in difficult use cases, it proves to be very valuable. This credit card fraud detection use case is especially difficult because it is highly imbalanced.   The Boosted Model Tool uses the GBM package in R. This algorithm is resistant to over-fitting. It also works well with a variety of target variable types.   If you decide to use the Boosted Model Tool, you will find that the default values provide decent results. If you want to improve upon those results, experiment with the customization parameters. This will add time to the design and modeling process, but it can produce significantly better results.     References:   A Gentle Introduction to Gradient Boosting Boosting with AdaBoost and Gradient Boosting Boosting and Logistic Regression A Short Introduction to Boosting ADABOOST Quick Introduction to Boosting Algorithms in Machine Learning Beyond Accuracy: Precision and Recall Understanding the AUC – ROC Curve Useful properties of ROC curves, AUC scoring, and Gini Coefficients   [1] How to do this semi-random separation is beyond the scope of this article, but if you would like to know how best to split an imbalanced dataset, please contact the author through the Alteryx Community. By now, you should have expert-level proficiency with the Boosted Model Tool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at community@alteryx.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.   Stay tuned with our latest posts every #ToolTuesday by following @alteryx on Twitter! If you want to master all the Designer tools, consider subscribing for email notifications.
View full article
The Select Tool  within the Alteryx Designer is the equivalent of your High School Sweetheart. Always there when you needed them and helped you find out more about yourself.  The Select Tool can do exactly this by showing you the data type and structure of your data, but it also gives you the flexbility  to change aspects of your dataset.
View full article
The Dynamic Rename Tool is part of the developer category of tools. It allows the user to quickly rename any or all fields within an input stream by employing the use of different methods.   The user has the option to rename only certain fields, all fields, or even dynamic/unknown fields at runtime (e.g. after a Cross Tab Tool). The option for renaming fields are: 
View full article
Now everyone loves to talk about their workflow masterpieces, however the notion of documenting them doesn't always hold the same appeal. The  Comment Tool  would be happy to help and can make documenting and structuring your workflow easy and straight forward.  Whether you want to include images, text descriptions or categorize parts of your workflow, the Comment Tool can act as a great refresher when revisiting workflows and make it far more intuitive for work colleagues to follow along. This article provides a few examples on how.
View full article
One of the most underrated tool groups, the Documentation Category is full of gems that can help you improve the aesthetic, organization, and shareability of your workflows. Chief among them is the Tool Container Tool – in addition to better aesthetic this tool has the ability to disable all the tools within it, lending itself to a handful of functional applications as well:
View full article
Inside the Laboratory tool set you'll find the Basic Data Profile Tool . This tool is similar to the Field Summary Tool in that it provides information about each field within your data such as length, type, source, shortest and longest values, and more. It differs from the Field Summary however when you get to the missing data details. The Field Summary tool gives you a single value for Percent Missing, but makes no distinction between whether that percentage is Null or Empty values. The Basic Data Profile tool gives you a count of records that have Null values, and a count of records that are blank. 
View full article
This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the Pearson Correlation Tool on our way to mastering the Alteryx Designer.
View full article
First added to Alteryx Designer in version 10.6, the Optimization Tool is a member of the Prescriptive Tools (included with the Predictive Tools installation) and allows you to solve optimization problems. Mathematical Optimization is the selection of the best possible option(s), given a set of alternatives and a selection criterion. In this Tool Mastery, we will review the inputs, configuration, and outputs of the Optimization Tool.
View full article
The Directory Tool gives you a data-stream input that contains information about the files and folders (file name; file date; last modified, etc.) for the location of your choice, which you can then use for more complex interactions with the file system. Basically, the Directory Tool could also finally help me track down my keys - not just where I put the keys in the house, but also how long they've been there, and when they were last moved.
View full article
For most tools that already have “dynamic” in the name, it would be redundant to call them one of the most dynamic tools in the Designer. That’s not the case for Dynamic Input. With basic configuration, the Dynamic Input Tool  allows you to specify a template (this can be a file or database table) and input any number of tables that match that template format (shape/schema) by reading in a list of other sources or modifying SQL queries. This is especially useful for periodic data sets, but the use of the tool goes far beyond its basic configuration. To aid in your data blending, we’ve gone ahead and cataloged a handful of uses that make the Dynamic Input Tool so versatile:
View full article
Once you have started a workflow within Alteryx it's hard to think about leaving! However, if you feel like that time has come, Alteryx makes that transition easier than Micheal Phelps winning Gold in the Olympics. The Output Data Tool is used to write the results of your workflow to any supported database or file formats ; Alteryx also offers the opportunity to output directly to Tableau Server , and  Power BI . Using the Output Data Tool, you can:
View full article
Introduced in Alteryx Designer 2018.3, the Insight tool can be used to build and combine multiple interactive charts into an interactive dashboard, allowing you to clearly communicate your analysis and data insights. This article will review many of the features of the Insight tool, and how to use them. With this article, I hope you feel empowered to take on your Visualytics adventures head-first.
View full article
The Date Time Now Tool is part of the Input Tool Category and it is actually a macro encapsulating other Alteryx tools . To use it, only one selection needs to be made: an output format. That's it, then you can go about your business. You also have the option to output the time with that date.
View full article