community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Knowledge Base

Definitive answers from Designer experts.
Upgrade Alteryx Designer in 10 Steps

Debating whether or not to upgrade to the latest version of Alteryx Designer?

LEARN MORE
Platform Product: Spatial Issues – Working with Alteryx Customer Support Engineers (CSEs) (for use by CSEs and Alteryx Customers)   To EXPEDITE the resolution of your case, please include the below information.   Spatial - Requested Information: *** Suggestion: copy/paste the questions below and email the supporting documentation to support@alteryx.com   1. Detailed Description of the Issue 2. Screenshot of Alteryx Version 3. Screenshot of Error or Exact Text of Error 4. Spatial File type 5. Please send a copy of your workflow (*.yxmd or *.yxzp) 6. Have there been any changes made recently? Update to Alteryx, Server, etc.?     Spatial – Requested Information (Detailed Instructions):   1.  Detailed Description of the Issue – What issues are you having?  Has it worked in the past?  When did the issue start?  Are all users affected or just some?  What are the steps to reproduce your issue?  What have you tried to resolve the issue?  Have you searched the Alteryx Community ?     2.  Screenshot of Alteryx Version– Our CSEs need to know the precise version of Alteryx so we can replicate any issues.   In Alteryx, click Help >> About and provide a screenshot.  The screenshot will include whether it is Server or Designer.  In addition, whether it is “Running Elevated” Admin vs. Non-Admin.       3.  Screenshot of Error or Exact Text of Error- Click CTRL-Print-screen to capture the error and paste into your e-mail.  Also include where was the error encountered – Gallery, Designer, Scheduler? Note: You may wish to Google the error text research the issue. The Knowledgebase is also a great place to search the error text as well!    4.  Spatial File type – What type of file are you working with? How many records?   5.  Please send a copy of your workflow (*.yxmd or *.yxzp) and sample data if possible. Either create a .yxzp and include macros and data by clicking Options>Export Workflow.  Or, include the workflow *.yxmd and sample data if possible.         6.  Have there been any changes made recently? Update to Alteryx Designer, Server, etc.?  What do you think may have caused the issue?  What have you done to try to resolve the issue?     Suggested links: Spatial Functions  
View full article
Question If I have a list of dates, how can I find the date of the next Monday for each date? Answer Using a simple workflow, you can calculate the next Monday from any date by using a single Formula tool and configuring as follows: Determine the day of each date Day: DateTimeFormat([Sample Dates],"%a") Calculate the number of days to get to the next Monday based on each day AddDays: Switch([Day],Null(),'Mon',7,'Tue',6,'Wed',5,'Thu',4,'Fri',3,'Sat',2,'Sun',1) Add the number of days (from step 2) to get to the next Monday to each date Monday: DateTimeAdd([Sample Dates],[AddDays],"days") Verify that new date is Monday VerifyNewDay: DateTimeFormat([Monday],"%a") You can actually do all of this within one formula, save for verifying the day, if you want to get fancy: Monday: DateTimeAdd([Sample Dates], Switch(DateTimeFormat([Sample Dates],"%a"),Null(),'Mon',7,'Tue',6,'Wed',5,'Thu',4,'Fri',3,'Sat',2,'Sun',1),"days")   Things to consider: This workflow assumes that your dates are already in the Alteryx native date format of "YYYY-MM-DD". If they aren't, please visit another Knowledge Base article, Date Conversions, for tips on how to get your dates there!   This was done in 10.1. Sample workflow attached.   Thanks for tuning in!
View full article
Let's start with the basics of how to create a report map in Alteryx.  To start off, ensure that the layers you want to show in your map have a spatial object field. This can be checked by placing a select tool and confirming that there is a column of type 'SpatialObj.'
View full article
Platform Product:  API/Run Command– Working with Alteryx Customer Support Engineers (CSEs) (for use by CSEs and Alteryx Customers) To EXPEDITE the resolution of your case, please include the below information.   API/Run Command - Requested Information *** Suggestion: copy/paste the questions below and email the supporting documentation to support@alteryx.com   1.  Detailed Description of the Issue 2.  Screenshot of Alteryx Version 3.  Are you licensed for API/SDK? 4.  Please send a copy of your workflow (*.yxmd or *.yxzp) 5.   Which API is being used? Alteryx Gallery, Alteryx Engine, Third Party? 6.  For third-party APIs provide a copy of tool you are developing. 7.  What API endpoints are you working with? What payload are you sending to that endpoint?  What is the response/result it returns?    API/Run Command – Requested Information (Detailed Instructions): 1.  Detailed Description of the Issue – What issues are you having?  Has it worked in the past?  When did the issue start?  Are all users affected or just some?  What are the steps to reproduce your issue?  What have you tried to resolve the issue?  Have you searched the Alteryx Community ?     2.  Screenshot of Alteryx Version– Our CSEs need to know the precise version of Alteryx so we can replicate any issues.   In Alteryx, click Help > About and provide a screenshot.    The screenshot will include whether it is Server or Designer.  In addition, whether it is “Running Elevated” Admin vs. Non-Admin. 3.  Are you licensed for API/SDK?  Click Options>Manage Licenses and send a screenshot 4.  Please send a copy of your workflow (*.yxmd or *.yxzp) and sample data if possible. Either create a .yxzp and include macros and data by clicking Options>Export Workflow.  Or, include the workflow *.yxmd and sample data if possible.   5.  Which API is being used?  Alteryx Gallery, Alteryx Engine, third party?     6.  For third-party APIs provide a copy of the tool you are developing.     7.  What API endpoints are you working with? What payload are you sending to that endpoint?  What is the response/result it returns?      Suggested links: Tool Mastery|Run Command The Run Command: Write Source & Read Results  
View full article
Platform Product: Reporting/Visualytics Issues – Working with Alteryx Customer Support Engineers (CSEs) (for use by CSEs and Alteryx Customers)     Reporting/Visualytics- Requested Information *** Suggestion: copy/paste the questions below and email the supporting documentation to support@alteryx.com   1.  Detailed Description of the Issue 2.  Alteryx Version 3.  Version of Browser if Applicable 4.  Screenshot of Error or Exact Text of Error 5.  Please send a copy of your workflow (*.yxmd or *.yxzp)   Reporting/Visualytics – Requested Information (Detailed Instructions):   1.  Detailed Description of the Issue – What issues are you having?  Has it worked in the past?  When did the issue start?  Are all users affected or just some?  What are the steps to reproduce your issue?  What have you tried to resolve the issue?  Have you searched the Alteryx Community ?     2.   Our CSEs need to know the precise version of Alteryx so we can replicate any issues.   In Alteryx, click Help >> About and provide a screenshot.   The screenshot will include whether it is Server or Designer.  In addition, whether it is “Running Elevated” Admin vs. Non-Admin.       3.  Version of Browser (if applicable) – What browser are you using?  What version? Chrome (Help-About)?  Internet Explorer (Settings Gear)?       4.  Screenshot of Error or Exact Text of Error- Click CTRL-Print-screen to capture the error and paste into your e-mail.  Also include where was the error encountered – e.g.  Gallery, Designer, Scheduler?   Note: You may wish to Google the error text research the issue. The Knowledgebase is also a great place to search the error text as well!    5.  Please send a copy of your workflow (*.yxmd or *.yxzp) and sample data if possible. Either create a .yxzp and include macros and data by clicking Options>Export Workflow.  Or, include the workflow *.yxmd and sample data if possible.       Suggested links: Tool Mastery | Insight Tool Mastery | Report Header Tool Mastery | Report Footer Tool Mastery | Report Text Tool Mastery | Report Map Add Page Numbers – Reporting  
View full article
Platform Product: Database Connections Issues – Working with Alteryx Customer Support Engineers (CSEs) Database Connections Issues – Working with Alteryx Customer Support Engineers (CSEs) To EXPEDITE the resolution of your case, please include the below information.   Database Connections- Requested Information  *** Suggestion: copy/paste the questions below and email the supporting documentation to support@alteryx.com   1. Detailed description of the Issue 2. Is this a Production Issue? What is the urgency? 3. Have there been any changes made recently? Update to Alteryx, Server, etc.? 4. Alteryx Version Server, Connect or Designer? 5. Screenshot of Error and the Configuration of the Input tool. 6. What is your Database Type and Version?  Is your database/version supported?  http://www.alteryx.com/technical-specifications 7. What is your Driver Version 8. Do you have 32bit or 64bit? 9. Are you able to connect to their databases with something else like TOAD, SQL Developer, Tableau? If you are using ODBC, does the test button in ODBC dsn work? 10. Please send a copy of your workflow (*.yxmd or *.yxzp) 11. Have any changes made recently, such as update to Alteryx, Server, etc.? 12. Have you successfully connected to the database before? Are your colleagues able to connect? 13. Have you searched for this issue on the Community? 14.  What have you tried? 15.  ODBC Trace Log file?     Database Connections – Requested Information (Detailed Instructions): ***All of the below are not required, but please send what you can provide.   1.  Detailed Description of the Issue – What issues are you having?  Has it worked in the past?  When did the issue start?  Are all users affected or just some?  What are the steps to reproduce your issue?  What have you tried to resolve the issue?  Have you searched the Alteryx Community?     2.  Is this a Production Issue? What is the urgency? How many people is this affecting?  Are you able to complete most of your work?  Do you have a time frame that you need to resolve the issue?     3.   Have there been any changes made recently? Update to Alteryx, Server, etc.?  What do you think may have caused the issue?  What have you done to try to resolve the issue?     4.  Alteryx Version – Our Customer Support Engineers need to know the precise version so we can replicate any issues.   In Alteryx, click Help >> About and provide a screenshot, or the exact version number.     5.  Exact Text of Error or Screenshot of error- Click CTRL-Print-screen to capture the error and paste into your e-mail. Note: You may wish to Google the error text research the issue. The Knowledgebase is also a great place to search the error text as well!  Where was the error encountered?  Gallery?   6.  What is your Database Type and Version?   Is your database/version supported?  http://www.alteryx.com/technical-specifications       7. What is your Driver Version?  Click Start - ODBC Datasources 64 bit<Drivers.  Highlight the driver you are using and send a screen shot.  Do you know if it is 32 bit?  Have you tried a different driver?  OLEDB instead of ODBC will be faster.       8.  Do you have 32bit or 64bit?   9.  Are you able to connect to the databases with something else like TOAD, SQL Developer, Tableau? If you are using ODBC, does the test button in ODBC DSN work?  Click Start>ODBC Data Source Administrator>System DSN.  Highlight the Driver.  Click Configure and Test.       10.  Please send a copy of your workflow (*.yxmd or *.yxzp) and sample data if possible. Either create a .yxzp and include macros and data by clicking Options>Export Workflow.  Or, include the workflow *.yxmd and sample data if possible.   11.  Have any changes been made recently, such as an update to Alteryx Designer, Server, etc.?  Migrated the database to the cloud?  What is the physical location of the database?   12.  Have you successfully connected to the database before?  Are your colleagues able to connect now?     13.  Have you searched for this issue on the Community? https://community.alteryx.com   14.  What have you tried?  Do fewer records run?  Does a simple query run with just one column in the table?     15.  ODBC Trace Log file.  Click Start<ODBC Data Source Administrator<System DSN.  Highlight the Driver.  Click Configure and Test.    Suggested Links: Troubleshooting Database Connections Creating an In-Database Connection FAQ: How Do the In-Database tools Work? How To: Connect to an Oracle Database in Alteryx  
View full article
Alteryx defaults to using the US/English Standard when it comes to number formats. However, for reporting purposes, it is important to remember that not all countries report their numbers in the same fashion. This article shows a quick and easy way to use Raw PCXML to convert numbers in to the Continental European Standard before outputting a final report. Throughout the workflow building process, numbers will be represented in the US/English Standard of 1,000.00.  However, when building an automated report, it is important to remember who the audience will be. In the case of users in countries that use the Continental European Standard, it may best to have Alteryx change the numerical formatting system before outputting the final report. The following example is specific for the Spanish-Spain numbering convention. Process   1. Pass the data through a Table tool to create a Table Report Snippet. 2. Insert a Report Text Tool and format as seen below. The LocaleID is what is specifically driving the formatting change.  For more information on other locale ID's check out this article. 3. Complete your layout and use a Render tool to complete your automated report. Please see the attached workflow for an example in practice.
View full article
Sometimes, especially when interacting with tools/files outside Alteryx, you need to explicitly control the order of tool execution. The classic example is a process whereby you 1) INSERT to a SQL table where an auto-incrementing ID is applied 2) read the records back in order to get the auto-incremented id 3) insert the auto-incremented ID into a different table.   In Alteryx worflow streams that terminate (as in an Output or Render), you can't reliably use Block Until Done to control the stream. One way to reliably control the order of events is with batch macros.   Batch macros guarantee that 1) every iteration is complete before the next iteration begins, and 2) that all iterations are complete before the macro reports itself complete. So if you place a process inside a batch macro and create an output stream from that macro, all processing will be done and the macro still can exist in the middle a workflow stream. You don't even need a Block Until Done as part of the process.   Please see the attached examples
View full article
This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the Boosted Model Tool on our way to mastering the Alteryx Designer:    The Boosted Model tool in Alteryx is a powerful predictive analytics tool. It is one of the more complex tools in the tool chest, but in difficult use cases, it proves to be very valuable. This article will explain what boosting is and provide step-by-step instructions on how to incorporate this technique in a workflow, illustrated with a credit card fraud detection use case. In the Predictive Set of Tools in Alteryx, the very first one on the left is the Boosted Model. The icon for the tool has an upward arrow and what looks like an upright spring.   What is the Boosted Model?   The Boosted Model relies on the machine learning technique known as boosting, in which small decision trees (“stumps”) are serially chain-linked together. Each successive tree in the chain is optimized to better predict the errored records from the previous link. How to better-predict those previous errors is what distinguishes each boosting algorithm. One of the strongest and most proven algorithms is the Gradient Boosting Machine (aka “GBM”). Both R and Python have GBM packages; the Alteryx Boosted Model Tool uses the R implementation.   The Gradient Boosting Machine was invented by Jerome Friedman two decades ago while he was a Statistics Professor at Stanford University. You can download his paper here. A Gradient Boosting Machine is one that uses gradient descent as the method of zeroing in on the errors from the previous tree in the chain.   Boosting – especially gradient boosting – is very popular because of how well it works. New alternative versions of GBM are available in R and Python. The most popular are XGBoost developed in 2014, CatBoost developed in 2017, and Light GBM developed in 2017.    Since the Gradient Boosting Machine uses gradient descent, it needs to have a learning rate configured for it, like other gradient descent algorithms. For GBM, the learning rate is called the “shrinkage factor”.   Gradient boosting (as well as other boosting algorithms) is resistant to overfitting which is basically when an algorithm learns too much about the specifics of the data records it initially learns on, thereby failing to generalize to new, previously unseen, data at prediction time. This is a common problem with many algorithms, but not so much with gradient boosting.   To understand what overfitting is, look at the two sets of graphs below. The blue dots represent the training data points, and the red lines represent the model function inferred from the training data that will be used to predict on new data. The top two graphs show two possible models based on the same training data, while the bottom graphs show how the same models perform on test data. On each set the left model is what we refer to as a “Generalized” Model. The red line (aka ‘model’) serves as a trend line and predicts the trend by trying to be placed in the middle of the training points on the scatterplot. It is a simple straight line defined by a first order linear equation. In the Generalized Model, the model only accurately predicts 2 of the 16 training points (shown by going through the 2 points). However, its placement minimizes the error (distance between the other points and the line). In contrast, in the Overfit Model on the right, the model employed is more complicated and defined by higher order equations. In the case of the Overfit Model, the model BETTER PREDICTS the training data, accurately predicting 7 of the 16 training data points. We might be tempted to think that the Overfit Model is better because it achieves higher Accuracy when predicting the training data; however, when the models try to predict on the previously unseen Test Data, the Accuracy of the Overfit Model underperforms that of the Generalized Model and it does not reflect the true trend of the data.       Gradient boosting is flexible in the types of target variables (dependent variables) that it can handle. It can deal with continuous numeric variables such as temperature, as well as binary variables such as 0 and 1. It can also work with categorical variables such as Yes and No, or multinomial categorical variables such as “Dog”, “Cat”, and “Mouse”.   The Gradient Boosting Machine (like the other boosting algorithms) is said to be good for imbalanced training datasets. So, for example, if a training dataset has 50% of the records with a target variable of “Y” and the other 50% has a target variable of “N”, then it is considered a perfectly balanced dataset. Balanced datasets are excellent for training, but there are many cases in which the training dataset by its very nature is highly imbalanced. For example, if we are trying to detect credit card fraud, we would have a historical dataset in which each credit card transaction is a record. The number of fraudulent transactions within those records would be a very small percentage of the total. In this case, the dataset available for training a model is highly imbalanced, which presents a difficulty for many algorithms such as Logistic Regression. It is less of a problem for the GBM, however, because of the way the algorithm focuses on error records in each tree generation, as described above.    Credit Card Fraud Data   In order to help illustrate how to use the Alteryx Boosted Model, we will use a credit card fraud detection dataset for model training. Kaggle has a sizeable dataset available, which you can download here .   This particular dataset has nearly 285,000 rows, each representing a single credit card transaction. In order to hide the personal data associated with each credit card transaction, the variables are all obfuscated by their names and normalized. It is very likely these variables are the output of a PCA (Principal Components Analysis) preprocess. PCA processes are often employed to reduce many variables down to only a few independent and regularized (normalized) variables. One of the benefits of PCA is that the smaller set of variables that are output are often obfuscated and difficult to correlate back to a known input variable (without the output reports of the PCA process). Another benefit of PCA preprocessing is that the output variables have very low collinearity, which makes them better suited for modeling.   The Kaggle credit card fraud dataset has 30 columns. The main variables are named “V1” through “V28”. Following this is the “Amount” column, which is the purchase amount, and then finally the Target Variable labeled “ClassYN”. If Class=Y, then the record represents a fraudulent transaction; if Class=N, then the record is a normal transaction. Of the 284,807 total records, only 492 (0.17%) are fraud records. This means that 99.83% of the records are not fraud hence this is a very imbalanced dataset.   In preparation for modeling, we separated the 285K records into three different datasets: one for training, one for testing, and one for final evaluation. This separation into Training, Testing, and Evaluation sets can be done through random assignment, but – especially with imbalanced datasets – it is better to split the data in a semi-random way. This ensures each sub-dataset has a representative sampling of the important types of records from the original dataset [1] . Once the full dataset is split evenly into three datasets, a T-Test (Test of Means) is performed on the target variable and the most important of the other variables (“V1” – “V28”) to ensure that each dataset is statistically similar to the original.   The Alteryx Workflow   Below is a screenshot of the credit card fraud detection model and workflow created for this blog. You can download a copy of it at the Alteryx Public Gallery here .     The workflow is broken into three columns. The Input Data column on the far left, the colored processing sections in the middle, and the Output Data column on the far right. The data flows from left to right, Inputs to Outputs. The processing happens in three sections: The pink section is where the Boosted Model gets trained. In the green section, the model is tested using Test Dataset 1. The yellow section is where the final model evaluation is performed using Test Dataset 2.   Before the Training Data is used in step one of processing, we decided to transform the Amount field. Transforming variables can help the modeling process. Because the Amount field spanned from $0 to $25,000 and this scale is significantly different than the span of the variables V1 through V28, it would be best to either normalize the field or transform it to make it better correlate with the Target Variable. In this case, we decided to do a Logarithmic transformation. Specifically, the new variable “3LogAmount” is 3*Log(Amount). This transformation brings the scale to the same degree as the other variables; and, the new variable better correlates with the Target Variable (using Pearson Correlation). Because we transformed Amount for Training, we also must transform it for Testing and Evaluation.   Configuring the Alteryx Boosted Model   Required Parameters Menu   In order to configure the Alteryx Boosted Model tool, click on it in the workflow. This is what you will see in the Configuration Pane:     On the “Required parameters” tab, you can choose a Model Name. You can choose any name you want, but a good practice is to include information about how the model was configured.   Next, choose the target field. This is the field we are trying to predict. In our dataset the target field is “ClassYN”.   The next section shows a list of checkboxes for indicating the predictor fields. We check every field except Amount and ClassYN. Since we transformed the Amount field into 3LogAmount, we should check 3LogAmount, but not Amount. Also, it is very important that we do not check the ClassYN field, because it is the dependent variable and not an independent variable.   The next option is to choose whether you want to use sampling weights. If you want to do this, include a field in the data with the numeric weight for each record. For example, if there is a set of records you want to be 10 times more likely to be used than the rest of the records, then those records get weighted as 10 and the rest of the records get weighted as 1. When configuring the tool, choose the field in the dataset that has the weights. In our case, we do not need to use weights.   The next option is whether or not to include marginal effect plots. These are plots that isolate a single variable to inspect which values of that variable are most predictive of the Target Variable. These are helpful in determining how well a variable will serve as a predictor.   Model Customization Menu   Next, to further customize the model, choose the Model customization tab.     If you click on the checkbox: “Specify target type and the loss function distribution”, you can choose what type of variable the Target Value is, and then its loss function. A loss function essentially tells the algorithm how to penalize errored predictions.   Continuous target:   For example, if the target variable is continuous, there are three different loss functions available to use. A Gaussian (Squared Error Loss) function is very good to use when the predicted errors are large in value. With this loss function, since the error amount is squared, this penalizes the errors much more when they are large. A Laplace (Absolute Value Loss) function penalizes the errors linearly to their error amount. This is a faster loss function, but it does not work as well as Gaussian in cases in which the errored amounts are large.   The t-distribution loss function (AKA Student’s T loss function) should penalize errors even less than the Laplace loss function. If you choose the t-distribution, you will need to enter the number of predictive variables in the “Degrees of Freedom” field.   Count (integer) Target:   If your target variable is a count of some kind (e.g. count of sneezes a person has during allergy season), then the Boosted Model tool will automatically choose the Poisson loss function.   Binary (two outcomes) Categorical:   If you choose a target variable that is Binary Categorical, then you will have a choice of two loss functions: Bernoulli (logistic regression) or AdaBoost (exponential loss). Bernoulli has a loss function that is less steep than AdaBoost for large errors and thus penalizes such errors less. For small errors, however, Bernoulli penalizes more than AdaBoost. If your errors are large, use AdaBoost; if your errors are small, use Bernoulli.   Multinomial (three or more outcomes) categorical:   If you choose Multinomial categorical (three or more target values), then it will use a loss function geared specifically for a target variable with three or more values.   Since, in our credit card fraud use case, we have a binary categorical target (“ClassYN”), we experimented between the Bernoulli Loss Function and the AdaBoost Loss Function, because we did not know whether the errors would be large or small.   The maximum number of trees in the model:   The next customization parameter that we have is the maximum number of trees. With this parameter, we want to use the minimum number of trees that will give us the best balance between processing time and error amount when tested via the Testing and Evaluation datasets. Typically, there is a value for the number of trees after which the quality of the model (measured by AUC, or Precision/Recall, or Accuracy) plateaus. In the example graph below, which was generated on the R anchor of the Boosted Model Tool, the X axis (titled “Iteration”) is the number of trees created. The Y axis titled “AdaBoost exponential bound” refers to the amount of prediction error with the number of trees. In this example, we were using an AdaBoost loss function with cross validation. There are two key points identified by the arrows in the graph.   The blue arrow identifies the “knee” of the graph. This is an inflection point where the error no longer significantly reduces as the number of trees increases. In this example, it is at about 2000 trees.   The yellow arrow identifies the “gone too far” point where the error actually begins to increase as the number of trees increases. In this example, it is at about 6000 trees.     The maximum number of trees should be between these two points on the graph. The closer to the yellow arrow point, the longer it takes to create the model, but the error will be the minimum. The closer to the blue arrow point, the less time it will take to create the model, but the error will be slightly higher. We chose right in the middle at 4000 trees.   Understand that the only way to view this graph is to run the model. So, in order to choose the optimal value for maximum number of trees, you may have to experiment by running the model with different loss functions and different values for the other parameters.   Method to determine the final number of trees in the model:   The next parameter we need to set with respect to the number of trees is the Method to Determine the Final Number of Trees in the Model. The three options are cross validation, test (validation sample), and out-of-bag. For the credit card fraud detection use case, we experimented with each method.   Cross validation:   Of the three options, the one that is consistently the best is cross validation. A recommendation for the number of cross validation folds is somewhere between 5 and 10. If the data is very noisy, you may want to increase the number of CV folds even more, to 20. Cross validation has the benefit of preventing overfitting. Unfortunately, it also takes a lot of processing power to use cross validation. This is why you can choose the number of CPU cores to dedicate to cross validation. Using more cores than the number of folds does not help, but keeping it to one core will take the longest time to process. For instance, if you have 5 CV folds, and 8 CPU cores on your machine, then you can probably up the number of cores to 4 and not significantly slow down any processing happening concurrently outside of Alteryx.   Test (validation) sample:   Using Test (validation) sample will simply tell the tool to separate out a percent of the training records (that you set) to be used for internal testing. This method is much faster than cross validation, but it is not better than cross validation.   Out-of-bag:   This method is based on the algorithm’s “out-of-bag” estimator, which typically does not reach the optimal number of trees. It is faster than cross validation and test sample, but it usually does not yield the most accurate model.   The fraction of observations used in the out-of-bag sample:   The next customization parameter is “The fraction of observations used in the out-of-bag sample” (AKA “bag fraction”). This is the percentage of training records pulled and used in each iteration of tree creation. This pulling of the records out of the bag of training records is normally random, except in the case where the random seed value is set to zero. If the random seed value is set to zero, and the bag fraction is less than one, then every time the model is retrained with the same data and the same parameters, the resulting model will be slightly different. If the random seed value is zero then it can cause that a slightly different model is built each time the workflow is run and even with the same input data. If we set it to any number (e.g. 1), and keep it on that number, then every time the workflow is run with the same input data, the model that is built will always be the same.   For the credit card fraud detection use case, we experimented with different fractions of observations to determine the optimal value.   Shrinkage:   The shrinkage value is also called the learning rate. As a general rule, the smaller the learning rate, the more accurate the model will be. But please note there are exceptions to this general rule which sometimes are encountered. Also, as a general rule, the smaller the learning rate, the longer it takes to process in an effort to reach an optimal model.   For the credit card fraud detection use case, we experimented with different shrinkage values to find the optimal value.   Interaction depth:   The interaction depth relates to how big each tree should be. There is a lot of debate on how many tree nodes and tree leaves will be formed for each value of interaction depth. What is not debated is that the greater the number, the larger each tree will be. The default value is 1 and the value that will prevent overfitting the most is a value of 1.   For the credit card fraud detection use case, we experimented with values 1 through 5 on the interaction depth to determine the best value.   Minimum required number of objects in each tree node:   The next parameter is the minimum required number of objects in each tree node. This value is the minimum number of training records that must be in each terminal node of the tree for the model to use that tree. When creating a tree, it is necessary to split the tree into right and left nodes. But before making the split official, the dataset must provide the model this minimum number of records to put in each new node. If the value is higher, this will reduce the number of tree splits that can happen and possibly reduce the number of trees. The higher the number, the more it will generalize the model and prevent overfitting. The default value is 10. The best way to know the optimal value is to experiment.   For the credit card fraud detection use case, we experimented with values 1 through 100 (the maximum possible value).   Random seed value:   Finally, the last customization parameter is the random seed value. If it is set to 0, the output models for each training run will be slightly different (even with the exact same input data). If you instead set the seed to any number greater than zero, all runs with that same seed value (and the same input data, and the same customization parameters) will yield the same output model. This is important for repeatability and testing.   Credit Card Fraud Detection Use Case   With the credit card fraud data, we ran several experiments to determine the best set of customization hyper-parameters. The first set of experiments examined the type of loss function together with the method to determine the final number of trees. The two choices for loss functions are Bernoulli and AdaBoost, and the three choices for methods are cross validation (5 folds), Test Sample (50%), and Out-Of-Bag. Below are the results.       Model Measurements   Confusion Matrix:   The first column is the model description. The next four columns are the “confusion matrix” values: TP (True Positive) means we predicted the record as fraud and it truly was fraud. FP (False Positive) means we predicted the record as fraud, but it was not actually fraud. FN (False Negative) means we predicted the record as not fraud but in reality, it was fraud. TN (True Negative) means we predicted the record as not fraud and it truly was not fraud. Our main focus with the confusion matrix numbers lies with the False Positive and False Negative values: We want these to be as low as possible because they have actual costs to the company.   Please note that the confusion matrix is not an output of the Boosted Model Tool. In order to create it, we created a tool container titled “Post Processing Evaluation of Scored Data” in which we put the logic to create the confusion matrix along with the Optimal Cutoff, TotalLoss, Precision, Recall, and Accuracy fields. To find the tool container, look in the green and yellow processing sections on the right side. If you open the tool container, you can easily follow the logic.   Here is a good example to help you understand the four confusion matrix metrics:     Optimal Cutoff and TotalLoss:   After the model is created, it is fed directly into a score tool along with the Test records. The output of the score tool is a Score_Y field containing a probability that the record is fraudulent. In a default setting, if the probability is 0.50 and above, we would classify the transaction as fraud. However, by taking into account the actual costs for False Positives and False Negatives, we can calculate an optimal cutoff value that minimizes costs.   For example, if we inaccurately predict a transaction as fraudulent, the company could take an action that makes a customer upset, such as blocking the transaction and maybe even putting a freeze on the card. This could cause the customer undo stress and some will take their business elsewhere. This is the potential loss due to a False Positive. If a card has true fraudulent transactions and yet we predict that the transactions are not fraud, then we will likely lose more money. This loss results from a False Negative. In our use case, we assigned the loss for a False Positive as $100 and the loss for a False Negative as $400. Using these values, we calculate the Optimal Cutoff score and the TotalLoss in dollars. We want the Optimal Cutoff score to be as high as possible, and the TotalLoss to be as small as possible to make sure our model is robust.   The logic for this calculation is in the tool container. There is a Text Input tool where you can set the costs for each of the categories of classifications (TP, FP, FN, and TN).   Precision, Recall and Accuracy:   These are common measurements that can be applied to models based on the results of the test data. For all of these measurements, the higher the number the better. A perfect score for each would be 1.00, although in practice that is almost never achieved.     Precision is defined as the proportion of correct positive classifications (predictions) from the cases that are predicted as positive. When False Positives rise, then the Precision declines. If our focus were to maximize detecting fraud with a concern about False Positives and with little concern about False Negatives, then Precision is a good measurement to use.     Recall is defined as the proportion of correct positive classifications from all the cases that are actually positive, which is the TP plus FN. When False Negatives rise, the Recall value declines. If our focus is on maximizing detection of all fraud with False Positives as a second priority, then Recall is the measurement to use.     Accuracy is the proportion of correct classifications (true positives and true negatives) from the overall number of cases. When we want to simultaneously measure how well our model correctly classifies “fraud” and “not fraud” transactions, then Accuracy is a good measurement to use. However, when we have a highly imbalanced training dataset (as we do in this case), Accuracy is not a good measurement. For example, imagine the model simply predicts all records it sees as not fraud. If that were the case, then with 492 fraud transactions and 47,211 normal transactions, our accuracy will still calculate to be 98.9% and yet our model would not be able to detect any fraud. Our model would be useless, but still very accurate, simply by chance. For this reason, with a highly imbalanced data set, we normally do not use Accuracy as a primary measure.   AUC (Area Under the Curve) and Gini Coefficient:   AUC: The Area Under the Curve is a total area measurement of the area under the Receiver Operating Characteristic (ROC) curve when plotting the True Positive Rate (AKA Recall) against the False Positive Rate (AKA FPR or “1-Recall”). It’s a good measurement because it represents the degree or measure of separability between the ‘fraud’ and ‘not fraud’ classes and tells us how much the model is capable of distinguishing between these classes. A perfect measurement of AUC would be equal to 1.00. In Alteryx, the tool that we use to generate the ROC curve and measure the AUC (area under the ROC curve) is the Lift Chart tool which is found in the Predictive suite of tools. The lift tool can measure the AUC for one or many models, allowing us to conveniently compare the graphs and final measures for each model.   The output graph below is an actual output from our experiments in determining the best Model Customization set of parameters. The ROC graph shows six ROC curves (one for each of six different models). In the middle of the graph is the diagonal line. Any ROC curve from a model that is working “better than random” will have a ROC curve that is above and to the left of the diagonal line. In this actual graph, all six models have curves that indicate they are all “better than random”. Typically, the model curve that is the farthest to the left and highest up of the diagonal line will have the highest AUC value and is considered the best model. Another way of identifying this is the model curve closest to the upper left corner of the graph, typically is the best model. In the graph below, the closest curve to the upper left corner is the Adaboost_Cross. Next would be the Adaboost_Out_of_the_Bag model. And the third best is the Adaboost_Test_Sample. Note that all the Adaboost models are closer to the upper left corner than all three of the Bernoulli model curves (the Bernoulli_Cross is hidden behind the Bernoully_Out_of_the_Bag). In the table below the graph, we see that the Adaboost_Cross model has the highest value for AUC confirming that it is best of the 6 models. The worst would be the Bernoulli_Cross or Bernoulli_Out_of_the_Bag.     Gini Coefficient: The Gini Coefficient is found in the table below the ROC Curve on the far right. It is related to the AUC calculation and like the AUC value, it is scaled from 0 to 1, where 0 indicates no ability to separate the classes, and 1 indicates perfect separability.   Which measurement shall we use?   For uses cases where the data is highly imbalanced and the target variable is binary, the best measurement to use is the AUC (Area Under the Curve). In our experiments, we looked at all the measurements and gave decision preference to AUC, Optimal Cutoff, and TotalLoss.   What experiments were run?   We ran through various experiments determining the optimal customization parameters for the Boosted Model. In all cases, we used the Test Data to determine these parameters. Here are the winners for our credit card fraud detection use case: Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Final experiment results:   We ran a final experiment using the Evaluation Data against the top two contending models and the default Model (the default model is the model generated when we use all the default parameters). Here are the default parameters:   Default: Model_Default_Bernoulli_Cross_4000_20_1_10_50 Loss function: Bernoulli The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 50% Shrinkage: 0020 Interaction depth: 1 Minimum required number of objects in each tree node: 10 Our two contending models were: Model_Adaboost_Cross_4000_35_1_100_30 Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Cross validation (5 fold) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Model_Adaboost_Test_4000_35_1_100_30 Loss function: AdaBoost The maximum number of trees in the model: 4000 Method to determine the final number of trees in the model: Test Sample (50%) The fraction of observations used in the out-of-bag sample: 30% Shrinkage: 0035 Interaction depth: 1 Minimum required number of objects in each tree node: 100 Our results from the three contending models are below:     Looking at the results, we see that the first model (Default) is not bad. It has a high True Positive value. It also has the lowest TotalLoss value. However, it also has the lowest AUC value, which is not good. The second and third models had better AUC values, yet their TotalLoss values were greater (in magnitude). The second model has the highest Optimal Cutoff value, which indicates a stronger model.   One last thing we can look at when comparing the models is the ROC curves. Below is a combined chart with the ROC curves of the three contending models.     From the chart, we see that the ROC curve farthest to the left of the diagonal and higher towards the top than the others is the Model_Adaboost_Cross_4000_35_1_100_30 model. These curves confirm the AUC values in the table. This is the winning model because it has the highest AUC value and the highest Optimal Cutoff value.   Conclusion:   The Boosted Model Tool in Alteryx is a very powerful predictive analytics tool. It is one of the more complex tools in the tool chest, but in difficult use cases, it proves to be very valuable. This credit card fraud detection use case is especially difficult because it is highly imbalanced.   The Boosted Model Tool uses the GBM package in R. This algorithm is resistant to over-fitting. It also works well with a variety of target variable types.   If you decide to use the Boosted Model Tool, you will find that the default values provide decent results. If you want to improve upon those results, experiment with the customization parameters. This will add time to the design and modeling process, but it can produce significantly better results.     References:   A Gentle Introduction to Gradient Boosting Boosting with AdaBoost and Gradient Boosting Boosting and Logistic Regression A Short Introduction to Boosting ADABOOST Quick Introduction to Boosting Algorithms in Machine Learning Beyond Accuracy: Precision and Recall Understanding the AUC – ROC Curve Useful properties of ROC curves, AUC scoring, and Gini Coefficients   [1] How to do this semi-random separation is beyond the scope of this article, but if you would like to know how best to split an imbalanced dataset, please contact the author through the Alteryx Community. By now, you should have expert-level proficiency with the Boosted Model Tool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at community@alteryx.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.   Stay tuned with our latest posts every #ToolTuesday by following @alteryx on Twitter! If you want to master all the Designer tools, consider subscribing for email notifications.
View full article
The attached Alteryx Workflow takes the color schemes from www.ColorBrewer2.org and adds them to a new XML file to be utilized as the ReportSettings.xml file installed with Alteryx. You can utilize the new color palettes under the Report Map tools. Note that these palettes will not be added to the Interactive Chart or Insight tools. 
View full article
How to get rid of annoying "Exception" error during YXI installation. Case of long path names.
View full article
Looking for more premium connector and tool content to better equip your Alteryx platform for success? Check out what our partners have been up to!
View full article
There are a couple of different whitespace situations you might get yourself into, but this article has you covered in all of them!
View full article
Having trouble reading or writing Microsoft Access files? No worries - Client Services is here to help!
View full article
Let's talk about how to replicate the WHERE EXISTS functionality of SQL within Alteryx.   Example 1:  UNION - Add rows from table#1 to table#2 if the key value of table#1 does not exist in table#2.  If desired, the combined data set could be joined with a third data set, but only if the key value in table#3 does not exist with only one key column.     Example 2:  SELECT/UPDATE records from table #1 based on the contents of table #2.  The statement below generates the names of customers who had orders during 2016.   select c1.customer_number ,c1.customer_name from customers c1 where 1 = 1 and exists (select * from customer_orders c2 where 1 = 1 and c1.customer_number = c2.customer_number and c2.order_year = 2016 ) ;     The SQL EXISTS condition is used in combination with a subquery and is considered to be met if the subquery returns at least one row. It can be used in a SELECT, INSERT, UPDATE, or DELETE statement.     The syntax for the EXISTS condition in SQL is:  WHERE EXISTS (subquery)   The  subquery  is a SELECT statement. If the  subquery  returns at least one record in its result set, the EXISTS clause will evaluate to true and the EXISTS condition will be met. If the  subquery  does not return any records, the EXISTS clause will evaluate to false and the EXISTS condition will not be met.           Example 1:  Take the unmatched records from the source you are appending data from out of a join between the two data sets and union it back to the data set you are appending to.   Example 2:  Create your "look up list" out of a filter for whatever you set as the condition.  In this example, a year that is in the data.  Join the filtered data back with the other data source. Deselect the data that comes from the filtered source.   Results:     Please see the attached workflow: Where Exists Question.yxmd.
View full article
Have you ever wanted to do a Cross Tab, but needed the results in a particular order?  You've probably discovered that Cross Tab will dutifully change your data to column headings and keep the rest of the data correlated, but that the output columns will be arranged in alphabetical order.  It's as if there were a Select tool with the "Sort on Field Name" Option thrown into your workflow without you putting it there.    
View full article
Output data as Text File   Within Alteryx there isn't an output option directly to a text file. To achieve this you will need to use a flat ASCII file.   Step 1: Bring in an output data tool and choose the ‘Flat ASCII file (*.flat) option   Step 2: You will then see this below screenshot in the output tool configuration window.     Step 3: You will now need to change the file extension from .flat to .txt, this will chage the flat file to a text format.       Step 4: You can now click on the hyperlink in the results window and open your text file in a supported application.         Example attached.
View full article
In Alteryx, there are 5 customizable options within the Cross-validation screen.
View full article
In SQL, you can join on a range, using code similar to the below snippet.   SELECT Column, FruitName, StartDateTime, EndDateTime FROM dbo.Fruit_List INNER JOIN dbo.Greek_Alphabet ON dbo.DateTime BETWEEN dbo.FruitList_StartDateTime AND dbo.FruitList_EndDateTime   Alteryx does not have a "join on range" command, but you can get the same result by appending all records and using a filter tool.                                                                                                                                         For Advanced Joins: when a value from one file is between (< or >) a value from another file, visit and download the Advanced Join Macro.
View full article
Calgary  is a list count data retrieval engine designed to perform analyses on large scale databases containing millions of records. One use case for a Calgary database in an  App or Macro  is to offer users certain predetermined criteria (e.g. geography) to select and then return the corresponding data to them. A back-end process can combine multiple data sources, format and pre-filter as needed, and load them into a Calgary database that can be queried quickly by the  app  or  macro . This way, the bulk of the processing happens outside of the app or macro allowing data to be returned more quickly. 
View full article