Data Science

TanyaS · ‎09-27-2017

The predictive tools can be intimidating to many Alteryx users. Even members of our own internal teams sometimes shake their head and walk away rather than trying to scale that steep learning curve. I started reading through Designer Help, trying to understand. And while the pages are accurate and well-laid out, they are far from a learning guide. It seemed like the perfect starting place to make a better starting place.

I started with Boosted Model because, well, it’s first on the Tool Palette.

And they say writers are creative

I connected up a .yxdb from the available sample data and started clicking around the configuration window. Looking at Required parameters, this seemed really easy.

Required parameters: keeping it simple

And then confusion!

Or not so simple

Sampling weights? Marginal effect plots? What are those and why are they optional? And, most importantly, what should I click and why?

Sampling weights

Weighting is a technique used to adjust imbalanced data where over-represented groups get a smaller weight, and under-represented groups get a larger weight. It’s basically trying to make the data more accurately represent the sampled population without having to go get more samples.

What you click

If you have a dataset that didn’t get enough samples of people age 45-55 for a second refinance loan, you’d want to use this, because that group of people needs to be considered accurately if you’re going to figure out who to refinance.

Marginal effect

It turns out, this is the amount of impact a field has on a calculation, written as a percentage. So if you’re trying to determine what economic class someone is, their salary has a huge impact. The number of beanie babies their second cousin has, not so much.

This option doesn’t just let you see how big a deal a variable is in the overall calculation. It also lets you determine what to ignore. Set a minimum percentage, and any variable that doesn’t have that big an impact isn’t charted.

What you click

If you’re trying to determine how variables impact your loan estimation, you would plot the marginal effect and set a low percentage to see how important each field is.

Then onto the next tab!

Model customization was…well, more intimidating. I was definitely second-guessing my decision.

Those read about the same at first glance

Target Type

First was this target type thing, which is the pattern the data in the target field follows.

Continuous data means that any given value can fall within the range of values. If data ranges from 11 to 2 billion, any given point could be 46.3.
Count data are integers representing how often a value shows up in the data set. If the data set is the sentence "here we go again", count data will return 3 for the value "e".
Categorical data means that each value will be one of a set number of options (categories), e.g.: colors. If it’s binary, it is one of two colors. It it’s multinomial, it is one of several.

On top of that, target types pair with loss function distributions. In classification problems, values can be incorrectly classified (it happens); loss functions account for this by assigning a penalty value. These distributions aim to reduce the penalty by minimizing the function. In terms of learning each function, I recommend your favorite search engine.

What you click

Loss function distributions can be tricky to choose between, and sometimes you don’t even need to. But when there are options, it’s recommended to try all of them and see which best suits your model, your data, and your use case.

Decision trees within Boosted Model

The maximum number of trees in the model caused me to finally read the inner workings of the Boosted Model macro. Trees refers to Decision trees, a type of model that runs “if-then” split rules. Basically, the tree checks for specific criteria (if), and when there’s a match, follows to the next rule (then). The Boosted Model stacks a bunch of these trees to determine the logic the model will use while calculating. The more trees there are, the more specific the logic can be, with a tradeoff of a longer runtime.

The Method to determine the final number of trees in the model is important because it makes sure that the model is actually predicting what it should be without adjusting for noise and randomness. This method’s job is to Keep Your Model Simple and not let it overthink.

One method is Cross validation, a process that happens within your training dataset. Basically, your training data is broken up into folds, or equal chunks of data. These folds are then compared against one another to make sure that your model isn’t treating any data as special or making unnecessary accommodations.
Set the Number of cross validation folds based on how you want the data to be split. Set the Number of machine cores to use based on what your machine has available to run. Be sure to consider any additional processing that will be taking place at the same time, or you could slow down all the work you’re doing.
Another option is Test (validation) sample. A subset of you training data is pulled out and labeled as test data. The test data is compared to the remaining training data. This method is useful with big data sets because you set The percentage in the estimation (training) sample. This gives you more control over the size of the samples being compared.
Out-of-bag is a method that uses records that were withheld from the training data as test data.

What you click

Definitely try each method to see what works best, but know that cross validation is especially useful with datasets have limited records because it allows you to more accurately see how the logic will generalize to new data. Test validation is useful with big sample sets. Out-of-bag validation is simply another method to check.

REALLY: It is really important to try all options because there might be connections you don’t see that work best with a method you aren’t considering.

Those other options

Use The fraction of observations used in the out-of-bag sample to set the size of the test data. This is the test that checks model effectiveness, so it uses data that wasn’t used in creating the model. Using between 25% and 50% is common practice so you have a clear idea of how the model treats new data before using it for predictions.

Shrinkage weights the trees in the model. If trees are weighted as having a low individual impact in calculating the target, more trees will be needed, so a small shrinkage value can cause your model to need more trees.

Interaction depth indicates relationships between predictor variables in relation to the target variable. If a predictor has a meaningful relationship to the target variable by itself, in this case it has a linear relationship, such as a history of past payments in relation to the likelihood of a successful future payment. Keep in mind, a predictor variable may depend on another variable before its relationship to the target variable is meaningful.

For example, if studying the effect of fertilizer on plant growth, the size of the pot may not have a meaningful connection unless you also consider the pot size in relation to the plant type. That pot size would require a higher interaction depth to prove meaningful in calculating the target variable.

Minimum required number of objects in each tree node sets the size of the decisions. A bigger decision tree will have less specific logic, but the overall model will have fewer trees and run faster. Like most modeling options, the tradeoff comes down to what works for your model and use case.

A Random seed value is responsible for how data is “randomly” selected as a sample. Since computers can’t actually make random selections, it uses an algorithm that simulates randomization. Since it’s the same algorithm each time, it will make the same random selections each time until the algorithm is changed. Random seed values change the algorithm while still allowing for consistent checks.

Graphics. Wait, I know what those are!

It's so simple! The promised land!

The Boosted Model tool has a Report output anchor, and the controls under Graphics Options control the generated report. The Report output produces several reports, including the marginal effect plot if you chose to include that. These reports can be used to analyze the effectiveness of the model and get a visual on data relationships.

The output anchor produces the actual model. This is what you connect to a Score tool with some new data to get new predictions.

And that’s it!

What this all boils down to:

Know your data
Know your model type and its limitations
Try every option and don’t be afraid to get it wrong at first

I’m pretty sure that’s true for all of predictive actually. And, since this didn’t dissuade me from learning predictive capabilities, I’m planning on tackling the messiest Help we have in the suite: Decision Tree. Yup, that model that makes up this model? We have a tool for that.

Data Science

An Alteryx Newbie works through the predictive suite: Boosted Model

And then confusion!

Or not so simple

Sampling weights

What you click

Marginal effect

What you click

Then onto the next tab!

Target Type

What you click

Decision trees within Boosted Model

What you click

Those other options

Graphics. Wait, I know what those are!

And that’s it!