The predictive tools can be intimidating to many Alteryx users. Even members of our own internal teams sometimes shake their head and walk away rather than trying to scale that steep learning curve. I started reading through Designer Help, trying to understand. And while the pages are accurate and well-laid out, they are far from a learning guide. It seemed like the perfect starting place to make a better starting place.
I started with Boosted Model because, well, it’s first on the Tool Palette.
I connected up a .yxdb from the available sample data and started clicking around the configuration window. Looking at Required parameters, this seemed really easy.
Sampling weights? Marginal effect plots? What are those and why are they optional? And, most importantly, what should I click and why?
Weighting is a technique used to adjust imbalanced data where over-represented groups get a smaller weight, and under-represented groups get a larger weight. It’s basically trying to make the data more accurately represent the sampled population without having to go get more samples.
If you have a dataset that didn’t get enough samples of people age 45-55 for a second refinance loan, you’d want to use this, because that group of people needs to be considered accurately if you’re going to figure out who to refinance.
It turns out, this is the amount of impact a field has on a calculation, written as a percentage. So if you’re trying to determine what economic class someone is, their salary has a huge impact. The number of beanie babies their second cousin has, not so much.
This option doesn’t just let you see how big a deal a variable is in the overall calculation. It also lets you determine what to ignore. Set a minimum percentage, and any variable that doesn’t have that big an impact isn’t charted.
If you’re trying to determine how variables impact your loan estimation, you would plot the marginal effect and set a low percentage to see how important each field is.
Model customization was…well, more intimidating. I was definitely second-guessing my decision.
First was this target type thing, which is the pattern the data in the target field follows.
On top of that, target types pair with loss function distributions. In classification problems, values can be incorrectly classified (it happens); loss functions account for this by assigning a penalty value. These distributions aim to reduce the penalty by minimizing the function. In terms of learning each function, I recommend your favorite search engine.
Loss function distributions can be tricky to choose between, and sometimes you don’t even need to. But when there are options, it’s recommended to try all of them and see which best suits your model, your data, and your use case.
The maximum number of trees in the model caused me to finally read the inner workings of the Boosted Model macro. Trees refers to Decision trees, a type of model that runs “if-then” split rules. Basically, the tree checks for specific criteria (if), and when there’s a match, follows to the next rule (then). The Boosted Model stacks a bunch of these trees to determine the logic the model will use while calculating. The more trees there are, the more specific the logic can be, with a tradeoff of a longer runtime.
The Method to determine the final number of trees in the model is important because it makes sure that the model is actually predicting what it should be without adjusting for noise and randomness. This method’s job is to Keep Your Model Simple and not let it overthink.
Definitely try each method to see what works best, but know that cross validation is especially useful with datasets have limited records because it allows you to more accurately see how the logic will generalize to new data. Test validation is useful with big sample sets. Out-of-bag validation is simply another method to check.
REALLY: It is really important to try all options because there might be connections you don’t see that work best with a method you aren’t considering.
Use The fraction of observations used in the out-of-bag sample to set the size of the test data. This is the test that checks model effectiveness, so it uses data that wasn’t used in creating the model. Using between 25% and 50% is common practice so you have a clear idea of how the model treats new data before using it for predictions.
Shrinkage weights the trees in the model. If trees are weighted as having a low individual impact in calculating the target, more trees will be needed, so a small shrinkage value can cause your model to need more trees.
Interaction depth indicates relationships between predictor variables in relation to the target variable. If a predictor has a meaningful relationship to the target variable by itself, in this case it has a linear relationship, such as a history of past payments in relation to the likelihood of a successful future payment. Keep in mind, a predictor variable may depend on another variable before its relationship to the target variable is meaningful.
For example, if studying the effect of fertilizer on plant growth, the size of the pot may not have a meaningful connection unless you also consider the pot size in relation to the plant type. That pot size would require a higher interaction depth to prove meaningful in calculating the target variable.
Minimum required number of objects in each tree node sets the size of the decisions. A bigger decision tree will have less specific logic, but the overall model will have fewer trees and run faster. Like most modeling options, the tradeoff comes down to what works for your model and use case.
A Random seed value is responsible for how data is “randomly” selected as a sample. Since computers can’t actually make random selections, it uses an algorithm that simulates randomization. Since it’s the same algorithm each time, it will make the same random selections each time until the algorithm is changed. Random seed values change the algorithm while still allowing for consistent checks.
The Boosted Model tool has a Report output anchor, and the controls under Graphics Options control the generated report. The Report output produces several reports, including the marginal effect plot if you chose to include that. These reports can be used to analyze the effectiveness of the model and get a visual on data relationships.
The output anchor produces the actual model. This is what you connect to a Score tool with some new data to get new predictions.
What this all boils down to:
I’m pretty sure that’s true for all of predictive actually. And, since this didn’t dissuade me from learning predictive capabilities, I’m planning on tackling the messiest Help we have in the suite: Decision Tree. Yup, that model that makes up this model? We have a tool for that.
Tanya Stere is the Product Manager supporting Alteryx Server, looking to make sure ideas make it from the white board to your desktop. She primarily works to make Server match your needs.
Tanya Stere is the Product Manager supporting Alteryx Server, looking to make sure ideas make it from the white board to your desktop. She primarily works to make Server match your needs.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.