Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
TanyaS
Alteryx Alumni (Retired)

After the great challenge that was the Boosted Model, I took a respite, a long respite, before coming back. But after many moons, I have taken on another mighty challenge: a tree.

 

How can it be scary? It doesn't even have roots!How can it be scary? It doesn't even have roots!

So what does the Decision Tree tool do?

The Decision Tree tool creates a set of if-then split rules to optimize model creation criteria based on Decision Tree Learning methods. Rule formation is based on the target field type:

  • If the target field is a member of a category set, a classification tree is constructed.
  • If the target field is a continuous variable, a regression tree is constructed.

Use the Decision Tree tool when the target field is predicted using one or more variable fields, such as a classification or continuous target regression problem.

If you notice the sound of crickets, you know that that didn't mean anything to me when I first read it. So, put MORE simply:

 

You have a data set that contains all kinds of pets and descriptors about them. You want to use this to train a model that will be used in predicting what kind of animal is described based on the provided descriptors.

 

The Decision Tree tool looks at the descriptors and splits the data. Then it does it again and again and again.


T for True, F for False. You got that, right?T for True, F for False. You got that, right?

To get this, you only need the Setup page. It's the standard stuff: What to call the model, what you're trying to predict, and what variables you want to use to make predictions. Easy peasy.

 

But, you can click the Customize button and find a whole new world!

 

 I had toI had to

Model tab, for data evaluation and model building!

The Decision Tree tool allows for all kinds of customization, right down to letting you pick what algorithm to use.

 

Rpart: Why pick rpart? It's the standard, it's what you use for a regression model, and it provides a report called a pruning plot. So if you need that plot, a regression model, or just really aren't sure, Rpart is a good choice.

 

C5.0: Why pick C5.0? It sorts data into mutually exclusive classes, and creates a rule set. So it functions a little different, and may be better suited to your data depending on how your categories are set up.

 

However, there's always the option of trying both! Just like with the Boosted Model, trying multiple setups is going to be lauded, because there might be relationships you don't know about that Designer can help you find. 😄

 

Let's play with rpart first

You click on rpart, and you get three dropdowns.

 

Dropdown Model Type and Sampling Weights: you find a neat selector for your model type. It is automatically set to Auto, and will assess the target variable to select the right model type, which is pretty neat! (However, if you know you are creating a Classification model, you can select that and gain some additional controls in the Splitting Criteria and Surrogates dropdown). You can also choose to use sampling weights. You'll select a field from your dataset, and Designer will use that field to determine if a record is more or less valuable. Then that record is weighted accordingly in the later calculations.

 

Dropdown Splitting Criteria and Surrogates: What model type are you making? That's gonna impact how this looks. Did you select classification? You see this:

 

splitting criteria, yes yes...of course!splitting criteria, yes yes...of course!

The splitting criteria option does not show up if you're using regression or if you selected auto. If you picked auto, Decision Tree doesn't know if you're using regression or classification until it starts building the model. If you selected regression, Decision Tree is using Least Squares, so you don't need to make a choice.

 

Just so you know, the Gini impurity (not coefficient) is being used. And there isn't a big difference in results depending on if you use Gini impurity or information index. Information index is based on information theory, so if that sounds thrilling, go for it! Or Gini! Or both! (Always both!)

 

However your splitting is done, you have the options to control how surrogates are used.

 

What is a surrogate?

 

Glad you asked! I had no idea! But they're really cool once you figure them out. Surrogates are variables that are kind of related to the variable being used to determine a split. So, in the pet data, the data for whether a pet is fuzzy might be missing, but the data for if it hunts mice might evaluate kinda similarly. Then, for all the records missing is-fuzzy data, the split is determined using hunts-mice data. You get to decide how the surrogate data is evaluated, because there will likely be records that are missing the surrogate data also.

 

Omit: doesn't have the surrogate data, doesn't get used in the evaluation. Keeps it simple!

 

Split: records without the surrogate data are split evenly between the branches of the split. The current branch is balanced, but may cause additional complexity since the data is split.

 

Majority direction: records without the surrogate data are funneled down the majority path. This means they can all be evaluated against each other on a later branch, but that the current split isn't as balanced.

 

Select best surrogate: the surrogate variable is determined by how closely it matches, but that can be determined by either the total number or the percentage of records correctly classified.  Choosing to determine by total number discourages the model from using variables with a large number of missing variables more than percent.

 

Dropdown HyperParameters: Prior distribution controls. Prior distribution is what you assume about the data before everything can be taken into account. So it's what you know about the model and an unknown element, which may be a model parameter or a variable or something else that impacts data distribution. These controls let you set parameters about how the model will be built, except these parameters are specific to model complexity and split requirements.

 

The minimum number of records needed to allow for a split: Set how much data is required for each split. Higher number, fewer splits. Lower number, more splits.

 

The allowed minimum number of records in a terminal node: Set how many records are needed for something to register as a final option. Lower number, more final options. So with low enough values, the node "bird" might be just a split point leading down to specific species cross-referenced by name. (How many sparrows do you know named Dave? Right? It's a really low number.)

 

Number of folds to use in the cross-validation to prune the tree: MOUTHFUL AND A HALF. Cross validation takes a dataset, breaks it up, and compares it to the other parts of itself. Folds are the breaks. The Decision Tree gets "pruned" using cross-validation to make sure that the data isn't locked into such tiny buckets that it has become inaccurate due to overfitting. This would help correct

for the sparrows-named-Dave issue.

 

Maximum allowed depth of any node in the final tree: In the pet tree, data is 3 nodes out. You can set a limit to stop the tree from growing out too far. This reduces processing time and can help prevent overfitting.

 

Set complexity parameter: How big should your tree be? That really depends on your dataset and what information you're looking to get out of it. If you're not sure, don't set the complexity parameter. It will use cross-validation to come up with a number for you.

 

With all that, you've got the rpart of the model done.

 

But I want to use C5.0! 

....Okay. We can do that. I can help. Just let me pull out my notes.

 

Honestly, my notes don't look like that. At least not since I finished my English degreeHonestly, my notes don't look like that. At least not since I finished my English degree

Okay, got C5.0 selected? Let's get into the three dropdowns for that.

 

Dropdown Structural Options: Remember how C5.0 can make your tree into rules? This is how. The Decision Tree gets broken down into a bunch of unordered if-then rules that compare more collectively instead of linearly, like a tree. It's really cool, very Decisiony, not so very Tree-y.

 

Dropdown Detailed Options: This dropdown provides some controls that allow you to simplify the model. Most of the options here are useful for reducing overfitting, stopping a tree from making too many splits, and working out which predictors aren't helpful.

 

There's one really cool option: Evaluate advanced splits in the data. If a descriptor falls within a range close to the threshold used to split the data, this option uses later splits to evaluate data against the current split. So if you've got data on the weight of pets, and it was a medium sized fuzzy animal that’s not obviously large or small, this would check both of the downstream options to find the best fit based on both of those options. That way you could tell if it as a Doberman or someone’s pet Puma.

 

Dropdown Numerical Hyperparameters: Similar to the Hyperparameters of rpart, the numerical hyperparameters works on the prior distribution of the model, just using numerical values for it.

 

CROSS-VALIDATION! WE ARE BEYOND THE MODEL!

Well. Not beyond the model. Just into validating the model rather than setting how it's built. Select to use this cross-validation. This is taken over the cross-validation you might have set with rpart. But it means you get cross-validation regardless of algorithm.

 

Like I said, Cross validation takes a dataset, breaks it up, and compares it to the other parts of itself. Folds are the breaks. Trials are how many times it compares.

 

External cross-validation samples from an external data set, which means you need to determine what you're going to pull in. If you choose to seed the data pulled it, it pulls random values, but it pulls the same random values between runs. This is really useful when you're building your model and check the data for effectiveness. Otherwise, you get really-actually random, and that's hard to compare against other really-actually random data.

 

Plots, graphs, visuals, Oh my!

The plots that are displayed ALSO vary by which algorithm you select. Remember how rpart makes prune plots? That is important because C5.0 doesn't. rpart gives you the option to create and customize either a tree plot, which displays the tree, root to leaves, and all the decisions in-between, or a prune plot, which displays the tree, but simplified, pruned down to the important bits. Other than that, it's pretty standard graphs-in-Designer options.

 

And that's about it.

Other than the customization based on algorithm, this is a pretty simple tool. As you can tell, that makes up the majority of "what on earth did I learn writing the Help" above.

 

If you want to learn more, there's a Tool Mastery article that gets into more of the nitty gritty of configuration. And another article on output interpretation, that helps you figure out if your results are the results you want.

Tanya Stere
Product Manager - Server

Tanya Stere is the Product Manager supporting Alteryx Server, looking to make sure ideas make it from the white board to your desktop. She primarily works to make Server match your needs.

Tanya Stere is the Product Manager supporting Alteryx Server, looking to make sure ideas make it from the white board to your desktop. She primarily works to make Server match your needs.

Comments