Data Science

mstarks · ‎06-06-2012

In this blog post, I share an example module that demonstrates how to use a new Decision Tree macro, developed by Dr. Dan Putler.

The module begins with a Text Input tool that contains pre-processed data. The pre-processing includes over-sampling the original data so that the Default "Yes" and "No" field values are about equally likely. Also, we have created both Estimation and Validation Sample sets, as indicated in the last column. The data itself is taken from the UC Irvine machine learning data archive, http://archive.ics.uci.edu/ml/. It involves German credit records that contain the following fields: Chk_Bal, Duration, Credit_Hist, Purpose, Amount, Savings_Bonds, Employ_Length, Debt_Income, Gender_Marital, Debtor_Guarantor, Length_Res, Property, Age, Otr_Install, Tenure, Num_Loans, Job_Type, Dependents, Telephone, Foreigh_Worker, and Default. Notice that many of these fields contain values within easy-to-identify categories. To learn more about the dataset and its attributes, visit https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

We want to be able to predict if a given individual is likely to default based on our observations so far. You might expect that certain fields are better predictor variables than others. Is a single man more likely to default than a maried woman? Is a younger or older individual more likely to default? Is an individual's credit_history or debt-to-income ratio a good predictor variable? In this case, the Default field is our target variable, because it represents the behavior that we want to be able to predict. The algorithm uses the predictor variables to progressively divide the cases in the data into smaller groups. The final groups, called leaves, should have either a high or low number of people who default.

The example module also includes the new Decision Tree macro. Click on the macro to open its Configuration tab. Here, we can give the Model a name, we can select the target variable, and we can choose a number of predictor variables. Next, we include a Subset expression to select our "Estimation" sample, and we stick with the remaining default values: "Auto" for our complexity parameter (explained below), "Proportions" for our Leaf summary, and we keep "Uniform branch distances" checked. If we open the macro, Decision_Tree.yxmc, we see how outputs from the R Tool can be included in a report.

The "rpart" package in the R Tool provides a "recursive paritioning" technique to produce our Decision Tree model. It determines which of the predictor variable fields does the best job splitting the data into two groups. Then, it repeats the process for each sub-group until an end condition is reached. The complexity parameter measures the cost of adding predictor variables to the model and can be used to specify an end condition. Setting the complexity parameter to "Auto" or omitting a value results in the "best" complexity parameter being selected based on cross-validation.

The macro uses the "rpart.plot" package for plotting our Decision Tree model. Note: The "rpart.plot" package is not installed with R; therefore, the macro attempts to download it from the Comprehensive R Archive Network (CRAN), http://cran.r-project.org/web/packages/rpart.plot/index.html. This will fail if you do not have access to the internet or permission to write to your R installation directory.

When you run the example module, it produces a "Summary Report for Decision Tree Model", which can be viewed by clicking its link in the Output window. The report shows the actual rpart call, a Model Summary, the Pruning Table, the Leaf Summary, and a couple of plots. It turns out that the checking balance field is our strongest predictor variable, with the Duration and Amount fields also coming into play. The Tree Plot shows that 76% of individuals with a large checking account balance (or no checking account) do not default. Individuals with a small (or negative) checking account balance and a long duration tend to default, as do those with a large amount financed.

Compared to other data mining methods, Decision Tree models are simple to understand and to interpret. They are fairly robust and typically perform well even with large data sets. They are a good tool for initial exploration of a given data set.

You can expect to see more predictive macros with the next release of Alteryx.

(7/10/12) NOTE: If you are using Alteryx Version 7.1, refer to the "Decision Tree" sample module instead of using the contents of the attached ZIP file. There are subtle differences due to engine improvements.

Data Science

Using R to Create a Decision Tree Model