Data Science

BridgetT · ‎02-29-2016

What is R?

In addition to being a letter of the alphabet, R is also a powerful statistical programming language. People around the world from students to experienced data scientists and cutting edge researchers in many fields use it. It’s not too difficult to do many basic tasks in R, but those tasks won’t be the focus of this post. The tools/macros we’ll discuss in this series of posts all use the Alteryx R tool in some capacity. The Alteryx R tool allows you to use the R language within an Alteryx workflow, app, or macro. We won’t be learning about the direct use of the R tool today, but if you want to learn more about using it, please comment on this post and ask me to write another post about it! This is my first EngineWorks post, so I’m still figuring out what topics Alteryx users want to hear more about.

How do I access these tools?

First, you’ll need to make sure that R and these tools are installed. To do so, open up Alteryx Designer, click “Options” on the toolbar and then choose “Download Predictive Tools.” These tools may not show up on your tool palette, but recall that you can find any tools using the “Search All Tools” option in the upper left corner next to the “Favorites” tab.

How do I know which tool(s) to use for my prediction problem?

Step one: What’s the Problem?

First, you need to decide whether you have a classification or a regression problem. You can count the number of potential outcomes in a classification problem. For example, trying to predict whether a customer will buy a certain item is a classification problem because it only involves two outcomes (either the customer buys the item or not). Predicting whether a person will drive, bike, walk, or take public transit to work is also a classification problem, because it involves four possible outcomes (the four different potential modes of transportation). The set of possible outcomes of a regression problem, on the other hand are continuous. Intuitively, a continuous variable can take any value within a certain range. For example, predicting people’s heights is a regression problem, because adult heights could range anywhere from 21.5 inches to 107.1 inches. For the rest of this post, we’ll assume you’re solving a classification problem, but a future post about regression tools may be forthcoming!

Step two: What’s in the data?

Next, you need to consider your data. For example, if you have missing values in your data, you need to take extra care in selecting your predictive tool(s). Some of the tools, such as the Forest Model, Neural Network, Support Vector Machines (SVM), and Logistic Regression, do not allow you to use data with missing values. If you have data with missing values and wish to use one of these tools, you have two options. The first option is to use an Imputation tool to impute your missing values before using one of these tools. In this example, an entry is missing from the Age field. So we impute it before using the aforementioned tools on our data. If you’ve never used the Imputation Tool before, you might want to play around with it for a bit. (Note: It’s in the Preparation category if you’re having trouble finding it.) But the basic idea behind this tool is pretty simple. If you have data with null records (or another type of record you don’t want), the Imputation Tool allows you to replace these records with the field average, median, mode, or another value you can choose to specify. In this example, we impute with the average.

The second option for dealing with missing data is to use a Filter tool to remove the missing values.

Now, the bad news at this point is that both of these approaches for dealing with missing data have their flaws. The main limitation of imputing null values is that it can cause our model to be based on incorrect information. In the example above, the missing age value was imputed as 45. But what if this person was actually 18? Then we could have gotten a very different model from the “true” model resulting from the complete data set. The primary flaw of filtering out null data points, on the other hand, is that it can bias our data. Suppose in our example that we had a few more missing age values. Let’s say that all of these values were missing because older respondents did not feel comfortable disclosing their ages in this survey. Then basing our model only on complete records (which is what happens when we filter out nulls) will bias it towards younger respondents.

Luckily, we don’t always have to choose one of these imperfect options if we have missing data. There are other classification tools that handle missing data all on their own. The two included in the Predictive toolset are the Decision Tree and the Naïve Bayes Classifier. However, you need to include at least two predictor fields in order to use the Naïve Bayes Classifier.

In addition to missing data, you should consider the presence of outliers in your data. Some models, such as Support Vector Machines, Decision Trees, and the Forest Model aren’t affected very strongly by outliers and are thus a good choice to use in data sets with them.

Finally you should consider the number of features in your data. As mentioned above, the Naïve Bayes Classifier requires that you include at least two predictor fields. However, other methods are better suited to data sets with a smaller number of predictors. For example, Support Vector Machines, Logistic Regression, and Neural Networks can overfit your data if you have too many features. (There are ways to choose the parameters for these models so that overfitting is less likely, but those methods are outside the scope of this post.) Overfitting occurs when your model conforms too well to the data you used to train it and doesn’t generalize well to new data. It is especially likely to happen with models that use a large number of features. A test frequently used to detect overfitting is called cross-validation. Unfortunately, Alteryx Designer does not have a built-in cross-validation tool at the time of this writing (Version 10.1). However, one is currently in the pipeline, so stay tuned for an announcement about its release!

If you have a data set with many features, you should consider using a tree-based tool such as the Decision Tree or a Forest Model. With the Decision Tree, you can limit the number of features used by changing any of the following settings under the Model Customization tab as directed:

Increasing the minimum number of records needed to allow for a split
Increasing the allowed minimum number of records in a terminal node
Decreasing the maximum allowed depth of any node in the final tree

Try playing around with various combinations of these three options and seeing what kind of model you obtain!

Forest Models are even better for limiting overfitting, because their design is customized for this very purpose. (A mathematical explanation of their design is beyond the scope of this post, but I’d be happy to provide one in the comments section for any interested readers.) The default settings for the Forest Model will likely result in a model that is not overfitted. However, if you’re still concerned about overfitting, you can also customize the model parameters in the following ways under the Model Customization tab. These steps will likely limit the number of features used by your model:

Check the box for “Directly limit the overall size of each node tree.” Then decrease the total allowable nodes in a tree.
Increase the minimum number of records allowed in a tree node.

How do I understand the output of the tool(s) I chose?

All of the classification tools have at least two outputs: an O output for the model object, and an R object for a report about the model. If this is your first time using the Predictive tools, you can probably ignore the O output for now. However, you should attach a Browse tool to the R output so you can actually see the report about your model. Here’s a partial screenshot from the R output on the Support Vector Machines tool:

Under Model Summary, the call is a copy of the R code used within the Support Vector Machines tool in order to generate the model. The target is the variable being predicted, and the predictors are the variables used to predict the target. The values for cost and gamma are parameters for the model. They can be user-specified, but I’d suggest letting the tool select them for you if this is your first time using the Predictive tools. (If you want a more detailed explanation of these parameters, please ask for one in the comments!) The confusion matrix displays how many members of each class the model assigned to the various possible classes. In this example, the model assigned 9 members of the “No” class correctly (the top left column) and 6 incorrectly (the bottom left column). It also classified 3 members of the “Yes” class incorrectly (the top right column) and 12 correctly (the bottom right column).

Additionally, the report for Support Vector Machines contains an SVM classification plot. This plot is colored based on the region associated with each class to which the Support Vector Machines model would assign data.

In this example, the pink/purple region corresponds to “Yes” responses, and the teal region corresponds to “No” response. So for example, a 30-year-old whose family made $200,000 would be predicted as a “Yes” response because such a person would fall into the pink/purple region.

The other classification models all generally have similar “Model Summary” sections as well as a Confusion Matrix, in addition to output that is unique to the model type. If you have questions about the output from other models, feel free to ask them in the comments section!

Well, that’s finally the end of what I have to say (for now, at least). Feel free to take a look at the attached workflow if you want a more hands-on example of using these tools. Try changing some of the input data and even model parameters if you’re feeling adventurous! Thanks for reading, and please leave your feedback in the comments section! My next post will likely be about evaluating your model’s accuracy, but I’m open to other suggestions if you’d like to learn about something else. Finally, don’t forget to visit the new Predictive District on the Gallery if you want to check out even more predictive tools!

Atabarezz · ‎02-29-2016

Good job Bridget,

It may very nice to have some short kaggle tutorials using alteryx don't you think...

Best

BridgetT · ‎02-29-2016

Thanks, Atabarezz! Yes, that would be a good idea. I don't have a ton of experience with Kaggle, but it would probably be a good way to expose Alteryx to more data scientists.

mcwallendjack · ‎03-15-2016

Love the post Bridget!

I would be very much interested in a blog post about using the R-tool.

Also, I would love a mathematical explanation of the Forrest Tool design as you mention in your post.

Thanks!

Atabarezz · ‎03-18-2016

a simialar post with time series and

using xgboost for forecasting would be awesome...

BridgetT · ‎03-21-2016

mcwallendjack: There is already a blog post that talks about some aspects of the R tool here, but it's more about troubleshooting rather than explaining its configuration. Do you have any experience using R without Alteryx, or do you want to see a post that talks about both the basics of using R in general and how to configure the R tool within Alteryx? And as for the explanation about Random Forests: a Random Forest model is essentially an average of many smaller decision trees. A single decision tree looks something like this:

However, as I mentioned in the original post, using only a single decision tree on a data set with many features runs the risk of overfitting the data. For example, if your data set has 10000 features but only 100 records (this scenario happens a lot with biological data problems), a decision tree might find that using 90 of the features allows for a perfect classification on the training data. However, if you try to apply that model to new data, the selection process used to train the model might not generalize well. Again, tuning the parameters correctly can prevent this problem, but Random Forests randomly splits your data into multiple smaller subsets and trains a decision tree model on each subset. So in the example with 100 records, it might take 10 different subsets with 80 records each and train a decision tree on each one. Then (assuming that you're solving a classification problem), the Random Forest algorithm will re-apply each of the 10 decision tree models to the entire data set. Finally, the algorithm assigns each record to the mode of each of these 10 models. So if our example only has classes "Yes" and "No," and a record gets 8 "Yes" votes and 2 "No" ones from the decision trees, the Random Forest algorithm will assign it to the "Yes" class. The algorithm works similarly for regression problems; it just assigns records to the mean of the predictions rather than the mode, since the mean is generally a more meaningful summary with continuous data. This procedure helps prevent overfitting because we're essentially averaging models trained on different data sets rather than using a single model trained on just one data set.

Atabarezz: I could definitely write another post about time series forecasting as well, especially since I recently finished two macros extending the functionality of the existing TS macros. (By "finished," I mean they're actually in Beta right now, but you should look for them around the 10.5 Predictive Release!) I hadn't heard of xgboost before your comment, but it looks pretty cool, so maybe I can learn more about it and write something about it in the future!

Ok, so right now I have the following ideas for my next posts:

1. Validating your predictive model

2. Creating a regression model

3. Using the R tool (this may or may not include the basics of R itself, depending on what mcwallendjack and others want)

4. Forecasting Time Series models, possibly using xgboost

Would other readers like to see anything else? And right now that's my intended order of publication, but if anyone would prefer a different order, please let me know!

mcwallendjack · ‎03-22-2016

Hi Bridget,

Thanks for the post! Super Informative!

I personally don't have any experience using R without Alteryx. I'd appreciate (that is...if you're willing to create the content. Don't want to overburden you) any basic material on both using R wihtout Altyerx as well as configuring the R tool in Alteryx.

I also like your priority list of nest posts. They are all helpful to me.

Would you be appending these posts on top of this one or authoring a whole new post on the Engine Works blog?

Thanks!

JohnJPS · ‎03-22-2016

Here's a good article on xgboost: http://dmlc.ml/rstats/2016/03/10/xgboost.html. Note the comment that it has been used in roughly half the winning Kaggle solutions since it appeared. Deep Learning using GPUs is very common too... but deep learning is only available to those willing to get something working with a GPU, (or AWS or some such). The great thing about XGB is you can get it up and running on an ordinary laptop, and actually come up with very good results.

Side note: it was introduced during the Higg's Boson Machine Learning challenge; only available in Python at that time. Despite being a complete newbie to data science, R and Python, I got a model built using R's GBM, and XGB with Python... I did all manual parameter tuning and no feature engineering whatsoever, but a simple manual ensemble of the two got me in the top 25 overall out of 1,785 entrants. And XGB contributed much more to the submission than GBM. From that point forward everyone's been having a go in every competition with XGB, and I've never sniffed the top 10% in Kaggle since. :-) It really is insanely powerful.

BridgetT · ‎03-22-2016

@mcwallendjack: I could talk a bit about using R without Alteryx, but I think the EngineWorks blogs are supposed to focus primarily on Alteryx. Here is a good collection of resources to get started with R if you're interested, though. And here is a good introduction to Machine Learning that teaches you about R at the same time, assuming you have some math background (ideally both Calculus and Linear Algebra, though they might not be necessary to understand the fundamentals). Some of the authors have an even more mathy book covering the same topics in more depth here, but the first one is probably a better bet if you don't have a degree in a math-intensive field. And I would probably be writing a new post for each topic, since that seems to be the typical structure of EngineWorks posts. Thanks for all your feedback, and I'm glad this post was helpful to you!

Edit: Actually, I was wrong about the strict need to focus on Alteryx. I just talked to @TaraM, and she informed me that a post focusing on getting started with R in general with a tie-in to the Alteryx R tool would be fine. Does anyone want that to be my next post? We on the Content Engineering/Advanced Analytics team are on a rotation, so it'll be a few weeks before my next one is scheduled. Of course, maybe I could write it a bit earlier if someone else wants their next post bumped back.

@JohnJPS: Thanks for all that information! To my knowledge, Alteryx does not currently contain a built-in tool using XGB; the Boosted Model tool uses the gbm package on R instead. @DrDan can correct me if I'm wrong on this count, though. However, perhaps we'll create one in the future now that I know how powerful it is! And of course, Alteryx users can always use the R tool to code a script using XGB manually.

DrDan · ‎03-22-2016

@JohnJPS: We have been looking at xgboost for some time. Although, our primary interest has been using this library to replace the randomForest package in the Forest model tool. We are planning on making extensive rennovations to our existing predictive modeling tools for future releases, and making use of the xgboost library is definately part of those plans.

Dan

mcwallendjack · ‎03-22-2016

Thanks Bridget I will certainly take a look into these additional, outside Alteryx, R-based materials!

JohnJPS · ‎03-22-2016

@BridgetT - yes, thanks. I've seen the "Elements of..." one before, but not the ISLR one, which looks really great!

BridgetT · ‎03-22-2016

@JohnJPS: As a math person, I love Elements of Statistical Learning, but I could see that it's probably a bit intimidating to someone who doesn't share my math background. Also, ISLR has R tutorials, which ESL does not!

Atabarezz · ‎03-22-2016

I second the R tutorial but using it with Alteryx (otherwise it's irrelevant to this community),

1) First option may be basically replicating some of the Alteryx tools in R and comparing these to each other in terms of easiness of coding and performance (which will show how much Alteryx is a relief for a regular analyst a business unit worker),

inputs, ODBC connections
filters,
joins,
aggregations

2) The second option may be developing some basic coding which enhances alteryx capbilities furher using R

loading new R packages that are not included in Alteryx, symbolic regressions, route optimisiations and a like...
Some basic applications using those packages like,more advanced data quality use cases or trending analytics algorithms like structural equation models (SAS JMP and SPSS AMOS handles these models) or linear/non-linear optimisations etc.

Best

BridgetT · ‎03-23-2016

@Atabarezz: I'm planning on tying in the Alteryx R tool as well. (However, there are already a few EngineWorks posts, such as the one @ChrisF recently posted, that don't directly mention Alteryx.) I have yet to work with ODBC connections in R, but I could definitely talk about the other aspects of your first suggestion! And the second option sounds very interesting, though it would probably involve a lot of (fun!) work on my part.

Atabarezz · ‎03-23-2016

It's a lot of work but for the love of alteryx I'd like to provide support... just shout out...

Best

Edit:

During classes I gave, the list of topics I provide follwost the following agenda and

maybe you can use it as a list of mini-blogposts, consider us splitting the vast literature into readable parts;

1) I start with https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining short post but very essential at the start

2) Business understanding and developing hypotheses... What problem are we solving, what may be some alternative approaches? Any literature on that?

3) Data cleansing and enrichment by joining external data sets, spatial data, weather data etc. with formulations,

4) Data Discovery, charts, scattergraphs, information value and correlations

5) Symbolic regression, some data reduction, PCA, most basic scoring, forecasting, scaling, indexing,

6) Target definition and preparing data for predictions, normalization or binning using decision trees/WoE, sampling

7) Selecting prediction or estimation models based on the business problem, multi cross validation

8) Evaluating types of models with different measures, Gini, RMSE, R^2

9) Fine tuning models, model parameters (hyperparameter) optimisation

10) Ensambling

11) Scoring data sets, implementing scoring into decisioning software in banks, insurance firms, telecoms

What do you think about the list?

BridgetT · ‎03-24-2016

@Atabarezz: That sounds like an excellent list to start people on their way to predictive analytics! But I'm hesitant to put all that work into developing such a series of posts, because there are several projects people on the Content Engineering/Advanced Analytics team/Solutions Engineering team (including me), are currently working on that cover many of those areas. I'm not really sure how much I can say about these projects right now, but once they're released I will certainly mention them in future posts!

JohnJPS · ‎02-01-2017

Been a while... I've gotten fairly handy with R, and I can recreate what's in ISLR. I'd like to tackle ESL, but would like to take a MOOC or watch some lectures as I go. Is anyone aware of any such collection of lectures online, perhaps in a MOOC? I've already had a pretty good look around, but have come up empty. I know they have a course utilizing ISLR, but I want specifically for it to be more math oriented. Thanks!

PS: of course, only after I type this does it occur to me to search specifically within YouTube... there are two interestingf lecture series, neither of which appears based on ESL, but which look interesting:

Statistical Machine Learning (and)

Statistical Learning Theory and Applications

I'd still be interested in hearing about anything that closely follows ESL. Thanks!

Data Science

Demystifying the R-based Tools, Part 1 (Setup/installation and Classification Tools)