Data Science

DrDan · ‎04-24-2014

The 9.0 release of Alteryx Predictive Analytics adds several new tools to our predictive analytics workbench, but much of the development effort for this release has been devoted to improvements that are under the hood. Chief among these has been the integration with Revolution Analysis to provide the ability to scale predictive analytics to large data volumes via the use of Revolution's Revo ScaleR technology. Several of the things we did to make the integration with Revolution Analytics possible also have fringe benefits in other areas. In particular, scoring using open source R tools, and it prompted us to develop an infrastructure that we will be able to leverage in future releases to enable in-database and in-Hadoop analytics. Finally, while not directly part of the predictive analytics tools, we have done things to make it easier for users who want the ability to work with legacy SAS sas7bdat and IBM/SPSS sav files in Alteryx.

Using Revolution Analytics to Scale Predictive Analytics

The integration between Alteryx and Revolution Analytics Revo Scale R technology is implemented through the creation of a Revolution Analytics XDF format file, which triggers a number of Alteryx predictive modeling tools (Linear Regression, Logistic Regression, Count Regression, Gamma Regression, Stepwise, Decision Tree, Forest Model, Lift Chart, and Score) to make use of the scalable Revo ScaleR algorithms. There are two new tools in 9.0 to perform read and write functions for XDF files in Alteryx, XDF Output and XDF Input. The XDF Output tool takes an Alteryx data stream and writes it to an XDF file either in Alteryx's temporary directory or to a user specified permanent location on disk. In addition to writing the XDF file, an "XDF metadata stream" is also produced. The XDF metadata stream provides downstream predictive tools with information about the underlying metadata describing the data, along with information that enables a predictive tool to determine the location of the relevant XDF file. Given this information, the modeling tools appropriately use either the appropriate Revo ScaleR or opens source R modeling algorithm.

Working with Legacy Data from SAS or IBM/SPSS

Two new formats have been added to both the Alteryx Input and Output tools, sas7bdat (for SAS data files) and sav (for IBM/SPSS data files). This will allow users to blend data currently locked in these legacy formats with data from relational databases, spatial data, data from social media sites in JSON format, and the huge set of other data formats that Alteryx supports. It also allows a user to take advantage of Alteryx personal ETL capabilities to create data sets that can be used directly in these legacy statistical systems. Albeit, we expect that this will an increasingly less common user practice over time as Alteryx predictive capabilities continue to expand and mature.

Greatly Enhanced Scoring Capabilities

Major changes have been made to the scoring tool in two different areas. First, a lot of effort has gone into making these tools robust in scoring new data. Unfortunately, the methods that generate predicted values for most R models take an "all or nothing" approach by throwing an error when they encounter levels of a categorical variable that are in the new data that were not in the data used to estimate the model, and for many models, when not all levels of each categorical variable are present in the new data being scored (eliminating the ability to score a single record in an analytic app). Finally, scoring for some model types fails when there are missing values in any of the predictor variables in the new data. A number of our users found this behavior to be frustrating, to say the least. Consequently, we have moved from R's default "all or nothing" based approach to a "best effort" approach when it comes to scoring, resulting in all records of new data that can be scored to be scored, while missing values (Nulls) are returned for those records that cannot be scored. Overall, we feel that most of our users will be pleased with this change in behavior. In addition, it greatly enhances the ability to implement scoring analytical apps based on obtaining a score for one or a few customers at a time (for example, in an application that helps a branch bank loan officer rapidly approve or deny a personal loan).

The second major enhancement to Alteryx's scoring capabilities relates to the volume of data that can be scored in a single run. As part of the work of integrating Revolution Analytics capabilities, Alteryx gained the ability to read and write data into R a chunk at a time. This turned out to provide the very nice fringe benefit of allowing us to scale the scoring capabilities of open source R models as well, since not all of the data needs to be feed into R (and thus into main memory) at once. Instead, a chunk of records (that can be held in memory) is read into R, scored, and then written back to Alteryx; and then the process is repeated for subsequent chunks of the data until all of it is scored. Currently, this runs as a single threaded process, which means it can be made faster through the use of parallelization (to which it lends itself). While this setup isn't as fast as it could be, it does allow for an essentially infinite number of records to be scored. Moreover, my colleague Ben Gomez and I will be working on using the Alteryx Server's scheduling capability to develop a parallel scoring template that users can easily customize to meet their own high volume scoring needs.

New Tools

In addition to the XDF Input and XDF Output tools, 9.0 sees three additional tools being added to Alteryx's Predictive Analytics capabilities. Two of these tools are new modeling methods (the Spline Model and the Gamma Regression tools), and the third is a new plotting tool (the Heat Plot). These tools grew out of specific user requests, and we felt they represented additions to Alteryx that would be of interest to a number of our users.

The Spline Model tool provides the multivariate adaptive regression splines (or MARS) algorithm. This method is a modern statistical learning model that: (1) self-determines which subset of fields best predict a target field of interest; (2) is able to capture highly nonlinear relationships and interactions between fields; and (3) can automatically address a broad range of regression and classification problems in a way that can be transparent to the user (the user can do as little as specify a target field and a set of predictor fields, but the tool can be extensively fine-tuned by advanced users). It's basic approach is similar to the recursive partitioning algorithm (used in the Decision Tree tool) in that it finds the variables that matter most in predicting the target, as well as finding appropriate split points (known as "knots") in those predictor variables. However, unlike in a decision tree, a line between adjacent knots (called a term) is fit rather than using discrete jumps as is done in decision trees. This results in the construction of a piecewise linear function for each variable that can closely approximate any relationship between the target and a predictor variable.

In a number of applications, the values of the target variable are always strictly positive (i.e., are never zero or negative), but tend to cluster toward the lower range of the observed values, but in a small minority of cases take on large values. Target variables of this type represent a data generation process that is not consistent with the Normality assumptions underlying the traditional linear regression model. However, the values are always positive and do not have to all be integer numbers, so they do not follow a Poisson distribution or Negative Binomial distribution based processed. They are consistent with a process based on a Gamma distribution, and can be estimated using methods similar to linear regression, via the generalized linear model framework. The Gamma Regression Tool implements this model.

The Heat Plot tool uses a heat plot color map to show the joint distribution of two variables that are either continuous numeric variables or ordered categories (categorical variables that have a natural order, such as income groups or educational attainment levels). For example, this tool can provide an indication of the joint distribution of customer satisfaction and the length of time a customer has been with the company, highlighting potential problem and success hot-spots with respect to customer tenure.

Improvements in the A/B Testing Tool Suite

The final important area of improvement is in the behavior and reporting capabilities of A/B Testing suite of tools, particularly the AB Analysis tool, which greatly expands the types of A/B tests the tools can address. We worked closely with several of our customers to figure out how to better meet their needs, and the changes in these tools reflect their input.

The Bottom Line

We feel that the changes in the predictive analytics tools in the 9.0 release reflect major improvements in the day to day usability of these tools and the scope of problems they can address. Moreover, many of our tools have undergone major changes that reflect a solid maturation in their capabilities. There is a lot more to do, but we think we are on the right path.