Free Trial

Analytics

News, events, thought leadership and more.
DrDan
Alteryx Alumni (Retired)

The 9.0 release of Alteryx Analytics adds several new tools to our predictive analytics workbench, but much of the development effort for this release has been devoted to improvements that are under the hood. Chief among these has been the integration with Revolution Analysis to provide the ability to scale predictive analytics to large data volumes via the use of their Revo ScaleR technology. Several of the things we did to make the integration with Revolution Analytics possible also have benefits in other areas, including enhancing the scoring using open source R tools, and developing a scalable infrastructure that we will be able to leverage in future releases.

 

The integration between Alteryx and Revolution Analytics Revo Scale R technology is implemented through the creation of a Revolution Analytics XDF format file, which triggers a number of Alteryx predictive modeling tools (Linear Regression, Logistic Regression, Count Regression, Gamma Regression, Stepwise, Decision Tree, Forest Model, Lift Chart, and Score) to make use of the scalable Revo ScaleR algorithms. The XDF Output tool takes an Alteryx data stream and writes it to an XDF file either in Alteryx's temporary directory or to a user specified permanent location on disk. In addition to writing the XDF file, an "XDF metadata stream" is also produced. The XDF metadata stream provides downstream predictive tools with information about the underlying metadata describing the data, along with information that enables a predictive tool to determine the location of the relevant XDF file. Given this information, the modeling tools appropriately use either the appropriate Revo ScaleR or open source R modeling algorithm.

 

Major changes have been made to the scoring tool in two different areas. First, a lot of effort has gone into making these tools robust in scoring new data. Unfortunately, the methods that generate predicted values for most R models take an "all or nothing" approach by throwing an error when they encounter levels of a categorical variable that are in the new data that were not in the data used to estimate the model, and for many models, when not all levels of each categorical variable are present in the new data being scored (eliminating the ability to score a single record in an analytic app). Finally, scoring for some model types fails when there are missing values in any of the predictor variables in the new data. Understandably, a number of users found this behavior to be frustrating so we have changed from R's default "all or nothing" based approach to a "best effort" approach when it comes to scoring. This results in all records of new data that can be scored to be scored, while missing values (Nulls) are returned for those records that cannot be scored. Overall, we feel that most of our users will be pleased with this change in behavior. In addition, it greatly enhances the ability to implement scoring analytical apps based on obtaining a score for one or a few customers at a time (for example, in an application that helps a branch bank loan officer rapidly approve or deny a personal loan).

 

The second major enhancement to Alteryx's scoring capabilities relates to the volume of data that can be scored in a single run. As part of the work of integrating Revolution Analytics capabilities, Alteryx gained the ability to read and write data into R a chunk at a time. This allows us to scale the scoring capabilities of open source R models as well, since not all of the data needs to be fed into R (and thus into main memory) at once. Instead, a chunk of records (that can be held in memory) is read into R, scored, and then written back to Alteryx; and then the process is repeated for subsequent chunks of the data until all of it is scored. Currently, this runs as a single threaded process, which means it can be made faster through the use of parallelization (to which it lends itself). This setup allows for an essentially infinite number of records to be scored. Moreover, my colleague Ben Gomez and I will be working on using the Alteryx Server's scheduling capability to develop a parallel scoring template that users can easily customize to meet their own high volume scoring needs.

 

Stay tuned for Part 2 of this blog to learn more about our new predictive analytics offerings. You can also learn more at our May 14th Alteryx Analytics 9.0 Webinar.

Dan Putler
Chief Scientist

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.