Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
DrDan
Alteryx Alumni (Retired)

Last Friday was a very busy day for several of us at Alteryx in the wake of the announcement that Microsoft and Revolution Analytics had agreed to have Microsoft acquire Revolution Analytics. In this post I won't go into the Alteryx angle of this story, other than to say we think this is a net positive. Instead, I wanted to provide a few words of appreciation for what Revolution Analytics has done for both R based technology and for their non-technology contributions to the R community since its creation (as REvolution Computing) in 2007.

 

Contributions to R Based Technology

Revolution Analytics has long been at the forefront of efforts to scale R for applications involving large amounts of data. They have approached this problem using both coarse grained parallel and streaming computing approaches. As of now, considerably more effort is going into coarse grained parallel computing approaches (with Hadoop being the most well publicized of these efforts), but streaming computing approaches can be very effective in scaling predictive analytics with more limited hardware resources. The most impressive methods for doing this that I have seen are the streaming linear model and generalized linear model methods contained in Revolution Analytics (proprietary) Revo ScaleR package (which also makes use of Intel's multi-threaded linear algebra libraries), that comes with their Revolution R Enterprise product. We have found that with moderate data volumes they are faster than the comparable open source R functions, and they can easily scale to millions of records on a common business laptop configuration (e.g., 8 GB of memory and a modern multicore CPU), while that same configuration is capable of estimating the same type of model with at most between 100,000 to 200,000 records with fewer than 10 predictors using open source R's lm or glm functions. What Lee Edlefsen and the engineering team at Revolution Analytics has done in this area represents the state of the art (they clearly outshine comparable methods from SAS and IBM/SPSS), and will likely represent an important point of comparison for others developing streaming algorithms for a long time to come.

 

While they have kept their streaming methods proprietary, they have given back to the R community much of the technology they have developed in the area of coarse grained parallel computing methods in R. Chief among these are the foreach and the iterators packages. In academia, one thing professors are judged on in tenure and promotion decisions is how many other published articles cite their articles (there are a number of different broad discipline oriented citation indexes, such as the Social Science Citation Index, that provide this information, and a lot of attention is now being paid to Google Scholar citations, which are often more interdisciplinary in nature). The R Project originated in academia, and still has a very academic feel to it. As a result, the package archive for the project (CRAN, or the Comprehensive R Archive Network) provides something very similar to a citation index. Specifically, for every CRAN package there is an indication of how other CRAN packages make use of it. There are three levels of this: a "reverse dependency" (the package is absolutely necessity to install another package); a "reverse imports" (the package is a critical component of another package, but the package can be installed without it); and a "reverse suggests" (a package provides additional, less central, functionality to another package). The Revolution Analytics foreach package has (as I write this) a reverse status on the part of 111 other R packages, while the iterators package has a reverse status on the part of 34 other CRAN packages. Only three of these represent "vanity reverses" (i.e., a package that makes use of a package written by the same author), and the vast majority are either of the more important "depends" or "imports" variety. In both cases, this represents an extraordinarily high number of reverse status packages (the reverse status for the foreach package is extremely high). Put another way, if there was a University of R, Revolution Analytics would have the rank of Full Professor.

More recently, Revolution Analytics has made an effort to address issues that come up with R packages. The first of these efforts is incorporated in the miniCRAN package that allows an organization with strict firewall rules to create an internal, selective archive of R packages that members of that organization can access. The second package that addresses issues surrounding R packages is the checkpoint package which is closely linked with Revolution Analytics "Managed R Archive Network", or MRAN. The purpose of the combination of the checkpoint package and MRAN is to address a common problem in reproducing R based research results, addressing changes in contributed R packages. R consists of three components, a small set of "base" packages that provide basic functionality; a still very small, but somewhat larger, set of "recommended" packages that provide additional core R functionality, and then a huge set (nearly 5000 as of this writing) of contributed packages. R's base and recommended packages are shipped with R's installer package from CRAN, and are very stable. The same cannot be said of all of R's contributed packages. We at Alteryx have never had issues migrating to the base or recommended packages of a new version of R, but we have experienced a few hiccups in migrating to some new versions of contributed packages that we use and bundle with our Predictive Plug-in (yes, regression testing is a useful thing). It turns out we are not alone, and in some cases (particularly in clinical trial settings for new drugs or medical devices) can make research results difficult to reproduce. The problems can be due to changes in the API of a package (that can cause R analysis scripts to break) or changes in the underlying methods used by a package (which can change the nature of the results in marginally significant cases). The goal of the checkpoint package / MRAN combination is to allow researchers to "freeze" on a particular vintage of R packages in order to make sure past research results can be replicated in a setting that takes changes in underlying R packages out of the picture. I view this as a very selfless move on Revolution Analytics part since it is a technology that is likely to be extremely useful to portions of the R community, takes real resources on the part of Revolution Analytics to implement, but is one that seems difficult for them to monetize.

 

Non-Technology Contributions to the R Community

Revolution Analytics has consistently given back to the R Community on a non-technological basis in three ways. First, it has been a primary sponsor of the annual international R user group conference (UseR!) since 2008 (longer than any other software vendor, only the book publishers CRC Press and Springer having been sponsors of the conference more years than Revolution Analytics). Since Revolution Analytics was only founded in 2007, the length of time they have been a primary sponsor of UseR! is remarkable.

 

The second way Revolution Analytics has given back to the R community in a non-technical way is in help sponsor local R user groups, through there R User Group Sponsorship Program. I am a member of the Bay Area R Users Group which Revolution Analytics sponsors, and Joe Rickert of Revolution Analytics acts as the primary organizer. Revolution Analytics provided financial support to 51 local R user groups in 2014, all local R user groups are eligible for sponsorship, but not all apply). In addition, it supports all 150 local R user groups via the Local R User Group Directory, the R Community Calendar, and the @inside_r Twitter channel.

 

The third way they contribute back to the community is the Revolutions blog which is one of the longest on-going blogs that covers topics relevant to the R community. Most company blogs are done for specific, very narrow marketing or product education purposes. However, this is not the case with the Revolutions blog, which strives to cover all topics relevant to the R community, even new R based technologies that represent, at least to my mind, a potential competitive threat to them.

 

Going Forward

 

What exactly the longer-term future holds for Revolution Analytics as they become part of the Microsoft family is unknown at this point. However, my belief is the assessment of David Smith (Revolution Analytics Chief Community Officer) that

For our users and customers, nothing much will change with the acquisition. We’ll continue to support and develop the Revolution R family of products — including non-Windows platforms like Mac and Linux. The free Revolution R Open project will continue to enhance open source R. We’ll continue to offer expert technical support for R with Revolution R Plus subscriptions from the same team of R experts. We’ll continue to advance the big data and enterprise integration capabilities of Revolution R Enterprise. And we’ll continue to offer expert technical training and consulting services.

 

is correct. Moreover, the financial backing of Microsoft will likely provide a strong tail wind to help several of the initiatives that Revolution Analytics started move forward more rapidly.

 

As part of Alteryx's partnership with them, I've had the opportunity and pleasure to interact with many people at Revolution Analytics, and I wish them the best of luck in the next part of their journey.

Dan Putler
Chief Scientist

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Comments