Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
GregL
Alteryx Alumni (Retired)

yhat-classic-sticker.png

Like a lot of folks, I have a love/hate relationship with R. One topic that I've seen people struggle with is parallel computing, or more directly "How do I process data in R when I run out of RAM". But fear not! R actually has some easy to use parallelization packages!

 

Don't let this happen to you!Don't let this happen to you!

Here's a quick post on doing parallel computing in R.

 

Picking a library

 

My take on parallel computing has always been simpler is better. You can get in deep pretty quickly if you start poking your nose around some of the more complicated parts of multi-machine, out of core operations.

 

For scaling up R, I've come up with the following triage:

 

1) Try it on my laptop

2) Spin up the biggest, baddest server I can find on EC2 and run plain, vanilla R

3) Use same big, bad server and convert my code to run in parallel using plyr

 

So for this post I'll assume you've already tried #1 and #2 and I'll skip right ahead to #3.

 

First things first. You need to pick a parallel computing package. As I previously mentioned, I'm a single-tenant simpleton. I typically reach for the time tested combo of doMC and plyr. Call me primitive or old fashioned if you like, but 60% of the time it works every time.

 

60-percent-of-the-time-it-works-every-time.jpg

  

Configuring your environment

 

Ok so you've picked your packages, now we need to load 'em up. After importing them, we actually need to tell doMC how many cores we can use on our machine. If you've selected the gnarly R3 High-Memory Eight Extra Large, then you're about to be toe to toe with 32 cores of raw, cloud computing power. To do this, just run the following:

 

library(plyr)
library(doMC)

doMC::registerDoMC(cores=32) # or however many cores you have access to

 

horsepower.jpgJust let that horsepower course through your veins

 

If you followed link to ec2instances.info, then you'll notice that 32 cores is equal to "104 ec2 compute units". If you don't know what that means, don't be ashamed. Nobody does. If you really want a formal definition, you can check out the EC2 FAQ for a very unsatisfying definition.

 

Processing your data

 

We're going to be using plyr to perform our data processing. plyr works plays nicely doMC and will automatically begin using multiple cores. What this means for you is that you can forget about doMC!

 

As a simple example, we'll process the iris dataset using the ever useful ddply function. We'll shake things up a big and have the function pause for 2 seconds. Since iris has 3 types of species, the command will take ~6 seconds.

 

system.time(ddply(iris, .(Species), function(x) {
 Sys.sleep(2)
 nrow(x)
}))
# user system elapsed
# 0.005 0.001 6.016

 

Nothing too crazy. But watch what happens when you add the .parallel = TRUE argument to the ddply call.

 

system.time(ddply(iris, .(Species), function(x) {
 Sys.sleep(2)
 nrow(x)
}, .parallel = TRUE))
# user system elapsed
# 0.018 0.015 2.031

 

Woot! That only took 2 seconds! What does that mean for us? It means we just processed each species on its own, at the same time. Pretty slick, right?

 

And lucky for us, pretty much every function in plyr has a .parallel option. What does that mean for you? Well since I use plyr for tons of stuff already, it means I have to do 0 extra work in order to parallelize my code!

 

Kicking it up a notch

 

To give a demo on a larger dataset, I'm going to calculate some summary stats on the Yahoo! Search Marketing advertiser bidding dataset. It's basically the biggest, nastiest dataset that I could find on short notice.

 

We'll read it in and then partition it by the phrase_id and account_id and see how the different facets compare.

 

headers <- c("timestamp", "phrase_id", "account_id", "price", "auto")
df <- read.table("./ydata-ysm-advertiser-bids-v1_0.txt")
colnames(df) <- headers

system.time(ddply(df, .(phrase_id, account_id), function(x) {
 data.frame(mean=mean(x$price, na.rm=T), median=median(x$price, na.rm=T))
 nrow(x)
}))
# user system elapsed 
# 96.645 3.343 99.964

 

Now if we parallelize the same code:

 

system.time(ddply(df, .(phrase_id, account_id), function(x) {
 data.frame(mean=mean(x$price, na.rm=T), median=median(x$price, na.rm=T))
 nrow(x)
}, .parallel = TRUE))
# user system elapsed 
# 94.228 16.267 48.553

 

Bam! Just 48.5 seconds! There you have it, we just crunched through a dataset that's just shy of 1GB analyzed on 1 machine running R (and nobody got hurt).

 

emerill.jpg

 

Scaling with ScienceBox

 

Want this to be even easier? Try out the ScienceBox. It gives you full control over your very own industrial grade analytics server. Everything runs behind your own firewall and you retain complete ownership over your data.

 

sciencebox-logo-blue-text.png

 

We've just released a bunch of new features (GitHub integration, updated UI, enhanced command line cli), so give it a spin or email me at greg@yhathq.com if you have questions about it.

 

Final Thoughts

 

Parallel computing doesn't have to be hard or complicated. Keep it simple and you can leverage the great features that R has to offer with half of the headaches!

 

I'm also fully aware that there are other great R packages out there for doing parallel processing. Take a look at these if you're interested in learning more: