Data Science

GregL · ‎01-14-2015

Like a lot of folks, I have a love/hate relationship with R. One topic that I've seen people struggle with is parallel computing, or more directly "How do I process data in R when I run out of RAM". But fear not! R actually has some easy to use parallelization packages!

Don't let this happen to you! Don't let this happen to you!

Here's a quick post on doing parallel computing in R.

Picking a library

My take on parallel computing has always been simpler is better. You can get in deep pretty quickly if you start poking your nose around some of the more complicated parts of multi-machine, out of core operations.

For scaling up R, I've come up with the following triage:

1) Try it on my laptop

2) Spin up the biggest, baddest server I can find on EC2 and run plain, vanilla R

3) Use same big, bad server and convert my code to run in parallel using plyr

So for this post I'll assume you've already tried #1 and #2 and I'll skip right ahead to #3.

First things first. You need to pick a parallel computing package. As I previously mentioned, I'm a single-tenant simpleton. I typically reach for the time tested combo of doMC and plyr. Call me primitive or old fashioned if you like, but 60% of the time it works every time.

Configuring your environment

Ok so you've picked your packages, now we need to load 'em up. After importing them, we actually need to tell doMC how many cores we can use on our machine. If you've selected the gnarly R3 High-Memory Eight Extra Large, then you're about to be toe to toe with 32 cores of raw, cloud computing power. To do this, just run the following:

library(plyr)
library(doMC)

doMC::registerDoMC(cores=32) # or however many cores you have access to

Just let that horsepower course through your veins

If you followed link to ec2instances.info, then you'll notice that 32 cores is equal to "104 ec2 compute units". If you don't know what that means, don't be ashamed. Nobody does. If you really want a formal definition, you can check out the EC2 FAQ for a very unsatisfying definition.

Processing your data

We're going to be using plyr to perform our data processing. plyr works plays nicely doMC and will automatically begin using multiple cores. What this means for you is that you can forget about doMC!

As a simple example, we'll process the iris dataset using the ever useful ddply function. We'll shake things up a big and have the function pause for 2 seconds. Since iris has 3 types of species, the command will take ~6 seconds.

system.time(ddply(iris, .(Species), function(x) {
 Sys.sleep(2)
 nrow(x)
}))
# user system elapsed
# 0.005 0.001 6.016

Nothing too crazy. But watch what happens when you add the .parallel = TRUE argument to the ddply call.

system.time(ddply(iris, .(Species), function(x) {
 Sys.sleep(2)
 nrow(x)
}, .parallel = TRUE))
# user system elapsed
# 0.018 0.015 2.031

Woot! That only took 2 seconds! What does that mean for us? It means we just processed each species on its own, at the same time. Pretty slick, right?

And lucky for us, pretty much every function in plyr has a .parallel option. What does that mean for you? Well since I use plyr for tons of stuff already, it means I have to do 0 extra work in order to parallelize my code!

Kicking it up a notch

To give a demo on a larger dataset, I'm going to calculate some summary stats on the Yahoo! Search Marketing advertiser bidding dataset. It's basically the biggest, nastiest dataset that I could find on short notice.

We'll read it in and then partition it by the phrase_id and account_id and see how the different facets compare.

headers <- c("timestamp", "phrase_id", "account_id", "price", "auto")
df <- read.table("./ydata-ysm-advertiser-bids-v1_0.txt")
colnames(df) <- headers

system.time(ddply(df, .(phrase_id, account_id), function(x) {
 data.frame(mean=mean(x$price, na.rm=T), median=median(x$price, na.rm=T))
 nrow(x)
}))
# user system elapsed 
# 96.645 3.343 99.964

Now if we parallelize the same code:

system.time(ddply(df, .(phrase_id, account_id), function(x) {
 data.frame(mean=mean(x$price, na.rm=T), median=median(x$price, na.rm=T))
 nrow(x)
}, .parallel = TRUE))
# user system elapsed 
# 94.228 16.267 48.553

Bam! Just 48.5 seconds! There you have it, we just crunched through a dataset that's just shy of 1GB analyzed on 1 machine running R (and nobody got hurt).

Scaling with ScienceBox

Want this to be even easier? Try out the ScienceBox. It gives you full control over your very own industrial grade analytics server. Everything runs behind your own firewall and you retain complete ownership over your data.

We've just released a bunch of new features (GitHub integration, updated UI, enhanced command line cli), so give it a spin or email me at greg@yhathq.com if you have questions about it.

Final Thoughts

Parallel computing doesn't have to be hard or complicated. Keep it simple and you can leverage the great features that R has to offer with half of the headaches!

I'm also fully aware that there are other great R packages out there for doing parallel processing. Take a look at these if you're interested in learning more: