- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Notify Moderator
Like a lot of folks, I have a love/hate relationship with R. One topic that I've seen people struggle with is parallel computing, or more directly "How do I process data in R when I run out of RAM". But fear not! R actually has some easy to use parallelization packages!
Don't let this happen to you!
Here's a quick post on doing parallel computing in R.
Picking a library
My take on parallel computing has always been simpler is better. You can get in deep pretty quickly if you start poking your nose around some of the more complicated parts of multi-machine, out of core operations.
For scaling up R, I've come up with the following triage:
1) Try it on my laptop
2) Spin up the biggest, baddest server I can find on EC2 and run plain, vanilla R
3) Use same big, bad server and convert my code to run in parallel using plyr
So for this post I'll assume you've already tried #1 and #2 and I'll skip right ahead to #3.
First things first. You need to pick a parallel computing package. As I previously mentioned, I'm a single-tenant simpleton. I typically reach for the time tested combo of doMC
and plyr
. Call me primitive or old fashioned if you like, but 60% of the time it works every time.
Configuring your environment
Ok so you've picked your packages, now we need to load 'em up. After importing them, we actually need to tell doMC
how many cores we can use on our machine. If you've selected the gnarly R3 High-Memory Eight Extra Large
, then you're about to be toe to toe with 32 cores of raw, cloud computing power. To do this, just run the following:
library(plyr)
library(doMC)
doMC::registerDoMC(cores=32) # or however many cores you have access to
Just let that horsepower course through your veins
If you followed link to ec2instances.info, then you'll notice that 32 cores is equal to "104 ec2 compute units". If you don't know what that means, don't be ashamed. Nobody does. If you really want a formal definition, you can check out the EC2 FAQ for a very unsatisfying definition.
Processing your data
We're going to be using plyr
to perform our data processing. plyr
works plays nicely doMC
and will automatically begin using multiple cores. What this means for you is that you can forget about doMC
!
As a simple example, we'll process the iris
dataset using the ever useful ddply
function. We'll shake things up a big and have the function pause for 2 seconds. Since iris
has 3 types of species, the command will take ~6 seconds.
system.time(ddply(iris, .(Species), function(x) {
Sys.sleep(2)
nrow(x)
}))
# user system elapsed
# 0.005 0.001 6.016
Nothing too crazy. But watch what happens when you add the .parallel = TRUE
argument to the ddply
call.
system.time(ddply(iris, .(Species), function(x) {
Sys.sleep(2)
nrow(x)
}, .parallel = TRUE))
# user system elapsed
# 0.018 0.015 2.031
Woot! That only took 2 seconds! What does that mean for us? It means we just processed each species on its own, at the same time. Pretty slick, right?
And lucky for us, pretty much every function in plyr
has a .parallel
option. What does that mean for you? Well since I use plyr
for tons of stuff already, it means I have to do 0 extra work in order to parallelize my code!
Kicking it up a notch
To give a demo on a larger dataset, I'm going to calculate some summary stats on the Yahoo! Search Marketing advertiser bidding dataset. It's basically the biggest, nastiest dataset that I could find on short notice.
We'll read it in and then partition it by the phrase_id
and account_id
and see how the different facets compare.
headers <- c("timestamp", "phrase_id", "account_id", "price", "auto")
df <- read.table("./ydata-ysm-advertiser-bids-v1_0.txt")
colnames(df) <- headers
system.time(ddply(df, .(phrase_id, account_id), function(x) {
data.frame(mean=mean(x$price, na.rm=T), median=median(x$price, na.rm=T))
nrow(x)
}))
# user system elapsed
# 96.645 3.343 99.964
Now if we parallelize the same code:
system.time(ddply(df, .(phrase_id, account_id), function(x) {
data.frame(mean=mean(x$price, na.rm=T), median=median(x$price, na.rm=T))
nrow(x)
}, .parallel = TRUE))
# user system elapsed
# 94.228 16.267 48.553
Bam! Just 48.5 seconds! There you have it, we just crunched through a dataset that's just shy of 1GB analyzed on 1 machine running R (and nobody got hurt).
Scaling with ScienceBox
Want this to be even easier? Try out the ScienceBox. It gives you full control over your very own industrial grade analytics server. Everything runs behind your own firewall and you retain complete ownership over your data.
We've just released a bunch of new features (GitHub integration, updated UI, enhanced command line cli), so give it a spin or email me at greg@yhathq.com if you have questions about it.
Final Thoughts
Parallel computing doesn't have to be hard or complicated. Keep it simple and you can leverage the great features that R has to offer with half of the headaches!
I'm also fully aware that there are other great R packages out there for doing parallel processing. Take a look at these if you're interested in learning more:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.