Like a lot of folks, I have a love/hate relationship with R. One topic that I've seen people struggle with is parallel computing, or more directly "How do I process data in R when I run out of RAM". But fear not! R actually has some easy to use parallelization packages!
Don't let this happen to you!
Here's a quick post on doing parallel computing in R.
Picking a library
My take on parallel computing has always been simpler is better. You can get in deep pretty quickly if you start poking your nose around some of the more complicated parts of multi-machine, out of core operations.
For scaling up R, I've come up with the following triage:
3) Use same big, bad server and convert my code to run in parallel using plyr
So for this post I'll assume you've already tried #1 and #2 and I'll skip right ahead to #3.
First things first. You need to pick a parallel computing package. As I previously mentioned, I'm a single-tenant simpleton. I typically reach for the time tested combo of doMC and plyr. Call me primitive or old fashioned if you like, but 60% of the time it works every time.
Configuring your environment
Ok so you've picked your packages, now we need to load 'em up. After importing them, we actually need to tell doMC how many cores we can use on our machine. If you've selected the gnarly R3 High-Memory Eight Extra Large, then you're about to be toe to toe with 32 cores of raw, cloud computing power. To do this, just run the following:
doMC::registerDoMC(cores=32)# or however many cores you have access to
Just let that horsepower course through your veins
If you followed link to ec2instances.info, then you'll notice that 32 cores is equal to "104 ec2 compute units". If you don't know what that means, don't be ashamed. Nobody does. If you really want a formal definition, you can check out the EC2 FAQ for a very unsatisfying definition.
Processing your data
We're going to be using plyr to perform our data processing. plyr works plays nicely doMC and will automatically begin using multiple cores. What this means for you is that you can forget about doMC!
As a simple example, we'll process the iris dataset using the ever useful ddply function. We'll shake things up a big and have the function pause for 2 seconds. Since iris has 3 types of species, the command will take ~6 seconds.
nrow(x)}))# user system elapsed# 0.005 0.001 6.016
Nothing too crazy. But watch what happens when you add the .parallel = TRUE argument to the ddply call.
nrow(x)},.parallel = TRUE))# user system elapsed# 0.018 0.015 2.031
Woot! That only took 2 seconds! What does that mean for us? It means we just processed each species on its own, at the same time. Pretty slick, right?
And lucky for us, pretty much every function in plyr has a .parallel option. What does that mean for you? Well since I use plyr for tons of stuff already, it means I have to do 0 extra work in order to parallelize my code!
data.frame(mean=mean(x$price, na.rm=T), median=median(x$price, na.rm=T))
nrow(x)},.parallel = TRUE))# user system elapsed # 94.228 16.267 48.553
Bam! Just 48.5 seconds! There you have it, we just crunched through a dataset that's just shy of 1GB analyzed on 1 machine running R (and nobody got hurt).
Scaling with ScienceBox
Want this to be even easier? Try out the ScienceBox. It gives you full control over your very own industrial grade analytics server. Everything runs behind your own firewall and you retain complete ownership over your data.
We've just released a bunch of new features (GitHub integration, updated UI, enhanced command line cli), so give it a spin or email me at email@example.com if you have questions about it.
Parallel computing doesn't have to be hard or complicated. Keep it simple and you can leverage the great features that R has to offer with half of the headaches!
I'm also fully aware that there are other great R packages out there for doing parallel processing. Take a look at these if you're interested in learning more: