Parallelizing R on Supercomputers
Executive summary: I've posted a tutorial on how to parallelize R codes on my website. This post is a more personal reflection on how I got there.
"Parallel Options for R" was the title of the first talk I ever presented on behalf of my employer, and despite the fact that I didn't (and still don't) know anything about the R language, statistics, or how to parallelize any of it, the shoe seemed to fit at the time. The talk went over well, and I've been asked to give the talk in my capacity as the resident "parallel R guy" plenty of times since.
Every once in a while I get asked how I came to become so involved in some of the weird topics about which I write and speak--after all, I really have no formal training in things like SR-IOV, Hadoop, and next-generation gene sequencing. As much as I'd like to claim I just have some infinite sage-like knowledge, the reality is that I have to learn about these various technologies as a result of my day job--answering helpdesk tickets. In the case of parallel R, I simply got a ticket in January 2013 that read,
"I just ran an intensive R script through [the supercomputer]. Its not much faster than my own machine. Could you point me to a tutorial for how I can make the process run in different processors in parallel?"
I couldn't very well say "lol no idea" (which was the truth), but the fact is that there are only about three whole people in my group** who are tasked with solving every problem that comes in from the thousand unique users who run jobs on our system every year. If I didn't know the answer, there was a good chance that nobody else knew either. That doesn't change the fact that someone needs to answer the user's question though, and that fact is what got me into the parallel R business.
In my quest for an answer to this user's helpdesk request, I further discovered that there were no good tutorials online that explain the process of parallelizing R codes. Thus, I wound up having to buy a book to learn what I need to know to answer the user's question. So I did, and I learned the rough basics of how someone might go about parallelizing their R codes. I gave the user a few starting pointers, some of the libraries that he might want to check out on CRAN, and tried to provide some boilerplate code that might help him parallelize his particular script. We then went our separate ways.
With all this reflection aside though, I never lost sight of the reality that I never did answer the user's question: what is a good tutorial on how to parallelize R codes?
This question has actually come up a number of times from a number of users over the last year. Rather than take the easy route and tell everyone to attend my next talk on the subject, I decided to turn my presentation on parallelizing R into a series of tutorials which I've put on my website:
It's not comprehensive by any means; notably, I did not cover either the pbdr library out of UTK/Oak Ridge (an omission with no particularly good justification) or SPRINT from Edinburgh (it's a bit specialized in functionality). I also haven't had the opportunity to convert my presentation on using R with Hadoop and Spark into the final component of this tutorial. Those topics will come as time permits. Regardless, I hope someone finds the write-up useful.
** I say "whole people" to reflect that our funding provides somewhere in the neighborhood of three full-time equivalent employees providing front-line user support. That funding winds up getting distributed across more physical staff.