Quick MPI Cluster Setup on Amazon EC2


I was recently tasked with doing some MPI benchmarks on Amazon EC2 to get a general idea of how well EC2's Cluster Compute capabilities perform.  I had never used EC2 before, so the notion of setting up a working set of EC2 instances that have the necessary configuration to run MPI applications was initially quite daunting.  I had no idea what the difference between an AMI and an instance was, what AMI and instance to use, and if I should mess with any automatic provisioning/configuration tools (like Starcluster) to quickly spin up a cluster.

Most guides online are kind of unhelpful in that they try to illustrate some proof of concept in how easy it is to get a fully configured cluster-in-the-cloud setup using some sort of provisioning toolchain.  They gloss over the basics of exactly how to start these instances up and what to expect as far as their connectivity.  Fortunately it only took me a morning to get MPI up and running, and for the benefit of anyone else who just wants to get MPI applications running on EC2 with as little fuss as possible, here are my notes.

It is worth pointing out that I use the term "instance" and "node" (or "compute node") interchangeably in the following discussion.

Step 1: Establish an EC2 Account

Go to http://aws.amazon.com/ and click "My Account / Console" at the top right, then navigate to the "AWS Management Console."  You'll be asked to log in, and if you already have an Amazon.com account, you can use that one.  You are then prompted to sign up for AWS, which is a process that requires a credit card number on file to charge for instances, and a robocall to verify your identity.  Once this is done, find your way to the AWS Management Console and click the "EC2" option:

This should take you to the EC2 Dashboard from where you can launch VMs.

Step 2: Create your Instances

Click the big blue "Launch Instances" button to get into the wizard.

I used the Classic Wizard.

Select a VM Image

The first thing the wizard asks you to do is select an AMI (Amazon Machine Image).  Scroll down and choose the Cluster Compute Amazon Linux ABI.

This is sort of like a VM image that establishes your OS distribution, base software stack, and initial login user.  What threw me off is that there is a regular "Amazon Linux" and the "Cluster Compute Amazon Linux," yet they both have the same description.  As far as I understand, the Cluster Compute image uses a virtualization layer (called "HVM") that is a lot thinner than the one used by non-compute images. You wind up getting an instance that runs on an entire physical compute node, and only the I/O adapters are virtualized.  With that being said, this "Cluster Compute Amazon Linux" AMI says nothing about the hardware you want your instances to use.

Select an Instance Type

This next step is where hardware comes in.

The cc2.8xlarge instance has a configuration reasonably close to what SDSC Gordon has: 2x Intel Xeon E5-2670 processors and ~60 GB of RAM per node, so that's what I chose.  I also didn't want to run up a huge bill while I was getting this process stamped out, so I only wanted to launch two nodes to start--the minimum necessary to test MPI communications.

Also, Amazon lets you launch these instances like normal at a rate of $2.40 per instance per hour, or you can potentially get a significantly discounted rate by using spot instances.  Despite the possibility that my spot instances could be shut off at any time, the 90% discount was worth the risk so I went with the spot instances and bid the absolute lowest rate.  This turned out to be sufficient to have these virtual clusters running for a few hours, and I've never actually seen the spot instance bid rise above the value we paid.  I guess these things are not in high demand.  For what it's worth, the entire exercise of figuring all of this EC2 stuff out on a 2-node cluster, then spinning up a 4-node cluster and running a specific application benchmark only cost us $3.04.

Define a Placement Group

Either way, the next step is to define the instances' Placement Group, which is a critical component.

Instances which are on the same placement group share the same network subnet and get the full bisection bandwidth available to these Cluster Compute 10 GigE connections.  Give this cluster's placement group a name, and if you want to add nodes to this cluster on the fly later, you have the possibility of starting more instances and requesting that they be added to this same subnet.

The User Data field is unimportant for our purposes, and you may want to change the Shutdown Behavior to "terminate."  This will ensure that you aren't being billed for the VMs while they are "stopped" but not "terminated."

Configure Storage

The next step is defining your instances' storage volumes.

Even though the Cluster Compute AMI is backed by EBS (block storage), I don't think it's persistent since persistent EBS typically costs additional money.  I accept the defaults here, but you can change the "Delete on Termination" value to false.  It appears that AWS charges you all of the regular EBS rates for the default (required?) EBS-backed root volumes used by the Cluster Compute instances, so it might be possible to make the disks on these instances non-volatile.  I haven't played with this very much though.

If you have the money to burn, you can create a persistent EBS store and NFS-mount it across all compute nodes.  I quickly cover how to do this at the bottom of this guide.

Define VM Tags

The next screen lets you define tags to manage your instances.

This is not necessary for this simple test, so just continue past this screen.

Create a Key Pair

The next step, creating your key pair, is critical to being able to get into your instances once they boot up.

Give your key an arbitrary name, then download it as a .pem file.  This is the privatekey that will allow you to ssh into the instance once it's up; if you don't have this, you will simply be locked out of your own instance.

Set Proper Security Group

The next step of customizing your cluster's security group is also essential.

The default behavior of the wizard is to have you create a new security group called "quick-start-1" that will open port 22 (SSH) to the world but block all traffic everywhere else.  You need to let all instances (nodes) within your cluster communicate freely for MPI to work, so you must add a new rule to this security group that opens all ports (1-65535) to all other instances within the same security group (quick-start-1).  Once you enter these two parameters, you must then click the Add Rule button or this rule will not be applied and MPI will not work!

Launch Your Instances

This is the end of the instance wizard.

Click Launch, and your instances will begin booting up.  Go back to the EC2 Dashboard and wait for your VMs to reach the "Running" state.

Step 3. Configuring the Virtual Cluster

Once the instances are booted up, you can get their public IP addresses from the EC2 Dashboard by selecting their entry under the "Instances" page.  The default user on these Cluster Compute Amazon Linux images is "ec2-user", so ssh into one of the instances using the .pem file downloaded earlier:
$ ssh -i ~/Downloads/glock.pem ec2-user@ec2-54-245-13-102.us-west-2.compute.amazonaws.com
Once logged into the instance, getting set up is a pretty quick process.  The general procedure involves making changes on a single, master node (whether that be install packages, compile software, upload data, etc) and then push those changes out to all the compute nodes by hand.  This is a very rough approach, but when it comes down to it, you don't need very much to get MPI running.  The goal here is to do as little as it takes to run an MPI benchmark across Amazon EC2 Cluster Compute nodes.

Preliminary Cluster Setup

The Cluster Compute instances are a "cluster" inasmuch as they are all booted to the same OS and all share the same 10 GigE subnet with full bisection bandwidth; on the software front, there is close to nothing that makes the instances a cluster when they are booted up.  The very first thing I did was set up a few things as a matter of convenience.

1. Establish a useful /etc/hosts

Add all nodes' internal IP addresses (10.x.x.x) to /etc/hosts under aliases like node2, node3, node4, etc.  This test cluster has only two nodes, so only two entries need to be added.

2. Create alias to push files across the cluster

Create a simple script to push files out from this master node (node1) to the worker nodes (node2 and any others you may want to have booted).  Mine was a file called pushout that looked like this:
for node in node2 node3 node4
  scp $1 $node:$1
If you want to get fancy, you make a similar script to execute a command across all nodes as well.
for node in node2 node3 node4
  ssh $node "$@"

3. Enable password-less ssh between instances

Transfer the .pem onto the nodes so they can use password-less ssh to communicate.  I copied mine onto all my instances as /home/ec2-user/.ssh/id_rsa so that ssh/scp used it by default.

Installing and Running MPI

The Amazon Linux AMI is a Red Hat derivative and uses the yum package manager.  Amazon provides both OpenMPI and mpich RPMs in their repository, so installing MPI is a simple matter of

sudo yum install openmpi-devel
sudo yum install mpich-devel
This will pull in the GCC compiler, the MPI runtimes, and the mpicc wrapper.  You have to do this on all compute nodes.  In addition, you have to then add the following lines to .bashrc:

export PATH=/usr/lib64/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib

Be sure to then push this new .bashrc file out to all of your compute nodes or your mpi jobs will throw "bash: orted: command not found" errors as soon as you issue mpirun.

With this, you can now build applications with mpicc and run them with mpirun.  Given that our compute cluster here has two nodes (called node1 and node2), each with 16 cores (actually 16 cores + 16 hyperthreads), running a 32-way MPI job is just a matter of creating a hostfile compatible with OpenMPI's mpirun, e.g.,

$ cat ~/nodefile
node1 slots=16
node2 slots=16

and, assuming our mpi binary is ~/simulation.x and has been pushed out to all compute nodes, doing

mpirun -np 32 -hostfile ~/nodefile ~/simulation.x

The hostfile format is a little different for mpich's mpirun, but using mpich instead of OpenMPI is much the same.

Getting Fancy: NFS Filesystem

Creating a persistent EBS block storage volume is pretty straightforward.  The rates seem pretty cheap ($0.11 per GB per month, and $0.11 per million IOPS, however much that is), so I made one in the same Availability Zone as my cluster (this is essential!), attached it to my master node (node1) from the EC2 Management Console, and checked dmesg on node1 to see what the device name was (/dev/xdvf).  A quick few commands made it available:

# mkfs.ext4 /dev/xdvf
# mkdir /ebs
# mount /dev/xdvf /ebs
$ mount|grep /ebs
/dev/xvdf on /ebs type ext4 (rw)

Sharing this filesystem across the nodes is pretty simple, and you don't really need a separate EBS volume to export a filesystem over NFS.  You can just mkdir /ebs and use the volatile root EBS volume that comes with your instances if you don't want to mess with adding EBS volumes.

Either way, let's assume we're sharing /ebs.  First install nfsd on the master:

# yum install nfs-utils
Dependencies Resolved
 Package            Arch        Version                    Repository      Size
 nfs-utils          x86_64      1:1.2.3-36.13.amzn1        amzn-main      415 k
Installing for dependencies:
 libevent           x86_64      2.0.18-1.10.amzn1          amzn-main      278 k
 libgssglue         x86_64      0.1-11.7.amzn1             amzn-main       23 k
 libtirpc           x86_64      0.2.1-5.7.amzn1            amzn-main       87 k
 nfs-utils-lib      x86_64      1.1.5-6.9.amzn1            amzn-main       76 k
 rpcbind            x86_64      0.2.0-11.5.amzn1           amzn-main       56 k
Transaction Summary
Install       6 Package(s)
Total download size: 936 k
Installed size: 2.0 M
Is this ok [y/N]: y

Then configure the export by adding this line to /etc/exports:

/ebs (rw,insecure)

Start rpcbind and nfsd on your master node:

# /etc/init.d/rpcbind start
# /etc/init.d/nfs start

And NFS should be ready to go.  Log into each other node and install NFS, create the mountpoint, and mount it:

# yum install nfs-utils
# mkdir /ebs
# mount node1:/ebs /ebs

This then saves the hassle of having to sync up runtime files across nodes.  If you used a persistent EBS block store underneath this NFS share, you can also terminate/restart your instances without having to spend the money and time on EC2 bandwidth into and out of the cloud.