What's Killing Cloud Interconnect Performance
A small blog post benchmarking MPI on Google's new Google Compute Engine crossed my path this weekend, and since I've been messing around with virtualized interconnect performance lately, it caught my attention. While the post doesn't really say anything earth shattering, it is another contribution to the overwhelming evidence that commercial cloud services are just not any good for tightly coupled parallel applications and, as a consequence, traditional high performance computing.
It does highlight what appears to be a promising note at first glance--apparently Microsoft plans to incorporate Infiniband as an interconnect to its Azure offering (although the provided link is dead). Unfortunately, posts like these seem to carry the assumption that replacing ethernet with a "high-performance" interconnect is all we need to achieve cloud-HPC nirvana. The reality is that the lack of Infiniband in EC2 and now Compute Engine isn't what's holding back HPC in the cloud, and this was actually studied and reported extensively back in 2010 when DOE was pursuing the Magellan project.
The sad fact is that you can replace EC2's 10gig with a bunch of bonded 10gig interfaces, high-bandwidth Infiniband, 100gig, or whatever else you want. Replacing a dirt road with a dirt highway isn't going to get you from point A to point B any faster; it just lets you shovel more (or bigger) cars onto that path.
To illustrate this point, I ran the tried-and-true OSU Microbenchmarks on a single rail of QDR4X Infiniband (IB) attached to a single switch twice--once using the native RDMA over Infiniband, and once using TCP/IP over Infiniband. Since both cases use the same underlying Infiniband physical layer, any differences in performance between the two are caused solely by using TCP/IP. Of course, TCP/IP is what all major cloud providers are using.
As shown above, using TCP causes a huge (~10x) increase in latency across the board. While TCP over IB still beats out gigabit ethernet, Infiniband's trademark low latency really goes out the window as soon as you start message passing with TCP as you would in the cloud.
What about bandwidth? QDR4X provides a 40Gbit (4 GB/s usable) link. If we consider what percentage of this theoretical max bandwidth we can access using MPI,
the results are, perhaps unsurprisingly, awful. RDMA over Infiniband gives us about 94% of peak bandwidth for sufficiently large messages, and TCP over gigabit ethernet does surprisingly well with 93% of its peak bandwidth. TCP over Infiniband shows a dramatic 75% drop in usable bandwidth, saturating at just under 800 MB/s or 20% of the theoretical maximum.
The story only gets worse when we start looking at bidirectional bandwidth:
Both TCP-based communications simply bog down under the congestion of large message passing bidirectionally, and both fall to the unidirectional peak bandwidths. Thus, as long as the virtualized compute instances being provided by cloud services only offer TCP/IP for MPI communications, message passing performance will be awful. Bear in mind that the numbers presented above are run on bare metal, not through a VM, so cloud performance will only get worse.
The corollary here is to not be so quick to jump into a cloud provider that offers Infiniband; unless the Infiniband interfaces are properly virtualized using something like SR-IOV, you're wasting your money. And as far as I know, there are no such deployments at scale.
It does highlight what appears to be a promising note at first glance--apparently Microsoft plans to incorporate Infiniband as an interconnect to its Azure offering (although the provided link is dead). Unfortunately, posts like these seem to carry the assumption that replacing ethernet with a "high-performance" interconnect is all we need to achieve cloud-HPC nirvana. The reality is that the lack of Infiniband in EC2 and now Compute Engine isn't what's holding back HPC in the cloud, and this was actually studied and reported extensively back in 2010 when DOE was pursuing the Magellan project.
The sad fact is that you can replace EC2's 10gig with a bunch of bonded 10gig interfaces, high-bandwidth Infiniband, 100gig, or whatever else you want. Replacing a dirt road with a dirt highway isn't going to get you from point A to point B any faster; it just lets you shovel more (or bigger) cars onto that path.
To illustrate this point, I ran the tried-and-true OSU Microbenchmarks on a single rail of QDR4X Infiniband (IB) attached to a single switch twice--once using the native RDMA over Infiniband, and once using TCP/IP over Infiniband. Since both cases use the same underlying Infiniband physical layer, any differences in performance between the two are caused solely by using TCP/IP. Of course, TCP/IP is what all major cloud providers are using.
Measured latency for MPI |
As shown above, using TCP causes a huge (~10x) increase in latency across the board. While TCP over IB still beats out gigabit ethernet, Infiniband's trademark low latency really goes out the window as soon as you start message passing with TCP as you would in the cloud.
What about bandwidth? QDR4X provides a 40Gbit (4 GB/s usable) link. If we consider what percentage of this theoretical max bandwidth we can access using MPI,
Measured unidirectional bandwidth for MPI |
The story only gets worse when we start looking at bidirectional bandwidth:
Measured bidirectional bandwidth for MPI |
Both TCP-based communications simply bog down under the congestion of large message passing bidirectionally, and both fall to the unidirectional peak bandwidths. Thus, as long as the virtualized compute instances being provided by cloud services only offer TCP/IP for MPI communications, message passing performance will be awful. Bear in mind that the numbers presented above are run on bare metal, not through a VM, so cloud performance will only get worse.
The corollary here is to not be so quick to jump into a cloud provider that offers Infiniband; unless the Infiniband interfaces are properly virtualized using something like SR-IOV, you're wasting your money. And as far as I know, there are no such deployments at scale.