Blue Gene/P First Impressions - Hardware
The institution at which I work recently acquired a Blue Gene/P (BG/P) system as the first of three phases of building an advanced supercomputing center here, and as a computationalist in materials science, I've been given the opportunity to attend a three-day crash course on using it.
Getting access to a BG/P in itself is pretty exciting--having an account (albeit a temporary evaluation one) is like being a kid in a candy store. I've become quite the computer nerd over the last decade, but although I am a professional computationalist, my work is done entirely on commodity hardware. I've never had access to anything as exotic as BG/P, and I've certainly never had the opportunity to be taught how such a machine is designed by the people who developed it. The guys that IBM sent to conduct this course really know their stuff and speak a language I don't often get to hear, so I've been trying to soak up as much as I can in the three days they're here.
So here are my first impressions.
Like a conventional cluster, the bulk of the work is done by individual compute "nodes." Each node, like in a conventional cluster, has a set of CPU cores and its own memory, and each node runs its own system image. But beyond this basic description, the "nodes" are really unlike the nodes of a conventional Beowulf cluster.
Physically, each node is little more than a 6 inch long card that plugs into a "node board." Each node board kind of resembles what I would consider a "node" in a conventional cluster--a pizza box that slides into the rack. Here is what a node board looks like (from Wikipedia):
Each copper heat sink is a single "node," which means a single node board has 16+16=32 nodes on it. On BG/P, each node has four cores, meaning each node board has 128 cores. Each BG/P rack takes 32 node boards, providing a pretty beefy 4096 cores per rack. Although I'd imagine 72U racks loaded with those 16-core Opteron Interlagos processors can approach this density, such a machine would probably run into major memory contention issues due to the extensive sharing of memory bandwidth and caches.
Each BG/P node is based around a PowerPC 450 SOC. Specifically, the nodes all have
In the above photo of the node card, you can see that there are single-heatsink node cards all the way at the bottom. These are I/O nodes, and they actually handle many of the system calls generated by code since compute nodes are largely incapable. BG/P systems have a variable IO-to-compute node ratio, with lower ratios being better suited for IO-intensive applications but having significantly higher price tags. The pictured node card is configured with the top-of-the-line 1:16 ratio; the BG/P that we have is 1:64, which is one step under the cheapest 1:128 configuration. Since a single IO node handles the system calls for 16 to 128 compute nodes, there can be multiple jobs running on a single IO node and, as expected, this can create a bottleneck if the system is running a lot of tiny jobs.
The interconnect is the final major unique component of BG/P--it uses a proprietary, low-latency, highly interconnected 3D torus topology rather than a switched network. This topology gives non-uniform "access" latencies ranging from 0.5 (one hop) to 5 microseconds (farthest node) theoretical, or 2 to 10 microseconds with MPI. Each node therefore has twelve links on this 3D torus, each with 3.4 Gb/s bandwidth, and all of these links do DMA. The IO nodes serve as the gateway to the outside world; each IO node has a 10Gb/s ethernet link to accomplish this.
What dazzled me is that, in addition to this 3D torus for message passing, there are two separate, dedicated networks: a tree network exclusively for reductions and collective calls with 850 MB/s bandwidth, and a single dedicated network just for barriers. What's more is that all of this networking is rolled up into the BG/P MPI implementation (a derivative of Argonne's mpich2) so that proper utilization of the tree over the torus is generally transparent to the developer.
Getting access to a BG/P in itself is pretty exciting--having an account (albeit a temporary evaluation one) is like being a kid in a candy store. I've become quite the computer nerd over the last decade, but although I am a professional computationalist, my work is done entirely on commodity hardware. I've never had access to anything as exotic as BG/P, and I've certainly never had the opportunity to be taught how such a machine is designed by the people who developed it. The guys that IBM sent to conduct this course really know their stuff and speak a language I don't often get to hear, so I've been trying to soak up as much as I can in the three days they're here.
So here are my first impressions.
Hardware
Blue Gene/P, contrary to my initial expectations, turns out to be a heterogeneous system. Our head node is an IBM Power 550 Express with two dual-core 4.2 GHz POWER6 cores with hyperthreading and 16 GB of DDR2 running SUSE Linux Enterprise Server 10 SP1, although it sounds like it isn't uncommon for the head nodes to be x86-based Linux systems. The BG/P compute nodes are sufficiently different from anything capable of acting as a head node that all scientific code needs to be cross-compiled anyway, and cross compiling for BG/P on Linux/x86 is no different than on Linux/POWER6.Like a conventional cluster, the bulk of the work is done by individual compute "nodes." Each node, like in a conventional cluster, has a set of CPU cores and its own memory, and each node runs its own system image. But beyond this basic description, the "nodes" are really unlike the nodes of a conventional Beowulf cluster.
Physically, each node is little more than a 6 inch long card that plugs into a "node board." Each node board kind of resembles what I would consider a "node" in a conventional cluster--a pizza box that slides into the rack. Here is what a node board looks like (from Wikipedia):
Each copper heat sink is a single "node," which means a single node board has 16+16=32 nodes on it. On BG/P, each node has four cores, meaning each node board has 128 cores. Each BG/P rack takes 32 node boards, providing a pretty beefy 4096 cores per rack. Although I'd imagine 72U racks loaded with those 16-core Opteron Interlagos processors can approach this density, such a machine would probably run into major memory contention issues due to the extensive sharing of memory bandwidth and caches.
Each BG/P node is based around a PowerPC 450 SOC. Specifically, the nodes all have
- one 32-bit PowerPC 450 CPU with four cores
- 32KB instruction cache + 32KB data cache per core (cache coherent via snoop filtering)
- 16 line (128-bytes per line) L2 data prefetch unit per core
- 4 x 2 MB shared L3
- 2 GB of DDR2 featuring uniform memory access (13.6 GB/s)
- and that's about it. No scratch space, nowhere for a page file to go, nothing.
In the above photo of the node card, you can see that there are single-heatsink node cards all the way at the bottom. These are I/O nodes, and they actually handle many of the system calls generated by code since compute nodes are largely incapable. BG/P systems have a variable IO-to-compute node ratio, with lower ratios being better suited for IO-intensive applications but having significantly higher price tags. The pictured node card is configured with the top-of-the-line 1:16 ratio; the BG/P that we have is 1:64, which is one step under the cheapest 1:128 configuration. Since a single IO node handles the system calls for 16 to 128 compute nodes, there can be multiple jobs running on a single IO node and, as expected, this can create a bottleneck if the system is running a lot of tiny jobs.
The interconnect is the final major unique component of BG/P--it uses a proprietary, low-latency, highly interconnected 3D torus topology rather than a switched network. This topology gives non-uniform "access" latencies ranging from 0.5 (one hop) to 5 microseconds (farthest node) theoretical, or 2 to 10 microseconds with MPI. Each node therefore has twelve links on this 3D torus, each with 3.4 Gb/s bandwidth, and all of these links do DMA. The IO nodes serve as the gateway to the outside world; each IO node has a 10Gb/s ethernet link to accomplish this.
What dazzled me is that, in addition to this 3D torus for message passing, there are two separate, dedicated networks: a tree network exclusively for reductions and collective calls with 850 MB/s bandwidth, and a single dedicated network just for barriers. What's more is that all of this networking is rolled up into the BG/P MPI implementation (a derivative of Argonne's mpich2) so that proper utilization of the tree over the torus is generally transparent to the developer.