A week in the life of an SC attendee
Last week was the annual Supercomputing conference, held this year in Dallas, and it was as busy as they always are. Every year I take plenty of photos and post plenty of tweets throughout the week, but this year I thought it might be fun to share some of those photos (and the related things I learned) now that the dust has settled. Since some people might also be interested in how someone might approach the conference from a technical and philosophical perspective, I figured I'd write a more general piece documenting my entire SC experience this year.
This post wound up being a massive, meandering, chronological documentary of a week in my life that includes both technical and non-technical commentary. For anyone who is only interested in the technical insights I gained during SC, check out the items prefixed with (tech) in this table of contents:
Everything that's not labeled (tech) is part diary and part career development perspective. Hopefully someone will find something in here that's of some value.
Finally, disclosures:
My job is half research and half facilities staff. 50% of my time is funded by grant money to do applied research in characterizing parallel I/O systems. The other half of my time is spent staying current on emerging technologies in computing and storage. These two responsibilities mean that my SC is usually a mix of attending technical program sessions (to see what my peers in research are doing and see what research ideas might turn up in future technologies) and engaging with vendors.
I work in advanced technologies. This means I am generally not in the trenches directly feeling the pains of operating HPCs today; instead, my job is to identify technologies that will cause less problems tomorrow. This also means that I don't have purchasing authority, and I am less likely to be involved with anything that's going to hit the floor in the next year. As such, I generally don't do vendor sales meetings or briefings at SC because they are generally focused on nearer-term products and sales.
I did not get to where I am by myself. I first heard about SC in 2010 when I was a graduate student, and it sounded almost infinitely more exciting than the materials science conferences I was attending. I had no experience in HPC at the time, but it made me realize what I really wanted to pursue as a career. I relied heavily on the good will of the online HPC community to learn enough to get my first HPC job at SDSC, and after that, the faith of a great many more to get me to where I am now. SC is often the only time I get to see people who have helped me out in my early career, and I always make time connect with them.
The net result of these goals was a pretty full schedule this year:
I mark everything that I must attend (usually because I'm a speaker) in red to know the immovable obligations. Blue items are things I will attend unless an emergency comes up, and grey things are events I should attend because they sound interesting.
White space is very important to me too; between 10am and 6pm, white spaces are when I can walk the expo floor. A lot of people write off the expo as a waste of time, but I actually feel that it's one of the most valuable parts of SC. Since my job is to understand emerging technology (and the market trends that drive them), accosting a pre-sales engineer or product manager in a strategically important technology provider can yield an invaluable peek into the markets they're serving. White space in the evenings are equally important for engagements of opportunity or working on slides that have to be presented the next day.
On just about every work flight I'm on, I've got PowerPoint slides to review; this trip was no different, and I spent the 3.5-hour flight time reviewing the slides I had to present the next day. Once in Dallas and at my hotel, I carried out my usual work-travel night-of-arrival ritual: order the specialty pizza from a local pizza joint, text home saying I arrived safely, and iron my clothes while watching Forensic Files.
The tutorial itself is really comprehensive and includes everything from device-level performance behavior to parallel file systems architecture and I/O middleware. Even though I can proudly say that I knew 95% of the material being presented throughout the day (as I probably should since I was a presenter!), I found this particular slide that Rob Latham presented particularly insightful:
It makes the case that there is no significant performance penalty for using higher-level I/O libraries (like PnetCDF or parallel HDF5) despite how much easier they are to use than raw MPI-IO. One of the biggest take-home messages of the entire tutorial is to use I/O middleware wherever possible; doing so means that understanding parallel file system architecture isn't prerequisite to getting good I/O performance.
It's a really great workshop for people working in I/O, storage, and data and always draws a large crowd:
For researchers, it's a great venue for short papers that IEEE or ACM publishes, and it also has a really nice Work-in-Progress track where a page-long abstract gives you a seven minute spot to pitch your work. For attendees, it's always chock full of good talks that range from pure research to applied development.
This year's keynote speaker was Rangan Sukumar, Cray's analytics guru. His talk was interesting in that it approached the oft-mentioned convergence between HPC and AI (which has become an over-used trope by itself) from the perspective of a system architect (which is where the rubber meets the road):
As many great keynote speakers are, Rangan used hyperbole at times to contrast HPC and "Big Data" workloads, and this stimulated some discussion online. Although the slides alone tell only part of the story, you can download them from the PDSW-DISCS'18 website.
Later in the morning, Margaret Lawson (University of Illinois, Sandia Labs) presented a follow-on to the EMPRESS metadata system she presented last year:
Last year, EMPRESS seemed a little too researchy for me (as a facilities person) to sink my teeth into. This year though, the picture seems a lot more complete and I quite like the architectural framework. Although EMPRESS may not ever be a household name, the concept of separating data streams and metadata streams underneath some sort of I/O middleware is really solid. I think that storing data and metadata in different, architecturally distinct storage systems that map to the unique access patterns of data and metadata is ultimately the right way to approach large-scale data and metadata management in HPC, and I expect to see this design pattern proliferate as scientific data analysis becomes a bigger part of large-scale HPC workloads.
In the afternoon, researchers from OSU offered a rare peak into Alibaba through a high-level analysis of SSD failure data provided by the Chinese hyperscaler:
The most alarming finding to me was that 20% of SSD failures were caused by humans yanking the wrong SSD. This immediately made me wonder who Alibaba is hiring to do routine operational support at their data centers; if people are causing a significant fraction of storage faults, either they aren't hiring with the same standards as their US counterparts, or their data centers are a mess. The speaker's proposed remedy was to use a different SSD form factor for each logical use case for SSDs so that operators could visually identify an SSD reserved for metadata versus one reserved for data. I personally think a label maker, a barcode scanner, and a decent salary is an easier, standards-based solution.
Other highlights included
There were a number of really promising ideas presented at the WIP sessions as well, and recapping the entirety of the workshop is a blog post in and of itself. Fortunately, all the papers and slides are openly available on the PDSW-DISCS website.
This year I visited a few small businesses with whom I've fostered good will over the last few years to say hello, then dropped in on the SDSC booth to catch up with the latest news from my former coworkers. They also happen to have free beer on the opening night.
Once the expo floor opens to the public following the opening keynote, booth activity goes from zero to eleven really quickly. Every booth has a big splash during the gala which makes it hard to choose just one, but my decision this year was made easier by Cray choosing to unveil its new exascale HPC platform, Shasta, and celebrate its first sale of a Shasta system to NERSC.
This new system, named Perlmutter, will be delivered in 2020 and has a bunch of really slick new technologies incorporated into it.
After Cray CEO Pete Ungaro unveiled the prototype Shasta blades, there was a celebratory toast and both NERSC and Cray staff donned their "ASK ME ABOUT SAUL" pins:
I stuck around to shake hands with my colleagues at Cray (including the CEO himself! Haven't washed my hand since) and catch up with some of my counterparts in storage R&D there.
Unfortunately that vendor meeting overlapped with the Lustre BOF, but other staff from my institution were able to attend and ensure that our interests were represented. I was also able to attend the Lustre Lunch that followed the BOF which was very fruitful; in addition to simply being present to remind the Lustre community that I (and the institution I represent) am a part of it, I happened to connect in-person with someone I've known for a few years via Twitter and make a valuable connection. Unfortunately I had to leave the Lustre Lunch early to make another meeting, unrelated to SC, that allowed a geographically distributed committee to meet face-to-face.
After that committee meeting, I seized the free hour I had to visit the show room floor.
Unlike the Cray XC blade of today's systems which uses a combination of forced-air convection and heat exchangers to enable liquid cooling, these Shasta blades have direct liquid cooling which is rapidly becoming a de facto minimum requirement for an exascale-capable rack and node design. I had some questions, so I struck up a conversation with a Cray employee at the booth and learned some neat things about the Shasta packaging.
For the sake of clarity, here is a hand-drawn, annotated version of the same photo:
What stood out to me immediately was the interesting way in which the DIMMs were direct-liquid cooled. Unlike IBM's attempt at this with the POWER 775 system (the PERCS system of Blue Waters infamy) where cold plates were attached to every DIMM, Cray has opted to use what looks like a heat-conductive foam that wraps copper cooling lines. To service the DIMMs, the entire copper cooling complex that runs between the two rows of two DIMMs unfastens and lifts up. There's enough slack in the liquid cooling lines (highlighted in purple) so that DIMMs (and presumably every other field-replaceable part in the blade) can be serviced without draining the coolant from the blade.
The NIC is also pretty interesting; it is a commercial high-end data center Ethernet NIC that's manufactured in a custom form factor to fit this blade. It looks like a second CPU is housed underneath the NIC so that it may be the case that the NIC and one of the CPUs shares a common cooling block. The NIC is also positioned perpendicular to the long edge of the blade, meaning that there are probably some pretty good cable runs going from the front-most NIC all the way to the rear of the blade. Finally, because the NIC is on a discrete mezzanine card, the networking technology is no longer soldered to the compute as it is with Aries on today's XC.
The network switch (which I did not photograph, but others did) is another blade that slots into the rear of the Shasta cabinet and mates perpendicularly with a row of compute blades such that a single switch blade can service a fully populated compute chassis. The engineer with whom I spoke said that these Shasta cabinets have no actual midplane; the compute blades connect directly to the switch blades through a bunch of holes cut out of the sheet metal that separates the front of the cabinet from the rear. Without a midplane there is presumably one less single point of failure; at the same time though, it wasn't clear to me how out-of-band management works without a centralized controller somewhere in the chassis.
At this point I should point out that all of the above information is what I learned by talking to a Cray booth employee at SC without any special privilege; although I'm sure that more details are available under non-disclosure, I frankly don't remember any of it because I don't work on the compute side of the system.
My next big stop on the show room floor was at the Fujitsu booth, where they had their post-K prototype hardware on display. Of particular note was their A64FX engineering sample:
If you look very carefully, you can see the four stacks of high-bandwidth memory (HBM) on-die along with the ARM, which is fantastically historic in that it's the first general-purpose CPU (of which I am aware) that has integrated HBM2. What's not present is any indication of how the on-chip Tofu NIC is broken out; I guess I was expecting something like Intel's -F series KNLs with on-package OmniPath.
A sample node of the post-K system was also on display:
Seeing as how both this post-K system and Cray Shasta are exascale-capable system architectures, it's interesting to compare and contrast them. Both have direct liquid cooling, but the post-K compute blade does not appear to have any field-replaceable units. Instead, the entire board seems to be a single FRU, so CPUs must be serviced in pairs. I think the A64FX lacks any cache coherence bus, meaning that two CPUs correspond to two nodes per FRU.
That all said, the post-K design does not appear to have any DDR DRAM, and the NIC is integrated directly into the CPU. With those two components out of the picture, the rate of a single component failure is probably a lot lower in post-K than it would be in Shasta. Hopefully the post-K HBM has ECC though!
In chatting with a Fujitsu engineer about the post-K node architecture at their booth, I also met a Fujitsu engineer who just happened to be developing LLIO, the post-K system's burst buffer service:
It sounds a lot like DataWarp in terms of features, and given that Fujitsu is also developing a new Lustre-based file system (FEFS 2.0?) for post-K, we might see a tighter integration between the LLIO burst buffer layer and the FEFS back-end disk storage. This is definitely a technology that wasn't on my radar before SC but is definitely worth keeping an eye on as 2021 approaches.
As I was racing between a few other booths, I also happened upon my boss (and NERSC-9 chief architect) presenting the Perlmutter system architecture at the NVIDIA booth:
The talk drew a crowd--I'm glad to see people as jazzed about the new system as I am.
Unfortunately I did not take a picture of Andreas' second slide (available on the Analyzing Parallel I/O BOF's website) which is a "what is needed?" slide which largely revolves around better integration between storage system software (like Lustre) and user applications. I/O middleware seems to be at the center of most of the bullets that called for increased development which bodes well for scientific application developers who attended the Parallel I/O in Practice tutorial on Sunday--recall that this was my key takeaway. It's good to know that the lead of Lustre development agrees with this vision of the future, and I hope Whamcloud moves Lustre in this direction so users and middleware developers can meet the storage system software somewhere in the middle.
The BOF took a darker turn after this, starting with a presentation from Si Liu of TACC about the Optimal Overloaded IO Protection System, or OOOPS. It's a library that wraps the standard POSIX I/O calls:
But in addition to passively monitoring how an application performs I/O, it purposely injects latency to throttle the rate at which I/O operations get issued by an application. That is, it purposely slows down I/O from clients to reduce server-side load and, by extension, the effects of a single bad actor on the I/O performance of all the other users.
Ideologically, I have a lot of problems with an HPC facility inserting itself into the user's workflow and reducing the efficiency with which he or she can accomplish their science relative to the peak capability of the HPC resource. If a storage system allows a single user to accidentally deny service to other users in pursuit of peak performance, that is a problem with the storage system and it should be addressed at the system level. And as Andreas pointed out in the BOF, tools exist to allow storage systems to accomplish fair sharing, which is distinctly different from explicitly penalizing users. Granted, TACC is also the facility where one of its staff went on record as saying that the R language should not be used by anyone since it is a waste of energy. Perhaps they have an institutionally different relationship with their user community.
Fortunately, anything that relies on LD_PRELOAD can be circumvented by users, so OOOPS is unlikely to be used to enforce any kind of resource usage policy as it was pitched during the BOF. I do see a lot of value in using it to fence data analysis workflows that may hit a pathological condition as a result of their inputs, and being able to trigger changes in application behavior by tracking I/O rates is a technique that could be useful in auto-tuning I/O middleware.
Rosemary Francis, CEO of Ellexus, also spoke at the BOF and spoke for the need to make I/O performance analysis a little more accessible for the end users. I was quite delighted by the visualizations she presented (presumably from her company's Breeze product) which used both color and human-readable "bad" I/O patterns to create a pie graph that quickly shows how much time an application spent doing I/O in various good, bad, and neutral ways. Darshan, the tried-and-true open source I/O profiling library, operates at a slightly lower level and assumes a slightly higher level of user sophistication by comparison.
The discussion half of the BOF was packed with engagement from the audience--so much so that I didn't find any moments of silence to seize the opportunity to stump for my own view of the world. The combination of OOOPS and Rosemary's I/O war stories did steer the discussion towards ways to punish bad users though. I can appreciate HPC operators' frustration in novice users causing system-wide problems, but I don't think shaming users who do bad I/O is a great solution. Rather, something between OOOPS' automatic identification of bad I/O at runtime and Ellexus' user-centric reporting and feedback, combined with storage systems capable of enforcing QOS, is where we need to go.
The event was quite nice in that it was not held at a loud bar (which made conversation much easier), it had plenty of food (no need for 2 AM Hot Pockets), and the format was conducive to moving around and meeting a lot of different people. The event was awash with representatives from all the major Cray customers including the DOE labs, the big oil & gas companies, and the regional leadership computing centers in EMEA including CSCS and KAUST, as well as alumni of all those employers and Cray itself. I've only worked at a Cray customer site for three years now, but I couldn't walk ten feet without running into someone I knew; in that sense, it felt a little like an event at the annual Cray User Group meeting but with a broader range of attendees.
I don't know what this event would've been like if I was a student or otherwise didn't already know many of the regular faces within the Cray user community and instead had to start conversations cold. That said, I was busy the entire evening getting to know the people behind all the conference calls I'm on; I find that getting to know my industry counterparts as people rather than just vendor reps really pays dividends when surprises happen and conflicts need to be resolved. Events like this at SC are invaluable for building and maintaining these sorts of relationships.
The expo floor was awkwardly laid out this year, so I really needed to do this to make sure I didn't spin my tires trying to find certain booths once the crowd showed up. Incidentally, I did witness a sales person violate the unwritten rule of keeping everything friendly until the expo floor opened to the public--a sales rep selling "the world's fastest storage system" tried to stir up cold sales leads at my employer's booth at 8 AM while we were all still drinking our coffee and catching up on e-mail. If you do this, shame on you! Respect the exhibitor access and don't put your game face on until the public is allowed in.
My second obligation was volunteering at my employer's booth at the SC Career Fair. I generally enjoy booth duty and talking to students, and this year I was doubly motivated by my desire to fill some career and student job openings related to my responsibilities. A diverse cross section of students dropped by our booth looking for both summer internships and full-time jobs; many seemed very well rehearsed in their cold pitch, while some others were a little more casual or cautious. Although I'm not particularly qualified to give career advice, I will say that knowing how to sell yourself cold can be a valuable skill in your early career. If you are seeking employment, be prepared to respond to a request to "tell me about yourself" in a way that makes you stand out.
After the Career Fair, I wound up hunkering down at the SDSC booth to have lunch with my former coworkers and review the slides I volunteered to present at the adjacent DDN booth.
At 2 PM I took the stage (booth?) and one of my colleagues was not only kind enough to sit in on this booth talk, but also share this photo he took right before I started:
I continue to be humbled that anyone would go out of their way to come hear what I have to say, especially when my talk is as unvetted as booth talks tend to be. Talking at booths rarely goes well for me; the audio is always a wildcard, the audience is often unwitting, and auditory and visual distractions are literally everywhere. The DDN booth was my sole booth talk of this year and it went about as well as I would have expected. On the up side, quite a few attendees seemed genuinely interested to hear what I had to say about the variety of ways one can deploy flash in an HPC system. Unfortunately, I ran a few minutes long and got derailed by external distractions several times during the presentation though. Flubbing presentations happens, and none of the audience members seemed to mind.
Shortly after the booth talk, I had to find a quiet spot to jump on a telecon. This was no easy task; since cell phones killed the public phone booth, there are very few places to take a call on the expo floor.
I'm not really sure why so many vendors didn't show up this year, but it made getting a holistic view of the storage and networking technologies markets impossible. That said, I still saw a few noteworthy things.
One of the big open questions in high-performance storage revolves around the battle between the NF1 (formerly NGSFF, promoted by Samsung) and EDSFF (promoted by Intel) form factors for NVMe. It's clear that these long-and-skinny NVMe designs are going to have to replace the thermally inefficient 2.5" U.2 and unserviceable HHHL PCIe form factors, but the dust is far from being settled. On the one hand, Samsung leads flash storage sales worldwide, but their NF1 form factor caps the power consumption (and therefore performance) of its devices to levels that are squarely aimed at cheaper data center flash. On the other, the EDSFF form factor being pushed by Intel has a short version (competing directly with NF1) and a longer version that allows higher power.
The Supermicro booth had actual EDSFF drives on display, and this was the first time I could actually see one up-close:
What I didn't realize is that the higher thermal specification enabled by the long-version EDSFF drives requires that the entire SSD circuit board be enclosed in the aluminum casing shown to enable better heat dissipation. This has the nasty side effect of reducing density; while a standard 19" 1U chassis can fit up to 36 NF1 SSDs, the aluminum casing on long EDSFFs reduces the equivalent density to 32 SSDs. Although long EDSFF drives can compensate for this by packing more NAND dies on the physically longer EDSFF board, supporting these longer SSDs requires more engineering on the chassis design to fit the same amount of compute into a smaller area.
Similarly but differently, the Lenovo booth was showcasing their D3284 JBOD which packs 84x 3.5" HDDs into a double-decker 5U chassis. I had naively assumed that all of these super-dense 84-drive enclosures were top-loading such that each drive mates to a backplane that is mounted to the floor of the chassis, but it turns out that's not the case:
Instead, each 3.5" drive goes into its 2.5U shelf on its side, and each drive attaches to a carrier that has to be slid slightly toward the front of the JBOD to release the drive, and then slide towards the back of the JBOD to secure it. This seems a little harder to service than a simple top-load JBOD, but I assume there are thermal efficiencies to be gained by this layout.
The Western Digital booth had a pretty broad portfolio of data center products on display. Their newest gadget seems to be a planar NAND-based U.2 device that can present itself as DRAM through a custom hypervisor. This sounds like a direct competitor to Intel's Memory Drive offering which uses ScaleMP's hypervisor to expose flash as DRAM to a guest VM. The combination of exposing flash as very slow memory and relying on software virtualization to do this lends this to being a technology not really meant for HPC, and the engineer with whom I spoke confirmed as much. Virtualized big-and-slow memory is much more appealing to in-memory databases such as SAP HANA.
Perhaps more interestingly was the lack of any mention of Western Digital's investment in storage-class memory and microwave-assisted magnetic recording (MAMR) disk drives. When I prodded about the state of MAMR, I was assured that the technology will work because there is no future of hard drives without some form of energy-assisted magnetic recording. However, product announcements are still 18-24 months away, and the capacity for these drives will enter the market at the rather underwhelming range of ~20 TB. Conveniently, this matches Seagate's recent cry of wolf that they will launch HAMR drives in 2020 at a 20 TB capacity point. Western Digital also made no mention of multi-actuator drives, and asking about it only got me a sly grin; this suggests that Western Digital is either playing slow and steady so as not to over-promise, or Seagate has a slight technological lead.
My last substantive stop of the afternoon was at the IBM booth, where they had one of their new TS4500 tape libraries operating in demo mode. The window was too reflective to take a vide of the robotics, but I will say that there was a perceptible difference between the robotics in IBM's enterprise tape library and the robotics in another vendor's LTO tape library. The IBM enterprise robotics are downright savage in how forcefully they slam tapes around, and I now fully believe IBM's claims that their enterprise cartridges are constructed to be more physically durable than standard LTO. I'm sure there's some latency benefit to being able to ram tapes into drives and library slots at full speed, but it's unnerving to watch.
IBM also had this cheeky infographic on display that was worth a photo:
If I built a tape drive that was still operating after forty years in outer space, I'd want to brag about it too. But there are a couple of factual issues with this marketing material that probably made every physical scientist who saw it roll their eyes.
Over at the compute side of the IBM booth, I learned that the Summit and Sierra systems sitting at the #1 and #2 positions on Top500 are built using node architectures that IBM is selling commercially. There are 2 CPU + 6 GPU nodes (which is what Summit at OLCF has) which require liquid cooling, and 2 CPU + 4 GPU nodes (which is what Sierra at LLNL has) which can be air- or liquid-cooled. I asked an IBM technologist which configuration is more commercially popular, and the Sierra configuration is currently leading sales due to the relative lack of infrastructure to support direct liquid cooling in commercial data centers.
This has interesting implications for the exascale technologies I looked at on Tuesday; given that the exascale-capable system designs presented by both the Fujitsu and Cray rely on direct liquid cooling, bridging the gap between achieving exascale-level performance and delivering a commercially viable product is pretty wide from a facilities perspective. Fortunately, the Fujitsu A64FX chip usually runs below 200 W and can feasibly be air-cooled with lower-density packaging, and Cray's Shasta will support standard air-cooled 19" racks via lower-density nodes.
This year was exciting in that the top system, a DDN IME installation at JCAHPC, was unseated by the monstrous storage system attached to the Summit system at Oak Ridge and sustained an astounding 2 TiB/sec and 3 million opens/sec. In fact, the previous #1 system dropped to #4, and each of the new top three systems was of a different architecture (Spectrum Scale at Oak Ridge, IME at KISTI, and Lustre at Cambridge).
Perhaps the most interesting of these new submissions was the #3 system, the Data Accelerator at Cambridge, which is a home-grown whitebox system that was designed to be functionally equivalent to DataWarp's scratch mode:
The hardware are just Dell boxes with six NVMe drives and one OPA NIC per socket, and the magic is actually handled by a cleanroom reimplementation of the interface that Slurm uses to instantiate DataWarp partitions on Cray XC systems. Rather than use a sophisticated orchestration system as DataWarp does though, the Data Accelerator translates Slurm #DW pragmas into Ansible plays that spin up and tear down ephemeral Lustre file systems.
The fact that the #3 fastest storage system in the world is a whitebox NVMe system is really remarkable, and my hat is off to the team at Cambridge that did this work. As all-flash parallel file systems go from the realm of being a high-end boutique solution and become affordably mainstream, relatively scrappy but innovative engineering like the Cambridge system are surely going to cause a rapid proliferation of flash adoption in HPC centers.
DDN also presented their software-defined IO-500 submission, this time run in Google Cloud and landing in the #8 position:
Since DDN's embedded SFA product line already runs virtual machines on their controller hardware, it doesn't seem like a big stretch to run the same SFA VMs in the cloud. While this sounds a little counterproductive to DDN's biggest differentiator in providing a fully integrated hardware platform, this idea of running SFA in Google Cloud arose from the growing need for parallel file systems in the cloud. I can only assume that this need is being largely driven by AI workloads which require a combination of high I/O bandwidth, high IOPS, and POSIX file interfaces.
In terms of building a modern parallel file system from the ground up for all-flash, WekaIO checks off almost all of the right boxes. It runs almost entirely in user space to keep latency down, it runs in its own reserved pool of CPU cores on each client, and capitalizes on the approximate parity between NVMe latency and modern high-speed network latency. They make use of a lot of the smart ideas implemented in the enterprise and hyperscale storage space too and are one of the few really future-looking storage companies out there who are really thinking about the new possibilities in the all-flash world while still courting the HPC market.
There is a fair amount of magic involved that was not broken down in the talk, although I've found that the WekaIO folks are happy to explain some of the more complex details if asked specific questions about how their file system works. I'm not sure what is and isn't public though, so I'll save an architectural deep-dive of their technology for a later date.
Andreas Schlapka of Micron Technology was the next speaker, and his talk was quite a bit more high-level. Aside from the grand statements about how AI will transform technology though, he did have a couple of nice slides that filled some knowledge gaps in my mind. For example:
Training is what the vast majority of casual AI+HPC pundits are really talking about when extolling the huge compute requirements of deep learning. Part of that is because GPUs are almost the ideal hardware solution to tackle the mathematics of training (dense matrix-matrix multiplication) and post impressive numbers; the other part is that inference can't happen without a well-trained model, and models are continually being refined and re-trained. What I hadn't fully appreciated is that inference is much more of an interesting computational problem in that it more closely resembles the non-uniform and latency-bound workloads of scientific computing.
This has interesting implications for memory technology; while HBM2 definitely delivers more bandwidth than DDR, it does this by increasing the channel width to 128 bits and hard-wiring 8 channels into each stack. The extra bandwidth helps feed GPUs for training, but it's not doing much for the inference side of AI which, presumably, will become a much more significant fraction of the cycles required overall. In my mind, increasing the size of SRAM-based caches, scratchpads, and register files are the more obvious way to reduce latency for inference, but we haven't really seen a lot of fundamentally new ideas on how to effectively do that yet.
The speaker went on to show the following apples-to-apples system-level reference:
It's not terribly insightful, but it lets you back out the bus width of each memory technology (bandwidth / data rate / device #) and figure out where its bandwidth is coming from:
That all said, actually listening to the NSF HPC strategic vision makes me rather grumpy since the directions of such an important federal office sometimes appear so scattershot. And judging by the audience questions at the end of the BOF, I am not the only one--Very Important People(tm) in two different national-level HPC consortia asked very pointed questions of Manish Parashar, the NSF OAC director, that highlighted the dichotomy between OAC's strategic vision and where it was actually putting money. I really believe in the critical importance of NSF investment in maintaining national cyberinfrastructure which is probably why I keep showing up to these BOFs and do my best to support my colleagues at SDSC and the other XSEDE SPs.
After sitting through this Future Directions BOF, I could write another updated rant about how I feel about the NSF's direction in HPC and get myself in trouble. Instead, I'll instead share just a few slides I photographed from afar along with some objective statements and leave it at that.
The future directions summary slide:
The Frontera slide:
The Midscale Research Infrastructure slide:
I was admittedly bummed out when I found out that I was going to be the conference closer since a significant number of SC attendees tend to fly out on Thursday night and, presumably, would not stick around for my presentation. As a result, I didn't take preparation for it as seriously in the weeks leading up to SC as I normally would have. I knew the presentation was a 30-35 minute talk that had to be fit into a 25-minute slot, but I figured I would figure out how to manage that on the night before the talk and mostly wing it.
What I realized after arriving at SC was that a bunch of people--most of whom weren't the expected audience of storage researchers--were looking forward to hearing the talk. This left me scrambling to seriously step up the effort I was going to put into making sure the presentation was well composed despite needing to drop ten minutes of material and fit it into the 25 minutes I was given. I documented my general approach to crafting presentations in my patented Glenn K. Lockwood Five Keys to Being a Successful Researcher (FKBSR) method, but I'll mention some of my considerations for the benefit of anyone who is interested in how others approach public speaking.
As a result of all these factors, I opted to both cut a lot of details to get the talk down to ~25-30 minutes when presented at a casual pace, then prepare to present in turbo mode just in case the previous speakers went long (I was last of three speakers), there were A/V issues (they were prolific at this SC, especially for Mac users), or there were any audience interruptions.
I also opted to present from my iPad rather than a full laptop since it did a fine job earlier at both PDSW-DISCS and the IO-500/VI4IO BOF. In sticking with this decision though, I learned two valuable things during the actual presentation:
This post wound up being a massive, meandering, chronological documentary of a week in my life that includes both technical and non-technical commentary. For anyone who is only interested in the technical insights I gained during SC, check out the items prefixed with (tech) in this table of contents:
- Before the Conference
- Saturday
- Sunday
- Monday
- Tuesday
- (tech) Technical Program, Data and Storage Paper Track Highlights
- Interlude of Meetings
- (tech) Cray and Fujitsu's Exascale System Hardware on the Expo Floor
- (tech) Analyzing Parallel I/O BOF Highlights
- The Cray Celebration
- Wednesday
- SC Student Career Fair and a Booth Talk
- (tech) Flash, Disk, and Tape Technologies on the Expo Floor
- (tech) Recap of the IO-500/VI4IO BOF
- Thursday
- (tech) WekaIO and Micron at the Exhibitor Forum
- NSF Future Directions BOF
- My SC Paper
- SC Technical Program Reception at the Perot Museum
- Friday
- After the Conference
Everything that's not labeled (tech) is part diary and part career development perspective. Hopefully someone will find something in here that's of some value.
Finally, disclosures:
- I omitted some names in the interests of respecting the privacy of the folks who took the time to talk to me one-on-one. If you're part of this story and don't mind having your name out there, I'd be happy to include it.
- Everything I paraphrase here is public information or conjecture on my part. Nothing in this post is either confidential or sensitive. That said, check your references before citing anything here. I don't know what I'm talking about.
- Everything here is my personal opinion and does not necessarily reflect the viewpoint of my employer or its funding agency. I attended the conference as a part the regular course of business in which I am employed. However I took all photos for personal purposes, and the entirety of this post was written on my own personal time.
Before the conference
Everyone's SC experience is different because it draws such a diverse range of professionals. There are plenty of activities for everyone ranging from students and early-career staff to senior management and leadership, and people on different career tracks (e.g., facilities staff, computer science researchers, program managers, product sales) are likely to be drawn to very different parts of the conference agenda. My priorities during the week of SC are definitely shaped by where I am in my career, so when filling out my calendar a few weeks ahead of the conference, I considered the following:My job is half research and half facilities staff. 50% of my time is funded by grant money to do applied research in characterizing parallel I/O systems. The other half of my time is spent staying current on emerging technologies in computing and storage. These two responsibilities mean that my SC is usually a mix of attending technical program sessions (to see what my peers in research are doing and see what research ideas might turn up in future technologies) and engaging with vendors.
I work in advanced technologies. This means I am generally not in the trenches directly feeling the pains of operating HPCs today; instead, my job is to identify technologies that will cause less problems tomorrow. This also means that I don't have purchasing authority, and I am less likely to be involved with anything that's going to hit the floor in the next year. As such, I generally don't do vendor sales meetings or briefings at SC because they are generally focused on nearer-term products and sales.
I did not get to where I am by myself. I first heard about SC in 2010 when I was a graduate student, and it sounded almost infinitely more exciting than the materials science conferences I was attending. I had no experience in HPC at the time, but it made me realize what I really wanted to pursue as a career. I relied heavily on the good will of the online HPC community to learn enough to get my first HPC job at SDSC, and after that, the faith of a great many more to get me to where I am now. SC is often the only time I get to see people who have helped me out in my early career, and I always make time connect with them.
The net result of these goals was a pretty full schedule this year:
My SC'18 schedule. Note that the time zone is PST, or two hours behind Dallas time. |
I mark everything that I must attend (usually because I'm a speaker) in red to know the immovable obligations. Blue items are things I will attend unless an emergency comes up, and grey things are events I should attend because they sound interesting.
White space is very important to me too; between 10am and 6pm, white spaces are when I can walk the expo floor. A lot of people write off the expo as a waste of time, but I actually feel that it's one of the most valuable parts of SC. Since my job is to understand emerging technology (and the market trends that drive them), accosting a pre-sales engineer or product manager in a strategically important technology provider can yield an invaluable peek into the markets they're serving. White space in the evenings are equally important for engagements of opportunity or working on slides that have to be presented the next day.
Saturday, November 10
I always fly to SC on the Saturday before the conference starts. I have historically opted to do workshops on both Sunday and Monday, as I really enjoy attending both PMBS and PDSW-DISCS. I bring a suitcase with has extra room for conference swag, and doing so this year was critically important because I opted to bring along a pair of cowboy boots that I knew I would not want to wear on the flight home.My brown kicks. Also Harriet the cat. |
On just about every work flight I'm on, I've got PowerPoint slides to review; this trip was no different, and I spent the 3.5-hour flight time reviewing the slides I had to present the next day. Once in Dallas and at my hotel, I carried out my usual work-travel night-of-arrival ritual: order the specialty pizza from a local pizza joint, text home saying I arrived safely, and iron my clothes while watching Forensic Files.
Sunday, November 11
This year I had the honor of presenting one part of the famed Parallel I/O in Practice tutorial at SC along with Rob Ross, Brent Welch, and Rob Latham. This tutorial has been running for over fifteen years now, and at some point over those years, it picked up the curious ritual of being kicked off with some juggling:Brent leading up to the tutorial start time with some juggling. He brought the pins with him. |
The tutorial itself is really comprehensive and includes everything from device-level performance behavior to parallel file systems architecture and I/O middleware. Even though I can proudly say that I knew 95% of the material being presented throughout the day (as I probably should since I was a presenter!), I found this particular slide that Rob Latham presented particularly insightful:
The ease and portability of using I/O middleware comes without sacrificing performance! Sorry for the odd angle; this is the screen as us presenters were able to view it. |
It makes the case that there is no significant performance penalty for using higher-level I/O libraries (like PnetCDF or parallel HDF5) despite how much easier they are to use than raw MPI-IO. One of the biggest take-home messages of the entire tutorial is to use I/O middleware wherever possible; doing so means that understanding parallel file system architecture isn't prerequisite to getting good I/O performance.
Monday, November 12
Monday was the official first day of SC. Workshops and tutorials went on throughout the day, and the opening keynote and exhibition hall opening gala started in the evening.
PDSW-DISCS 2018
The 3rd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS) was on Monday, and I had the honor of being asked to serve as its Publicity Chair this year.The PDSW-DISCS full-day workshop agenda |
It's a really great workshop for people working in I/O, storage, and data and always draws a large crowd:
For researchers, it's a great venue for short papers that IEEE or ACM publishes, and it also has a really nice Work-in-Progress track where a page-long abstract gives you a seven minute spot to pitch your work. For attendees, it's always chock full of good talks that range from pure research to applied development.
This year's keynote speaker was Rangan Sukumar, Cray's analytics guru. His talk was interesting in that it approached the oft-mentioned convergence between HPC and AI (which has become an over-used trope by itself) from the perspective of a system architect (which is where the rubber meets the road):
As many great keynote speakers are, Rangan used hyperbole at times to contrast HPC and "Big Data" workloads, and this stimulated some discussion online. Although the slides alone tell only part of the story, you can download them from the PDSW-DISCS'18 website.
Later in the morning, Margaret Lawson (University of Illinois, Sandia Labs) presented a follow-on to the EMPRESS metadata system she presented last year:
Last year, EMPRESS seemed a little too researchy for me (as a facilities person) to sink my teeth into. This year though, the picture seems a lot more complete and I quite like the architectural framework. Although EMPRESS may not ever be a household name, the concept of separating data streams and metadata streams underneath some sort of I/O middleware is really solid. I think that storing data and metadata in different, architecturally distinct storage systems that map to the unique access patterns of data and metadata is ultimately the right way to approach large-scale data and metadata management in HPC, and I expect to see this design pattern proliferate as scientific data analysis becomes a bigger part of large-scale HPC workloads.
In the afternoon, researchers from OSU offered a rare peak into Alibaba through a high-level analysis of SSD failure data provided by the Chinese hyperscaler:
The most alarming finding to me was that 20% of SSD failures were caused by humans yanking the wrong SSD. This immediately made me wonder who Alibaba is hiring to do routine operational support at their data centers; if people are causing a significant fraction of storage faults, either they aren't hiring with the same standards as their US counterparts, or their data centers are a mess. The speaker's proposed remedy was to use a different SSD form factor for each logical use case for SSDs so that operators could visually identify an SSD reserved for metadata versus one reserved for data. I personally think a label maker, a barcode scanner, and a decent salary is an easier, standards-based solution.
Other highlights included
- Characterizing Deep-Learning I/O Workloads in TensorFlow, presented by Stefano Markidis of KTH. The first time I've seen an I/O-centric evaluation of how deep learning workflows will affect storage requirements of future systems. I learned a lot.
- Toward Understanding I/O Behavior in HPC Workflows, presented by Jakob Lüttgau of DKRZ/ANL. Rather than analyze the I/O pattern of a single MPI job, this paper began examining the I/O patterns of related jobs that all work towards a single scientific objective. Again, one of the first research papers I've seen that takes a critical look at end-to-end workflows from an I/O perspective.
- Methodology for the Rapid Development of Scalable HPC Data Services, presented by Matthieu Dorier of ANL. I think this paper is intended to be the canonical reference for the Mochi project, which I was glad to finally see. The idea of enabling quickly composable, purpose-built I/O services that are optimized for next-generation media and interconnects is a brilliant one, and I am a huge believer that this approach will be what demonstrates the earliest scientific successes that rely on storage-class memory at scale.
There were a number of really promising ideas presented at the WIP sessions as well, and recapping the entirety of the workshop is a blog post in and of itself. Fortunately, all the papers and slides are openly available on the PDSW-DISCS website.
SC Opening Keynote and Gala
I've actually stopped going to the SC keynotes over the last year since they're increasingly focused on the societal impacts enabled by HPC rather than HPC itself. While I'm definitely not knocking that theme--it's a great way to inspire early-career individuals, big-picture program management types, and disenchanted technical folks in the trenches--it's just not why I attend SC. Instead, I make use of my exhibitor badge and head into the expo floor before it opens to the public; this is the only time during the conference where I seem to be able to reliably find the people I want to meet at their booths.This year I visited a few small businesses with whom I've fostered good will over the last few years to say hello, then dropped in on the SDSC booth to catch up with the latest news from my former coworkers. They also happen to have free beer on the opening night.
Once the expo floor opens to the public following the opening keynote, booth activity goes from zero to eleven really quickly. Every booth has a big splash during the gala which makes it hard to choose just one, but my decision this year was made easier by Cray choosing to unveil its new exascale HPC platform, Shasta, and celebrate its first sale of a Shasta system to NERSC.
Cray CEO Pete Ungaro at the Shasta unveiling ceremony |
This new system, named Perlmutter, will be delivered in 2020 and has a bunch of really slick new technologies incorporated into it.
After Cray CEO Pete Ungaro unveiled the prototype Shasta blades, there was a celebratory toast and both NERSC and Cray staff donned their "ASK ME ABOUT SAUL" pins:
NERSC and Cray staff got these VIP pins to promote NERSC's next system, named after astrophysicist, Nobel laureate, and Berkeley Lab scientist Saul Perlmutter. |
I stuck around to shake hands with my colleagues at Cray (including the CEO himself! Haven't washed my hand since) and catch up with some of my counterparts in storage R&D there.
The Beowulf Bash
The gala shut down at 9 PM, at which time I headed over to the Beowulf Bash to try to find other some colleagues who said they would be there. I generally don't prioritize parties at SC for a couple reasons:- Shouting over music all night is a great way to burn out one's voice. This is not good when I have to present something the next day.
- The crowds and lines often undercut my enjoyment of catching up with old colleagues (and meeting new ones).
- I almost always have slides that need to be finished by the end of the night.
I make an exception for the Bash because I personally value many of the people behind organizing and sponsoring it, and it captures the scrappier side of the HPC community which helped me get my foot in the door of the industry. This year I specifically went to catch up with my colleagues at The Next Platform; Nicole and Tim are uncommonly insightful and talented writers and editors, and they always have wacky anecdotes to share about some of the more public figures in our industry.
More generally and self-servingly though, maintaining a good relationship with members of the HPC trade press at large has tremendous value over time regardless of your affiliation or job title. Behind every interesting HPC news article is an editor with incomparable access to a broad network of people in the industry. Despite this though, they still are subject to the same haters as anyone else who puts something out in the spotlight, so I have to imagine that putting in a kind word in-person will is always worth it.
At around midnight, only the die-hards were still around.
Late night Beowulf Bash at Eddie Deen's Ranch. |
Regrettably, I barely had any time to catch up with my colleagues from the FreeNode HPC community at the Bash (or at all). Maybe at ISC.
After getting back to the hotel, I realized I hadn't eaten anything since lunch. I also learned that absolutely nothing that delivers food in the downtown Dallas area is open after midnight. After waiting an hour for a food delivery that wound up going to a restaurant that wasn't even open, I had to settle for a hearty dinner of Hot Pockets from the hotel lobby.
I hadn't eaten a Hot Pocket since graduate school. Still taste the same. |
Fortunately my Tuesday was relatively light on hard obligations.
Tuesday, November 13
Tuesday was the first day in which the SC technical program and expo were both in full swing. I split the day between paper talks, meetings, and the expo floor.
Technical Program, Part 1 - Data and Storage
My Tuesday morning began at 10:30 AM with the Data and Storage paper presentation session in the technical program. Of note, the first two papers presented were about cloud-centric storage paradigms, and only the third one was clearly focused on scientific HPC workloads.- SP-Cache: Load-Balanced, Redundancy-Free Cluster Caching with Selective Partition by Yu et al was a paper squarely aimed at reducing tail latency of reads. Very important if you want to load an old GMail message without waiting more than a few seconds for it to load. Less useful for most scientific HPC workloads.
- BESPOKV: Application Tailored Scale-Out Key-Value Stores by Anwar et al was a paper presenting a framework that is uncannily similar to the Mochi paper presented at PDSW on the day before. The premise was to allow people to compose their own Cassandra-like KV store with specific consistency and durability balance without having to reinvent the basic building blocks.
- Scaling Embedded In Situ Indexing with DeltaFS by Zheng et al was the talk I really wanted to hear but I had to miss on account of a conflicting meeting. The DeltaFS work being done by CMU and LANL is a really innovative way to deal with the scalability challenges of parallel file system metadata, and I think it's going to ultimately be where many of the nascent software-defined storage technologies aimed at HPC will converge.
Unfortunately I had to cut out of the session early to meet with a vendor partner at a nearby hotel.
Interlude of Meetings
The first of my two vendor meetings at this year's SC was less a sales call and more about continuing a long-running discussion about technology futures in the five-to-ten year timeframe. No sane vendor will commit to any roadmap that far out, especially given the uncertainty surrounding post-Moore's Law technologies, but they are receptive to input from customers who are formulating their own strategic directions for the same time period. Maintaining these sorts of ongoing conversations is a major part of what falls under my job title in "advanced technologies."Unfortunately that vendor meeting overlapped with the Lustre BOF, but other staff from my institution were able to attend and ensure that our interests were represented. I was also able to attend the Lustre Lunch that followed the BOF which was very fruitful; in addition to simply being present to remind the Lustre community that I (and the institution I represent) am a part of it, I happened to connect in-person with someone I've known for a few years via Twitter and make a valuable connection. Unfortunately I had to leave the Lustre Lunch early to make another meeting, unrelated to SC, that allowed a geographically distributed committee to meet face-to-face.
After that committee meeting, I seized the free hour I had to visit the show room floor.
Expo Floor, Part 1
The first photo-worthy tech I saw was the Shasta blade at the Cray booth. Because the booth was mobbed with people during the previous night's gala, this was actually my first time seeing Shasta hardware up close. Here's the compute blade:Part of a Cray Shasta compute blade up-close |
Unlike the Cray XC blade of today's systems which uses a combination of forced-air convection and heat exchangers to enable liquid cooling, these Shasta blades have direct liquid cooling which is rapidly becoming a de facto minimum requirement for an exascale-capable rack and node design. I had some questions, so I struck up a conversation with a Cray employee at the booth and learned some neat things about the Shasta packaging.
For the sake of clarity, here is a hand-drawn, annotated version of the same photo:
Part of a Cray Shasta compute blade up-close with my annotations |
What stood out to me immediately was the interesting way in which the DIMMs were direct-liquid cooled. Unlike IBM's attempt at this with the POWER 775 system (the PERCS system of Blue Waters infamy) where cold plates were attached to every DIMM, Cray has opted to use what looks like a heat-conductive foam that wraps copper cooling lines. To service the DIMMs, the entire copper cooling complex that runs between the two rows of two DIMMs unfastens and lifts up. There's enough slack in the liquid cooling lines (highlighted in purple) so that DIMMs (and presumably every other field-replaceable part in the blade) can be serviced without draining the coolant from the blade.
The NIC is also pretty interesting; it is a commercial high-end data center Ethernet NIC that's manufactured in a custom form factor to fit this blade. It looks like a second CPU is housed underneath the NIC so that it may be the case that the NIC and one of the CPUs shares a common cooling block. The NIC is also positioned perpendicular to the long edge of the blade, meaning that there are probably some pretty good cable runs going from the front-most NIC all the way to the rear of the blade. Finally, because the NIC is on a discrete mezzanine card, the networking technology is no longer soldered to the compute as it is with Aries on today's XC.
The network switch (which I did not photograph, but others did) is another blade that slots into the rear of the Shasta cabinet and mates perpendicularly with a row of compute blades such that a single switch blade can service a fully populated compute chassis. The engineer with whom I spoke said that these Shasta cabinets have no actual midplane; the compute blades connect directly to the switch blades through a bunch of holes cut out of the sheet metal that separates the front of the cabinet from the rear. Without a midplane there is presumably one less single point of failure; at the same time though, it wasn't clear to me how out-of-band management works without a centralized controller somewhere in the chassis.
At this point I should point out that all of the above information is what I learned by talking to a Cray booth employee at SC without any special privilege; although I'm sure that more details are available under non-disclosure, I frankly don't remember any of it because I don't work on the compute side of the system.
My next big stop on the show room floor was at the Fujitsu booth, where they had their post-K prototype hardware on display. Of particular note was their A64FX engineering sample:
If you look very carefully, you can see the four stacks of high-bandwidth memory (HBM) on-die along with the ARM, which is fantastically historic in that it's the first general-purpose CPU (of which I am aware) that has integrated HBM2. What's not present is any indication of how the on-chip Tofu NIC is broken out; I guess I was expecting something like Intel's -F series KNLs with on-package OmniPath.
A sample node of the post-K system was also on display:
Seeing as how both this post-K system and Cray Shasta are exascale-capable system architectures, it's interesting to compare and contrast them. Both have direct liquid cooling, but the post-K compute blade does not appear to have any field-replaceable units. Instead, the entire board seems to be a single FRU, so CPUs must be serviced in pairs. I think the A64FX lacks any cache coherence bus, meaning that two CPUs correspond to two nodes per FRU.
That all said, the post-K design does not appear to have any DDR DRAM, and the NIC is integrated directly into the CPU. With those two components out of the picture, the rate of a single component failure is probably a lot lower in post-K than it would be in Shasta. Hopefully the post-K HBM has ECC though!
In chatting with a Fujitsu engineer about the post-K node architecture at their booth, I also met a Fujitsu engineer who just happened to be developing LLIO, the post-K system's burst buffer service:
LLIO burst buffer slide shown at the Fujitsu booth |
It sounds a lot like DataWarp in terms of features, and given that Fujitsu is also developing a new Lustre-based file system (FEFS 2.0?) for post-K, we might see a tighter integration between the LLIO burst buffer layer and the FEFS back-end disk storage. This is definitely a technology that wasn't on my radar before SC but is definitely worth keeping an eye on as 2021 approaches.
As I was racing between a few other booths, I also happened upon my boss (and NERSC-9 chief architect) presenting the Perlmutter system architecture at the NVIDIA booth:
NERSC's Nick Wright, chief architect of the Perlmutter system, describing its architecture at the NVIDIA booth |
The talk drew a crowd--I'm glad to see people as jazzed about the new system as I am.
Analyzing Parallel I/O BOF
The Analyzing Parallel I/O BOF is a must-attend event for anyone in the parallel I/O business, and this year's BOF was especially good. Andreas Dilger (of Lustre fame; now CTO of Whamcloud) gave a brief but insightful retrospective on understanding I/O performance:Unfortunately I did not take a picture of Andreas' second slide (available on the Analyzing Parallel I/O BOF's website) which is a "what is needed?" slide which largely revolves around better integration between storage system software (like Lustre) and user applications. I/O middleware seems to be at the center of most of the bullets that called for increased development which bodes well for scientific application developers who attended the Parallel I/O in Practice tutorial on Sunday--recall that this was my key takeaway. It's good to know that the lead of Lustre development agrees with this vision of the future, and I hope Whamcloud moves Lustre in this direction so users and middleware developers can meet the storage system software somewhere in the middle.
The BOF took a darker turn after this, starting with a presentation from Si Liu of TACC about the Optimal Overloaded IO Protection System, or OOOPS. It's a library that wraps the standard POSIX I/O calls:
OOOPS operates by hijacking standard I/O calls and lagging them. |
But in addition to passively monitoring how an application performs I/O, it purposely injects latency to throttle the rate at which I/O operations get issued by an application. That is, it purposely slows down I/O from clients to reduce server-side load and, by extension, the effects of a single bad actor on the I/O performance of all the other users.
Ideologically, I have a lot of problems with an HPC facility inserting itself into the user's workflow and reducing the efficiency with which he or she can accomplish their science relative to the peak capability of the HPC resource. If a storage system allows a single user to accidentally deny service to other users in pursuit of peak performance, that is a problem with the storage system and it should be addressed at the system level. And as Andreas pointed out in the BOF, tools exist to allow storage systems to accomplish fair sharing, which is distinctly different from explicitly penalizing users. Granted, TACC is also the facility where one of its staff went on record as saying that the R language should not be used by anyone since it is a waste of energy. Perhaps they have an institutionally different relationship with their user community.
Fortunately, anything that relies on LD_PRELOAD can be circumvented by users, so OOOPS is unlikely to be used to enforce any kind of resource usage policy as it was pitched during the BOF. I do see a lot of value in using it to fence data analysis workflows that may hit a pathological condition as a result of their inputs, and being able to trigger changes in application behavior by tracking I/O rates is a technique that could be useful in auto-tuning I/O middleware.
Rosemary Francis, CEO of Ellexus, also spoke at the BOF and spoke for the need to make I/O performance analysis a little more accessible for the end users. I was quite delighted by the visualizations she presented (presumably from her company's Breeze product) which used both color and human-readable "bad" I/O patterns to create a pie graph that quickly shows how much time an application spent doing I/O in various good, bad, and neutral ways. Darshan, the tried-and-true open source I/O profiling library, operates at a slightly lower level and assumes a slightly higher level of user sophistication by comparison.
The discussion half of the BOF was packed with engagement from the audience--so much so that I didn't find any moments of silence to seize the opportunity to stump for my own view of the world. The combination of OOOPS and Rosemary's I/O war stories did steer the discussion towards ways to punish bad users though. I can appreciate HPC operators' frustration in novice users causing system-wide problems, but I don't think shaming users who do bad I/O is a great solution. Rather, something between OOOPS' automatic identification of bad I/O at runtime and Ellexus' user-centric reporting and feedback, combined with storage systems capable of enforcing QOS, is where we need to go.
The Cray Celebration
I wrote earlier that I normally don't do the SC vendor party circuit, but the Cray party this year was another exception for two reasons: (1) we had just announced Perlmutter along with Cray's Shasta unveiling which is worth celebrating, and (2) there were specific Cray staff with whom I wanted to confer sometime during the week. So after the Parallel I/O BOF, I headed over to the event venue:The event was quite nice in that it was not held at a loud bar (which made conversation much easier), it had plenty of food (no need for 2 AM Hot Pockets), and the format was conducive to moving around and meeting a lot of different people. The event was awash with representatives from all the major Cray customers including the DOE labs, the big oil & gas companies, and the regional leadership computing centers in EMEA including CSCS and KAUST, as well as alumni of all those employers and Cray itself. I've only worked at a Cray customer site for three years now, but I couldn't walk ten feet without running into someone I knew; in that sense, it felt a little like an event at the annual Cray User Group meeting but with a broader range of attendees.
I don't know what this event would've been like if I was a student or otherwise didn't already know many of the regular faces within the Cray user community and instead had to start conversations cold. That said, I was busy the entire evening getting to know the people behind all the conference calls I'm on; I find that getting to know my industry counterparts as people rather than just vendor reps really pays dividends when surprises happen and conflicts need to be resolved. Events like this at SC are invaluable for building and maintaining these sorts of relationships.
Wednesday, November 14
My Wednesday began bright and early with a quick run-around of the expo floor to figure out who I needed to visit before the end of the week.The expo floor was awkwardly laid out this year, so I really needed to do this to make sure I didn't spin my tires trying to find certain booths once the crowd showed up. Incidentally, I did witness a sales person violate the unwritten rule of keeping everything friendly until the expo floor opened to the public--a sales rep selling "the world's fastest storage system" tried to stir up cold sales leads at my employer's booth at 8 AM while we were all still drinking our coffee and catching up on e-mail. If you do this, shame on you! Respect the exhibitor access and don't put your game face on until the public is allowed in.
SC Student Career Fair and Booth Talk
My first meeting was a chat over coffee with VAST Data, a storage technology company that has some really innovative and exciting ideas in the pipeline, to keep up to date with the latest news as they approach public launch.My second obligation was volunteering at my employer's booth at the SC Career Fair. I generally enjoy booth duty and talking to students, and this year I was doubly motivated by my desire to fill some career and student job openings related to my responsibilities. A diverse cross section of students dropped by our booth looking for both summer internships and full-time jobs; many seemed very well rehearsed in their cold pitch, while some others were a little more casual or cautious. Although I'm not particularly qualified to give career advice, I will say that knowing how to sell yourself cold can be a valuable skill in your early career. If you are seeking employment, be prepared to respond to a request to "tell me about yourself" in a way that makes you stand out.
After the Career Fair, I wound up hunkering down at the SDSC booth to have lunch with my former coworkers and review the slides I volunteered to present at the adjacent DDN booth.
At 2 PM I took the stage (booth?) and one of my colleagues was not only kind enough to sit in on this booth talk, but also share this photo he took right before I started:
Beginning of my talk at the DDN booth. Photo credit goes to Suhaib Khan via Twitter. |
I continue to be humbled that anyone would go out of their way to come hear what I have to say, especially when my talk is as unvetted as booth talks tend to be. Talking at booths rarely goes well for me; the audio is always a wildcard, the audience is often unwitting, and auditory and visual distractions are literally everywhere. The DDN booth was my sole booth talk of this year and it went about as well as I would have expected. On the up side, quite a few attendees seemed genuinely interested to hear what I had to say about the variety of ways one can deploy flash in an HPC system. Unfortunately, I ran a few minutes long and got derailed by external distractions several times during the presentation though. Flubbing presentations happens, and none of the audience members seemed to mind.
Shortly after the booth talk, I had to find a quiet spot to jump on a telecon. This was no easy task; since cell phones killed the public phone booth, there are very few places to take a call on the expo floor.
Expo Floor, Part 2
The afternoon afforded me two more hours to race around the expo floor. Despite my planning earlier in the morning, I wound up spinning my tires looking for a few key vendors who simply didn't show up to SC this year, including- Samsung and SK Hynix, two of the top three DRAM vendors and the sole manufacturers of HBM2
- Seagate, one of two hard disk drive manufacturers
- Broadcom/Avago, the company manufacturing most of the serdes used in the upcoming 200G and 400G network devices
- Juniper, one of the major players in the 400 GbE space
- AdvancedHPC, one of the few US integrators selling BeeGFS
I'm not really sure why so many vendors didn't show up this year, but it made getting a holistic view of the storage and networking technologies markets impossible. That said, I still saw a few noteworthy things.
One of the big open questions in high-performance storage revolves around the battle between the NF1 (formerly NGSFF, promoted by Samsung) and EDSFF (promoted by Intel) form factors for NVMe. It's clear that these long-and-skinny NVMe designs are going to have to replace the thermally inefficient 2.5" U.2 and unserviceable HHHL PCIe form factors, but the dust is far from being settled. On the one hand, Samsung leads flash storage sales worldwide, but their NF1 form factor caps the power consumption (and therefore performance) of its devices to levels that are squarely aimed at cheaper data center flash. On the other, the EDSFF form factor being pushed by Intel has a short version (competing directly with NF1) and a longer version that allows higher power.
The Supermicro booth had actual EDSFF drives on display, and this was the first time I could actually see one up-close:
A long-type EDSFF NVMe drive at the Supermicro booth. The aluminum casing is actually required to meet the thermals. |
What I didn't realize is that the higher thermal specification enabled by the long-version EDSFF drives requires that the entire SSD circuit board be enclosed in the aluminum casing shown to enable better heat dissipation. This has the nasty side effect of reducing density; while a standard 19" 1U chassis can fit up to 36 NF1 SSDs, the aluminum casing on long EDSFFs reduces the equivalent density to 32 SSDs. Although long EDSFF drives can compensate for this by packing more NAND dies on the physically longer EDSFF board, supporting these longer SSDs requires more engineering on the chassis design to fit the same amount of compute into a smaller area.
Similarly but differently, the Lenovo booth was showcasing their D3284 JBOD which packs 84x 3.5" HDDs into a double-decker 5U chassis. I had naively assumed that all of these super-dense 84-drive enclosures were top-loading such that each drive mates to a backplane that is mounted to the floor of the chassis, but it turns out that's not the case:
Lenovo's 5U84 JBOD |
Instead, each 3.5" drive goes into its 2.5U shelf on its side, and each drive attaches to a carrier that has to be slid slightly toward the front of the JBOD to release the drive, and then slide towards the back of the JBOD to secure it. This seems a little harder to service than a simple top-load JBOD, but I assume there are thermal efficiencies to be gained by this layout.
The Western Digital booth had a pretty broad portfolio of data center products on display. Their newest gadget seems to be a planar NAND-based U.2 device that can present itself as DRAM through a custom hypervisor. This sounds like a direct competitor to Intel's Memory Drive offering which uses ScaleMP's hypervisor to expose flash as DRAM to a guest VM. The combination of exposing flash as very slow memory and relying on software virtualization to do this lends this to being a technology not really meant for HPC, and the engineer with whom I spoke confirmed as much. Virtualized big-and-slow memory is much more appealing to in-memory databases such as SAP HANA.
Perhaps more interestingly was the lack of any mention of Western Digital's investment in storage-class memory and microwave-assisted magnetic recording (MAMR) disk drives. When I prodded about the state of MAMR, I was assured that the technology will work because there is no future of hard drives without some form of energy-assisted magnetic recording. However, product announcements are still 18-24 months away, and the capacity for these drives will enter the market at the rather underwhelming range of ~20 TB. Conveniently, this matches Seagate's recent cry of wolf that they will launch HAMR drives in 2020 at a 20 TB capacity point. Western Digital also made no mention of multi-actuator drives, and asking about it only got me a sly grin; this suggests that Western Digital is either playing slow and steady so as not to over-promise, or Seagate has a slight technological lead.
My last substantive stop of the afternoon was at the IBM booth, where they had one of their new TS4500 tape libraries operating in demo mode. The window was too reflective to take a vide of the robotics, but I will say that there was a perceptible difference between the robotics in IBM's enterprise tape library and the robotics in another vendor's LTO tape library. The IBM enterprise robotics are downright savage in how forcefully they slam tapes around, and I now fully believe IBM's claims that their enterprise cartridges are constructed to be more physically durable than standard LTO. I'm sure there's some latency benefit to being able to ram tapes into drives and library slots at full speed, but it's unnerving to watch.
IBM also had this cheeky infographic on display that was worth a photo:
If I built a tape drive that was still operating after forty years in outer space, I'd want to brag about it too. But there are a couple of factual issues with this marketing material that probably made every physical scientist who saw it roll their eyes.
Over at the compute side of the IBM booth, I learned that the Summit and Sierra systems sitting at the #1 and #2 positions on Top500 are built using node architectures that IBM is selling commercially. There are 2 CPU + 6 GPU nodes (which is what Summit at OLCF has) which require liquid cooling, and 2 CPU + 4 GPU nodes (which is what Sierra at LLNL has) which can be air- or liquid-cooled. I asked an IBM technologist which configuration is more commercially popular, and the Sierra configuration is currently leading sales due to the relative lack of infrastructure to support direct liquid cooling in commercial data centers.
This has interesting implications for the exascale technologies I looked at on Tuesday; given that the exascale-capable system designs presented by both the Fujitsu and Cray rely on direct liquid cooling, bridging the gap between achieving exascale-level performance and delivering a commercially viable product is pretty wide from a facilities perspective. Fortunately, the Fujitsu A64FX chip usually runs below 200 W and can feasibly be air-cooled with lower-density packaging, and Cray's Shasta will support standard air-cooled 19" racks via lower-density nodes.
The IO-500/VI4IO BOF
The second must-attend BOF for people working in I/O is the IO-500 and Virtual Institute for I/O BOF. It's a very pragmatic BOF where people discuss system architecture, benchmarking, and various related community efforts, and since 2017, also began to include the semiannual unveiling of the IO-500 list.This year was exciting in that the top system, a DDN IME installation at JCAHPC, was unseated by the monstrous storage system attached to the Summit system at Oak Ridge and sustained an astounding 2 TiB/sec and 3 million opens/sec. In fact, the previous #1 system dropped to #4, and each of the new top three systems was of a different architecture (Spectrum Scale at Oak Ridge, IME at KISTI, and Lustre at Cambridge).
Perhaps the most interesting of these new submissions was the #3 system, the Data Accelerator at Cambridge, which is a home-grown whitebox system that was designed to be functionally equivalent to DataWarp's scratch mode:
Alasdair King presenting the Data Accelerator design at the IO-500 BOF |
The hardware are just Dell boxes with six NVMe drives and one OPA NIC per socket, and the magic is actually handled by a cleanroom reimplementation of the interface that Slurm uses to instantiate DataWarp partitions on Cray XC systems. Rather than use a sophisticated orchestration system as DataWarp does though, the Data Accelerator translates Slurm #DW pragmas into Ansible plays that spin up and tear down ephemeral Lustre file systems.
The fact that the #3 fastest storage system in the world is a whitebox NVMe system is really remarkable, and my hat is off to the team at Cambridge that did this work. As all-flash parallel file systems go from the realm of being a high-end boutique solution and become affordably mainstream, relatively scrappy but innovative engineering like the Cambridge system are surely going to cause a rapid proliferation of flash adoption in HPC centers.
DDN also presented their software-defined IO-500 submission, this time run in Google Cloud and landing in the #8 position:
Since DDN's embedded SFA product line already runs virtual machines on their controller hardware, it doesn't seem like a big stretch to run the same SFA VMs in the cloud. While this sounds a little counterproductive to DDN's biggest differentiator in providing a fully integrated hardware platform, this idea of running SFA in Google Cloud arose from the growing need for parallel file systems in the cloud. I can only assume that this need is being largely driven by AI workloads which require a combination of high I/O bandwidth, high IOPS, and POSIX file interfaces.
Thursday, November 15
The conference was showing signs of winding down by Thursday, as many attendees brought their luggage with them to the convention center so they could head back home that night. The expo floor also closes in the mid-afternoon on Thursday.
Technical Program, Part 2 - Exhibitor Forum
My Thursday began at 10:30 AM with the HPC Storage and Memory Architectures session of the Exhibitor Forum. Liran Zvibel, former CTO and now CEO of WekaIO was the first presenter and gave a surprisingly technical description of the WekaIO Matrix parallel file system architecture:WekaIO's Matrix file system architecture block diagram. Surprising amount of detail can be cleaned by examining this carefully. |
In terms of building a modern parallel file system from the ground up for all-flash, WekaIO checks off almost all of the right boxes. It runs almost entirely in user space to keep latency down, it runs in its own reserved pool of CPU cores on each client, and capitalizes on the approximate parity between NVMe latency and modern high-speed network latency. They make use of a lot of the smart ideas implemented in the enterprise and hyperscale storage space too and are one of the few really future-looking storage companies out there who are really thinking about the new possibilities in the all-flash world while still courting the HPC market.
There is a fair amount of magic involved that was not broken down in the talk, although I've found that the WekaIO folks are happy to explain some of the more complex details if asked specific questions about how their file system works. I'm not sure what is and isn't public though, so I'll save an architectural deep-dive of their technology for a later date.
Andreas Schlapka of Micron Technology was the next speaker, and his talk was quite a bit more high-level. Aside from the grand statements about how AI will transform technology though, he did have a couple of nice slides that filled some knowledge gaps in my mind. For example:
Broad strokes highlighting the different computational (and architectural) demands of training and inference workloads |
Training is what the vast majority of casual AI+HPC pundits are really talking about when extolling the huge compute requirements of deep learning. Part of that is because GPUs are almost the ideal hardware solution to tackle the mathematics of training (dense matrix-matrix multiplication) and post impressive numbers; the other part is that inference can't happen without a well-trained model, and models are continually being refined and re-trained. What I hadn't fully appreciated is that inference is much more of an interesting computational problem in that it more closely resembles the non-uniform and latency-bound workloads of scientific computing.
This has interesting implications for memory technology; while HBM2 definitely delivers more bandwidth than DDR, it does this by increasing the channel width to 128 bits and hard-wiring 8 channels into each stack. The extra bandwidth helps feed GPUs for training, but it's not doing much for the inference side of AI which, presumably, will become a much more significant fraction of the cycles required overall. In my mind, increasing the size of SRAM-based caches, scratchpads, and register files are the more obvious way to reduce latency for inference, but we haven't really seen a lot of fundamentally new ideas on how to effectively do that yet.
The speaker went on to show the following apples-to-apples system-level reference:
System-level speeds and feeds of the memory products available now or in the near future as presented by Micron |
It's not terribly insightful, but it lets you back out the bus width of each memory technology (bandwidth / data rate / device #) and figure out where its bandwidth is coming from:
- DDR4 and DDR5 use 64-bit channels and relies on increasing channel-level parallelism to improve bandwidth. This is now putting them in a place where you wind up having to buy way more capacity than you may want just to get sufficient bandwidth. This is analogous to where HDDs are in the HPC storage hierarchy today; it's rapidly becoming uneconomical to rely on DDR for bandwidth.
- GDDR uses narrower channels (32 bits) but more of them to get better bandwidth. They also rely on phenomenally high data rates per pin; I don't really understand how this is possible since they rely on inefficient single-ended signaling.
- HBM uses both wide (128 bits) and plentiful channels to get its performance; the table is a misleading in this regard since each "device" (HBM stack) contains eight channels.
This is fine for feeding highly parallel arithmetic units like vector ALUs, but this offers no benefit to latency-bound workloads that, for example, chase pointers to traverse a graph.(it turns out HBM is just fine for pointer chasing--thanks to one of the HPC's memory-wizards-at-large for pointing this out to me!)
The NSF Future Directions BOF
I opted to see what was new with National Science Foundation's Office of Advanced Cyberinfrastructure (OAC) at their noon BOF. Despite having left the NSF world when I left San Diego, I still care deeply about NSF computing because they pay for many of the most accessible HPC resources in the US. I certainly got my start in HPC on the NSF's dime at SDSC, and I got to see firsthand the huge breadth of impact that SDSC's XSEDE resources had in enabling smaller research groups at smaller institutions to perform world-class research. As such, it's also no surprise that the NSF leads the pack in developing and deploying many of the peripheral technologies that can make HPC accessible such as federated identity, science gateways, and wide-area file systems.That all said, actually listening to the NSF HPC strategic vision makes me rather grumpy since the directions of such an important federal office sometimes appear so scattershot. And judging by the audience questions at the end of the BOF, I am not the only one--Very Important People(tm) in two different national-level HPC consortia asked very pointed questions of Manish Parashar, the NSF OAC director, that highlighted the dichotomy between OAC's strategic vision and where it was actually putting money. I really believe in the critical importance of NSF investment in maintaining national cyberinfrastructure which is probably why I keep showing up to these BOFs and do my best to support my colleagues at SDSC and the other XSEDE SPs.
After sitting through this Future Directions BOF, I could write another updated rant about how I feel about the NSF's direction in HPC and get myself in trouble. Instead, I'll instead share just a few slides I photographed from afar along with some objective statements and leave it at that.
The future directions summary slide:
NSF OAC's future directions |
- Performance, capability computing, and global leadership are not mentioned in the above slides. Terms like "agility, responsiveness, accessibility") are often used to describe the cloud.
- "reduce barriers to CI adoption" indicates that NSF wants to serve more users. NSF is not increasing investment in capital acquisition (i.e., more or larger HPC systems beyond the status quo of technology refreshes).
- "Prioritize investments to maximize impact" does not define what impacts are to be maximized.
The Frontera slide:
NSF's next leadership-class HPC, Frontera, to be deployed by TACC |
- The award amount was $60M. The previous Track-1 solicitation that funded Blue Waters was $200M. Stampede was $30M, and Stampede 2 was another $30M.
- "leadership-class ... for all [science and engineering] applications" either suggests that all science and engineering applications are leadership-capable, or this leadership-class system is not primarily designed to support a leadership computing workload.
- It is unclear what the significance of the "CPU" qualifier in "largest CPU system" is in the larger context of leadership computing.
- There is mention of "leadership-class" computing. There is no mention of exascale computing. There is nothing that acknowledges leveraging the multi-billion-dollar investment the US has made into the Exascale Computing Project. An audience member politely asked about this omission.
The Midscale Research Infrastructure slide:
Upcoming solicitations for research cyberinfrastructure |
- NSF OAC expects to issue one $6M-$20M solicitation and another $20M-$70M solicitation "soon" to fund HPC systems and the associated infrastructure.
- $6M-$20M is on the same order of magnitude as the Track-2 solicitations that funded SDSC's Gordon ($10M) and Comet ($12M).
- $20M-$70M is on the same order of magnitude as the Track-2 solicitations that funded TACC's Stampede 1 and 2 ($30M). NSF's next leadership-class investment (Frontera) is $60M.
My SC Paper
The next major item on my agenda was presenting my paper, A Year in the Life of a Parallel File System, as the final talk in the final session of the paper track.My name in lights--or something like that. |
I was admittedly bummed out when I found out that I was going to be the conference closer since a significant number of SC attendees tend to fly out on Thursday night and, presumably, would not stick around for my presentation. As a result, I didn't take preparation for it as seriously in the weeks leading up to SC as I normally would have. I knew the presentation was a 30-35 minute talk that had to be fit into a 25-minute slot, but I figured I would figure out how to manage that on the night before the talk and mostly wing it.
What I realized after arriving at SC was that a bunch of people--most of whom weren't the expected audience of storage researchers--were looking forward to hearing the talk. This left me scrambling to seriously step up the effort I was going to put into making sure the presentation was well composed despite needing to drop ten minutes of material and fit it into the 25 minutes I was given. I documented my general approach to crafting presentations in my patented Glenn K. Lockwood Five Keys to Being a Successful Researcher (FKBSR) method, but I'll mention some of my considerations for the benefit of anyone who is interested in how others approach public speaking.
- I absolutely could not overshoot the timing because some attendees had to leave at 5 PM to catch 7 PM flights. This meant that it would be better for me to undershoot the time and either draw out the conclusions and acknowledgments slides to finish on time or finish early and leave extra time for questions.
- The people I met at SC who indicated interest in my talk were storage systems people, not statisticians. This meant I could probably tone down the statistical rigor in the presentation without offending people's scientific sensibilities.
- Similarly, because attendees were already familiar with typical HPC I/O systems and the relevant technologies, I could gloss over the experimental setup and description of the different compute and storage systems.
- Given the above considerations, a reasonable approach would be to punt as many non-essential details into the Q&A after the talk and let people try to poke holes in my methods only if they really cared.
I also know two things about myself and the way I present:
- I can present either at a casual pace where I average ~70 seconds per slide or in turbo mode where I average ~50 seconds per slide. Orating at turbo speed requires a lot more preparation because it requires speaking through slide transitions rather than pausing to reorient after each slide transition.
- I get distracted easily, so I would rather have people begin to leave after my monologue ended and Q&A began than have the commotion of people getting up derail the tail end of my presentation.
As a result of all these factors, I opted to both cut a lot of details to get the talk down to ~25-30 minutes when presented at a casual pace, then prepare to present in turbo mode just in case the previous speakers went long (I was last of three speakers), there were A/V issues (they were prolific at this SC, especially for Mac users), or there were any audience interruptions.
I also opted to present from my iPad rather than a full laptop since it did a fine job earlier at both PDSW-DISCS and the IO-500/VI4IO BOF. In sticking with this decision though, I learned two valuable things during the actual presentation:
- The iOS "do not disturb" mode does not suppress Twitter notifications. A couple of people were kind enough to tweet about my presentation as I was giving it, but this meant that my presenter view was blowing up with Twitter noise as I was trying to present! Fortunately I only needed to look down at my iPad when transitioning between slides so it didn't derail me.
- There's no usefully sized timer or clock in PowerPoint for iOS's presenter view, and as a result, I had no idea how I was doing on time as I entered the final third of my slides. This became a distraction because I was fully expecting a five-minute warning from the session moderator at some point and got worried that I wasn't going to get one. As such, I didn't want to slow down the tail of the presentation without knowing how close I was getting to the target. It turned out that I didn't get a five-minute warning because I was already concluding at that point.
Fortunately the audience was sufficiently engaged to pad out the Q&A period with many of the questions that would've been answered by the slides I had dropped. Afterwards I got feedback that indicated the presentation was noticeably short to the audience (not great) but that the narrative remained understandable to most attendees throughout the entire presentation (good).
As far as the technical content of the presentation though, I won't recap that here--until I write up the high-level presentation as another blog post, you may have to read the paper (or invite me to present it at your institution!).
SC Technical Program Reception
I've never attended the reception that wraps up the last full day of SC for a variety of reasons, and I was going to skip it again this year to fit some me-time into the otherwise frantic week. However the venue (the Perot Museum) and its close proximity to my hotel lured me out.
The entryway to the Perot Museum |
I am not a "never eat alone" kind of person because I find that my ability to be at the top of my game diminishes without at least some intermittent time to sit back and digest. As such, I approached the reception with very selfish intent: I wanted to see the museum, learn about something that had nothing to do with supercomputing, have a drink and a meal, and then go back to my hotel. So I did just that.
The dinosaurs seemed like a major feature of the museum:
Rapetosaurus skeleton on display at the Perot Museum |
The archaeological diversity of the dinosaur room reminded me of the dinosaur museum near my wife's hometown in the Canadian prairies, but the exhibit seemed to be largely reproduction fossils that blended science with entertainment.
More impressive to me was the extensive mineral collection:
I'm a sucker for quartz. I did my PhD research on silicates. |
Not only were the minerals on display of remarkable quality, but many of them were found in Texas. In fact, the museum overall had a remarkably Texas-focused set of exhibits which really impressed me. The most interesting exhibit that caught my attention was a mini-documentary on the geologic history of Texas that explained how plate tectonics and hundreds of millions of years resulted in the world-famous oil and gas reserves throughout the state.
Having learned something and enjoyed some delightful food at the museum, I then called it quits and cashed out.
Friday, November 16
The last day of SC is always a bit odd because the expo has already wrapped up, most of the vendors and casual attendees have gone home, and the conference is much more quiet and focused. My day started with a surreal shuttle ride to the conference center in what appeared to be a 90's-era party bus:
Conference shuttle, complete with taped-together audio system, faux leather sofa, and a door that had to be poked with a broom stick to open. |
Only six concurrent half-day workshops and a panel were on the agenda:
The entire Friday agenda fit on a single screen |
I stuck my head into the P3HPC workshop's first panel discussion to catch the age-old but ever-lively argument over someone's proposed definition of performance portability and productivity either being too broad or too narrow. I/O performance portability generally does not have a place in these sorts of conversations (which I don't fault--algorithmic complexity in I/O is usually hidden from user applications) so I attended only as an interested observer and wasn't as fastidious about taking notes as I was earlier in the week.
At 10:30 AM I headed over to the Convergence between HPC and Big Data: The Day After Tomorrow panel discussion which had a star-studded speaker lineup. NERSC's Katie Antypas gave a great overview of the NERSC-9/Perlmutter architecture which fit the panel topic uncannily well since it is a system design from the ground up to meet the needs of both traditional HPC and large-scale data analysis.
The NERSC-9 Project Director describing how the Perlmutter system embodies the convergence of HPC and Big Data in front of a remarkably big crowd in the final session of SC. |
Unfortunately I had to duck out shortly after she spoke to get to my last meeting of the week with an old colleague for whom I always make time at SC. Incidentally, some of the most valuable time you can spend at SC is talking to industry consultants. Not unlike getting to know members of the trade press, good consultants have exposure to a tremendous breadth of problem and solution spaces. They can give you all manner of interesting insights into different vendors, industry verticals, and market trends in an otherwise brief conversation.
After my final meeting was cut short by my colleague's need to run to the airport, I had a quick bite with another Friday holdout then made my own way to the airport to catch up on a week's worth of e-mails. The flight back to Oakland was one of the rare occasions where I was just too worn out to try to catch up on some delinquent report writing and just watched three hours of Dark Tourist on Netflix.
After the Conference
It was technically Saturday by the time I finally got home, but the family was happy to see me (and the swag I had in tow):
George fully appreciating the giant pile of conference swag with which I came home |
This was definitely the busiest SC of my career, but in many ways it was also the most productive. I owe sincere thanks to everyone in the HPC community who made it such a worthwhile conference to attend--vendors, presenters, old colleagues, and even the new colleagues who occasionally just wanted to introduce themselves and express that they enjoy reading the nonsense I post on Twitter. I always leave SC more amazed and humbled by all the bright minds with whom I connect, and I hope that I am doing my part to pay that experience forward for others now and in the SC conferences to come.