ISC’24 recap

I had the great pleasure of attending the ISC High Performance conference this month, marking the fifth time I've attended what has become one of my top must-attend industry conferences of the year. This year was particularly meaningful to me because it is the first time that:

  1. I attended ISC as a Microsoft employee. This is also the first time I've attended any HPC conference since I changed my focus from storage into AI infrastructure.
  2. I attended ISC in-person since before the pandemic. It's also the first time I've visited Hamburg which turned out to be an absolute delight.

Although registrations have been lower since the pandemic, this year's final registration count was over 3,400 attendees, and there was no shortage of old and new colleagues to bump into walking between the sessions at the beautiful Congress Center Hamburg.

Exterior of CCH with ISC banners flying

This year's theme was "Reinvent HPC," and that idea—that HPC needs to reinvent itself—was pervasive throughout the program. The whole industry had been pulling towards exascale for the better part of a decade, and now that there are two exaflop systems on Top500 and the dust is settling, it feels like everyone is struggling to figure out what’s next. Is it quantum? AI?

It was difficult for me to draw a line through all the topics worth reviewing at this year's ISC, as it was a very dense four days packed with a variety of topics, discussions, vendors, and events. I only experienced a fraction of everything there was to be seen since so many interesting sessions overlapped, but I thought it might be worthwhile to share my perspective of the conference and encourage others to do the same.

Reinventing HPC (and blast those hyperscalers!)

The need to reinvent HPC was the prevailing theme of the conference from the very first session; with the listing of Aurora as the second system on Top500 to break the 1 exaflops barrier, the community is in search of a new milestone to drive research (and funding!). At the same time, commercial AI has rapidly risen up largely in an independent, parallel effort with a speed and scale that begs the question: how important was the decade-long drive to break the exaflops barrier if the AI industry could catch up so quickly without the help of the institutions that have historically posted the top HPL scores? If the commercial AI industry overtakes scientific computing as the world leader in deploying at scale, how can “HPC” be reinvented so it can continue to claim leadership in another dimension?

Kathy Yelick's opening keynote

ISC’s opening keynote was given by Kathy Yelick, where she provided commentary on two recent government-commissioned reports on the future of HPC:

  1. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration, commissioned by the National Academies
  2. Can the United States Maintain Its Leadership in High-Performance Computing?, commissioned by the US Department of Energy’s Advanced Scientific Computing Research program

Living up to her reputation, Dr. Yelick’s talk was fast and insightful, describing the insatiable demand for computing driven by scientific research, the struggle to expose continuing amounts of parallelism to make use of newer processors, and some promising directions to address that disconnect. However, her talk started in a direction that I didn’t like when she went into describing the disruptors that necessitate reinventing HPC:

Three disruptors to HPC: AI, quantum, and cloud

The above slide implied that AI, quantum, or cloud may pose an existential threat to the HPC community gathered at ISC this year; this immediately raised my hackles, as it cast the relationship between “HPC” and “AI”/“cloud” as having some sort of adversarial tension. As the talk went on, I realized that “HPC” didn’t really mean “high-performance computing” to her. Rather, it was used to refer to something much more narrowly scoped—high-performance computing to solve scientific problems. Slide after slide, the presentation kept doubling down on this idea that “HPC” as the audience knows it is being threatened. For example, Yelick talked through this slide:

Market cap of traditional HPC vendors vs. hyperscalers

The picture she painted is that “HPC” (denoted by companies with blue bars) no longer has influence over technology providers because the “hyperscalers” (green bars) have such an outsized amount of investment. She then used this to call on the audience to think about ways “we” could influence “them” to produce technologies that are useful for both scientific computing and low-precision AI workloads.

Her talk culminated in this slide:

Post-exascale strategy is "us" versus "them."

Which was accompanied by this conclusion:

"So what’s a post-exascale strategic for the scientific community? It's the beat 'em or join 'em strategy. The beat 'em strategy says we’re going to design our own processors. [...] The join 'em strategy says let's leverage the AI hardware that's out there. [...] The sort of sneaky way of doing this is getting embedded in the AI community and trying to convince them that in order to make AI better for commercial AI applications, you really want to have certain features. Like don't throw away your 64-bit arithmetic and things like that."

I found myself getting increasingly unsettled through the keynote, because this "us versus them" mentality put me, a long-standing member of this HPC community, in the camp of "them." It was as if I was suddenly an outsider in a conference that I've been attending for years just because I no longer work for an organization that has been doing HPC since the early days of computing. Even though the clusters I support use the same NVIDIA and AMD GPUs, the same InfiniBand fabrics, and the same Lustre file systems that "HPC" uses, I am no longer in "HPC" because I am "hyperscale" or "cloud" or "AI."

The underlying message is one I get; GPUs are trending in a direction that favors massive gains in lower-precision computation over FP64 performance. And the cost of HBM is driving the overall value (in FP64 FLOPS per dollar) of accelerators backwards for the first time in the history of scientific computing. But the thesis that the scientific computing community needs to be sneaky to influence the hyperscale or AI players seemed way off the mark to me. What seemed absent was the recognition that many of the "hyperscalers" are her former coworkers and remain her colleagues, and "they" sit in the same audiences at the same conferences and share the same stages as the "HPC" community. All that is true because "HPC" is not somehow different than "cloud" or "AI" or "hyperscale." If there really is a desire to influence the hyperscale and AI industry, the first step should be to internalize that there is no "us" and "them."

Closing keynotes on the future

Just as the conference was opened with a talk about this "us versus them" mentality, it was closed with a talk about "us versus them" in a keynote session titled, "Reinventing HPC with Specialized Architectures and New Applications Workflows" which had two speakers followed by Q&A.

Chiplets for modular HPC

John Shalf gave one half of the closing keynote, where he gave his usual rally for investments in chiplets and specialized processors for HPC:

John Shalf's birds slide calling for processor specialization

He gives a variant of this talk at every ISC, but this year he lasered in on this notion that the "HPC" community needs to do what the "hyperscalers" do and use chiplets to develop custom ASICs. It was an energetic and impassioned talk, but this notion that hyperscalers are already executing on his idea for the future sounded a little funny to me seeing as how I now work for one of these hyperscalers and his message didn't resonate.

John Shalf's concluding thoughts were off the mark

If you really follow the money, as Shalf suggested, a huge amount of it is flowing into GPUs, not specialized processors. It wasn't clear to me what specialization he was thinking of when he referred to custom silicon being developed by the likes of Meta, Google, AWS, and Microsoft; it's true that these companies are developing their own silicon, but those efforts are largely addressing cost, risk, and supply, not improving performance beyond more general-purpose silicon like GPUs. And it turns out that a significant fraction of the (non-US) HPC community is already developing custom silicon for the same reasons as the hyperscalers; Japan, China, and Europe are all developing their own indigenous processors or accelerators for scientific computing at leadership scales. In that sense, Shalf was preaching to the choir given that, on the international stage, his government is the odd one out of the custom silicon game.

He also suggested a dichotomy where the HPC community would either have to just (1) make every scientific problem an AI problem or (2) join this journey towards making domain-specific accelerators, ignoring the significant, unexplored runway offered by using mixed precision arithmetic in scientific applications. He called for partnering with hyperscalers, but his examples of implementing a RISC-V-based stencil accelerator and a SambaNova-based DFT processor didn't draw a clear line to the core missions of the large hyperscalers he extolled. He briefly said that partnering would benefit hyperscalers by addressing some capital cost challenges, but seeing as how the annual capital expenditures of the hyperscalers outstrips those of the US national HPC effort by orders of magnitude, I couldn't understand what the hyperscalers would stand to gain by partnering in this way.

Integrating HPC, AI, and workflows

Rosa Badia gave the second half of the closing keynote where she proposed ideas around complex scientific workflows and the novel requirements to support them. This talk felt a lot more familiar, as the focus was squarely on solving scientific computing challenges by connecting traditional HPC resources together in nontraditional ways using software whose focus goes beyond cranking out floating point arithmetic.

As she spoke, I couldn't help but see parallels between the challenges she presented and the sort of technologies we live and breathe every day in cloud services.  For example, she showed this slide:

Rosa Badia's HPC workflows as a service slide

Dr. Badia obviously wanted to make a cloud-tie in by calling this "HPC Workflows as a Service," but what I'm not sure she realized is that this model almost exactly describes platform-as-a-service frameworks that already exist in commercial clouds. For example,

  • What she calls a "Data Catalog" is a public or private object storage account (a blob container, an S3 bucket) or a PaaS abstraction built atop them
  • What she calls a "Software Catalog" is a container registry (Azure Container Registry, Amazon Elastic Container Registry) or an abstraction built atop them
  • A "Workflow Description" is something like an AzureML pipeline or SageMaker pipeline
  • A "Workflow Registry" is just a Github repository containing pipelines
  • The "Portal" is the web UI provided by AzureML or SageMaker

I don't think there's anything truly new here; the challenges she described lie in wedging these workflows into HPC infrastructure which lacks the platform features like robust identity and access management (i.e., something better than LDAP that supports more modern authentication and authorization flows and finer-grained access controls) and data management (i.e., something better than a parallel file system that depends on POSIX users, groups, and permissions and implicit trust of clients).

She went on to describe a workflow data management system that reinvented a bunch of infrastructure that is already baked into commercial cloud object stores like Azure Blob and AWS S3:

Rosa Badia's slide on data management for workflows

As she was describing the requirements for such a workflow data management layer, it struck me that what the scientific data community calls "FAIR principles" are the same basic requirements for operating in commercial environments where data may be subject to strict privacy and compliance regulations. The notion of findable data may be aspirational for scientific datasets, but when a company is having to find datasets because it's being sued or subpoenaed, findability is a bare-minimum requirement for any data management system. Similarly, tracking the provenance of data may be a nice-to-have for scientific data, but it is a hard requirement when establishing a secure software supply chain. Cloud storage systems solved many of these challenges a long time ago, and I can't help but wonder if this idea that workflows in HPC pose a new set of challenges is another manifestation of "us" not realizing "they" might have done something useful and applicable for science.

Badia's final slide had a particularly poignant statement which read, "Systems can only be justified if we have applications that need them." I think she was trying to call for more investment in application development to exploit new systems, but I think the inverse is also true. If modern scientific applications truly require more complex orchestration of compute and data, maybe the scientific computing community should stop building computing platforms that make it really difficult to integrate different systems.

Again, "HPC" is not the opposite of "cloud;" it's not an either/or decision. There are technologies and tools that were designed from the beginning to simplify the secure connection of services and resources; they just weren't invented by the HPC community.

Top500 and Aurora

One of the cornerstones of ISC is the semiannual release of the Top500 list, and unlike at SC, the Top500 announcements and awards do not overlap with any other sessions, so it tends to have a higher profile and draw all attendees. This go-around, there were no dramatic changes in the Top 10; the new Alps system at CSCS was the only new entry, and the order of the top five systems remained the same. Notably, though, Aurora posted a significantly higher score than at SC'23 and broke through the exaflops barrier using 87% of the system, cementing its place as the second exascale system listed. But let's start at the top.

#1 - Frontier

Frontier at Oak Ridge remained #1, but it squeezed twelve more petaflops out of the same node count and is now just over 1.2 EF. Nothing groundbreaking, but it's clear evidence that ORNL is continuing to tune the performance of Frontier at full system scale.

#2 - Aurora

Aurora, on the other hand, finally eked over the exaflops line with 1.012 EF using 87% of the system's total 63,744 GPUs. Rick Stevens gave a short talk about the achievement which is summed up on this slide:

Rick Stevens' slide summarizing their latest Top500 runs

I was a little surprised by how honest Stevens was in this talk; the typical game that is played is that you stand up on stage, talk about how great of a partnership you had with your partners to realize this achievement, extol the virtues of the technologies on which your system was built, and talk about how this HPL score is just the start of a lot of great science.

Stevens didn't do that though.

He started out by telling the conference that Intel had bad product names, then explained that their low Graph500 and HPCG scores were the result of their exclusive focus on breaking the exaflops barrier with HPL, implying they didn't have time or ability to run Graph500 or HPCG at the same 87%-89% scale as their HPL and HPL-MxP runs. Based on this, it sounds like Aurora is still a ways away from being stable at scale, and we're unlikely to see any Gordon Bell-nominated papers at SC'24 this November.

After this session, folks seemed to relish in dunking on Aurora; its window to be #1 is likely to have closed and it has some power efficiency issues. But I don't think anyone involved with the Aurora project needs to be told that; if what Stevens implied is true, the folks at ALCF, Intel, and HPE have been struggling for a long time now, and topping out over 1018 was a hard-sought, major milestone to be celebrated. The Aurora project has been thrown more curveballs than I would have ever guessed a single HPC project could have, so all parties deserve credit for sticking it through all this way rather than just walking away. With any luck, Aurora will stabilize in the next six months, and we'll see full-scale runs of Top500, Graph500, HPCG, and science apps by November.

#3 - Eagle

The third highest system on the list was Eagle, whose HPL score was not updated since the system was first listed at SC'23 last year. Through a few twists of fate, I wound up being the person who accepted the award on-stage, and I now have a Top500 award for the #3 system sitting in my home office. Here's a photo of me goofing around with it:

It's not entirely inappropriate that I was the one to accept it since my teammates are the ones carrying pagers for the on-call rotation of that system, and we were also the hands-on-keyboard when that HPL run was conducted. Still, it was a bit surreal to walk on-stage to pick up such a noteworthy award immediately following two actually important people (both of whom have "director" in their titles) accepting the same award. By comparison, most of my career highlights to date have been just trolling HPC people on Twitter (as the esteemed Horst Simon actually said out loud as I was leaving the stage!)

It was weird.

That said, I take this to mean that it is now my duty to be the friendly face from Microsoft who can speak intelligently about the #3 system on Top500. To that end, I'll answer some questions that I was asked at ISC about the system and Azure HPC clusters in general below. None of this is new or secret information!

  • Why didn't you run HPL again and post a higher score to beat Aurora? Because the day after that HPL run completed, that system was put into production. Once systems are in production, people are paying to use them, and taking a time-out to re-run HPL costs a ton of money in either real dollars (if a customer runs it) or lost revenue (if the HPL run is blocking customer workloads). This is quite different from public-sector HPC systems which never have to pay for themselves.
  • Can I get access to Eagle for a Gordon Bell run or to test software? That's not really how it works. Whereas a traditional supercomputer might allow users to ssh in and submit jobs to a Slurm queue, cloud-based supercomputers allow users to deploy virtual machines through a REST API. Those virtual machines can allow ssh, run Slurm, and support MPI jobs like HPL, but that OS environment is managed by Azure users, not Azure itself. You can get a taste for what's required to run a basic MPI job by reading some instructions I wrote on provisioning an MPI cluster on Azure.
  • Is it just a bunch of GPU nodes scattered around a bunch of data centers? No, all the nodes on any given Azure HPC cluster (like Eagle) share an InfiniBand fabric. There are countless InfiniBand clusters in Azure, but each one is a real supercomputer by any definition of a supercomputer, and they are designed to run tightly coupled job across all their GPUs.
  • What parallel file system does it use? Don't think about it that way. You can provision a Lustre file system and mount that to any or all cluster nodes if you want to, or you can access data directly from object storage.
  • Are there any photos of it? You can see a photo of one of the Microsoft-designed nodes that comprise the system on my SC'23 recap blog post. Beyond that, there's not much to look at because Azure HPC clusters are not meant to be photogenic like, say, Cray supercomputers. There's no rack graphics (or even rack doors!). It's just tons and tons of air-cooled racks with InfiniBand optics coming out of each one. Maybe the only unique thing is that the racks are painted white instead of the typical black. Not sure why.
Getting back to that false separation between "HPC" and "cloud," Eagle is strong evidence that they aren't different. What the "hyperscalers" do is not that different from what traditional HPC centers do. Perhaps the biggest difference is that cloud supercomputers get all the benefits of cloud infrastructure like software-defined infrastructure like virtual machines and virtual networking, integration with identity and access management that transcends simple Linux UIDs/GIDs, and the flexibility to integrate with whatever storage systems or ancillary services you want from any compute node.

Other notable tidbits

It is tradition for Erich Strohmaier to talk through some highlights and trends of the latest Top500 list every time a new one is announced, and in the past, I've been critical of how he's presented conclusions from the list with this implicit assumption that computers that never post to Top500 simply don't exist. This year felt different, because Dr. Strohmaier made the explicit statement that China has completely stopped submitting to Top500. Their exascale systems aren't listed, but neither are any new systems in the past three years at the bottom. They simply don't play the game anymore, making it undeniable that Top500 is no longer an authoritative list.

Just as the whole conference's theme was reinventing HPC, I felt a sense that even the most stalwart proponents of Top500 are now recognizing the need to reinvent the Top500 list. Kathy Yelick said as much during her keynote ("Shall we replace Top500? What are the metrics in post-exascale computing that are important?"), and Erich implored the audience to help expand the HPL-MxP (formerly HPL-AI; an HPL-like benchmark that can use the mixed-precision capabilities of tensor cores) list. Nobody seems to know how to quantify what makes a leadership supercomputer nowadays, but accepting that HPL scores (or appearing on the Top500 list!) won't cut it is a good first step.

That all said, Top500 is still a valuable way to track technology trends in the industry. For example, this edition of the list where NVIDIA's new Grace-Hopper node started appearing in force. The only new entrant in the Top 10 was the 270 PF GH200 component of CSCS's Alps system, and HPEhad these EX254n GH200 blades on display on the show floor.

Photo of the Cray EX254n GH200 blade

To HPE/Cray's credit, they seem to have gotten the system up and running with Slingshot without the delays that plagued early Cray EX systems like Frontier and Aurora. Hopefully this is a sign that the Cray EX platform and Slingshot-11 have graduated from being risky and not-quite-production-ready.

The other notable entrants on this year's Top500 are a trio of early MI300A APU-based Cray systems being built around the El Capitan program at Lawrence Livermore National Laboratory. This is a positive sign that MI300A is up and running at modest scale, and HPE also had one of these EX255a blades on display at their booth:

Photo of a Cray EX255a blade

The strong showing of MI300A suggests that we may see El Capitan take the top spot in the next edition of the Top500 list coming in November.

Everyone is an AI expert!

Since I now work on a team responsible for AI infrastructure, I tried attending as many of the AI-focused talks and panels as I could this year. Unsurprisingly, these sessions largely carried the same undertones of "reinventing HPC," and speakers opined on how AI would affect scientific computing and offered examples of what their institutions were doing to extend their leadership in the HPC space into the AI space. There was a fair amount of grasping going on (as there always is when AI is discussed at non-AI conferences), but this year I was struck by how confused so many speakers and attendees were about concepts related to applying AI.

To be clear: I am no expert in AI. However, my day job requires that I be steeped in some of the largest AI training workloads on the largest AI supercomputers on the planet, and I have to have a cursory understanding of the latest model architectures and techniques to anticipate how future system designs will have to evolve. It's from this perspective that I made the following observation: there are a lot of HPC people speaking very confidently about AI based on an outdated understanding of the state of the art. The AI industry generally moves much faster than the government-funded research community, and I couldn't help but wonder if some community leaders assumed that the AI industry today is the same as it was the last time they wrote their AI grant proposal.

Of course, there were also some really insightful perspectives on AI for science shared as well. Let's talk through some examples of both.

The Exascale AI Synergies LLM Workflows BOF

This realization that the ISC community is not keeping up with the AI community first slapped me in the face when I ducked into a BOF session titled, "Tales of Exascales – AI and HPC Supercomputing Platforms Synergies for Large Language Models (LLMs) and Scientific Workflows." I sometimes wonder if the organizers who propose titles like that are intentionally creating word salad, but in this case, it was apt session name; the discourse around HPC and AI was all over the board throughout the hour.

The session started on a strong, positive note by Simon McIntosh-Smith describing Bristol's new Isambard-AI system, a GH200-based Cray supercomputer funded under the broad charge of "AI research." While I'm usually skeptical of such nebulously defined "AI research" machines, Dr. McIntosh-Smith's description of the project quickly checked a bunch of boxes on how a real AI research platform should be developed. In particular,

Isambard-AI was developed and deployed at the pace of AI rather than HPC for scientific computing. Whereas government-funded, large-scale HPC systems typically take years to procure, Simon said that the first discussions started in August 2023, and in the nine months that followed, they had built the site, the team, and the system itself to the degree that a piece of the final system is already on Top500. By comparison, LLNL's El Capitan supercomputer also debuted on Top500 this month, but its contract was signed five years ago, and its procurement began at least two years before that. The AI industry would not exist if the systems it trains on took seven years to procure.

Isambard-AI deliberately avoided exotic AI accelerators to remain future-proof. Simon rightly pointed out that the AI industry moves too quickly to anticipate whether a bespoke AI accelerator would even be relevant to whatever the hottest model architecture will be in a year. GPUs were chosen because they are the most flexible way to accelerate the widest range of AI workloads, regardless of if they are dense models, sparse models, inferencing, training, and whatever level of quantization makes sense. The reality is that cutting-edge research is done on GPUs, so aligning an AI supercomputer on the same technology will ensure that the algorithms developed by industry are immediately usable for scientific research.

A reasonable definition of "AI for science" was defined from the outset. Rather than blurting out "we need to research AI!" and asking for a sack of money to buy GPUs, Simon outlined a vision of training AI models using data generated by physical simulation on a more conventional HPC system. Training models on models to create surrogate models is not particularly new, but it does establish a few reasonable architectural decisions such as having a robust data management and sharing platform, close coupling to the HPC system performing simulation, and aligning software stacks and programming environments as closely as possible.

Simon's contribution to the discussion stood out to me as the most impressive, and the discourse seemed to fall into a trap of familiarity following. Rather than focusing on the new and exciting prospects of AI, some panelists and audience members wanted to focus on the aspects of AI they understood. For example, an uncomfortable time was spent on a back-and-forth on how HPC centers can support Kubernetes and random I/O (which is what defines AI vs. HPC?) instead of Slurm and Lustre. If your biggest challenge in delivering infrastructure to support AI workloads is figuring out how to deploy both Kubernetes and Slurm, you haven’t even reached the starting line. This is a trivial issue in cloud environments, where entire AI clusters can be built up and torn down in minutes. Again, this is evidence that the scientific computing community isn’t ready to keep pace with the AI industry.

I jotted down a few of the questions and comments that I heard during this BOF that seem to reflect the level of familiarity the average ISC attendee has with AI:

  • "Would be nice if there were more models for science." I wasn't sure sure what this means. All the leading LLMs are pretty good at "science," and domain-specific models aren't readily transferable between different science domains or problems.
  • Scientific problems "have to validate outputs for correctness, unlike LLMs." I think the speaker was making a sidelong reference to hallucinations, but like with any model (large language or physics-based), validating outputs for correctness is certainly necessary and readily possible.
  • "The demands of inference of LLMs are completely different from those for training. How do you buy inference infrastructure?" I wonder where this notion came from. If your infrastructure can train a model, it can definitely inference that model. Cost-optimizing infrastructure for inferencing is a separate matter (you can cut corners for inferencing that you wouldn't want to cut for training), as is building the service infrastructure around inferencing to deliver inferencing as a service. But I don't think that's what this question was about.
  • "Working safely with sensitive data / isolating workloads on big shared clusters." This is a problem that arises only when you try to wedge AI workloads into infrastructure designed for traditional physics-based simulation. If you have sensitive data, don't use big shared clusters. Provision separate clusters for each security domain on a shared, zero-trust infrastructure.
  • "How different are the files and filesystem access while training for LLMs, image generation models, reinforcement learning?" This question reflects a general misunderstanding of data and storage in HPC overall; how data is organized into files and how that data is accessed by a workload is an arbitrary decision made by the application developer. You can organize piles of text into one giant file or a million little files.

There were a few questions that came up that touched on deeper issues on which the HPC community should reflect:

  • "What are the first steps for scientific groups wanting to get ready for using AI in the future?" This is probably the purest question raised in the entire session, and I think this is something the scientific computing community as a whole needs to figure out. What does "using AI" really mean for scientific groups? Is it training models? Fine-tuning models? Inferencing using pre-trained models on HPC infrastructure? Is it integrating simulation applications with separately managed inferencing services? Who manages those inferencing services? Does inferencing even require HPC resources, or can suitable models run on a few CPU cores? I think the first step to answering this question is ensuring that the scientific computing community reaches a common baseline level of understanding of "using AI" means. And a lot of that probably means ignoring what some self-professed AI experts in the HPC community claim is the future.
  • "Care to predict what that ChatGPT moment will be for AI for Science? Had it already happened?" This question was addressed directly by panelist Séverine Habert who rightly pointed out that the ChatGPT moment occurred when a complex and esoteric topic was suddenly put in the hands of hundreds of millions of laypeople across the world. It was the moment that the common person walking on the street could suddenly interact with the most cutting-edge technology that had been previously understandable only to the headiest of researchers in industry and academia. That will likely never happen in AI for science because science, by definition, requires a higher baseline of education and understanding than the average layperson has.
  • "How to effectively train the existing workforce when we are already struggling to retain talent in research/academia?" This question strikes at the same theme that Kathy Yelick's opening keynote confronted: what is the role of the scientific computing community now that it turns out that you don't need decades of institutional experience to deploy and use HPC resources at leadership scale? As offensive as it may sound, perhaps the public-sector HPC community should accept that their role is not training future researchers and academics, but training future practitioners of AI in industry. This is how the wider tech industry generally works; neither startups nor tech giants make hires assuming those people will still be around in ten years. Why does the public-sector HPC industry think otherwise?

Finally, I was also struck but how fiercely the discourse clung to the idea that large language models are the answer to all AI problems in science. I get that this panel was focused on exascale, and LLM training is one of the rare cases where AI requires exascale computing capabilities. But there was no acknowledgment that trillion-parameter models are not actually a good idea for most scientific applications.

AI Systems for Science and Zettascale

This singular focus on creating massive LLMs for science was front-and-center in a talk given by Rick Stevens titled "The Decade Ahead: Building Frontier AI Systems for Science and the Path to Zettascale." The overall thesis that I heard was something like...

  1. Science needs its own trillion-parameter foundation models
  2. Training trillion-parameter foundation models requires a lot of GPUs
  3. We need $25 billion from the U.S. government

However, Stevens never answered a very basic question: what does a foundation model for science do that any other foundation model cannot do?

He showed slides like this which really don't sound like foundation models for science as much as a generic AI assistants:

Foundation models for science, per Rick Stevens

Is the scientific computing HPC community really the most qualified bunch to reinvent what existing foundation models like GPT-4 or Claude 3 have already done? Even if you argue that these proprietary models aren't as good at "science" as they could be, who would have a better chance of addressing this with a billion dollars of federal funding: the companies who developed GPT or Claude, or a collection of government scientists starting from scratch?

I think the answer to this question was in other parts of Stevens' talk. For example, he started with this slide:

Rick Stevens' requirements gathering slide

While robust requirements are good when there's no urgency, this slide is also a tacit admission that the government takes years to general a perspective on AI. Do you think the creators of Llama-3 or Mistral Large gathered wide community input from over 1,300 researchers before deciding to build a supercomputer and train a model? Even if science needs its own foundation models, this slide is strong evidence that, by the time the scientific HPC community agrees on a path forward, that path will be years out of date relative to what the commercial AI industry is doing.

A great example of this already happening is the basic premise that creating a foundation model with a trillion parameters is the best way to apply AI to solve science problems. This certainly was the leading thought two years ago, when transformer scaling laws were published that suggested that the best way to get better-performing LLMs was to simply add more parameters to your transformer and train on more data. But there's a reason all the leading models have stopped advertising how many parameters they use.

Dealing with massive transformers is really expensive. They're not only really expensive to train, but they're really expensive to use for inferencing too. This has led to a bunch of innovation to develop model architectures and approaches to training that result in dramatically higher quality outputs from a fixed parameter count. Dense transformer architectures with a trillion parameters have become the blunt instrument in developing foundation models since 2022, so it took me by surprise to hear Stevens put so much stock into this notion that the need for a trillion-parameter model is essential for science.

To repeat myself, I am no expert in AI. I've never been called in front of Congress to talk about AI or been invited to give talks on the topic at ISC. There might be something basic that I am missing here. But when I look at the science drivers for AI:

Slide showing the cross-cutting themes in AI for science

I know that you do not need to train your own trillion-parameter model to do most of this stuff. Even the use cases that do require generative AI, like code generation and math theory, don't actually require trillions of parameters. Small language models, such as that described in Textbooks Are All You Need (published in 2023, after the reports Stevens cited in his talk), can produce amazing results with very small models when you train them using high-quality data instead of garbage from Reddit. And when you create or fine-tune a small language model for a specific science domain, not only do you save yourself from having to buy a billion-dollar supercomputer for training, but you get a model that is much more accessible to scientists around the world because they won't need a million dollars' worth of GPUs to inference with it.

So, if there's one question that was never answered across any of the AI-themed sessions at ISC this year, it is this: Why does science need to train its own large language models? My intuition is that either fine-tuning existing large language models or training small language models for domain-specific applications, would be a better investment in actually advancing science. However, if we cynically assume the real goal of LLMs-for-science is to justify buying massive GPU systems, suddenly a lot of the talks given at ISC on this topic make a lot more sense.

Real applications of generative AI for science

As frustrated as I got sitting through sessions on AI where it sometimes felt like the blind leading the blind, there was one really good session on actual applications of generative AI for science.

Mohamed Wahib of RIKEN gave an insightful presentation on the unique challenges of using generative AI in science. His summary slide touched on a lot of the key challenges:

Mohamed Wahib's slide on challenges of generative AI in science

And his actual talk focused largely on the model and data aspects of generative AI. What struck me is that the challenges he described reflected the experience of someone who has actually tried to do what many other AI experts at the conference were claiming would be the future. For example,

  • He recognized the importance of training scientific models with high-quality datasets, not just garbage scraped off of social media. This means not only scraping or generating high quality data for training, but curating and attributing that data and applying reinforcement learning with human feedback as the model is being trained. This is uniquely challenging when creating models for scientific applications, as managing the quality of scientific data requires deep domain expertise. This contrasts with a generic chat bot whose inputs and outputs can often be assessed by anyone with a basic education.
  • He also talked about the tendency of scientific data to be highly multimodal and multidimensional. Whereas multimodal chatbots may combine text and vision, scientific data often contains observations of the same phenomenon from many different sensors (for example, pressure, temperature, density, strain fields, ...), and the output of a generative model for science may require multiple modalities as well.  These capabilities are not well developed in LLMs designed for human language.
  • Dr. Wahib also pointed out that scientific datasets tend to be huge compared to text and images, and this may require developing ways for models to have context windows can fit multi-petabyte datasets' tokens to identify long-range correlations. Relatedly, he also pointed out that tokenization of scientific data is a new set of challenges unique to this community, since industry has been focused on tokenizing low-dimensional data such as text, audio, and images.

The good news is that industry's quest towards both commercializing generative AI and achieving AGI will touch on some of these challenges soon. For example, training domain-specific models using high-quality datasets is an essential component of the small language models I described in the previous section, and these small language models are what will enable privacy-preserving and cost-effective generative AI on laptops and phones. Effectively infinite context windows are also a major hurdle on the path to AGI, as industry is hard at work developing AI agents that can remember every conversation you've ever had with them. Finding more scalable approaches to attention that do not sacrifice accuracy are a part of this.

François Lanusse, currently at the Flatiron Institute, also gave a nice presentation that clearly explained how generative AI can be used to solve inverse problems—that is, figuring out the causes or conditions that resulted in a collection of measurements. A precise example he used applied generative AI to figure out what an image distorted by gravitational lensing might look like in the absence of those distortions. As I understood it, he trained a diffusion model to understand the relationship between images that are affected by gravitational lensing and the masses that cause lensing through simulation. He then used that model instead of an oversimplified Gaussian model as part of a larger method to solve the inverse problem of un-distorting the image.

The details of exactly what he did were a little over my head, but the insight piece for me is that combining generative AI and science in practice is not as straightforward as asking ChatGPT what the undistorted version of a telescope image is. Rather, almost all of the standard, science-informed approach to solving the inverse problem remained the same; the role of generative AI was simply to replace an oversimplified part of the iterative process (the Annealed Hamiltonian Monte Carlo method) to help it converge on better answers. It really is a combination of simulation and AI, rather than an outright substitution or surrogate model.

Dr. Lanusse also showed this slide which demonstrated how this approach can be generalized to other scientific domains:

François Lanusse's view of how generative AI can improve scientific discovery

The general approach of pretraining, fine-tuning ("adapt"), and combining foundation models with other physics-based models seems reasonable, although I admit I have a difficult time wrapping my head around exactly how broadly scoped he envisions any given pretrained foundation model to be. I can see such a model trained on extensive sky survey data being useful for a number of astrophysical and cosmological tasks, but it's less clear to me how such a model might be useful in unrelated domains like, say, genomics.

You might also ask why I think this vision of foundation models for science is reasonable while Rick Stevens' vision didn't ring true; the difference is in scale! The foundation models cited on Lanusse's slide are vision transformers which have many orders of magnitude fewer parameters than the trillion-parameter models that others talk about. Whereas a trillion-parameter model might need to be distributed over dozens of H100 GPUs just to produce one inference result, the largest of the vision transformers can probably be squeezed on to a single high-end desktop GPU. Again, you don't need billion-dollar supercomputers to train these models for science.

Frank Noé from Microsoft Research then talked about how generative AI can be applied to solve problems in simulating biological systems. Like the talk before his, Dr. Noé followed this pattern where a larger, physics-based framework had one statistical technique replaced by a method based on generative AI, and then a physics-based model is used to quantify the likelihood that the result is reasonable. He contrasted this with convention approaches (to, say, protein folding) where you just simulate for really long times in the hopes that your simulation randomly wanders into a situation where you capture a rare event.

His talk wasn't about generative AI as much as the previous speakers, but he offered a litany of ways in which AI models can be useful to molecular modeling:

  • Markov state models provide a statistical framework that lets you replace one long simulation (that hopefully captures every possible scenario) with a bunch of short, chopped-up simulations that hopefully capture every possible in parallel. He cited an example that took 20,000 GPU-days on V100 GPUs that would've otherwise taken a million GPU-years if done in one long simulation.
  • Coarse-grained models use machine learning to develop surrogate models to simulate the physics of relatively uninteresting parts of molecular systems. The example he used was simulating the water molecules surrounding a biomolecule; water can be very difficult to accurately model, and the example he cited led to a surrogate model that was 100x faster than directly simulating water molecules.
  • Boltzmann generators can generate 3D molecular structures based on a known probability distribution defined by the energy states of the system. This is another fast way to find rare but stable molecular configurations without having to throw darts at a dartboard.

What struck me is that, in all these cases, the AI model is never generating results that are blindly trusted. Instead, they generate molecular configurations which are then fed into physics-based models which can quantify how likely they are to be valid.

Both Lanusse's and Noé's examples of combining AI and simulation painted a picture to me where generative AI can be really useful in solving problems where a researcher would otherwise have to make educated guesses about what physical phenomenon is really happening based on incomplete information. So long as there is a way to apply a physics-based model to check the accuracy of each guess, generative AI can be trained to predict the relationships between incomplete information and what's really going on and get to probable answers much faster than relying on physics alone.

More broadly, I couldn't help but think about the Sora video showing pirate ships battling in a cup of coffee as I left this session. Like that video, these talks demonstrated that it's possible to train generative AI models to reproduce physical phenomena (like the fluid dynamics of coffee) without explicitly embedding any laws of physics (like the Navier-Stokes equations) into the model itself and still get really compelling results. The part of this that was lacking from the Sora video—but was present in these talks—was closing the loop between generated results and the laws of physics by feeding those generated results back into the laws of physics to figure out if they are probable.

High Performance Software Foundation

ISC'24 wasn't all about AI though! I wound up attending the launch of the High Performance Software Foundation (HPSF), a new Linux Foundation effort spearheaded by Todd Gamblin and Christian Trott (from Livermore and Sandia, respectively) aimed to promote the sustainability of the software packages relied upon within the high-performance computing community.

I haven't paid close attention to HPC software in a long time since most of my work was in platform architecture and storage systems, so a lot of the background context remains a little murky to me. That said, it seems like HPSF was formed to be like the Cloud Native Computing Foundation for the HPC community in that:

  • it will serve as a neutral home for software projects that aren't tied to any single university or government institution
  • it provides mechanisms to ensure that critical HPC software can continue to be maintained if its original author gets hit by a bus
  • it will help with the marketing, promotion, and marketing of HPC software

Its governance seems pretty reasonable, with different levels of membership being accompanied by different levels of rights and obligations:

HPSF governance structure 

There is a Governing Board is comprised of paying members (and predominantly those who pay the most), while the Technical Advisory Council carries out the more technical tasks of forming working groups and onboarding projects.

There are three levels of membership, and the highest (premier) has a $175,000 per year buy-in and comes with a seat on the Governing Board. Right now, the founding seats are held by AWS, HPE, LLNL, and Sandia.

Below that is a general membership tier whose cost is on a sliding scale based on the organization size, and AMD, Intel, NVIDIA, Kitware, ORNL, LANL, and Argonne have all committed at this level.  The associate tier is below that, and it is free to nonprofits but comes with no voting rights.

It seemed like the exact functions that HPSF will have beyond this governing structure are not fully baked yet, though there were six "prospective" working groups that provide a general scope of what the HPSF will be doing:

HPSF's prospective working groups

My read of the description of these working groups is that

  • CI/testing will supply resources (GPUs) on which HPSF projects' code can be automatically tested.
  • Software stacks will maintain E4S.
  • User engagement sounds like it will figure out what users of HPSF projects' software are looking for. It sounds like this will provide some product management-like support for projects.
  • Facility engagement is probably like user engagement, but for the sites deploying code on behalf of their users. Again, this sounds like product management functions.
  • Security sounded like stewarding SBOM-like stuff for member projects' software.
  • Benchmarking would make a framework for benchmarking HPC applications.

That all said, it still wasn't clear what exactly HPSF would do; what would all those membership dues go towards supporting? Based on some Q&A during this BOF and follow-up afterwards, I pieced together the following:

  • HPSF will not be funding developers, much in the same way that OpenSFS doesn't fund Lustre development. That said, Todd Gamblin later said that not funding software development was a financial constraint more than a policy one, with the implication that if more members join, there may be opportunity for HPSF to fund projects.
  • HPSF likely will be hosting events and conferences (perhaps like the CNCF hosts KubeCon), providing scholarships, developing and providing training related to member projects, and "increasing collaboration" (whatever that may mean!).
HPSF also has some influence and ownership over its member projects:

  • HPSF will co-own its projects' GitHub repos to ensure continuity in case the other repo owner abandons it.
  • HPSF will own the domain for the project for the same reasons as above.
  • Member projects still manage their own software development, roadmaps, releases, and the like. The HPSF won't dictate the technical direction of projects.
  • HPSF will own the trademark and logos of its member projects so it can prevent corporations from profiting off of repackaging products without respecting trademark.

This establishes an interesting new direction for the sorts of software projects that are likely to become member projects. Historically, such projects developed by the member organizations (i.e., DOE labs) have been wholly controlled by the labs that funded the work, and those software projects lived and died at the whims of the government funding. The HPSF offers a new vehicle for software projects to live on beyond the end of the grants that created them, but at the same time, it requires that the DOE surrender control of the work that it sponsored.

I left the session still wondering a few pretty major things, likely borne out of my own ignorance of how similar organizations (like CNCF or the Apache Foundation) work:

  1. How does a software project actually become a member project? The HPSF folks said that the Technical Advisory Committee onboards new projects, but what is the bar if I have an open-source project used by the community that I no longer want to maintain myself? I assume it's not a pay-to-play arrangement since that defeats the purpose of sustaining software after its seed funding runs out.
  2. What do stakeholders actually get out of joining HPSF? I see obvious value for organizations (like the DOE labs) who develop open-source software but may not want to be exclusively responsible for sustaining it forever. But would an HPC facility get any obvious benefit from joining and paying dues if it is simply a consumer of member projects' software? What does a cloud vendor like AWS get by being a premiere member? Is HPSF just a way to get someone else to cover the overheads of maintaining open-source software that comes out of, say, R&D organizations rather than product organizations?

Hopefully the answers to these questions become clearer as the foundation gets off the ground and we get to see what member organizations contribute under the HPSF banner.

Ultimately though, I see this as a really positive direction for the HPC software community that might help resolve some uncertainty around key pieces of HPC software that have uncertain ownership. For example, I wound up as a maintainer of the IOR and mdtest benchmark because I was the last one to touch it when its previous maintainer lost interest/funding. I don't even work in I/O performance anymore, but the community still uses this benchmark in virtually every procurement of parallel file systems either directly or through IO500. It would be wonderful if such an important tool didn't rest on my shoulders and had a more concrete governance structure given how important it is.

Quantum computing

Besides AI and cloud, quantum computing was cited in Kathy Yelick's opening keynote as the third disruptor to HPC for scientific computing. At the time, I thought citing quantum was just an obligation of any opening keynote speaker, but quantum computing was particularly high-profile at ISC this year. I was surprised to see over a dozen quantum computing companies on the vendor exhibition floor, many of whom were Europe-based startups.

In addition, this year's Hans Meuer award (for best research paper) was given to a paper on quantum computing by Camps et al. This is particularly notable since this is the first time that the Meuer award has ever been given to a paper on a topic that isn't some hardcore traditional HPC like MPI or OpenMP advancements; by comparison, this award has never been given to any papers on AI topics. Granted, the winning paper was specifically about how to use conventional HPC to solve quantum problems, but this recognition of research in quantum computing makes a powerful statement: quantum computing research is high-performance computing research.

Reinvent HPC to include urgent computing?

I was invited to give a lightning talk at the Workshop on Interactive and Urgent High-Performance Computing on Thursday, and urgent/interactive HPC is not something I'd really paid attention to in the past. So as not to sound like an ignorant fool going into that workshop, I opted to sit in on a focus session titled "Urgent Computing" on Tuesday. I had two goals:

  1. Make sure I understood the HPC problems that fall under urgent and interactive computing so I could hold an intelligent conversation on this topic at the Thursday workshop, and
  2. See if there are any opportunities for cloud HPC to provide unique value to the challenges faced by folks working in urgent HPC
I'll describe what I came away with through these lenses.

The Urgent Computing focus session

What I learned from the focus session is that urgent computing is not a very well-defined set of application areas and challenges. Rather, it's another manifestation of reinventing HPC to include any kind of computation for scientific purposes.

Much to my surprise, this "Urgent Computing" focus session was actually a session on IoT and edge computing for science. Several speakers spoke about getting data from edge sensors on drones or telephone poles into some centralized location for lightweight data analysis, and the "urgent" part of the problem came from the hypothetical use cases of analyzing this sensor data to respond to natural disasters. There wasn't much mention of anything requiring HPC-like computing resources; at best, a few talks made unclear references to using AI models for data analysis, but it felt like grasping:

The above conclusion slide was presented by one of the speakers, and to be honest, I don't understand what any of it means. Granted, I know very little about urgent computing, IoT, or edge computing so there may be some domain jargon here that's throwing me off. But based on this, as someone working in the area of HPC and AI in the cloud, I don't think I have a role to play here. I'm sure cloud computing can help, but the challenges would be in general-purpose cloud rather than HPC.

The Interactive and Urgent HPC workshop

Fortunately for me, the Thursday workshop on Interactive and Urgent HPC was much less about edge/IoT and more about developing software infrastructure and workflows that allow scientific data analysis of large datasets to happen before the results become obsolete. It was a fascinating workshop for learning about specific science drivers that require fast access to HPC resources, and how different HPC providers are enabling that through non-traditional services and policies. Below are a few highlights.

Sam Welborn (NERSC) presented his team's efforts to convert a streaming data workflow from its current file-based approach into one that streamed directly into compute node memory. The specific use case was the initial data processing for image information coming off of a scanning transmission electron microscope at 480 Gbps, totaling 750 GB per shot. As he described it, the current technique involves streaming those data to files at the microscope, then copying those files to the parallel file system of a remote supercomputer, then reading, processing, and writing that data within the HPC environment to prepare it for downstream analysis tasks. And for what it's worth, this is how I've always seen "streaming" HPC workflows actually work; they're actually using file transfers, and the performance of both the file system at the source and destination are in the critical path.

The problem with this approach is that parallel file systems on HPC systems tend to be super flaky, and there's no real reason to bounce data through a storage system if you're just going to pick it up and process it. So, Dr. Welborn showed a true streaming workflow that skipped this file step and used ZeroMQ push sockets at the microscope and pull sockets on the HPC compute nodes to do a direct memory-to-memory transfer:

Streaming memory workflow slide from Sam Welborn

Seeing software like ZeroMQ used to enable communication in an HPC environment instead of forcing this workflow to fit into the MPI paradigm is an encouraging sign in my eyes. ZeroMQ, despite not using purpose-built HPC technology like RDMA, is the right tool for this sort of job since it supports much better resilience characteristics than messaging libraries designed for tightly coupled HPC jobs. Workflows like this that combine beefy GPU nodes with software developed in the commercial tech space suggest that the world of HPC is willing to abandon not-invented-here ideology.

It wasn't clear to me that there's a great opportunity for cloud HPC to be uniquely useful in use cases like this; while you certainly can provision beefy CPU and GPU nodes with InfiniBand in Azure, cloud services can't obviously simplify this ZeroMQ-based workflow beyond just supplying general-purpose VMs on which the orchestration services can run. Had this team stuck with a file-based streaming mechanism, the performance SLAs on cloud storage (like object or ephemeral Lustre) would provide a more reliable experience to ensure the data transfer happened in near-real-time. But the better solution to unpredictable file system performance is to do exactly what was done here: skip the file system entirely.

Just to keep the speaker honest, I asked why this computation couldn't simply be done at the same place as the telescope generating the data. After all, if the telescope always generates 750 GB per shot, you should be able to buy a couple GPU servers that are ideally sized to process that exact workload in the time between images. There were actually two answers: one from Sam and one from an audience member:

  1. Sam said that you can process this workflow locally, but that the goal of this work was to prepare for a future microscope (or another instrument) that could not. He also insightfully pointed out that there's tremendous value in getting the data into the HPC environment because of all the services that can be used to work on that data later. I envisioned doing things like using a Jupyter notebook to further process the data, serve it up through a web UI, and similar tasks that cannot be done if the data is stuck inside a microscope room.
  2. An audience member also pointed out that sticking GPU nodes in the same room as electron microscopes can result in enough noise and vibration to disrupt the actual scope. This was a great point! In the days before I started working in HPC, I was training to become an electron microscopist, and I worked in a lab where we had water-cooled walls to avoid the problems that would be caused by air conditioning breezes. There's no way a loud server would've worked in there.

Toshio Endo (Tokyo Tech) gave an interesting talk on how they enable urgent/interactive compute jobs on their batch-scheduled TSUBAME4.0 supercomputer by doing, frankly, unnatural things. Rather than holding aside some nodes for interactive use as is common practice, his work found that a lot of user jobs do not completely use all resources on each compute node they reserve:

Toshio Endo's slide on TSUBAME3.0 GPU utilization

I had to do a double-take when I saw this: even though 65%-80% of the nodes on the supercomputer were allocated to user jobs, less than 7% of the GPUs were actually being utilized.

Dr. Endo's hypothesis was that if nodes were suitably subdivided and jobs were allowed to oversubscribe CPUs, GPUs, and memory on a compute node without impacting performance too much, they could deliver real-time access to HPC resources without having to create a separate pool of nodes only for interactive uses. He defined success as the slowdown of a shared job being 1/k if k jobs shared the same node; for example, if four jobs were all running on the same node, each one taking four times as long to complete would be acceptable, but any longer would not. He then went on to show that the best way to accomplish this is using Slurm's gang scheduling, where each job takes turns having exclusive access to all the CPUs and GPUs on a node. The alternative (just letting the OS context switch) was no good.

While a fascinating study in how to provide zero wait time to jobs in exchange for reduced performance, this whole mechanism of using gang scheduling to exploit low resource utilization seems like jamming a square peg into a round hole. If a workload doesn't (or can't) use all the GPUs on a node, then that's not the right node for the job; I feel like a more appealing solution would simply be to offer a heterogeneous mix of nodes based on the demands of the workload mix. This is hard to do if you're buying monolithic supercomputers since you're stuck with whatever node mix you've got for five years, but there is another way to buy supercomputers!

I won't pretend like dynamically provisioning different flavors of CPU- and GPU-based nodes interconnected with InfiniBand in the cloud doesn't come with a cost; the convenience of being able to slosh a cluster makeup between CPU-heavy and GPU-heavy nodes will be more expensive than committing to use the same makeup of node flavors for multiple years. But if you're paying for GPUs that are only being used 7% of the time, surely it's cheaper to pay a higher cost for GPUs when you need them if it also allows you to not pay for them 93% of the time when they're idle.

Bjoern Enders (NERSC) gave the first lightning talk where he presented the exploration they're making into enabling real-time and urgent computation. They're currently going in three parallel directions to provide this capability:

  1. Reservations, a process by which a user can request a specific number of nodes for a specific period of time, and Slurm ensures that many nodes are available for the exclusive use of that user by the time the reservation starts. He said that implementing this at NERSC is costly and rigid because it requires a human administrator to perform manual steps to register the reservation with Slurm.
  2. Realtime queues, where a few nodes are held from the regular batch queue and only special real-time users can submit jobs to them. Dr. Enders said that NERSC is extremely selective about who can access this queue for obvious reasons: if too many people use it, it will back up just like the regular batch queues do.
  3. Jupyter Hub, which utilizes job preemption and backfill under the hood. If a user requests a Jupyter job, Slurm will pre-empt a job that was submitted to a preemptible queue to satisfy the Jupyter request. However, if there are no preemptible jobs running, the Jupyter job will fail to launch after waiting for ten minutes.

To provide compute resources to back up these scheduling capabilities, they also deployed a new set of compute nodes that can be dynamically attached to different supercomputers they have to support urgent workloads even during downtimes.  Called "Perlmutter on Demand" (POD), it sounded like a separate set of Cray EX racks that can be assigned to either the Perlmutter supercomputer, or if Perlmutter is down for maintenance, either their smaller Alvarez or Muller supercomputers which share the same Cray EX architecture. What wasn't clear to me is how the Slingshot fabrics of these nodes interact; perhaps POD has its own fabric, and only the control plane owning those racks are what changes.

He showed a slide of explorations they're doing with this POD infrastructure, but as with Dr. Endo's talk, this seemed a bit like a square peg in a round hole:

Bjoern Enders' slide on experiments with POD

All of this sounds aligned with the strengths of what HPC in a cloud environment can deliver, and some of the big challenges (like figuring out the ideal node count to reserve for interactive jobs) are problems specific to Slurm and its mechanism for scheduling. There's a lot more flexibility to rapidly provision HPC resources in cloud environments because, unlike the case where Slurm is scheduling jobs on a single cluster, cloud resource managers can schedule across any number of clusters independently. For example, if an urgent workload needing only four GPU nodes suddenly appears, it doesn't necessarily have to be scheduled on the same InfiniBand fabric that a large hero job is running on. Since the urgent job and the hero job don't need to talk to each other, cloud resource managers can go find a GPU cluster with a little more flex in them to provision those resources quickly.

Automating the process of reservations is also a bit of a game of catch-up, though my guess is that this is more a matter of someone having a weekend to sit down and write the REST service that manages incoming reservation requests. Although there's not a direct analog for reservations like this in Azure, AWS has a feature called AWS Capacity Blocks that does exactly this: if you know you'll want a certain number of GPU nodes sometime in the future, Capacity Blocks let you reserve them ahead of time through an API.

Finally, I represented Microsoft and gave a lightning talk that riffed on a lot of what I've been writing about in this blog post: HPC seems to be reinventing a lot of things that the cloud has already figured out how to do. The illustrious Nick Brown was kind enough to snap a photo of one of my slides and post it on Twitter:

A slide I presented on the parallels between inferencing as a service and urgent HPC workflows

My thesis was that the way urgent HPC workflows are triggered, scheduled, run, and reported on follows the same pattern that inferencing-as-a-service services (like Copilot and ChatGPT) are implemented under the hood, right down to executing multi-node jobs on InfiniBand clusters. The difference is that these cloud workflows are built on the foundation of really nice cloud services that provide security, scalability, monitoring, and hands-free management that were originally developed for commercial (not HPC!) customers. My argument was that, even if you don't want to pay cloud providers to run urgent HPC workflows as a managed service, you can use these services (and the software infrastructure on which they're built) as a blueprint for how to build these capabilities in your own HPC environments.

Concluding thoughts

The ISC'24 conference was fantastic, and I am glad it has not lost the unique elements that made me want to attend in the years prior to the pandemic. It's still that smaller, intimate, and focused HPC conference that brings the community together. Although a lot of my synopsis above may sound critical of the content presented over the four days I attended, the fact that I've had so much to write down in this blog post is a testament to the value I really get out of attending: it makes me sit down and think critically about the way the HPC community is evolving, what the leading minds in the field are thinking, and where I might be able to contribute the most in the coming year.

I never much paid attention to the annual taglines of conferences like ISC, but this year's "Reinvent HPC" really resonated. The HPC community is at a crossroads. Exascale computing for science is now in the rear-view mirror, and large-scale AI is all the rage across the computing industry at large. But for the first time ever, this new direction in at-scale computing is happening without the inclusion of the people and organizations who've historically driven innovation in HPC. Whereas institutions like Oak Ridge, RIKEN, Cray, and Fujitsu defined the future of computing for decades, hundred-person startups like OpenAI and Anthropic are now paving the way in partnership with companies like Microsoft and Amazon.

HPC needs to be reinvented, if for no other reason than to decide whether the HPC community wants to be inclusive of new frontiers in computing that they do not lead. Does the HPC community want AI to be considered a part of HPC?

Judging from many speakers and panelists, the answer may be "no." To many, it sounded like AI is just another industry that's sucking all the air (and GPUs) out of the room; it's a distraction that is pulling funding and public interest away from solving real problems. It's not something worth understanding, it's not something that uses the familiar tools and libraries, and it's not the product of decades of steady, government-funded improvements. AI is "them" and HPC is "us."

Personally, I'd like the answer to be "yes" though. Now that I'm on the other side of the table, supporting AI for a cloud provider, I can say that the technical challenges I face at Microsoft are the same technical challenges I faced in the DOE. The desire to deeply understand systems, optimize applications, and put world-class computing infrastructure in the hands of people who do amazing things is the same. And as the days go by, many of the faces I see are the same; instead of wearing DOE or Cray badges, my lifelong colleagues are now wearing NVIDIA or Microsoft badges.

All this applies equally to whether cloud is HPC or not. The HPC community needs to reinvent itself to be inclusive of everyone working towards solving the same problems of computing at scale. Stop talking about people who work on commercial AI in cloud-based supercomputers as if they aren't in the room. They are in the room. Often near the front row, snapping photos, and angrily posting commentary on Twitter about how you're getting it all wrong.

Picture of me near the front row, angrily posting on Twitter

HPC has historically been used to solve scientific problems, whether to expand our understanding of the university, to find the next best place to drill an oil well, or to model the safety of aging nuclear weapons. The fact that HPC is now being used to solve squishier problems related to natural language or image generation does not change the essence of HPC. And whether that HPC is delivered through physical nodes and networks or virtualized nodes and networks is irrelevant, as long as those resources are still delivering high performance. AI is just as much HPC as scientific computing is, and cloud is just as much HPC as OLCF, R-CCS, or CSCS is.

So perhaps HPC doesn't need to be reinvented as much as the mindset of its community does.

That all said, I am genuinely impressed by how quickly ISC'24 has been reinventing itself in recent years. It wasn't too long ago that all its keynote speakers were greybeards from a predictable pool of public HPC centers all saying the same things year after year. It's wonderful to see a greater diversity of perspectives on the main stage and torches passing on to the next generation of leading figures in the field. And it was not lost on me that, for the first time in the history of this conference, Thomas Sterling did not deliver the closing keynote. As much fun as I had poking fun at his meandering and often-off-the-mark conjectures every year, it was delightful to be exposed to something new this year.

I'm hopeful that ISC will continue to get better year over year, and ISC'25 will feel more inclusive of me despite the fact that I am now one of those hyperscale cloud AI people. So long as I still feel like it's my community, though, I will keep showing up in Germany every summer.