SC'25 recap

The annual SC conference was held last week, drawing over 16,000 registrants and 560 exhibitors to in St. Louis, Missouri to talk about high-performance computing, artificial intelligence, infrastructure, and science. It was my tenth time attending in-person (12th overall), and as is always the case, it was a great week to reconnect with colleagues, hear what people are worrying about, and get a finger on the pulse of the now-rapidly changing HPC industry.

Outside the SC'25 convention center on the only clear day of the week.

Although every SC I've attended always felt a little different from the previous year, this one felt quite different. Part of that results from my own personal circumstances: this is the first year I attended as an employee of VAST Data, and so the people with whom I met and the technical problems to which I paid attention were certainly biased towards those most relevant to my work. But the backdrop of the whole conference has also shifted. It's been three SC conferences since ChatGPT came out, and it's now undeniable that AI isn't simply on the horizon; it's shaping the field of HPC and scientific computing. What used to be an argument of "us vs. them" is now more like "them (and us?)"

As has become tradition, I'm sharing some of my thoughts from the week with the world in the hopes that someone finds this interesting and insightful. I've roughly organized them into two areas big themes and the exhibition hall.

Big themes

HPC has always been at the center of a tension between keeping things the same (supercomputers are the most stable the day they are turned off) and pushing the technological envelope (which is the fastest way to unlock new discovery). The desire to push the envelope has always been a "pull" towards the future; researchers first led with kooky ideas (like DAOS and Kokkos), and as those ideas turn from research into production, they make new technologies (like all-flash and AMD GPUs) accessible to scientists.

What hasn't historically happened, though, is a strong "push" towards the future. Scientific HPC centers push themselves to justify building the next big supercomputer, but it's been a given that there will always be another big machine, so this push has been internal and gentle. Combined with the not-so-urgent pull of HPC researchers, every center has gotten a new machine every five years or so.

This is the year where it became clear to me that AI is now exerting a strong push on the HPC industry--a shove even, forcing HPC centers around the world to align themselves on an AI mission if they want to survive. All the big-money HPC systems being announced this year are clearly being positioned as AI-first and AI-motivated, and these announcements are going well beyond simply peppering "AI" throughout the press release and otherwise acting as if it was business-as-usual. This is the first SC where I saw scientists, architects, and decision-makers being being forced to confront real tradeoffs favor either HPC or AI, and they are beginning to choose AI.

This push-and-pull on HPC towards the future manifested in three big themes.

Theme 1: The big number is losing its shine

HPC has long organized itself around treating the big machine and the big number as its top priority, and this is why the two largest HPC conferences of the year honor the semiannual release of the Top500 list on their main stage. However, this year felt like the first time that one number (that somehow reflects "performance") dominated the conversation. Instead, the discourse was more diffuse and discussed "performance and x" or "the supercomputer and x."

Top500

The place where this was most evident to me was at the Top500 BOF, where the latest list was unveiled.

The biggest announcement was that Europe now has its first benchmark-confirmed exascale system in JUPITER, which ran a full-system HPL at 1,000,184 TFLOPS for two hours and seven minutes. However, JUPITER didn't get any stage time at the BOF since, like Aurora, it actually debuted on a previous list with a sub-exascale run. This run pushed it over the finish exascale finish line, but if the Top500 list metadata is to be believed, the run used 100% of JUPITER's 5,884 nodes to break the barrier--a feat that is unlikely to be reproduced on any production applications, since it is rare to have zero failed nodes in any large-scale production environment.

So, while there was little fanfare for Europe in breaking the exaflops barrier with its new big machine and big number, there were some big announcements--one overt, and others more muted.

The biggest news was that the Top500 list is changing hands. Whereas it has historically been controlled by three people--Jack Dongarra, Horst Simon, and Erich Strohmaier--it will be transitioning to be community-controlled under the stewardship of ACM SIGHPC. Dongarra, Simon, and Strohmaier will still be on the steering committee under the ACM stewardship, but this new governance structure opens the doors for new ideas to breathe new life into the way systems are ranked and, more broadly, how "performance" is meant to be interpreted from Rmax.

At present, the list (and related lists) are bound by rules that, in the present day of reduced-precision accelerators, make little sense. For example, using the Ozaki scheme within the LU decomposition is not allowed by Top500 despite the fact that it can produce the same answer with the same numerical accuracy much faster than hardware FP64. And while the HPL-MxP benchmark does allow solving the same problem using more creative methods, Strohmaier highlighted a problem there too: it never dictated how to deal with multiple levels of mixed precision until AIST broke the rankings. AIST ran HPL-MxP at both 16-bit and 8-bit precisions, resulting in their ABCI 3.0 system simultaneously ranking at #6 and #10.

These sorts of issues make it easy to question the value of leaderboards like Top500 or HPL-MxP, as their definition of "performance" becomes increasingly further divorced from how large supercomputers are really used. The past few years have shown that there hasn't been the time or energy to get ahead of these ambiguities amongst the three men maintaining the list, so transitioning it to ACM will hopefully be a positive move that will give the list a chance to be revitalized.

To their credit, the incipient stagnation of the Top500 list was called out by Strohmaier during his analysis of the list, acknowledging that "growth has tremendously slowed down compared to what it used to be" and "we don't have proof of what is actually the reason for that:"

All the key highlights of this SC's Top500 list.

China has stopped submitting, the AI and hyperscale providers really never started submitting, and retired systems are being thrown off the list long before they fall off the bottom. To me, this was a tacit acknowledgment that the list does not have a bright future out to 2030 unless it is modernized to be relevant to the way in which today's largest systems are actually being used--which is not DGEMM.

The final surprising acknowledgment during Strohmaier's talk was that the list is trailing the state of the art in hardware by quite a bit. He pointed out that Blackwell systems are only now starting to appear even though they've been shipping in volume for the better part of a year. While he hypothesized that there is "uneasiness" about Blackwell in an HPC context, the reality is that there are no Blackwells for HPC until the Blackwell orders for hyperscale AI have been fulfilled. HPC is second in line, and even then, the only Blackwells I could find on this year's Top500 list were NVL8 configurations--not the NVL72 configurations that have been filling up hyperscale datacenters like Fairwater.

Strohmaier pointed out that Blackwell, by virtue of its HBM3e (vs. Hopper's HBM3), is showing up higher on the HPCG list (which is a memory bandwidth test) than on Top500 (which is an FP64 FLOPS test). He phrased this as evidence that "not everything is bad for the HPC community," but I would have phrased my conclusion a little differently:

  1. Blackwell is actually great for HPC, because most real workloads are memory-bandwidth bound, not FLOPS bound. The fact that B200 offers similar FP64 FLOPS at higher memory bandwidth means that real applications will get higher effective use of those FP64 FLOPS.
  2. Despite the above, Blackwell doesn't perform well on Top500 because HPL doesn't reflect the reality that memory bandwidth is important. It follows that HPL doesn't reflect the reality of real HPC applications. A Blackwell system can be significantly better for real HPC applications than a comparably sized Hopper system even though it may rank lower than Hopper on Top500.
  3. Blackwell isn't showing up in volume now because the HPC community is second in line. The HPC community isn't uneasy as much as it is completely locked out. The first NVIDIA-based exascale system debuted in November 2025 despite its GPU being three years old, suggesting that if big Blackwell systems ever appear on Top500, it'll happen in 2026-2027.

All of this is a roundabout way of showing that the big number--in this case, the HPL score--no longer leads meaningful conversation around how useful a system is for science.

The Gordon Bell Prize

Another major indicator of the changing tide away from the big number was the work that won this year's Gordon Bell Prize. The winning paper, titled "Real-time Bayesian inference at extreme scale: A digital twin for tsunami early warning applied to the Cascadia subduction zone," wasn't the typical case of running a huge simulation for a few hours and reporting some result. Rather, it described a four-step workflow that culminates in the desired insight popping out of a computation that runs across only 128 nodes and completes in less than 0.2 seconds. Furthermore, the hero run part could be decomposed into trivially parallel components, allowing the bulk of the computation to be geographically distributed across HPC centers or GPUs spread across on-prem and cloud providers.

My understanding of the work is that there was a massive "offline" computation to precompute a few key matrices (Phase 1) followed by two shorter offline steps that turn those matrices into the core of the digital twin. The last step, which was "online" and designed to be computed in real-time, could then take this core and solve the input problem with extremely low latency. This workflow front-loads a hero run in such a way that, if an earthquake were to occur, the risk of tsunami could be calculated in less than a second using only modest compute resources and the precomputed core.

The authors eschewed methods that generated tons of FLOPS in favor of methods that were less FLOPS-efficient but got to the answer faster. In the authors' own words:

As shown in Fig. 7, higher FLOP/s does not necessarily lead to faster time-to-solution. On MI300A nodes of El Capitan, the best-performing implementation, Fused PA, achieves a lower percentage (5.2%) of theoretical peak FLOP/s than Fused MF (5.5%) but is faster.

Interestingly, the hero computation here was embarrassingly parallel(ish) as well; in their demonstration run, the hero run (Phase 1) was broken into 621 independent calculations each requiring 128 nodes (512 A100 GPUs) for about an hour. Because they are independent, these tasks could be parallelized across multiple HPC centers as well, and my understanding of the data volumes involved are modest; Phase 1 would require a single shared copy of the input mesh (a hundred GiB?) per HPC center, and each of the 621 tasks would output around 8 GiB which would have to be copied back.

While I don't understand the mathematics behind this work, the paper took what would've been a huge exascale-class mathematical problem ("10 years on a sustained 1 EFLOP/s machine") and reformulated it into a workflow that solves the problem faster and more usefully. Instead of brute-forcing the problem with a big supercomputer, they split it into separate offline and online parts, and this naturally allowed the most computationally expensive part to be geographically distributable.

This work surrendered the need for a single big machine, and it didn't produce a big-number result. But it did win the Gordon Bell Prize, again signaling that the HPC community is beginning to look beyond performance-only and think about awarding innovation according to outcomes, not just FLOPS.

The talk for this paper can be viewed here in the SC25 Digital Experience.

Fixing problems caused by the big number

Most of my perception around the HPC community beginning to de-emphasize the singular big machine or big number arose from organic interactions I had with colleagues and customers though. It's hard to summarize how these conversations went, but the Lustre Community BoF is a good example of what I saw elsewhere.

Lustre has long been the gold standard in high-performance parallel I/O in the HPC community because it was designed from day one to deliver high bandwidth above all else. As a result, Lustre already has the big number solved in many ways, and events like the Lustre BOF are a great case study in what it looks like for a performance-first technology to be pushed into adapting to deliver more than just a big number.

First, the ever-innovative Stéphane Thiell from Stanford discussed the process and tooling he developed to enable online capacity expansion of a Lustre file system. The basis for it was a distributed, fault-tolerant tool he developed that uses redis, lfs find, and lfs migrate to manage the state of file migrations across Lustre servers as the file system is rebalanced. While a part of me thought this was a great tool that would be super helpful for many others, another part of me was kind of horrified.

Maybe I've been spoiled by working in hyperscale and AI these past three years, but online capacity expansion and rebalancing is a built-in capability of all distributed storage systems these days. All the major cloud object stores do this, as do all modern parallel file systems including Quobyte, VAST, and WEKA. Of course, none of these modern systems are as efficient (on a per-CPU core or per-SSD basis) as Lustre at delivering peak performance. But Stéphane's talk made me realize the price that's paid for this great performance.

Andreas Dilger and others went on to talk about Lustre futures, and as they were speaking, I noticed that nobody was talking about performance improvements to Lustre. Rather, feature development was focused on catching up in every other dimension--data governance, reliability, manageability, and others. For example, Andreas talked a bit about the upcoming "multi-tenancy" features coming to Lustre:

It's a lot of work to retrofit multitenancy into a performance-first file system.
I put "multi-tenancy" in quotes because these changes really represent trying to back into a security posture that is fundamentally different from the one that Lustre was designed around. In the pursuit of performance, Lustre (as with most other HPC technologies) was designed assuming that security was someone else's problem. By the time someone could log into a system that could mount a Lustre file system, they had already been authenticated, and it was up to the OS on each compute node to authorize any interactions with Lustre itself. This is the "implicit trust" model.

The problem, of course, is that the rest of the world has adopted a "zero trust" model which makes many things (except performance!) generally easier. Compliance is easier when the system assumes that everything is encrypted as a default and key management can be delegated to a third party. Because Lustre didn't do this from the outset, it is going through this process of retrofitting encryption in various places and using a mixture of nodemaps, UID/GID maps, and shared secrets to patch over all the places where trust was fundamentally implicit.

Later on in the BOF, panelists acknowledged (some half-heartedly) that manageability of Lustre was a barrier. One panelist admitted that it took five years of work to almost get to the point where a Lustre update can be done without crashing applications. Another panelist said that multitenancy in Lustre is easy if you follow a million steps, and that his company was developing script-based ways to simplify this. While the idea of using scripts to simplify operations is not bad, from a secure supply chain standpoint, relying on third-party bash scripts to enable features required for legal compliance is horrifying.

I don't mean to pick on Lustre alone here; other HPC technologies such as InfiniBand, Slurm, and DAOS are facing the same reality: retrofitting modern requirements like security and manageability into architectures that prioritized performance and scalability over everything else are now going through similar contortions to meet modern requirements around data governance. For those HPC centers who do not have to worry about compliance (which is most of open-science computing), these technologies will continue to be just fine.

However, the successes of these modern file systems across leading HPC centers and the proliferation of alternative technologies such as Kubernetes-based HPC and MRC over Ethernet tells me that HPC coming around to the idea that marginal increases in performance are no longer worth missing out on factors that weigh heavily on day-to-day operations like manageability, reliability, and flexibility.

Theme 2: HPC policy is becoming AI policy

Some of the biggest news at SC was not actually showcased at the conference despite being what many people wanted to talk about in side conversations: HPC policy is rapidly becoming AI policy, resulting in a slew of huge (but poorly defined) "public-private partnerships."

As a bit of background, the Oak Ridge Leadership Computing facility announced its next system, Discovery, in late October--this was the result of a "typical" supercomputer procurement process that first came into the public eye in 2023. However, the Discovery announcement also included mention of a smaller system, Lux, which will "leverage the Oracle Cloud Infrastructure (OCI)" (whatever that means) to provide earlier access to AMD MI355X GPUs ahead of Discovery's full-scale deployment.

Then, two days later, Argonne National Laboratory announced a similar arrangement with Oracle Cloud and NVIDIA to deliver a small (Lux-sized) GPU supercomputer named Equinox, followed by a much-larger 100,000-GPU supercomputer named Solstice. Neither Equinox nor Solstice are attached to a "typical" supercomputer procurement; the follow-on to Aurora, to be named Helios, is still in planning and will be deployed in 2028. This strongly suggests that, whatever "public-private partnership" means to the DOE, it is not the same as the typical leadership computing systems; it is its own AI-centric program.

At SC itself, Evangelos Floros (EuroHPC's head of infrastructure) also mentioned the "need for public-private partnerships" to realize EuroHPC's goal of building "AI Gigafactories" with "100,000 advanced AI processors" across Europe.

"Need for public-private partnerships" to fund AI factories is recognized by EuroHPC too.

Again, what exactly this "public-private partnership" model entails in Europe was never really defined.

What was clear is that both American and European efforts are declaring the need to build massive (100K+ GPU) supercomputers for AI, the traditional HPC centers will be the public stewards of them, and "public-private partnerships" are the only way to realize them since governments alone cannot foot the bill.

The Top500 BOF also included a short, awkward talk by Rick Stevens titled "The DOE AI Initiatives" that amounted to Stevens saying he had nothing to say. What really happened, I suspect, is that DOE's new "Genesis Mission," which was announced the week after the SC conference, was a week late and therefore couldn't be discussed as originally planned. If Stevens had been able to describe the Genesis Mission, though, I'm sure he would've also described "public-private partnership" as a key aspect, since the same language is used in the Executive Order that established Genesis. And I'm sure his description would've been no clearer about what this really means than what EuroHPC or the OCI/DOE descriptions have stated.

Most revealing was my observation that, even outside of the proper conference program, nobody really knew what any of this meant. I talked to plenty of my colleagues from both government HPC and hyperscale cloud organizations, and the only consistent message was that there aren't many concrete facts backing up the the press releases right now. It appears that these partnerships were brokered far outside the usual channels that large supercomputer procurements are normally done, and the people in charge of actually delivering on the promises of the press releases are still figuring out what is possible.

Connecting the dots between Lux/Equinox/Solstice, Genesis, and a recent RFI and RFP from DOE to allow hyperscalers to build AI factories on federal land, it appears that what is happening is...

  • The DOE has a bunch of land that is adjacent to the National Labs that is undeveloped but has the infrastructure to support massive AI factories. Specifically named is a 110-acre parcel at Argonne that can accommodate up to 1 GW "AI data park," and a 100-acre parcel at Oak Ridge with up to 800 MW. These details were disclosed in an RFI they issued earlier in the spring.
  • The Solstice press release specifically said that DOE envisions "shared investments and shared computing power between government and industry." Given the RFI/RFP were about land leases, these public-private partnerships may involve splitting the costs of space/power/cooling (the land and infrastructure being leased) and the capital/operations (the supercomputer cloud services being built) between the Labs and Oracle.

A potential model for operations is that cloud providers are allowed to build and operate commercial AI cloud services adjacent to the DOE HPC facilities in exchange for the DOE Genesis Mission being entitled to some of those AI cloud capabilities. Exactly how much supercomputing resources hyperscalers like OCI would give to DOE, and exactly how much it would cost the DOE Labs to serve as landlords, is probably still undefined. But seeing as how power is the single biggest limiter in AI these days, I expect this model will only spread costs around, not actually lower them.

If this is indeed how Genesis plays out, this would establish a bizarre new way for the government to acquire HPC (or AI) capabilities that completely sidesteps the standard procurement model. Instead of plunking down a hundred million dollars a year to finance a new leadership supercomputer, we might be moving into a world where the Labs plunk down a hundred million dollars a year to cover the costs of power, space, and cooling for a cloud provider. And instead of owning a leadership supercomputer, these national HPC facilities wind up consuming HPC (well, AI) resources from cloud providers--hopefully at a cost that reflects the fact that the cloud providers are also profiting from cycles being sold off of these machines to commercial AI customers.

But again, this is all speculation based on the consistencies I heard throughout the conference and the experience I had trying to build these sorts of partnership with the HPC community while I worked at Microsoft. I may be right, or I may be wildly wrong. There are probably only a handful of people in the world with a clear idea of what these partnerships are meant to look like right now, and they are all way above the heads of the people at the HPC centers who will be tasked with executing on the vision.

Selfishly, I am also left with a bit of heartburn over all of this news. I put a lot of personal time and energy into giving the HPC community the information it needed to feel comfortable about partnering with hyperscale AI infrastructure providers while I was at Microsoft, and it often felt like a Sisyphean task. Within months of me giving up and moving on from my career at a cloud provider, seeing a complete reversal of policy from the leadership HPC folks--and to see the "other guy" in pole position--is a bit of a slap in the face.

I also couldn't help but notice that the cloud provider in all the headlines in the US didn't seem to demonstrate a very strong and unified presence at SC this year. Comically, they didn't even use their own brand's colors for their booth on the exhibit floor. And the color scheme they did use left no room for Oak Ridge's Lux system, which will be AMD-based, to be showcased.

Oracle's booth at SC25. Their brand color is red, not green. Or so I thought.

Though I may have read too much into this, it feels like these public-private partnerships are not necessarily composed of equal partners with equal levels of commitment.

More broadly, I left the conference concerned that the discourse happening around these cloud-HPC/AI integrations--at least in the US--appears to have regressed compared to where it was when I worked at Microsoft. Many of the things we had to figure out years ago (cybersecurity models, impacts on jobs at the HPC centers) seem to have reset to zero. And sidestepping the procurement processes for leadership computing to enable these public-private partnerships will either require significant new funding (of which Genesis provides none; the executive order as-written appears to recolor existing money) or robbing Peter (the budget funding the next generation of leadership HPCs) to pay Paul (the cloud providers serving up compute resources for AI). As a result, I can envision a future where all of the money that used to fund leadership computing for science becomes money to fund commercial AI factories, resulting in a slow evaporation of the LCFs as their HPC capabilities shrink in size and relevance.

Though there's lots more to be said on this topic, it's all based on conjecture. So, maybe the best thing to do is quietly wait and see.

Theme 3: AI discourse is growing up

This was the first SC where it felt like the discourse around AI's role in the future of scientific computing actually carried some substance. Whereas previous years saw talk that mostly revolved around basic ideas like "do LLMs hallucinate too much?" or "can ChatGPT write MPI code?," I sat in on a number of interesting talks and conversations that skipped the question of "is AI useful?" and went straight to "this is how AI is proving useful to us."

Maybe it's related to the previous theme: HPC money is becoming AI money, so AI research is becoming required to stay afloat. Or maybe it's because 2025 has been the year of agentic AI, and agents allow LLMs to be integrated much more surgically into complex workflows. Or maybe confirmation bias led me to sit in sessions and talk with people who are at the frontier of applying AI to scientific discovery. Whatever the case may be, I was glad to hear so much discussion from researchers around the importance of all the connective tissue required to operationalize AI in scientific computing.

Agentic workflows

A great example of this was the 1st International Symposium on Artificial Intelligence and Extreme-Scale Workflows, which happened on Friday. One of the invited speakers, Dr. Katrin Heitmann, connected a lot of dots in my head with a talk she gave on how massive-scale, physics-based simulation workflows can benefit from agentic AI.

Heitmann's vision on how agentic approaches can augment (but not replace) humans in complex scientific workflows.

The crux of the challenge faced by most massive-scale simulation (like HACC, the cosmology code for which she is famous) is that they generate massive amounts of data. The most recent HACC run generated hundreds of terabytes of compressed data per checkpoint and over a hundred petabytes of data in the end; this cosmological simulation serves as a reference dataset from which downstream cosmological research can draw when exploring targeted questions. The challenge, of course, is finding relevant pieces of the simulated universe from amidst a hundred petabytes of raw data.

Dr. Heitmann's premise is that agents and tools have very specific scopes and capabilities, and researchers have control over which of these tools they wish to use. However, they can hand off these tools to an agentic workflow to let it autonomously sift through all of the data, looking for specific features within the simulated universe that are relevant. A specific example she gave was the process of examining 500 million galaxy clusters; with an agentic, AI-driven approach, a postdoc was able to interactively sift through these objects without examining each one individually. For truly interesting objects, a separate agent could go search the literature and provide an explanation as to why it may be interesting, absolving the postdoc from having to make round trips between the dataset and external literature.

That all said, it was clear from this talk (and others) that integrating agentic AI into scientific inquiry is still in its early days. But what I appreciated about this talk (and the entire workshop) is that it sidestepped pedestrian questions about trustworthiness by acknowledging that the goal isn't full autonomy, but rather, enabling researchers to do things faster. There is still a human at the start and the end of the workflow just as there always has been, but agents can reduce the number of times a human must be in the loop.

Data and agent-centric service infrastructure

Even when AI wasn't the main topic of discussion, it was clear to me at this SC that AI is influencing the way researchers are thinking about the infrastructure surrounding supercomputers. A great example of this was the keynote at the PDSW workshop, given by the ever-insightful Dr. Rob Ross, where he offered a retrospective on the work his team has done over the last two decades, what he felt they got right, what they missed, and what's ahead.

Towards the end of his presentation, he made the case that "science is increasingly multi-modal." But rather than talk about multimodality in the AI sense, he was emphasizing that there's more to scientific computing than performance:

Domain science, provenance, search, and resilience are equal partners to performance in scientific computing.
Taken at face value, this slide positions performance on equal footing with domain science, provenance, findability, and his argument was that we've moved beyond the world where the only storage problem that HPC faces is checkpointing. Just as Dr. Heitmann would say on Friday, Dr. Ross' argument was that the increasing volume of scientific data coming out of both exascale simulation and scientific instruments is driving the field towards more automation. And with automation comes a greater need to understand data provenance--after all, if automation produces a surprising result, a human ultimately has to go back and understand exactly how the automation generated that result.

He also point out that in this coming world of automation-by-necessity, infrastructure itself might have to be rethought. After all, traditional technologies like parallel file systems were designed to make the lives of human researchers easier; when the primary consumer of data becomes AI agents, not humans, there may be better ways to organize and expose data than through files and directories. A human might repeatedly cd and ls to find a specific dataset on a file system, whereas an agent use a query a flat index to find the same data in a single step.

At the end of the same PDSW workshop, I was fortunate enough to contribute to a panel where many of these same themes--how will data systems change as AI plays a greater role in scientific discovery--were discussed. Although we touched on a lot of topics, what stuck with me was a general acknowledgment that, while HPC has always talked about data management and provenance as being important, they were always treated as a "nice to have" rather than a "must have." However, as was echoed across many presentations (including the two I described above), governance and provenance are now becoming non-negotiable as larger datasets drive us towards AI-driven automation.

Regardless of what you think about AI's ability to accelerate scientific discovery, I left SC with the feeling that AI is forcing the HPC community to grow up with regards to how seriously it takes data management:

  • The size and velocity of datasets generated by simulation or experiment is growing beyond any single person's ability to analyze it by hand. The complexity of these data are also making it harder to develop herustics-based or analytical approaches to combing through all of it.
  • The best path forward to understanding these data is through AI (via purpose-built models for analysis) or AI-driven data exploration (via autonomous, agentic workflows).
  • Automation or autonomous workflows will always act under authority delegated to them by human researchers, meaning there is a growing need to be able to inspect how these workflows arrived at the conclusions they generate.
  • Understanding how an answer was achieved requires significantly better data management features such as governance, provenance, and auditability. A result is ultimately only useful if a human can trust it, and that trust comes from understanding which data informed that conclusion, how that data was created, and how it was modified over time.

Put differently, checkpointing was the main concern of I/O research because I/O performance was the first scalability issue that scientific computing ran into as supercomputers and scientific instruments got bigger. However, we're now at a point where issues ancillary to performance have reached the limits of scalability. Dr. Ross's multi-modal slide indicate that provenance, indices/search, and resilience are some examples of these new barriers, but there are plenty more as well.

In a sense, this theme is the opposite side of the same coin as the first theme I discussed--that the big number is losing its shine. The hardest questions going forward aren't the obvious ones about scaling performance; they are about scaling everything else to keep up. AI seems to be the technology that has cleared a path to these data management hurdles, but the benefits of adopting strong data management practices and systems will extend far beyond the reach of just enabling AI-based automation.

The exhibit hall

The exhibit hall has long been one of my favorite parts of attending SC because it's a great way to get a feeling for what technologies and vendors are hot, where the innovation is trending, and what sorts of commercial problems are worth solving. Every year I feel like I have less and less time to walk the exhibit hall though, and the layout and composition of this year's exhibition meant I only saw a small fraction of what I wanted to see in the few days it was up.

The most common comment I heard about the exhibit this year is captured in Doug Eadline's article, SC25 Observations: More Pumps than Processors (which is well worth the read!). The same commentary was repeated throughout the OCP conference in October as well, suggesting that there is a lot of money to be made (or at least the prospect of money) in helping datacenters get outfitted for the liquid cooling demanded by the next generation of large-scale GPU infrastructure. However, I found the overwhelming amount of space devoted to liquid cooling companies acutely problematic at SC25 this year for two reasons:

  1. Most SC attendees have nothing to do with liquid cooling. A colleague of mine who operates supercomputers for the energy sector asked one of these big liquid cooling vendors what he could do to actually engage with them. After all, he doesn't buy liquid cooling infrastructure; he buys whole supercomputers that come with heat exchangers and CDUs that are integrated into the solution. The vendor had no good answer, because the reality is that the typical supercomputer user or buyer has no say over what piping, coolant, or exchangers are used inside the machine itself. The whole point of buying an integrated supercomputer is to not have to deal with that level of details.
  2. These liquid cooling vendors soaked up a ton of floor space. A few of these physical infrastructure providers had massive (50x50) booths sprinkled across the exhibit hall. Combined with the fact that the average SC attendee has nothing to do with liquid cooling meant that the booths that were more likely to be relevant to a typical attendee were much further apart than they had to be.

The end result was that the exhibit hall was absolutely gargantuan and yet information-sparse. In fact, this year saw a secondary exhibit hall in the old football stadium serve as overflow space, because the entire primary exhibit hall was full. What's worse is that this overflow space was (as best as I could tell) completely disconnected from the main hall, and the only time I ever saw it was from the dining area used to serve lunch for the tutorials.

The exhibit hall's overflow space being set up in the former football stadium.
I would've been furious if if I had been stuck with a booth in this overflow space, because I can't imagine the foot traffic in there was very high. I personally couldn't even find the entrance to this second exhibition area in the few hours I had to look for it.

I can't help but think the SC organizers leaned far too much into booking up as much space (and therefore exhibitor dollars) as possible without thinking about the dilutive effects of having such a massive vendor count. Some vendors definitely benefitted from having a good location near one of the hall entrances, but I also heard a nontrivial amount of grumbling around how little traffic there was at some of the big booths. It wouldn't surprise me if there was a contraction of the HPC mainstays at SC26.

By the numbers

Rather than rely solely on anecdotes though, it's also fun to take a quantitative look at the changes in exhibitors relative to last year. Since I spent the time figuring out how to generate tree maps for my SC24 recap last year, I figured I should re-run the same analysis to compare SC25 to SC24.

Of the biggest booths who were exhibiting for the first time this year, it should be no surprise that the two biggest new entrants were Danfoss (liquid cooling infrastructure) and Mitsubishi Heavy Industries (gas turbines and other large-scale infrastructure):

New exhibitors with the largest booths.

Of the other top new exhibitors, some (Solidigm, Sandisk, C-DAC, MinIO, and University of Missouri Quantum Innovation Center) were quite relevant to the typical SC attendee. Arm was also back after having skipped SC24. But there were scores of new exhibitors whose services and products seem much more relevant to very niche aspects of physical datacenter infrastructure.

Of the exhibitors who didn't show up to SC25 but had big booths at SC24, there was a diverse mix of markets:

Vendors who didn't show up to SC'25 but had big booths at SC'24.

Sadly, higher ed and government popped up on this list (see Doug Eadline's take on this for more). A bunch of datacenter infrastructure providers also vanished, including Valvoline and Boundary Electric; this suggests that some of the top new vendors of this year (Danfoss, Mitsubishi) may similarly vanish entirely next year after realizing that SC isn't really their crowd. But I was also surprised to see some big names in AI vanish; Iris Energy (IREN) is a GPU cloud provider that just inked a multi-billion dollar deal with Microsoft; Ingrasys manufactures much of the world's GB200 NVL72 infrastructure; Groq, Sambanova, and SambaNova also inexplicably vanished.

Perhaps more interesting are the top growers; these vendors exhibited both last year and this year, but went significantly larger on their booth sizes:

Biggest increases in booth size at SC'25 vs. SC'24.

Legrand, which provides datacenter infrastructure bits, likely grew as a result of it acquiring USystems and merging USystems' booth with Legrand's booth this year. The other big booth expansions are mostly household names though; Gates, EBARA, and GRC are cooling vendors that the typical SC attendee can't do much with, but the others are organizations with whom a researcher or HPC datacenter operator might actually talk to.

Finally, the top contractions in booth space are a mix of service providers, HPC facilities or research centers, and component suppliers:

Biggest decreases in booth size at SC'25 vs. SC'24.

Of the biggest vendors who downsized, Carahsoft is a component reseller and service provider, Stulz is a liquid cooling company, HLRS is a German supercomputer center, and Viridien is an HPC services company that primarily serves the energy sector. It is surprising to see AWS shrink while Microsoft grew, and it is doubly surprising to see Oracle shrink when it's at the center of the biggest HPC deployment news of the season. Given that these booth sizes are chosen a year in advance, this may speak to how unexpected the turn of events were that resulted in Oracle carrying the cloud services end of DOE's big public-private partnerships.

Interesting new technology

For reasons I'll discuss later, I didn't have much time to walk the exhibit hall floor. Combined with the fact that everything was so spread out and diffuse, I just didn't get a great sense of what interesting new technology was being introduced this year beyond what tended to stick out. And amidst all the giant CDUs and liquid cooling infrastructure, it was hard for anything to stick out except really big compute cabinets.

Dell IR700

Dell's booth had a fully loaded IR7000 rack on display (as they did at GTC earlier in the year) with 36 GB200 NVL4 sleds. At 50OU high (almost eight feet tall), this thing is physically huge:

Dell's 50OU IR7000 rack, fully loaded. This is what TACC Horizon will be built from.

Unlike the version they had on display at GTC though, this one had both the front door and a full rear-door heat exchanger installed:

HUGE rear-door heat exchanger on the back of the Dell IR7000 rack.

What's notable about this platform is that we now know that it is the basis for both TACC's upcoming Horizon system (which will have 28 of these fully loaded racks) and NERSC's upcoming Doudna system (which will have Vera Rubin rather than Blackwell). This rack was nominally designed for hyperscale AI and is the basis for Dell's GB200 NVL72 (XE9712) deployments at places like CoreWeave and xAI, which means that it'll be thoroughly tested at scale long before TACC or NERSC have it up and running. This is the opposite of what has historically happened: before AI, it was usually government HPC that had to debug new rack-scale architectures before industry would touch it.

HPE Cray GX5000

However, government HPC will still have a chance to debug a new supercomputing platform in the recently announced Cray GX (formally called "the HPE Cray Supercomputing GX platform"), which is the successor to the current Cray EX platform. This is the platform that the Discovery supercomputer at OLCF will use, and HPE had a CPU-only blade (Cray GX250) and a rack mockup on display at SC:

HPE's new GX blade form factor. This one appears to be the GX250, the 8-socket CPU-only blade.

It's hard to tell the size of this blade from the photo, but if you look at the relative size of the CPU socket and the DIMM slots, you can get a sense of how physically massive it is--it's like a coffee table. It also isn't perfectly rectangular; Cray decided to put this unusual protrusion on the front of the blades which is where the four NICs and eight E1.S SSDs are housed:

A look at the side of the Cray GX blade's "nose" showing the side-mounted NIC ports.

This nose(?) adds more surface area to the front of the rack, and it makes more sense when you see a rack full of these nodes. HPE had a full GX5000 rack with mocked-up cardboard nodes in their booth as well:

Fully loaded GX5000 rack. The nodes were cardboard, but pretty nice cardboard.

By having the NIC ports (which are Slingshot 400) face the sides of the rack rather than stick out the front, the bend radius of all that copper doesn't have to be quite as dramatic to route it along the sides of these node noses. And unlike previous Cray designs, there's also no midplane or backplane that connect the nodes in a rack to the rack-local switches; everything connects through discrete copper or optical cables.

At the center of the rack is a liquid-cooled switch chassis, and each rack can support either 8-, 16-, or 32-switch configurations. Each switch is a 64-port Slingshot 400 switch, and I think the premise is that a single GX5000 rack is always exactly one dragonfly group. If you want a smaller group, you use a switch chassis with fewer switches.

Interestingly, this GX will also support non-Slingshot Ethernet and XDR InfiniBand switches. Given that both XDR InfiniBand and 800G Ethernet are shipping today and have twice the bandwidth that Slingshot 400 will have when it starts shipping in a year, perhaps the Slingshot 400 option is just a stopgap until HPE's investments in Ultra Ethernet result in a product. The lack of a network backplane in the rack also makes it easier for the rack to accommodate the non-dragonfly topologies that would be required for InfiniBand or Ethernet.

The rear of the rack is remarkably unremarkable in that it simply contains a rear bus bar and the liquid cooling manifolds and mates. In this sense, the rack looks very OCP-like; the boring stuff is in the back, everything exciting is serviced from the front, and the rack itself is passive plumbing. Like any OCP ORv3 rack, power shelves slot in just as server blades do, and they use the same liquid cooling infrastructure as the rest of the rack. They power the bus bar, and the blades and switches draw from the same bus bar.

Compared to an ORv3 rack though, these GX racks are wider and shorter. The width probably offers more flexibility for future NVIDIA or AMD GPU boards, but I was surprised that Cray didn't go ultra tall like Dell's 50OU IR7000. I was also surprised to hear that Cray is launching GX with a 400 kW cabinet design; power appears to already be a limiting factor in the nodes launching with GX. A single 400 kW GX rack can support

  • 40 CPU-only blades (81,920 cores of Venice)
  • 28 AMD Venice+MI430X blades (112 GPUs)
  • 24 NVIDIA Vera+Rubin blades (192 GPUs)

For reference, the demo GX5000 rack pictured above had only 29 blades and 16 switches. I assume that fitting 40 blades into the rack requires using the smallest dragonfly group possible.

On the cooling front, the GX5000 rack will launch with support for the same 1.6 MW CDUs as the current Cray EX platform. I heard talk of a neat sidecar CDU option as well, but the person with whom I spoke at the HPE booth said that would come a little later.

Overall, I was surprised by how un-exotic the new Cray GX platform is compared to what the AI world has been doing with ORv3 racks. The fact that Cray and Dell's designs are more similar than different suggests that the HPC/AI world is converging on a place where the future is uncertain, and flexibility is more important that highly engineered racks that optimize for very specific nodes and networks. It also suggests that the real value of buying Cray is higher up the stack; liquid cooling, power delivery, and rack integration is becoming commoditized thanks to AI.

I was also surprised that Cray's next-generation design is not obviously superior to what the hyperscale community is designing. Whereas the GX rack caps out at 400 kW, Dell's will allegedly scale up to 480 kW. That said, today's IR7000 racks shipping for Horizon are only 215 kW (for GPU racks) and 100 kW (for CPU-only racks) according to a talk given by Dan Stanzione:

The physical configuration of TACC's upcoming Horizon supercomputer.

So until the final specifications for the Rubin GPU are released, I suspect we won't know whether Cray still leads the pack in terms of compute density, or if Dell made the better bet by aligning its supercomputing platform on a standard OCP rack design.