HPC in an AI world: swimming upstream with more conviction

Dan Reed recently published an essay, HPC In An AI World, that summarizes a longer-form statement piece he co-authored with Jack Dongarra and Dennis Gannon called Ride the Wave, Build the Future: Scientific Computing in an AI World. It's worth a read since, as with much of Dr. Reed's writing, it takes a necessary, hard look at where the HPC community needs to look as the world underneath it shifts as a result of the massive market forces driving AI.

This is a topic about which I've written at length in the past on my blog, and as I read Dr. Reed's latest post (and the Riding the Wave paper that motivated it), I found myself agreeing with a many of his positions but disagreeing with some others.

My own background is in the world at the center of Dr. Reed's writing: traditional HPC for scientific computing at the national scale. However, my outlook has also been colored by the years I spent at Microsoft supporting massive-scale supercomputing infrastructure for training frontier models and the days I now spend at VAST, steeped in the wider enterprise AI market. This undoubtedly results in an unusual lens through which I now view Dr. Reed's position, and I couldn't help but mark up his essay with my own notes as I read through it.

In the event that my perspective--that of an HPC-turned-AI infrastructure practitioner--is of interest to anyone who found Dr. Reed's latest essay as engaging as I did, I've shared them below.

New Maxim Two: Energy and data movement, not floating point operations, are the scarce resources.

This has been true long before exascale in the HPC world. This is not a new maxim. Ironically, it is in the AI world that this maxim is relatively new; as inference overtakes training as the predominant consumer of GPU cycles, we are seeing widespread shortages of DRAM because of the extreme demand for HBM and the memory bandwidth it provides.

New Maxim Three: Benchmarks are mirrors, not levers. Benchmarks rarely drive technical change. Instead, they are snapshots of past and current reality, highlighting progress (or the lack thereof), but they have little power to influence strategic directions.

Benchmarks drive technical change amongst technology providers who act without conviction. The tech industry is full of companies who are blindly chasing consumer demand, and these companies design entire product lines to achieve high benchmark results with the mistaken belief that those benchmarks are a reasonable proxy for actual productivity. And even worse, many buyers (especially in lower-sophistication markets like enterprise) also believe that benchmarks, by virtue of being designed by community organizations who have ostensibly thought deeply about performance, are a good proxy for productivity, make purchasing decisions around these same benchmarks.

The net result is that a bad set of benchmarks can create and sustain an entire economy of buyers and sellers who think they are buying and selling something useful, when in fact they are wasting resources (time, energy, and COGS) because none of them actually understand what really drives productivity within their organizations.

Fortunately, the HPC community is generally savvier than enterprises, and most national computing centers now recognize that HPL is simply not a meaningful yardstick. While it used to be good for convincing politicians and other non-technical funders that good work was being done, the discourse around AI has squarely put Rmax in the ground as a meaningful metric. Politicians now understand "hundreds of thousands of GPUs" or "gigawatts," neither of which require a benchmark like HPL to prove.

Also, as an aside, I find it ironic that a paper with Jack Dongarra listed as an author is now saying HPL is a snapshot of the past. I've heard that he is the reason that HPL results achieved using emulated FP64 are not allowed on Top500. Despite achieving the required residuals through more innovative means than simply brute-forcing a problem through FP64 ALUs, using techniques like the Ozaki scheme were deemed incompatible with the purpose of Top500. Which is to say, I think he's the reason why HPL and Top500 has been reduced to a benchmark that reflects outputs (hardware FP64 throughput) rather than outcomes (solving a system of equations using LU decomposition).

New Maxim Four: Winning systems are co-designed end-to-end—workflow first, parts list second.

In HPC, we must pivot to funding sustained co-design ecosystems that bet on specific, high-impact scientific workflows

I don't agree with this. Funding sustained co-design is just swimming upstream with more conviction.

The real way forward is to find ways to align scientific discovery with the way the technology landscape is moving. This means truly riding the wave and accepting that scientific discovery may have to turn to completely different techniques that achieve their desired precision and validation through means that may render obsolete the skills and expertise some people have spent their careers developing.

Consider the scaffolding of end-to-end workflow automation; a rich ecosystem of technologies exists in the enterprise and hyperscale worlds that have been used to build extreme-scale, globally distributed, resilient, observable, and high-performance workflows that combine ultra-scalable analytics engines with exascale data warehouses. However, realizing these capabilities in practice requires fundamentally rethinking the software infrastructure on which everything is built. The rigidities of Slurm and the inherent insecurities of relying on ACL- and kernel-based authentication and authorization need to be abandoned, or at least understood to be critically limiting factors that the HPC community chains itself to.

To make this very specific, consider a bulk-synchronous MPI job running across a hundred thousand GPUs; if one node fails, the whole job fails. The "swimming upstream with more conviction" way of solving this problem is to pay a storage company to build a faster file system, pay some researchers to develop a domain-specific checkpoint library that glues the MPI application to platform-specific APIs, and pay SchedMD to automate fast restart based on these two enhancements. Fund all three projects under the same program, and it is arguably a "co-designed end-to-end workflow."

Riding the wave would be something different though: instead of requiring a job requeue and full restart from checkpoint upon job failure, treat the entire job as an end-to-end workflow. If a node fails, the job doesn't stop; it just transitions into a recovery state, where the orchestrator gives it a new node on which the job runtime can rebuild the state of the dead node using distributed parity or domain-specific knowledge. A fast file system is completely unnecessary for failure recovery. But the application developers would have to abandon the model of an application being a single process invocation in favor of the application being a system whose state evolves with the underlying hardware.

Slurm can't do any of this, because Slurm is tied to the MPI model of parallel execution which assumes nothing ever fails. Which is to say, I think co-design should be deferred until a time that the HPC community first recognizes that, so long as they continue to approach end-to-end co-design as an HPC problem to be solved by HPC people using HPC approaches, they will continue to swim upstream regardless of how much co-design they do.

New Maxim Five: Research requires prototyping at scale (and risking failure), otherwise it is procurement. A variant of our 2023 maxim, prototyping – testing new and novel ideas – means accepting the risk of failure, otherwise it is simply incremental development. Implicit in the notion of prototyping is the need to test multiple ideas, then harvest the ones with promise. Remember, a prototype that cannot fail has another name – it’s called a product.

The idea is right, but the title is wrong. Prototyping at scale is the wrong way to think about developing leadership supercomputing capability. The largest commercial AI infrastructure providers do not prototype at scale. Instead, they frame their thinking differently: anything done at scale is production, and if it doesn't work, make it work.

In practice, this means foregoing years-long acceptance test processes and beating up suppliers over hundred-page-long statements of work. Instead, they accept the reality that they share the responsibility of integration with their suppliers, and if things go sideways, they are working with partners who will not walk away when times get tough.

National-scale supercomputing has always been this way in practice, but the HPC community likes to pretend that it isn't. Consider Aurora: if that system wasn't a prototype-at-scale, I don't know what is. That system's deployment and operations was and remains fraught, and it is built on processors and nodes that were cancelled as products before the system even entered production. Yet the theatrics of acceptance testing went on, Intel got paid something, and we all pretend like Aurora just like Frontier or Perlmutter.

AI doesn’t prototype at scale; they just take a risk because the next breakthrough can't wait for every "i" to be dotted and "t" to be crossed. If a hyperscale AI system is a failure, that’s fine. The demand for FLOPS is sufficiently high that it will be utilized by someone for something, even if that use generates low-value results rather than the next frontier model that it was meant to build. The same is true for systems like Aurora; it's not like these systems sit idle, even if they don't live up to their original vision.

And rest assured, AI systems prove to be bad ideas just like HPC systems do. The difference is scale: there are multi-billion-dollar AI supercomputers in existence that were obsolete before they even came online, because the problem they were designed to solve became irrelevant in the years it took to build them. But what was really lost? A bit of money and a little time. The GPUs are still used for day-to-day R&D or inferencing, and the time lost was made up for in lessons learned for the systems that followed.

All the big AI systems are prototypes, because AI workloads themselves are continually evolving prototypes. As a result, the line between prototype and production become blurry, if not meaningless.

All too often, in scientific computing, our gold is buried in disparate, multi-disciplinary datasets. This needs to change; we must build sustainable, multidisciplinary data fusion.

This is so easy to say, but it always feels empty when it is said. What’s stopping this data fusion? I don’t think it’s willpower or resources. It’s just really difficult to figure out what good any of it would be within a standard theory-based modeling framework. Making productive use of fused multimodal data (meshes, particles, and discrete observations, for example) requires multimodal, multiphysics models. And such models are really expensive relative to the insights they deliver.

To me, this means the challenge isn't in getting the world's scientific data to hold hands and sing kumbaya; it's accepting that there's limited value in actually doing this data fusion unless you're willing to also take on more approximations within the models that use them so that the net return--science per dollar--comes out as a net positive over today's physics-based, single-mode scientific models.

The AI community accepts that wholly empirical models are much less interpretable but can much more readily turn multimodal data into results in a meaningfully faster, most resource-efficient way. for example the Aurora model and how it took all sorts of disparate climate datasets to develop an incredibly efficient forecasting tool. In a minute on a single GPU, it produces forecasts of comparable quality to what would take hours across multiple GPUs using a physics-based model. And it achieves this efficiency by having trained on a diverse collection of gridded 3D atmosphere data and tabular data that was fused.

The only problem, of course, is that the model is much less interpretable than a physics-based model. If the Aurora model's forecast is off, forecasters mostly have to shrug and move on with life. But for the purposes of solving the scientific problem at hand (predicting the weather a few days out), that may be good enough.

Governments must now treat advanced computing as a strategic utility, requiring a scale of coordination and investment that rivals the Manhattan Project or the Apollo program.

Manhattan Project and the Apollo mission had distinct goals with a defined "lump of work" required to achieve them. They are not comparable. Computing is a commodity, and it’s a far fairer comparison to liken it to oil or gas reserves. And even then, exactly what good are these computing reserves or capabilities really? Is it one big supercomputer, or many small ones? What are the range of problems that such a strategic utility would be called upon to solve?

In the AI game, advanced computing is certainly a pillar of competitiveness, but it is not necessarily the most limiting one. DeepSeek showed us that ingenuity and massive computing are two orthogonal axes towards developing new capabilities. They showed that, although you can spend a ton of money on GPUs to train a new frontier model, you can also be a lot more clever about how you use much fewer GPUs to do the same thing. And the ratio of people to capital that resulted in DeepSeek-R1 arguably showed that investing in innovation, not just datacenter buildout, has a much higher return on investment.

In the context of the above statement, I think governments would do far better to treat its innovators as a strategic asset and worry less about issuing press releases that lead with how many thousands of GPUs they will deploy. For every thousand GPUs to be deployed on government land in the US this year, how many government researchers, architects, and visionaries have headed out the door and are never coming back?