AI doesn't need giant supercomputers after all
I attended the 2026 Salishan Conference on High Speed Computing last month, and it was a week well spent in coastal Oregon hearing what many of the world's experts in scientific computing are worried about. Although I took copious notes and many of the presentations are online now, I'm not really sure how closed-door the week is meant to be. As such, I don't feel quite comfortable sharing a distillation of my observations (beyond what I posted to Bluesky that week) without someone telling me I can.
However, I do feel comfortable sharing a presentation that I gave. I wasn't asked to give this talk; rather, I was allowed to present it as part of Salishan's fun "Random Access" session, a round of lightning talks that are both anonymously submitted and voted on in the first two days of the conference I submitted a talk titled "AI doesn't need giant supercomputers after all" based on a half-written blog post I've been struggling to finish, and much to my surprise, enough people voted for it that I was given ten minutes to present it.
Because these "Random Access" presenters are only notified a few hours before they have to take the stage, the talks can be kind of nutty. Rather than waste the 90 minutes I had to make a pretty presentation, I made one that looked like garbage but contained the essence of the blog post. And now I'm sharing it here, roughly as I presented it.
AI doesn't need giant supercomputers after all!
The first time I attended Salishan last year, I came as a representative of Microsoft and gave a lightning talk about the massive scale at which AI supercomputers were being built. A year has passed, I have a new employer, and the world of AI has fundamentally changed. So I thought the Salishan attendees might value an updated perspective that combines a little of what I used to do at Microsoft (system architecture for massive-scale AI training clusters) with what I do now at VAST (combine AI research trends with usage data at scale) to paint a picture of where AI training infrastructure is headed. This picture is based partly on fact, partly on extrapolation, and partly just editorialization, and you can decide which is which.
A brief history of massive AI supercomputers
To understand where AI training infrastructure is going, let's first consider where it's come from:
The history of frontier model training at massive scales begins with the release of GPT-4 in March 2023, which we will call t=0.
At t=0 in March 2023, GPT-4 was released and the world rejoiced. It was perhaps the first time when scaling laws, which predict how good of a model you can train if given enough GPUs and data, were shown to hold up. OpenAI and Microsoft hypothesized that with a massive supercomputer and a massive model, a massive leap in quality could be realized. Microsoft dropped the billions to build the machine, and OpenAI delivered the model. This meant that the race was on to build an even bigger supercomputer to build an even better model.
Two months later (t=2, in May 2023), still high on the success of GPT-4, OpenAI claimed that superintelligence is "conceivable ... within the next ten years."
Fourteen months later (t=14, in May 2024), Microsoft announced that a massive new supercomputer had been built, and it was specifically designed to train the model that would succeed GPT-4.
Just four months later (t=18, in September 2024), a guy named Sam claimed that superintelligence would arrive in "a few thousand days." He also posted on reddit that superintelligence would be achievable "on current hardware," which was Hopper GPUs.
Two months after that (t=20, in November 2024), Sam went on to say that "we basically know what to go do" to achieve superintelligence.
Three months later (t=23, in February 2025), OpenAI released a massive, new model. Sam said it embodied "a different kind of intelligence" that was "magic." OpenAI openly discussed how difficult it was to train a model of its scale.
Two months after that (t=25, in April 2025), OpenAI deprecated this magic model.
Three months after that (t=28, in July 2025), OpenAI shut down this magic model. The public could no longer use it.
A month after that (t=29, in August 2025), Sam said that AGI is "not a super-useful term."
So what happened?
What happened in the year between "we basically know what to go do" and "not a super-useful term?"
Briefly,
- Microsoft and OpenAI spent a lot of money to build a massive new supercomputer to train the model that was to succeed GPT-4.
- OpenAI trained a massive model with the full capability of that massive supercomputer.
- The model that resulted from this massive infrastructure investment and training effort wasn't very good.
- A very expensive lesson was learned: the juice is not worth the squeeze when it comes to building ultra-massive supercomputers to train ultra-massive models.
(Not coincidentally, I spent this year supporting the training of a massive model on a massive AI training cluster at Microsoft.)
Massive models are a massive pain
With hindsight being 20/20, it's not hard to understand why the juice was not worth the squeeze. The reality is that the squeeze required to create a massive model using a massive supercomputer is phenomenally hard on both sides of the equation.
Training massive models is straight up painful from an operational standpoint. Doing anything at massive scale is hard enough, but consider the unique challenges of training a massive frontier model.
- At this point in history, it was believed that better models were only possible with better GPUs. This meant that these training runs were happening on GPUs that, for all intents and purposes, were the first ones off the production line and had never really been tested in production before.
- In addition to the GPUs being brand new, everything else was too--CPUs, NICs, PCIe, protocols, and packaging were all new. Weird, rare errors and failures that don't get screened out in testing all appear when training begins at scale.
- Many of the errors and failure modes are brand new as a result. There's no amount of automation that can work around cryptic error messages that have never been encountered before, much less documented.
The result of these factors is that a team of humans must babysit training infrastructure--both software and hardware--around the clock throughout training. And these humans have to be specialists, as everything is being debugged in production. Combined with the fact that someone is paying for the GPUs whether training is up or down, training these models demands large teams of highly stressed, highly skilled engineers around the clock.
Inferencing massive models is not very appealing either, although this is more an economic problem than an infrastructure problem. Massive models require a massive amount of HBM (and therefore GPUs) to service a single prompt, and this directly translates into much higher cost per token. Taking OpenAI's massive "magic" model as an example:
- At launch, it cost 15x more per token than OpenAI's contemporaneous mainstream model, GPT-4o. Making the reasonable assumption that the margins for this magic model were no higher than the mainstream model of the time, this suggests the magic model required 15x more GPUs.
- It is reasonable to assume that the mainstream model was designed to balance model capability with economics. The sweet spot at the time was fitting a single model replica on a single H100 HGX node; this maximizes available HBM without incurring the cost and complexity of having to scale out over InfiniBand to service a single prompt.
From these two assumptions (and the assumption that both the magic model and the mainstream model were optimized for "current hardware" of the time), this suggests that a single inference on the magic model would've required around 120 GPUs or 15 HGX nodes. That is an expensive squeeze for juice that had to be qualified as "a different kind of intelligence."
Innovation happened on other fronts
While the world's largest AI training cluster was being used to train this massive follow-on to GPT-4 though, a different direction was being explored as a new potential frontier in models: reasoning.
OpenAI never said much about the infrastructure used to train their first reasoning model, but there are a few data points that form a shape around it:
- The release date of the first reasoning model was sufficiently close to the massive magic model that they were probably trained on different supercomputers.
- Microsoft didn't mention building any second, massive supercomputers beyond the one intended to train the follow-on to GPT-4.
- OpenAI never released a video describing how hard it was to train their first reasoning model.
These facts all suggest that this first reasoning model was trained on a smaller, older supercomputer, perhaps in less time, and requiring less of a herculean task from infrastructure operators.
And yet, the world rejoiced when this reasoning model was released, much in the same way that it rejoiced when GPT-4 was released. And quite unlike how the world reacted (or didn't react) to the massive model, released a few months later, that required significantly more capital and time to train.
The significance of this point in late 2024/early 2025 cannot be understated; it was an inflection point in the thinking behind what it took to create newer, better frontier models. The juxtaposition of a cheap-to-train/cheap-to-inference reasoning model with a painful-to-train/costly-to-inference massive model made it clear that throwing more parameters, GPUs, and dollars at AI was not a going to yield limitless improvements that would realize AGI. Being smarter, not bigger, was going to be the path forward.
I would also say that this is the point at which the scaling laws, which predicted the success of GPT-4, failed. But much like Moore's Law, the AI industry simply changed how scaling laws were defined to ensure that they lived on. Now instead of simply using parameters, FLOPS, and tokens to predict model performance, there are scaling laws for reinforcement learning, reasoning, and every other dimension of model design that represents designing smarter, not just bigger, models.
So why are massive AI supercomputers still being built?
In the fall of 2025, around the time AGI became "not a super-useful term," both Microsoft and Amazon unveiled their next-generation, ultra-massive supercomputers:
- Microsoft Fairwater is a 450 MW supercomputer design that connects "hundreds of thousands" of GPUs on a single coherent fabric for synchronous training. Microsoft is building multiple such systems with the stated intent of "large-scale distributed training across multiple, geographically diverse Azure regions."
- AWS Rainier is a supercomputer with "nearly half a million Trainium2 chips" built to train "future versions" of Anthropic's frontier models.
Why would these hyperscalers spend billions on such ultra-massive training clusters if AI models don't actually need them? Sunk cost, mostly.
It takes a few years to build these ultra-massive supercomputers; for example, the Fairwater site in Wisconsin was first purchased in mid-2023, publicly announced in 2024, and will be complete in 2026. For context, this ultra-massive supercomputer was first conceived at the same time GPT-4 was first released and scaling laws had their first major validation. In the years it took to actually build it, AI research found better ways to advance models beyond pure scale. In a sense, these systems coming online were designed to solve a problem that no longer exists.
This is not to say that these systems are worthless for training models though! They just aren't essential anymore, and it's unclear whether there will be justification to build another generation of ultra-ultra-massive supercomputers that follow.
So what are these giant supercomputer useful for today?
When these massive supercomputers were designed, it was thought that they would be used to train massive models using massive amounts of tokens for massive amounts of time. That has proven to yield limited benefit. However, there are ways in which building these singular, massive training systems offer an advantage over multiple, smaller clusters.
A very nice property of training AI models is that they can strong scale exceptionally well. This is the basis for data parallelism; you create replicas of a model, let each replica learn from non-overlapping pieces of the total input corpus in parallel, and synchronize all replicas after every step. The time it trains to train a model is governed by how much training data you use, so to speed up training, you can simply distribute the input data over more data-parallel replicas.
Practically speaking, this means that if it takes you a month to train a model on 10,000 GPUs, you can train that model in three days if you have 100,000 GPUs. And based on recent data profiling a couple dozen production training runs that I happened to have, this appears to be what model trainers are doing these days:
Rather than spending weeks or months training impractically big models on moderately sized GPU clusters, it looks like people would rather spend hours or days training moderately sized models on moderately sized GPU clusters.
At its surface, this may seem trivial; 10x faster training time still requires using 10x more GPUs, so the cost is a wash. But there are practical benefits to using more GPUs to spend less time training:
- Operationally, staffing a training run 24x7 for weeks or months is just brutal. I don't know many people who have done this once and are willing to do it a second time. I think there was a time when the promise of AGI being right around the corner helped people muscle through the difficulties, but as these massive models get trained and yield disappointing results, it is probably getting harder to sustain these marathons at scale.
- Training massive models is very unpredictable. Even if a 1T parameter training run appears stable in its early phases, aberrations in the data, model architecture, or infrastructure can cause learning to go in a direction that renders the model irrecoverably useless. It's very hard to predict this, and it's very hard to throw away weeks of training because a model has become irreconcilably divergent. It's far preferable to find out your massive model will not work after a few hours rather than a few weeks, because you can regroup and rearchitect faster.
The other side of this coin is that building a massive GPU cluster is fundamentally more difficult and costly than building many smaller GPU clusters.
AI is being forced to smarten up
The cost/benefit of building massive supercomputers is rapidly changing. The math used to be a simple matter of "more GPUs means better model," but the system design space is now having to balance cost, complexity, flexibility, and time to solution.
What is clear, though, is that AI doesn't need bigger supercomputers to build better models anymore. Bigger models are hard to train and costly to inference, and as the AI industry matures and becomes a little more cost-conscious around building models that are both capable and profitable, smaller but smarter model architectures are emerging as better alternatives to simply scaling up.
What's equally noteworthy is that, as the state of the art in AI turns to being smarter instead of going bigger, a lot of ancillary infrastructure challenges are also cooling down. Networks don't need to scale quite so much in the immediate future, reducing the urgency for technologies like co-packaged optics and next-gen transport protocols to be widely available. Similarly, the insane I/O requirements being driven by optimizations like KV cache offload look less critical as smarter model architectures implement more scalable, linear-like attention mechanisms.
I can't help but think this will make the traditional HPC community feel a bit smug, since they have long known that simply doing the same dumb things at bigger scales has its limits. Money is good at getting down the runway faster, but there aren't any shortcuts to getting off the ground.







