GTC 2026 recap

I recently attended GTC26, NVIDIA's flagship annual conference, and it was a blazing hot week full of equal parts glitz and technology. And when I say blazing hot, I mean that literally--San Jose was in the midst of a record-breaking heat wave with temperatures consistently above 30C every day. Despite the sizzle though, the steak came out like it got microwaved: hot in a few places, but cold in others.

I left the conference feeling like NVIDIA might've spent much of the last year catching up on the promises that it made at GTC25. There were more updates than big announcements, and many of the new launches had me scratching my head more than anything else.

This isn't to say that I didn't enjoy the week; I learned a lot, even if my biggest takeaways weren't the ones that Jensen was spoon-feeding attendees during his keynote. NVIDIA remains years ahead of its competition in terms of mass-market accelerators and software ecosystem, but he also inadvertently drew attention to ways in which NVIDIA has been fumbling. A bunch of slick new hardware was announced, but the quantitative case for them rested on convoluted plots that wallpapered over the growing complexity and fragility of these technologies that undercut the value they can deliver.

As is becoming custom, I've written up the notes I jotted down throughout the week in this blog post. And as is custom, a few disclaimers: I am not an expert in AI, I have no inside knowledge into anything I write, and the views expressed are mine alone (see my full author disclaimer). In addition, I only had an exhibitor pass this year, which meant I only saw technical session recordings and didn't attend them in-person. Fortunately, there were lots of nice spaces to meet and talk to people to hear what was interesting. In addition, the tech program was so physically and technologically diverse that I'm not sure I missed a lot by just watching recordings.

What follows are the themes that stood out to me, organized in no particular order.

Tokens = money. Don't worry about the details.

NVIDIA has been pretty heavy-handed in drilling the idea that tokens are fungible with money and the details are less important for a couple years now, but the GTC keynote opened with a sizzle reel that slapped every attendee in the face with the message. Within a minute of that opening video, tokens were mentioned seven times:

Tokens have opened a new frontier, turning data into knowledge and drawing on all we have learned. Tokens are harnessing a new wave of clean energy and unlocking the secrets of the stars. In virtual worlds, they help robots learn. And in the physical world, perfect, forging new paths and clearing the way for a bountiful harvest. In the moments that matter, tokens are already there. And in the miles between, they never stop. They work where human hands cannot. So we may all breathe easier. And the smallest hearts be stronger. Tokens are helping us break new ground.

Yet you could replace every instance of "token" with "AI" in the keynote's opening video and nothing would've been lost. Yet, NVIDIA chose to specifically focus on tokens as being important.

Just for fun, I plotted the number of times Jensen mentioned "tokens" during his keynote and annotated it with the context:

What jumps out is that tokens are the hot word when trying to convince people how much money there is to be made by buying new hardware, but it is never mentioned when talking about AI advancements themselves (whether it be enterprise use cases or developing open models).

Why is this?

Because tokens are the lynchpin concept that connect GPUs to dollars. LLMs run on GPUs and generate tokens, and all of the frontier model builders monetize LLMs in units of dollars-per-million tokens. By talking about tokens instead of, say, exaflops, NVIDIA can draw a direct (but tortuous) line between GPUs and revenue, resulting in plots like this one that Jensen showed during the keynote:

If you flash a plot like this up and say it fast enough, equating tokens with revenue means that GPUs are literally money-printing devices. Jensen cleverly casts the quasi-quantitative justification in terms of "tokens per gigawatt" (and therefore "revenue per gigawatt") which sweeps a bunch of the ugly parts of this reality under the carpet:

  • Revenue per gigawatt ignores up-front investment and capital expenditures. Microsoft spent $3.3 billion (in 2024 dollars) on a 450 MW Blackwell datacenter. That suggests the capital outlay for a gigawatt Blackwell datacenter is $7.7 billion. With memory prices having increased since then, let's say that's $10 billion in GTC 2026 dollars--immediately throwing a third of those GPU revenues out the window. Rubin GPUs are going to be significantly more expensive per unit than Blackwells; given how NVIDIA tends to price their GPUs around value, the above plot suggests that Rubins might cost close to 5x more than Blackwells on a per-watt basis.
  • Revenue per gigawatt also ignores ongoing operational expenditures. This is admittedly small; Microsoft is estimated to pay around $100/MWh, so a gigawatt of GPUs would cost around $850M per year.
  • GPUs depreciate at a staggering pace, with one analysis finding that the value of a GPU's cycles depreciates 70-80% in two years. So generating $30B or $150B in revenue per year will only last one year, and it is dependent on how quickly you can get GPUs up and in revenue service after they begin shipping. Again with Microsoft as a reference, they first got GB200 NVL72 running in October 2024, but OpenAI didn't release its first model trained on GB200 NVL72 until over a year later--in either December 2025 (GPT-5.2 was partly trained on GB200) or February 2026 (GPT-5.3-Codex was fully trained on GB200). This suggests that it takes a year to begin generating revenue on the latest and greatest GPUs.
  • Jensen's math assumes 100% GPU utilization for the year. Goooooooooooooooooooooooooooooooood luck with that. Especially given the previous bullet, where you can only maximize revenue if you deploy and operationalize as soon as these latest GPUs hit the market.

Anyone who's peripherally involved with deploying and monetizing the latest and greatest GPUs already knows all this though. So what's the point of making these claims?

As I have come to realize, the AI industry really has two completely different customer bases to which it must appeal:

  1. Engineers and technologists who are making AI better and more useful
  2. Investors who are making money and neither know nor care about AI as a technology

Jensen, NVIDIA, and every other company at the forefront of AI have to play this game where it looks like they're developing technology exclusively for the engineers and technologists, but in reality need to appeal equally to the investors who do not care about the technology. So industry keynotes have become full of fancy plots and quasi-technical explanations that seem like they are designed by and for engineers but are really exclusively for investors. And you can see this in my earlier tokens/minute plot, where talk of tokens was concentrated around talk of money, and talk of AI use cases had virtually no mention of them.

Since I've come to this realization, I've stopped getting upset by what I used to interpret as sloppy math, theater, and borderline deceit. It's just the game we must all play to keep the paychecks coming. Will anyone really make $150B a year because they bought Vera Rubin? Absolutely not, and I think everyone (including investors) know this. But people can get better value from buying Rubin instead of Blackwell if given the choice, and NVIDIA's framing exposes the right variables: power over rack space, tokens/sec over flops, and the latency-throughput continuum. But it also ignores the half of the equation I bulleted above.

Vertically integrated, horizontally open

Another mantra that Jensen kept repeating in his keynote was this idea that NVIDIA is "vertically integrated and horizontally open;" NVIDIA provides a full-stack solution for AI from software down to hardware architecture, but leaves it up to a diversity of competitors to turn the architecture into an implementation that customers actually buy and deploy. Jensen repeated it enough times in the keynote to make it clear to the audience--or at least, the non-technical ones--that NVIDIA is definitely not anything close to resembling a monopoly.

Horizontally open, as long as hardware is identical

This "horizontally open" message was mirrored this year (as it was last year) by what I call "NVIDIA's not-a-monopoly wall:"

It's meant to showcase how many companies are able to compete within the hardware ecosystem that NVIDIA has created, and enabling companies to compete is definitely a good thing. But if you look at the actual examples, the message doesn't really land at a technical level. For example, here are the liquid cooling assemblies that fit over Vera Rubin node boards. Can you spot the difference?

As best I can tell, the only thing this display shows is that manufacturers have the freedom to use black powder coat or not. Everything else, even down to the placement of screws, appears identical and undifferentiated. What I took away from this is that NVIDIA has effectively achieved horizontal integration even if they don't manufacture these subassemblies.

There's little reason to deviate from the reference architecture; while being innovative might shave a few dollars off of the bill of materials, innovation comes with the significant risk of being late to market relative to competitors who strictly stamped out whatever NVIDIA told them to. And in an industry where these GPU servers lose significant value within their first two years, not having a product ready at launch would be catastrophic compared to the marginal differentiation that would result from innovating.

The takeaway for me was that "horizontally open" really just means multiple companies are building identical subassemblies around NVIDIA's GPUs. The obvious advantage to NVIDIA is that they can keep low-margin widgets off their books while still dictating everything required to ensure a good experience for end-users, while the component manufacturers can fight over making a few pennies on the dollar here and there. As a concrete illustration, consider that NVIDIA's gross margin for FY24 was 72.7%, while Foxconn--which is the leading manufacturer of NVL72 subassemblies--had a gross margin of 6.25% in the same FY.

There is a strong disincentive for anyone in the horizontally open ecosystem to be clever, so as best as I can tell, nobody tries. So, Jensen has to remind investors and regulators that, even though every NVL72 rack looks identical, there are a diversity of suppliers behind them.

Vertically integrated

The "vertically integrated" part of the mantra is well motivated. In Jensen's own words:

We are a vertically integrated computing company. There is no other way. We have to understand the applications. We have to understand the domain. We have to understand fundamentally the algorithms and we have to figure out how to deploy the algorithm in whatever scenario it wants to be deployed whether it's a data center, cloud, on-prem, at the edge, or in a robotic system. [...] We offer you the software, we offer you libraries, we integrate with your technology so that we can bring accelerated computing to everybody in the world.

This is undeniably true, and NVIDIA has done an excellent job at developing domain-specific libraries and algorithms that make it easy for developers from every domain to realize speedups from GPUs without being CUDA experts. Not only does NVIDIA ship libraries that make familiar primitives go really fast, but they develop reformulations of existing numerical methods to exploit these primitives and solve domain-specific problems. NVIDIA makes it easier to not think about the fact that you're using GPUs when computing.

The downside to this vertical integration, of course, is that it's easier not to think about what you're doing. In industry, it feels like there's a growing majority of people (and companies) who live downstream of NVIDIA and depend on them to do this hard work of understanding workloads and building solutions. And when NVIDIA announces solutions to complex problems, it's often accompanied with little more than Jensen getting on stage and saying "just trust me, bro" followed by a bunch of partner companies jumping up and agreeing that NVIDIA is brilliant.

This would be fine if NVIDIA was infallible, but as I realized at GTC this year, NVIDIA has been fumbling a bit: they've quietly walked back previous proclamations, and some newer ones just don't make a lot of sense. Let's talk about that next.

NVIDIA has begun backpedaling in public

I wrote earlier that GTC (and every other industry-driven AI conference) contains equal measures of news intended for engineers and investors. This was only my second GTC, but it is the first time I felt like NVIDIA might've gone over its skis to impress investors in the past year, and is now spending this year reckoning with those promises.

Specifically, there weren't a lot of new announcements; the few people with whom I talked about the keynote agreed that nothing really new and exciting came out. There was new hardware (more on that later), but a lot of the announcements were restating or refining announcements that happened at GTC last year. For example, NVIDIA...

  • re-introduced the Rubin NVL72 nomenclature this year after making a point to rebrand Rubin NVL72 as Rubin NVL144 at last year's GTC. He also updated his hand-wavey economics slides from comparing Hopper/Blackwell to Blackwell/Rubin. But there was nothing new here beyond assuring investors that Rubin would make them even more money than Blackwell did.
  • re-introduced the Rubin Ultra's Kyber rack, which is a bit different from the prototype that was showcased at GTC last year in that it appears half as dense as originally announced last year. With the change of the NVL nomenclature though, it's hard to figure out exactly how much Kyber has downsized in capability.
  • talked about the STX storage platform a little, which is the reference architecture for CMX, which NVIDIA introduced as ICMS at CES in January. In three months it's changed names three times and remains a universal source of confusion among both the storage vendors and analysts with whom I spoke.
  • talked about the Feynman GPU less at GTC26 than he did at GTC25. I took this to mean that whatever work has been done over the last year may still be changed in the coming year.

What was more interesting, though, is what was erased from existence this year. In addition to the Rubin NVL144 nomenclature simply being referred to as NVL72 without any commentary, there was simply no mention of the Rubin CPX GPU.

Clearly NVIDIA made a mistake in declaring that a Rubin GPU with 128 GB of GDDR7 was the best way to accelerate disaggregated inferencing. What's more is that not only did NVIDIA make this mistake, but all of the Rubin CPX launch partners (Cursor, Runway, and Magic) made the same mistake. Or, NVIDIA asked them to jump, and they asked how high.

Cancelling a chip is not a big deal; anyone who's worked in large-scale HPC has lived on this roller coaster ride. However, these cancellations have historically happened before the chips are ever publicly announced, leaving the public none the wiser. The only other company that has cancelled chips after making public releases (and realizing the benefits of investor excitement) is Intel--remember Rialto Bridge and Falcon Shores?

Of course, I'm not saying NVIDIA is going down the same path as Intel. But in their rush to maintain the theatrics of constant innovation to keep investors happy, this is the first time NVIDIA seems to have rushed an announcement. The timeline, as best I can tell, was shockingly fast:

Reading between the lines, Rubin CPX was dead within three months of being announced. And the thing that replaced it, the Groq 3 LPU, is a messier fit that solves a different problem. More on that below.

But NVIDIA isn't as important to AI as it may seem

There was another unspoken but telling statement in Jensen's keynote that made me realize that NVIDIA isn't as indomitable as I previously thought.

He was trying to make a splash (with investors, not technologists) by saying NVIDIA sees a trillion dollars in high-confidence demand through 2027. He accompanied that claim with the following slide to explain where that trillion bucks would come from:

He explained:

Simultaneously, we were very pleased last year that Anthropic has come to NVIDIA.

If he hadn't called out Anthropic as a major new source of revenue for NVIDIA, I wouldn't have realized that Anthropic, which has undoubtedly developed many of the world's leading models, has been tremendously successful without NVIDIA until now. And even then, Anthropic didn't switch to NVIDIA to unlock new AI capabilities; it is turning to NVIDIA simply to supply capacity because it cannot build fast enough with Google and Amazon to meet their demand.

In fact, at the time of GTC 2026, none of the top three frontier models (Claude Opus 4.6 Thinking, Claude Opus 4.6, and Gemini 3.1 Pro Preview) were exclusively dependent on NVIDIA technology. And of the top ten models, a minority (four out of ten) relied on NVIDIA exclusively. This is to say, Jensen pointed out that leading AI companies don't need NVIDIA to develop world-class AI models, and NVIDIA isn't doing anything to change that.

Between the unrelenting talk of tokens and optimizations for inferencing, Jensen also outright said that "every part of AI" is "way past training now." And of course NVIDIA would say this, seeing as how they don't lead in training, and they have no real way to pull ahead and enable model builders to develop better models than what they can do on competing accelerators. So, they are pivoting their messaging to the place where they are winning (at least on financial terms), which is inferencing.

This made me rethink NVIDIA's overall position in AI innovation (which is different than the AI market). Advancements in AI come from developing better models, and right now, the frontier labs that are the most research-forward have the least dependence on NVIDIA. NVIDIA is selling tokens by the megawatt cheaper than anyone, but it's not leading the way towards making those individual tokens any more valuable.

If you extrapolate this situation out a few years, we may arrive at a place where NVIDIA isn't playing a significant role in developing breakthroughs in AI anymore. Instead, NVIDIA will become the cheapest way to deploy models developed elsewhere, and its optimization points will have to change. Performance-per-time-per-energy will remain the critical optimization parameter, but anything that impacts that parameter needs to be tightly controlled.

Specifically, NVIDIA will need to turn its attention to reliability to ensure that its GPUs' aggregate tokens per year (not tokens per second!) aren't being diluted by frequent failures and long repair times. In training, failures are frequent and tolerated with the understanding that paving new ground (like developing a bigger model) will involve discovering new infrastructure problems. But with inferencing, failures and other inefficiencies directly impact profitability, and tolerance of them is much lower.

New hardware

One of the most exciting parts of any big tech conference are the hardware announcements and displays, and NVIDIA had a lot of metal to showcase this year. And in a unique twist, this GTC was the first time NVIDIA had an opinion on storage hardware.

New racks for Vera and Rubin

NVIDIA made it clear this year that they will continue to support both their standard-ish 50V Oberon rack (which current NVL72 systems use) as well as the high-power 800V Kyber rack going forward. This is a good thing, as some hyperscalers have had to go to extreme ends to support the requirements of even the basic GB200 NVL72, and requiring datacenters to support multi-hundred-kilowatt liquid-cooled racks would've severely limited adoption of Rubin Ultra.

In addition, NVIDIA announced they'll also support having rack-scale Ethernet instead of NVLink on the backplane of these Oberon racks in a platform they're calling "NVIDIA MGX ETL." And they had one on display in the exhibit hall:

It has 32 trays and the official blog claims that these trays can be a mix of HGX (NVL8?), RTX (PCIe workstation-class), XPUs (whatever that means), and "etc." (also whatever that means). All of these trays connect to NVIDIA Spectrum-X Ethernet switches through back-loaded cable cartridges, and the dummy trays they had in the demo rack suggest that scale-out can be accomplished through front-facing network ports.

This appears to compete with the flexibility offered by the Cray GX platform, though it is unclear whether third parties can/will build their own trays for MGX ETL. In addition, I'm not convinced that using NVIDIA's wonky cable cartridges to carry Ethernet is a great idea. GB200 NVL72 was super touchy; hastily loading a server tray can bend a pin which can be catastrophic since it requires the whole rack to be taken out of service for repair. Unless NVIDIA made the MGX ETL cable cartridge significantly more robust, I don't think the benefits of having a structured cable assembly outweigh the complexity and operational burdens that come with them.

NVIDIA also displayed two new Oberon rack configurations specifically for Vera Rubin and Vera: the NVL8 (left) and Vera-only rack:

The Vera Rubin NVL8 (specifically branded as "Vera Rubin HGX NVL8") on the left is interesting; NVIDIA has chosen to split the NVL8 baseboard into its own self-contained tray that contains 8x Rubin GPUs, 8x ConnectX-9 NICs, and NVLink 6 Switches. This GPU-only tray then connects to one of several flavors of CPU sleds (Vera or x86) through (I presume) some sort of external PCIe or NVLink-C2C cabling. I didn't get a photo of the back of these NVL8 racks or trays so I'm not super clear on what's going on, but NVIDIA did say that this rack can be configured in the MGX ETL setup so that Ethernet is plumbed in the back with cable cartridges.

Here are what the Vera and Rubin HGX NVL8 trays look like:

If you look at the NVL8 rack on the left of the above picture, you can see that there are four groups of eight trays, and they alternate between Vera CPU trays (with 8x SSD slots in the middle) and Rubin GPU trays (with 8x ConnectX-9 OSFP receptacles). This means each group has four NVL8 nodes, and this rack has 128 GPU sockets.

The rack on the right contains 32x Vera-only CPU trays and two middle-of-rack switches in the MGX ETL rack configuration with cable cartridges in the back. This is what each of those Vera trays looks like:

Each contains 8x Vera CPUs divided up into 4x compute nodes. Each such compute node is connected to 2x front-loading NVMe drives and a single OSFP(?) NIC port. This sled design appears to compete with Cray's GX250 blade which hosts 8x Venice CPUs. Not to be outdone, Cray also announced a doubly dense GX server blade for Vera, the GX240, which supports up to 16x Vera CPUs.

The only other new Vera Rubin rack was the "NVIDIA DGX Vera Rubin NVL72" rack, which is just the same boring old NVL72 configuration (18 server trays, 9 switch trays) with newer parts.

Of these new rack configurations, the most interesting is the HGX NVL8 configuration. It represents an acknowledgment that x86 with NVIDIA GPUs remains a strong combination, and NVIDIA may as well integrate that configuration in its high-density/high-power rack configuration. Allowing one rack to support either 16x NVL8 domains or 1x NVL72 domain allows the market to decide how much value there really is in a giant NVLink domain versus many smaller NVLink domains in an otherwise comparable physical footprint.

When given the choice between Vera Rubin with a small but robust NVLink domain or a big and fragile NVLink domain, will hyperscalers really buy into the economic benefits that Jensen touts around bigger and bigger NVLink domains? Will the reliability, availability, and outlay cost downsides of NVL72 drive people towards NVL8 now that you can have an NVIDIA CPU and NVIDIA GPU configuration for both? It'll be interesting to see.

New Kyber refinements

Last year, NVIDIA debuted Kyber, its 576-GPU, 600 kW rack for the Rubin Ultra platform, and had a prototype on display at GTC25. The design changed a bit since then:

Component GTC25 Kyber GTC26 Kyber
Chassis/rack 4 2
Compute sleds/chassis 18 18
Sleds/rack 72 36
SSDs/compute sled 4 4
NICs/compute sled 4 2 (or 4?)
GPU sockets/compute sled 4 4
CPU sockets/compute sled 2 2
NVL midplane pins Male Female
Compute-facing NVL midplane connectors 18x4 18x4

The rack appears to have become half as dense as it once was with only two chassis per rack:

NVIDIA did not restate this new Kyber rack's power requirements, but the unit they had on display appears to have 8x power shelves (or steppers, or power stabilizers, or something) integrated into it now. I don't know enough about how 800V DC racks are going to work to guess what's going on, but it wouldn't surprise me if last year's 600 kW Rubin Ultra Kyber racks are now closer to 300 kW. I don't think NVIDIA gave hyperscalers enough lead time to build datacenters that efficiently support the original Kyber's power density.

The compute trays have a major redesign, and they too are much less dense now. A lidded version was on display, and Jensen showed an unlidded version during his keynote:

They contain the same 2 Vera + 4 Rubin Ultra ratio, but you can see that a ton of space is taken up by Vera's SOCAMM memory modules and VRMs. In addition, there appear to be 4x NIC ASICs which are connected to weirdly small front-facing ports along with one big NIC ASIC which serves two bigger OSFP-looking ports. Not sure what's going on there.

The bigger change is how NVIDIA changed the way the NVLink backplane is physically assembled. The cable cartridge of NVL72 is gone, but it is now replaced with integrated, rear-facing, monstrous NVLink Switch trays that act as both switches and cable backplanes:

I didn't get a look at what the back of the Kyber midplane looks like, but it is clear that this huge blade connects two midplanes (and their 18x compute trays) together within a rack through 6x NVLink 7 switches. My further guess is that 36 compute trays on the front will be matched by 6x of these NVLink Switch trays to enable all-to-all connectivity across the NVL144 domain.

My first reaction to seeing this new Kyber NVLink Switch tray was horror; given how many reliability issues arose from GB200 NVL72 switches and cabling, having what appears to be one giant FRU seems like it would increase the blast radius of a single failure. However,

  • NVIDIA claimed that Rubin Ultra will be able to operate even when NVLink switches are down, meaning that not all failures are catastrophic.
  • The flaky cable cartridge of NVL72 is being replaced by a passive midplane with female connectors. This means that if a pin does get bent, it's on a hot-swappable NVLink Switch component which doesn't necessarily require disassembling the whole rack like it does today.

Still, this is a huge FRU. A single switch failure will probably knock out 17% of the whole rack's NVLink bandwidth, and I'll bet keeping spares in supply to minimize downtime will be a very expensive proposition since you've got to buy switches in units of six.

One genuinely new and exciting development is that NVIDIA will be enabling NVLink scale-up over optical connections starting with Rubin. There wasn't much to show, but Jensen stumbled through this slide:

He kept tripping over scale up versus scale out, Oberon and Opteron(?), and not being clear about whether he was talking about Rubin, Rubin Ultra, or Feynman so it wasn't completely clear to me what he really meant. But as best I can tell, he said they'd be doing the following:

GPU Rack Domain Size Scale-up Technology
Blackwell Oberon NVL72 Copper
Blackwell Oberon NVL288 Cu + Optical - prototype only
Blackwell Ultra Oberon NVL72 Copper
Rubin Oberon NVL72 Copper
Rubin Ultra Oberon NVL576 Cu + Optical
Rubin Ultra Kyber NVL144 Copper
Feynman Oberon NVL72 Copper
Feynman Kyber NVL144 Copper
Feynman Kyber NVL1152 Cu + Optical

Optical scale-up will all use co-packaged optics and intersect with Rubin Ultra in the Oberon rack. According to a blog post from NVIDIA, this optical scale-up will use a two-level NVLink fabric and is exclusive to Oberon--not Kyber. In my mind, this paints a picture:

  • Rubin Ultra in Oberon will still have 9x2xNVLink Switch trays, but the NVLink 7 switches will have 144 ports instead of the 72 ports used in Rubin+Oberon racks.
  • 72 ports will still go down to rack-local Rubin Ultra GPUs using rear-facing copper cable cartridges, just like Rubin+Oberon.
  • 72 ports will now go out the front to separate racks of spine switches.

I assume this will be the case based on how NVIDIA appears to have done optical scale-up on their Blackwell+Oberon prototype, Polyphe, mentioned in the blog post:

This seems to show 8x GB200 racks, each with:

  • 36x B200 GPUs (648 NVLink ports per rack)
  • 18x 72-port NVLink 5 switches (1,296 ports per rack)
  • Each NVLink switch tray appears to have 18 links going out of the rack (162 links/rack of optical scale-up)

And at the right end of the row appear to be:

  • One rack with 32x 1U switch trays (64 switch ASICs)
  • One rack with 4x 1U switch trays (9 switch ASICs)
  • ...and 4x 1U north-south switches with cyan fibers (these don't matter)

This maps neatly into a two-level non-blocking tree with 144 leaf switch ASICs (10,368 NVLink 5 leaf ports, half of which go to GPUs) and 72 spine switch ASICs (5,184 NVLink 5 spine ports, all of which go to leaves). Given that there appear to be 18 fibers coming out of each leaf switch, each fiber appears to carry 4x400G lanes of NVLink 5.

Also of note: the blog post claims this GB200 system is NVL576, but I think that is an error resulting from the old Rubin Ultra nomenclature where a 72-GPU rack was being called NVL144. There is no way to fit 576 B200 GPUs into 8 racks using the 2U servers shown in the photo of Polyphe.

NVIDIA announced the intention to enable optical scale-up of Kyber racks in the Feynman generation, and I interpret this to mean that Feynman will use NVLink 8 with a higher switch radix. Given the Kyber rack for Rubin Ultra will support 144x GPUs and 36x NVLink 7 switch ASICs, it's reasonable to assume that NVLink 8 will double the radix of NVLink 7 to enable optical scale-up.

Just as Rubin Ultra's doubling of switch radix over Rubin allowed it to scale from NVL72 to 8x72, I assume that Feynman doubling the effective radix of Rubin Ultra in the Kyber form factor will allow it to go from NVL144 to 8x144. The renderings shown at GTC confirm that the Feynman/Kyber NVL1152 will fit into eight racks, but they don't show the end-of-row spine switch rack that I suspect will be required.

Now, it's a separate question of who actually wants an NVLink domain that has 1,152 endpoints. Unless NVIDIA's intent is to eventually merge scale-up InfiniBand and scale-out NVLink, these two-level NVLink fabrics are adding a lot of complexity to both the physical infrastructure and the software infrastructure required to make NVLink behave well under a routed topology.

If the first-generation NVL72 is any indicator, hyperscalers will have to make hard decisions as to whether NVL576 or NVL1152 actually provides the economic benefits to justify the added cost and loss of availability that comes with complexity. I think NVL72 was largely sold on the idea that, if NVIDIA says it's a good idea, it must be a good idea. But as I mentioned earlier, NVIDIA has started to make mistakes out in the open. Just because NVIDIA says NVL1152 is a good idea doesn't make it one.

Megawatt rack infrastructure

Although the power requirements of Kyber may have gone down, there was still a showcase of high-voltage DC electrical infrastructure to support NVIDIA's new 800V standard. I particularly liked this 800V DC power sidecar that APC had on display:

These sorts of power racks are going to have to live alongside every high-power GPU rack like Kyber, and seeing each component labeled was helpful in understanding what goes into one of these sidecars.

  • 4x 2U "DC Output PDUs" - These look like they provide circuit protection for the 800V outputs that go over to the GPU rack. Maybe the throw switches are breakers?
  • 4U power management controller with manual disconnects? Maybe one per zone?
  • 7x 3U "110KW 3RU 800Vdc Power Shelves" - The things that do the actual work. Assuming 6+1, this is enough power for the original 600 kW Kyber rack.
  • 5x "Li-ion BBU" - Battery backup units. These are less about allowing GPUs to ride through power outages and more about protecting upstream electrical gear from the high-frequency oscillations caused by 144 Blackwell Ultra GPUs synchronously switching on and off during training.

Obviously I'm no expert in datacenter electrical infrastructure, but this rack is a powerful visual of how far into the extremes AI is driving the computing industry; this whole rack does the equivalent of what the hot-swap power supplies that slide into the back of individual servers do at smaller scales.

Groq LPUs

Maybe the biggest new announcement of GTC was the unveiling of NVIDIA's new Groq-based products and integrations. Everyone and their mother has already written about this, so if you want the facts, I recommend coverage by ServeTheHome, More Than Moore, The Next Platform, Tom's Hardware, or anywhere else.

Just so I keep my story straight, this appears to be the product timing and relevant features:

  • Groq LP30 + VR - FP8-only
  • Groq LP35 + VR Ultra - introduces NVFP4
  • Groq LP40 + Rosa + Feynman - introduces NVLink

What's significant about this announcement is that NVIDIA has changed its story around what it feels is worthwhile disaggregation. Whereas the now-cancelled Rubin CPX processor was positioned as a prefill accelerator that would be integrated with regular Rubin GPUs with HBM for decode in a single chassis, these Groq LPUs are being positioned for two niche use cases.

Use case 1: Expert parallelism during decode

These Groq accelerators are funny in a couple of ways:

  1. They have no HBM. This means they can't cache many keys and values, which makes them pretty useless for computing attention when inferencing.
  2. They have only a little bit of SRAM, but that SRAM is really fast. This means you can't store a lot of model weights on them, which means you either have to use fine-grained tensor parallelism (which requires the huge bandwidths of NVLink) or find a way to shard a model into tiny, self-contained tensors.

Factor #1 means these processors are really only good for computing feed forward networks (FFNs), and factor #2 means you have to find a way to shard up a model's FFNs into something that fits into a few hundred megabytes of FP8. But guess what? Mixture of experts models do exactly that.

Employing some napkin math and remembering that I don't know what I'm talking about, let's consider how NVIDIA's Groq LP30 might accelerate a super-sparse MOE model like DeepSeek-V3:

  • DeepSeek-V3 has 256 routed experts. The model dimension is 7,168 and the hidden dimension for each routed expert is 2,048. Each expert has three matrices--the up, down, and gate weights, each with 2048x7168 elements. Since LP30 only supports FP8, each such weight is one byte, and each expert takes up 3x2048x7168 = 44 MB per layer.
  • There are 58 mixture-of-expert layers which is 2.5 GB of weights per expert--more than will fit in an LP30's 500 MB of SRAM. However, each LP30 also gets 16 GB of host DRAM.

I think it's reasonable to envision mapping a sparse MOE model like DeepSeek-V3 into a 256-accelerator NVIDIA LPX rack using a combination of static expert parallelism placement (one LP30 per routed expert) and pipeline parallelism to prefetch sequential layers of each expert's FFNs into the LP30 SRAM while the previous layers are computing.

The challenge is that these LP30s are still only accelerating the FFN part of each transformer layer; you still need Rubin GPUs in order to compute the attention for both prefill and decode stages. So, actually disaggregating inferencing with these Groq LP30 accelerators might use Rubin and LP30 as follows:

  • Prefill is still compute-bound and must occur on Rubin GPUs.
  • The attention part of decode is always memory bandwidth bound, and becomes an increasingly constrained by HBM capacity as context length and batch size grow. Large contexts are better served on Rubin GPUs because their large HBM capacity can quickly stream keys and values into GPU SRAM during each decode step.
  • The FFN (expert) part of decode is also memory bandwidth bound because each token requires all of the active experts' weights in SRAM. However, experts don't need any keys or values, so most of the LP30 SRAM can be dedicated to hosting expert weights. In addition, the order in which weights need to be available in SRAM is deterministic, so I think the latency of shuffling expert weights from DRAM to SRAM can be prescheduled.

The biggest issue is that LP30 will run out of SRAM when inferencing using large batch sizes; doing so drives up the arithmetic intensity into a regime where the FFNs become compute-bound and the ultra-high-bandwidth SRAM is no longer the bottleneck.

NVIDIA has been promoting this goofy plot to try to explain how these properties of Groq somehow allow you to generate "ultra" valuable tokens that will make you gobs of money:


Listen to Jensen's keynote if you want to try to follow the mental gymnastics behind this plot. However, it boils down to:

  1. Rubin remains best for all prefill, the attention part of decode, and the FFN/expert part of decode for large batch sizes (e.g., when using continuous batching).
  2. Use LP30 for only the FFN/expert part of decode for small batch sizes. If you want to return the next token really fast, this means you aren't waiting to build up a big batch before passing it through the next decode step; you are giving each user their own decode step with a batch size of 1. In this case, the FFN part of decode dominates, the FFN part is memory bandwidth bound, each expert is relatively small and fits neatly into 500 MB, and LP30's SRAM offers 10x the memory bandwidth of Rubin.

Of course, there are gotchas that Jensen didn't talk about during his keynote. Foremost, using LP30 now splits disaggregated inferencing into three steps:

  1. Prefill on a Rubin - compute bound
  2. Decode (attention) on Rubin - memory bandwidth/capacity bound
  3. Decode (FFN/experts) in LP30 - memory bandwidth bound

This is now getting really complicated to orchestrate data movement:

  • Keys, values, and activations must pass between different Rubin GPUs between prefill and decode attention. This occurs over NVLink and is the same as before.
  • Activations must pass between Rubin GPUs and spray out over a subset of LPUs in a different rack over scale-out Ethernet or InfiniBand. There is probably a one-to-many broadcast since a Rubin will be handling more activations than an LP30.

And because LP30 doesn't support NVFP4, you either have to quantize your model to use NVFP4 for attention but FP8 for activations if you intend to inference on this Rubin+LP30 architecture.

Furthermore, Groq is fundamentally a dataflow accelerator while Rubin is a matrix accelerator. Not only is CUDA not supported on LP30, but they are semantically completely different. You will have to compile attention for Rubin and FFNs for both Rubin (to handle large batches) and LP30 (to handle small batches).

Of course, a lot of these differences can be hidden behind a PyTorch-like abstraction, and NVIDIA has promised to handle the orchestration and data movement of disaggregated decode behind its Dynamo framework. However, magical software tends to take a while to mature; as far as I know, Dynamo is still rough around the edges and not in widespread production use despite being a year old now. So it may be a while for LP30 to really start printing money as Jensen has promised.

On a better note, the lack of NVFP4 will be addressed in the LP35, and GPUs will be able to hand off the outputs of attention to Groq accelerators via NVLink starting with LP40. I presume this will simplify the process of both quantizing models to be Groq-capable and avoid the complexity of disaggregating over two different fabrics.

One last thing that bugged me about this: Jensen's convoluted economic model again uses the trick of using tokens/sec/watt to completely erase the up-front cost of buying yet another accelerator type. In addition, disaggregating inferencing across LP30 using expert parallelism will necessarily leave a significant fraction of each LPX rack idle at any given time; this is the nature of MOEs.

I guess it's fair to assume that neither of these factors are very big though; there is no costly HBM in Groq accelerators so their per-unit cost should be much lower than a Rubin GPU. Similarly, there's not a lot within these LPX racks to soak up a ton of power, so the fact that the average utilization of an LPX rack will be low doesn't affect the throughput-per-watt of the combined GPU+LPU.

Use case 2: Speculative decoding

I don't think he said it during his keynote, but the NVIDIA LPU blog mentions speculative decoding as another advanced inference-time optimization that these LPUs can accelerate token throughput.

The technique runs two copies of an LLM during inferencing: the full-sized model, and a miniature version called the "draft model." The principle is that the draft model is so small and fast that it can generate multiple tokens using normal token-by-token decode faster than it would take the big model to generate just one token. These "draft tokens" are then fed into the big model all at once like a miniature prefill step, and the big model accepts or rejects some or all of the draft tokens in the same(ish) time it would've taken to generate a single token with the full model. In doing this, you can generate/accept multiple tokens using a single forward pass of the big model, effectively multiplying the GPUs' tokens/sec throughput.

LPUs fit neatly into this paradigm because

  1. You can fit the draft model inside the SRAMs of a bunch of LPUs. You'd still use standard model parallelism techniques to achieve this, and you maintain a KV cache for the draft model.
  2. You still run the big model inside GPUs as normal as well, and you're free to use all the normal inferencing optimizations like disaggregation. There's a KV cache for the big model too.
  3. LPUs generate a long sequence of draft tokens really fast as the decode phase begins.
  4. The sequence of draft tokens are then sent over to the Rubin NVL72 cluster and processed in a single forward pass. The big model accepts some or all of those draft tokens, and they appear to the end-user as a single decode step generating a clump of words all at once.
  5. The last accepted token is sent over to the draft model on the LPUs (via the scale-out network) so it knows where to begin generating more tokens, and this process loops.

Assuming a steady inference load, the LPUs in this case have a higher average utilization because they're all computing both the attention and expert parts of the draft model.

However, there remains a fundamental problem: an LP30 rack is waaaay less capable than a Rubin NVL72 rack, so you would probably need multiple LPX racks to support the inferencing throughput of single Rubin NVL72 rack. In addition, the ratio of LPU to GPU depends on exactly what big model and draft model you use, and how well those decompose over 500 MB blocks of SRAM versus hundreds of gigabytes of HBM. And as models and workloads evolve, so will the optimal ratio of LPU to GPU. As a result, I expect that even speculative decoding will eventually wind up with LPU racks that are either oversubscribed (requiring some inference requests to be serviced entirely by GPUs) or undersubscribed (if draft models don't neatly divide into 256 LPUs/rack).

STX/ICMS/CMX

This GTC was the first time NVIDIA took an opinionated stance on storage since they now have a BlueField that is sufficiently capable of serving as a self-hosted storage server. Jensen quickly mentioned this new STX platform during his keynote, and there were two implementations in the exhibit hall:

As with any dutiful NVIDIA reference architecture implementation that is first to market, the two units on display were visually identical, even down to the placement of LEDs and USB-C ports. Their guts were a little differentiated; the Supermicro version had a row of fans inside whereas the Quanta version did not:

These STX nodes were actually described at CES back in January, when Jensen announced the ICMS cluster-local key-value cache (which has since been rebranded as CMX...I think?), and they contain a pair of self-hosted, low-bin Bluefield-4 DPUs (64 cores, 128G LPDDR5X), 24 NVMe drives, and not much else.

The intended use case is to attach these SSD boxes to the non-blocking parts of a SuperPOD's front-end network and configure them to act as dumb JBOFs. On the individual GPU servers, there are complementary Bluefield DPUs which actually implement the storage logic that turn a bunch of network-attached NVMe drives into some kind of distributed KV cache using Dynamo's KVBM.

There's not a lot of room for third parties to innovate on this hardware platform since it is so prescriptive of design; I have a page in my digital garden about ICMS/CMX in which I collected what each vendor claims they will do with these STX boxes (very little, as it turns out) and hypothesized ways that a few specific distributed storage systems could be implemented on STX as well. I won't rehash them here.

However, I will note that I had private discussions with multiple people throughout the week about what they thought about these STX boxes, and the results were unanimous: nobody knows what they're for.

I think the reality doesn't make for a great story, which is why NVIDIA hasn't been clearly telling it: this is just distributed block storage, and using it as a distributed KV cache is very simple to implement. There aren't a lot of third parties chomping at the bit to talk about their STX solutions because they have limited value. KV caches are ephemeral caches by definition, so none of the typical enterprise value-add applies; it just needs to go really fast (so don't worry about compression) and have a consistent index. If it loses data, who cares? You just recompute the prefix and move on with life. Fancy data protection schemes are worthless.

The people who need distributed KV caches today--the top-end inferencing providers--have already solved this problem using in-house solutions cooked up with Claude Code. And the people who would pay for a third-party KV cache solution on top of STX haven't hit the scale at which they can justify the costs of putting a dozen petabytes of dumb fabric-attached NVMe alongside their GPU clusters.

So for the time being, STX is purely a hardware solution that solves a niche problem for the rich and famous. The unwashed masses may need it in a few years once they've perfected disaggregated inferencing and start craving the next level of optimization, but for the time being, I'm bearish on seeing STX deployed at scale in the wild. The few who will use it are not the type who talks about how they optimize their inferencing infrastructure, since that special sauce directly ties to their overall profitability.

Potpourri

In addition to all the technical stuff being showcased in earnest, I made a few observations during the week that I wrote down just for me.

Robots are stupid

One of the weird aspects of GTC is that it aims to be a public spectacle in addition to a tech conference. The most obvious manifestation of that this year were the number of stupid robots in attendance.

Last year, Jensen said that this would be the year of physical AI, and I guess that had to become true come hell or high water. As a result, there were a plethora of robots on display across the exhibit hall and in the outdoor spaces between buildings. And as one would expect, they attracted a lot of attention and photos from attendees as they were moving about. But the shine wore off very quickly after the first day, and I found them kind of annoying since it became clear that they didn't actually do anything useful.

For example, one exhibitor had this humanoid robot on display in an outdoor area of theirs. Before I could go in and see it, the greeter told me I should definitely "talk to her. I mean it." So I did.

As best I could tell, it was just a voice-to-voice LLM (like you have on your phone...) wrapped in some unnecessarily humanoid robot shape. What's more, when another visitor asked if it (she?) could walk, the answer was no--the legs didn't actually function. The humanoid shape was style, not substance. But its eyes moved and it could make facial expressions! You could see how excited it was to recite the same canned phrases about being at GTC 2026.

Fortunately, this stupid robot didn't move, so its intellectual merit was its biggest annoyance. Not so for an army of ankle-height delivery bots that NVIDIA just had roaming around the most congested crosswalks and vehicle barriers. They didn't talk or actually deliver anything, but they did cause people to stop in the middle of the street to take photos.

Moving up the value chain was a bipedal robot that ran after people in the GTC park area and cracked jokes about their appearance. Here's a picture of it learning how to become a Terminator from one of San Jose's finest.

This bipedal robot was actually pretty impressive, because it was literally running after people (which is a little terrifying) and could clearly use a multimodal model to generate voice based on what it was seeing. I didn't get a chance to talk to its operator (who was never more than a couple dozen feet away from it) to figure out exactly how it worked, but it accomplished the goal of being a spectacle.

I am told that the exhibit hall was full of half-cocked robot demos as well; I saw plenty of recordings of robots that served coffee really slowly, pipetted liquids very slowly, and did generic arm movements very slowly. So overall, it seemed like the state of physical AI remains very much in the realm of "that's neat, but who cares?"

This is not to say that I think robotics and physical AI are a dead end though! Just because they're all stupid at GTC26 doesn't mean they'll still be stupid at GTC27 or GTC28. In fact, I saw a version of the bipedal joke-cracking Terminator when I visited NVIDIA headquarters last fall, and all it seemed to do then was walk very slowly and fall over. The fact that it was now running and joking told me how quickly the state of robotics is advancing.

Finally, there was one robot that wasn't stupid at all:

Waymo now covers the area surrounding GTC, so I could have a robot drive me to my hotel when I was ready to go back.

GPUs in space are stupid

For some reason, Jensen wedged a reference to deploying GPUs in space towards the end of his keynote. As I said earlier, there are two distinct audiences for AI conferences (engineers and investors), and his lip service to space-based AI was definitely aimed at the investors. Since he seemed to want to whiz past the topic as fast as he could, I'll memorialize what he announced:

  • NVIDIA developed Thor, a rad-hardened chip that is already embedded in satellites. This isn't going to be training any AI models though.
  • NVIDIA also announced the NVIDIA Space-1 Vera Rubin Module, a Vera Rubin module that...runs in space. Of course. According to the blog post about it, it will be available "at a later date." You can tell how much they've committed to this.
  • Jensen has no idea how to cool GPUs in space, but he's "got lots of great engineers working on it."

That last line sounded familiar to me.

I'll put it in writing--GPUs in space is a stupid idea. And it appears that Jensen is aware. He's just not allowed to say it.

Booth duty

GTC this year was also the first time I had booth duty--real booth duty--since I worked at SDSC back in 2013. I'd worked Microsoft booths in the past, but I never really engaged people since I really had no idea what any of the Azure SaaS offerings were. Best I could do was talk about what little hardware was on display and hope that someone else knew whether some feature would ever be supported on AI Foundry...whatever that is.

This time was a refreshing return to a place where I'm more comfortable and talking about technology that I actually understand. And because my booth got a ton of traffic, I got to experience a broad cross section of the people attending GTC: students, product managers, venture capitalists, executives, engineers, and entrepreneurs all walked up and were willing to be accosted for a few minutes.

It's hard to capture everything that I took away from my turns at the booth, but a few things come to mind:

  • People are worrying about structured data. Coming from a world of HPC, I was amazed how many AI people are more interested in comparisons to Databricks. Not a single person asked about Lustre or any other parallel file system.
  • People just walk up to a booth cold and say "so tell me what you do." Having the elevator pitch ready is really important, and booth duty is a great place to try out new material and do a little A/B testing. This doesn't really happen when you work for the government or a tech giant.
  • There were a lot of students from a diversity of domains in attendance. And I even met a person or two who understood very little about technology at all. But they were all attending GTC to just take it in and try to figure out where to start. These people have a much greater tolerance for a firehose than me.
  • People often hear "infrastructure for AI" and immediately assume you're a GPU cloud provider operating datacenters for others. I was surprised by how many people made this leap, and it goes to show how prolific GPUaaS providers have become. Then again, some people hear "VAST" and think "GPUs in space."
  • The center of mass of the AI industry is much higher up the stack than it is in the HPC industry. Few people I ran into actually cared at all about servers and networks. They wanted to talk about columns and vectors.

Maybe my most important takeaway is how diverse the AI community is. Coming from HPC, I assumed that AI is just a workload within HPC and everything about AI could be viewed through that lens. But I quickly learned that when you're at GTC, you've got to open every conversation with, "So what do you do?" Otherwise, you might start yammering about InfiniBand to someone whose concerns lie in model safety.

Concluding thoughts

This year's GTC left me with the impression that NVIDIA is in a subtle transition. It's still years ahead of any competitor in mass-market AI infrastructure, but the gap between the ambitions of Jensen's keynotes and what they're delivering afterwards is widening. The quietly erased Rubin CPX, a slimmer Kyber, and the STX design that has left many people puzzled are all little things that don't add up to a big problem, but they paint a picture of a company that's been running so fast that it's starting to trip over its own feet.

Everything presented last week suggested to me that the AI industry will keep buying whatever NVIDIA is selling because NVIDIA is still way in the lead and executing impressively. But the most advanced models in the world are already being developed without NVIDIA, and the story of why the latest GPUs (and now LPUs) are truly the greatest is requiring increasingly convoluted keynote plots and mental gymnastics. So the steak is still sizzling in the parts where it matters most--the parts that investors keep eating up. But that doesn't mean there aren't other parts that are getting harder to enjoy.