Lessons learned from three years in cloud supercomputing

I recently decided to leave Microsoft after having spent just over three years there, first as a storage product manager, then as a compute engineer. Although I touched many parts of Azure's infrastructure during that time, everything I did was at the intersection of large-scale supercomputing and hyperscale cloud. There was no shortage of interesting systems to figure out and problems to solve, but as I began to wrap my arms around the totality of hyperscale AI training in the cloud, I also began to see the grand challenges that lay ahead.

Outside Microsoft's Silicon Valley Campus minutes after I was escorted off the premises.

Although many of those challenges would probably be fun and exciting to tackle, the more I learned, the more I found myself asking the same two questions: what did I want to do with the rest of my career, and was the path I was following going in the right direction? I spent a lot of time thinking about this, and my decision to leave Microsoft ultimately reflects the answer at which I arrived. But rather than indulge myself by recounting my introspection, I thought I would share some of the things that I learned while at Microsoft in the hopes that others find value in my experience.

To that end, I've split this post into two sections:

  1. Things I've observed about HPC and technology trends from the perspective of a cloud/hyperscale/AI practitioner and provider, and
  2. Things I've realized about jobs and careers from the perspective of someone who's now worked in academia, a successful startup, government, and now Big Tech and is about halfway through his career

I consider this to be the concluding chapter of a three-part series that began with Life and leaving NERSC and continued with How has life after leaving the Labs been going.

Also, please note that I authored this the day after my employment at Microsoft ended, and I was not beholden to any company or organization at the time of writing. The views expressed below are mine alone.

HPC

Everything I did at Microsoft touched supercomputers in one way or another, and my day job was exclusively supporting Microsoft's largest AI training supercomputers. Despite that, I did a lot of moonlighting in support of Azure's Federal business, and this is how I justified giving talks at events like like NERSC@50, SC, and Salishan in my last year. It's also what let me straddle both worlds: I had a rare, first-hand knowledge of how the de facto largest supercomputers in the world were built and used, and I had a front-row seat for how leaders in the traditional supercomputing world perceived (and sometimes misunderstood) what we were doing in the cloud.

Before I get into specific observations though, I should clarify some nomenclature that I will use throughout:

  • Supercomputers are the piles of compute nodes with a high-speed interconnect that are designed to solve one big problem in parallel. This is a generic term to describe the instrument, not its workload.
  • HPC, traditional HPC, modsim, and scientific computing all refer to the ecosystem built around using something like MPI to solve a problem rooted in some type of science. Every big supercomputer run by DOE, procured through EuroHPC, and sited at the world-famous, government-funded supercomputer centers falls into this category.
  • Cloud, hyperscale, and AI training all refer to the ecosystem built to train large language models. The supercomputers are run by hyperscale companies like Microsoft, Amazon, or Meta whose backgrounds have not historically been in the world of supercomputing.

I realize that these are not very precise, but they're the easiest way to contrast what I learned inside Microsoft (a hyperscale cloud) with the world I came from prior (traditional HPC).

HPC wants to be like the cloud, not in it

When I left NERSC in May 2022, I speculated that the future of large-scale supercomputer centers would be follow one of two paths:

  1. They develop and squish cloud technologies into their supercomputers to make them more cloud-like, or
  2. They abandon the idea of buying individual systems and instead enter into long-term relationships where flagship HPC systems are colocated inside cloud datacenters sited in places with low-cost, low-carbon power.

I was hoping that the desire to continue building systems after passing the exascale milestone would make the next click-stop follow path #2, but early indications (across the global HPC landscape) are that the community has chosen path #1.

HPC centers around the world are embracing the idea of cloudifying on-prem supercomputers by adding virtualization, containerization, and integration with other services to enable complex workflows. And as a part of that, they're reinventing many of the technology integrations that have always been first-class citizens in cloud: CSCS added capabilities to create "versatile software-defined clusters" on their latest Cray system, Alps. NERSC's next system, Doudna, is envisioned to allow its users to "move from programming the supercomputer to programming the datacenter." But none of these systems are actually using commercial cloud services in non-trivial ways.

In the year or two that followed ChatGPT, the notion of large-scale supercomputers in the cloud was a green field, and cloud providers were open to chasing all sorts of silly ideas. This made it the ideal time for the leadership HPC computing community to get a seat at the hyperscale table. Although their budgets couldn't compete with AI, HPC centers could've drafted on the investments of AI buildout and offered the societal impacts of using GPUs for science as a nice complement to the societal impacts of using GPUs for AI training.

Much to my dismay, though, that window of opportunity was spent decrying the investment in hyperscale and AI rather than trying to exploit it; that window was the year of "us versus them." And unfortunately, that window has essentially closed as accountants and CFOs have now sharpened their pencils and are searching for returns on the investments made in GPU infrastructure. The intrinsic value of supercomputing infrastructure in the cloud has been reduced to the point where Microsoft's CEO outright said they were turning away customers who just wanted to pay for GPU clusters, because higher-quality revenue could be made from inferencing services that use those same GPUs.

So even if the HPC community woke up tomorrow and realized the long-term benefits of partnering with commercial clouds (instead of trying to copy them), I don't think cloud providers would respond with the same enthusiasm to meet in the middle now as they would have a year or two ago. I don't think this was a deliberate decision on behalf of the cloud providers, and they may not even fully realize this change. But the future of hyperscale supercomputing is rapidly crystallizing, and because HPC wasn't present in the solution, there's no room for it in the final structure.

Cloud is expensive, but not for the reasons most think

It's been easy to write off the cloud as too expensive for HPC, and most people do silly math based on public list prices for VMs to justify their position. The narrative usually goes something like, "if a single GPU VM costs $40/hr, then running 10,000 of them for five years will cost 17X more than our on-prem supercomputer!" That's not how it works, and nobody pays that price. That $40/hr is the maximum possible price, and it includes the cost to the cloud provider of keeping nodes idle in the event that someone shows up and suddenly wants to use one on-demand.

But even if you cut out all the profit for the cloud provider and just look at the cost of the physical infrastructure, building a supercomputer in the cloud is just more expensive than putting a bunch of whitebox nodes into a traditional HPC datacenter. There's a couple reasons for this, and here are a couple in no particular order:

High availability: Every cloud datacenter has redundant power, and most of them have very redundant power. This is provisioned independently of whatever goes inside of that datacenter, so when you deploy a 10 MW supercomputer inside a 10 MW cloud datacenter, that comes with at least 10 MW of backup diesel generators, UPSes, and the electrical infrastructure. HPC workloads don't really need this, but it's hard to deploy HPC in the cloud without a ton of generators and UPSes coming along for the ride. This is changing with AI-specific cloud datacenters now being built, but these AI datacenters still have way more redundant power than a typical on-prem HPC datacenter. Building a cloud datacenter with the minimal redundancy that a traditional HPC datacenter has would mean that facility couldn't ever be used for anything but HPC, and that would undercut the overall flexibility upon which cloud economics are built.

Cloud-side infrastructure: Every compute node has to be attached to the frontend cloud network in addition to a backend high-speed network like InfiniBand, unlike a traditional supercomputer where nodes are only attached to one high-speed network. While the cost of the smart NIC in each node is just a couple hundred dollars, every cloud supercomputer has to have a complete frontend network built out to support every single compute node--that's a ton of switches, routers, and fiber that must be properly provisioned all the way up to the cloud region in which those nodes are deployed. This frontend network is what enables all the cool cloud features on every node (full SDN, integration with other cloud services, etc), but these features aren't generally worth their cost when running meat-and-potatoes HPC workloads like MPI jobs by themselves. Their value only really shines through when executing complex workflows that, for example, couple an MPI job with stateful services and globally accessible data sharing with fine-grained access controls, all fully automated through programmable APIs and full RBAC.

AI-optimized system architecture: AI-optimized GPU supercomputers contain a bunch of components that your typical Cray or Eviden simply wouldn't have. I wrote about the differences between AI and HPC supercomputers elsewhere, but in brief, AI workloads specifically benefit from having tens of terabytes of local SSDs and all-optical (no copper) RDMA fabrics. These add to the COGS (cost of goods sold) of an AI-optimized supercomputer, meaning that that a supercomputer with a thousand GPUs designed for AI is going to be more expensive than one designed for scientific computing no matter where it's deployed. And cloud providers are all optimizing their supercomputers for AI.

There's a bunch of other cloud "stuff" that is required as well; every cloud region has a first footprint which is a LOT of general-purpose servers and storage that is required to support the basic cloud control plane. Before any user-facing cloud resources (including supercomputers) can be deployed, there has to be tens or hundreds of racks of this cloud "stuff" that is up and running. And although the cost of that first footprint is amortized over many customers in larger or older cloud regions, larger single-use infrastructures (like supercomputers) carry a proportionally larger fraction of the cost to deploy the first footprint.

So when you look at the cost of running a single compute node in a cloud supercomputer, there are a bunch of extra ingredients baked in that you wouldn't get by just signing a check over to an OEM:

  • a high availability SLA, afforded in part by all those generators and UPSes
  • slick cloud service integrations, privacy features, virtual networking, afforded by that frontend cloud network
  • better performance for AI training or inferencing workloads, afforded by extra SSDs and all-optical interconnects
  • a bunch of other typical TCO stuff--the power consumed by the node, the opportunity cost of free floor tiles in your datacenter, and the engineers and technicians that keep it all running

Ultimately, someone needs to pay for all of these extra ingredients. Cloud providers could just eat the costs themselves and sell the supercomputing service at a price comparable to what a customer would pay for an on-prem supercomputer--and sometimes they do. But this dilutes the profitability of the deal, and it increases the risks of the cloud provider losing money if unexpected issues arise during execution. Losing money is objectively bad business, so it's usually cloud customers who are paying for all these extra capabilities regardless of if they use them or not.

So if all you want to do is run big MPI jobs, and you have no use for the extra availability, cloud integrations, privacy and security, and programmable infrastructure, sure--the price per-node is going to be higher in the cloud than on-prem. You're paying for a bunch of features that you don't need.

...Although sometimes it is

Sometimes buying a supercomputer in the cloud is straight up more expensive because of the value it provides though. For example, I remember a case where a large AI company needed to train a big LLM on many thousands of GPUs, so they signed an agreement which gave them exclusive access to a cloud supercomputer that strongly resembled a specific GPU system in the DOE complex. Because I used to work in the DOE, I knew how much DOE paid to buy their GPU cluster, and I also knew that three years of maintenance was included in that cost.

What amazed me is what this AI company was willing to pay (roughly) the same price that DOE paid for their on-prem supercomputer, but in exchange, get exclusive access to a comparably capable cloud supercomputer (same GPUs model, similar GPU count, similar interconnect) for one year only. Put differently, being able to use a big, cutting-edge GPU cluster was worth up to 3x more to this AI company than it was to the DOE.

While it may sound like I'm spilling secrets here, the reality is that anyone working for a cloud provider wouldn't be able to tell which AI deal I was describing here--they all look like this, and they're all willing to spend significantly more than the HPC community for the same compute capability. This gives you a sense of the real value that AI companies place on all the benefits that cloud-based supercomputers can provide.

This isn't all bad for HPC, though. Every fat deal with an AI company means that there can be another deal with an HPC center that has slim margins. For example, let's say an AI company is willing to pay a billion dollars for a supercomputer whose TCO is only $330M--that means the cloud provider gets 67% margin. If the cloud provider's overall margin target is 50%, that means it can sell an identical supercomputer to an HPC customer at zero profit (for $330M) and still walk away happy. Thus, it is possible for the price of a supercomputer for HPC to be subsidized by all the money that the AI industry is throwing into supercomputing. Whether or not a cloud provider ever cuts deals like this is a business decision though--and as I said earlier, I don't think they're as open to silly ideas now as they used to be.

The real hurdle that I was never able to overcome out, though, is a result of the fact that there is finite expertise in HPC and AI in the world. HPC-AI is ultimately a zero-sum game, and every hour spent working with an HPC customer is usually an hour that isn't being spent working with a much more profitable AI customer. I constantly ran into this problem working in hyperscale AI; my full-time job was to deal with AI customers, but I enjoyed interacting with HPC customers too. As a result, I had to do a lot of my the HPC-specific work (preparing conference presentations, for example) on nights, weekends, and vacations. It was just hard to tell people that I couldn't help improve job uptime on a massive training run because I was preparing a talk for a workshop that, frankly, might be openly hostile to my message.

Influencing the cloud is hard

Because the difference in investment is so big between HPC and AI, many of the carrots that the HPC community has traditionally dangled in front of HPC vendors aren't very enticing to the hyperscale AI community. For example, both US and European HPC programs have relied heavily on non-recurring engineering (NRE) contracts with industry partners to incentivize the creation of products that are well-suited for scientific computing; PathFoward and Horizon 2020 both come to mind as well-funded, successful efforts on this front.

However, HPC is the only customer community that really tries to do this, and it echoes a time when the HPC community was at the forefront of scale and innovation. Nowadays, the prospect of accepting $1M/year NRE contract to implement XYZ is completely unappetizing to a hyperscaler; it would probably cost more than $1M/year just to figure out how a company with $250 billion in annual revenue can handle such an unusual type of contract and payment. Add to to this the weird intellectual property rules (like disentangling a 40% cost sharing advance waiver for a tiny project within a multi-billion-dollar business), and it can become a corporate quagmire to go anywhere near NRE projects. Companies with well-insulated HPC silos can probably manage this better, but part of hyperscale economics is that everything overlaps with everything else as much as possible across supercomputing, general-purpose computing, hardware, and software.

As a result of this, I really struggled to understand how a $20M/year service contract and a $1M/year NRE contract is materially different from a $21M/year service contract in the cloud world. For most (non-HPC) cloud customers, the RFP comes in saying "we need XYZ" and some product manager notes customer demand for XYZ. If the demand is large enough, the feature winds up on roadmap, and the cloud provider develops it as a part of regular business. If there is no other demand, then an NRE contract isn't really going to change that; maintaining feature XYZ long-term will cost far more than a couple million dollars, so implementing it would be a bad decision. This isn't unique to cloud, for what it's worth; while there are some successful HPC NRE stories, there are far more NRE-originated products that had no product-market fit and were simply abandoned after the associated supercomputer was retired.

As best as I can tell, NRE has become a way for big HPC customers to maintain the illusion that they are influencing hyperscalers. A hyperscaler could propose some NRE, and an HPC buyer could fund it, and there could be weekly meetings where the two get together and pretend like they're collaborating and codesigning. The hyperscaler could write milestone reports, and they could attend quarterly business reviews with the customer. But this feels like an act. You simply can't move a $250B/year company that isn't solely organized around supercomputing with the lure of a couple million a year.

This is not to say that NRE and codesign have no purpose in HPC! I'm sure component vendors (GPUs, networking, and the like) can make minor tweaks that offer big upside for the HPC community. But I learned that, as in several other dimensions, the HPC community is being pushed towards buying whatever is already on the truck, and NRE isn't going to have the impact that it once did.

Career

In addition to learning about how the hyperscale supercomputer world works in practice, my time at Microsoft exposed me to a segment of the supercomputing community that I didn't know existed: junior software engineers who were unwittingly thrown into the deep end of HPC straight out of college and were desperate to find their footing in both the technology and their careers overall. Maybe the most impactful work I did in the past three years was not technical at all, but instead came through some internal talks I gave on my professional journey in HPC and the one-on-one conversations that followed.

Since I've gotten such positive feedback when I talk and write about this aspect of HPC, I'll also share some things I've learned about choosing the right employer and job during my time at Microsoft.

People matter

I learned that the right team matters more than the right job. It is profoundly important to me that I get to work with people with the same level of passion and curiosity, even if we are working on different problems.

In retrospect, I realize that I have been very lucky that my career has progressed through organizations that were packed to the gills with people with whom I shared values. They wanted to go to conferences to share their work, they wanted to hear about how others are solving similar challenges, and they weren't afraid to present (and challenge) new ideas. As I learned over the last three years though, I think these traits are acutely concentrated in the HPC world since HPC itself originated from academia and a culture of independence and self-direction. They certainly aren't universal to all workplaces.

To be clear, I am not saying that my coworkers at Microsoft weren't passionate or curious. But I did learn that, at big tech companies, you can have a perfectly successful career by keeping your head down and cranking away at the tasks given to you. If the work changes one day, it's actually a virtue to be able to walk away from the old project and turn your complete attention to a new one. Did the company just cancel the product you've been working on? No problem. If you were good at writing code for Windows update, you'll probably be just fine at coordinating planned maintenances for supercomputers. A colleague of mine called these people "survivors," because they will do the best they can with whatever they're given.

While this agility is great if you love programming, it can also engender numbness and dispassion for any specific application area. If a "survivor" can just as easily program for HoloLens as they can for GPU telemetry, they also likely don't really care about either HoloLens or GPUs. This isn't a bad thing, and I am certainly not passing judgment on people who don't care about GPUs. But it does mean that it's harder for someone who really cares about GPUs to connect with a teammate who really doesn't. And this has many knock-on effects in day-to-day work; it's only natural for people who share common values to help each other out, while relative strangers are less likely to go that extra mile. Finding that common ground to promote "some person on team X" to "my trusted colleague on team X" is that much harder.

This difficulty in finding my community amidst all the survivors is what led me to look outside of my company to find my people. I went to events like the Smoky Mountains Conference and NERSC@50 and took the stage to literally beg the HPC community to give me a reason to work with them. By the letter of my job description, I was never supposed to be on stage; I was supposed to spending all my time behind my desk, thinking about the reliability of our biggest supercomputers. But I liked working with the people in the HPC community, and I liked working with our HPC sales organization, because we all shared common values; we were passionate about HPC and the mission of advancing scientific computing. So, I wound up spending a lot of time working on simple things with HPC folks and not enough time doing my day job.

Company culture matters, too

In an organization where individuals don't often share a lot of common ground, I learned that it's incumbent upon everyone to make a deliberate effort to maintain a culture of working together and helping each other out. A positive workplace culture won't happen by itself across a massive organization. To this end, Satya has a bunch of corporate culture mantras that are often repeated to keep reminding people of the way employees should treat each other.

For example, he has a mantra of "be a learn-it-all, not a know-it-all." But I found that many people struggled to really understand how to do this in practice; when confronted with a tough problem ("your database keeps timing out when we point a thousand nodes at it at once"), it's often too easy to just be a know-it-all ("nobody else does that, so you are doing it wrong") rather than a learn-it-all ("why are you doing it that way?"). And the older a company is, the harder it is for decades-long veterans to maintain openness to new challenges in the silo they've built around themselves.

I've worked with HPC users for long enough to know that this attitude is pervasive anywhere you put a bunch of smart people with different perspectives into a room. However, it wasn't until I came to Microsoft that I learned that there's something to be gained by explicitly and repeatedly reminding people that they should strive to understand at least as much as they try to explain. Should I ever find myself in a leadership position, this is definitely a mantra I will carry with me and repeat to others, and I will credit my time at Microsoft with appreciating how to really live this mentality, not just parrot it.

Being good at things isn't always a job

People tell me that I'm pretty good at a bunch of stuff: figuring out how technologies work, explaining complex concepts in understandable ways, and taking a critical look at data and figuring out what's missing. And I enjoy doing these things; this is why I post to my blog, maintain my digital garden, and love getting on stage and giving presentations. But people also say that, because I'm good at these things, there'd be no shortage of opportunities for me in the HPC industry should I ever go looking.

However, I've learned that a job has to be an amalgamation of responsibilities that create value, and connecting "things I'm good at" with "things that need to be done" is not always straightforward. For example, if I am good at learning things and share what I learned with others, what kind of jobs actually turn that into a responsibility?

  • Developers don't really do this at all. Their job is really to keep those git commits coming. Sometimes this requires learning new things, but writing blog posts or giving talks is not in the job description, so they don't count for much on performance reviews.
  • Product managers do a little of this. I had to learn a few things and then repeat them a lot when I was a PM. Over and over. To customers, to executives, to partner teams. It was 5% learning and 95% sharing.
  • Salespeople also do a little of this. They have to stay current on customer needs and product features, then repeat them a lot.
  • System architects do a fair amount of this. I had to learn about what technologies are on the horizon, figure out how to piece them into an idea that could be implemented, then explain why it'd all be a good idea to others.
  • Educators do a lot of this. The technology industry is always moving, so learning is required to stay up to date. They also get to be selective about the ideas worth sharing and downplay the rest.

Each one of these roles has its own downsides too; for example, product managers and salespeople often have to nag people a lot, which I don't think anyone likes. And many of these roles require sharing knowledge with people who really don't want to hear it. After all, what customer is eager to talk to every salesperson who comes in the door?

Trying to find the ideal job is not just a matter of being good at many things; it's a matter of finding specific jobs that contain a maximal number of things you're good at and a minimal number of things you don't want to do. It's an NP-hard problem, and I've come to realize that the only way to solve it is through trial-and-error. I'm sure some people get lucky and figure out the optimal path on their first try, but for the rest of us, the only way to approach the optimal path is to continuously reflect and not longer on a known-suboptimal path for any longer than is necessary.

I've given up on trying to find the perfect job, because I've learned that it probably doesn't exist. I'm good at some things, I'm bad at some things; I enjoy some responsibilities, and I dislike some responsibilities. As with every other job I've had, I learned a lot about all four of these categories during my time at Microsoft, and my choice of next step has been informed by that. I don't expect it to be perfect, but I have high hopes that it will be a step in the right direction.

You don't have to be your employer

When I left the government for a corporate job, one of my biggest worries was losing credibility with peers whose opinions I respected. It's easy to dismiss the viewpoint of someone at a large vendor with a rationalization like, "of course they'd say that; it's their job," but I learned that the HPC community isn't so reductive. People are smart, and most were willing to engage with the quality of my ideas before checking the affiliation on my conference badge.

The trick, of course, was finding ways to share ideas in a way that didn't upset my corporate overlords but had substantive value to my audience. I think I figured this out, and in short, I found that leading with honesty and precision works best. The HPC community was built on sharing experiences and learnings about what does and doesn't work, so embracing that--rather than name-dropping products and making hyperbolic claims--seemed to keep me getting invited back to the HPC conferences and workshops that I wanted to attend.

I wasn't completely intentional in building whatever credibility I've gained over the last three years, but I was intentional in avoiding work that would clearly compromise it. I never want to be accused of misrepresenting the limits of my understanding, so I will never present a slide containing statements or plots that I can't substantiate. I also never want to be accused of misrepresenting the truth, so I am as forthright as possible in disclosing when I do (or don't) have an incentive to say something.

Because I stayed true to myself, I think I was the same person at Microsoft as I was at NERSC or SDSC. That continuity helped my peers quickly recalibrate after I became a vendor, and I think this helped me do more than if I had gone all-in on the role of a cloud spokesperson. Of course, there were times when I had to take on an employer-specific persona, but that's just business, and I've found that peers recognize that this is just a part of the game that we all must play.

The result of all this wasn't clear to me until after I started telling people I was leaving Microsoft. There are a bunch of HPC-specific projects I undertook on the side (e.g., reviewing and advising on research, serving on panels), and I started notifying people that I would have to find other Microsoft engineers to take over these obligations since I was leaving. Much to my surprise though, everyone responded the same way: the request to have me help was specifically to me, not my employer. Short of any conflicts of interest, they didn't care who employed me and valued my contributions regardless of who was signing my paychecks.

So, after three years working for an HPC vendor, I have learned that most people won't define you by your employer as long as you don't define yourself by your employer. It is possible to work for a company that sells HPC and still maintain your own identity as a person, but it requires thoughtful effort and a supportive (or indifferent!) employer. If you act like a company shill, you will be regarded as one, but not many jobs in industry actually require that to fulfill your responsibilities.

Happiness sometimes costs money

I think most people would agree that, while money can't buy happiness, it certainly helps. What I didn't realize until recently, though, is a reciprocal truth: sometimes happiness costs money.

A year ago, I wrote about how the pay in industry compares to working at the national labs, and I described how my golden handcuffs were structured. An optimist might say that these vesting schedules are a way to keep a happy employee from being lured away, but I think it's equally common that these are truly handcuffs. They are a constant reminder that, even in the darkest of days, there is a six-figure reason to grit one's teeth and persevere.

I've come to realize that there is an adverse correlation between a few factors:

  • Smaller organizations offer more flexibility to mold a job around your preferences, because there is more work scope spread across fewer people.
  • Larger organizations can afford to offer larger total compensation, but flexibility is limited to the scope of any single team.

I kind of thought about it like this:

Compensation as a function of flexibility

When I realized that I should explore other paths, I had to determine where in this continuum I wanted to wind up: do I care more about a fat paycheck, or do I care more about enjoying my day-to-day responsibilities? And once offers started coming in, exactly how much of a pay cut was I willing to take in exchange for the flexibility that I would receive?

By the time I handed in my resignation at Microsoft, I knew exactly how much this happiness was worth to me. Alternatively, I found out how much opportunity cost I was willing to pay for the ability (hopefully!) to reconnect with my day-to-day work. The calculus was an interesting exercise involving a bunch of Monte Carlo simulation which I won't detail here, but as it turns out, I was willing to pay a lot of money for the chance to do something that aligned more completely with what I wanted to do with the rest of my career. In the end, I gave up hundreds of thousands in unvested stock, and I am taking a six-figure pay cut in annual base pay when I start my next job. For me, though, this was a fair price to pay.

Final thoughts

After three years in the world of hyperscale supercomputing, I have come away with two major learnings that now shape how I think about the future.

On the technical front, I think the HPC community has chosen to keep going its own way and reinvent the cloud rather than work meaningfully with hyperscale cloud providers. There was a brief window of opportunity where the mountain may have actually come to Muhammed, and the trajectory of scientific computing could have fundamentally changed to align with the growth trajectory of hyperscale AI. However, I don't think the HPC community was ready to take a big swing during those early days post-ChatGPT or do an earnest assessment of what that future could've looked like. I also worry that the window has closed, and the HPC community never even realized what was on the table.

On the career front, I've realized that success is multidimensional. Money is one axis, but so are impact, people, and purpose. The relative importance of each is not always obvious either; they only became clearer to me as I tried different jobs across the space. I've found that the ability to work with like-minded people and the opportunity to learn and share are the most important dimensions to me, but also I recognize that I am privileged in others. Finding stacks of money can be easy for those who work in AI, but there are no shortcuts to building (and retaining!) teams of great people. Anyone who can do the latter well should not be undervalued.

There's a lot more that I didn't have time to organize and write, but I have every intention of continuing to be myself, regardless of where I work, in the future. I will keep writing, posting, and talking about what I'm learning in supercomputing whenever I can. And along those lines, I hope that writing all this out helps others figure out what's important to them and where they want to go.