LLM training without a parallel file system

The illustrious Jeff Denworth recently posted a hot take across social media, claiming that training large language models (LLMs) doesn't require massive, expensive parallel file systems:


As someone who's been working on one of the largest supercomputers on the planet--one that has no parallel file system at all--I was surprised by how many incredulous or curious responses followed. I guess supercomputers and parallel file systems are like peas and carrots in so many people's minds that the idea of being able to run a massive parallel compute job without a massive parallel file system is so unintuitive that it is unbelievable.

I've given talks about how LLM training uses storage in the past, but I realized I've never written it down. So, for the benefit of humankind, let's talk about how these supercomputers without parallel file systems work.

The workload

Though the actual model training on giant GPU supercomputers gets all the attention, the full process of training an LLM is a little more involved. A colleague of mine at Microsoft gave a great overview of this storage-centric, end-to-end picture at SNIA SDC24; broadly, training an LLM involves the following steps:

  1. Data ingestion: This is where crawlers scrape the Internet and pull down raw html, images, videos, and other media. These raw data are indexed and shoved into a data warehouse. At scale, this can be hundreds or thousands of petabytes of data for frontier models.
  2. Data preparation: This is where the raw data is converted into tokenized data. It amounts to a huge data analytics problem that uses well-documented text and image processing pipelines that filter, deduplicate, and otherwise clean the raw garbage on the Internet using frameworks like Apache Spark. The hundreds of petabytes of input get reduced down by 10x-1000x.
  3. Model training: This is where the tokenized data is shoveled through the LLM on giant GPU clusters in little batches. As the data is processed, the model weights are updated, and those weights are checkpointed to storage. If a compute node crashes and the job fails, that checkpoint is used to restart, just like a traditional scientific HPC application. There might be fine-tuning and the like happening as part of this too, but I won't talk about that.
  4. Model deployment and inferencing: This is where the final model is copied across giant fields of inferencing servers, and a web service sits in front of it all to transform REST API requests into actual inferencing queries that run on the GPUs. This isn't training, but we'll talk about it anyway.

To understand why a parallel file system offers no particular benefit to any of these steps, let's take a closer look at what's going on in each one.

Data ingestion

Data ingestion is a widely distributed process that involves minimal computation; you just need a lot of Internet-facing network connectivity and CPU cores to drive independent processes connecting to other people's public HTTP servers. I don't know a lot about what this process looks like, because it never relies on anything resembling a supercomputer.

To the best of my knowledge, data ingestion just pulls HTML, images, or video streams from the Internet and packs them into data containers. As it is packing webpages into these files, it is building a separate index that stores metadata about the webpage (URL, encoding, date of access) and its location (the file in which the webpage's contents are stored and the byte offset within that file). Thousands of VMs might be performing these tasks completely independently, and because they do not need to synchronize with each other at any step, it can be better to distribute these scrapers around the world rather than centralize all of them in a single datacenter.

While one could store each scraped HTML page in a file that's organized in a parallel file system, accessing those files would be very slow--a full crawl of all the data would require scanning hundreds of billions of little files. So instead of implementing data containers using files and the index using a file system directory tree, it's better to implement data containers on top of object stores and use a distributed key-value store for the index. The fact that scraped data is write-once (and therefore doesn't need features like file locking or read-modify-write), is a natural fit for object stores' design around object immutability.

Data preparation

Once raw data is indexed and saved in object stores, the first phase of computation comes into play. I've documented this data processing pipeline on my LLM training datasets page, but a lot of it amounts to running Apache Spark-like pipelines that chew through all the raw data in a trivially parallel way.

These data processing pipelines are very well defined from the days when Hadoop was all the rage, and their data access patterns map well to the strengths of object stores. Each processing task might read a couple hundred megabytes of data from an object all at once, process it in-memory, then dump it back out to objects all at once. File systems offer no benefit here, because each task reads once and writes once rather than skipping around inside individual objects.

There is a significant compute workload here, and there are points in the data processing pipeline where global synchronization of all tasks is required. Specifically, the process of deduplicating input data--which is a critical step to getting a high-quality model these days--requires comparing every piece of data to every other piece of data. As a result, this data preparation phase is often done in a centralized location that is adjacent the object store containing all the raw data scraped from the previous step. The clusters used for data processing can resemble traditional CPU-based supercomputers (think a system like TACC's Frontera), and in some cases, they might even have full RDMA fabrics to accelerate the all-to-all deduplication step.

Critically, this data processing step is not done on the GPU nodes that actually train the model. Data processing is usually limited by I/O bandwidth to storage, and you never want your GPUs stalling out because they're waiting for data. Parallel file system vendors might tell you that the only way to avoid this GPU starvation issue is to plug every GPU node into a super-fast parallel file system, but the reality is that people just do this I/O-heavy step on completely separate supercomputers before training on GPUs ever begins.

CPU nodes are significantly cheaper than GPUs, so buying cheap object storage and a cheap CPU cluster is more cost-effective than buying an expensive file system and wasting your GPU nodes on trivially parallel text processing tasks. To illustrate this, consider some normalized list prices from Azure:

  • $1.00 gets you a 96-core general-purpose VM with 384 GB of RAM
  • $1.65 gets you a 176-core HPC-optimized VM with NDR InfiniBand and 768 GB of RAM
  • $22.55 gets you a 96-core, 8x H100 GPU VM with NDR InfiniBand
Given that GPUs don't give you a 13x-22x speedup for data processing despite the 13x-22x the price, it makes no sense to perform this data processing on GPU nodes inline with training.

One could argue that the GPUs are sitting idle while the data processing cluster is working anyway, but rest assured that AI model shops have no shortage of work to keep their GPUs busy. Data processing for the next model on a CPU cluster often happens at the same time the current model is being trained on the GPU cluster. In cases where there isn't enough work to keep both CPU and GPU clusters busy around the clock, also remember that most of this stuff happens in the cloud, and cloud providers can sell those idle CPU or GPU cycles to another customer in between training campaigns.

Model training

Huge, distributed training jobs are where most people would think a fast parallel file system is required for both reading input data and writing out checkpoints. After all, the need for fast checkpointing and restart were the primary driver behind the creation of parallel file systems.

While parallel file systems certainly can be used for training, they are not the most cost-effective or scalable way to train across tens of thousands of GPUs. To better illustrate the reasons why this is, let's consider the processes of reading inputs and writing checkpoints separately.

Reading training data

Training a model on GPUs, whether it be on one or a thousand nodes, follows a simple cycle (this is a "step" in LLM training parlance) that's repeated over and over:

  1. A batch of tokenized data is loaded into GPU memory
  2. That data is then processed through the neural network and the model weights are adjusted
  3. All GPUs synchronize their updated weights

It's tempting to imagine the I/O load generated by step #1 as being the same as it would be for a traditional HPC job: data is read from a parallel file system into compute memory at the start of every single step:

Animation showing a naive approach to loading data directly from shared storage into the GPU nodes of an InfiniBand cluster.

In years past, storage vendors would've insisted that this repeated, random re-reading of input data at every step requires a super-fast parallel file system to keep up. However, two factors make that untrue:

  1. The input data isn't millions of little text or image files. As described in the data ingest and data processing steps, these small files are packaged into large objects before the GPUs ever see them.
  2. Tokenized data is very dense compared to raw input, so the amount of bytes being read over the course of hundreds or thousands of steps is actually quite small.

To quantify #2, consider the Llama-3 405b model, which was trained on a significant fraction of the public Internet--15.6 trillion tokens. That sounds like a lot of information until you realize that the size of a typical token is between 3 and 5 bytes depending on the tokenizer and encoding. This means that the entire 405-billion parameter Llama-3 model, which was trained using 16,000 GPUs, only had to load 60 TB of tokens from storage. That divides out to 3.75 GB of tokens processed by each GPU over the entire course of a 54-day run.

When you consider how few bytes are required to train an LLM, it should become clear that the biggest I/O challenge in the performance-critical training loop isn't raw bandwidth; it's performance variability. As such, the best way to ensure that GPUs do not stall out due to read requests is to eliminate as much I/O performance variability as possible. To do this, you have to minimize the sources of contention that might arise between the storage devices and the network that connects them to the GPUs. While you can do this using sophisticated quality-of-service in both the storage servers and interconnect, there is an easier way.

The secret ingredient is local NVMe.

Just stick some local SSDs in every GPU node.

This ensures that no contention will occur when loading data from storage into the GPU, because the only network between them is the PCIe on the node. In addition, using node-local NVMe allows storage capacity and storage performance to scale linearly with GPU performance. By comparison, a remote storage system (whether it be parallel file or object) won't get any bigger or faster as you add more GPUs to the training job, resulting in each GPU losing efficiency due to I/O as more GPUs are added to the training job.

In practice, model training uses local SSDs like this:

Animation showing how data can be read once from shared storage into GPU nodes, then distributed within the GPU nodes' InfiniBand cluster.

At the start of a training job, data is read from remote storage into the local SSDs in a distributed fashion once. Because the tokenized data is so small, many replicas of the entire dataset can be stored across the job's GPU nodes as well; for example, if you were to train Llama-3 405b on NVIDIA DGX H100 nodes, you could fit the entire training dataset (all 60 TB of it) on just three nodes since each node comes with 30 TB of local SSD. Given that the model was trained on 16,000 GPUs (2,000 nodes), that translates to storing hundreds of replicas of the entire training set. This has a few major benefits:

  1. GPUs never have to wait for shared storage to return data before they can compute. Everything they need is on the local SSDs.
  2. When a GPU node fails, its input data can be recovered from a surviving GPU node over the backend InfiniBand. After training starts, input data never has to be read from shared storage again.
  3. It's common to scale up training over time by adding more GPUs (more data-parallel domains) to the job as it stabilizes. When this happens, I/O performance scales linearly because these new GPUs never have to fight over shared storage.

A reasonable critique of this approach is that data management becomes more complicated; either the training framework has to keep track of which SSDs and nodes have copies of which input data, or a distributed, client-side shared namespace like WEKA Converged Mode or CoreWeave LOTA has to sit between your application and your data. In practice though, frontier models are trained for exactly one epoch; that is, every input token is processed exactly one time to achieve optimal model quality. Because no two GPUs will ever need to read the same input token, there's never a need to copy input tokens between nodes inside the training loop. 

I also acknowledge that the above description is greatly simplified; the entire node-local SSD capacity cannot be filled with input data, as space is also needed for checkpoints and other temporary data. However, the fact remains that super high-bandwidth or super high-capacity parallel file systems are not necessary for loading input tokens during training. AI training clusters are built with a ton of local SSDs to do the heavy lifting, and the input data for LLMs is small enough to fit in just a handful of GPU nodes.

Writing model checkpoints

Though the read workload of LLM training is modest at best, the write workload can be quite intense at scale because the probability of failure increases superlinearly with the size of the training job. However, unlike with scientific HPC jobs, the checkpoint size does not scale as a function of the job size; the checkpoint for a 405 billion-parameter model trained on 16,000 nodes is the same size as the checkpoint for that model trained on three nodes. This is a result of the fact that every training step is followed by a global synchronization which makes each data-parallel copy of the model identical. Only one copy of those model weights, which amounts to under a hundred terabytes for state-of-the-art LLMs, needs to be saved:

Animation showing a naive approach to model checkpointing where a single model replica dumps its model weights directly to shared storage. 

Kartik and Colleen Tartow at VAST wrote a quantitative breakdown of the true I/O requirements of checkpointing, and they illustrate how even a trillion-parameter model can achieve 99.7% forward progress (only 0.3% time spent checkpointing) when training across 3,072 GPUs with a modest 273 GB/s file system. A parallel file system is not required to get that level of performance; for example, HDD-based Azure Blob achieved over 1 TB/s when benchmarked with IOR for writes at scale.

As with reading input tokens though, the real goal for checkpointing at scale is to remove any dependence on shared storage from the training loop entirely. And again, the best way to do this is to simply checkpoint to node-local storage. However, special care must be taken to ensure that the checkpoints don't get lost when a node crashes.

In practice, LLM training is now done with asynchronous, multilevel checkpointing. This technique provides the scalability of checkpointing to node-local storage and the durability of shared storage:

An animation showing an approach to hierarchical checkpointing where data is first copied from GPU memory into CPU memory, then from CPU memory to the local SSD of a partner node. After that, data is collectively copied from local SSD to shared storage.

The key to this checkpointing process is hierarchical data synchronization:

  1. Model weights are first copied from GPU memory into the node's CPU memory after every training step. This checkpoint is governed by the CPU-GPU bandwidth (either PCIe or NVLink/Infinity Fabric), and a 500 GB checkpoint can complete in a second. The benefit of checkpointing to DRAM is that the GPU can unblock and begin computing the next step very quickly. However, this checkpoint in DRAM is not protected and will be lost if the node crashes.
  2. To protect against node crashes, the checkpoint is then asynchronously copied from CPU DRAM to a neighbor node's local SSD using RDMA. Now if a node crashes, it can restore from a checkpoint that is stored on its neighboring node's SSD via InfiniBand. Reading and writing a 500 GB checkpoint to neighboring SSDs might take ten seconds, so this asynchronous replication might be done for every tenth DRAM checkpoint.
  3. To store many checkpoints long-term, checkpoints are also asynchronously copied from node-local SSD to shared storage. This might take a minute or two per 500 GB checkpoint, so this last-level checkpoint copy might be done once every ten minutes.

This hierarchical checkpointing scheme allows the GPUs to spend only a second checkpointing while being able to recover from job, node, and even cluster-level failures by tailoring the checkpoint tiering frequencies to the performance of each storage tier being used. The cost of recovering from a catastrophic failure might be re-computing up to ten minutes worth of training, but given the rarity of such events, this scheme balances the performance (and risks) of checkpointing to DRAM against hard drive prices (and suffering their performance) for a durable object store.

To this latter point, the requirements of the shared storage system at the bottom of this checkpointing hierarchy are very modest:

  • The checkpoint only needs to complete in the time between successive last-level checkpoint copies. If the 500 GB checkpoint is drained to shared storage only once every ten minutes, our shared storage only needs to deliver 1 GB/s of total bandwidth.
  • The write pattern from node-local NVMe to shared storage is arbitrary, because it is a simple copy operation of a fully formed checkpoint file. Unlike direct-to-storage checkpoints, there are no weirdly shaped tensors being serialized into a file on the fly; rather, opaque bits are streaming from a local checkpoint file into a remote object using whatever transfer size and parallelism gives the highest write bandwidth.

This combination of modest write bandwidth and simple, sequential, large-block writes is ideally suited for object stores. This isn't to say a parallel file system cannot work here, but this checkpointing scheme does not benefit from directory structure, fine-grained consistency semantics, or any of the other complexities that drive up the cost of parallel file systems.

The catch, of course, is that checkpointing using these schemes can be complicated to implement. Fortunately, a growing number of training frameworks support both writing and restoring checkpoints using asynchronous and hierarchical approaches. Model developers never have to worry about interacting with specific files or objects; instead, the framework manages data locality during checkpoint and restart underneath a high-level API.

Model deployment and inferencing

Once a model is trained, putting it into production as an inferencing service is the final step of its lifecycle. From a storage and I/O standpoint, this is a lot more complicated than training because it marries an enterprise service delivery model (failover, load balancing, authentication, and scaling) with copies of a trained model running across HPC infrastructure. When you hear vendors talking about key-value stores, vector databases, and RAG, that is all happening at this stage.

Setting aside everything but the storage attached to the GPU cluster though, the I/O requirements of inferencing are relatively straightforward:

  1. When provisioning a GPU node for inferencing, model weights must be loaded from shared storage as fast as possible.
  2. When using an LLM to search documents, a vector database is required to perform the similarity search that augments the LLM query with the relevant documents. This is the basis for RAG.
  3. Key-value caches are often used to reduce the latency for different parts of the inferencing pipeline by storing context including the conversation or frequently accessed contextual documents.
  4. As the inferencing demand evolves, different models and weights may be swapped in and out of individual GPU servers.

A parallel file system is not particularly useful for any of these; the only place in which their high bandwidth would be a benefit is in loading and re-loading model weights (#1 and #4). But as with hierarchical checkpointing, those I/O operations are whole-object, read-only copies that are a natural fit for object APIs. Complex directory structures and strong consistency simply aren't necessary here.

Objects are good enough, maybe better

None of the steps in this model training lifecycle uniquely benefit from the capabilities that parallel file systems offer:

  • Data ingestion involves hundreds of petabytes of small documents, but they are immediately packaged and indexed into large data containers. Their metadata is stored in a separate key-value store, so the directory hierarchy of a file system isn't used, and once data has been packaged and indexed, it's never modified in-place. The bandwidth requirements are modest as well since web crawling is the rate-limiting step.
  • Data processing is an I/O-intensive data analytics workload. Read bandwidth is critical here, but data is accessed in large transactions and most of the computation is embarrassingly parallel. This workload runs on standalone analytics clusters, so even though the read bandwidth here is rate-limiting, slower storage is not going to impact GPU utilization on training clusters in any way. This step also reduces data by 100x or more, so the write requirements are also modest.
  • Training requires both loading input tokens and checkpointing model weights. However, both of these workloads lean on node-local NVMe in every node to eliminate slowdowns due to noisy neighbors. Input data is staged to node-local storage only once at the beginning of a training campaign, and checkpoints are asynchronously bled out to shared storage without impacting GPU utilization.
  • Inferencing involves infrequent, read-only, bulk loading of model weights into GPU nodes. While key-value caches and vector databases are also used in inferencing, parallel file systems offer no particular benefit for them.

The I/O patterns of each of these steps map nicely to object storage since they are predominantly write-once and whole-file transactions. Parallel file systems certainly can be used, and workloads will benefit from the high bandwidth they offer. However, they come with the cost of features that aren't necessary--either literal costs (in the case of appliances or proprietary software) or figurative costs (allocating people to manage the complexities of debugging a parallel file system).

The importance of this latter point is hard to appreciate if you've never used a supercomputer without a parallel file systems. However, I recently sat in on the validation of a brand-new H200 training cluster where various InfiniBand congestion and routing issues were being worked out. It wasn't until someone said "eviction" in some nontechnical context that I realized that the sporadic file system evictions during fabric instability were simply a non-issue. There was no cleanup of mount points after major fabric events because there was no persistent, fragile client-server state being maintained. I/Os between GPU nodes or nodes and storage might have failed during a rough patch, but they recovered and resumed on their own as soon as the fabric came back. Similarly, identity didn't matter, and all tests could be run as root because there was no implicit trust between the client kernel and remote storage. Removing the dependence between compute nodes, LDAP, and healthy file system mounts completely eliminates many of the challenges of standing up new clusters quickly.

An ideal AI training cluster architecture

The workloads I described above form a rough outline for an AI training infrastructure which has:

  1. A bunch of GPU nodes with a strong RDMA backend like InfiniBand. Each node should have at least enough node-local SSD to store a substantial amount of the input tokens to be used for training, enough space for hierarchical checkpointing, and enough I/O bandwidth to these SSDs to support draining checkpoints from partner nodes' DRAM in just a few seconds. A separate frontend network that connects to storage is also a good idea; it ensures that asynchronous checkpoint draining won't interfere with weight synchronization in the training loop.
  2. A separate CPU cluster for data processing pipelines. A strong backend network will benefit the deduplication step (which is critical to producing high-quality training datasets), but more emphasis should be placed on optimizing large-transaction reads from storage. Given that CPU nodes are so much cheaper than GPU nodes, separating the data processing nodes from training nodes allows you cut more corners when optimizing this CPU cluster. Keeping data processing out-of-band of actual model training means your most data-intensive step (data processing) is decoupled from your most expensive step (training).
  3. A scalable object store that supports basic write-once semantics with modest I/O bandwidth at scale. This matches the needs of the workloads with the price-performance of the storage system and simplifies the recovery process if the interconnect between compute and storage gets congested. It can also serve the data needs of all stages of the training pipeline: hundreds of petabytes of raw training data, hundreds of terabytes of input tokens, and tens of terabytes of model weights all have similar performance needs and can be stored on the same infrastructure with the appropriate QOS settings.
  4. A pool of general-purpose compute infrastructure for hosting the raw training data indices. This can also be used to support vector databases, raw context documents for RAG, and any other ancillary services required for production inferencing.

By eschewing a high-performance parallel file system and localizing I/O performance to inside the GPU cluster with node-local NVMe, a vanilla network between the GPU cluster and the other subsystems is sufficient. Although less high-performance, these non-critical bits (ideally) have lower complexity, maintenance, and supportability as well, allowing (again, ideally) more resources to be sloshed towards supporting the high-value GPU infrastructure.

Incidentally, this architecture happens to be how most of the largest AI training clusters on which I work are designed.

But parallel files aren't all bad

Of course, having no parallel file system presents some usability challenges if users are expecting to be able to SSH into a login node and have a complete user environment ready. The user experience for the above infrastructure works best for those who are comfortable developing software in containers and launching pods rather than developing software in vim and submitting Slurm jobs. I do not advocate for throwing out parallel file systems if they're already ingrained in users' workflows!

In addition, the latest crop of modern, distributed file systems all now support multi-protocol data access. For example, WEKA, VAST, and Qumulo, all support S3 (object) interfaces as first-class citizens. Users who want the traditional HPC experience can play with their data using a file mount as they always have, while those who are coming in from the cloud-native side have equal access to those same data as objects. Supporting multiprotocol access to data in AI environments doesn't reduce the need to overbuild infrastructure or support stateful file mounts across all compute nodes, but it does provide an onramp for users to get comfortable moving away from the traditional HPC user experience.

Finally, a few of the leading-edge parallel-file-system-turned-AI-storage platforms are also shipping features that make them valuable for the deployment and inferencing part of the lifecycle. For example, WEKA has their WARRP reference architecture for RAG, and VAST has its InsightEngine--both use the unique architectures underneath their file interfaces to accelerate vector queries far beyond what you would get from running a vector database on, say, Lustre. These so-called "AI data platforms," despite starting as parallel file systems, are spreading their relevance out to the entire LLM lifecycle, filling needs for file, object, and structured data with a single storage system.

This is all to say that parallel file systems aren't bad, and they aren't going anywhere. But they aren't required to train frontier models either, and as I've tried to describe above, some of the largest supercomputers on the planet are designed not to require them.