LLM training without a parallel file system
The illustrious Jeff Denworth recently posted a hot take across social media, claiming that training large language models (LLMs) doesn't require massive, expensive parallel file systems: As someone who's been working on one of the largest supercomputers on the planet --one that has no parallel file system at all--I was surprised by how many incredulous or curious responses followed. I guess supercomputers and parallel file systems are like peas and carrots in so many people's minds that the idea of being able to run a massive parallel compute job without a massive parallel file system is so unintuitive that it is unbelievable. I've given talks about how LLM training uses storage in the past, but I realized I've never written it down. So, for the benefit of humankind, let's talk about how these supercomputers without parallel file systems work.