Skip to main content

Posts

Featured

LLM training without a parallel file system

The illustrious Jeff Denworth recently posted a hot take across social media, claiming that training large language models (LLMs) doesn't require massive, expensive parallel file systems: As someone who's been working on one of the largest supercomputers on the planet --one that has no parallel file system at all--I was surprised by how many incredulous or curious responses followed. I guess supercomputers and parallel file systems are like peas and carrots in so many people's minds that the idea of being able to run a massive parallel compute job without a massive parallel file system is so unintuitive that it is unbelievable. I've given talks about how LLM training uses storage in the past, but I realized I've never written it down. So, for the benefit of humankind, let's talk about how these supercomputers without parallel file systems work.

Latest Posts

SC'24 recap

FASST will be DOE's opportunity to adapt, align, or...