Storage Utilization in the Long Tail of Science


Since changing careers and moving up to the San Francisco Bay Area in July, I haven't had nearly as much time to post interesting things here on my blog—I guess that's the startup life. That isn't to say that my life in DNA sequencing hasn't been without interesting observations to explore though; the world of high-throughput sequencing is becoming increasingly dependent on high-performance computing, and many of the problems being solved in genomics and bioinformatics are stressing aspects of system architecture and cyberinfrastructure that haven't gotten a tremendous amount of exercise from the more traditional scientific domains in computational research.

Take, for example, the biggest and baddest DNA sequencer on the market: over the course of a three-day run, it outputs around 670 GB of raw (but compressed) sequence data, and this data is spread out over 1,400,000 files. This would translate to an average file size of around 500 KB, but the reality is that the file sizes are a lot less uniform:

Figure 1. File size distribution of a single flow cell output (~770 gigabases) on Illumina's highest-end sequencing platform

After some basic processing (which involves opening and closing hundreds of these files repeatedly and concurrently), these data files are converted into very large files (tens or hundreds of gigabytes each) which then get reduced down to data that is more digestible over the course of hundreds of CPU hours. As one might imagine, this entire process is very good at taxing many aspects of file systems, and on the computational side, most of this IO-intensive processing is not distributed and performance benefits most from single-stream, single-client throughput.

As a result of these data access and processing patterns, the storage landscape in the world of DNA sequencing and bioinformatics is quite different from conventional supercomputing. Some large sequencing centers do use the file systems we know and love (and hate) like GPFS at JGI and Lustre at Sanger, but it appears that most small- and mid-scale sequencing operations are relying heavily on network-attached storage (NAS) for both receiving raw sequencer data and being a storage substrate for all of the downstream data processing.

I say all of this because these data patterns—accessing large quantities of small files and large files with a high degree of random IO—is a common trait in many scientific applications used in the "long tail of science." The fact is, the sorts of IO for which parallel file systems like Lustre and GPFS are designed are tedious (if not difficult) to program, and for the majority of codes that don't require thousands of cores to make new discoveries, simply reading and writing data files in a naïve way is "good enough."

The Long Tail

This long tail of science is also using up a huge amount of the supercomputing resources made available to the national open science community; to illustrate, 98% of all jobs submitted to the XSEDE supercomputers in 2013 used 1024 or fewer CPU cores, and these modest-scale jobs represented over 50% of all the CPU time burned up on these machines.

Figure 2. Cumulative job size distribution (weighted by job count and SUs consumed) for all jobs submitted to XSEDE compute resources in 2013

The NSF has responded to this shift in user demand by awarding Comet, a 2 PF supercomputer designed to run these modest-scale jobs. The Comet architecture limits its full-bisection bandwidth interconnectivity to groups of 72 nodes, and these 72-node islands will actually have enough cores to satisfy 99% of all the jobs submitted to XSEDE clusters in 2013 (see above). By limiting the full-bisection connectivity to smaller islands and using less rich connectivity between islands, the cost savings in not having to buy so many mid-tier and core switches are then turned into additional CPU capacity.

What the Comet architecture doesn't address, however, is the question of data patterns and IO stress being generated by this same long tail of science—the so-called 99%. If DNA sequencing is any indicator of the 99%, parallel file systems are actually a poor choice for high-capacity, mid-scale jobs because their performance degrades significantly when facing many small files. Now, the real question is, are the 99% of HPC jobs really generating and manipulating lots of small files in favor of the large striped files that Lustre and GPFS are designed to handle? That is, might the majority of jobs on today's HPC clusters actually be better served by file systems that are less scalable but handle small files and random IO more gracefully?

Some colleagues and I set out to answer this question last spring, and a part of this quest involved looking at every single file on two of SDSC's Data Oasis file systems. This represented about 1.7 PB of real user data spread across two Lustre 2.4 file systems—one designed for temporary scratch data and the other for projects storage—and we wanted to know if users' data really consisted of the large files that Lustre loves or if, like job size, the 99% are really working with small files.  Since SDSC's two national resources, Gordon and Trestles, restrict the maximum core count for user jobs to modest-scale submissions, these file systems should contain files representative of long-tail users.

Scratch File Systems

At the roughest cut, files can be categorized based on whether their size is on the order of bytes and kilobytes (size < 1024*1024 bytes), megabytes (< 1024 KB), gigabytes (<1024 MB), and terabytes (< 1024 GB). Although pie charts are generally a terrible way to show relative compositions, this is how the files on the 1.2 PB scratch file system broke down:

Figure 3. Fraction of file count consumed by files of a given size on Data Oasis's scratch file system for Gordon

The above figure shows the number of files on the file system classified by their size, and there are clearly a preponderance of small files less than a gigabyte in size. This is not terribly surprising as the data is biased towards smaller files; that is, you can fit a thousand one-megabyte files in the same space that a single one-gigabyte file would take up. Another way to show this data is by how much file system capacity is taken up by files of each size:

Figure 4. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon

This makes it very apparent that the vast majority of the used space on this scratch file system—a total of 1.23 PB of data—are taken up by files on the order of gigabytes and megabytes. There were only seventeen files that were a terabyte or larger in size.

Incidentally, I don't find it too surprising that there are so few terabyte-sized files; even in the realm of Hadoop, median job dataset sizes are on the order of a dozen gigabytes (e.g., Facebook has reported that 90% of its jobs read in under 100 GB of data). Examining file sizes with much finer granularity reveals that the research data on this file system isn't even of Facebook scale though:

Figure 5. Number of files of a given size on Data Oasis's scratch file system for Gordon.  This data forms the basis for Figure 3 above

While there are a large number of files on the order of a few gigabytes, it seems that files on the order of tens of gigabytes or larger are far more scarce. Turning this into relative terms,

Figure 6. Cumulative distribution of files of a given size on Data Oasis's scratch file system for Gordon

we can make more meaningful statements. In particular,

  • 90% of the files on this Lustre file system are 1 megabyte or smaller
  • 99% of files are 32 MB or less
  • 99.9% of files are 512 MB or less
  • and 99.99% of files are 4 GB or less

The first statement is quite powerful when you consider the fact that the default stripe size in Lustre is 1 MB. The fact that 90% of files on the file system are smaller than this means that 90% of users' files really gain no advantages by living on Lustre. Furthermore, since this is a scratch file system that is meant to hold temporary files, it would appear that either user applications are generating a large amount of small files, or users are copying in large quantities of small files and improperly using it for cold storage. Given the quota policies for Data Oasis, I suspect there is a bit of truth to both.

Circling back a bit though, I said earlier that comparing just the quantity of files can be a bit misleading since a thousand 1 KB files will take up the same space as a single 1 MB file. We can also look at how much total space is taken up by files of various sizes.

Figure 7. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon.  This is just a more finely diced version of the data presented in Figure 4 above.

The above chart is a bit data-dense so it takes some staring at to understand what's going on. First looking at the purple line, we can pull out some pretty interesting facts:

  • Half of the file system's used capacity (50%) is consumed by files that are 1 GB or less in size
  • Over 20% of the file system's used capacity is taken up by files smaller than 64 MB
  • About 10% of the capacity is used by files that are 64 GB or larger

The blue boxes represent the derivative of that purple line—that is, how much space is taken up by files of only one specific size. The biggest chunk of the file system (141 TB) is taken up by 4 GB files, but it appears that there is a substantial range of file sizes that take up very similarly sized pieces of the pie. 512 MB files take up a total of 139 TB; 1 GB, 2 GB, and 8 GB files all take up over 100 TB of total space each as well. In fact, files ranging from 512 MB to 8 GB comprise 50% of the total file system capacity.

Why the sweet spot for space-consuming files is between 512 MB and 8 GB is unclear, but I suspect it's more caused by the human element in research. In my own research, I worked with files in this range simply because it was enough data to be statistically meaningful while still small enough to quickly re-analyze or transfer to a colleague. For file sizes above this range, the mass of the data made it difficult to manipulate using the "long-tail" cyberinfrastructure available to me. But, perhaps as more national-scale systems comes online to meet the needs of these sorts of workloads, this sweet spot will creep out to larger file sizes.

Projects Storage

The above discussion admittedly comes with a lot of caveats.  In particular, the scratch file system we examined was governed by no hard quotas which did lead some people to leave data resident for longer than they probably should have.  However, the other file system we analyzed was SDSC's Data Oasis projects storage which was architected for capacity over performance and featured substantially more disks per OSS.  This projects storage also came with 500 GB quotas by default, forcing users to be a little more mindful of what was worth keeping.

Stepping back to the coarse-grained kilobyte/megabyte/gigabyte/terabyte pie charts, here is how projects storage utilization compared to scratch storage:

Figure 8. Fraction of file count consumed by files of a given size on Data Oasis's projects file system (shared between Gordon and Trestles users)

On the basis of file counts, it's a bit surprising that users seem to store more smaller (kilobyte-sized) files in their projects space than their scratch space.  This may imply that the beginning and end data bookending simulations aren't as large as the intermediate data generated during the calculation.  Alternately, it may be a reflection of user naïveté; I've found that newer users were often afraid to use the scratch space because of the perception that their data may vanish from there without advanced notice.  Either way, gigabyte-sized files comprised a few hundredths of a percent of files, and terabyte-sized files were more scarce still on both file systems.  The trend was uniformly towards smaller sizes on projects space.

As far as space consumed by these files, the differences remain subtle.

Figure 9. Fraction of file system capacity consumed by files of a given size on Data Oasis's projects file system

There appears to be a trend towards users keeping larger files in their projects space, and the biggest change is the decrease in megabyte-sized files in favor of gigabyte-sized files.  However, this trend is very small and persists across a finer-grained examination of file size distributions:

Figure 10. File system capacity consumed by files of a given size on Data Oasis's projects file system

Half of the above plot is the same data shown above, making this plot twice as busy and confusing.  However there's a lot of interesting data captured in it, so it's worth the confusing presentation.  In particular, the overall distribution of mass with respect to the various file sizes is remarkably consistent between scratch and projects storage.  We see the same general peak of file size preference in the 1 GB to 10 GB range, but there is a subtle bimodal divide in projects storage that reveals preference for 128MB-512MB and 4GB-8GB files which manifests in the integrals (red and purple lines) that show a visibly greater slope in these regions.

The observant reader will also notice that the absolute values of the bars are smaller for projects storage and scratch storage; this is a result of the fact that the projects file system is subject to quotas and, as a result, is not nearly as full of user data.  To complicate things further, the projects storage represents user data from two different machines (each with unique job size policies, to boot), whereas the scratch storage is only accessible from one of those machines.  Despite these differences though, user data follows very similar distributions between both file systems.


It is probably unclear what to take away from these data, and that is with good reason.  There are fundamentally two aspects to quantifying storage utilizations--raw capacity and file count--because they represent two logically separate things.  There is some degree of interchangeability (e.g., storing a whole genome in one file vs. storing each chromosome its own file), and this is likely contributing to the broad peak in file size between 512 MB and 8 GB.  With that being said, it appears that the typical long-tail user stores a substantial amount of decidedly "small" files on Lustre, and this is exemplified by the fact that 90% of the files resident on the file systems analyzed here are 1 MB or less in size.

This alone suggests that large parallel file systems may not actually be the most appropriate choice for HPC systems that are designed to support a large group of long-tail users.  While file systems like Lustre and GPFS certainly provide a unique capability in that some types of medium-sized jobs absolutely require the IO capabilities of parallel file systems, there are a larger number of long-tail applications that do single-thread IO, and some of these perform IO in such an abusive way (looking at you, quantum chemistry) that they cannot run on file systems like Lustre or GPFS because of the number of small files and random IO they use.

So if Lustre and GPFS aren't the unequivocal best choice for storage in long-tail HPC, what are the other options?

Burst Buffers

I would be remiss if I neglected to mention burst buffers here since they are designed, in part, to address the limitations of parallel file systems.  However, their actual usability remains unproven.  Anecdotally, long-tail users are generally not quick to alter the way they design their jobs to use cutting-edge technology, and my personal experiences with Gordon (and its 300 TB of flash) were that getting IO-nasty user applications to effectively utilize the flash was often a very manual process that introduced new complexities, pitfalls, and failure modes.  Gordon was a very experimental platform though, and Cray's new DataWarp burst buffer seems to be the first large-scale productization of this idea.  It will be interesting to see how well it works for real users when the technology starts hitting the floor for open science in mid-2016, if not sooner.

High-Performance NAS

An emerging trend in HPC storage is the use of high-performance NAS as a complementary file system technology in HPC platforms.  Traditionally, NAS has been a very poor choice for HPC applications because of the limited scalability of the typical NAS architecture--data resides on traditional local file system with network service being provided by an additional software layer like NFS, and the ratio of storage capacity to network bandwidth out of the NAS is very high.

The emergence of cheap RAM and enterprise SSDs has allowed some sophisticated file systems like ZFS and NetApp's WAFL to demonstrate very high performance, especially in delivering very high random read performance, by using both RAM and flash as a buffer between the network and spinning rust.  This allows certain smaller-scale jobs to enjoy substantially better performance when running on flash-backed NAS than a parallel file system.  Consider the following IOP/metadata benchmark run on a parallel file system and a NAS head with SSDs for caching:

Figure 11. File stat rate on flash-backed NAS vs. a parallel file system as measured by the mdtest benchmark

A four-node job that relies on statting many small files (for example, an application that traverses a large directory structure such as the output of one of the Illumina sequencers I mentioned above) can achieve a much higher IO rate on a high-performance NAS than on a parallel file system.  Granted, there are a lot of qualifications to be made with this statement and benchmarking high-performance NAS is worth a post of its own, but the above data illustrate a case where NAS may be preferable over something like Lustre.

Greater Context

Parallel file systems like Lustre and GPFS will always play an essential role in HPC, and I don't want to make it sound like they can be universally replaced by high-performance NAS.  They are fundamentally architected to scale out so that increasing file system bandwidth does not require adding new partitions or using software to emulate a single namespace.  In fact, the single namespace of parallel file systems makes the management of the storage system, its users, and the underlying resources very flexible and straightforward.  No volume partitioning needs to be imposed, so scientific applications' and projects' data consumption do not have to align with physical hardware boundaries.

However, there are cases where a single namespace is not necessary at all; for example, user home directories are naturally partitioned with fine granularity and can be mounted in a uniform location while physically residing on different NAS heads with a simple autofs map.  In this example, leaving user home directories on a pool of NAS filers offers two big benefits:

  1. Full independence of the underlying storage mitigates the impact of one bad user.  A large job dropping multiple files per MPI process will crush both Lustre and NFS, but in the case of Lustre, the MDS may become unresponsive and block IO across all users' home directories.
  2. Flash caches on NAS can provide higher performance on IOP-intensive workloads at long-tail job sizes.  In many ways, high-performance NAS systems have the built-in burst buffers that parallel file systems are only now beginning to incorporate.
Of course, these two wins come at a cost:
  1. Fully decentralized storage is more difficult to manage.  For example, balancing capacity across all NAS systems is tricky when users have very different data generation rates that they do not disclose ahead of time.
  2. Flash caches can only get you so far, and NFS will fall over when enough IO is thrown at it.  I mentioned that 98% of all jobs use 1024 cores or fewer (see Figure 1), but 1024 cores all performing heavy IO on a typical capacity-rich, bandwidth-poor NAS head will cause it to grind to a halt.
Flash-backed high-performance NAS is not an end-all storage solution for long-tail computational science, but it also isn't something to be overlooked outright.  As with any technology in the HPC arena, its utility may or may not match up well with users' workloads, but when it does, it can deliver less pain and better performance than parallel file systems.


As I mentioned above, the data I presented here was largely generated as a result of an internal project in which I participated while at SDSC.  I couldn't have cobbled this all together without the help of SDSC's HPC Systems group, and I'm really indebted to +Rick+Haisong, and +Trevor for doing a lot of the heavy lifting in terms of generating the original data, getting systems configured to test, and figuring out what it all meant when the dust settled (even after I had left!).  SDSC's really a world-class group of individuals.