Perspectives on the Current State of Data-Intensive Scientific Computing

I recently had the benefit of being invited to attend two workshops in Oakland, CA, hosted by the U.S. Department of Energy (DOE), that shared the common theme of emerging trends in data-intensive computing: the Joint User Forum on Data-Intensive Computing and the High Performance Computing Operational Review.  My current employment requires that I stay abreast of all topics in data-intensive scientific computing (I wish there was an acronym to abbreviate this...DISC perhaps?) so I didn't go in with the expectation of being exposed to a world of new information.  As it turned out though, I did gain a very insightful perspective on how data-intensive scientific computing (DISC), and I daresay Big Data, is seen from the people who operate some of the world's largest supercomputers.

The DOE perspective is surprisingly realistic, application-oriented, and tightly integrated with high-performance computing.  There was the obligatory discussion of Hadoop and how it may be wedged into machines at LLNL with Magpie, ORNL with Spot Hadoop, and SDSC with myHadoop, of course, and there was also some discussion of real production use of Hadoop on bona fide Hadoop clusters at some of the DOE labs.  However, Hadoop played only a minor role in the grand scheme of the two meetings for all of the reasons I've outlined previously.

Rather, these two meetings had three major themes that crept into all aspects of the discussion:
  1. Scientific workflows
  2. Burst buffers
  3. Data curation
I found this to be a very interesting trend, as #1 and #2 (workflows and burst buffers) aren't topics I'd heard come up at any other DISC workshops, forums, or meetings I've attended.  The connection between DISC and workflows wasn't immediately evident to me, and burst buffers are a unique aspect of cyberinfrastructure that have only been thrust into the spotlight with the NERSC-8/LANL Trinity RFP last fall.  However, all three of these topics will become central to both data-intensive scientific computing and, by virtue of their ability to produce data, exascale supercomputers.

Scientific workflows

Workflows are one of those aspects of scientific computing that have been easy to dismiss as the toys of computer scientists because traditional problems in high-performance computing have typically been quite monolithic in how they are run.  SDSC's own Kepler and USC's Pegasus systems are perhaps the most well-known and highly engineered workflow management systems, and I have to confess that when I'd first heard of them a few years ago, I thought they seemed like a very complicated way to do very simple tasks.

As it turns out though, both data-intensive scientific computing and exascale computing (by virtue of the output size of exaflop calculations) tend to follow patterns that look an awful lot like map/reduce at a very abstract level.  This is a result of the fact that most data-intensive problems are not processing giant monoliths of tightly coupled and inter-related data; rather, they are working on large collections of generally independent data.  Consider the recent talk I gave about a large-scale genomic study on which I consulted; the general data processing flow was
  1. Receive 2,190 input files, 20 GB each, from a data-generating instrument
  2. Do some processing on each input file
  3. Combine groups of five input files into 438 files, each 100 GB in size
  4. Do more processing 
  5. Combine 438 files into 25 overlapping groups to get 100 files, each 2.5 GB in size
  6. Do more processing
  7. Combine 100 files into a single 250 GB file
  8. Perform statistical analysis on this 250 GB file for scientific insight
The natural data-parallelism inherent from the data-generating instrument means that any collective insight to be gleaned from this data requires some sort of mapping and reduction, and the process of managing this large volume of distributed data is where scientific workflows become a necessary part of data-intensive scientific computing.  Managing terabytes or petabytes of data distributed across thousands or millions of logical records (whether they be files on a file system, rows in a database, or whatever else) very rapidly becomes a problem that nobody will want to do by hand.  Hadoop/HDFS delivers an automated framework for managing these sorts of workflows if you don't mind rewriting all of your processing steps against the Hadoop API and building out HDFS infrastructure, but if this is not the case, alternate workflow management systems begin to look very appealing.

The core debate was not whether or not workflow management systems were a necessary component in DISC; rather, I observed two salient, open questions:
  1. The systems in use at DOE (notably Fireworks and qdo) are primarily used to work around deficiencies in current HPC schedulers (e.g., Moab and SLURM) in that they cannot handle scheduling hundreds of thousands of tiny jobs concurrently.  Thus, should these workflow managers be integrated into the scheduler to address these shortcomings at their source?
  2. How do we stop every user from creating his or her own workflow manager scripts and adopt an existing solution instead?  Should one workflow manager rule them all, or should a Darwinian approach be taken towards the current diverse landscape of existing software?
Question #1 is a highly technical question that has several dimensions; ultimately though, it's not clear to me that there is enough incentive for resource manager and scheduler developers to really dig into this problem.  They haven't done this yet, and I can only assume that this is a result of the perceived domain-specificity and complexity of each workflow.  In reality, a large number of workflows can be accommodated by two simple features: support for directed acyclic graphs (DAGs) of tasks and support for lightweight, fault-tolerant task scheduling within a pool of reserved resources.  Whether or not anyone will rise to the challenge of incorporating this in a usable way is an open question, but there certainly is a need for this in the emerging realm of DISC.

Question #2 is more interesting to me since this problem of multiple people cooking up different but equivalent solutions to the same problems is pervasive throughout computational and computer science. This is in large part due to the fatal assumption held by many computer scientists that good software can be simply "thrown over the fence" to scientists and it will be adopted.  This has never worked; rather, the majority of widely adopted software technologies in HPC have been a result of the standardization of a landscape of similar but non-standard tools.  This is something I touched on in a previous post when outlining the history of MPI and OpenMP's successes.

I don't think the menagerie of workflows' developers are ready to settle on a standard, as the field is not mature enough to have a holistic understanding of all of the issues that workflows need to solve.  Despite the numerous presentations and discussions of various workflow solutions being used across DOE's user facilities, my presentation was the only one that considered optimizing workflow execution for the underlying hardware.  Given that the target audience of these talks were users of high-performance computing, the lack of consideration given to the performance aspects of workflow optimization is a testament to this immaturity.

Burst buffers

For those who haven't been following the details of one of DOE's more recent procurement rounds, the NERSC-8 and Trinity request for proposals (RFP) explicitly required that all vendor proposals include a burst buffer to address the capability of multi-petaflop simulations to dump tremendous amounts of data in very short order.  The target use case is for petascale checkpoint-restart, where the memory of thousands of nodes (hundreds of terabytes of data) needs to be flushed to disk in an amount of time that doesn't dominate the overall execution time of the calculation.

The concept of what a burst buffer is remains poorly defined.  I got the sense that there are two outstanding definitions:
  • The NERSC burst buffer is something more tightly integrated on the compute side of the system and may be a resource that can be allocated on a per-job basis
  • The Argonne burst buffer is something more tightly integrated on the storage side of the system and acts in a fashion that is largely transparent to the user.  This sounded a lot like the burst buffer support being explored for Lustre.
In addition, Los Alamos National Labs (LANL) is exploring burst buffers for the Trinity procurement, and it wasn't clear to me if they had chosen a definition or if they are exploring all angles.  One commonality is that DOE is going full-steam ahead on providing this burst buffer capability in some form or another, and solid-state storage is going to be a central enabling component.

Personally, I find the NERSC burst buffer concept a lot more interesting since it provides a more general purpose flash-based resource that can be used in novel ways.  For example, emerging software-defined storage platforms like EMC's Vipr can potentially provide very fine-grained access to flash as-needed to make better overall use of the underlying SSDs in HPC environments serving a broad user base (e.g., NERSC and the NSF centers).  Complementing these software technologies are emerging hardware technologies like DSSD's D5 product which will be exposing flash to compute systems in innovative ways at hardware, interconnect, and software levels.

Of course, the fact that my favorite supercomputer provides dynamically allocatable SSDs in a fashion not far removed from these NERSC burst buffers probably biases me, but we've demonstrated unique DISC successes enabled by our ability to pile tons of flash on to single compute nodes.  This isn't to say that the Argonne burst buffer isn't without merit; given that the Argonne Leadership Computing Facility (ALCF) caters to capability jobs rather than capacity jobs, their user base is better served by providing a uniform, transparent burst I/O capability across all nodes.  The NERSC burst buffer, by comparison, is a lot less transparent and will probably be much more susceptible to user disuse or misuse.  I suspect that when the dust settles, both takes on the burst buffer concept will make their way into production use.

A lot of the talk and technologies surrounding burst buffers are shrouded in NNSA secrecy or vendor non-disclosures, so I'm not sure what more there is to be said.  However, the good folks at HPCwire ran an insightful article on burst buffers after the NERSC-8 announcement for those who are interested in more detail.

Data curation

The final theme that bubbled just beneath the surface of the DOE workshops was the idea that we are coming upon an era where scientists can no longer save all their data from all their calculations in perpetuity.  Rather, someone will have to become the curator of the scientific data being generated by computations and figure out what is and is not worth keeping, and how or where that data should be stored and managed.  This concept of selectively retaining user data manifested in a variety of discussions ranging from in-place data sharing and publication with Globus Plus and science DMZs to transparently managing online data volumes with hierarchical storage management (HSM).  However, the common idea was that scientists are going to have to start coming to grips with data management themselves, as facilities will soon be unable to cope with the entirety of their users' data.

This was a particularly interesting problem to me because it very closely echoed the sentiments that came about from Datanami's recent LeverageBIGDATA event which had a much more industry-minded audience.  The general consensus is that several fields are far ahead of the pack in terms of addressing this issue; the high-energy physics community has been filtering data at its genesis (e.g., ignoring the data from uninteresting collision events) for years now, and enterprises seem comfortable with retaining marketing data for only as long as it is useful.  By comparison, NERSC's tape archive has not discarded user data since its inception several decades ago; each new tape system simply repacks the previous generation's tape to roll all old data forward.

All of the proposed solutions for this problem revolve around metadata.  The reality is that not all user data has equal importance, and there is a need to provide a mechanism for users (or their applications) to describe this fact.  For example, the principal use case for the aforementioned burst buffers is to store massive checkpoint-restart files; while these checkpoints are important to retain while a calculation is running, they have limited value after the calculation has completed.  Rather than rely on a user to manually recognize that these checkpoints can be deleted, the hope is that metadata attributes can be attached to these checkpoint files to indicate that they are not critical data that must be retained forever for automated curation systems to understand.

The exact way this metadata would be used to manage space on a file system remains poorly defined.  A few examples of exactly how metadata can be used to manage data volume in data-intensive scientific computing environments include
  • tagging certain files or directories as permanent or ephemeral, signaling that the file system can purge certain files whenever a cleanup is initiated;
  • tagging certain files with a set expiration date, either as an option or by default.  When a file ages beyond a certain point, it would be deleted;
  • attributing a sliding scale of "importance" to each file, so that files of low importance can be transparently migrated to tape via HSM
Some of these concepts are already implemented, but the ability for users and applications to attach extensible metadata to files in a file system-agnostic way does not yet exist.  I think this is a significant gap in technology that will need to be filled in very short order as pre-exascale machines begin to demonstrate the ability to generate tremendous I/O loads.  Frankly, I'm surprised this issue hasn't been solved in a broadly deployable way yet.

The good news here is that the problem of curating digital data is not new; it is simply new to high-performance computing.  In the spirit of doing things the right way, DOE invited the director of LANL's Research Library to attend the workshops, and she provided valuable insights into how methods of digital data curation may be applied to these emerging challenges in data-intensive scientific computing.

Final Thoughts

The products of the working groups' conventions at the HPC Operational Review are being assembled into a report to be delivered to DOE's Office of Science, and it should be available online at the HPCOR 2014 website as well as the usual DOE document repository in a few months.  Hopefully it will reflect what I feel was the essence of the workshop, but at any rate, it should contain a nice perspective on how we can expect the HPC community to address the new demands emerging from data-intensive scientific computing (DISC) community.

In the context of high-performance computing, 
  • Workflow management systems will continue to gain importance as data sets become larger, more parallel, and more unwieldy.
  • Burst buffers, in one form or another, will become the hardware solution to the fact that all exascale simulations will become data-intensive problems.
  • Data curation frameworks are the final piece of the puzzle and will provide the manageability of data at rest.
None of these three legs are fully developed, and this is simply an indication of data-intensive scientific computing's immaturity relative to more traditional high-performance computing:  
  • Workflows need to converge on some sort of standardized API or feature set in order to provide the incentive to users to abandon their one-off solutions.
  • Burst buffer technology has diverged into two solutions centered at either the compute or storage side of a DISC platform; both serve different workloads, and the underlying hardware and software configurations remain unfinished.
  • Effective data curation requires a metadata management system that will allow both users and their applications to identify the importance of data to automate sensible data retention policy enforcement and HSM.
Of course, I could be way off in terms of what I took away from these meetings seeing as how I don't really know what I'm talking about.  Either way, it was a real treat to be invited out to hang out with the DOE folks for a week; I got to meet some of my personal supercomputing heroes, share war stories, and make some new pals.

I also got to spend eight days getting to know the Bay Area.  So as not to leave this post entirely without a picture,

I also learned that I have a weird fascination with streetcars.  I'm glad I was introduced to supercomputers first.