Wednesday, April 29, 2009

Next Generation Sequencing Data Management & Analysis

Melissa Kramer of CSHL and Matthew Trunnell of The Broad

The first full day of Bio-IT World Exp 2009 included an interesting panel discussion with mid-to-high end users of next generation sequencers. Moderated by Gerald Sample from BlueArc, it included Melissa Kramer from Cold Springs Harbor Laboratory, Matthew Trunnell from The Broad Institute, and Bruce Martin from Complete Genomics.

CSHL is the smallest operation of the three, with nine Illuminas. The staff of 11 running their sequencing process uses BlueArc systems for storage and a 2000 core cluster of IBM blades for compute.

The Broad is one of the largest sequencing operations in the world, with 47 Illuminas, 8 SOLiDs, 10 454s, and close to 4 PB of storage (mostly Isilon and Sun).

Complete Genomics is working on next next-gen sequencing, which requires very high throughput. Their sequencing technology is built in-house and is image-based, they are running Isilon storage to handle the active sequencing, plus a cheaper tier for parking before handing over to the customer. Thanks to the hand-off, they don't have the long-term storage concerns of most labs. They are however are continuing to be challenged by the storage I/O needs of the instruments - he characterized the problem as "severe", and said they are resorting to custom engineering and dark fiber to address it.

Matthew said that in designing storage for NGS, it's important to distinguish between working storage for the pipeline versus what is needed downstream for later analysis and retention.

The group discussed the changing nature of what is considered "raw" data, that increasingly the image or initial data is no longer kept, and the initial analysis of the sequencing process is now considered raw. Melissa noted that CSHL started out keeping the raw data, but have dropped that practice due to the storage constraints.

The Broad has been struggling with how to manage the huge amounts of data - they have 1000+ filesystems and billions of files. Migration policies for moving the data between different tiers of storage based on filesystem information is not straight-forward or sufficient; there is a critical need for better metadata describing the data. Bruce commented that they don't try to figure out what to archive, they figure out what to delete. Melissa noted that they are moving some of their data to tape for long term storage, Bruce cautioned that if you use tape, make sure to have multiple copies as "tape dies over time".

Matthew said that the compute and working storage will need to move closer to the instruments as they provide greater capacity. Matthew noted that moving data is a problem because the networks is not getting faster as quickly as the data is growing - sneakernet is still frequently in use.

Bruce agreed and said that it's much easier to build out compute than to build out storage. He agreed with Chris Dagdigian's keynote that analysis data will eventually go one-way into the cloud.

UPDATE (5/14/09): BioInform has a nice article up covering NGS data storage issues as discussed at Bio-IT World.

1 comment:

Mr Lee said...

Wow,so cool!Wonderful blog.Coming to your blog We are very pleased to.
Nice to meet you.We upload something about the works to the blog.
We invite you to our blog.Look forward to your visit.Happy every day!