Thursday, April 22, 2010

Matthew Trunnell on Data Management Challenges at The Broad

Matthew Trunnell, Acting Director, Advanced IT, Broad Institute gave a talk in Track 1 Infrastructure - Hardware at Bio-IT World on Adjusting to the New Scale of Research Data Management. The Broad has been struggling with the PBs of data associated with their massive sequencing facilities for four years (he noted he ordered 1.1 PB of new storage last week), and is now encountering the issues associated with managing massive data collections that have challenge the physics and space engineering communities (among others) for the last 10-20 years.

He noted that as data grow very large, if it's not well managed you start to spend substantial amounts of time looking for data instead of actually working with it. Their legacy data is growing faster than the costs of technology are dropping, which is driving total costs up to the point where they can no longer afford to backup all data.  Furthermore, they've developed their own tool (fsview) and have done analysis of current utilization of data storage, and found as many as 18 redundant copies of files.

The primary issue is that the simple tools included with filesystems provide very limited metadata for managing data (typically file owner, group, size, and creation/modification/access times). Information such as project, laboratory, security classification, availability requirements, and lifespan are not available.This information is critical for managing efficient and cost-effective storage of the data, as it needs to be identified long after it's created when the original creator may not be available or may not remember the details of the data.

He quoted investment guru Peter Lynch's adage, "Know what you own and why you own it" as the guiding principle of data management. As step towards tackling the problem, the management of the Broad directed that all files be associated with funded projects. Matthew noted that there is an established field of solutions specifically designed to address this challenge: digital asset management (DAM). They are working with the iRODS software that is derived from the Storage Resource Broker (SRB) from UCSD Supercomputer Center. SRB and variants have made fairly substantial penetration into large-scale data management (I've previously talked to the folks at General Atomics, who provide a commercially supported version of SRB called Nirvana SRB).

But he said the technology is not really the challenge, the biggest change is the cultural change required of the scientists, who will need to tag their data as they are created. Some of the tagging can be automated, but there will also need to be other metadata provided by people. He said his approach will be to provide the service, and data won't get the usual services (backup, security, etc) until they have been registered in the DAM.

I'm very interested in these efforts as it's a natural follow-on to the whitepaper we are working on focusing on Research IT Infrastructure, which clearly demonstrates the challenges we face in the next five years, and the need for better data management.

No comments: