Thursday, April 22, 2010
Matthew Trunnell on Data Management Challenges at The Broad
He noted that as data grow very large, if it's not well managed you start to spend substantial amounts of time looking for data instead of actually working with it. Their legacy data is growing faster than the costs of technology are dropping, which is driving total costs up to the point where they can no longer afford to backup all data. Furthermore, they've developed their own tool (fsview) and have done analysis of current utilization of data storage, and found as many as 18 redundant copies of files.
The primary issue is that the simple tools included with filesystems provide very limited metadata for managing data (typically file owner, group, size, and creation/modification/access times). Information such as project, laboratory, security classification, availability requirements, and lifespan are not available.This information is critical for managing efficient and cost-effective storage of the data, as it needs to be identified long after it's created when the original creator may not be available or may not remember the details of the data.
iRODS software that is derived from the Storage Resource Broker (SRB) from UCSD Supercomputer Center. SRB and variants have made fairly substantial penetration into large-scale data management (I've previously talked to the folks at General Atomics, who provide a commercially supported version of SRB called Nirvana SRB).
But he said the technology is not really the challenge, the biggest change is the cultural change required of the scientists, who will need to tag their data as they are created. Some of the tagging can be automated, but there will also need to be other metadata provided by people. He said his approach will be to provide the service, and data won't get the usual services (backup, security, etc) until they have been registered in the DAM.
I'm very interested in these efforts as it's a natural follow-on to the whitepaper we are working on focusing on Research IT Infrastructure, which clearly demonstrates the challenges we face in the next five years, and the need for better data management.