Wednesday, November 19, 2008
SC08 Data Lifecycle Management BoF
The Data Lifecycle Management: ILM in an HPC World BoF at SC08 Tuesday afternoon was an interesting discussion of the issues surrounding the storage, archival, and management of large data sets. Sponsored by Avetec's Data Intensive Computing Environment (DICE), it was led by Tracey Wilson, Computer Sciences Corporation and DICE and Ralph McEldowney, Air Force Research Laboratory Major Share Resource Center.
Many in the audience, including folks from NASA and other large sites, had years of experience with this - they were past the initial challenges of simply get the capacity in place, and were now struggling with trying to figure out what data needs to be kept for how long, who owns it, how to find it back, how to continue growing the service with something resembling an affordable cost. At the heart of the problem is the need for users to tag their stored data with metadata that describes such things as the owner, contents, projected size, expected lifespan, etc. Major HPC sites are finding that storage needs are eating into funding originally aimed at HPC capacity, thus potentially limiting the compute resources available for research.
They listed some of the more commonly used solutions today in this general space, although each has different levels of functionality when it comes to handling the metadata associated DLM.
The group seemed to agree that user training was a major goal - users need to understand the critical importance of associating metadata with their storage. Some expressed frustration in getting agreement on this, however - there was discussion of how to provide incentives to drive the right behavior, as well as the suggestion of deleting any data after a period of time that does not have associated metadata.
One site provides disincentives through increased fees for inefficient use of storage - space that is requested but not utilized is charged at a higher rate, as is data that is retained beyond a certain timeframe.
There was general consensus that establishing and enforcing policy around data management was a must - the example of eDiscovery was brought up as justification for properly classifying data so that it could be located easily during the discovery phase of a civil or criminal legal case. Failure to be prepared for eDiscovery can cost as much as $15M for a single case. One person noted that policy was relatively useless if developed by IT - it should be driven by the business needs of the organization, ie the records management and legal needs, working in partnership with IT. Another person commented that there is high likelihood of a failure of the commons in the effort to develop policy if this is not driven from the top down. Most researchers will be certain that there is no risk to themselves re: eDiscovery, so have no personal motivation to participate.