Thursday, April 22, 2010

Matthew Trunnell on Data Management Challenges at The Broad

Matthew Trunnell, Acting Director, Advanced IT, Broad Institute gave a talk in Track 1 Infrastructure - Hardware at Bio-IT World on Adjusting to the New Scale of Research Data Management. The Broad has been struggling with the PBs of data associated with their massive sequencing facilities for four years (he noted he ordered 1.1 PB of new storage last week), and is now encountering the issues associated with managing massive data collections that have challenge the physics and space engineering communities (among others) for the last 10-20 years.

He noted that as data grow very large, if it's not well managed you start to spend substantial amounts of time looking for data instead of actually working with it. Their legacy data is growing faster than the costs of technology are dropping, which is driving total costs up to the point where they can no longer afford to backup all data.  Furthermore, they've developed their own tool (fsview) and have done analysis of current utilization of data storage, and found as many as 18 redundant copies of files.

The primary issue is that the simple tools included with filesystems provide very limited metadata for managing data (typically file owner, group, size, and creation/modification/access times). Information such as project, laboratory, security classification, availability requirements, and lifespan are not available.This information is critical for managing efficient and cost-effective storage of the data, as it needs to be identified long after it's created when the original creator may not be available or may not remember the details of the data.

He quoted investment guru Peter Lynch's adage, "Know what you own and why you own it" as the guiding principle of data management. As step towards tackling the problem, the management of the Broad directed that all files be associated with funded projects. Matthew noted that there is an established field of solutions specifically designed to address this challenge: digital asset management (DAM). They are working with the iRODS software that is derived from the Storage Resource Broker (SRB) from UCSD Supercomputer Center. SRB and variants have made fairly substantial penetration into large-scale data management (I've previously talked to the folks at General Atomics, who provide a commercially supported version of SRB called Nirvana SRB).

But he said the technology is not really the challenge, the biggest change is the cultural change required of the scientists, who will need to tag their data as they are created. Some of the tagging can be automated, but there will also need to be other metadata provided by people. He said his approach will be to provide the service, and data won't get the usual services (backup, security, etc) until they have been registered in the DAM.

I'm very interested in these efforts as it's a natural follow-on to the whitepaper we are working on focusing on Research IT Infrastructure, which clearly demonstrates the challenges we face in the next five years, and the need for better data management.

Chris Dag from BioTeam - Trends from the Trenches

Chris Dagdigian from BioTeam chaired the Track 1 Infrastructure - Hardware talks, and got some additional time due to the unfortunate loss of Phil Butcher from Sanger due to the volcano. Chris ran through the latest Trends From the Trenches presentation, and had some interesting updates.
  • He had expected blades to win the HPC hardware battle, but has not seen that come to pass, it's still a split field
  • Intel is currently the chip of choice, but AMD might be back in the game
  • BioTeam has done more Sun Grid Engine consulting in the first quarter of 2010 than all of 2009; he's not concerned about SGE's future following the Oracle acquisition of Sun
  • He got a laugh from the crown with "private clouds - still stupid in 2010". He notes that this is just marketing speak and doesn't really mean anything.
  • Public clouds, on the other hand, are very real, and close to being mainstream... he's a strong supporter of their use in the right situations.
  • DIY cluster/parallel filesystems have a higher risk of implementation failure rate due to lack of pre-sales planning and design, especially in smaller shops. He also recommended commercial solutions with formal support programs.
  • Clusters are increasingly utilizing fat nodes (32 core, 128 GB+ memory)
  • Petascale storage is no longer risky, and single namespace solutions are recommended
  • He expressed concern about the downstream analysis of data (such as sequencing) eating up storage capacity - while the HTPS pipeline is relatively easy to model, secondary analysis is much more difficult.
He had an interesting observation regarding communication between IT and scientists. He gave the example that scientists will often ask for 100% uptime and full data protection, but don't realize that the difference between five 9s and four 9s of uptime is several million dollars. He emphasized the need for bettter communication between IT and research, and that IT needs accurate accounting of IT costs so that it can explain the costs of services and facility

He argues that the DNA data deluge will get better, mostly because the sequencing vendors are becoming more efficient in delivering data from the instruments. I would agree that the per run sizes will stabilize, but as the costs continue to plummet for sequencing, it will drive much greater demand, thus continuing the pressure on storage and compute infrastructures.

He also talked about an issue that Jackson has had recent experience with, and that is the challenge of high-speed networking. He noted that moving large data around requires more than just big pipes and bandwidth. His experience is the number of hops between locations can have a huge impact on performance, as well as the tools and protocols utilized.

John Halamka on EHR

"From the doctor's brain to the patient's vein." - John Halamka, CIO of Harvard Medical School, on the impacts of EHR.

Wednesday at Bio-IT World started out with a keynote by John Halamka, M.D., M.S., CIO of Harvard Medical School, and fourth person to have his genome publicly sequenced. John walked the group through the implications of hundred of pages of $30B Healthcare IT legislation. Regarding privacy concerns he said, "there will not be a massive database in the basement of the White House run by Sarah Palin." He said the goal is to go from 20% to 100% use of EHR in five years, and characterized fully implemented Electronic Health Records as improving the accuracy and efficiency of medical records - "from the doctor's brain to the patient's vein."

John also related an interesting story about their one and only data breach. It started with an employee looking at a particular clinical trial involving 4000+ subjects. They found the data very compelling, and made a copy on their laptop (encrypted), which was then forgotten. A year later the employee left Beth Israel Deaconess and went to UCSF, and in the process copied the contents of their laptop to a new unencrypted laptop (CA has less stringent encryption requirements than MA). The laptop was stolen by someone, pawned, when the pawnshop owner couldn't get the system to boot, he called Dell Tech Support. Dell, upon discovering the contents of the laptop, contacted Beth Israel, and the laptop was returned in 24 hours. He said that he spends $1M annually on information security for BID, and that they are attacked every seven seconds over the Internet, half of which come from eastern Europe and other half of which come from eastern Cambridge.

Some other points of interest:
  • Lab tests will start using controlled vocabulary to ensure consistency across providers.
  • Patients will be able to get a full copy of their EHR.
  • The Social Security Administration spends $500M annually managing paper records, which are subsequently digitized.
He also commented on the growing collection of wifi-enabled devices capable of measuring and reporting body telemetry. He is using a home scale which automatically transmits his weight, body mass, and other data to Google Health and Microsoft Health Vault.

Wednesday, April 21, 2010

Bio-IT World Keynote: How to Start a Drug Company

"I will be shocked if there aren't drugs in the market in the next 10-15 years that target aging genes [and pathways]." - Christoph Westphal, CEO of Sitris Pharmaceuticals, responds to a question from Kevin Davies, Editor-in-Chief of Bio-IT World.

The kick-off keynote at Bio-IT World 2010 was given by Christoph Westphal, a doctor and scientist who has started a number of small drug companies. While most of them lost (or are losing) money, one was a widely-acclaimed success, at least for a time.

The conference was opened by Cindy Crowninshield, the conference director. One major challenge the conference has faced has been the need to find 20 replacement speakers due to travel disruptions caused by the Icelandic volcano.

The keynote was introduced by Ronald Ranauro, CEO of GenomeQuest, a company trying to carve out a niche in the next-gen sequencing data market with a SaaS offering they describe as "SDM" (sequence data management). It's not clear to me yet how they differ/integrate with LIMS solutions, but we'll find out more on Thursday, as I'll be meeting with them and another colleague from Jackson.  Ron pointed to the exponential growth in genome data has a tremendous opportunity.


GenomeQuest is looking to aggregate 1 million public genomes.

Back to Christoph, who told the story of Sitris, which was eventually acquired by Glaxo-Smith Kline for $720M.  They are working with Resveratrol, the anti-aging compound found in red wine. The talk focused on the process and components of a drug start that has a chance of making it.

This wasn't a topic of particular interest to me, but there were some interesting thoughts. By coincidence I had just had a long conversation with bandmate Jim Coffman, who is researching aging with sea urchin larvae at the MDI Biological Lab, on a ride back from band practice. It was great exercise for what I've learned in Genetics I and II over the last year. Jim's work focuses more on the TOR pathway (which is linked to rapamyacin, something being studied by Dave Harrison's lab at Jackson), but it seems similar to the SIRT1 pathway, which is regulated by resveratrol.  The big picture here is that caloric intake affects these pathways, in that a calorie-restricted diet has been repeatedly show to extend lifespan in multiple organisms.

But that wasn't the interesting part of the talk, really. More interesting was the insights to the incredible pace of progress in this field of research. Fifteen years ago, Christoph noted, it was considered crazy that there were genes involved in aging; now it's a major area of research.

Some other conclusions he's reached that apply fairly broadly:
  • it's more about the people than the technology
  • good teams overcome failures
  • a powerful vision/idea will attract supporters
He also noted that there is debate about whether to share data and information, or withhold it for competitive advantage. He is very strongly of the opinion that it is more important to share data and show you're a thought leader than worry about proprietary issues.

Tuesday, April 20, 2010

Prepping for Bio-IT World 2010

It's spring and I'm at the World Trade Center in Boston, which means it's time for Bio-IT World. Workshops started today and the main conference is kicked off with a keynote and reception later this afternoon.

I'm really looking forward to a number of talks, and am struggling with the usual conflicts. The conference has seven different focus areas or tracks and I'm interested in the first four - IT Infrastructure Hardware, IT Infrastructure Software, Bioinformatics & Next Gen Data, and Systems and Predictive Biology.

Track 1 (Hardware) will spend all day Wednesday on Scaling up for the Data Deluge, and then on Thursday look at Sequencing, Genetics Data Management & Grid Computing in the morning and Data Storage & Usage for Computational Tasks in the afternoon. Particular talks I'm interested in in this track are:
  • Wed, 11:00 - Sanger Centre's Perspective on Data Storage Challenges, Phil Butcher, Head of Systems at Sanger
  • Wed, 11:30 - Adjusting to the New Scale of Research Data Management, Matthew Trunnell, Acting Director, Advanced IT, Broad Institute
  • Wed, 2:15 - Improving Storage Efficiency for Unstructured Research Data, Richard Shaginaw, Project Manager, Scientific Computing Services, Bristol-Meyers Squibb
  • Wed, 3:45 - ResearchStation: A Bioinformatics Platform for Research Collaboration in Translational Medicine, Lynn H. Vogel, Ph.D., VP and CIO, Associate Professor, Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center
  • Thurs 11:00 - Pallas, a Computational Analysis Network, Charles Hurmiz, Director, Research Informatics, Information Sciences, St. Jude Children's Research Hospital
  • Thurs, 11:30 Making Systems and Services Easy: Secure File Sharing and Computational Portals, Shawn Houston, Technical Lead, Life Sciences Informatics, University of Alaska Fairbanks
  • Plus I'm presenting on IT Infrastructure Strategy in Support of Next-Gen Biological Research at 1:45 on Wed.
Track 2 (Software) spends Tuesday on Collaboration & Open Source Tools, Genomics Data & Wikis, and Semantic Web & Linked Data Technologies. Wednesday is focused on Information Exchange, Integration & Security. Items that catch my eye include:
  • Wed, 11:00 - Test to Best - Evidence for Collaboration and Science Driven IT as Criteria for Personalized Medicine, Michael Berens, Ph.D., Director of the Cancer and Cell Biology Division, Brain Tumor Research Lab, Translational Genomics Research Institute (ack! already a conflict with Track 1!)
  • Wed, 2:45 - Feature Presentation: The BIG Idea: Strategies to Achieve a Rapid-Learning Health System, Ken Buetow, Ph.D., Associate Director, Bioinformatics and Information Technology, National Cancer Institute (this talk spans Tracks 2, 3, 4, 6, and 7).
  • Thurs, 2:30 - Sharing Data While Keeping Control, Werner Ceusters, Professor, Director, Ontology Research Group, NYS Center of Excellence in Bioinformatics & Life Sciences
Track 3 (Bioinformatics and Next-Gen Data) focuses on Driving Biomarker Discovery and Translational Research on Wed morning, then Data Management & Integration Strategies on Wed afternoon, followed up with Application of Data on Thursday.
  • Wed, 11:00 - Leverage Emerging Technologies to Manage Genomic and Clinical Data, Stephen Friend, M.D., Ph.D., President, Sage Bionetworks. This first slot Wed is brutal. Sage is the non-profit off-shoot of Merck that looks very interesting.
  • Wed, 12:00 - Pipelining Your NGS Data, Nancy Miller Latimer, M.S. Senior Product Manager, Biological Sciences and Analytics, Accelrys. 
  • Wed 1:45 - CASTOR QC - A Database Approach for Handling Large Genomic Data Sets, Marc Bouffard, M.Sc., Senior Bioninformatician, Montreal Heart Institute and Genome Quebec Pharmacogenomics Center
  • Thurs, 11:00 - Unbiased Prioritization of Mutations in Cancer Genomes, David Dooling, Ph.D., Director, Analysis Developers, Laboratory Information Management Systems (LIMS), and the Information Systems Groups, The Genome Center at Washington University in St. Louis School of Medicine
  • Thurs, 3:00 - Toward Meaningful Whole-Genome Interpretation with Open Access Tools from the Genome Commons, Reece Hart, Ph.D., Chief Scientist, Genome Commons, UC Berkeley QB3 and Center for Computational Biology
Track 4 covers Data Modeling: Enabling Systems Medicine, Data Generation: Good Models Start with Good Data, and Data Integration: Modeling Disparate "Omic" Sources on Wed. On Thursday the focus continues with Data Integration in the morning, followed by Data Validation: From Benchtop to Clinical Outcomes. There are only a handful of talks I'm interested in this track, and I've run out of time as the first keynote is about to start. And I haven't even covered the main morning keynotes. More later...