Wednesday, April 29, 2009

Clifford Reid on Personalized Genomics


Clifford Reid, CEO of Complete Genomics

Kevin Davies opened up the second full day at Bio-IT World 2009 with thanks for the audience for solid participation in tough economic times - attendance at the conference is up compared to last year, which will help the group in their efforts to kick off Bio-IT World Europe in October.
Complete Genomics is "the Netflix of Next-Gen."
- Kevin Davies
Kevin then introduced Clifford Reid for the day's keynote, titled Personalized Genomics - The Impact of Large-Scale Human Sequencing Projects, who began his talk with a review of the costs of sequencing, which started out dropping by 2x per year until 2007, when it began dropping by 10x per year. This drop is the result of the confluence of three technologies:
  • Bio - Oligos (synthetic DNA), enzymes, flourescent molecules, DNA amplification
  • Nano - Photolithography, nano-robotics, CCD optical systems, ZMWs, nanopores
  • Info - Moore's Law (HPC), digital image processing, informatic insight: short paired ends (need to be able to make long reads)
Complete Genomics' approach includes manipulating the DNA to make it easily readable - they start with 500 bases, circularize it, then insert adaptors. These are combined into DNA nano-balls in a test tube, which are then dumped out onto a slide where they self-assemble into a square grid. The grid can be aligned with the imaging CCD pixels, so no pixels are wasted. The slide can hold an entire human genome, and will in the near future be read in a half day.

Sequencing a slide currently costs $5000, with the reagent costs at $1000. The chemical component of sequencing used to be the major cost; now that cost is down to hundreds of dollars per slide, and the costs of imaging and compute are the major factors. As this progresses, the data sets will get huge - 60 TB of images and 1 TB of processed data, requiring thousands of CPUs to process.

Complete Genomics is working in a new vendor model, as an information service provider, not an instrument vendor. They are focusing on scaling through continuous manufacturing technology, they are half a sequencing company and half a computing company, and they rely on two key distribution technologies: FedEx for atoms, and the Internet for bits. This is driving a new user model, the instrument-less genome research center, which are analogous to the fab-less semiconductor firms.

CG's five year mission is to build ~10 genome centers around the world and sequence 1 million human genomes.

The impact of this revolution will mean that academia can tackle large human populations and orphan diseases that aren't otherwise commercially interesting. BioPharma will see a reduction in costs and will attack cancer head-on. Agriculture and energy fields will see an explosion of new, economically viable genomic studies on plants and microbes. Personalized genomics, already in the early stages of use, will continue to grow rapidly.

The field of genetics research is moving from instrument-centric efforts, to a data-centric focus for the next five years, and eventually to a post-discovery action-centric world.

Cliff then noted that major leaps in science comes with advancement in measuring tools that allow new hypotheses to be tested. He took the room back some 400+ years to the invention of light microscopy, which overhauled our view of the world, but noted that it didn't do much for medical progress until the improvements in the 1870s that enabled a view into the cell. The cause of tuberculosis was subsequently identified, and there was more medical progress in 3 years than there had been in the previous 300 years. High-throughput, low cost sequencing will enable large-scale complete human genome studies - the next five years will see the investigation of the genome with a high-resolution gene microscope, and it will do for cancer what the light microscope did for tuberculosis.

Next Generation Sequencing Data Management & Analysis


Melissa Kramer of CSHL and Matthew Trunnell of The Broad

The first full day of Bio-IT World Exp 2009 included an interesting panel discussion with mid-to-high end users of next generation sequencers. Moderated by Gerald Sample from BlueArc, it included Melissa Kramer from Cold Springs Harbor Laboratory, Matthew Trunnell from The Broad Institute, and Bruce Martin from Complete Genomics.

CSHL is the smallest operation of the three, with nine Illuminas. The staff of 11 running their sequencing process uses BlueArc systems for storage and a 2000 core cluster of IBM blades for compute.

The Broad is one of the largest sequencing operations in the world, with 47 Illuminas, 8 SOLiDs, 10 454s, and close to 4 PB of storage (mostly Isilon and Sun).

Complete Genomics is working on next next-gen sequencing, which requires very high throughput. Their sequencing technology is built in-house and is image-based, they are running Isilon storage to handle the active sequencing, plus a cheaper tier for parking before handing over to the customer. Thanks to the hand-off, they don't have the long-term storage concerns of most labs. They are however are continuing to be challenged by the storage I/O needs of the instruments - he characterized the problem as "severe", and said they are resorting to custom engineering and dark fiber to address it.

Matthew said that in designing storage for NGS, it's important to distinguish between working storage for the pipeline versus what is needed downstream for later analysis and retention.

The group discussed the changing nature of what is considered "raw" data, that increasingly the image or initial data is no longer kept, and the initial analysis of the sequencing process is now considered raw. Melissa noted that CSHL started out keeping the raw data, but have dropped that practice due to the storage constraints.

The Broad has been struggling with how to manage the huge amounts of data - they have 1000+ filesystems and billions of files. Migration policies for moving the data between different tiers of storage based on filesystem information is not straight-forward or sufficient; there is a critical need for better metadata describing the data. Bruce commented that they don't try to figure out what to archive, they figure out what to delete. Melissa noted that they are moving some of their data to tape for long term storage, Bruce cautioned that if you use tape, make sure to have multiple copies as "tape dies over time".

Matthew said that the compute and working storage will need to move closer to the instruments as they provide greater capacity. Matthew noted that moving data is a problem because the networks is not getting faster as quickly as the data is growing - sneakernet is still frequently in use.

Bruce agreed and said that it's much easier to build out compute than to build out storage. He agreed with Chris Dagdigian's keynote that analysis data will eventually go one-way into the cloud.

UPDATE (5/14/09): BioInform has a nice article up covering NGS data storage issues as discussed at Bio-IT World.

Tuesday, April 28, 2009

Eric Schadt on the Systems Biology Revolution


"There are no such things as pathways, there are only networks."
- Eric Schadt

Tuesday morning started out with Eric Schadt, Ph.D., Executive Scientific Director, Genetics, Rosetta Inpharmatics/Merck Research Labs; Vice President and Chief Scientific Officer, Sage, with a talk titled Integrative Genomics.

Schadt founded and lead the genetics department in molecular profiling at Merck's Rosetta subsidiary, and is considered a pioneer in integrative genomics. While still at Merck, he is transitioning to Sage, "an open access, integrative bionetwork evolved by contributor scientists working to eliminate human disease." Sage will take the massive amounts of data Schadt has created at Merck and put it into the public domain. Schadt's vision is for Sage to the be the Google of biology.

Eric provided a brief review of sequencing research, and noted that the avalanche of genetic data doesn't really explain underlying mechanisms - the data alone is not delivering the breakthroughs some thought it would. We can see the patterns, but don't understand what is driving them. As an example of the how knowing the pattern is not enough, he put up a cartoon of a fat guy sitting in a lounger walking his fat dog on a treadmill, because studies show that there is a correlation between the weight of pets and their owners.

He then went through a relatively technical discussion of his research, noting how they discovered there was a huge network of gene interaction not only within a tissue but between tissues, and that these interactions had a demonstrable impact on health.



He had some nice visualizations of this activity, including an interesting diagram with various tissues and their associated genes arranged in a circle, and then lines representing interactions within tissues (on the outside of the circle) and those between tissues (inside the circle). It clearly shows equal if not greater interaction between tissues. These and related visualizations revealed that disease in one tissue may be driven by changes in another tissue. Later, when asked about the possible mechanisms for this communication, he said it's not understood, the endocrine system can't explain it. He has a hypothesis that new signaling mechanism are yet to be discovered, possibly cells that can communicate network state.

Monday, April 27, 2009

Gluster on commodity hardware


I had a long, winding, and typically interesting conversation with Jacob Farmer from Cambridge Computer yesterday afternoon. One of the favorite topics for bio-IT infrastructure folks is how to make good use of Sun's Thumper (aka Thor) systems, which are extremely attractive storage systems from a capital cost perspective, but aren't a complete package. Many organizations have started with one or two and then grown out to many only to discover they are expensive to operate and maintain.

Jacob said he has been looking at using Gluster, a free, open source package which "is a cluster file-system capable of scaling to several peta-bytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design without compromising performance." It sounds like some organizations are having success pairing Gluster with Thumpers/Thors, though one person I spoke with said they had some issues with the administrative tools and interface.

It looks like an interesting platform that bears watching.

Research Computing and Infrastructure Technology



The opening keynote at Bio-IT World Expo 2009 was given by Chris Dagdigian from BioTeam, titled Research Computing and Infrastructure Technology. Chris was introduced by Rudy Potenzone, Ph.D., WW Industry Technology Strategist for Pharmaceuticals, Microsoft Corporation, who noted in his remarks that "there will be more data generated in the next five years than in the history of mankind."

Chris is known for his 90-slides-in-30-minutes "Trends from the Trenches" presentations, however this was a somewhat less manic version of that talk. First topic was virtualization, which he says is the lowest hanging fruit in the infrastructure. He talked about a west-coast campus that ran out of power in the their data center, and in response built a "virtual colocation service" that recovered facilities, lowered costs, and provided more flexible services to users.

He had an interesting observation that the data deluge is not looking quite as scary as it was even last fall, not because the technology is coming to be rescue, but rather because people are realizing that there must be "data triage", or data management - we just can't keep it all. Furthermore, in the world of next-gen sequencing, he argues that infrastructure is not the gating element, it's the chemistry, reagent costs, and human factors that are the bottlenecks to the throughput.

In the last six months, BioTeam has put up their first 1 PB filesystem - he had a screen grab from a df command showing "1.1P". He really likes the "P". He noted more and more customers are not backing huge systems up - one customer has 50 TB of Isilon storage for research use, and they don't back it up.

He talked a bit about cloud computing and storage, and several times noted how much he likes James Hamilton's blog. James is with Amazon, which Chris says is the cloud - all the other providers are several years behind. Chris noted that James has said that the cloud storage providers can provide 4x geographically distributed storage for $0.80/GB/year, which he says is less than any organization can provide data in a single location, much less distributed. He said those kinds of economics are going to drive all data, even huge data, into the cloud. The problem that needs to be solved at the moment is that there is no good way to get large data (ie 1 TB/day) into the cloud, but he said Amazon is working on this and this will be overcome as well.

Looking out on the near horizon, Chris noted the recent release by Google of videos of their 2004 data center technology, and asked the question, if that's what they were doing five years ago, imagine what they and Amazon are doing now? The economics and competition related to these huge facilities is driving incredible, but secret innovation. Slowly these innovations are starting to leak out, which is a good thing for the rest of the field. One example are the rising operating temperatures of systems, and the huge energy savings associated with every extra degree hotter the facilities can run. Pushed by big customers, Dell is now offering systems warrantied for operation at 94F, and Rackable offers systems that are supported for 104F.

Lastly, he said he thinks federated storage is on the horizon, and referenced the recently formed partnership between BioTeam, Cambridge Computer, and General Atomics to deliver GA's Nirvana storage platform.

Thursday, April 23, 2009

Headed for Bio-IT World '09

I'm excited to find myself with a last-minute change of plans that has me headed to Bio-IT World Conference & Expo '09 next week. It's been a number of years since I've attended, and from the look of the agenda, they've made some major progress. I'm going to be focusing on Track 1 - IT Infrastructure & Operations, which has some great sessions lined up. I'm looking forward to talking to Matt Trunnel from the Broad Institute again, those folks are involved in supporting next-gen sequencing at a truly mind-boggling scale - they have something like 30 sequencers and just announced they were acquiring another 22! This should put them close to needing to deal with something like 10 TB/day of storage. It will also be interesting to hear from the BioTeam folks, who seem to have built a pretty good reputation in the Northeast for working in the life sciences space. Wednesday's keynote from Clifford Reid, CEO of Complete Genomics and the following panel session on personalized medicine should also be interesting, as the lab where I work has also been focusing on personalized medicine for some time. It will also give me a chance to catch up with Jacob Farmer from Cambridge Computer in person, one of our trusted storage advisors.

Thursday, April 2, 2009

We Sleep to Prune the Brain

In the process of caring for plants for many years (I collect orchids and bromeliads), I've found a number of parallels between their lives and mine. For example, their overall health can change rapidly for the worse, but only long-term, consistent, persistent attention and cultivation will result in an exemplary speciman. I've the same is true with my own health - it takes months of consistent attention to diet, exercise, sleep, and stress to feel truly excellent.

So it was with some interest that I ran across this research that suggests sleep is designed to prune the brain of unneeded synapses:
Sleep's core function, Cirelli and Tononi say, is to prune the strength or number of synapses formed during waking hours, keeping just the strongest neuronal connections intact. Synapse strength increases throughout the day, with stronger synapses creating better contact between neurons. Stronger synapses also take up more space and consume more energy, and if left unchecked, this process—which Cirelli and Tononi believe occurs in many brain regions—would become unsustainable.2,3 Downscaling at night would reduce the energy and space requirement of the brain, eliminate the weakest synapses, and help keep the strongest neuronal connections intact. This assumption is based on the principle in neuroscience that if one neuron doesn't fire to another very often, the connection between the two neurons weakens. By eliminating some of the unimportant connections, the body, in theory, eliminates background connections and effectively sharpens the important connections.
Cool stuff!