Wednesday, October 7, 2009
Summer Student Program." Many think of the Lab as the just the "mouse house", ie a supplier of mice, or a just a research lab, but our educational effort is just as important a part of our mission to improve human health.
Friday, June 5, 2009
Bruce Segee, Associate Professor of Electrical and Computer Engineering at the University of Maine in Orono, gave a presentation today at the Jackson Laboratory about a proposal to build an energy efficient regional data center in a former paper mill.
Bruce started his overview of the CIDER (Cyberinfrastructure Investment for Development, Economic Growth, and Research) submission to the Maine Technology Asset Fund (MTAF) as a proposal to "change the world" by rejuvenating former paper mill space as a green energy platform. It involves partnerships with just about every technology related organization in the state, and targets the facility near Old Town, Maine.
View Larger Map
The facility has 40 Mw of on-site green power in the form of 3 generators (hydro, biomass, and recovery boiler), 135,000 square feet of space, and tremendous cooling capacity from the river. The onsite natural gas turbine can take advantage of the nearby Juniper Ridge landfill, which is generating 5 Mw of natural gas. The cost to produce electricity at the facility is roughly $.06/KwH , and the retail cost of power on the grid is $.22/KwH, so there is clearly opportunity for this to be a sustainable business model.
One ironic issue is that Maine river water is generally too cold for efficient use in the paper industry, as it requires more energy to boil it for the paper making process. This liability becomes an asset for data center cooling.
The cluster they are targeting to build would have 1024 cores and would hopefully make the Top500 list. The computational capacities would be of interest to multiple industries in the state including aquaculture and marine, composite materials, precision manufacturing, and forestry and agriculture.
The location is desirable as Maine does not have a good dispersion of carrier-neutral data centers - most are concentrated in the Portland area. The feasibility analysis finds the idea of selling data center space consistently profitable, and does not propose to sell compute cycles.
Bruce notes the facility's connection with the University will help develop the state's IT staffing resources and expertise - this is critical to the state's technology industry which otherwise has to compete with Massachusetts and other New England states for talent.
Bruce reviewed progress on the proposal, as well as 8 other proposals that all leverage this facility. An announcement of the MTAF funding status of CIDER is scheduled for Monday.
Bruce then reviewed the regional network facilities and issues, and noted that there is great opportunity for Maine to be the crossroads between the U.S., Canada, Europe and the rest of the world. There are currently two pending proposals with the NSF and NIH, with a longer term goal of a "three-ring binder" built with stimulus funding. The network would consist of three network loops within the state running east, north, and west with the Orono facility at the center.
All in all, some very interesting ideas which can hopefully lead towards a vision of Maine becoming another technology hosting region for the country.
Wednesday, April 29, 2009
Clifford Reid, CEO of Complete Genomics
Kevin Davies opened up the second full day at Bio-IT World 2009 with thanks for the audience for solid participation in tough economic times - attendance at the conference is up compared to last year, which will help the group in their efforts to kick off Bio-IT World Europe in October.
Complete Genomics is "the Netflix of Next-Gen."Kevin then introduced Clifford Reid for the day's keynote, titled Personalized Genomics - The Impact of Large-Scale Human Sequencing Projects, who began his talk with a review of the costs of sequencing, which started out dropping by 2x per year until 2007, when it began dropping by 10x per year. This drop is the result of the confluence of three technologies:
- Kevin Davies
- Bio - Oligos (synthetic DNA), enzymes, flourescent molecules, DNA amplification
- Nano - Photolithography, nano-robotics, CCD optical systems, ZMWs, nanopores
- Info - Moore's Law (HPC), digital image processing, informatic insight: short paired ends (need to be able to make long reads)
Sequencing a slide currently costs $5000, with the reagent costs at $1000. The chemical component of sequencing used to be the major cost; now that cost is down to hundreds of dollars per slide, and the costs of imaging and compute are the major factors. As this progresses, the data sets will get huge - 60 TB of images and 1 TB of processed data, requiring thousands of CPUs to process.
Complete Genomics is working in a new vendor model, as an information service provider, not an instrument vendor. They are focusing on scaling through continuous manufacturing technology, they are half a sequencing company and half a computing company, and they rely on two key distribution technologies: FedEx for atoms, and the Internet for bits. This is driving a new user model, the instrument-less genome research center, which are analogous to the fab-less semiconductor firms.
CG's five year mission is to build ~10 genome centers around the world and sequence 1 million human genomes.
The impact of this revolution will mean that academia can tackle large human populations and orphan diseases that aren't otherwise commercially interesting. BioPharma will see a reduction in costs and will attack cancer head-on. Agriculture and energy fields will see an explosion of new, economically viable genomic studies on plants and microbes. Personalized genomics, already in the early stages of use, will continue to grow rapidly.
The field of genetics research is moving from instrument-centric efforts, to a data-centric focus for the next five years, and eventually to a post-discovery action-centric world.
Cliff then noted that major leaps in science comes with advancement in measuring tools that allow new hypotheses to be tested. He took the room back some 400+ years to the invention of light microscopy, which overhauled our view of the world, but noted that it didn't do much for medical progress until the improvements in the 1870s that enabled a view into the cell. The cause of tuberculosis was subsequently identified, and there was more medical progress in 3 years than there had been in the previous 300 years. High-throughput, low cost sequencing will enable large-scale complete human genome studies - the next five years will see the investigation of the genome with a high-resolution gene microscope, and it will do for cancer what the light microscope did for tuberculosis.
Melissa Kramer of CSHL and Matthew Trunnell of The Broad
The first full day of Bio-IT World Exp 2009 included an interesting panel discussion with mid-to-high end users of next generation sequencers. Moderated by Gerald Sample from BlueArc, it included Melissa Kramer from Cold Springs Harbor Laboratory, Matthew Trunnell from The Broad Institute, and Bruce Martin from Complete Genomics.
CSHL is the smallest operation of the three, with nine Illuminas. The staff of 11 running their sequencing process uses BlueArc systems for storage and a 2000 core cluster of IBM blades for compute.
The Broad is one of the largest sequencing operations in the world, with 47 Illuminas, 8 SOLiDs, 10 454s, and close to 4 PB of storage (mostly Isilon and Sun).
Complete Genomics is working on next next-gen sequencing, which requires very high throughput. Their sequencing technology is built in-house and is image-based, they are running Isilon storage to handle the active sequencing, plus a cheaper tier for parking before handing over to the customer. Thanks to the hand-off, they don't have the long-term storage concerns of most labs. They are however are continuing to be challenged by the storage I/O needs of the instruments - he characterized the problem as "severe", and said they are resorting to custom engineering and dark fiber to address it.
Matthew said that in designing storage for NGS, it's important to distinguish between working storage for the pipeline versus what is needed downstream for later analysis and retention.
The group discussed the changing nature of what is considered "raw" data, that increasingly the image or initial data is no longer kept, and the initial analysis of the sequencing process is now considered raw. Melissa noted that CSHL started out keeping the raw data, but have dropped that practice due to the storage constraints.
The Broad has been struggling with how to manage the huge amounts of data - they have 1000+ filesystems and billions of files. Migration policies for moving the data between different tiers of storage based on filesystem information is not straight-forward or sufficient; there is a critical need for better metadata describing the data. Bruce commented that they don't try to figure out what to archive, they figure out what to delete. Melissa noted that they are moving some of their data to tape for long term storage, Bruce cautioned that if you use tape, make sure to have multiple copies as "tape dies over time".
Matthew said that the compute and working storage will need to move closer to the instruments as they provide greater capacity. Matthew noted that moving data is a problem because the networks is not getting faster as quickly as the data is growing - sneakernet is still frequently in use.
Bruce agreed and said that it's much easier to build out compute than to build out storage. He agreed with Chris Dagdigian's keynote that analysis data will eventually go one-way into the cloud.
UPDATE (5/14/09): BioInform has a nice article up covering NGS data storage issues as discussed at Bio-IT World.
Tuesday, April 28, 2009
"There are no such things as pathways, there are only networks."
- Eric Schadt
Tuesday morning started out with Eric Schadt, Ph.D., Executive Scientific Director, Genetics, Rosetta Inpharmatics/Merck Research Labs; Vice President and Chief Scientific Officer, Sage, with a talk titled Integrative Genomics.
Schadt founded and lead the genetics department in molecular profiling at Merck's Rosetta subsidiary, and is considered a pioneer in integrative genomics. While still at Merck, he is transitioning to Sage, "an open access, integrative bionetwork evolved by contributor scientists working to eliminate human disease." Sage will take the massive amounts of data Schadt has created at Merck and put it into the public domain. Schadt's vision is for Sage to the be the Google of biology.
Eric provided a brief review of sequencing research, and noted that the avalanche of genetic data doesn't really explain underlying mechanisms - the data alone is not delivering the breakthroughs some thought it would. We can see the patterns, but don't understand what is driving them. As an example of the how knowing the pattern is not enough, he put up a cartoon of a fat guy sitting in a lounger walking his fat dog on a treadmill, because studies show that there is a correlation between the weight of pets and their owners.
He then went through a relatively technical discussion of his research, noting how they discovered there was a huge network of gene interaction not only within a tissue but between tissues, and that these interactions had a demonstrable impact on health.
He had some nice visualizations of this activity, including an interesting diagram with various tissues and their associated genes arranged in a circle, and then lines representing interactions within tissues (on the outside of the circle) and those between tissues (inside the circle). It clearly shows equal if not greater interaction between tissues. These and related visualizations revealed that disease in one tissue may be driven by changes in another tissue. Later, when asked about the possible mechanisms for this communication, he said it's not understood, the endocrine system can't explain it. He has a hypothesis that new signaling mechanism are yet to be discovered, possibly cells that can communicate network state.
Monday, April 27, 2009
I had a long, winding, and typically interesting conversation with Jacob Farmer from Cambridge Computer yesterday afternoon. One of the favorite topics for bio-IT infrastructure folks is how to make good use of Sun's Thumper (aka Thor) systems, which are extremely attractive storage systems from a capital cost perspective, but aren't a complete package. Many organizations have started with one or two and then grown out to many only to discover they are expensive to operate and maintain.
Jacob said he has been looking at using Gluster, a free, open source package which "is a cluster file-system capable of scaling to several peta-bytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design without compromising performance." It sounds like some organizations are having success pairing Gluster with Thumpers/Thors, though one person I spoke with said they had some issues with the administrative tools and interface.
It looks like an interesting platform that bears watching.
The opening keynote at Bio-IT World Expo 2009 was given by Chris Dagdigian from BioTeam, titled Research Computing and Infrastructure Technology. Chris was introduced by Rudy Potenzone, Ph.D., WW Industry Technology Strategist for Pharmaceuticals, Microsoft Corporation, who noted in his remarks that "there will be more data generated in the next five years than in the history of mankind."
Chris is known for his 90-slides-in-30-minutes "Trends from the Trenches" presentations, however this was a somewhat less manic version of that talk. First topic was virtualization, which he says is the lowest hanging fruit in the infrastructure. He talked about a west-coast campus that ran out of power in the their data center, and in response built a "virtual colocation service" that recovered facilities, lowered costs, and provided more flexible services to users.
He had an interesting observation that the data deluge is not looking quite as scary as it was even last fall, not because the technology is coming to be rescue, but rather because people are realizing that there must be "data triage", or data management - we just can't keep it all. Furthermore, in the world of next-gen sequencing, he argues that infrastructure is not the gating element, it's the chemistry, reagent costs, and human factors that are the bottlenecks to the throughput.
In the last six months, BioTeam has put up their first 1 PB filesystem - he had a screen grab from a df command showing "1.1P". He really likes the "P". He noted more and more customers are not backing huge systems up - one customer has 50 TB of Isilon storage for research use, and they don't back it up.
He talked a bit about cloud computing and storage, and several times noted how much he likes James Hamilton's blog. James is with Amazon, which Chris says is the cloud - all the other providers are several years behind. Chris noted that James has said that the cloud storage providers can provide 4x geographically distributed storage for $0.80/GB/year, which he says is less than any organization can provide data in a single location, much less distributed. He said those kinds of economics are going to drive all data, even huge data, into the cloud. The problem that needs to be solved at the moment is that there is no good way to get large data (ie 1 TB/day) into the cloud, but he said Amazon is working on this and this will be overcome as well.
Looking out on the near horizon, Chris noted the recent release by Google of videos of their 2004 data center technology, and asked the question, if that's what they were doing five years ago, imagine what they and Amazon are doing now? The economics and competition related to these huge facilities is driving incredible, but secret innovation. Slowly these innovations are starting to leak out, which is a good thing for the rest of the field. One example are the rising operating temperatures of systems, and the huge energy savings associated with every extra degree hotter the facilities can run. Pushed by big customers, Dell is now offering systems warrantied for operation at 94F, and Rackable offers systems that are supported for 104F.
Lastly, he said he thinks federated storage is on the horizon, and referenced the recently formed partnership between BioTeam, Cambridge Computer, and General Atomics to deliver GA's Nirvana storage platform.
Thursday, April 23, 2009
I'm excited to find myself with a last-minute change of plans that has me headed to Bio-IT World Conference & Expo '09 next week. It's been a number of years since I've attended, and from the look of the agenda, they've made some major progress. I'm going to be focusing on Track 1 - IT Infrastructure & Operations, which has some great sessions lined up. I'm looking forward to talking to Matt Trunnel from the Broad Institute again, those folks are involved in supporting next-gen sequencing at a truly mind-boggling scale - they have something like 30 sequencers and just announced they were acquiring another 22! This should put them close to needing to deal with something like 10 TB/day of storage. It will also be interesting to hear from the BioTeam folks, who seem to have built a pretty good reputation in the Northeast for working in the life sciences space. Wednesday's keynote from Clifford Reid, CEO of Complete Genomics and the following panel session on personalized medicine should also be interesting, as the lab where I work has also been focusing on personalized medicine for some time. It will also give me a chance to catch up with Jacob Farmer from Cambridge Computer in person, one of our trusted storage advisors.
Thursday, April 2, 2009
In the process of caring for plants for many years (I collect orchids and bromeliads), I've found a number of parallels between their lives and mine. For example, their overall health can change rapidly for the worse, but only long-term, consistent, persistent attention and cultivation will result in an exemplary speciman. I've the same is true with my own health - it takes months of consistent attention to diet, exercise, sleep, and stress to feel truly excellent.
So it was with some interest that I ran across this research that suggests sleep is designed to prune the brain of unneeded synapses:
So it was with some interest that I ran across this research that suggests sleep is designed to prune the brain of unneeded synapses:
Sleep's core function, Cirelli and Tononi say, is to prune the strength or number of synapses formed during waking hours, keeping just the strongest neuronal connections intact. Synapse strength increases throughout the day, with stronger synapses creating better contact between neurons. Stronger synapses also take up more space and consume more energy, and if left unchecked, this process—which Cirelli and Tononi believe occurs in many brain regions—would become unsustainable.2,3 Downscaling at night would reduce the energy and space requirement of the brain, eliminate the weakest synapses, and help keep the strongest neuronal connections intact. This assumption is based on the principle in neuroscience that if one neuron doesn't fire to another very often, the connection between the two neurons weakens. By eliminating some of the unimportant connections, the body, in theory, eliminates background connections and effectively sharpens the important connections.Cool stuff!
Saturday, March 14, 2009
We've been working on building a consistent, simple process for evaluating and prioritizing requests for IT projects. Historically there seems like there has always been an order of magnitude more projects than there are funds or staffing to tackle. But how to sort through this onslaught of needs to sort out the real priorities?
We've used evaluation matrices for some time to facilitate structured conversations and make decisions about hiring staff and for selecting IT products. The matrices ensure that the information needed to make a decision is complete, and provides the opportunity to capture the relative values of criteria through weights. In poking around the web, I found several processes like this one that use a similar process for reviewing projects. Subsequently we've develop a relatively simple process that looks at both reward and risk associated with a request, and tries to make the scoring as quantitative as possible.
The fascinating part of this is that running a collection of requests through such a tool can be quite enlightening. For example, one of the first things you discover is that often requests are too vague to score accurately, and that they need to be broken down into smaller efforts that start with a feasibility/analysis stage. This helps to explain why prioritizing always seems so hard - it's because in many cases there just isn't good information to make a decision!
You may also discover that small projects with quick turnarounds, small resource needs, and limited risk score well. To a certain extent this makes sense, and lines up well with the principles of good project management.
The most important thing in my mind is that it provides a way to have a thoughtful discussion about priorities - it's critical not to let the tool take over, it needs to facilitate the real issue, which is clear communication.
I spent some time recently mucking with the iPhone SDK. I built this calculator from a tutorial I found after spending a fair amount of time wading through the materials on Apple's Developer site. First, I'm not a coder by any stretch, I have no experience with C, C++, objective C, what little I know is from Basic and Pascal from 20 years ago.
The iPhone is an amazing piece of hardware, and I have even more respect for some of the apps I have running on my phone now that I've been through this process. I typed the code in to get familiar with the editing interface, I managed to find and fix five bugs the compiler caught (mostly typos), however I almost got stumped by a seg fault once the app built and ran. After sleeping on it, I realized I had probably hooked up the code incorrectly in the Interface Builder (the instructions weren't very clear on that part). So I looked at another app that worked, and discovered my mistake.
At this point I can probably create a very simple, relatively static app that provides information (like the first aid apps, for example). But anything beyond that which involves more substantial logic and/or use of some of the cool hardware like accelerometer or GPS would require a lot of training in objective C. My understanding from my reading so far is that managing memory on the iPhone is much more manual than other environments, and can be a major issue if you're not careful about cleaning up. Probably explains why my phone runs better if I reboot it every once in a while.
Apple's desire to control the quality of apps in the Apps Store is understandable - one of the most attractive things about the iPhone is its immediate, consistent usability. Yet it's proving to be difficult to scale. There are something like 25,000 apps in the store now, and the process for getting them approved has slowed down dramatically. Plus there is talk of a Premium Store for apps costing $20 and up. How this will play out as the G1, with its open approach, or the RIM or Microsoft stores get going will be interesting. Support for Flash has long been a sore spot on the iPhone, as it would mean Apple losing control over apps. Yet it would open up the development environment to a much wider market. One thought I've had that might resolve this conflict for Apple is to provide something like HyperCard for the iPhone. I always thought HyperCard was an interesting, multipurpose tool, and it would be great to have that kind of flexible swiss army knife paired with the iPhone hardware.
Friday, March 13, 2009
I got a basic proof of concept multi-touch table built this morning from stuff I scavenged at work (box from Shipping, tracing paper from Engineering, iSight from Desktop Support). It's pretty rough, but does work well enough to demo the principles.
I'm using NUI's tbeta software, they also have a video with the construction directions for the hardware, which took maybe 30 minutes to setup.
It's pretty twitchy, and is very much dependent on the lighting environment. Moderately bright diffuse light seems better than direct bright light.