Posted: October 1, 2012
CASE STUDY: The New Genomic Bottleneck and How the Cloud Might Widen It
Emma J. Spaulding
For years, the constraining factor in genomics research was data generation. The slowdown in the process or “bottleneck” was that generating genomic data required a great deal of time, specialized expertise and money. Scientists generated and read genomic sequence letter by letter. Because data were slowly generated, researchers had plenty of time to analyze them. However, in recent years, new technologies for genome sequencing have been developed. According to Ilya Shmulevich, Ph.D., a professor and researcher in computational biology at the Institute for Systems Biology, “Technology is getting better, faster, cheaper and more accurate… Now, we can generate much more data. It is basically a flood of data, and we have to analyze it.”
Flooded with TCGA Data
Analyzing the deluge of data is a challenge that Dr. Shmulevich is quite familiar with. As the Principal Investigator for one of the Genome Data Analysis Centers for The Cancer Genome Atlas (TCGA), he and his research group analyze and integrate TCGA data across thousands of samples and develop novel informatics tools to facilitate use of TCGA data. TCGA has taken advantage of the new sequencing technologies. As part of its goal to comprehensively characterize the genomes of over 25 types of cancer, the program shifted in 2009 from the slower Sanger sequencing to second generation technologies. These new systems are capable of sequencing a whole genome in a matter of days. Dr. Shmulevich says, “TCGA is probably the prototypical example for state-of-the-art acquisition and analysis of massive large scale heterogeneous data. TCGA has genome, transcriptome, proteomic, epigenetic and clinical—there are so many different data types all collected in one study!” This is one reason TCGA data are so powerful in analysis. However, Dr. Shmulevich notes that it is quite computationally demanding to tackle the heterogeneity of the data and the large scale analysis. He summarizes the difficulty and opportunity saying, “[TCGA’s] potential is great, but it also is a huge challenge.”
To contend with this challenge and integrate these many data types, Dr. Shmulevich and his group use an algorithm called RF-ACE, which makes connections among the pieces of genomic and clinical data in a large dataset based on patterns in a training dataset. He describes the importance: “We are beginning to understand these relationships [between the many different genomic findings], but we now need data-driven analysis to see how these data are related to each other.”
The challenge for Dr. Shmulevich and his group’s research is that these analyses can become very complicated, very quickly. “We are measuring anywhere from tens of thousands to hundreds of thousands to potentially millions of variable data in each sample.” With the goal of understanding the relationships among variables, the sheer volume of possible associations is overwhelming, even for a computer. Dr. Shmulevich explains how quickly this analysis becomes computationally challenging. “These are very large feature spaces, meaning there are lots of measurements and variables. When you start looking for associations, the number of them explodes exponentially.”
To run these analyses, Dr. Shmulevich and his group need computing power. Computing power is often measured in the number of “cores” available. Most standard commercial laptop computers use a “dual core processor,” which means that the computer uses two cores, sometimes called central processing units or CPUs. A core executes the user’s commands and runs the programs the user wants. In general, more cores mean programs run faster. Dr. Shmulevich and his group need to run RF-ACE to find the relationships between all the pieces of genomic data. However, the genomic data are so big that a commercial dual core processor computer would take years to finish running the program. This is a problem for many computationally intensive research groups. To speed up analysis, they are using more and more cores. The Shmulevich group has 1000 cores available to them. Even still, sometimes it’s not enough.
Dr. Shmulevich characterizes the problem, “We have approximately 1000 cores. They allow us to run this algorithm, but still it is very slow. For example, just one dataset on one cancer type can take hours.” The Shmulevich group’s research is an iterative series of questions, where the answer for each query informs the next one. Slow-running algorithms prevent the team from spending more of their time in active examination of the data, and instead, they wait for the computational analyses to finish.
Renting Computing Power from the Cloud
Now however, a new solution could open the computational bottleneck. What if Dr. Shmulevich could “rent” the computational power he needed, instead of buying it? Several groups have seen this vacuum and entered the Infrastructure as a Service (IaaS) market. The IaaS market consists of groups providing the computing power by maintaining cores offsite in a data center, or what sometimes is called “the cloud.” Google offers its own IaaS, Compute Engine, which can use hundreds of thousands of core in the cloud to run computations and it offers many times the computational power of Dr. Shmulevich’s own system. A single computation that used to take 15 hours to be completed could be finished in one hour using 10,000 of Google’s cores. This sped-up analysis would enhance the productivity of Dr. Shmulevich’s group.
Google and Dr. Shmulevich partnered in early 2012 in a proof of concept project to run RF-ACE on TCGA’s large, heterogeneous, integrated data on the Compute Engine. In principle, this would demonstrate that a research group could run its algorithm on data through cloud computing. The feasibility of this collaboration would be exhibited at the 2012 Google I/O Conference, its annual developer-focused meeting. Because Dr. Shmulevich has a background as an electrical and computer engineer, it seemed like a natural fit. He says, “We are definitely on the same page as far as the philosophy, and the way they build software and the way they approach data; the Google way of doing things is a very good match [for us]. We were very excited when we had the opportunity to work with them directly and our collaboration was very intense.”
In the keynote speech of the June 2012 Google I/O, Compute Engine was unveiled and the demonstration of how the RF-ACE algorithm could use it began. The screen behind the presenter, Urs Hölzle, Google’s first vice president of engineering, was a counter. Dr. Shmulevich recounts, “[Google] showed the power of this kind of massive analysis. There was a counter and the counter went up to 600,000! [Hölzle] announced that's how many cores were just brought up to do the computation!” Dr. Shmulevich exclaims, “That's really unprecedented, really amazing to have that many computers in such a short amount of time brought up to run a particular algorithm.” The skyrocketing counter showed that cloud computing for complex data sets was feasible. He declares, “It's a proof of concept that this type of massive analysis for research is possible and doable.”
Looking Ahead in Analysis of Large Scale Genomic Data
The collaboration between Dr. Shmulevich and Google for the demonstration shows where the future of data analysis may be headed. As data continue to grow in complexity and volume, Dr. Shmulevich asserts, “We, the scientific community, certainly have a lot to gain from the capabilities that Google could bring to our data analysis problems. I think it's going to be a very fruitful relationship and I'm very excited about it.”
Speculating about future implications for TCGA data, Dr. Shmulevich is optimistic. “We wanted to do this high dimensional analysis of all these data to hopefully one day be able to make clinically relevant predictions to help physicians in the clinical decision making process." He continues, saying that while TCGA wasn’t designed for this application, TCGA data offer a starting point for these complex levels of analyses.
TCGA is an example is what is to come. Dr. Shmulevich sees the increasing volume, complexity and heterogeneity in TCGA data, and there’s no sign of stopping. “I think it’s only going to continue. It doesn't show any sign, nor should it, of slowing down or plateauing... We have to be prepared for the increase in scale and complexity, and that means we have to develop the right infrastructure, the right tools, the right methods to deal with this increase.” This prediction only seems more likely when considering that one million cancer genomes are expected to be available in the next 10 years.
This possible future makes clear that the cancer research community still has much to do. About the increasing density of the data, Dr. Shmulevich says, “That's a major challenge. Not only do we need bigger, better, faster computers, but we need more clever statistical and machine learning approaches. That's another bottleneck - actually, not a bottleneck, but something for the community to think about.”