Posted: March 21, 2011
CASE STUDY: New Tool Combines Many Types of Genomic Data
At the heart of The Cancer Genome Atlas (TCGA) is a dilemma that dogs the modern era of computational biology––the sheer volume of data can exceed scientists’ ability to analyze it in a timely manner. “It’s the bottleneck problem,” explains Rachel Karchin, Ph.D., who runs a lab at the Institute for Computational Medicine at Johns Hopkins University.
Recent advances in biological measurement and DNA sequencing platforms—known as high-throughput genomics—now permit scientists to generate large amounts of information that describe changes in tumors. By 2015, TCGA is on track to collect and analyze tumor specimens from some 10,000 cancer patients. However, while the number of genes and molecular measurements in each sample is finite, it is also enormous.
“The challenge is how to extract meaning from this information to affect human health,” says Karchin. “As computational biologists, our job is to develop new tools to model the data so that researchers can begin to harness its potential.”
Building Haystacks, Finding Needles
Dr. Karchin explains the challenge of working with the multi-dimensional dataset that is being generated to develop the ovarian cancer “page” in the TCGA atlas. “First, you have tumor and normal samples, nearly 500 of each in the ovarian project. For each of those you have between 10,000 and 20,000 genes that could be altered.”
But, the TCGA sequence and characterization process is not looking simply for gene mutations, she says. Five other sets of information about each tumor are produced by various genomic measurement tools. For example, one tool measures copy number alterations – information on how many copies of a gene a tumor carries compared to normal tissue. And more tools are on the way, which will produce even more data to comb through.
For researchers working with the TCGA ovarian dataset, the influx of data will quickly swell to a tidal wave of unfiltered information. Dr. Karchin is helping the TCGA Ovarian Analysis Working Group by developing robust analytical tools that they and researchers the world over can use to open up the potential bottleneck. The hope is that such tools will improve the researchers’ ability to locate the needle-in-the-haystack gene changes that matter the most. “There are a lot of changes to be found in the tumors of cancer patients that we don’t see in normal tissues,” explains Dr. Karchin.
Dr. Karchin notes that a novel project the size and scope of TCGA “will certainly detect genome changes not seen before, but many of those changes are biologically neutral, innocent bystanders you might say. We are after the changes that actually drive cancer.”
Helping Bring TCGA Puzzle Pieces Together
Each TCGA Genome Characterization Center (GCC) and Genome Sequencing Center (GSC) working on ovarian cancer uses specific technology platforms to generate different types of data related to changes in the cancer genome. For example, the TCGA Center at the University of North Carolina analyzes tumor and normal tissue for gene expression while the Broad Institute and Brigham and Women’s Hospital are looking for copy number changes. These types of changes can influence how proteins are produced from the DNA code and are just two of a number of molecular changes that TCGA studies.
Dr. Karchin’s lab is collaborating with a team of researchers to develop a new software tool, known as TOPPER, designed to make it easier to combine and organize the masses of data that come out of each tumor analysis platform. Researchers analyzing TCGA data from the ovarian samples will be able to select any combination of data types and let the computer analyze the resulting patterns. TOPPER allows researchers to assemble a composite or merged view of ovarian cancer.
Unlocking a Gate to the Future
If TOPPER works as the researchers hope, users coming to the TCGA Data Portal will be able to select a “phenotype” to study. Phenotypes are the outward expression of the genes an individual carries, the “big-picture” variables that researchers use to separate patients into two or more groups in order to see which genes might be causing cancer to develop differently in one group versus the other.
Researchers use a wide range of phenotypes to frame their studies, things like a patient’s clinical diagnosis, treatment history, the biological and chemical status of their tumor, how long they survive, and a whole host of observable traits, such as age, race, blood type, etc. Using TOPPER, researchers accessing the TCGA data can interrogate the molecular differences between two contrasting phenotypic groups.
TOPPER and other tools being developed for TCGA will have immediate relevance for the bottleneck problem and the search for driver mutations, said Dr. Karchin, “but we are designing this kind of software to be more general.” Despite its unique features and unprecedented scope, she said, TCGA represents a more generic kind of scientific puzzle–integrating many types of genomic data.
TOPPER and other tools will be accessible to the public from NCI data portals, and “we hope people will find uses in other biological contexts and with other diseases.” Dr. Karchin and others are developing these tools to enable a future they can only guess at, “but the long-range potential is fantastic,” she believes.