• National Cancer Institute
  • National Human Genome Research Institute

Posted: February 18, 2014

CASE STUDY: Using Forensics to Untangle Batch Effects in TCGA Data

Jean Hazel Mendoza

Image: Dr. Rehan Akbani, Ph.D.

At the turn of the 21st century, several experiments using high-throughput genomic technologies had fallen prey to technical errors known as batch effects. A study published in 2002 described a diagnostic test that used patterns of protein expression in human blood, measured by a mass spectrometer, to detect ovarian cancer.1 Hailed as a breakthrough at the time, this finding launched a flurry of commercial testing, only to be suspended by the Food and Drug Administration.2 The study had a deep flaw—the observed differences between the normal and ovarian cancer samples had resulted not from patterns in protein expression, but from the fact that they were processed in batches on different days.3

Batch effects are systematic technical variations in the data that result from processing samples in groups or batches. "They’re not due to actual biology that we are interested in. They are side effects," says Rehan Akbani, Ph.D., an assistant professor of bioinformatics and computational biology at the University of Texas MD Anderson Cancer Center. Genomics studies are particularly susceptible to batch effects because they often involve large numbers of samples that must be processed in multiple batches. "Since batch processing is very common in [genomics] research, so are batch effects," says Dr. Akbani.

Some researchers recognized this problem as early as 1999. They observed that a batch of samples run one week on a DNA microarray, a device used to measure gene expression among other things, could yield drastically different results from another batch run the following week.4 This is because high-throughput technologies are so sensitive that their measurements can be affected by subtle changes. Everything from timing, location, and personnel to reagent stock, storage conditions, and even the room temperature and ozone level of a laboratory can influence the measurements of a machine.5 These sources of batch effects are sometimes known as "nuisance variables." Ironically, as genome analysis technologies have become more sophisticated, the opportunities for introducing technical errors have only become greater.6

The Perils of Batch Effects

Dr. Akbani completed a Ph.D. in computer science from the University of Texas at San Antonio, and then joined MD Anderson as a postdoctoral fellow and later as an assistant professor. One of his areas of expertise—inspecting the quality of experimental data—is often underappreciated, but it is an integral part of the scientific process. Batch effect analysis "is an aspect of many projects that may not be as exciting as discovering new cancer-related genes or new processes," says Dr. Akbani. "Nonetheless this is an endeavor that needs to be pursued. It is an essential task."

Indeed, batch effects can pose a serious threat to research. They can mask underlying biology. Worse, they can be confounded with biological relationships. By compromising the data, batch effects can lead to spurious correlations or predictions. Like the ovarian cancer diagnostic test that almost made it to market, studies based on unsound data not only lack reproducibility and waste time and resources, but they could also mislead doctors and put patients at risk.

In recent years, screening for batch effects has become common practice for researchers, journal editors, and reviewers alike. Still, no matter how scrupulous the experimental design—ensuring identical experimental conditions, stringent protocols, randomization—batch effects are sometimes unavoidable. This is especially true for large-scale projects like The Cancer Genome Atlas (TCGA), which involves a vast network of scientists, laboratories, and multiple platforms over an extended period of time. Processing hundreds to thousands of samples all at once is simply impractical.

A Tool to Detect, Diagnose, and Correct Batch Effects

Dr. Akbani’s involvement in TCGA began in 2009 when he joined the laboratory of John Weinstein, M.D., Ph.D. Dr. Weinstein is the professor and chair of the department of bioinformatics and computational biology as well as director of the TCGA Genome Data Analysis Center (GDAC) at MD Anderson. Subsequently, Dr. Weinstein appointed Dr. Akbani as co-director of the GDAC, and in 2011, they were both appointed co-chairs of the Batch Effects Working Group by TCGA. Along with Dr. Weinstein and Nianxiang Zhang, Ph.D., a former senior statistical analyst at MD Anderson, Dr. Akbani led the development of a software package called MBatch to detect, diagnose, and correct batch effects.

"What MBatch does is it provides a ready-made suite of [batch effects] assessment and correction algorithms," says Dr. Akbani. Every algorithm has strengths and weaknesses, so MBatch offers two assessment algorithms to provide a complementary picture of batch effects: hierarchical clustering and principal component analysis (PCA). Those are "extremely common tools, so we decided to start off with those in MBatch," says Dr. Akbani. More tools are currently under development.

Hierarchical clustering groups samples into clusters based on their similarities through a tree diagram, or dendrogram. If no batch effect is present, samples with a common biological characteristic would group close together regardless of the batch in which they were processed. However, samples that group together by batch would be a potential indication of the presence of batch effects.

Principal component analysis, on the other hand, looks at samples in terms of variation. PCA pinpoints the greatest source of variation within the data, so that when plotted, a batch that separates from others according to a technical variable, such as processing site or time, would reveal the presence of a batch effect.

In addition to this computation, the MBatch PCA tool, called PCA-Plus, presents two novel features. Samples in the same batches are connected by batch centroids, data points that represent the average of the samples within each batch, making it easier to spot any potential outlier batch. PCA-Plus also provides a new metric, the Dispersion Separability Criterion (DSC), that quantifies the overall batch effect in a dataset. A high DSC value indicates the presence of a strong batch effect, while a low value suggests the opposite.

MBatch is publically available, so researchers can download the software and analyze their own data. In addition, Dr. Akbani and his colleagues have created a TCGA MBatch Web Portal that presents batch effect analyses specific to TCGA data.

Once detected, batch effects can be corrected using any of six different algorithms available in MBatch—Empirical Bayes (also known as COMBAT), ANOVA, and median polish, with two variations of each. But as Dr. Akbani warns, correction algorithms should be applied with caution. When used blindly to correct data, they risk erasing both technical and biological effects.

Instead, Dr. Akbani recommends tracking down the source of the batch effect, whenever possible. "Try to go back to the source," he says, "which could have been either in the wet lab or it could have been after data processing … anywhere in that stream." He adds, "Identify and then eliminate those sources of batch effects." This is no easy task. The ways in which batch effects manifest in data are often complicated and difficult to untangle.

On the Trail of Batch Effects in TCGA Data

"Studying batch effects is more like a forensics analysis," says Dr. Akbani. Sorting out technical variation from genuine biological variation sometimes demands detective work. "You can imagine there are a hundred different variables, and sifting through all of them and figuring out which one variable contributed to that batch effect can be difficult," he says. According to Dr. Akbani, significant batch effects in TCGA data are rare, but a few cases have required dogged investigation.

One such case involved kidney cancer data. Rounds of analysis revealed several layers of batch effects, in which the samples kept dividing into dichotomies. "Something interesting, or fishy rather, [was] going on there," says Dr. Akbani. The first dichotomy was based on patient sex, the second caused by unstable components on a machine, and the third due to misclassification. Closer inspection revealed that a handful of samples were not clear cell kidney carcinoma as was previously thought, but a rare type of kidney cancer called chromophobe renal cell carcinoma. Removing the misclassified rare tumor samples finally resolved the problem. "So this is how we went through three dichotomies," says Dr. Akbani. "Each time we had to go back and figure out … which variable, if any, was associated with that dichotomy."

The potential for batch effects is even greater when analyzing not just a single cancer dataset, but when combining datasets for multiple tumor types, as in the TCGA Pan-Cancer project.  "We’d never run MBatch on such a large project comprising of multiple tumors, so this was an interesting challenge," says Dr. Akbani.

One of the main problems was that the RNA data for the 12 Pan-Cancer tumor types were processed on one of two platforms, but not both. "It was difficult to tease out which effects, if any, were due to tumor differences and which effects were due to platform differences," says Dr. Akbani. Fortunately, one data-generating center had run 19 colon cancer samples on both machines. Based on those samples, Dr. Akbani and his colleagues found that differences in the data between the two platforms were minimal, "so not much correction was required," he says.

Dr. Akbani continues to conduct batch effect analyses for TCGA tumor projects as they are completed. Almost every marker paper has been published with a supplement on batch effects. The only exceptions are the 2008 paper on glioblastoma multiforme, the first cancer type TCGA studied, and some of the rare tumor projects. Because rare tumors' sample sizes are small, they can be processed in a single batch, minimizing the potential for batch effects.

"Quality control is an important aspect of any major project, especially for projects such as TCGA," says Dr. Akbani. "We really need to look at the quality of the data, because we understand that researchers all over the world are going to use them."

Selected References

1 Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., et al. (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 359(9306):572-577. Read the full article

2 Wagner, L. (2004) A test before its time? FDA stalls distribution process of proteomic test. J Natl Cancer Inst. 96(7):500-501. Read the full article

3 Baggerly, K. A., Edmonson, S. R., Morris, J. S., and Coombes, K. R. (2004) High-resolution serum proteomic patterns for ovarian cancer detection. Endocr Relat Cancer. 11:583-584. Read the full article

4 Lander, E.S. (1999) Array of hope. Nat Genet. 21:3-4. View PubMed abstract

5 Luo, J., Schumacher, M., Scherer, A., Sanoudou, D., Megherbi, D., Davison, T., Shi, T., Tong, W., Shi, L., Hong, H., et al. (2010) A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 10(4):278-91. Read the full article

6 Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., and Irizarry, R.A. (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 11(10):733-739. Read the full article