The Genotype-Tissue Expression (GTEx) project is an NIH initiative to catalog human tissue-specific expression patterns in order to better understand gene regulation (see initial press release). The project is an RNA-Seq tour-de-force: RNA extracted from multiple tissues from more than 900 individuals is been quantified with more than 1,800 RNA-Seq experiments. An initial paper describing the experiments was published in Nature Genetics earlier this year and the full dataset is currently being analyzed by a large consortium of scientists.
I have been thinking recently about how to analyze genotype-tissue expression data, and have been looking forward to testing some ideas. But I have not yet become involved directly with the data, and in fact have not even submitted a request to analyze it. Given the number of samples, I’d been hoping that some basic mapping/quantification had already been done so that I could build on the work of the consortium. But, alas, this past week I got some bad news.
In a recent twitter conversation, I discovered that the program that is being used by several key GTEx consortium members to quantify the data is Flux Capacitor developed by Michael Sammeth while he was in Roderic Guigó‘s group at the CRG in Barcelona.
What is Flux Capacitor?
Strangely, the method has never been published, despite the fact that it has been used in ten publications over the course of four years, including high profile papers from consortia such as ENCODE, GENCODE, GEUVADIS and GTEx. There is no manuscript on the author’s website or in a preprint archive. There is a website for the program but it is incomplete and unfinished, and contains no coherent explanation of what the program does. Papers using the method point to the article S. B. Montgomery, … , E. T. Dermitzakis, Transcriptome genetics using second generation sequencing in a Caucasian population, Nature 464 (2010) and/or the website http://sammeth.net/confluence/display/FLUX/Home for a description of the method. Here is what these citations amount to:
The Montgomery et al. paper contains one figure providing the “FluxCapacitor outline”. It is completely useless in actually providing insight into what Flux Capacitor does:
Modification of the top half of Supplementary Figure 23 from Montgomery et al (2010) titled “Flux Capacitor Outline” (although it actually shows a splice graph if one corrects the errors as I have done in red).
The methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad. Consider for example the sentence:
In our approach we estimate the biases characteristic of each experiment by collecting read distribution profiles in non-overlapping transcripts, binned by several transcript lengths and expression levels. From these profiles, we estimate for each edge and transcript a flux correction factor that following the language of hydro-dynamic flow networks, we denote as the capacity of the edge, as the area under the transcript profile between the edge boundaries (Supplementary Fig. 23).
The indices i and j for are never defined, but more importantly its completely unclear what the the correction factor actually is, how it is estimated, and how it is used (this should be compared to the current sophistication of other methods). On the program website there is no coherent information either. Here is an example:
The resulting graph with edges labelled by the number of reads can be interpreted as a flow network where each transcript representing a transportation path from its start to its end and consequently each edge a possibly shared segment of transportation along which a certain number of reads per nucleotide — i.e., a flux — is observed.
I downloaded the code and it is undocumented- even to the extent that it is not clear what the input needs to be or what the output means. There is no example provided with the software to test the program.
I therefore became curious why GTEx chose Flux Capacitor instead of many other freely available tools for RNA-Seq (e.g. ALEXA-Seq, CLIIQ, Cufflinks, eXpress, iReckon IsoEM, IsoformEx, MISO, NEUMA, RSEM, rSEQ, rQuant, SLIDE, TIGAR, …). Although many of these programs are not suitable for production-scale analysis, Cufflinks and RSEM certainly are, and eXpress was specifically designed for efficient quantification (linear in the number of mapped reads and constant memory). I looked around and no benchmark of Flux Capacitor has ever been performed–there is literally not even a mention of it in any paper other than in manuscripts by Sammeth, Guigó or Dermitzakis. So I thought that after four years of repeated use of the program in high profile projects, I would take a look for myself:
After fumbling about with the barely usable Flux Capacitor software, I finally managed to run it on simulated data generated for my paper: Adam Roberts and Lior Pachter, Streaming fragment assignment for real time analysis of sequencing experiments, Nature Methods 10 (2013), 71–73. One example of the state of the software is the example page (the required sorted file is posted there but its download requires the realization that is is linked to from the non-obviously placed paperclip). Fortunately, I was using my own reads and the UCSC annotation. The Roberts-Pachter simulation is explained in the Online Methods of our paper (section “Simulation RNA-Seq study”). It consists of 75bp paired-end reads simulated according to parameters mimicking real data from an ENCODE embryonic stem cell line. I tested Flux Capacitor with both 10 million and 100 million simulated reads; the results are shown in the figure below:
Flux Capacitor accuracy on simulations with 10 million and 100 million reads. The top panels show scatterplots of estimated transcript abundance vs. true transcript abundance. The lower panels show the same data with both axes logged.
For comparison, the next figure shows the results of RSEM, Cufflinks and eXpress on a range of simulations (up to a billion reads) from the Roberts-Pachter paper (Figure 2a):
Modification of Figure 2a from A. Roberts and L. Pachter, Nature Methods (2013) showing the performance of Flux Capacitor in context.
Flux Capacitor has very poor performance. With 100 million reads, its performance is equivalent to other software programs at 10 million reads, and similarly, with 10 million reads, it has the performance of other programs at 1 million reads. I think its fair to say that
Using Flux Capacitor is equivalent to throwing out 90% of the data!
The simulation is a best case scenario. It adheres to the standard model for RNA-Seq in which fragments are generated uniformly at random with lengths chosen from a distribution, and with errors. As explained above, all these parameters were set according to an actual ENCODE dataset, so that the difficulty of the problem corresponds to realistic RNA-Seq data. I can’t explain the poor performance of Flux Capacitor because I don’t understand the method. However my best guess is that it is somehow solving min-flow using linear programming along the lines of the properly fomulated ideas in E. Bernard, L. Jacob, J. Mairal and J.-P. Vert, Efficient RNA isoform identification and quantification from RNA-seq data with network flows, Technical Report HAL-00803134, March 2013. If this is the case, the poor performance might be a result of some difficulties resulting from the minimization of isoforms and reflected in the (incorrectly estimated) stripes on the left and bottom of the log-log plots. That is not to say the conclusions of the papers where Flux Capacitor is used are wrong. As can be seen from our benchmark, although performance is degraded with Flux Capacitor, the quantifications are not all wrong. For example, abundant transcripts are less likely to be affected by Flux Capacitor’s obviously poor quantification. Still, the use of Flux Capacitor greatly reduces resolution of low-expressed genes and, as mentioned previously, is effectively equivalent to throwing out 90% of the data.
As far as GTEx is concerned, I’ve been told that a significant amount of the analysis is based on raw counts obtained from reads uniquely mapping to the genome (this approach appears to have also been used in many of the other papers where Flux Capacitor was used). Adam Roberts and I examined the performance of raw counts in the eXpress paper (Figure S8, reproduced below):
Figure S8 from A. Roberts and L. Pachter, Nature Methods (2013) showing the limits of quantification when ignoring ambiguous reads. NEUMA (Normalization by Expected Uniquely Mappable Areas) calculates an effective length for each transcript in order to normalize counts based on uniquely mappable areas of transcripts. We modified NEUMA to allow for errors, thereby increasing the accuracy of the method considerably, but its accuracy remains inferior to eXpress, which does consider ambiguous reads. Furthermore, NEUMA is unable to produce abundance estimates for targets without sufficient amounts of unique sequence. The EM algorithm is superior because it can take advantage of different combinations of shared sequence among multiple targets to produce estimates. The accuracy was calculated using only the subset of transcripts (77% of total) that NEUMA quantifies.
Quantification with raw counts is even worse than Flux Capacitor. It is not even possible to quantify 23% of transcripts (due to insufficient uniquely mapping reads). This is why in the figure above the eXpress results are better than on the entire transcriptome (third figure of this post). The solid line shows that on the (raw count) quantifiable part of the transcriptome, quantification by raw counting is again equivalent to throwing out about 90% of the data. The dashed line is our own improvement of NEUMA (which required modifying the source code) to allow for errors in the reads. This leads to an improvement in performance, but results still don’t match eXpress (and RSEM and Cufflinks), and are worse than even Flux Capacitor if the unquantifiable transcripts are taken into account. In the recent Cufflinks 2 paper, we show that raw counts also cannot be used for differential analysis (as “wrong does not cancel out wrong”– see my previous post on this).
One criticism of my simulation study could be that I am not impartial. After all, Cufflinks and eXpress were developed in my group, and the primary developer of RSEM, Bo Li, is now my postdoc. I agree with this criticism! This study should have been undertaken a long time ago and subjected to peer review by the author(s?) of Flux Capacitor and not by me. The fact that I have had to do it is a failure on their part, not mine. Moreover, it is outrageous that multiple journals and consortia have published work based on a method that is essentially a black box. This degrades the quality of the science and undermines scientists who do work hard to diligently validate, benchmark and publish their methods. Open source (the Flux Capacitor source code is, in fact, available for download) is not open science. Methods matter.