You are currently browsing the tag archive for the ‘ENCODE’ tag.

I was recently reading the latest ENCODE paper published in PNAS when a sentence in the caption of Figure 2 caught my attention:

“Depending on the total amount of RNA in a cell, one transcript copy per cell corresponds to between 0.5 and 5 FPKM in PolyA+ whole-cell samples according to current estimates (with the upper end of that range corresponding to small cells with little RNA and vice versa).”

Although very few people actually care about ENCODE, many people do care about the interpretation of RNA-Seq FPKM measurements and to them this is likely to be a sentence of interest. In fact, there have been a number of attempts to provide intuitive meaning for RPKM (and FPKM) in terms of copy numbers of transcripts per cell. Even though the ENCODE PNAS paper provides no citation for the statement (or methods section explaining the derivation), I believe its source is the RNA-Seq paper by Mortazavi et al. In that paper, the authors write that

“…absolute transcript levels per cell can also be calculated. For example, on the basis of literature values for the mRNA content of a liver cell [Galau et al. 1977] and the RNA standards, we estimated that 3 RPKM corresponds to about one transcript per liver cell. For C2C12 tissue culture cells, for which we know the starting cell number and RNA preparation yields needed to make the calculation, a transcript of 1 RPKM corresponds to approximately one transcript per cell. “

This statement has been picked up on in a number of publications (e.g., Hebenstreit et al., 2011, van Bakel et al., 2011). However the inference of transcript copies per cell directly from RPKM or FPKM estimates is not possible and conversion factors such as 1 RPKM = 1 transcript per cell are incoherent. At the same time, the estimates of Mortazavi et al. and the range provided in the ENCODE PNAS paper are informative. The “incoherence” stems from a subtle issue in the normalization of RPKM/FPKM that I have discussed in a talk I gave at CSHL, and is the reason why TPM is a better unit for RNA abundance. Still, the estimates turn out to be “informative”, in the sense that the effect of (lack of) normalization appears to be smaller than variability in the amount of RNA per cell. I explain these issues below:

Why is the sentence incoherent?

RNA-Seq can be used to estimate transcript abundances in an RNA sample. Formally, a sample consists of n distinct types of transcripts, and each occurs with different multiplicity (copy number), so that transcript appears $m_i$ times in the sample. By “abundance” we mean the relative amounts $\rho_1,\ldots,\rho_n$ where $\rho_i = \frac{m_i}{\sum_{i=1}^n m_i}$. Note that  $0 \leq \rho_i \leq 1$ and $\sum_{i=1}^n \rho_i = 1$. Suppose that $m_j=1$ for some j. The corresponding $\rho_j$ is therefore $\rho_j = \frac{1}{M}$ where $M = \sum_{i=1}^n m_i$. The question is what does this $\rho$ value correspond to in RPKM (or FPKM).

RPKM stands for “reads per kilobase  of transcript per million reads mapped” and FPKM is the same except with “fragment” replacing read (initially reads were not paired-end, but with the advent of paired-end sequencing it makes more sense to speak of fragments, and hence FPKM). As a unit of measurement for an estimate, what FPKM really refers to is the expected number of fragments per kilboase of transcript per million reads. Formally, if we let $l_i$ be the length of transcript and define $\alpha_i = \frac{\rho_i l_i}{\sum_{j=1}^n \rho_j l_j}$ then abundance in FPKM for transcript is abundance measured as $FPKM_i = \frac{\alpha_i \cdot 10^{6}}{l_i/(10^3)}$. In terms of $\rho$, we obtain that

$FPKM_i = \frac{\rho_i \cdot 10^9}{\sum_{j=1}^n \rho_j l_j}$.

The term in the denominator can be considered a kind of normalization factor, that while identical for each transcript, depends on the abundances of each transcript (unless all lengths are equal). It is in essence an average of lengths of transcripts weighted by abundance. Moreover, the length of each transcript should be taken to be taken to be its “effective” length, i.e. the length with respect to fragment lengths, or equivalently, the number of positions where fragments can start.

The implication for finding a relationship between FPKM and relative abundance constituting one transcript copy per cell is that one cannot. Mathematically, the latter is equivalent to setting $\rho_i = \frac{1}{M}$ in the formula above and then trying to determine $FPKM_i$. Unfortunately, all the remaining $\rho$ are still in the formula, and must be known in order to calculate the corresponding FPKM value.

The argument above makes clear that it does not make sense to estimate transcript copy counts per cell in terms of RPKM or FPKM. Measurements in RPKM or FPKM units depend on the abundances of transcripts in the specific sample being considered, and therefore the connection to copy counts is incoherent. The obvious and correct solution is to work directly with the $\rho$. This is the rationale of TPM (transcripts per million) used by Bo Li and Colin Dewey in the RSEM paper (the argument for TPM is also made in Wagner et al. 2012).

Why is the sentence informative?

Even though incoherent, it turns out there is some truth to the ranges and estimates of copy count per cell in terms of RPKM and FPKM that have been circulated. To understand why requires noting that there are in fact two factors that come into play in estimating the FPKM corresponding to abundance of one transcript copy per cell. First, is M as defined above to be the total number of transcripts in a cell. This depends on the amount of RNA in a cell. Second are the relative abundances of all transcripts and their contribution to the denominator in the $FPKM_i$ formula.

The best paper to date on the connection between transcript copy numbers and RNA-Seq measurements is the careful work of Marinov et al. in “From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing” published in Genome Research earlier this year. First of all, the paper describes careful estimates of RNA quantities in different cells, and concludes that (at least for the cells studied in the paper) amounts vary by approximately one order of magnitude. Incidentally, the estimates in Marinov et al. confirm and are consistent with rough estimates of Galau et al. from 1977, of 300,000 transcripts per cell. Marinov et al. also use spike-in measurements are used to conclude that in “GM12878 single cells, one transcript copy corresponds to ∼10 FPKM on average.”. The main value of the paper lies in its confirmation that RNA quantities can vary by an order of magnitude, and I am guessing this factor of 10 is the basis for the range provided in the ENCODE PNAS paper (0.5 to 5 FPKM).

In order to determine the relative importance of the denominator in $FPKM_i$ I looked at a few RNA-Seq datasets we are currently examining. In the GEUVADIS data, the weighted average can vary by as much as 20% between samples. In a rat RNA-Seq dataset we are analyzing, the difference is a factor of two (and interestingly very dependent on the exact annotation used for quantification). The point here is that even the denominator in $FPKM_i$ does vary, but less, it seems, than the variability in RNA quantity. In other words, the estimate of 0.5 to 5 FPKM corresponding to one transcript per cell is incoherent albeit probably not too far off.

One consequence of all of the above discussion is that while differential analysis of experiments can be performed based on FPKM units (as done for example in Cufflinks, where the normalization factors are appropriately accounted for), it does not make sense to compare raw FPKM values across experiments. This is precisely what is done in Figure 2 of the ENCODE PNAS paper. What the analysis above shows, is that actual abundances may be off by amounts much larger than the differences shown in the figure. In other words, while the caption turns out to contain an interesting comment the overall figure doesn’t really make sense. Specifically, I’m not sure the relative RPKM values shown in the figure deliver the correct relative amounts, an issue that ENCODE can and should check. Which brings me to the last part of this post…

What is ENCODE doing?

Having realized the possible issue with RPKM comparisons in Figure 2, I took a look at Figure 3 to try to understand whether there were potential implications for it as well. That exercise took me to a whole other level of ENCODEness. To begin with, I was trying to make sense of the x-axis, which is labeled “biochemical signal strength (log10)” when I realized that the different curves on the plot all come from different, completely unrelated x-axes. If this sounds confusing, it is. The green curves are showing graphs of functions whose domain is in log 10 RPKM units. However the histone modification curves are in log (-10 log p), where p is a p-value that has been computed. I’ve never seen anyone plot log(log(p-values)); what does it mean?! Nor do I understand how such graphs can be placed on a common x-axis (?!). What is “biochemical signal strength” (?) Why in the bottom panel is the grey H3K9me3 showing %nucleotides conserved decreasing as “biochemical strength” is increasing (?!) Why is the green RNA curves showing conservation below genome average for low expressed transcripts (?!) and why in the top panel is the red H3K4me3 an “M” shape (?!) What does any of this mean (?!) What I’m supposed to understand from it, or frankly, what is going on at all ??? I know many of the authors of this ENCODE PNAS paper and I simply cannot believe they saw and approved this figure. It is truly beyond belief… see below:

All of these figures are of course to support the main point of the paper. Which is that even though 80% of the genome is functional it is also true that this is not what was meant to be said , and that what is true is that “survey of biochemical activity led to a significant increase in genome coverage and thus accentuated the discrepancy between biochemical and evolutionary estimates… where function is ascertained independently of cellular state but is dependent on environment and evolutionary niche therefore resulting in estimates that  differ widely in their false-positive and false-negative rates and the resolution with which elements can be defined… [unlike] genetic approaches that rely on sequence alterations to establish the biological relevance of a DNA segment and are often considered a gold standard for defining function.”

The ENCODE PNAS paper was first published behind a paywall. However after some public criticism, the authors relented and paid for it to be open access. This was a mistake. Had it remained behind a paywall not only would the consortium have saved money, I and others might have been spared the experience of reading the paper. I hope the consortium will afford me the courtesy of paywall next time.

The Genotype-Tissue Expression (GTEx) project is an NIH initiative to catalog human tissue-specific expression patterns in order to better understand gene regulation (see initial press release). The project is an RNA-Seq tour-de-force: RNA extracted from multiple tissues from more than 900 individuals is been quantified with more than 1,800 RNA-Seq experiments. An initial paper describing the experiments was published in Nature Genetics earlier this year and the full dataset is currently being analyzed by a large consortium of scientists.

I have been thinking recently about how to analyze genotype-tissue expression data, and have been looking forward to testing some ideas. But I have not yet become involved directly with the data, and in fact have not even submitted a request to analyze it. Given the number of samples, I’d been hoping that some basic mapping/quantification had already been done so that I could build on the work of the consortium. But, alas, this past week I got some bad news.

In a recent twitter conversation, I discovered that the program that is being used by several key GTEx consortium members to quantify the data is Flux Capacitor developed by Michael Sammeth while he was in Roderic Guigós group at the CRG in Barcelona.

What is Flux Capacitor?

Strangely, the method has never been published, despite the fact that it has been used in ten publications over the course of four years, including high profile papers from consortia such as ENCODE, GENCODE, GEUVADIS and GTEx. There is no manuscript on the author’s website or in a preprint archive. There is a website for the program but it is incomplete and unfinished, and contains no coherent explanation of what the program does. Papers using the method point to the article S. B. Montgomery, … , E. T. DermitzakisTranscriptome genetics using second generation sequencing in a Caucasian population, Nature 464 (2010) and/or the website http://sammeth.net/confluence/display/FLUX/Home for a description of the method. Here is what these citations amount to:

The Montgomery et al. paper contains one figure providing the “FluxCapacitor outline”. It is completely useless in actually providing insight into what Flux Capacitor does:

Modification of the top half of Supplementary Figure 23 from Montgomery et al (2010) titled “Flux Capacitor Outline” (although it actually shows a splice graph if one corrects the errors as I have done in red).

The methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad. Consider for example the sentence:

In our approach we estimate the biases characteristic of each experiment by collecting read distribution profiles in non-overlapping transcripts, binned by several transcript lengths and expression levels. From these profiles, we estimate for each edge and transcript a flux correction factor $b^j_i$ that following the language of hydro-dynamic flow networks, we denote as the capacity of the edge, as the area under the transcript profile between the edge boundaries (Supplementary Fig. 23).

The indices and j for $b^j_i$ are never defined, but more importantly its completely unclear what the the correction factor actually is, how it is estimated, and how it is used (this should be compared to the current sophistication of other methods). On the program website there is no coherent information either. Here is an example:

The resulting graph with edges labelled by the number of reads can be interpreted as a flow network where each transcript representing a transportation path from its start to its end and consequently each edge a possibly shared segment of transportation along which a certain number of reads per nucleotide — i.e., a flux — is observed.

I downloaded the code and it is undocumented- even to the extent that it is not clear what the input needs to be or what the output means. There is no example provided with the software to test the program.

I therefore became curious why GTEx chose Flux Capacitor instead of many other freely available tools for RNA-Seq (e.g. ALEXA-SeqCLIIQCufflinks, eXpress, iReckon IsoEM, IsoformExMISO, NEUMARSEM, rSEQrQuantSLIDE, TIGAR, …). Although many of these programs are not suitable for production-scale analysis, Cufflinks and RSEM certainly are, and eXpress was specifically designed for efficient quantification (linear in the number of mapped reads and constant memory). I looked around and no benchmark of Flux Capacitor has ever been performed–there is literally not even a mention of it in any paper other than in manuscripts by Sammeth, Guigó or Dermitzakis. So I thought that after four years of repeated use of the program in high profile projects, I would take a look for myself:

After fumbling about with the barely usable Flux Capacitor software, I finally managed to run it on simulated data generated for my paper: Adam Roberts and Lior Pachter, Streaming fragment assignment for real time analysis of sequencing experiments, Nature Methods 10 (2013), 71–73. One example of the state of the software is the example page (the required sorted file is posted there but its download requires the realization that is is linked to from the non-obviously placed paperclip). Fortunately, I was using my own reads and the UCSC annotation. The Roberts-Pachter simulation is explained in the Online Methods of our paper (section “Simulation RNA-Seq study”). It consists of 75bp paired-end reads simulated according to parameters mimicking real data from an ENCODE embryonic stem cell line. I tested Flux Capacitor with both 10 million and 100 million simulated reads; the results are shown in the figure below:

Flux Capacitor accuracy on simulations with 10 million and 100 million reads. The top panels show scatterplots of estimated transcript abundance vs. true transcript abundance. The lower panels show the same data with both axes logged.

For comparison, the next figure shows the results of RSEM, Cufflinks and eXpress on a range of simulations (up to a billion reads) from the Roberts-Pachter paper (Figure 2a):

Modification of Figure 2a from A. Roberts and L. Pachter, Nature Methods (2013) showing the performance of Flux Capacitor in context.

Flux Capacitor has very poor performance. With 100 million reads, its performance is equivalent to other software programs at 10 million reads, and similarly, with 10 million reads, it has the performance of other programs at 1 million reads. I think its fair to say that

Using Flux Capacitor is equivalent to throwing out 90% of the data!

The simulation is a best case scenario. It adheres to the standard model for RNA-Seq in which fragments are generated uniformly at random with lengths chosen from a distribution, and with errors. As explained above, all these parameters were set according to an actual ENCODE dataset, so that the difficulty of the problem corresponds to realistic RNA-Seq data. I can’t explain the poor performance of Flux Capacitor because I don’t understand the method. However my best guess is that it is somehow solving min-flow using linear programming along the lines of the properly fomulated ideas in E. Bernard, L. Jacob, J. Mairal and J.-P. VertEfficient RNA isoform identification and quantification from RNA-seq data with network flows, Technical Report HAL-00803134, March 2013. If this is the case, the poor performance might be a result of some difficulties resulting from the minimization of isoforms and reflected in the (incorrectly estimated) stripes on the left and bottom of the log-log plots. That is not to say the conclusions of the papers where Flux Capacitor is used are wrong. As can be seen from our benchmark, although performance is degraded with Flux Capacitor, the quantifications are not all wrong. For example, abundant transcripts are less likely to be affected by Flux Capacitor’s obviously poor quantification. Still, the use of Flux Capacitor greatly reduces resolution of low-expressed genes and, as mentioned previously, is effectively equivalent to throwing out 90% of the data.

As far as GTEx is concerned, I’ve been told that a significant amount of the analysis is based on raw counts obtained from reads uniquely mapping to the genome (this approach appears to have also been used in many of the other papers where Flux Capacitor was used). Adam Roberts and I examined the performance of raw counts in the eXpress paper (Figure S8, reproduced below):

Figure S8 from A. Roberts and L. Pachter, Nature Methods (2013) showing the limits of quantification when ignoring ambiguous reads. NEUMA (Normalization by Expected Uniquely Mappable Areas) calculates an effective length for each transcript in order to normalize counts based on uniquely mappable areas of transcripts. We modified NEUMA to allow for errors, thereby increasing the accuracy of the method considerably, but its accuracy remains inferior to eXpress, which does consider ambiguous reads. Furthermore, NEUMA is unable to produce abundance estimates for targets without sufficient amounts of unique sequence. The EM algorithm is superior because it can take advantage of different combinations of shared sequence among multiple targets to produce estimates. The accuracy was calculated using only the subset of transcripts (77% of total) that NEUMA quantifies.

Quantification with raw counts is even worse than Flux Capacitor. It is not even possible to quantify 23% of transcripts  (due to insufficient uniquely mapping reads). This is why in the figure above the eXpress results are better than on the entire transcriptome (third figure of this post). The solid line shows that on the (raw count) quantifiable part of the transcriptome, quantification by raw counting is again equivalent to throwing out about 90% of the data. The dashed line is our own improvement of NEUMA (which required modifying the source code) to allow for errors in the reads. This leads to an improvement in performance, but results still don’t match eXpress (and RSEM and Cufflinks), and are worse than even Flux Capacitor if the unquantifiable transcripts are taken into account. In the recent Cufflinks 2 paper, we show that raw counts also cannot be used for differential analysis (as “wrong does not cancel out wrong”–  see my previous post on this).

One criticism of my simulation study could be that I am not impartial. After all, Cufflinks and eXpress were developed in my group, and the primary developer of RSEM, Bo Li, is now my postdoc. I agree with this criticism! This study should have been undertaken a long time ago and subjected to peer review by the author(s?) of Flux Capacitor and not by me. The fact that I have had to do it is a failure on their part, not mine. Moreover, it is outrageous that multiple journals and consortia have published work based on a method that is essentially a black box. This degrades the quality of the science and undermines scientists who do work hard to diligently validate, benchmark and publish their methods. Open source (the Flux Capacitor source code is, in fact, available for download) is not open science. Methods matter.