You are currently browsing the monthly archive for April 2014.

I was recently reading the latest ENCODE paper published in PNAS when a sentence in the caption of Figure 2 caught my attention:

“Depending on the total amount of RNA in a cell, one transcript copy per cell corresponds to between 0.5 and 5 FPKM in PolyA+ whole-cell samples according to current estimates (with the upper end of that range corresponding to small cells with little RNA and vice versa).”

Although very few people actually care about ENCODE, many people do care about the interpretation of RNA-Seq FPKM measurements and to them this is likely to be a sentence of interest. In fact, there have been a number of attempts to provide intuitive meaning for RPKM (and FPKM) in terms of copy numbers of transcripts per cell. Even though the ENCODE PNAS paper provides no citation for the statement (or methods section explaining the derivation), I believe its source is the RNA-Seq paper by Mortazavi et al. In that paper, the authors write that

“…absolute transcript levels per cell can also be calculated. For example, on the basis of literature values for the mRNA content of a liver cell [Galau et al. 1977] and the RNA standards, we estimated that 3 RPKM corresponds to about one transcript per liver cell. For C2C12 tissue culture cells, for which we know the starting cell number and RNA preparation yields needed to make the calculation, a transcript of 1 RPKM corresponds to approximately one transcript per cell. “

This statement has been picked up on in a number of publications (e.g., Hebenstreit et al., 2011, van Bakel et al., 2011). However the inference of transcript copies per cell directly from RPKM or FPKM estimates is not possible and conversion factors such as 1 RPKM = 1 transcript per cell are incoherent. At the same time, the estimates of Mortazavi et al. and the range provided in the ENCODE PNAS paper are informative. The “incoherence” stems from a subtle issue in the normalization of RPKM/FPKM that I have discussed in a talk I gave at CSHL, and is the reason why TPM is a better unit for RNA abundance. Still, the estimates turn out to be “informative”, in the sense that the effect of (lack of) normalization appears to be smaller than variability in the amount of RNA per cell. I explain these issues below:

Why is the sentence incoherent?

RNA-Seq can be used to estimate transcript abundances in an RNA sample. Formally, a sample consists of n distinct types of transcripts, and each occurs with different multiplicity (copy number), so that transcript appears $m_i$ times in the sample. By “abundance” we mean the relative amounts $\rho_1,\ldots,\rho_n$ where $\rho_i = \frac{m_i}{\sum_{i=1}^n m_i}$. Note that  $0 \leq \rho_i \leq 1$ and $\sum_{i=1}^n \rho_i = 1$. Suppose that $m_j=1$ for some j. The corresponding $\rho_j$ is therefore $\rho_j = \frac{1}{M}$ where $M = \sum_{i=1}^n m_i$. The question is what does this $\rho$ value correspond to in RPKM (or FPKM).

RPKM stands for “reads per kilobase  of transcript per million reads mapped” and FPKM is the same except with “fragment” replacing read (initially reads were not paired-end, but with the advent of paired-end sequencing it makes more sense to speak of fragments, and hence FPKM). As a unit of measurement for an estimate, what FPKM really refers to is the expected number of fragments per kilboase of transcript per million reads. Formally, if we let $l_i$ be the length of transcript and define $\alpha_i = \frac{\rho_i l_i}{\sum_{j=1}^n \rho_j l_j}$ then abundance in FPKM for transcript is abundance measured as $FPKM_i = \frac{\alpha_i \cdot 10^{6}}{l_i/(10^3)}$. In terms of $\rho$, we obtain that

$FPKM_i = \frac{\rho_i \cdot 10^9}{\sum_{j=1}^n \rho_j l_j}$.

The term in the denominator can be considered a kind of normalization factor, that while identical for each transcript, depends on the abundances of each transcript (unless all lengths are equal). It is in essence an average of lengths of transcripts weighted by abundance. Moreover, the length of each transcript should be taken to be taken to be its “effective” length, i.e. the length with respect to fragment lengths, or equivalently, the number of positions where fragments can start.

The implication for finding a relationship between FPKM and relative abundance constituting one transcript copy per cell is that one cannot. Mathematically, the latter is equivalent to setting $\rho_i = \frac{1}{M}$ in the formula above and then trying to determine $FPKM_i$. Unfortunately, all the remaining $\rho$ are still in the formula, and must be known in order to calculate the corresponding FPKM value.

The argument above makes clear that it does not make sense to estimate transcript copy counts per cell in terms of RPKM or FPKM. Measurements in RPKM or FPKM units depend on the abundances of transcripts in the specific sample being considered, and therefore the connection to copy counts is incoherent. The obvious and correct solution is to work directly with the $\rho$. This is the rationale of TPM (transcripts per million) used by Bo Li and Colin Dewey in the RSEM paper (the argument for TPM is also made in Wagner et al. 2012).

Why is the sentence informative?

Even though incoherent, it turns out there is some truth to the ranges and estimates of copy count per cell in terms of RPKM and FPKM that have been circulated. To understand why requires noting that there are in fact two factors that come into play in estimating the FPKM corresponding to abundance of one transcript copy per cell. First, is M as defined above to be the total number of transcripts in a cell. This depends on the amount of RNA in a cell. Second are the relative abundances of all transcripts and their contribution to the denominator in the $FPKM_i$ formula.

The best paper to date on the connection between transcript copy numbers and RNA-Seq measurements is the careful work of Marinov et al. in “From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing” published in Genome Research earlier this year. First of all, the paper describes careful estimates of RNA quantities in different cells, and concludes that (at least for the cells studied in the paper) amounts vary by approximately one order of magnitude. Incidentally, the estimates in Marinov et al. confirm and are consistent with rough estimates of Galau et al. from 1977, of 300,000 transcripts per cell. Marinov et al. also use spike-in measurements are used to conclude that in “GM12878 single cells, one transcript copy corresponds to ∼10 FPKM on average.”. The main value of the paper lies in its confirmation that RNA quantities can vary by an order of magnitude, and I am guessing this factor of 10 is the basis for the range provided in the ENCODE PNAS paper (0.5 to 5 FPKM).

In order to determine the relative importance of the denominator in $FPKM_i$ I looked at a few RNA-Seq datasets we are currently examining. In the GEUVADIS data, the weighted average can vary by as much as 20% between samples. In a rat RNA-Seq dataset we are analyzing, the difference is a factor of two (and interestingly very dependent on the exact annotation used for quantification). The point here is that even the denominator in $FPKM_i$ does vary, but less, it seems, than the variability in RNA quantity. In other words, the estimate of 0.5 to 5 FPKM corresponding to one transcript per cell is incoherent albeit probably not too far off.

One consequence of all of the above discussion is that while differential analysis of experiments can be performed based on FPKM units (as done for example in Cufflinks, where the normalization factors are appropriately accounted for), it does not make sense to compare raw FPKM values across experiments. This is precisely what is done in Figure 2 of the ENCODE PNAS paper. What the analysis above shows, is that actual abundances may be off by amounts much larger than the differences shown in the figure. In other words, while the caption turns out to contain an interesting comment the overall figure doesn’t really make sense. Specifically, I’m not sure the relative RPKM values shown in the figure deliver the correct relative amounts, an issue that ENCODE can and should check. Which brings me to the last part of this post…

What is ENCODE doing?

Having realized the possible issue with RPKM comparisons in Figure 2, I took a look at Figure 3 to try to understand whether there were potential implications for it as well. That exercise took me to a whole other level of ENCODEness. To begin with, I was trying to make sense of the x-axis, which is labeled “biochemical signal strength (log10)” when I realized that the different curves on the plot all come from different, completely unrelated x-axes. If this sounds confusing, it is. The green curves are showing graphs of functions whose domain is in log 10 RPKM units. However the histone modification curves are in log (-10 log p), where p is a p-value that has been computed. I’ve never seen anyone plot log(log(p-values)); what does it mean?! Nor do I understand how such graphs can be placed on a common x-axis (?!). What is “biochemical signal strength” (?) Why in the bottom panel is the grey H3K9me3 showing %nucleotides conserved decreasing as “biochemical strength” is increasing (?!) Why is the green RNA curves showing conservation below genome average for low expressed transcripts (?!) and why in the top panel is the red H3K4me3 an “M” shape (?!) What does any of this mean (?!) What I’m supposed to understand from it, or frankly, what is going on at all ??? I know many of the authors of this ENCODE PNAS paper and I simply cannot believe they saw and approved this figure. It is truly beyond belief… see below:

All of these figures are of course to support the main point of the paper. Which is that even though 80% of the genome is functional it is also true that this is not what was meant to be said , and that what is true is that “survey of biochemical activity led to a significant increase in genome coverage and thus accentuated the discrepancy between biochemical and evolutionary estimates… where function is ascertained independently of cellular state but is dependent on environment and evolutionary niche therefore resulting in estimates that  differ widely in their false-positive and false-negative rates and the resolution with which elements can be defined… [unlike] genetic approaches that rely on sequence alterations to establish the biological relevance of a DNA segment and are often considered a gold standard for defining function.”

The ENCODE PNAS paper was first published behind a paywall. However after some public criticism, the authors relented and paid for it to be open access. This was a mistake. Had it remained behind a paywall not only would the consortium have saved money, I and others might have been spared the experience of reading the paper. I hope the consortium will afford me the courtesy of paywall next time.

In reading the news yesterday I came across multiple reports claiming that even casually smoking marijuana can change your brain. I usually don’t pay much attention to such articles; I’ve never smoked a joint in my life. In fact, I’ve never even smoked a cigarette. So even though as a scientist I’ve been interested in cannabis from the molecular biology point of view, and as a citizen from a legal point of view, the issues have not been personal. However reading a USA Today article about the paper, I noticed that the principal investigator Hans Breiter was claiming to be a psychiatrist and mathematician. That is an unusual combination so I decided to take a closer look. I immediately found out the claim was a lie. In fact, the totality of math credentials of Hans Breiter consist of some logic/philosophy courses during a year abroad at St. Andrews while he was a pre-med student at Northwestern. Even being an undergraduate major in mathematics does not make one a mathematician, just as being an undergraduate major in biology does not makes one a doctor. Thus, with his outlandish claim, Hans Breiter had succeeded in personally offending me! So, I decided to take a look at his paper underlying the multiple news reports:

This is quite possibly the worst paper I’ve read all year (as some of my previous blog posts show I am saying something with this statement). Here is a breakdown of some of the issues with the paper:

### 1. Study design

First of all, the study has a very small sample size, with only 20 “cases” (marijuana users), a fact that is important to keep in mind in what follows. The title uses the term “recreational users” to describe them, and in the press release accompanying the article Breiter says that “Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” In fact, the majority of users in the study were smoking more than 10 joints per week. There is even a person in the study smoking more than 30 joints per week (as disclosed above, I’m not an expert on this stuff but if 30 joints per week is “recreation” then it seems to me that person is having a lot of fun). More importantly, Breiter’s statement in the press release is a lie. There is no evidence in the paper whatsoever, not even a tiny shred, that the users who were getting high once or twice a week were having any problems. There are also other issues with the study design. For example, the paper claims the users are not “abusing” other drugs, but it is quite possible that they are getting high on cocaine, heroin, or ??? as well, an issue that could quite possibly affect the study. The experiment consisted of an MRI scan of each user/control, but only a single scan was done. Given the variability in MRI scans this also seems problematic.

### 2. Multiple testing

The study looked at three aspects of brain morphometry in the study participants: gray matter density, volume and shape. Each of these morphometric analyses constituted multiple tests. In the case of gray matter density, estimates were based on small clusters of voxels, resulting in 123 tests (association of each voxel cluster with marijuana use). Volumes were estimated for four regions: left and right nucleus accumbens and amygdala. Shape was also tested in the same four regions. What the authors should have done is to correct the p-values computed for each of these tests by accounting for the total number of tests performed. Instead, (Bonferroni) corrections were performed separately for each type of analysis. For example, in the volume analysis p-values were required to be less than 0.0125 = 0.05/4. In other words, the extent of testing was not properly accounted for. Even so, many of the results were not significant. For example, the volume analysis showed no significant association for any of the four tested regions. The best case was the left nucleus accumbens (Figure 1C) with a corrected p-value of 0.015 which is over the authors’ own stated required threshold of 0.0125 (see caption). They use the language “The association with drug use, after correcting for 4 comparisons, was determined to be a trend toward significance” to describe this non-effect. It is worth noting that the removal of the outlier at a volume of over $800 mm^3$ would almost certainly flatten the line altogether and remove even the slight effect. It would have been nice to test this hypothesis but the authors did not release any of their data.

Figure 1c.

In the Fox News article about the paper, Breiter is quoted saying ““For the NAC [nucleus accumbens], all three measures were abnormal, and they were abnormal in a dose-dependent way, meaning the changes were greater with the amount of marijuana used,” Breiter said.  “The amygdala had abnormalities for shape and density, and only volume correlated with use.  But if you looked at all three types of measures, it showed the relationships between them were quite abnormal in the marijuana users, compared to the normal controls.” The result above shows this to be a lie. Volume did not significantly correlate with use.

This is all very bad, but things get uglier the more one looks at the paper. In the tables reporting the p-values, the authors do something I have never seen before in a published paper. They report the uncorrected p-values, indicating those that are significant (prior to correction) in boldface, and then put an asterisk next to those that are significant after their (incomplete) correction. I realize my own use of boldface is controversial… but what they are doing is truly insane. The fact that they put an asterisk next to the values significant after correction indicates they are aware that multiple testing is required. So why bother boldfacing p-values that they know are not significant? The overall effect is an impression that more tests are significant than is actually the case. See for yourself in their Table 4:

Table 4.

The fact that there are multiple columns is also problematic. Separate tests were performed for smoking occasions per day, joints per occasion, joints per week and smoking days per week. These measures are highly correlated, but even so multiply testing them requires multiple test correction. The authors simply didn’t perform it. They say “We did not correct for the number of drug use measures because these measures tend not be independent of each other”. In other words, they multiplied the number of tests by four, and chose to not worry about that. Unbelievable.

Then there is Table 5, where the authors did not report the p-values at all, only whether they were significant or not… without correction:

Table 5.

### 3. Correlation vs. causation

This issue is one of the oldest in the book. There is even a wikipedia entry about itCorrelation does not imply causation. Yet despite the fact the every result in the paper is directed at testing for association, in the last sentence of the abstract they say “These data suggest that marijuana exposure, even in young recreational users, is associated with exposure-dependent alterations of the neural matrix of core reward structures and is consistent with animal studies of changes in dendritic arborization.” At a minimum, such a result would require doing a longitudinal study. Breiter takes this language to an extreme in the press release accompanying the article. I repeat the statement he made that I quoted above where I boldface the causal claim: “”Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” I believe that scientists should be sanctioned for making public statements that directly contradict the content of their papers, as appears to be the case here. There is precedent for this.