Don’t believe the anti-hype. They are saying that RNA-Seq promises the discovery of new expression events, but it doesn’t deliver:
Is this true? There have been a few papers comparing microarrays to RNA-Seq technology (including one of my own) that I’ll discuss below, but first a break-down of the Affymetrix “evidence”. The first is this figure (the poor quality of the images is exactly what Affymetrix provides, and not due to a reduction in quality on this site; they are slightly enlarged when clicked on):
The content of this figure is an illustration of the gene LMNB1 (Lamin protein of type B), used to argue that microarrays can provide transcript level resolution whereas RNA-Seq can’t!! Really? Affymetrix is saying that RNA-Seq users would likely use the RefSeq annotation which only has three isoforms. But this is a ridiculous claim. It is well known that RefSeq is a conservative annotation and certainly RNA-Seq users have the same access to the multiple databases Affymetrix used to build their annotation (presumably, e.g. Ensembl). It therefore seems that what Affymetrix is saying with this figure is that RNA-Seq users are dumb.
The next figure is showing the variability in abundance estimates as a function of expression level for RNA-SEq and the HTA 2.0, with the intended message being that microarrays are less noisy:
But there is a subtle trick going on here. And its in the units. The x-axis is showing RPM, which is an abbreviation for Reads Per Million. This is not a commonly used unit, and there is a reason. First, its helpful to review what is used. In his landmark paper on RNA-Seq, Ali Mortazavi introduced the units RPKM (note the extra K) that stands for reads per kilobase of transcript per million mapped. Why the extra kilobase term? In my review on RNA-Seq quantification I explain that RPKM is proportional to a maximum likelihood estimate of transcript abundance (obtained from a simple RNA-Seq model). The complete derivation is on page 6 ending in Equation 13; I include a summary here:
The maximum likelihood (ML) abundances are given by
In these equations is the length of transcript t (if reads are long it is necessary to modify the length due to edge effects, hence the tilde in the paper), the
are the maximum likelihood estimates for the probabilities of selecting reads from transcripts (unnormalized by their length) and finally
is the number of reads mapping to transcript t while N is the total number of mapped reads. The point is that RPKM (the rightmost formula for abundance) is really a unit for describing the maximum likelihood relative abundances (
) scaled by some factors.
RPKM as a unit has two problems. The first is that in current RNA-Seq experiments reads are paired so that the actual units being counted (in ) are fragments. For this reason we modified RPKM to FPKM in the Cufflinks paper (the “F” replaced “R” for fragment instead of read). A more serious problem, noted by Bo Li and Colin Dewey in their paper on RSEM, is that while FPKM is proportional to ML estimates of abundance, the proportionality constants may vary between experiments. For this reason they proposed TPM (transcripts per million) which is also proportional to the ML abundance estimates but with a proportionality constant (a million) that is the same between experiments. All of these units are used simply to avoid writing down the
which are in many cases tiny numbers since they must all sum to 1.
Returning to the Affymetrix figure, we see the strange RPM units. In essence, this is the rightmost term in the equation above, with the length terms removed from the denominators. Therefore RPM is proportional to the
. If a transcript is short, even if it is equally abundant to a longer transcript ,it will produce less RNA-Seq reads and therefore its
will be (possibly considerably) smaller. The effect of displaying RPM for RNA-Seq vs. expression level for the HTA 2.0 arrays is therefore to mix apples and oranges. Since what is being displayed is a coefficient of variation, there is a bias caused by the relationship between length and expression (e.g. many highly expressed housekeeping genes are short).
To be fair to Affymetrix the conversion between the and the
can be confusing (its explained in Lemma 14 in the Supplement of the Cufflinks paper). So maybe the discordant x-axes were unintentional…but then there is the third figure:
Here its a bit hard to tell what is going on because not all the information needed to decipher the figure is provided. For example, its not clear how the “expression of exons” was computed or measured for the RNA-Seq experiment. I suspect that as with the previous figure, read numbers were not normalized by length of exons, and moreover spliced reads (and other possibly informative reads from transcripts) were ignored. In other words, I don’t really believe the result.
Having said this, it is true that expression arrays can have an advantage in measuring exon expression, because an array measurement is absolute (as opposed to the relative quantification that is all that is possible with RNA-Seq). Array signal is based on hybridization, and it is probably a reasonable assumption that some minimum amount of RNA triggers a signal, and that this amount is independent of the remainder of the RNA in an experiment. So arrays can (and in many cases probably do) have advantages over RNA-Seq.
There are a few papers that have looked into this, for example the paper “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae ” by Nookaew et al., Nucleic Acids Research 40 (2012) who find high reproducibility in RNA-Seq and consistency between arrays and RNA-Seq. Xu et al., in “Human transcriptome array for high-throughput clinical studies“, PNAS 108 (2011), 3707–3712 are more critical, agreeing with Affymetrix that arrays are more sensitive at the exon level. For disease studies, they recommend using RNA-Seq to identify transcripts relevant to the disease, and then screening for those transcripts on patients using arrays.
For the Cuffdiff2 paper describing our new statistical procedures for differential analysis of transcripts and genes, the Rinn lab performed deep RNA-Seq and array expression measurement on the same samples from a HOXA1 knowdown (the experiments included multiple replicates of both the RNA-Seq and the arrays). To my knowledge, it is the deepest and most comprehensive data currently available for comparing arrays and RNA-Seq. Admittedly, the arrays used were not Affymetrix but Agilent SurePrint G3, and the RNA-Seq coverage was deep, however we had two main findings very different from the Affymetrix claims. First, we found overall strong correlation between array expression values and RNA-Seq abundance estimates. The correlation remained strong for large regimes of expression even with very few reads (tested by sequencing fewer reads from a MiSeq). Second, we found that arrays were missing differentially expressed transcripts, especially at low abundance levels. In other words, we found RNA-Seq to have higher resolution. The following figure from our paper made the case (note the overall Spearman Correlation was 0.86):
There are definitely continued applications for arrays. Both in high-throughput screening applications (as suggested in the Xu et al. paper), and also in the development of novel assays. For example Mercer et al. “Targeted rNA sequencing reveals the deep complexity of the human transcriptome“, Nature Biotechnology 30 (2012) 99–104 show how to couple capture (with arrays) with RNA-Seq to provide ultra deep sequencing in subsets of the transcriptome. So its not yet the time to write off arrays. But RNA-Seq has many applications of its own. For example the ability to better detect allele-specific expression, the opportunity to identify RNA-DNA differences (and thereby study RNA editing), and the ability to study expression in non-model organisms where genomes sequences are incomplete and annotations poor. For all these reasons I’m betting on RNA-Seq.
10 comments
Comments feed for this article
September 6, 2013 at 9:29 pm
Jari Niemi
Also RNA-seq is really good for finding fusion genes!
September 7, 2013 at 8:51 am
MB
Do you see that the CV is lower for RPKM or FPKM than for TPM empirically?
I would think off the bat that the conversion from FPKM to TPM would mostly be just dividing TPM by length. Since the length is a constant for a given gene or transcript then I would expect the CV to be the same for either metric. I am having trouble seeing where a reduction in variability would come from in using different values, except perhaps in more accurate normalization which shouldn’t be much of an issue for technical replicates.
Thanks!
September 7, 2013 at 9:23 am
Lior Pachter
You are correct about the relationship between FPKM and TPM. However the Affymetrix Figure is showing RPM, which is not the same as TPM (in the notation of the post RPM is proportional to
whereas TPM is proportional to
. The fact that the latter is normalized by length means that a gene has different * x co-ordinates * when comparing the array results to RNA-Seq results. This would be ok if coefficient of variation was constant at all
(and therefore at all
) but I don’t think thats the case. Note that the red curve in the plot (this is the array) is not constant but increasing at low expression. Therefore what is going on is that at a fixed array expression value, the RNA-Seq points being plotted are on average for genes with much lower actual
value. In other words the blue curve is like a zoomed in version of the very leftmost part of the red curve- most likely a part of the red curve that we are not even seeing.
September 9, 2013 at 7:30 am
Biologist
So, to summarize, not believing the anti-hype boils down to 1) apples to oranges comparisons of two fundamentally different platforms 2) small figures on a flyer and 3) agilent data?
However, what is not addressed is the 1) significantly higher cost of RNA seq analysis 2) significantly delayed time to pub while data sits in bioinformaticist queue and 3) clinical suitability.
I think its fair to say that bioinformaticists will push for RNA-seq for job security reasons. Affy’s TAC software puts conclusions in the hands of biologists, which is a welcome change to this biologist
September 9, 2013 at 9:10 am
Lior Pachter
I don’t think thats accurate. Specifically, point (1) is not that its apples vs. oranges because the platforms are different. No, its actually that the plot is showing two curves with different x-axes superimposed on each other. I would have no problem looking at arrays vs. RNA-Seq on the same plot if the RNA-Seq units were in FPKM or TPM. (2) The small figures are not on a flyer- they are on a website. And they make it very difficult to find out the kind of hanky panky going on (as in, for example, in point (1)). (3) It is true that we published our paper with Agilent data. Others have reached similar conclusions with (earlier platform) Affymetrix data (e.g. Marioni et al.). I think a comparison with Affymetrix’s latest HTA would be great. Unfortunately despite claims to the contrary, what is on the Affymetrix website is not it. You ended at three points but I also pointed out that (4) The notion that RNA-Seq is not isoform specific because RNA-Seq users will use inferior annotations to quantify with is ridiculous.
Regarding what you claim I did not address, I’ll respond as follows: (1) The cost of RNA-Seq is (approximately) a constant factor times the depth sequenced (I’m ignoring the cost of library prep). In the Cuffdiff2 paper we specifically conducted a MiSeq experiment to show that, in fact, RNA-Seq can produce decent results even at low coverage. We acknowledged the higher variability between replicates when using MiSeq (see our Figure 4) however we also noted that although MiSeq discovery was less than HiSeq 2000, MiSeq differential analysis did not produce extra false positives. So while I agree that arrays are currently competitive in price to RNA-Seq, I don’t think that will be true in the (near) future and it may already not be true (I know of at least one lab that has switched completely to RNA-Seq to reduce costs. (2) It may be that in some cases RNA-Seq is sitting a long time in the bioinformatics queue, but I’m sure thats true for array processing as well. From a technical standpoint, there is nothing about RNA-Seq processing that is significantly more time consuming than array analysis (except, perhaps read mapping although the recent Sailfish paper that I reviewed in a previous post even negates that necessity). Even if reads are mapped, for an experiment with a few tens of millions of reads the time required on a decent machine is hours, which is negligible compared to the time required to conduct a biological experiment. Suggesting that RNA-Seq users can’t handle their bioinformatics processing is the same as suggesting they haven’t heard about annotation databases other than RefSeq- simply ridiculous. (3) I think the clinical suitability is a point in favor of arrays, not against, which is why I linked to a paper stating that.
November 3, 2013 at 1:48 am
Jari Niemi
>1) significantly higher cost of RNA seq analysis
This is not true! Affy per sample costs more or less the same as RNA-seq per sample (~40 million reads, 100 bp, paired-end)!!!
> 2) significantly delayed time to pub while data sits in bioinformaticist queue
This might be true in some cases but using this as the main reason not to use a new technology like NGS which is quite new and it is imperfect like any other new technology would mean basically “never to use something new technology because it is not perfect yet and always stay in our comfort zone”!
> and 3) clinical suitability.
Always this can be said and will be said about a new technology but this is not a reason not to start using new technologies and push them to the clinics! Otherwise always we will get stuck with old technologies and stay within our comfort zone because “this has been working well for us for the last 10 years and it is used in clinics”!
>I think its fair to say that bioinformaticists will push for RNA-seq for job security reasons.
This is one way to see it! Indeed one needs a bioinformatician in most of the cases to analyze RNA-seq data and that might mean also a power shift from the biologist to the bioinformatician. This is not bad! In the end is a team work and everybody is welcome!
> Affy’s TAC software puts conclusions in the hands of biologists, which is a welcome change to this biologist
That is true but unfortunately Affy does not give everything what a biologists might want in all cases! In some cases a biologist might use a RNA-seq experiment (like for example looking for novel fusion genes) to get out more info/results out of his/her experiments when that is not possible with Affy! In the end it is a team work where everybody (MDs, biologists, statisticians, bioinformaticians) is welcome!
September 11, 2014 at 4:16 pm
Michael evans
Well the seq qc findings kinda favours arrays over seq for low expressors and unless your a bioinformatics geek and or have deep pockets to pay for deep sequencing and analysis affy arrays are far better especially for biomarker screening and translational workflows. And most basic researchers do no know how to properly design a seq experiment…
October 3, 2014 at 3:03 pm
Biologist
Jari:
Re costs: Assays may be equivalent, but total cost to RNA Seq includes informatics and ILMN and life tech always gloss over this and neglect adding in the dollar. Not to mention time. I’ve spoken to researchers who complain about being in a queue, measured in months, waiting for their analysis. Time is money, but how do you quanitfy that loss?
Re informaticist bias: I see your point for economy reasons. However, when it comes to publishing what is essentially opinion, you have to remember the source and what their economic driver (pay source) is.
Re fusion genes: While its true that with a fixed content source (arrays), you’re not going to see something novel, such as a fusion gene, what percentage of end users is that useful for? Many users are looking for gene level signatures (reproducible, mind you), alt splice events, etc where all you care about is detecting known events (with a well characterized genome). RNA Seq is expensive, noisy, overkill for those situations. I’m not saying array is the be-all and end-all. But neither is RNA-Seq. Intelligently _and_ economically choose the appropriate situation, ignoring the hype.
November 28, 2014 at 9:25 am
Peter A.C. 't Hoen
Two remarks:
1. It is true that RNA-seq generates noisy expression patterns for low abundant genes. However, the majority of these genes can not even be detected on microarrays but fall in the fluorescent background. Like the results shown above, we showed already in 2008 a much higher sensitivity of sequencing-based analysis compared to microarrays in the detection of differential expression, particularly of low expressed genes: http://nar.oxfordjournals.org/content/36/21/e141.long. Also in a number of eQTL analyses I am involved, there are many eQTLs that can be detected by sequencing but not by microarrays with similar numbers of samples and those are typically eQTLs for low expressed genes.
2. Probe design on microarrays are based on reference transcript annotations like those from RefSeq. I believe it is very dangerous to rely on annotated transcripts for transcript quantification. Based on yet unpublished results from PacBio full-length RNA sequencing, we estimate that >75% of transcripts, even in well characterized human cell lines, are not in RefSeq.
January 18, 2017 at 5:01 am
Arjun K
I don’t know the situation in other parts of the world, but in India people are still suffering with RNASeq data analysis. Definitely, the assumption that RNASeq is cheaper than microarrays is really false. RNASeq needs high depth to get better reproducibility and low noise for low level transcripts. People just think that RNASeq can be done at 400 USD per sample. But the reality is “At what Coverage”?? Don’t blindly believe that your RNASeq data produces whatever coverage you expect (eg. 30X) coverage for all the reads. it is inherent issue with NGS that not all reads will have uniform coverage. So during the RNASeq data analysis, bioinformatics guy filters all the reads that has low coverage from the data and thereby missing some low abundant transcripts from the data.
Secondly, library preparation PCR steps introduce random errors while amplifying the low abundant transcript copies. The random errors introduced in the PCR is carried over by SBS chemistry and makes it a systematic error in the all the reads. This makes lot of errors in the data and ultimately generates Q20 or Q10 data which is of no use.