You are currently browsing the tag archive for the ‘DESeq’ tag.

The development of microarray technology two decades ago heralded genome-wide comparative studies of gene expression in human, but it was the widespread adoption of RNA-Seq that has led to differential expression analysis becoming a staple of molecular biology studies. RNA-Seq provides measurements of transcript abundance, making possible not only gene-level analyses, but also differential analysis of isoforms of genes. As such, its use has necessitated refinements of the term “differential expression”, and new terms such as “differential transcript expression” have emerged alongside “differential gene expression”. A difficulty with these concepts is that they are used to describe biology, statistical hypotheses, and sometimes to describe types of methods. The aims of this post are to provide a unifying framework for thinking about the various concepts, to clarify their meaning, and to describe connections between them.

To illustrate the different concepts associated to differential expression, I’ll use the following example, consisting of a comparison of a single two-isoform gene in two conditions (the figure is Supplementary Figure 1 in Ntranos, Yi et al. Identification of transcriptional signatures for cell types from single-cell RNA-Seq, 2018):

The isoforms are labeled primary and secondary, and the two conditions are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates $x_A$ and $x_B$ corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates $y_A$ and $y_B$ corresponding to the abundance of the secondary isoforms. In data from an experiment the black dots will represent the mean level of expression of the constituent isoforms as derived from replicates, and there will be uncertainty as to their exact location. In this example I’ll assume they represent the true abundances.

### Biology

Below is a list of terms used to characterize changes in expression:

Differential transcript expression (DTE) is change in one of the isoforms. In the figure, this is represented (conceptually) by the two red lines along the x- and y-axes respectively. Algebraically, one might compute the change in the primary isoform by $x_B-x_A$ and the change in the secondary isoform by $y_B-y_A$. However the term DTE is used to denote not only the extent of change, but also the event that a single isoform of a gene changes between conditions, i.e. when the two points lie on a horizontal or vertical line. DTE can be understood to occur as a result of transcriptional regulation if an isoform has a unique transcription start site, or post-transcriptional regulation if it is determined by a unique splicing event.

Differential gene expression (DGE) is the change in the overall output of the gene. Change in the overall output of a gene is change in the direction of  the line $y=x$, and the extent of change can be understood geometrically to be the distance between the projections of the two points onto the line $y=x$ (blue line labeled DGE). The distance will depend on the metric used. For example, the change in expression could be defined to be the total expression in condition B ($x_B+y_B$) minus the change in expression in condition A ($x_A+y_A$), which is $|x_B-x_A+y_B-y_A|$.  This is just the length of the blue line labeled “DGE” given by the $L_1$ norm. Alternatively, one could consider “DGE” to be the length of the blue line in the $L_2$ norm. As with DTE, DGE can also refer to a specific type of change in gene expression between conditions, one in which every isoform changes (relatively) by the same amount so that the line joining the two points has a slope of 1 (i.e. is angled at 45°). DGE can be understood to be the result of transcriptional regulation, driving overall gene expression up or down.

Differential transcript usage (DTU) is the change in relative expression between the primary and secondary isoforms. This can be interpreted geometrically as the angle between the two points, or alternatively as the length (as given by some norm) of the green line labeled DTU. As with DTE and DGE, DTU is also a term used to describe a certain kind of difference in expression between two conditions, one in which the line joining the two points has a slope of -1. DTU events are most likely controlled by post-transcriptional regulation.

Gene differential expression (GDE) is represented by the red line. It is the amount of change in expression along in the direction of line joining the two points. GDE is a notion that, for reasons explained below, is not typically tested for, and there are few methods that consider it. However GDE is biologically meaningful, in that it generalizes the notions of DGE, DTU and DTE, allowing for change in any direction. A gene that exhibits some change in expression between conditions is GDE regardless of the direction of change. GDE can represent complex changes in expression driven by a combination of transcriptional and post-transcriptional regulation. Note that DGE, DTU and DTE are all special cases of GDE.

If the $L_2$ norm is used to measure length and $DTE_1,DTE_2$ denote DTE in the primary and secondary isoforms respectively, then it is clear that DGE, DTU, DTE and GDE satisfy the relationship

$GDE^2 = DGE^2 + DTU^2 = DTE_1^2 + DTE_2^2.$

### Statistics

The terms DTE, DGE, DTU and GDE have an intuitive biological meaning, but they are also used in genomics as descriptors of certain null hypotheses for statistical testing of differential expression.

The differential transcript expression (DTE) null hypothesis for an isoform is that it did not change between conditions, i.e. $x_A=x_B$ for the primary isoform, or $y_A=y_B$ for the secondary isoform. In other words, in this example there are two DTE null hypotheses one could consider.

The differential gene expresión (DGE) null hypothesis is that there is no change in overall expression of the gene, i.e. $x_A+y_A = x_B+y_B$.

The differential transcript usage (DTU) null hypothesis is that there is no change in the difference in expression of isoforms, i.e. $x_A-y_A = x_B - y_B$.

The gene differential expression (GDE) null hypothesis is that there is no change in expression in any direction, i.e. for all constants $a,b$, $ax_A+by_A = ax_B+by_B$.

The union differential transcript expression (UDTE) null hypothesis is that there is no change in expression of any isoform. That is, that $x_A = y_A$ and $x_B = y_B$ (this null hypothesis is sometimes called DTE+G). The terminology is motivated by $\neg \cup_i DTE_i = \cap_i DTE_i$.

Not that $UDTE \Leftrightarrow GDE$, because if we assume GDE, and set $a=1,b=0$ we obtain DTE for the primary isoform and setting $a=0,b=1$ we obtain DTE for the secondary isoform. To be clear, by GDE or DTE in this case we mean the GDE (respectively DTE) null hypothesis. Furthermore, we have that

$UDTE,GDE \Rightarrow DTE,DGE,DTU$.

This is clear because if $x_A=y_A$ and $x_B=y_B$ then both DTE null hypotheses are satisfied by definition, and both DGE and DTU are trivially satisfied. However no other implications hold, i.e. $DTE \not \Rightarrow DGE,DTU$, similarly $DGE \not \Rightarrow DTE,DTU$, and $DTU \not \Rightarrow DGE, DTE$.

### Methods

The terms DGE, DTE, DTU and GDE also used to describe methods for differential analysis.

A differential gene expression method is one whose goal is to identify changes in overall gene expression. Because DGE depends on the projection of the points (representing gene abundances) to the line y=x, DGE methods typically take as input gene counts or abundances computed by summing transcript abundances $x_A+y_A$ and $x_B+y_B$. Examples of early DGE methods for RNA-Seq were DESeq (now DESeq2) and edgeR. One problem with DGE methods is that it is problematic to estimate gene abundance by adding up counts of the constituent isoforms. This issue was discussed extensively in Trapnell et al. 2013. On the other hand, if the biology of a gene is DGE, i.e. changes in expression are the same (relatively) in all isoforms, then DGE methods will be optimal, and the issue of summed counts not representing gene abundances accurately is moot.

differential transcript expression method is one whose goal is to identify individual transcripts that have undergone DTE. Early methods for DTE were Cufflinks (now Cuffdiff2) and MISO, and more recently sleuth, which improves DTE accuracy by modeling uncertainty in transcript quantifications. A key issue with DTE is that there are many more transcripts than genes, so that rejecting DTE null hypotheses is harder than rejecting DGE null hypotheses. On the other hand, DTE provides differential analysis at the highest resolution possible, pinpointing specific isoforms that change and opening a window to study post-transcriptional regulation. A number of recent examples highlight the importance of DTE in biomedicine (see, e.g., Vitting-Seerup and Sandelin 2017). Unfortunately DTE results do not always translate to testable hypotheses, as it is difficult to knock out individual isoforms of genes.

differential transcript usage method is one whose goal is to identify genes whose overall expression is constant, but where isoform switching leads to changes in relative isoform abundances. Cufflinks implemented a DTU test using Jensen-Shannon divergence, and more recently RATs is a method specialized for DTU.

As discussed in the previous section, none of null hypotheses DGE, DTE and DTU imply any other, so users have to choose, prior to performing an analysis, which type of test they will perform. There are differing opinions on the “right” approach to choosing between DGE, DTU and DTE. Sonseson et al. 2016 suggest that while DTE and DTU may be appropriate in certain niche applications, generally it’s better to choose DGE, and they therefore advise not to bother with transcript-level analysis. In Trapnell et al. 2010, an argument was made for focusing on DTE and DTU, with the conclusion to the paper speculating that “differential RNA level isoform regulation…suggests functional specialization of the isoforms in many genes.” Van den Berge et al. 2017 advocate for a middle ground: performing a gene-level analysis but saving some “FDR budget” for identifying DTE in genes for which the UDTE null hypothesis has been rejected.

There are two alternatives that have been proposed to get around the difficulty of having to choose, prior to analysis, whether to perform DGE, DTU or DTE:

differential transcript expression aggregation (DTE->G) method is a method that first performs DTE on all isoforms of every gene, and then aggregates the resulting p-values (by gene) to obtain gene-level p-values. The “aggregation” relies on the observation that under the null hypothesis, p-values are uniformly distributed. There are a number of different tests (e.g. Fisher’s method) for testing whether (independent) p-values are uniformly distributed. Applying such tests to isoform p-values per gene provides gene-level p-values and the ability to reject UDTE. A DTE->G method was tested in Soneson et al. 2016 (based on Šidák aggregation) and the stageR method (Van den Berge et al. 2017) uses the same method as a first step. Unfortunately, naïve DTE->G methods perform poorly when genes change by DGE, as shown in Yi et al. 2017. The same paper shows that Lancaster aggregation is a DTE->G method that achieves the best of both the DGE and DTU worlds. One major drawback of DTE->G methods is that they are non-constructive, i.e. the rejection of UDTE by a DTE->G method provides no information about which transcripts were differential and how. The stageR method averts this problem but requires sacrificing some power to reject UDTE in favor of the interpretability provided by subsequent DTE.

gene differential expression method is a method for gene-level analysis that tests for differences in the direction of change identified between conditions. For a GDE method to be successful, it must be able to identify the direction of change, and that is not possible with bulk RNA-Seq data. This is because of the one in ten rule that states that approximately one predictive variable can be estimated from ten events. In bulk RNA-Seq, the number of replicates in standard experiments is three, and the number of isoforms in multi-isoform genes is at least two, and sometimes much more than that.

In Ntranos, Yi et al. 2018, it is shown that single-cell RNA-Seq provides enough “replicates” in the form of cells, that logistic regression can be used to predict condition based on expression, effectively identifying the direction of change. As such, it provides an alternative to DTE->G for rejecting UDTE. The Ntranos and Yi GDE methods is extremely powerful: by identifying the direction of change it is a DGE methods when the change is DGE, it is a DTU method when the change is DTU, and it is a DTE method when the change is DTE. Interpretability is provided in the prediction step: it is the estimated direction of change.

### Remarks

The discussion in this post is based on an example consisting of a gene with two isoforms, however the concepts discussed are easy to generalize to multi-isoform genes with more than two transcripts. I have not discussed differential exon usage (DEU), which is the focus of the DEXSeq method because of the complexities arising in genes which don’t have well-defined shared exons. Nevertheless, the DEXSeq approach to rejecting UDTE is similar to DTE->G, with DTE replaced by DEU. There are many programs for DTE, DTU and (especially) DGE that I haven’t mentioned; the ones cited are intended merely to serve as illustrative examples. This is not a comprehensive review of RNA-Seq differential expression methods.

### Acknowledgments

The blog post was motivated by questions of Charlotte Soneson and Mark Robinson arising from an initial draft of the Ntranos, Yi et al. 2018 paper. The exposition was developed with Vasilis Ntranos and Lynn Yi. Valentine Svensson provided valuable comments and feedback.

RNA-Seq is the new kid on the block, but there is still something to be learned from the stodgy microarray. One of the lessons is hidden in a tech report by Daniela Witten and Robert Tibshirani from 2007: “A comparison of fold-change  and the t-statistic for microarray data analysis“.

The tech report makes three main points. The first is that it is preferable to use a modified t-statistic rather than the ordinary t-statistic. This means that rather than comparing (normalized) means using

$T_i = \frac{\bar{x_i} - \bar{y_i}}{s_i}$

where $s_i$ is the standard deviation of the replicates $x_i$ (respectively $y_i$) of gene i in two different conditions, it is better to use

$T'_i = \frac{\bar{x_i} - \bar{y_i}}{s_i+s_0}$

where $s_0$ minimizes the coefficient of variation of $T'_i$.

The second point made is that the intuition that reproducibility implies accuracy is not correct (fold change had been proposed for use instead of a t-statistic because the results were more reproducible).

The third point, in my opinion the most important one, I quote directly from the report:

“A researcher should choose the measure of differential expression based on the biological system of interest. If large absolute changes in expression are relevant to the system, then fold-change should be used; on the other hand, if changes in expression relative to the underlying noise are important, then a modified t-statistic is preferable.”

How does this pertain to RNA-Seq? Microarray experiments and RNA-Seq both measure expression but the translation of methods for the analysis of one platform to the other can be non-trivial. One reason is that in RNA-Seq experiments accurately measuring “fold-change” is difficult. Read counts accumulated across a gene cannot be used directly to estimate fold change because the transcripts making up the gene may have different lengths. For this reason, methods such as Cufflinks, RSEM or eXpress (and most recently Sailfish recently reviewed on this blog) use the EM algorithm to “deconvolute” ambiguously mapped reads. The following thought experiment (Figure 1 in our paper describing Cufflinks/Cuffdiff 2) illustrates the issue:

Changes in fragment counts for a gene do not necessarily equal a change in expression. The “exon-union” method counts reads falling on any of a gene’s exons, whereas the “exon-intersection” method counts only reads
on constitutive exons. Both of the exon-union and exon-intersection counting schemes may incorrectly estimate a change in expression in genes with multiple isoforms as shown in the table. It is important to note that the problem of fragment assignment described here in the context of RNA-Seq is crucial for accurate estimation of parameters in many other *Seq assays.

“Count-based” methods for differential expression, such as DESeq, work directly with accumulated gene counts and are based on the premise that even if estimated fold-change is wrong, statistical significance can be assessed based on differences between replicates.  In our recent paper describing Cuffdiff 2 (with a new method for differential abundance analysis) we examine DESeq (as a proxy for count-based methods) carefully and show using both simulation and real data that fold-change is not estimated accurately. In fact, even when DESeq and Cufflinks both deem a gene to be differentially expressed, and even when the effect is in the same direction (e.g. up-regulation), DESeq can (and many times does) estimate fold-change incorrectly. This problem is not specific to DESeq. All “count based” methods that employ naive heuristics for computing fold change will produce inaccurate estimates:

Comparison of fold-change estimated by Cufflinks (tail of arrows) vs. “intersection-count” (head of arrows) reproduced from Figure 5 of the supplementary material of the Cuffdiff 2 paper. “Intersection-count” consists of the accumulated read counts in the regions shared among transcripts in a gene. The x-axis shows array fold change vs. the estimated fold-change on the y-axis.  For more details on the experiment see the Cuffdiff 2 paper.

In other words,

it is essential to perform fragment assignment in a biological context where absolute expression differences are relevant to the system.

What might that biological context be? This is a subjective question but in my experience users of microarrays or RNA-Seq (including myself) always examine fold-change in addition to p-values obtained from (modified) t-statistics or other model based statistics because the raw fold-change is more directly connected to the data from the experiment.

In many settings though, statistical significance remains the gold standard for discovery. In the recent epic “On the immortality of television sets: ‘function’ in the human genome according to the evolution-free gospel of ENCODE“, Dan Graur criticizes the ENCODE project for reaching an “absurd conclusion” through various means, among them the emphasis of “statistical significance rather than magnitude of effect”. Or, to paraphrase Samuel Johnson,

statistical significance is the last refuge from a poor analysis of data.