You are currently browsing the tag archive for the ‘single-cell’ tag.

The development of microarray technology two decades ago heralded genome-wide comparative studies of gene expression in human, but it was the widespread adoption of RNA-Seq that has led to differential expression analysis becoming a staple of molecular biology studies. RNA-Seq provides measurements of transcript abundance, making possible not only gene-level analyses, but also differential analysis of isoforms of genes. As such, its use has necessitated refinements of the term “differential expression”, and new terms such as “differential transcript expression” have emerged alongside “differential gene expression”. A difficulty with these concepts is that they are used to describe biology, statistical hypotheses, and sometimes to describe types of methods. The aims of this post are to provide a unifying framework for thinking about the various concepts, to clarify their meaning, and to describe connections between them.

To illustrate the different concepts associated to differential expression, I’ll use the following example, consisting of a comparison of a single two-isoform gene in two conditions (the figure is Supplementary Figure 1 in Ntranos, Yi et al. Identification of transcriptional signatures for cell types from single-cell RNA-Seq, 2018):

The isoforms are labeled primary and secondary, and the two conditions are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates $x_A$ and $x_B$ corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates $y_A$ and $y_B$ corresponding to the abundance of the secondary isoforms. In data from an experiment the black dots will represent the mean level of expression of the constituent isoforms as derived from replicates, and there will be uncertainty as to their exact location. In this example I’ll assume they represent the true abundances.

### Biology

Below is a list of terms used to characterize changes in expression:

Differential transcript expression (DTE) is change in one of the isoforms. In the figure, this is represented (conceptually) by the two red lines along the x- and y-axes respectively. Algebraically, one might compute the change in the primary isoform by $x_B-x_A$ and the change in the secondary isoform by $y_B-y_A$. However the term DTE is used to denote not only the extent of change, but also the event that a single isoform of a gene changes between conditions, i.e. when the two points lie on a horizontal or vertical line. DTE can be understood to occur as a result of transcriptional regulation if an isoform has a unique transcription start site, or post-transcriptional regulation if it is determined by a unique splicing event.

Differential gene expression (DGE) is the change in the overall output of the gene. Change in the overall output of a gene is change in the direction of  the line $y=x$, and the extent of change can be understood geometrically to be the distance between the projections of the two points onto the line $y=x$ (blue line labeled DGE). The distance will depend on the metric used. For example, the change in expression could be defined to be the total expression in condition B ($x_B+y_B$) minus the change in expression in condition A ($x_A+y_A$), which is $|x_B-x_A+y_B-y_A|$.  This is just the length of the blue line labeled “DGE” given by the $L_1$ norm. Alternatively, one could consider “DGE” to be the length of the blue line in the $L_2$ norm. As with DTE, DGE can also refer to a specific type of change in gene expression between conditions, one in which every isoform changes (relatively) by the same amount so that the line joining the two points has a slope of 1 (i.e. is angled at 45°). DGE can be understood to be the result of transcriptional regulation, driving overall gene expression up or down.

Differential transcript usage (DTU) is the change in relative expression between the primary and secondary isoforms. This can be interpreted geometrically as the angle between the two points, or alternatively as the length (as given by some norm) of the green line labeled DTU. As with DTE and DGE, DTU is also a term used to describe a certain kind of difference in expression between two conditions, one in which the line joining the two points has a slope of -1. DTU events are most likely controlled by post-transcriptional regulation.

Gene differential expression (GDE) is represented by the red line. It is the amount of change in expression along in the direction of line joining the two points. GDE is a notion that, for reasons explained below, is not typically tested for, and there are few methods that consider it. However GDE is biologically meaningful, in that it generalizes the notions of DGE, DTU and DTE, allowing for change in any direction. A gene that exhibits some change in expression between conditions is GDE regardless of the direction of change. GDE can represent complex changes in expression driven by a combination of transcriptional and post-transcriptional regulation. Note that DGE, DTU and DTE are all special cases of GDE.

If the $L_2$ norm is used to measure length and $DTE_1,DTE_2$ denote DTE in the primary and secondary isoforms respectively, then it is clear that DGE, DTU, DTE and GDE satisfy the relationship

$GDE^2 = DGE^2 + DTU^2 = DTE_1^2 + DTE_2^2.$

### Statistics

The terms DTE, DGE, DTU and GDE have an intuitive biological meaning, but they are also used in genomics as descriptors of certain null hypotheses for statistical testing of differential expression.

The differential transcript expression (DTE) null hypothesis for an isoform is that it did not change between conditions, i.e. $x_A=x_B$ for the primary isoform, or $y_A=y_B$ for the secondary isoform. In other words, in this example there are two DTE null hypotheses one could consider.

The differential gene expresión (DGE) null hypothesis is that there is no change in overall expression of the gene, i.e. $x_A+y_A = x_B+y_B$.

The differential transcript usage (DTU) null hypothesis is that there is no change in the difference in expression of isoforms, i.e. $x_A-y_A = x_B - y_B$.

The gene differential expression (GDE) null hypothesis is that there is no change in expression in any direction, i.e. for all constants $a,b$, $ax_A+by_A = ax_B+by_B$.

The union differential transcript expression (UDTE) null hypothesis is that there is no change in expression of any isoform. That is, that $x_A = y_A$ and $x_B = y_B$ (this null hypothesis is sometimes called DTE+G). The terminology is motivated by $\neg \cup_i DTE_i = \cap_i DTE_i$.

Not that $UDTE \Leftrightarrow GDE$, because if we assume GDE, and set $a=1,b=0$ we obtain DTE for the primary isoform and setting $a=0,b=1$ we obtain DTE for the secondary isoform. To be clear, by GDE or DTE in this case we mean the GDE (respectively DTE) null hypothesis. Furthermore, we have that

$UDTE,GDE \Rightarrow DTE,DGE,DTU$.

This is clear because if $x_A=y_A$ and $x_B=y_B$ then both DTE null hypotheses are satisfied by definition, and both DGE and DTU are trivially satisfied. However no other implications hold, i.e. $DTE \not \Rightarrow DGE,DTU$, similarly $DGE \not \Rightarrow DTE,DTU$, and $DTU \not \Rightarrow DGE, DTE$.

### Methods

The terms DGE, DTE, DTU and GDE also used to describe methods for differential analysis.

A differential gene expression method is one whose goal is to identify changes in overall gene expression. Because DGE depends on the projection of the points (representing gene abundances) to the line y=x, DGE methods typically take as input gene counts or abundances computed by summing transcript abundances $x_A+y_A$ and $x_B+y_B$. Examples of early DGE methods for RNA-Seq were DESeq (now DESeq2) and edgeR. One problem with DGE methods is that it is problematic to estimate gene abundance by adding up counts of the constituent isoforms. This issue was discussed extensively in Trapnell et al. 2013. On the other hand, if the biology of a gene is DGE, i.e. changes in expression are the same (relatively) in all isoforms, then DGE methods will be optimal, and the issue of summed counts not representing gene abundances accurately is moot.

differential transcript expression method is one whose goal is to identify individual transcripts that have undergone DTE. Early methods for DTE were Cufflinks (now Cuffdiff2) and MISO, and more recently sleuth, which improves DTE accuracy by modeling uncertainty in transcript quantifications. A key issue with DTE is that there are many more transcripts than genes, so that rejecting DTE null hypotheses is harder than rejecting DGE null hypotheses. On the other hand, DTE provides differential analysis at the highest resolution possible, pinpointing specific isoforms that change and opening a window to study post-transcriptional regulation. A number of recent examples highlight the importance of DTE in biomedicine (see, e.g., Vitting-Seerup and Sandelin 2017). Unfortunately DTE results do not always translate to testable hypotheses, as it is difficult to knock out individual isoforms of genes.

differential transcript usage method is one whose goal is to identify genes whose overall expression is constant, but where isoform switching leads to changes in relative isoform abundances. Cufflinks implemented a DTU test using Jensen-Shannon divergence, and more recently RATs is a method specialized for DTU.

As discussed in the previous section, none of null hypotheses DGE, DTE and DTU imply any other, so users have to choose, prior to performing an analysis, which type of test they will perform. There are differing opinions on the “right” approach to choosing between DGE, DTU and DTE. Sonseson et al. 2016 suggest that while DTE and DTU may be appropriate in certain niche applications, generally it’s better to choose DGE, and they therefore advise not to bother with transcript-level analysis. In Trapnell et al. 2010, an argument was made for focusing on DTE and DTU, with the conclusion to the paper speculating that “differential RNA level isoform regulation…suggests functional specialization of the isoforms in many genes.” Van den Berge et al. 2017 advocate for a middle ground: performing a gene-level analysis but saving some “FDR budget” for identifying DTE in genes for which the UDTE null hypothesis has been rejected.

There are two alternatives that have been proposed to get around the difficulty of having to choose, prior to analysis, whether to perform DGE, DTU or DTE:

differential transcript expression aggregation (DTE->G) method is a method that first performs DTE on all isoforms of every gene, and then aggregates the resulting p-values (by gene) to obtain gene-level p-values. The “aggregation” relies on the observation that under the null hypothesis, p-values are uniformly distributed. There are a number of different tests (e.g. Fisher’s method) for testing whether (independent) p-values are uniformly distributed. Applying such tests to isoform p-values per gene provides gene-level p-values and the ability to reject UDTE. A DTE->G method was tested in Soneson et al. 2016 (based on Šidák aggregation) and the stageR method (Van den Berge et al. 2017) uses the same method as a first step. Unfortunately, naïve DTE->G methods perform poorly when genes change by DGE, as shown in Yi et al. 2017. The same paper shows that Lancaster aggregation is a DTE->G method that achieves the best of both the DGE and DTU worlds. One major drawback of DTE->G methods is that they are non-constructive, i.e. the rejection of UDTE by a DTE->G method provides no information about which transcripts were differential and how. The stageR method averts this problem but requires sacrificing some power to reject UDTE in favor of the interpretability provided by subsequent DTE.

gene differential expression method is a method for gene-level analysis that tests for differences in the direction of change identified between conditions. For a GDE method to be successful, it must be able to identify the direction of change, and that is not possible with bulk RNA-Seq data. This is because of the one in ten rule that states that approximately one predictive variable can be estimated from ten events. In bulk RNA-Seq, the number of replicates in standard experiments is three, and the number of isoforms in multi-isoform genes is at least two, and sometimes much more than that.

In Ntranos, Yi et al. 2018, it is shown that single-cell RNA-Seq provides enough “replicates” in the form of cells, that logistic regression can be used to predict condition based on expression, effectively identifying the direction of change. As such, it provides an alternative to DTE->G for rejecting UDTE. The Ntranos and Yi GDE methods is extremely powerful: by identifying the direction of change it is a DGE methods when the change is DGE, it is a DTU method when the change is DTU, and it is a DTE method when the change is DTE. Interpretability is provided in the prediction step: it is the estimated direction of change.

### Remarks

The discussion in this post is based on an example consisting of a gene with two isoforms, however the concepts discussed are easy to generalize to multi-isoform genes with more than two transcripts. I have not discussed differential exon usage (DEU), which is the focus of the DEXSeq method because of the complexities arising in genes which don’t have well-defined shared exons. Nevertheless, the DEXSeq approach to rejecting UDTE is similar to DTE->G, with DTE replaced by DEU. There are many programs for DTE, DTU and (especially) DGE that I haven’t mentioned; the ones cited are intended merely to serve as illustrative examples. This is not a comprehensive review of RNA-Seq differential expression methods.

### Acknowledgments

The blog post was motivated by questions of Charlotte Soneson and Mark Robinson arising from an initial draft of the Ntranos, Yi et al. 2018 paper. The exposition was developed with Vasilis Ntranos and Lynn Yi. Valentine Svensson provided valuable comments and feedback.

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as the law of total variance. Given a random (response) variable $C$ with a random covariate  $Z$, the law of total variance states that the variance of $C$ can be decomposed as:

$Var[C] = E_Z[Var[C|Z]] + Var_Z[E[C|Z]]$.

There is a biological interpretation of this law that also serves to explain it: If the random variable $C$ denotes the expression of a gene in a single cell ($C$ being a random variable means that the expression is stochastic), and $Z$ denotes the (random) state of a cell, then the law of total variance describes the “total noise” $Var[C]$ in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression $Var[C|Z]$ to be the conditional variance, which is also a random variable; its possible values are the variance of the gene expression in different cell states. If $Var[C]$ does not depend on $Z$ then the expression of the gene is said to be homoscedastic, i.e., it does not depend on cell state (if it does then it is said to be heteroscedastic). Because $Var[C|Z]$ is a random variable, the expression $E_Z[Var[C|Z]$ makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression $E[C|Z]$ is a random variable whose possible values are the average  of the gene expression in cells. Thus, $Var_Z[E[C|Z]]$ is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into E. coli and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables $Z_1,\ldots,Z_n$ which are independent and identically distributed, one for each of n different cells, while random variables $C_1,\ldots,C_n$  and $Y_1,\ldots,Y_n$ correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable $C_i$ and $Y_i$, i.e. a pair of measurements for each cell. A hierarchical model for the experiment, in which the marginal (unconditional) distributions $C_i$ and $Y_i$ are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead,  in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

$\eta^2_{int} = \frac{\frac{1}{n} \left( \sum_{i=1}^n \frac{1}{2} (c_i-y_i)^2\right) }{\overline{c} \cdot \overline{y}},$ (intrinsic noise)

$\eta^2_{ext} = \frac{\frac{1}{n}\sum_{i=1}^n c_i \cdot y_i - \overline{c} \cdot \overline{y}}{\overline{c} \cdot \overline{y}},$ (extrinsic noise)

$\eta^2_{tot} = \frac{\frac{1}{n}\sum_{i=1}^n \frac{1}{2} (c_i^2+y_i^2)-\overline{c}\cdot\overline{y}}{\overline{c} \cdot \overline{y}}.$ (total noise)

Here  $c_1,\ldots,c_n$ and $y_1,\ldots,y_n$ are the measurements of cyan respectively yellow reporter expression in each cell, $\overline{c} = \frac{1}{n}\sum_{i=1}^n c_i$ and $\overline{y} = \frac{1}{n}\sum_{i=1}^n y_i$.

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho,  studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:

The Elowitz et al. caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of n in the denominator rather than n-1), an observation made by Hilfinger and Paulsson. The sample covariance has a  (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line $c=y$ is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):

There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

1. There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
2. The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
3. The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.