You are currently browsing the tag archive for the ‘law of total variance’ tag.

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as the law of total variance. Given a random (response) variable $C$ with a random covariate  $Z$, the law of total variance states that the variance of $C$ can be decomposed as:

$Var[C] = E_Z[Var[C|Z]] + Var_Z[E[C|Z]]$.

There is a biological interpretation of this law that also serves to explain it: If the random variable $C$ denotes the expression of a gene in a single cell ($C$ being a random variable means that the expression is stochastic), and $Z$ denotes the (random) state of a cell, then the law of total variance describes the “total noise” $Var[C]$ in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression $Var[C|Z]$ to be the conditional variance, which is also a random variable; its possible values are the variance of the gene expression in different cell states. If $Var[C]$ does not depend on $Z$ then the expression of the gene is said to be homoscedastic, i.e., it does not depend on cell state (if it does then it is said to be heteroscedastic). Because $Var[C|Z]$ is a random variable, the expression $E_Z[Var[C|Z]$ makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression $E[C|Z]$ is a random variable whose possible values are the average  of the gene expression in cells. Thus, $Var_Z[E[C|Z]]$ is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into E. coli and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables $Z_1,\ldots,Z_n$ which are independent and identically distributed, one for each of n different cells, while random variables $C_1,\ldots,C_n$  and $Y_1,\ldots,Y_n$ correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable $C_i$ and $Y_i$, i.e. a pair of measurements for each cell. A hierarchical model for the experiment, in which the marginal (unconditional) distributions $C_i$ and $Y_i$ are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead,  in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

$\eta^2_{int} = \frac{\frac{1}{n} \left( \sum_{i=1}^n \frac{1}{2} (c_i-y_i)^2\right) }{\overline{c} \cdot \overline{y}},$ (intrinsic noise)

$\eta^2_{ext} = \frac{\frac{1}{n}\sum_{i=1}^n c_i \cdot y_i - \overline{c} \cdot \overline{y}}{\overline{c} \cdot \overline{y}},$ (extrinsic noise)

$\eta^2_{tot} = \frac{\frac{1}{n}\sum_{i=1}^n \frac{1}{2} (c_i^2+y_i^2)-\overline{c}\cdot\overline{y}}{\overline{c} \cdot \overline{y}}.$ (total noise)

Here  $c_1,\ldots,c_n$ and $y_1,\ldots,y_n$ are the measurements of cyan respectively yellow reporter expression in each cell, $\overline{c} = \frac{1}{n}\sum_{i=1}^n c_i$ and $\overline{y} = \frac{1}{n}\sum_{i=1}^n y_i$.

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho,  studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:

The Elowitz et al. caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of n in the denominator rather than n-1), an observation made by Hilfinger and Paulsson. The sample covariance has a  (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line $c=y$ is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):

There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

1. There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
2. The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
3. The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.