You are currently browsing the monthly archive for January 2016.

In 2014 biologist Mike Snyder published 42 papers. Some of his contributions to those papers were collaborative, but many describe experiments performed in his lab. To make sense of the number 42, one might naïvely assume that Snyder dedicated 8 days and 17 hours (on average) to each paper. That time would include the hours (days?) needed to conceive of ideas, setup and perform experiments, analyze data and write text. It’s not much time, yet even so it is probably an upper bound. This is because Snyder performed the work that resulted in the 42 papers in his spare time. As the chair of the genetics department at Stanford, an administrative position carrying many responsibilities, he was burdened with numerous inter- and intra- and extra- departmental meetings, and as an internationally renowned figure in genomics he was surely called upon to assist in numerous reviews of grants and papers, to deliver invited talks at conferences, and to serve on numerous national and international panels and committees. Moreover, as a “principal investigator” (PI), Snyder was certainly tasked with writing reports about the funding he receives, and to spend time applying for future funding. So how does someone write 42 papers in a year?

Snyder manages 36 postdocs,  13 research assistants,  11 research scientists, 9 visiting scientists, and 8 graduate students. One might imagine that each of the graduate students supervises four postdocs, so that the students/postdocs operate under a structure that enables independent work to proceed in the lab with only occasional supervision and input from the PI. This so-called “PI model” for biology arguably makes sense. The focus of experienced investigators on supervision allows them to provide crucial input gained from years of experience on projects that require not only creative insight, but also large amounts of tedious rote work. Modern biology is also fraught with interdisciplinary challenges, and large groups benefit from the diversity of their members.  For these reasons biology papers nowadays tend to have large numbers of authors.

It is those authors, the unsung heroes of science, that are the “cast” in Eric Lander’s recent perspective on “The Heroes of CRISPR” where he writes that

“the narrative underscores that scientific breakthroughs are rarely eureka moments. They are typically ensemble acts, played out over a decade or more, in which the cast becomes part of something greater than what any one of them could do alone.”

All of the research papers referenced in the Lander perspective have multiple authors, on average about 7, and going up to 20. When discussing papers such as this, it is therefore customary to refer to “the PI and colleagues…” in lieu of crediting all individual authors. Indeed, this phrase appears throughout Lander’s perspective, where he writes “Moineau and colleagues..”, “Siksnys and colleagues..”, etc. It is understood that this means that the PIs guided projects and provided key insights, yet left most of the work to the first author(s) who were in turn aided by others.

There is a more cynical view of the PI model, namely that by running large labs PIs are able to benefit from a scientific roulette of their own making. PIs can claim credit for successes while blaming underlings for failures. Even one success can fund numerous failures, and so the wheel spins faster and faster…the PI always wins. Of course the mad rush to win leads to an overall deterioration of quality. For example, it appears that even the Lander perspective was written in haste (Lander may not be as prolific as Snyder but according to Thomson Reuters he did publish 21 “hot” papers in 2015, placing him fourth on the list of the world’s most influential scientific minds, only a few paces behind winner Stacey Gabriel). The result is what appears to be a fairly significant typo in Figure 2 of his perspective, which shows the geography of the CRISPR story. The circle labeled “9” should be colored in blue like circle “10”, because that color represents “the final step of biological engineering to enable genome editing”. The Jinek et al. 2012 paper of circle “9” clearly states that “We further show that the Cas9 endonuclease can be programmed with guide RNA engineered as a single transcript to target and cleave any dsDNA sequence of interest”. Moreover the Jinek et al. 2013 paper described editing in mammalian cells but was submitted in 2012 (the dates in Lander’s figure are by submission), so circle “9” should also be labeled with a “10” for “Genome editing in mammalian cells”.  In other words, the figure should have looked like this:

Lander acknowledges the reality of the PI model in the opening of his perspective, where he writes that “the Perspective describes an inspiring ensemble of a dozen or so scientists who—with their collaborators and other contributors whose stories are not elaborated here—discovered the CRISPR system, unraveled its molecular mechanisms, and repurposed it as a powerful tool for biological research and biomedicine. Together, they are the Heroes of CRISPR.”

But in choosing to highlight a “dozen or so” scientists, almost all of whom are established PIs at this point, Lander unfairly trivializes the contributions of dozens of graduate students and postdocs who may have made many of the discoveries he discusses, may have developed the key insights, and almost certainly did most of the work. For example, one of the papers mentioned in the Lander perspective is

F. Zhang*, L. Cong*, S. Lodato, S. Kosuri, G.M. Church and P. Arlotta, Efficient construction of sequence-specific TAL effectors for modulating mammalian transcription, 2011

I’ve placed a * next to the names of F. Zhang and L. Cong as is customary to do when “authors contributed equally to this work” (a quote from their paper). Yet in his perspective Lander writes that “when TALEs were deciphered, Zhang, with his collaborators Paola Arlotta and George Church…successfully repurposed them for mammals”. The omission of Cong is surprising not only because his contribution to the TALE work was equal to that of Zhang, but because he is the same Cong that is first author of Cong et al. 2013 that Lander lauds for its number of citations (Le Cong is a former graduate student of Zhang, and is now a postdoc at the Broad Institute). Another person who is absent from Lander’s perspective, yet was first author on two of the key papers mentioned is Martin Jinek. He and Charpentier’s postdoc Krzysztof Chylinski were deemed fundamental to the CRISPR story elsewhere.

Lander concludes his perspective by writing that “the [CRISPR] narrative…is a wonderful lesson for a young person contemplating a life in science”. That may be true of the CRISPR narrative, but the lesson of his narrative is that young persons should not count on leaders in their field to recognize their work.

COI statement: Eric Lander is my former (informal) Ph.D. advisor. Jennifer Doudna is a current colleague and collaborator.

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as the law of total variance. Given a random (response) variable $C$ with a random covariate  $Z$, the law of total variance states that the variance of $C$ can be decomposed as:

$Var[C] = E_Z[Var[C|Z]] + Var_Z[E[C|Z]]$.

There is a biological interpretation of this law that also serves to explain it: If the random variable $C$ denotes the expression of a gene in a single cell ($C$ being a random variable means that the expression is stochastic), and $Z$ denotes the (random) state of a cell, then the law of total variance describes the “total noise” $Var[C]$ in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression $Var[C|Z]$ to be the conditional variance, which is also a random variable; its possible values are the variance of the gene expression in different cell states. If $Var[C]$ does not depend on $Z$ then the expression of the gene is said to be homoscedastic, i.e., it does not depend on cell state (if it does then it is said to be heteroscedastic). Because $Var[C|Z]$ is a random variable, the expression $E_Z[Var[C|Z]$ makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression $E[C|Z]$ is a random variable whose possible values are the average  of the gene expression in cells. Thus, $Var_Z[E[C|Z]]$ is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into E. coli and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables $Z_1,\ldots,Z_n$ which are independent and identically distributed, one for each of n different cells, while random variables $C_1,\ldots,C_n$  and $Y_1,\ldots,Y_n$ correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable $C_i$ and $Y_i$, i.e. a pair of measurements for each cell. A hierarchical model for the experiment, in which the marginal (unconditional) distributions $C_i$ and $Y_i$ are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead,  in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

$\eta^2_{int} = \frac{\frac{1}{n} \left( \sum_{i=1}^n \frac{1}{2} (c_i-y_i)^2\right) }{\overline{c} \cdot \overline{y}},$ (intrinsic noise)

$\eta^2_{ext} = \frac{\frac{1}{n}\sum_{i=1}^n c_i \cdot y_i - \overline{c} \cdot \overline{y}}{\overline{c} \cdot \overline{y}},$ (extrinsic noise)

$\eta^2_{tot} = \frac{\frac{1}{n}\sum_{i=1}^n \frac{1}{2} (c_i^2+y_i^2)-\overline{c}\cdot\overline{y}}{\overline{c} \cdot \overline{y}}.$ (total noise)

Here  $c_1,\ldots,c_n$ and $y_1,\ldots,y_n$ are the measurements of cyan respectively yellow reporter expression in each cell, $\overline{c} = \frac{1}{n}\sum_{i=1}^n c_i$ and $\overline{y} = \frac{1}{n}\sum_{i=1}^n y_i$.

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho,  studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:

The Elowitz et al. caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of n in the denominator rather than n-1), an observation made by Hilfinger and Paulsson. The sample covariance has a  (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line $c=y$ is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):

There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

1. There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
2. The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
3. The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.