You are currently browsing the category archive for the ‘statistics’ category.

The widespread establishment of statistics departments in the United States during the mid-20th century can be traced to a presentation by Harold Hotelling in the Berkeley Symposium on Mathematical Statistics and Probability in 1945. The symposium, organized by Berkeley statistician Jerzy Neyman, was the first of six such symposia that took place every five years, and became the most influential meetings in statistics of their time. Hotelling’s lecture on “The place of statistics in the university” inspired the creation of several statistics departments, and at UC Berkeley, Neyman’s establishment of the statistics department in the 1950s was a landmark moment for statistics in the 20th century.

Neyman was hired in the mathematics department at UC Berkeley by a visionary chair, Griffith Evans, who transformed the UC Berkeley math department into a world-class institution after his hiring in 1934. Evans’ vision for the Berkeley math department included statistics, and Eric Lehmann‘s history of the UC Berkeley statistics department details how Evans’ commitment to diverse areas in the department led him to hire Neyman without even meeting him. However, Evans’ progressive vision for mathematics was not shared by all of his colleagues, and the conservative, parochial attitudes of the math department contributed to Neyman’s breakaway and eventual founding of the statistics department. This dynamic was later repeated at universities across the United States, resulting in a large gulf between mathematicians and statistics (ironically history may be repeating itself with some now suggesting that the emergence of “data science” is a result of conservatism among statisticians leading them to cling to theory rather than to care about data).

The divide between mathematicians and statistics is unfortunate for a number of reasons, one of them being that statistical literacy is important even for the purest of the pure mathematicians. A recent debate on the appropriateness of diversity statements for job applicants in mathematics highlights the need: analysis of data, specifically data on who is in the maths community, and their opinions on the issue, turns out to be central to understanding the matter at hand. Case in point is a recent preprint by two mathematicians:

Joshua Paik and Igor Rivin, Data Analysis of the Responses to Professor Abigail Thompson’s Statement on Mandatory Diversity Statements, arXiv, 2020.

This statistics preprint examines attempts to identify the defining attributes of mathematicians who signed recent letters related to diversity statement requirements in mathematics job searches. I was recently asked to provide feedback on the manuscript, ergo this blog post.

### Reproducibility

In order to assess the results of any preprint or paper, it is essential, as a first step, to be able to reproduce the analysis and results. In the case of a preprint such as this one, this means having access to the code and data used to produce the figures and to perform the calculations. I applaud the authors for being fully transparent and making available all of their code and data in a Github repository in a form that made it easy to reproduce all of their results; indeed I was able to do so without any problems. 👏

### The dataset

The preprint analyzes data on signatories of three letters submitted in response to an opinion piece on diversity statement requirements for job applicants published by Abigail Thompson, chair of the mathematics department at UC Davis. Thompson’s letter compared diversity statement requirements of job applicants to loyalty oaths required during McCarthyism. The response letters range from strong affirmation of Thompson’s opinions, to strong refutation of them. Signatories of “Letter A”, titled “The math community values a commitment to diversity“, “strongly disagreed with the sentiments and arguments of Dr. Thompson’s editorial” and are critical of the AMS for publishing her editorial.” Signatories of “Letter B”, titled “Letter to the editor“, worry about “direct attempt[s] to destroy Thompson’s career and attempt[s] to intimidate the AMS”. Signatories of “Letter C”,  titled “Letter to the Notices of the AMS“, write that they “applaud Abigail Thompson for her courageous leadership [in publishing her editorial]” and “agree wholeheartedly with her sentiments.”

The dataset analyzed by Paik and Rivin combines information scraped from Google Scholar and MathSciNet with data associated to the signatories that was collated by Chad Topaz. The dataset is available in .csv format here.

### The Paik and Rivin result

The main result of Paik and Rivin is summarized in the first paragraph of their Conclusion and Discussion section:

“We see the following patterns amongst the “established” mathematicians who signed the three letters: the citations numbers distribution of the signers of Letter A is similar to that of a mid-level mathematics department (such as, say, Temple University), the citations metrics of Letter B are closer to that of a top 20 department such as Rutgers University, while the citations metrics of the signers of Letter C are another tier higher, and are more akin to the distribution of metrics for a truly top department.”

A figure from their preprint summarizing the data supposedly supporting their result, is reproduced below (with the dotted blue line shifted slightly to the right after the bug fix):

Paik and Rivin go a step further, using citation counts and h-indices as proxies for “merit in the judgement of the community.” That is to say, Paik and Rivin claim that mathematicians who signed letter A, i.e. those who strongly disagreed with Thompson’s equivalence between diversity statements and McCarthy’s loyalty oaths, have less “merit in the judgement of the community” than mathematicians who signed letter C, i.e. those who agreed wholeheartedly with her sentiments.

The differential is indeed very large. Paik and Rivin find that the mean number of citations for signers of Letter A is 2397.75, the mean number of citations for signers of Letter B is 4434.89, and the mean number of citations for signers of Letter C is 6226.816. To control for an association between seniority and number of citations, the computed averages are based only on citation counts of full professors. [Note: a bug in the Paik-Rivin code results in an error in their reporting for the mean for group B. They report 4136.432 whereas the number is actually 4434.89.]

This data seems to support Paik and Rivin’s thesis that mathematicians who support the use of diversty statements in hiring and who strongly disagree with Thompson’s analogy of such statements to McCarthy’s loyalty oaths, are second rate mathematicians, whereas those who agree wholeheartedly with Thompson are on par with professors at “truly top departments”.

But do the data really support this conclusion?

### A fool’s errand

Before delving into the details of the data Paik and Rivin analyzed, it is worthwhile to pause and consider the validity of using citations counts and h-indices as proxies for “merit in the judgement of the community”. The authors themselves note that “citations and h-indices do not impose a total order on the quality of a mathematician” and emphasize that “it is quite obvious that, unlike in competitive swimming, imposing such an order is a fool’s errand.” Yet they proceed to discount their own advice, and wholeheartedly embark on the fool’s errand they warn against. 🤔

I examined the mathematicians in their dataset and first, as a sanity check, confirmed that I am one of them (I signed one of the letters). I then looked at the associated citation counts and noticed that out of 1435 mathematicians who signed the letters, I had the second highest number of citations according to Google Scholar (67,694), second only to Terence Tao (71,530). We are in the 99.9th percentile. 👏 Moreover, I have 27 times more citations than Igor Rivin. According to Paik and Rivin this implies that I have 27 times more merit in the judgement of our peers. I should say at least 27 times, because one might imagine that the judgement of the community is non-linear in the number of citations. Even if one discounts such quantitative comparisons (Paik and Rivin do note that Stephen Smale has fewer citations than Terence Tao, and that it would be difficult on that basis alone to conclude that Tao is the better mathematician), the preprint makes use of citation counts to assess “merit in the judgement of the community”, and thus according to Paik and Rivin my opinions have substantial merit. In fact, according to them, my opinion on diversity statements must be an extremely meritorious one. I imagine they would posit that my opinion on the debate that is raging in the math community regarding diversity statement requirements from job applicants is the correct, and definitive one. Now I can already foresee protestations that, for example, my article on “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation” which has 9,438 citations is not math per se, and that it shouldn’t count. I’ll note that my biology colleagues, after reading the supplement, think it’s a math paper, but in any case, if we are going to head down that route shouldn’t Paik and Rivin read the paper to be sure? And shouldn’t they read every paper of mine, and every paper of every signatory to determine it is valid for their analysis? And shouldn’t they adjust the citation counts of every signatory? Well they didn’t do any of that, and besides, they included me in their analysis so… I proceed…

The citation numbers above are based on Google Scholar citations. Paik and Rivin also analyze MathSciNet citations and state that they prefer them because “only published mathematics are in MathSciNet, and is hence a higher quality data source when comparing mathematicians.” I checked the relationship between Scholar and MathSciNet citations and found that, not surprisingly, they have a correlation of 0.92:

I’d say they are therefore interchangeable in terms of the authors’ use of them as a proxy for “merit”.

But citations are not proxies for merit. The entire premise of the preprint is ridiculous. Furthermore, even if it was true that citations were a meaningful attribute of the signatories to analyze, there are many other serious problems with the preprint.

### The elephant not in the room

Paik and Rivin begin their preprint with a cursory examination of the data and immediately identify a potential problem… missing data. How much data is missing? 64.11% of individuals do not have associated Google Scholar citation data, and 78.82% don’t have MathSciNet citation data. Paik and Rivin brush this issue aside remarking that “while this is not optimal, a quick sample size calculation shows that one needs 303 samples or 21% of the data to produce statistics at a 95% confidence level and a 5% confidence interval.” They are apparently unaware of the need for uniform population sampling, and don’t appear to even think for a second of the possible ascertainment biases in their data. I thought for a second.

For instance, I wondered whether there might be a discrepancy between the number of citations of women with Google Scholar pages vs. women without such pages. This is because I’ve noticed anecdotally that several senior women mathematicians I know don’t have Google Scholar pages, and since senior scientists presumably have more citations this could create a problematic ascertainment bias. I checked and there is, as expected, some correlation between age post-Ph.D. and citation count (cor = 0.36):

To test whether there is an association between presence of a Google Scholar page and citation number I examined the average number of MathSciNet citations of women with and without Google Scholar pages. Indeed, the average number of citations of women without Google Scholar pages is much lower than those with a Google Scholar page (898 vs. 621). For men the difference is much smaller (1816 vs. 1801). By the way, the difference in citation number between men and women is itself large, and can be explained by a number of differences starting with the fact that the women represented in the database have much lower age post-Ph.D. than the men (17.6 vs. 26.3), and therefore fewer citations (see correlation between age and citations above).

The analysis above suggests that perhaps one should use MathSciNet citation counts instead of Google Scholar. However the extent of missing data for that attribute is highly problematic (78.82% missing values). For one thing, my own MathSciNet citation counts are missing, so there were probably bugs in the scraping. The numbers are also tiny. There are only 46 women with MathSciNet data among all three letter signatories out of 452 women signatories. I believe the data is unreliable. In fact, even my ascertainment bias analysis above is problematic due to the small number of individuals involved. It would be completely appropriate at this point to accept that the data is not of sufficient quality for even rudimentary analysis. Yet the authors continued.

### A big word

Confounder is a big word for a variable that influences both the dependent and independent variable in an analysis, thus causing a spurious association. The word does not appear in Paik and Rivin’s manuscript, which is unfortunate because it is in fact a confounder that explains their main “result”.  This confounder is age. I’ve already shown the strong relationship between age post-Ph.D. and citation count in a figure above. Paik and Rivin examine the age distribution of the different letter signatories and find distinct differences. The figure below is reproduced from their preprint:

The differences are stark: the mean time since PhD completion of signers of Letter A is 14.64 years, the mean time since PhD completion of signers of Letter B is 27.76 years and the the mean time since PhD completion of signers of Letter C is 35.48 years. Presumably to control for this association, Paik and Rivin restricted the citation count computations to full professors. As it turns out, this restriction alone does not control for age.

The figure below shows the number of citations of letter C signatories who are full professors as a function of their age:

The red line at 36 years post-Ph.D. divides two distinct regimes. The large jump at that time (corresponding to chronological age ~60) is not surprising: senior professors in mathematics are more famous and have more influence than their junior peers, and their work has had more time to be understood and appreciated. In mathematics results can take many years before they are understood and integrated into mainstream mathematics. These are just hypotheses, but the exact reason for this effect is not important for the Paik-Rivin analysis. What matters is that there are almost no full professors among Letter A signers who are more than 36 years post-Ph.D. In fact, the number of such individuals (filtered for those who have published at least 1 article), is 2. Two individuals. That’s it.

Restricting the analysis to full professors less than 36 years post-Ph.D. tells a completely different story to the one Paik and Rivin peddle. The average number of citations of full professors who signed letter A (2922.72) is higher than the average number of citations of full professors who signed letter C (2348.85). Signers of letter B have 3148.83 citations on average. The figure for this analysis is shown below:

The main conclusion of Paik and Rivin, that signers of letters A have less merit than signers of letter B, who in turn have less merit than signers of letter C can be seen to be complete rubbish. What the data reveal is simply that the signers of letter A are younger than the signers of the other two letters.

Note: I have performed my analysis in a Google Colab notebook accessible via the link. It allows for full reproducibility of the figures and numbers in this post, and facilitates easy exploration of the data. Of course there’s nothing to explore. Use of citations as a proxy for merit is a fool’s errand.

### Miscellania

There are numerous other technical problems with the preprint. The authors claim to have performed “a control” (they didn’t). Several p-values are computed and reported without any multiple testing correction. Parametric approximations for the citation data are examined, but then ignored. Moreover, appropriate zero-inflated count distributions for such data are never considered (see e.g. Yong-Gil et al. 2007).  The results presented are all univariate (e.g. histograms of one data type)- there is not a single scatterplot in the preprint! This suggests that the authors are unaware of the field of multivariate statistics. Considering all of this, I encourage the authors to enroll in an introductory statistics class.

### The Russians

In a strange final paragraph of the Conclusion and Discussion section of their preprint, Paik and Rivin speculate on why mathematicians from communist countries are not represented among the signers of letter A. They present hypotheses without any data to back up their claims.

The insistence that some mathematicians, e.g. Mikhail Gromov who signed letters B and C and is a full member at IHES and professor at NYU, are not part of the “power elite” of mathematics is just ridiculous. Furthermore, characterizing someone like Gromov, who arrived in the US from Russia to an arranged job at SUNY Stonybrook (thanks to Tony Phillips) as being a member of a group who “arrived at the US with nothing but the shirts on their backs” is bizarre.

### Diversity matters

I find the current debate in the mathematics community surrounding Prof. Thompson’s letter very frustrating. The comparison of diversity statements to McCarthy’s loyalty oaths is ridiculous. Instead of debating such nonsense, mathematicians need to think long and hard about how to change the culture in their departments, a culture that has led to appallingly few under-represented minorities and women in the field. Under-represented minorities and women routinely face discrimination and worse. This is completely unacceptable.

The preprint by Paik and Rivin is a cynical attempt to use the Thompson kerfuffle to advertise the tired trope of the second-rate mathematician being the one to advocate for greater diversity in mathematics. It’s a sad refrain that has a long history in mathematics. But perhaps it’s not surprising. The colleagues of Jerzy Neyman in his mathematics department could not even stomach a statistician, let alone a woman, let alone a person from an under-represented minority group. However I’m optimistic reading the list of signatories of letter A. Many of my mathematical heroes are among them. The future is theirs, and they are right.

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as the law of total variance. Given a random (response) variable $C$ with a random covariate  $Z$, the law of total variance states that the variance of $C$ can be decomposed as:

$Var[C] = E_Z[Var[C|Z]] + Var_Z[E[C|Z]]$.

There is a biological interpretation of this law that also serves to explain it: If the random variable $C$ denotes the expression of a gene in a single cell ($C$ being a random variable means that the expression is stochastic), and $Z$ denotes the (random) state of a cell, then the law of total variance describes the “total noise” $Var[C]$ in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression $Var[C|Z]$ to be the conditional variance, which is also a random variable; its possible values are the variance of the gene expression in different cell states. If $Var[C]$ does not depend on $Z$ then the expression of the gene is said to be homoscedastic, i.e., it does not depend on cell state (if it does then it is said to be heteroscedastic). Because $Var[C|Z]$ is a random variable, the expression $E_Z[Var[C|Z]$ makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression $E[C|Z]$ is a random variable whose possible values are the average  of the gene expression in cells. Thus, $Var_Z[E[C|Z]]$ is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into E. coli and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables $Z_1,\ldots,Z_n$ which are independent and identically distributed, one for each of n different cells, while random variables $C_1,\ldots,C_n$  and $Y_1,\ldots,Y_n$ correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable $C_i$ and $Y_i$, i.e. a pair of measurements for each cell. A hierarchical model for the experiment, in which the marginal (unconditional) distributions $C_i$ and $Y_i$ are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead,  in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

$\eta^2_{int} = \frac{\frac{1}{n} \left( \sum_{i=1}^n \frac{1}{2} (c_i-y_i)^2\right) }{\overline{c} \cdot \overline{y}},$ (intrinsic noise)

$\eta^2_{ext} = \frac{\frac{1}{n}\sum_{i=1}^n c_i \cdot y_i - \overline{c} \cdot \overline{y}}{\overline{c} \cdot \overline{y}},$ (extrinsic noise)

$\eta^2_{tot} = \frac{\frac{1}{n}\sum_{i=1}^n \frac{1}{2} (c_i^2+y_i^2)-\overline{c}\cdot\overline{y}}{\overline{c} \cdot \overline{y}}.$ (total noise)

Here  $c_1,\ldots,c_n$ and $y_1,\ldots,y_n$ are the measurements of cyan respectively yellow reporter expression in each cell, $\overline{c} = \frac{1}{n}\sum_{i=1}^n c_i$ and $\overline{y} = \frac{1}{n}\sum_{i=1}^n y_i$.

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho,  studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:

The Elowitz et al. caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of n in the denominator rather than n-1), an observation made by Hilfinger and Paulsson. The sample covariance has a  (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line $c=y$ is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):

There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

1. There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
2. The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
3. The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.

Nature Publishing Group claims on its website that it is committed to publishing “original research” that is “of the highest quality and impact”. But when exactly is research “original”?  This is a question with a complicated answer. A recent blog post by senior editor Dorothy Clyde at Nature Protocols provides insight into the difficulties Nature faces in detecting plagiarism, and identifies the issue of self plagiarism as particularly problematic. The journal tries to avoid publishing the work of authors who have previously published the same work or a minor variant thereof. I imagine this is partly in the interests of fairness, a service to the scientific community to ensure that researchers don’t have to sift through numerous variants of a single research project in the literature, and a personal interest of the journal in its aim to publish only the highest level of scholarship.

On the other hand, there is also a rationale for individual researchers to revisit their own previously published work. Sometimes results can be recast in a way that makes them accessible to different communities, and rethinking of ideas frequently leads to a better understanding, and therefore a better exposition. The mathematician Gian-Carlo Rota made the case for enlightened self-plagiarism in one of his ten lessons he wished he had been taught when he was younger:

3. Publish the same result several times

After getting my degree, I worked for a few years in functional analysis. I bought a copy of Frederick Riesz’ Collected Papers as soon as the big thick heavy oversize volume was published. However, as I began to leaf through, I could not help but notice that the pages were extra thick, almost like cardboard. Strangely, each of Riesz’ publications had been reset in exceptionally large type. I was fond of Riesz’ papers, which were invariably beautifully written and gave the reader a feeling of definitiveness.

As I looked through his Collected Papers however, another picture emerged. The editors had gone out of their way to publish every little scrap Riesz had ever published. It was clear that Riesz’ publications were few. What is more surprising is that the papers had been published several times. Riesz would publish the first rough version of an idea in some obscure Hungarian journal. A few years later, he would send a series of notes to the French Academy’s Comptes Rendus in which the same material was further elaborated. A few more years would pass, and he would publish the definitive paper, either in French or in English. Adam Koranyi, who took courses with Frederick Riesz, told me that Riesz would lecture on the same subject year after year, while meditating on the definitive version to be written. No wonder the final version was perfect.

Riesz’ example is worth following. The mathematical community is split into small groups, each one with its own customs, notation and terminology. It may soon be indispensable to present the same result in several versions, each one accessible to a specific group; the price one might have to pay otherwise is to have our work rediscovered by someone who uses a different language and notation, and who will rightly claim it as his own.

The question is: where does one draw the line?

I was recently forced to confront this question when reading an interesting paper about a statistical approach to utilizing controls in large-scale genomics experiments:

J.A. Gagnon-Bartsch and T.P. Speed, Using control genes to corrected for unwanted variation in microarray dataBiostatistics, 2012.

A cornerstone in the logic and methodology of biology is the notion of a “control”. For example, when testing the result of a drug on patients, a subset of individuals will be given a placebo. This is done to literally control for effects that might be measured in patients taking the drug, but that are not inherent to the drug itself. By examining patients on the placebo, it is possible to essentially cancel out uninteresting effects that are not specific to the drug. In modern genomics experiments that involve thousands, or even hundreds of thousands of measurements, there is a biological question of how to design suitable controls, and a statistical question of how to exploit large numbers of controls to “normalize” (i.e. remove unwanted variation) from the high-dimensional measurements.

Formally, one framework for thinking about this is a linear model for gene expression. Using the notation of Gagnon-Bartsch & Speed, we have an expression matrix $Y$ of size $m \times n$ (samples and genes) modeled as

$Y_{m \times n} = X_{m \times p}\beta_{p \times n} + Z_{m \times q}\gamma_{q \times n} + W_{m \times k} \alpha_{k \times n} + \epsilon_{m \times n}$.

Here is a matrix describing various conditions (also called factors) and associated to it is the parameter matrix $\beta$ that records the contribution, or influence, of each factor on each gene. $\beta$ is the primary parameter of interest to be estimated from the data Y. The $\epsilon$ are random noise, and finally  and are observed and unobserved covariates respectively. For example Z might encode factors for covariates such as gender, whereas W would encode factors that are hidden, or unobserved. A crucial point is that the number of hidden factors in W, namely k, is not known. The matrices $\gamma$ and $\alpha$ record the contributions of the Z and W factors on gene expression, and must also be estimated. It should be noted that X may be the logarithm of expression levels from a microarray experiment, or the analogous quantity from an RNA-Seq experiment (e.g. log of abundance in FPKM units).

Linear models have been applied to gene expression analysis for a very long time; I can think of papers going back 15 years. But They became central to all analysis about a decade ago, specifically popularized with the Limma package for microarray data analysis. In an important paper in 2007, Leek and Storey focused explicitly on the identification of hidden factors and estimation of their influence, using a method called SVA (Surrogate Variable Analysis). Mathematically, they described a procedure for estimating k and W and the parameters $\alpha$. I will not delve into the details of SVA in this post, except to say that the overall idea is to first perform linear regression (assuming no hidden factors) to identify the parameters $\beta$ and to then perform singular value decomposition (SVD) on the residuals to identify hidden factors (details omitted here). The resulting identified hidden factors (and associated influence parameters) are then used in a more general model for gene expression in subsequent analysis.

Gagnon-Bartsch and Speed refine this idea by suggesting that it is better to infer W from controls. For example, house-keeping genes that are unlikely to correlate with the conditions being tested, can be used to first estimate W, and then subsequently all the parameters of the model can be estimated by linear regression. They term this two-step process RUV-2 (acronym for Remote Unwanted Variation) where the “2” designates that the procedure is a two-step procedure. As with SVA, the key to inferring W from the controls is to perform singular value decomposition (or more generally factor analysis). This is actually clear from the probabilistic interpretation of PCA and the observation that what it means to be a in the set of “control genes” C  in a setting where there are no observed factors Z, is that

$Y_C = W \alpha_C + \epsilon_C$.

That is, for such control genes the corresponding $\beta$ parameters are zero. This is a simple but powerful observation, because the explicit designation of control genes in the procedure makes it clear how to estimate W, and therefore the procedure becomes conceptually compelling and practically simple to implement. Thus, even though the model being used is the same as that of Leek & Storey, there is a novel idea in the paper that makes the procedure “cleaner”. Indeed, Gagnon-Bartsch & Speed provide experimental results in their paper showing that RUV-2 outperforms SVA. Even more convincing, is the use of RUV-2 by others. For example, in a paper on “The functional consequences of variation in transcription factor binding” by Cusanovitch et al., PLoS Genetics 2014, RUV-2 is shown to work well, and the authors explain how it helps them to take advantage of the controls in experimental design they created.

There is a tech report and also a preprint that follow up on the Gagnon-Bartsch & Speed paper; the tech report extends RUV-2 to a four step method RUV-4 (it also provides a very clear exposition of the statistics), and separately the preprint describes an extension to RUV-2 for the case where the factor of interest is also unknown. Both of these papers build on the original paper in significant ways and are important work, that to return to the original question in the post, certainly are on the right side of “the line”

The wrong side of the line?

The development of RUV-2 and SVA occurred in the context of microarrays, and it is natural to ask whether the details are really different for RNA-Seq (spoiler: they aren’t).  In a book chapter published earlier this year:

D. Risso, J. Ngai, T.P. Speed, S. Dudoit, The role of spike-in standards in the normalization of RNA-Seq, in Statistical Analysis of Next Generation Sequencing Data (2014), 169-190.

the authors replace “log expression levels” from microarrays with “log counts” from RNA-Seq and the linear regression performed with Limma for RUV-2 with a Poisson regression (this involves one different R command). They call the new method RUV, which is the same as the previously published RUV, a naming convention that makes sense since the paper has no new method. In fact, the mathematical formulas describing the method are identical (and even in almost identical notation!) with the exception that the book chapter ignores altogether, and replaces $\epsilon$ with O.

To be fair, there is one added highlight in the book chapter, namely the observation that spike-ins can be used in lieu of housekeeping (or other control) genes. The method is unchanged, of course. It is just that the spike-ins are used to estimate W. Although spike-ins were not mentioned in the original Gagnon-Bartsch paper, there is no reason not to use them with arrays as well; they are standard with Affymetrix arrays.

My one critique of the chapter is that it doesn’t make sense to me that counts are used in the procedure. I think it would be better to use abundance estimates, and in fact I believe that Jeff Leek has already investigated the possibility in a preprint that appears to be an update to his original SVA work. That issue aside, the book chapter does provide concrete evidence using a Zebrafish experiment that RUV-2 is relevant and works for RNA-Seq data.

The story should end here (and this blog post would not have been written if it had) but two weeks ago, among five RNA-Seq papers published in Nature Biotechnology (I have yet to read the others), I found the following publication:

D. Risso, J. Ngai, T.P. Speed, S. Dudoit, Normalization of RNA-Seq data using factor analysis of control genes or samples, Nature Biotechnology 32 (2014), 896-902.

This paper has the same authors as the book chapter (with the exception that Sandrine Dudoit is now a co-corresponding author with Davide Risso, who was the sole corresponding author on the first publication), and, it turns out, it is basically the same paper… in fact in many parts it is the identical paper. It looks like the Nature Biotechnology paper is an edited and polished version of the book chapter, with a handful of additional figures (based on the same data) and better graphics. I thought that Nature journals publish original and reproducible research papers. I guess I didn’t realize that for some people “reproducible” means “reproduce your own previous research and republish it”.

At this point, before drawing attention to some comparisons between the papers, I’d like to point out that the book chapter was refereed. This is clear from the fact that it is described as such in both corresponding authors’ CVs.

How similar are the two papers?

Final paragraph of paper in the book:

Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical effects. With the advent of single-cell sequencing [35], the role of spike-in standards should become even more important, both to account for technical variability [6] and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike-in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols.

Final paragraph of paper in Nature Biotechnology:

Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical factors. With the advent of single-cell sequencing27, the role of spike-in standards should become even more important, both to account for technical variability28 and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike- in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols.

Abstract of paper in the book:

Normalization of RNA-seq data is essential to ensure accurate inference of expression levels, by adjusting for sequencing depth and other more complex nuisance effects, both within and between samples. Recently, the External RNA Control Consortium (ERCC) developed a set of 92 synthetic spike-in standards that are commercially available and relatively easy to add to a typical library preparation. In this chapter, we compare the performance of several state-of-the-art normalization methods, including adaptations that directly use spike-in sequences as controls. We show that although the ERCC spike-ins could in principle be valuable for assessing accuracy in RNA-seq experiments, their read counts are not stable enough to be used for normalization purposes. We propose a novel approach to normalization that can successfully make use of control sequences to remove unwanted effects and lead to accurate estimation of expression fold-changes and tests of differential expression.

Abstract of paper in Nature Biotechnology:

Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms.

Abstract of Gagnon-Bartsch & Speed paper that already took credit for a “new” method called RUV:

Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method “Remove Unwanted Variation, 2-step” (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.

Many figures are also the same (except one that appears to have been fixed in the Nature Biotechnology paper– I leave the discovery of the figure as an exercise to the reader). Here is Figure 9.2 in the book:

The two panels appears as (b) and (c) in Figure 4 in the Nature Biotechnology paper (albeit transformed via a 90 degree rotation and reflection from the dihedral group):

Basically the whole of the book chapter and the Nature Biotechnology paper are essentially the same, down to the math notation, which even two papers removed is just a rehashing of the RUV method of Gagnon-Bartsch & Speed. A complete diff of the papers is beyond the scope of this blog post and technically not trivial to perform, but examination by eye reveals one to be a draft of the other.

Although it is acceptable in the academic community to draw on material from published research articles for expository book chapters (with permission), and conversely to publish preprints, including conference proceedings, in journals, this case is different. (a) the book chapter was refereed, exactly like a journal publication (b) the material in the chapter is not expository; it is research, (c) it was published before the Nature Biotechnology article, and presumably prepared long before,  (d) the book chapter cites the Nature Biotechnology article but not vice versa and (e) the book chapter is not a particularly innovative piece of work to begin with. The method it describes and claims to be “novel”, namely RUV, was already published by Gagnon-Bartsch & Speed.

Below is a musical rendition of what has happened here:

Hui Jiang and Julia Salzman have posted a new paper on the arXiv proposing a novel approach to correcting for non-uniform coverage of transcripts in RNA-Seq: “A penalized likelihood approach for robust estimation of isoform expression” (October 1, 2013).

Their paper addresses the issue of non-uniformity of read coverage across transcripts in RNA-Seq, an issue that is frustrating for the challenges it presents in analysis. The non-uniformity of read coverage in RNA-Seq was first noticed in A. Mortazavi et al., Mapping and quantifying mammalian transcriptomes, Nature Methods 5 (2008), 621–628. Figure 1 in the paper (see below) shows an example of non-uniform coverage, and the paper discusses ideas for library preparation that can reduce bias and improve uniformity.

Figure 1b from Mortazavi et al. (2008) showing (non-uniform) coverage of Myf6.

Supplementary Figure 1a from Mortazavi et al. (2008) describing uniformity of coverage achievable with different library preparations. “Deviation from uniformity” was assessed using the Kolmogorov-Smirnov test.

The experimental approach of modifying library preparation to reduce non-uniformity has been complemented by statistical approaches to the problem. Specifically, various models have been proposed for “correcting” for experimental artefacts that induce non-uniform coverage. To understand Jiang and Salzman’s latest paper, it is helpful to review previous approaches that have been proposed. Read the rest of this entry »

Don’t believe the anti-hype. They are saying that RNA-Seq promises the discovery of new expression events, but it doesn’t deliver:

Is this true? There have been a few papers comparing microarrays to RNA-Seq technology (including one of my own) that I’ll discuss below, but first a break-down of the Affymetrix “evidence”. The first is this figure (the poor quality of the images is exactly what Affymetrix provides, and not due to a reduction in quality on this site; they are slightly enlarged when clicked on):

The content of this figure is an illustration of the gene LMNB1 (Lamin protein of type B), used to argue that microarrays can provide transcript level resolution whereas RNA-Seq can’t!! Really? Affymetrix is saying that RNA-Seq users would likely use the RefSeq annotation which only has three isoforms. But this is a ridiculous claim. It is well known that RefSeq is a conservative annotation and certainly RNA-Seq users have the same access to the multiple databases Affymetrix used to build their annotation (presumably, e.g. Ensembl). It therefore seems that what Affymetrix is saying with this figure is that RNA-Seq users are dumb.

The next figure is showing the variability in abundance estimates as a function of expression level for RNA-SEq and the HTA 2.0, with the intended message being that microarrays are less noisy:

But there is a subtle trick going on here. And its in the units. The x-axis is showing RPM, which is an abbreviation for Reads Per Million. This is not a commonly used unit, and there is a reason. First, its helpful to review what is used. In his landmark paper on RNA-Seq, Ali Mortazavi introduced the units RPKM (note the extra K) that stands for reads per kilobase of transcript per million mapped. Why the extra kilobase term? In my review on RNA-Seq quantification I explain that RPKM is proportional to a maximum likelihood estimate of transcript abundance (obtained from a simple RNA-Seq model). The complete derivation is on page 6 ending in Equation 13; I include a summary here:

The maximum likelihood (ML) abundances $\hat{\rho}_t$ are  given by

$\hat{\rho}_t = \frac{\frac{\hat{\alpha}_t}{l_t}}{\sum_{r \in T} \frac{\hat{\alpha}_r}{l_r}} \propto \frac{X_t}{\left( \frac{l_t}{10^3}\right) \left( \frac{N}{10^6}\right) }$

In these equations $l_t$ is the length of transcript (if reads are long it is necessary to modify the length due to edge effects, hence the tilde in the paper), the $\hat{\alpha}_t$ are the maximum likelihood estimates for the probabilities of selecting reads from transcripts (unnormalized by their length) and finally $X_t$ is the number of reads mapping to transcript t while N is the total number of mapped reads. The point is that RPKM (the rightmost formula for abundance) is really a unit for describing the maximum likelihood relative abundances ($\hat{\rho}$) scaled by some factors.

RPKM as a unit has two problems. The first is that in current RNA-Seq experiments reads are paired so that the actual units being counted (in $X_t$) are fragments. For this reason we modified RPKM to FPKM in the Cufflinks paper (the “F” replaced “R” for fragment instead of read). A more serious problem, noted by Bo Li and Colin Dewey in their paper on RSEM, is that while FPKM is proportional to ML estimates of abundance, the proportionality constants may vary between experiments. For this reason they proposed TPM (transcripts per million) which is also proportional to the ML abundance estimates but with a proportionality constant (a million) that is the same between experiments. All of these units are used simply to avoid writing down the $\hat{\rho}_t$ which are in many cases tiny numbers since they must all sum to 1.

Returning to the Affymetrix figure, we see the strange RPM units. In essence, this is the rightmost term in the equation above, with the $l_t$ length terms removed from the denominators. Therefore RPM is proportional to the $\hat{\alpha}_t$. If a transcript is short, even if it is equally abundant to a longer transcript ,it will produce less RNA-Seq reads and therefore its $\hat{\alpha}_t$ will be (possibly considerably) smaller. The effect of displaying RPM for RNA-Seq vs. expression level for the HTA 2.0 arrays is therefore to mix apples and oranges. Since what is being displayed is a coefficient of variation, there is a bias caused by the relationship between length and expression (e.g. many highly expressed housekeeping genes are short).

To be fair to Affymetrix the conversion between the $\hat{\alpha}$ and the $\hat{\rho}$ can be confusing (its explained in Lemma 14 in the Supplement of the Cufflinks paper). So maybe the discordant x-axes were unintentional…but then there is the third figure:

Here its a bit hard to tell what is going on because not all the information needed to decipher the figure is provided. For example, its not clear how the “expression of exons” was computed or measured for the RNA-Seq experiment. I suspect that as with the previous figure, read numbers were not normalized by length of exons, and moreover spliced reads (and other possibly informative reads from transcripts) were ignored. In other words, I don’t really believe the result.

Having said this, it is true that expression arrays can have an advantage in measuring exon expression, because an array measurement is absolute (as opposed to the relative quantification that is all that is possible with RNA-Seq). Array signal is based on hybridization, and it is probably a reasonable assumption that some minimum amount of RNA triggers a signal, and that this amount is independent of the remainder of the RNA in an experiment. So arrays can (and in many cases probably do) have advantages over RNA-Seq.

There are a few papers that have looked into this, for example the paper “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae ” by Nookaew et al., Nucleic Acids Research 40 (2012) who find high reproducibility in RNA-Seq and consistency between arrays and RNA-Seq.  Xu et al., in “Human transcriptome array for high-throughput clinical studies“, PNAS 108 (2011), 3707–3712 are more critical, agreeing with Affymetrix that arrays are more sensitive at the exon level. For disease studies, they recommend using RNA-Seq to identify transcripts relevant to the disease, and then screening for those transcripts on patients using arrays.

For the Cuffdiff2 paper describing our new statistical procedures for differential analysis of transcripts and genes, the Rinn lab performed deep RNA-Seq and array expression measurement on the same samples from a HOXA1 knowdown (the experiments included multiple replicates of both the RNA-Seq and the arrays). To my knowledge, it is the deepest and most comprehensive data currently available for comparing arrays and RNA-Seq. Admittedly, the arrays used were not Affymetrix but Agilent SurePrint G3, and the RNA-Seq coverage was deep, however we had two main findings very different from the Affymetrix claims. First, we found overall strong correlation between array expression values and RNA-Seq abundance estimates. The correlation remained strong for large regimes of expression even with very few reads (tested by sequencing fewer reads from a MiSeq). Second, we found that arrays were missing differentially expressed transcripts, especially at low abundance levels. In other words, we found RNA-Seq to have higher resolution. The following figure from our paper made the case (note the overall Spearman Correlation was 0.86):

There are definitely continued applications for arrays. Both in high-throughput screening applications (as suggested in the Xu et al. paper), and also in the development of novel assays. For example Mercer et al. “Targeted rNA sequencing reveals the deep complexity of the human transcriptome“, Nature Biotechnology 30 (2012) 99–104  show how to couple capture (with arrays) with RNA-Seq to provide ultra deep sequencing in subsets of the transcriptome. So its not yet the time to write off arrays. But RNA-Seq has many applications of its own. For example the ability to better detect allele-specific expression, the opportunity to identify RNA-DNA differences (and thereby study RNA editing), and the ability to study expression in non-model organisms where genomes sequences are incomplete and annotations poor. For all these reasons I’m betting on RNA-Seq.

There is a new arXiv paper out with the title Sailfish: Alignment-free Isoform Quantification from RNA-Seq Reads using Lightweight Algorithms by Rob Patro, Stephen M. Mount and Carl Kingsford. It describes a new approach to RNA-Seq quantification that is based on directly estimating abundances from k-mers rather than read alignments. This is an interesting approach, because it avoids the time-intensive read alignment step that is rapidly becoming a bottleneck in RNA-Seq analysis. The idea of avoiding read alignments to reference genomes/transcriptome in *Seq experiments is being explored in other contexts as well, such as for mutant mapping (by the Korbinian Schneeberger group) and genotyping (by the Gil McVean group). I am particularly interested in these ideas as we have been exploring such methods for association mapping.

Patro, Mount and Kingsford work with a  simplified model for RNA-Seq to first obtain approximate transcript abundance estimates. In the notation of my survey paper on RNA-Seq models (see equation 14, except with k replaced by to avoid confusion), they are maximizing the likelihood

$L(\rho) = \prod_{i=1}^N \left( \sum_{j=1}^K y_{i,j} \frac{\alpha_j}{l_j} \right)$

where the product is over k-mers instead of reads, so that $N=4^k$ (where is the k-mer size) rather than the total number of reads. The EM updates are therefore the same as those of other RNA-Seq quantification algorithms (see Figure 4 in my survey). They also implement an acceleration of the EM called SQUAREM (by Varadhan and Roland) in order to improve convergence.

The results of the paper are impressive. They compare speed and accuracy with RSEM, Cufflinks and eXpress and obtain comparable accuracy while avoiding the time intensive alignment of reads to transcripts (or the genome in the case of Cufflinks). An interesting point made is that bias can be corrected after fragment assignment (or in the case of Sailfish after k-mer assignment) without much loss in accuracy. We used a similar approximation in eXpress, namely stopping estimation of bias parameters after 5 million reads have been processed, but it seems that postponing the entire correction until fragment assignment is complete is acceptable.

Sailfish also appears to have been well engineered. The code (in C++) is well documented and available in both source and executable (for Linux and Mac OS X). I haven’t had a chance to test it yet but hope to do so soon. My only concern with the manuscript is that the simulation results for eXpress appear to be unreasonable. The experiments conducted on “real data” (for which there appear to be qPCR) suggest that the accuracy of Sailfish is similar to that of eXpress, RSEM and Cufflinks (with RSEM underperforming slightly presumably to the lack of bias correction), but the simulations, performed with the Flux Simulator, are inconsistent. I suspect there is a (trivial) problem with the simulated data and presumably the authors will address this before journal publication. Update: The authors responded to my blog post within a day and we quickly realized the problem was likely to have been that Flux Simulator did not output reads in random order. Random ordering of reads is important for eXpress to function correctly, and when we wrote our paper we verified that Illumina sequencers do indeed output reads in random order. Rob Patro shuffled the Flux Simulator reads and verified that the performance of eXpress on simulated data is consistent with the results on real data (see attached figure). We’re grateful for his quick work in sorting out the issue and thank the authors of Sailfish for posting their paper on the arXiv (as others are starting to do), thereby enabling this exchange to occur prior to publication.

### Blog Stats

• 2,853,568 views