You are currently browsing the category archive for the ‘quantitative biology’ category.

Three years ago, when my coauthors (Páll Melsted, Nicolas Bray, Harold Pimentel) and I published the “kallisto paper” on the arXiv (later Bray et al. “Near-optimal probabilistic RNA-seq quantification“, 2016), we claimed that kallisto removed a major computational bottleneck from RNA-seq analysis by virtue of being two orders of magnitude faster than other state-of-the-art quantification methods of the time, without compromising accuracy. With kallisto, computations that previously took days, could be performed as accurately in minutes. Even though the speedup was significant, its relevance was immediately questioned. Critics noted that experiments, library preparations and sequencing take at least months, if not years, and questioned the relevance of a speedup that would save only days.

One rebuttal we made to this legitimate point was that kallisto would be useful not only for rapid analysis of individual datasets, but that it would enable analyses at previously unimaginable scales. To make our point concrete, in a follow-up paper (Pimentel et al., “The Lair: a resource for exploratory analysis of published RNA-seq data”, 2016) we described a semi-automated framework for analysis of archived RNA-seq data that was possible thanks to the speed and accuracy of kallisto, and we articulated a vision for “holistic analysis of [short read archive] SRA data” that would facilitate “comparison of results across studies [by] use of the same tools to process diverse datasets.” A major challenge in realizing this vision was that although kallisto was fast enough to allow for low cost processing of all the RNA-seq in the short read archive (e.g. shortly after we published kallisto, Vivian et al., 2017 showed that kallisto reduced the cost of processing per sample from $1.30 to $0.19, and Tatlow and Piccolo, 2016 achieved $0.09 per sample with it), an analysis of experiments consists of much more than just quantification. In Pimentel et al. 2016 we struggled with how to wrangle metadata of experiments (subsequently an entire paper was written by Bernstein et al. 2017 just on this problem), how to enable users to dynamically test distinct hypotheses for studies, and how to link results to existing databases and resources. As a result, Pimentel et al. 2016 was more of a proof-of-principle than a complete resource; ultimately we were able to set up analysis of only a few dozen datasets.

Now, the group of Avi Ma’ayan at the Icahn School of Medicine at Mount Sinai has surmounted the many challenges that must be overcome to enable automated analysis of RNA-seq projects on the short read archive, and has published a tool called BioJupies (Torre et al. 2018). To assess BioJupies I began by conducting a positive control in the form of analysis of data from the “Cuffdiff2” paper, Trapnell et al. 2013. The data is archived as GSE37704. This is the dataset I used to initially test the methods of Pimentel et al. 2016 and is also the dataset underlying the Getting Started Walkthrough for sleuth. I thought, given my familiarity with it, that it would be a good test case for BioJupies.

Briefly, in Trapnell et al. 2013, Trapnell and Hendrickson performed a differential analysis of lung fibroblasts in response to an siRNA knockdown of HOXA1 which is a developmental transcription factor. Analyzing the dataset with BioJupies is as simple as typing the Gene Expression Omnibus (GEO) accession on the BioJupies searchbox. I clicked “analyze”, clicked on “+” a few times to add all the possible plots that can be generated, and this opened a window asking for a description of the samples:


I selected “Perturbation” for the HOXA1 knockdown samples and “Control” for the samples that were treated with scrambled siRNA that did not target a specific gene. Finally, I  clicked on “generate notebook”…


BioJupies displayed a notebook (Trapnell et al. 2013 | BioJupies) with a complete analysis of the data. Much of the Trapnell et al. 2013 analysis was immediately evident in the notebook. For example, the following figure is Figure 5a in Trapnell et al. 2013. It is a gene set enrichment analysis (GSEA) of the knockdown:


BioJupies presents the same analysis:


It’s easy to match them up. Of course BioJupies presents a lot of other information and analysis, ranging from a useful PCA plot to an interesting L1000 connectivity map analysis (expression signatures from a large database of over 20,000 perturbations applied to various cell lines that match the signatures in the dataset).


One of the powerful applications of BioJupies is the presentation of ARCHS⁴ co-expression data. ARCHS⁴ is the kallisto computed database of expression for the whole and is the primary database that enables BioJupies. One of its features is a list of co-expressed genes (as ascertained via correlation across the whole short read archive). These are displayed in BioJupies making it possible to place the results of an experiment in the context of “global” transcriptome associations.

While the Trapnell et al. 2013 reanalysis was fun, the real power of BioJupies is clear when analyzing a dataset that has not yet been published. I examined the GEO database and found a series GSE60538 that appears to be a partial dataset from what looks like a paper in the works. The data is from an experiment designed to investigate the role of Sox5 and Sox6 in the mouse heart via two single knockout experiments, and a double knockout. The entry originates in 2014 (consistent with the single-end 50bp reads it contains), but was updated recently. There are a total of 8 samples: 4 controls and 4 from the double knockout (the single knockouts are not available yet). I could not find an associated paper, nor was one linked to on GEO, but the abstract of the paper has already been uploaded to the site. Just as I did with the Trapnell et al. 2013 dataset, I entered the accession in the BioJupies website and… four minutes later:


The abstract of the GSE60538 entry states that “We performed RNA deep sequencing in ventricles from DKO and control mice to identify potential Sox5/6 target genes and found altered expression of genes encoding regulators of calcium handling and cation transporters” and indeed, BioJupies verifies this result (see Beetz et al. GSE60538| BioJupies):


Of course, there is a lot more analysis than just this. The BioJupies page includes, in addition to basic QC and datasets statistics, the PCA analysis, a “clustergrammer” showing which genes drive similarity between samples, differentially expressed genes (with associated MA and volcano plots), gene ontology enrichment analysis, pathway enrichment analysis, transcription factor enrichment analysis, kinase enrichment analysis, microRNA enrichment analysis, and L1000 analysis. In a sense, one could say that with BioJupies, users can literally produce the analysis for a paper in four minutes via a website.

The Ma’ayan lab has been working towards BioJupies for some time. The service is essentially a combination of a number of tools, workflows and resources published previously by the lab, including:

With BioJupies, these tools become more than the sum of their parts. Yet while BioJupies is impressive, it is not complete. There is no isoform analysis, which is unfortunate; for example one of the key points of Trapnell et al. 2013 was how informative transcript-level analysis of RNA-seq data can be. However I see no reason why a future release of BioJupies can’t include detailed isoform analyses. Isoform quantifications are provided by kallisto and are already downloadable via ARCHS⁴. It would also be great if BioJupies were extended to organisms other than human and mouse, although some of the databases that are currently relied on are less complete in other model organisms. Still, it should even be possible to create a BioJupies for non-models. I expect the authors have thought of all of these ideas. I do have some other issues with BioJupies: e.g. the notebook should cite all the underlying programs and databases used to generate the results, and while it’s neat that there is an automatically generated methods section, it is far from complete and should include the actual calls made to the programs used so as to facilitate complete reproducibility. Then, there is my pet peeve: “library size” is not the number of reads in a sample. The number of reads sequenced is “sequencing depth”.  All of these issues can be easily fixed.

In summary, BioJupies represents an impressive breakthrough in RNA-seq analysis. It leverages a comprehensive analysis of all (human and mouse) publicly available RNA-seq data to enable rapid and detailed analyses that transcend what has been previously possible. Discoveries await.

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as the law of total variance. Given a random (response) variable C with a random covariate  Z, the law of total variance states that the variance of C can be decomposed as:

Var[C] = E_Z[Var[C|Z]] + Var_Z[E[C|Z]].

There is a biological interpretation of this law that also serves to explain it: If the random variable C denotes the expression of a gene in a single cell (C being a random variable means that the expression is stochastic), and Z denotes the (random) state of a cell, then the law of total variance describes the “total noise” Var[C] in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression Var[C|Z] to be the conditional variance, which is also a random variable; its possible values are the variance of the gene expression in different cell states. If Var[C] does not depend on Z then the expression of the gene is said to be homoscedastic, i.e., it does not depend on cell state (if it does then it is said to be heteroscedastic). Because Var[C|Z] is a random variable, the expression E_Z[Var[C|Z] makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression E[C|Z] is a random variable whose possible values are the average  of the gene expression in cells. Thus, Var_Z[E[C|Z]] is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into E. coli and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables Z_1,\ldots,Z_n which are independent and identically distributed, one for each of n different cells, while random variables C_1,\ldots,C_n  and Y_1,\ldots,Y_n correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable C_i and Y_i, i.e. a pair of measurements for each cell. A hierarchical model for the experiment, in which the marginal (unconditional) distributions C_i and Y_i are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead,  in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

\eta^2_{int} = \frac{\frac{1}{n} \left( \sum_{i=1}^n \frac{1}{2} (c_i-y_i)^2\right) }{\overline{c} \cdot \overline{y}}, (intrinsic noise)

\eta^2_{ext} = \frac{\frac{1}{n}\sum_{i=1}^n c_i \cdot y_i - \overline{c} \cdot \overline{y}}{\overline{c} \cdot \overline{y}}, (extrinsic noise)

\eta^2_{tot} = \frac{\frac{1}{n}\sum_{i=1}^n \frac{1}{2} (c_i^2+y_i^2)-\overline{c}\cdot\overline{y}}{\overline{c} \cdot \overline{y}}. (total noise)

Here  c_1,\ldots,c_n and y_1,\ldots,y_n are the measurements of cyan respectively yellow reporter expression in each cell, \overline{c} = \frac{1}{n}\sum_{i=1}^n c_i and \overline{y} = \frac{1}{n}\sum_{i=1}^n y_i.

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho,  studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:


The Elowitz et al. caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of n in the denominator rather than n-1), an observation made by Hilfinger and Paulsson. The sample covariance has a  (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line c=y is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):


There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

  1. There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
  2. The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
  3. The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.

Today I posted the preprint N. Bray, H. Pimentel, P. Melsted and L. Pachter, Near-optimal RNA-Seq quantification with kallisto to the arXiv. It describes the RNA-Seq quantification program kallisto. [Update April 5, 2016: a revised version of the preprint has been published: Nicolas L. Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-Seq quantification, Nature Biotechnology (2016), doi:10.1038/nbt.3519 published online April 4, 2016.]

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was so much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of k-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. That is asking for a lot less, and as it turns out, pseudoalignment can be much faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient k-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

The screenshot below shows kallisto being used on my 2012 iMac desktop to build an index of the human transcriptome (5 min 8 sec), and then quantify 78.6 million GEUVADIS human RNA-Seq reads (14 min). When we first saw these results we thought they were simply too good to be true. Let me repeat: The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. In some tests we’ve gotten even faster running times, up to 15 million reads quantified per minute.


The results in our paper indicate that kallisto is not just fast, but also very accurate. This is not surprising: underlying RNA-Seq analysis are the alignments, and although kallisto is pseudoaligning instead, it is almost always only the compatibility information that is used in actual applications. As we show in our paper, from the point of view of compatibility, the pseudoalignments and alignments are almost the same.

Although accuracy is a primary concern with analysis, we realized in the course of working on kallisto that speed is also paramount, and not just as a  matter of convenience. The speed of kallisto has three major implications:

1. It allows for efficient bootstrapping. All that is required for the bootstrap are reruns of the EM algorithm, and those are particularly fast within kallisto. The result is that we can accurately estimate the uncertainty in abundance estimates. One of my favorite figures from our paper, made by Harold, is this one:


It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. This result is non-trivial: the standard dogma, that the technical variance in RNA-Seq is “Poisson” (i.e. proportional to the mean) is false, as shown in Supplementary Figure 3 of our paper (the correlation becomes 0.64). Thus, the bootstrap will be invaluable when incorporated in downstream application and we are already working on some ideas.

2. It is not just the kallisto quantification that is fast; the index building, and even compilation of the program are also easy and quick. The implication for biologists is that RNA-Seq analysis now becomes interactive. Instead of “freezing” an analysis that might take weeks or even months, data can be explored dynamically, e.g. easily quantified against different transcriptomes, or re-quantified as transcriptomes are updated. The ability to analyze data locally instead of requiring cloud computation means that analysis is portable, and also easily secure.

3. We have found the fast turnaround of analysis helpful in improving the program itself. With kallisto we can quickly check the effect of changes in the algorithms. This allows for much faster debugging of problems, and also better optimization. It also allows us to release improvements knowing that users will be able to test them without resorting to a major computation that might take months. For this reason we’re not afraid to say that some improvements to kallisto will be coming soon.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!


“An entertaining freshness… Tic Tac!” This is Ferrero‘s tag line for its most successful product, the ubiquitous Tic Tac. And the line has stuck. As WikiHow points out in how to make your breath freshfirst buy some mints, then brush your teeth.

One of the amazing things about Tic Tacs is that they are sugar free. Well… almost not. As the label explains, a single serving (one single Tic Tac) contains 0g of sugar (to be precise, less than 0.5g, as explained in a footnote). In what could initially be assumed to be a mere coincidence, the weight of a single serving is 0.49g. It did not escape my attention that 0.50-0.49=0.01. Why?


To understand it helps to look at the labeling rules of the FDA. I’ve reproduced the relevant section (Title 21) below, and boldfaced the relevant parts:


(c) Sugar content claims –(1) Use of terms such as “sugar free,” “free of sugar,” “no sugar,” “zero sugar,” “without sugar,” “sugarless,” “trivial source of sugar,” “negligible source of sugar,” or “dietarily insignificant source of sugar.” Consumers may reasonably be expected to regard terms that represent that the food contains no sugars or sweeteners e.g., “sugar free,” or “no sugar,” as indicating a product which is low in calories or significantly reduced in calories. Consequently, except as provided in paragraph (c)(2) of this section, a food may not be labeled with such terms unless:

(i) The food contains less than 0.5 g of sugars, as defined in 101.9(c)(6)(ii), per reference amount customarily consumed and per labeled serving or, in the case of a meal product or main dish product, less than 0.5 g of sugars per labeled serving; and

(ii) The food contains no ingredient that is a sugar or that is generally understood by consumers to contain sugars unless the listing of the ingredient in the ingredient statement is followed by an asterisk that refers to the statement below the list of ingredients, which states “adds a trivial amount of sugar,” “adds a negligible amount of sugar,” or “adds a dietarily insignificant amount of sugar;” and

(iii)(A) It is labeled “low calorie” or “reduced calorie” or bears a relative claim of special dietary usefulness labeled in compliance with paragraphs (b)(2), (b)(3), (b)(4), or (b)(5) of this section, or, if a dietary supplement, it meets the definition in paragraph (b)(2) of this section for “low calorie” but is prohibited by 101.13(b)(5) and 101.60(a)(4) from bearing the claim; or

(B) Such term is immediately accompanied, each time it is used, by either the statement “not a reduced calorie food,” “not a low calorie food,” or “not for weight control.”

It turns out that Tic Tacs are in fact almost pure sugar. Its easy to figure this out by looking at the number of calories per serving (1.9) and multiplying  the number of calories per gram of sugar (3.8) by 0.49 => 1.862 calories. 98% sugar! Ferrero basically admits this in their FAQ. Acting completely within the bounds of the law, they have simply exploited an arbitrary threshold of the FDA. Arbitrary thresholds are always problematic; not only can they have unintended consequences, but they can be manipulated to engineer desired outcomes. In computational biology they have become ubiquitous, frequently being described as “filters” or “pre-processing steps”.  Regardless of how they are justified, thresholds are thresholds are thresholds. They can sometimes be beneficial, but they are dangerous when wielded indiscriminately.

There is one type of thresholding/filtering in used in RNA-Seq that my postdoc Bo Li and I have been thinking about a bit this year. It consists of removing duplicate reads, i.e. reads that map to the same position in a transcriptome. The motivation behind such filtering is to reduce or eliminate amplification bias, and it is based on the intuition that it is unlikely that lightning strikes the same spot multiple times. That is, it is improbable that many reads would map to the exact same location assuming a model for sequencing that posits selecting fragments from transcripts uniformly. The idea is also called de-duplication or digital normalization.

Digital normalization is obviously problematic for high abundance transcripts. Consider, for example, a transcripts that is so abundant that it is extremely likely that at least one read will start at every site (ignoring the ends, which for the purposes of the thought experiment are not relevant). This would also be the case if the transcript was twice as abundant, and so digital normalization would prevent the possibility for estimating the difference. This issue was noted in a paper published earlier this year by Zhou et al.  The authors investigate in some detail the implications of this problem, and quantify the bias it introduces in a number of data sets. But a key question not answered in the paper is what does digital normalization actually do?

To answer the question, it is helpful to consider how one might estimate the abundance of a transcript after digital normalization. One naive approach is to just count the number of reads after de-duplication, followed by normalization for the length of the transcript and the number of reads sequenced. Specifically if there are sites where a read might start, and of the sites had at least one read, then the naive approach would be to use the estimate \frac{k}{n} suitably normalized for the total number of reads in the experiment. This is exactly what is done in standard de-duplication pipelines, or in digital normalization as described in the preprint by Brown et al. However assuming a simple model for sequencing, namely that every read is selected by first choosing a transcript according to a multinomial distribution and then choosing a location on it uniformly at random from among the sites, a different formula emerges.

Let be a random variable that denotes the number of sites on a transcript of length n that are covered in a random sequencing experiment, where the number of reads starting at each site of the transcript is Poisson distributed with parameter c (i.e., the average coverage of the transcript is c). Note that

Pr(X \geq 1) = 1-Pr(X=0) = 1-e^{-c}.

The maximum likelihood estimate for can also be obtained by the method of moments, which is to set

\frac{k}{n} = 1-e^{-c}

from which it is easy to see that

c = -log(1-\frac{k}{n}).

This is the same as the (derivation of the) Jukes-Cantor correction in phylogenetics where the method of moments equation is replaced by \frac{4}{3}\frac{k}{n} = 1-e^{-\frac{4}{3}c} yielding D_{JC} = -\frac{3}{4}log(1-\frac{4}{3}\frac{k}{n}), but I’ll leave an extended discussion of the Jukes-Cantor model and correction for a future post.

The point here, as noticed by Bo Li, is that since log(1-x) \approx -x by Taylor approximation, it follows that the average coverage can be estimated by c \approx \frac{k}{n}. This is exactly the naive estimate of de-duplication or digital normalization, and the fact that \frac{k}{n} \rightarrow 1 as k \rightarrow n means that -log(1-\frac{k}{n}) blows up, at high coverage hence the results of Zhou et al.

Digital normalization as proposed by Brown et al. involves possibly thresholding at more than one read per site (for example choosing a threshold C and removing all but at most C reads at every site). But even this modified heuristic fails to adequately relate to a probabilistic model of sequencing. One interesting and easy exercise is to consider the second or higher order Taylor approximations. But a more interesting approach to dealing with amplification bias is to avoid thresholding per se,  and to instead identify outliers among duplicate reads and to adjust them according to an estimated distribution of coverage. This is the approach of Hashimoto et al. in a the paper “Universal count correction for high-throughput sequencing” published in March in PLoS One. There are undoubtedly other approaches as well, and in my opinion the issue will received renewed attention in the coming year as the removal of amplification biases in single-cell transcriptome experiments becomes a priority.

As mentioned above, digital normalization/de-duplication is just one of many thresholds applied in a typical RNA-Seq “pipeline”. To get a sense of the extent of thresholding, one need only scan the (supplementary?) methods section of any genomics paper. For example, the GEUVADIS RNA-Seq consortium describe their analysis pipeline as follows:

“We employed the JIP pipeline (T.G. & M.S., data not shown) to map mRNA-seq reads and to quantify mRNA transcripts. For alignment to the human reference genome sequence (GRCh37, autosomes + X + Y + M), we used the GEM mapping suite24 (v1.349 which corresponds to publicly available pre-release 2) to first map (max. mismatches = 4%, max. edit distance = 20%, min. decoded strata = 2 and strata after best = 1) and subsequently to split-map (max.mismatches = 4%, Gencode v12 and de novo junctions) all reads that did not map entirely. Both mapping steps are repeated for reads trimmed 20 nucleotides from their 3′-end, and then for reads trimmed 5 nucleotides from their 5′-end in addition to earlier 3′-trimming—each time considering exclusively reads that have not been mapped in earlier iterations. Finally, all read mappings were assessed with respect to the mate pair information: valid mapping pairs are formed up to a maximum insert size of 100,000 bp, extension trigger = 0.999 and minimum decoded strata = 1. The mapping pipeline and settings are described below and can also be found in, where the code as well as an example pipeline are hosted.”

This is not a bad pipeline- the paper shows it was carefully evaluated– and it may have been a practical approach to dealing with the large amount of RNA-Seq data in the project. But even the first and seemingly innocuous thresholding to trim low quality bases from the ends of reads is controversial and potentially problematic. In a careful analysis published earlier this year, Matthew MacManes looked carefully at the effect of trimming in RNA-Seq, and concluded that aggressive trimming of bases below Q20, a standard that is frequently employed in pipelines, is problematic. I think his Figure 3, which I’ve reproduced below, is very convincing:


It certainly appears that some mild trimming can be beneficial, but a threshold that is optimal (and more importantly not detrimental) depends on the specifics of the dataset and is difficult or impossible to determine a priori. MacManes’ view (for more see his blog post on the topic) is consistent with another paper by Del Fabbro et al. that while seemingly positive about trimming in the abstract, actually concludes that “In the specific case of RNA-Seq, the tradeoff between sensitivity (number of aligned reads) and specificity (number of correctly aligned reads) seems to be always detrimental when trimming the datasets (Figure S2); in such a case, the modern aligners, like Tophat, seem to be able to overcome low quality issues, therefore making trimming unnecessary.”

Alas, Tic Tac thresholds are everywhere. My advice is: brush your teeth first.

This is the third and final post in a series (part1, part2) of posts on two back-to-back papers published in Nature Biotechnology in August 2013:

  1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601
  2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635

An inconvenient request

One of the great things about conferences is that there is time to chat in person with distant friends and collaborators. Last July, at the ISMB conference in Berlin, I found myself doing just that during one of the coffee breaks. Suddenly, Manolis Kellis approached me and asked to talk in private. The reason for his interruption: he came to request that I remove an arXiv post of mine, namely “Comment on ‘Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions“, a response to a paper by Ward and Kellis. Why? He pointed out that my arXiv post was ranking highly on Google. This was inconvenient for him, he explained, while insisting that it was wrong of me to post a criticism of his work on a forum where he could not directly respond. He suggested that it would be best to work out any issues I might have with his paper offline. Unfortunately,  there was the inconvenient truth that arXiv postings cannot be removed. Unlike some journals, where, say, a supplement can be revised while having the original removed (see the figure switching of Feizi et al.), arXiv preprints are permanent.

My initial confusion quickly turned to anger. After all, my arXiv comment had been rejected from Science where I had submitted it as a technical comment on the Ward-Kellis paper. I had then put it on the arXiv as a last resort measure to at least have some record of my concerns publicly accessible. How is this wrong? Can one not critique the work of Manolis Kellis?

Network nonsense begins

My first review of a Manolis Kellis paper was in the fall of 2006, in my capacity as a program committee member of the  Research in Computational Molecular Biology (RECOMB) conference held in Oakland, CA in 2007. Because Oakland is right next to Berkeley, a number of Berkeley professors were involved in organizing and running the conference. Terry Speed was chair of the program committee. I was out of the country that year on sabbatical at the University of Oxford, so I could not participate, or even attend, the conference, but I volunteered to serve on the program committee. For those not familiar with the RECOMB review process, it is modeled after the standard Computer Science conferences. The program committee chair forms the program committee, who are then assigned a handful of papers to review and score. Reviews are entered on a website, and after a brief period of online discussion about borderline papers, scores are revised and accept/reject decisions are made. Authors can revise their manuscripts, but the reviewers never see the papers again before publication in the proceedings.

One of the papers I was assigned to review was by a student named Joshua Grochow and his advisor Manolis Kellis. The paper was titled “Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking“. Although networks were not my research focus at the time, and “symmetry-breaking” evoked in me nightmares from the physics underworld, I agreed to the review. The paper seemed to contain some interesting algorithms, appeared to have a combinatorial flavor, and potentially important applications- a good mix for RECOMB. 

The problem addressed by Grochow & Kellis was that of identifying “network motifs” in biological networks. “Motifs” can be defined in a variety of ways, and the Grochow-Kellis objective was simple. In graph theoretic terms, given a graph G, the goal was to find subgraphs occurring with high multiplicity to an extent unlikely in a random graph. There are many models for random graphs, and the one that the results in Grochow-Kellis are based on is the Erdös-Renyi model (each edge chosen independently with some fixed probability). The reason this definition might be of biological interest, is that recurrent motifs interspersed in a graph are likely to represent evolutionarily conserved interaction modules.

The paper begins with a description of the method. I won’t go into the details here, except to say that everything seemed well until I read the caption of Figure 3. There the number 27,720 caught my eye. Grochow_Kellis_Figure3

During my first few years of graduate school I took many courses on enumerative and algebraic combinatorics. There are some numbers that combinatorialists just “know”. For example, seeing 42 emerge as the answer to a counting problem does not bring to mind Douglas Adams, but rather the vast literature on Catalan numbers and their connections to dozens of well-known counting problems. Similarly, a number such as 126 brings to mind binomial coefficients (126={9 \choose 4}), and with them the idea of counting the number of subsets of fixed size from a larger set. When I saw the number 27,720 I had a hunch that somehow some canonical combinatorial set had been enumerated. The idea may have entered my mind because of the picture of the “motif” in which I immediately recognized a clique (all vertices mutually connected) and a stable set (no pair of vertices connected). In any case, I realized that

27,720 = 220 \cdot 126 = {12 \choose 3} \cdot {9 \choose 4}.

The significance of this is that the “motif” on the left-hand side of Figure 3 had appeared many times  because of a type of double- or rather thousandfold- counting. Instead of representing statistically significant recurring independent motifs, this “motif” arises because of a combinatorial artifact. In the specific example of Figure 3, the motif occurred once for any choice of 4 nodes from the clique of size 9, and any choice of 3 nodes from the stable set of size 12.

The point is that in a graph, any subgraph attached to a large clique (or stable set) will occur many times. This simple observation is a result of the fact that there are many subgraphs of a clique (or stable set) that are identical. I realized that this meant that the Grochow-Kellis method was essentially a heuristic for finding cliques and stable sets in graphs. The particular “network motifs” they were pulling out were just subgraphs that happened to be connected to some large cliques and stable sets. There are two problems with this: first, a clique or a stable set can hardly be considered an interesting “network motif”. Moreover, the fact that they appear in biological networks much more than in Erdös-Renyi random graphs is not surprising. Second, there is a large literature on finding cliques in graphs, none of which Grochow-Kellis cited or seemed to be familiar with. 

The question of the performance of the Grochow-Kellis algorithm is answered in their Figure 3 as well. There is a slightly larger motif consisting of nodes from the stable set of size 12, instead of 3. That motif occurs in all {12 \choose 6} subsets of the stable set instead of {12 \choose 3} subsets which means that there is a motif that occurs 116,424 times! Grochow and Kellis’s algorithm did not even achieve its stated goal. It really ought to have outputted the left hand side figure with six nodes in the stable set on the left, and not three. In other words, this was a paper providing uninteresting solutions from a biological point of view, and doing so poorly to boot.

I wrote up a detailed report on the paper, and posted it on the RECOMB review website together with poor scores reflecting my opinion that the paper had to be rejected. How could RECOMB, ostensibly the premier computer science conference on computational and algorithmic biology, publish a paper with neither a computational nor biological result? Not to mention an algorithm that demonstratably did not find the most frequently occurring motif.

As you might already guess, my rejection was subsequently overruled. I don’t know who made the final decision to accept the Grochow & Kellis paper to the RECOMB conference, although presumably it was the program committee chair. The decision jarred with my sense of scientific integrity. I had put considerable effort into reviewing the paper and understanding it, and I felt that I had provided a compelling objective argument for why the paper was fundamentally flawed- the fact that the results were trivial (and incorrect!) was not a subjective statement. At this point I need to point out that the RECOMB conference is quite difficult to get into. The acceptance rate for papers in 2007, consistent with other years, was 21.8%. I knew this meant that even a single very negative review, especially one with a compelling argument against the paper, almost certainly would lead to rejection of the paper. So I couldn’t understand then, nor do I still understand now, on what basis the decision was made to accept the paper. This bothered me greatly, and after much deliberation I started boycotting the conference. Despite publishing five RECOMB papers from 2000 to 2006 and regularly attending the meeting during that time, the continued poor decisions and haphazard standards for papers selected have led me to not return in almost 8 years.

Grochow and Kellis obviously received my review and considered how to “deal with it”. They added a section titled “The role of combinatorial effects”, in which they explained the origins of the number 27,720 that they gleaned from my report, but then spun the bad news they had received as “resulting from combinatorial connectivity patterns prevalent in larger network structures.” They then added that “…this combinatorial clustering effect brings into question the current definition of network motif” and proposed that “additional statistics…might well be suited to identify larger meaningful networks.” This is a lot like someone claiming to discover a bacteria whose DNA is arsenic-based and upon being told by others that the “discovery” is incorrect – in fact, that very bacteria seeks out phosphorous – responding that this is “really helpful” and that it “raises lots of new interesting open questions” about how arsenate gets into cells. Chutzpah. When you discover your work is flawed, the correct response is to retract it.

I don’t think people read papers very carefully. Joshua Grochow went on to win the MIT Charles and Jennifer Johnson Outstanding M. Eng. Thesis Award for his RECOMB work on network motif discovery. [Added February 18: Grochow and Kellis have posted a reply here].

The nature of man

I have to admit that after the Grochow-Kellis paper I was a bit skeptical of Kellis’ work. Not because of the paper itself (everyone makes mistakes), but because of the way he responded to my review. So a year and a half ago, when Manolis Kellis published a paper in an area I care about and am involved in, I may have had a negative prior. The paper was Luke Ward and Manolis Kellis “Evidence for Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”, Science 337 (2012) . Having been involved with the ENCODE pilot, where I contributed to the multiple alignment sub-project, I was curious what comparative genomics insights the full-scale $130 million dollar project revealed. The press releases accompanying the Ward-Kellis paper (e.g. The Nature of Man, The Economist) were suggesting that Ward and Kellis had figured out what makes a human a human; my curiosity was understandably piqued.

Ward and Kellis combined population genomic data from the 1000 Genomes Project with biochemical data from the ENCODE project to look for signatures of human constraint in regulatory elements. Their analysis was based on measuring three different proxies for constraint: SNP density, heterozygosity and derived allele frequency. To identify specific classes of regulatory regions under constraint, aggregated regions associated with specific gene ontology (GO) categories were tested for significance. Reading the paper I was amazed to discover they found precisely two categories: retinal cone cell development and nerve growth factor receptor signaling. It was only upon reading the supplement that I discovered that their tests had produced 53 other GO categories as well (Table S5).


Despite the fact that the listed categories were required to pass a false discovery rate (FDR) threshold for both the heterozygosity and derived allele frequency (DAF) measures, it was statistically invalid for them to highlight any specific GO category. FDR control merely guarantees a low false discovery rate among the entries in the entire list. Moreover, there was no obvious explanation for why categories such as chromatin binding (which had a smaller DAF than nerve growth) or protein binding (with the smallest p-value) appeared to be under purifying selection. As with the Feizi et al. paper, the supplement produced a story much less clean than the one presented in the main body of the paper. In fact, retinal cone cell development and nerve growth factor were 33 and 34 out of the 55 listed GO categories when sorted by the DAF p-value (42 and 54 when sorted by heterozygosity p-value). In other words, the story being sold in the paper was based on blatant statistically invalid cherry picking.

The other result of the paper was an estimate that in addition to the 5% of the human genome conserved across mammalian genomes, at least another 4% has been subject to lineage-specific constraint. This result was based on adding up the estimates of constrained nucleotides from their Table S6 (using the derived allele frequency measure). These were calculated using a statistic that was computed as follows: for each one of ten bins determined according to estimated background selection strength, and for every feature F, the average DAF value DF was rescaled to

PUC_F = \frac{(D_F - D_{CNDC})}{(D_{NCNE}-D_{CNDC})},

where DCNDC and DNCNE were the bin-specific average DAFs of conserved non-degenerate coding regions and non-conserved non-ENCODE regions respectively. One problem with the statistic is that the non-conserved regions contain nucleotides not conserved in all mammals, which is not the same as nucleotides not conserved in any mammals. The latter would have been needed in order to identify human specific constraint. Second, the statistic PUCF  was used as a proxy for the proportion under constraint even though, as defined, it could be less than zero or greater than one. Indeed, in Table S6 there were four values among the confidence intervals for the estimated proportions using DAF that included values less than 0% or above 100%:

zoom1                             zoom2

Ward and Kellis were therefore proposing that some features might have a negative number of nucleotides under constraint. Moreover, while it is possible that after further rescaling PUCmight have correlated with the true proportion of nucleotides under constraint, there was no argument provided in the paper. Thus, while Ward and Kellis claimed to have estimated the proportion of nucleotides under constraint, they had only computed a statistic named “proportion under constraint”.

Nicolas Bray and I wrote up these points in a short technical comment and submitted it to the journal Science early in November 2012. The comment was summarily rejected with a curt reply by senior editor Laura Zahn stating that “relative to other Technical  Comments we have recently received we feel that the scope and focus of your  comment make it more suitable for the Online Comments facility at Science, rather than as a candidate for publication as a Technical Comment.” It is worth noting that Science did decide to publish another comment: Phil Green and Brent Ewing’s, “Comment on’Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions‘”, Science 10 (2013). Green and Ewing’s comment is biological in nature. Their concern is that “… the polymorphism trends are primarily attributable to mutational variation and technical artifacts rather than selection.” Its fine that Science decided to host a debate on a biology question on its pages, but how can one debate the interpretation of results from a method, when the method is fundamentally flawed to begin with? After all, our problem with PUC was much deeper than a “technical flaw”.

We decided at the end to place the comment in the arXiv. After doing so, it became apparent that it had little impact. Indeed, I have never received any feedback about it from anyone. Apparently even this was too much for Manolis Kellis.

Methods matter

By the time I noticed the Feizi et al. paper in the journal Nature Biotechnology early in August 2013, my experiences reading Kellis’ papers had subtly altered the dynamic between myself and the printed word. Usually, when I read a paper and I don’t understand something, I assume the fault lies with me. I think most people are like this. But now, when the Feizi et al. paper started to not make sense, I didn’t presume the problem was with me. I tried hard to give the paper a fair reading, but after a few paragraphs the spell of the authors was already broken. And so it is that Nicolas Bray and I came to figure out what was really going on in Feizi et al., a project that eventually led us to also look at Barzel-Barabási.

Speaking frankly, it was difficult work to write the blog posts about these articles. In addition to the time it took, it was exhausting and exasperating to discover the flaws, fallacies and frauds. Both Nick and I prefer to do research. But we felt a responsibility to spell out in detail what had happened here. Manolis Kellis is not just any scientist. He has, and continues to play leading roles in major consortium projects such as mod-ENCODE and ENCODE, and he has served on numerous advisory committees for the NHGRI. He is a member of the GCAT (Genomics, Computational Biology and Technology) study section until 2018. That any person would swap out a key figure in a published paper without publishing a correction, and without informing the editor is astonishing. That a person with great responsibility towards scientists is an abuser of science is unacceptable.

Manolis Kellis’ behavior is part of a systemic problem in computational biology. The cross-fertilization of ideas between mathematics, statistics, computer science and biology is both an opportunity and a danger. It is not hard to peddle incoherent math to biologists, many of whom are literally math phobic. For example, a number of responses I’ve received to the Feizi et al. blog post have started with comments such as

“I don’t have the expertise to judge the math, …”

Similarly, it isn’t hard to fool mathematicians into believing biological fables. Many mathematicians throughout the country were recently convinced by Jonathan Rothberg to donate samples of their DNA so that they might find out “what makes them a genius”. Such mathematicians, and their colleagues in computer science and statistics, take at face value statements such as “we have figured out what makes a human human”. In the midst of such confusion, it is easy for an enterprising “computational person” to take advantage of the situation, and Kellis has.

I believe the solution for this problem is for computational biologists to start taking themselves more seriously. Whether serving as reviewers for journals, as panel members for funding agencies, on hiring/tenure committees, or writing articles, all of us have to tone down the hype and pay closer attention to the science. There are many examples of what this means: a review of a math/stats containing paper cannot be a single paragraph long and based on a hunch, and similarly computational biologists shouldn’t claim, as have many of the authors of papers I’ve reviewed in these posts, pathways to cure disease and explanations for what makes humans human. Don’t fool the biologists. Don’t fool the computer scientists, statisticians, and mathematicians.

The possibilities for computational methods in biology are unlimited. The future is exciting, and there are possibilities for significant advances in areas ranging from molecular and evolutionary biology to medicine. But money, citations and fame cannot rule the day. The details of the #methodsmatter.

Today I came across an arXiv posting from yesterday:

Theoretical bounds on mate-pair information for accurate genome assembly by Henry Lin and Yufeng Shen (, October 7 2013).

It purports to provide a sufficient condition for genome assembly, when in fact it is a rediscovery of Ukkonen’s condition for assembly (and a poor rediscovery at that). I’ve seen this happen before so I thought I’d make a short post about this.

Ukkonen’s paper is

E. Ukkonen, Approximate string matching with q-grams and maximal matches, Theoretical Computer Science 92 (1992), no. 1, 191–211.

The connection to genome assembly is explained well in a recent paper by Guy Bresler, Ma’ayan Bresler and David Tse: Optimal assembly for high-throughput shotgun sequencing (, 18 February 2013).

The following figure from the Bresler et al. paper illustrates Ukkonen’s condition:


The condition is that the read length L must be at least one greater than the maximum length of any interleaved repeat or triple repeat. In the figure above (Figure 4 in the Bresler et al. paper), it is impossible to distinguish the two different assemblies (that merely interchange the sequence between the interleaved repeats) with reads of length L. There is a similar example of switching for triple repeats.

The lower bound of Lin and Shen essentially recapitulates the above condition. Moreover, the upper bound is trivial: it merely describes a sufficient library design that if sequenced completely would allow for assembly. Bresler et al. go much farther than this, and describe an approach to optimal assembly under a random model of (shotgun) sequencing that in particular, requires shotgun generalizations of Ukkonen’s combinatorial condition. In a future post I will describe in much more detail the Bresler et al. results. For now, it suffices to say, Ukkonen should be essential (starting) reading for anyone working on assembly.

Update (October 8): the authors have graciously contacted me and acknowledged their oversight in not referring to Ukkonen’s work. They will update their manuscript and repost to the arXiv. They also pointed out that their paper discusses specifically paired-end reads, and that this requires a slightly different analysis than Ukkonen’s initial work (a good point that I should have mentioned in the original post). I applaud them for (a) posting the initial result on the arXiv thereby allowing the community to look at their work prior to publication and (b) for responding immediately to improve it given the feedback.

I recently came across a wonderful website of Karl Broman via his homepage.  He hosts a list of top ten worst graphs in which he describes examples of badly constructed/displayed graphs from statistical and computational biology papers. He calls  out mistakes of others, not in order to humiliate the authors, but rather to educate via example.  Its a list every computational biologist should peruse and learn from. The page also has an excellent list of recommended references, including Tufte’s “The visual display of quantitative information” which inspired the Minard plots for transcript abundance shown in Figure 2b and Appendix B of the supplement of my Cufflinks paper.

Interestingly, in addition to the list of graphs, Karl includes a list of tables but with only a single entry (the problem with the table is discussed here). Why should a list of worst graphs have 10 items but a list of tables only one? This is a blatant example of protectionism of tables at the expense of the oft abused graphs. With this post I’d like to remedy this unfair situation and begin the process of assembling a list of top ten worst tables: the tabular hall of shame.

I begin by offering Supplementary Table S6 from the paper “Evidence for Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions” by Luke Ward and Manolis Kellis, Science 337 (2012) 1675–1678:


The table is in support of a main result of the paper, namely that in addition to the 5% of the human genome conserved across mammalian genomes, at least another 4% is subject to lineage-specific constraint. This result is based on adding up the estimates of constrained nucleotides from Table S6 (using column 10, the derived allele frequency measure). These estimates were calculated using a statistic that is computed as follows: for each one of ten bins determined according to estimated background selection strength, and for every ENCODE feature F, the average derived allele frequency value DF was rescaled to


where DCNDC and DNCNE are the bin-specific average derived allele frequencies of conserved non-degenerate coding regions and non-conserved non-ENCODE regions respectively.

One problem with the statistic is that the non-conserved regions contain nucleotides not conserved in all mammals, which is not the same as nucleotides not conserved in any mammals. The latter would be needed in order to identify human specific constraint. Second, the statistic PUCF is used as a proxy for the proportion under constraint even though, as defined, it could be less than zero or greater than one. Indeed, in Table S6 there are four values among the confidence intervals for the estimated proportions using derived allele frequency that include values less than 0% or above 100%, for example:

zoom1                             zoom2

Ward and Kellis are therefore proposing that some features might have a negative number of nucleotides under constraint. Moreover, while it is possible that PUCF might correlate with the true proportion of nucleotides under constraint, there is no argument provided in the paper. Thus, while Ward and Kellis claim to have estimated the proportion of nucleotides under constraint, they have only computed a statistic named “proportion under constraint”.

The bottom line is that a table containing percentages should not have negative entries, and if it does reader beware!

On his worst graphs website Karl provides one example of a what he calls “a really bad table” and here I have offered a second (amazingly Table S5 from the Ward-Kellis paper is also a candidate for the list– Nicolas Bray and I review its sophistry in an arXiv note–  but I think every paper should be represented only once on the list). I ask you, the reader, to help me round out the list by submitting more examples in the comments or by email directly to me. Tables from my papers are fair game (notably example #10 of a bad graph on Karl’s list is from one of his own papers). Please help!

I recently completed a term as director of our Center for Computational Biology and one of the things I am proud of having accomplished is helping Brian McClendon, program administrator for the Center, launch a new Ph.D. program in computational biology at UC Berkeley.

The Ph.D. program includes a requirement for taking a new course, “Classics in Computational Biology” (CMPBIO 201), that introduces students to research projects and approaches in computational biology via a survey of classic papers. I taught the class last week and the  classic paper I chose was  “Comparison of biosequences” by Temple Smith and Michael Waterman, Advances in Applied Mathematics 2 (1981), p 482–489.  I gave a preparatory lecture this past week in which I discussed the Needleman-Wunsch algorithm, followed by leading a class discussion about the Smith-Waterman paper and related topics. This post is an approximate transcription of my lecture on the Needleman-Wunsch algorithm:

The Needleman-Wunsch algorithm was published in “A general method applicable to the search for similarities in the amino acid sequence of two proteins“, Needleman and Wunsch, Journal of Molecular Biology 48 (1970), p 443–453. As with neighbor-joining or UniFrac that I just discussed in my previous blog post, the Needleman-Wunsch algorithm was published without an explanation of its meaning, or even a precise description of the dynamic programming procedure it is based on. In their paper 11 years later, Smith and Waterman write

“In 1970 Needleman and Wunsch [1] introduced their homology (similarity) algorithm. From a mathematical viewpoint, their work lacks rigor and clarity. But their algorithm has become widely used by the biological community for sequence comparisons.”

They go on to state that “The statement in Needleman-Wunsch [1] is less general, unclearly stated, and does not have a proof” (in comparison to their own main Theorem 1). The Theorem Smith and Waterman are referring to explicitly describes  the dynamic programming recursion only implicit in Needleman-Wunsch, and then explains why it produces the “optimal” alignment. In hindsight, the Needleman-Wunsch algorithm can be described formally as follows:

Given two sequences and b of lengths n_1 and n_2 respectively, an alignment A(a,b) is a sequence of pairs:

[(a_{i_1},b_{j_1}),(a_{i_2},b_{j_2}),\ldots : 1 \leq i_1 < i_2 < \cdots \leq n_1, 1 \leq j_1 < j_2 \cdots \leq n_2 ].

Given an alignment A(a,b), we let

M = | \{ k: a_{i_k} = b_{j_k} \} |, X = | \{ k: a_{i_k} \neq b_{j_k} \} |


S = | \{ r:a_r \neq i_k \, \mbox{for any k} \} | + | \{ r:b_r \neq j_k \, \mbox{for any k} \} |.

In other words, M counts the number of matching characters in the alignment, X counts the number of mismatches and S the number of “spaces”, i.e. characters not aligned to another character in the other sequence. Since every character must be either matched, mismatched, or a not aligned, we have that

2 M + 2 X + S = n_1+n_2.

Given scores (parameters) consisting of real numbers m,x,s, the Needleman-Wunsch algorithm finds an alignment that maximizes the function m \cdot M + x \cdot X + s \cdot S. Note that by the linear relationship between M,X and described above, there are really only two free parameters, and without loss of generality we can assume they are and x. Furthermore, only and are relevant for computing the score of the alignment; when the scores are provided statistical meaning, one can say that and X are sufficient statistics. We call them the summary of the alignment.

The specific procedure of the Needleman-Wunsch algorithm is to compute a matrix S recursively:

S(i,j) = \mbox{max} \{ S_{i-1,j} + s,S_{i,j-1} + s, S_{i-1,j-1} + xI(a_{i} = b_{j} )\}

where I(a_i = b_j) is the indicator function that is 1 if the characters are the same, and zero otherwise, and an initialization step consists of setting S(i,0)=s \cdot i and S(0,j) = s \cdot j for all i,j. There are numerous generalizations and improvements to the basic algorithm, both in extending the scoring function to include more parameters (although most generalizations retain the linearity assumption), and algorithms for trading off time for space improvements (e.g. divide-and-conquer can be used to improve the space requirements from O(n_1 n_2) to O(n_1 + n_2)).

An important advance in understanding the Needleman-Wunsch algorithm was the realization that the “scores” m,x and s can be interpreted as logarithms of probabilities if the sign on the parameters is the same, or as log-odds ratios if the signs are mixed. This is discussed in detail in the book “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Sequences” by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison.

One insight from the statistical point of view is that if  max and plus in the recursion are replaced with plus and times, and the logarithms of probabilities are replaced by the probabilities themselves, then the Needleman-Wunsch recursion computes the total probability of the pairs of sequences marginalized over alignments (i.e., it is the forward algorithm for a suitably defined hidden Markov model). Together with the backward algorithm, which consists of running the forward algorithm on the reversed sequences, this provides an efficient way to compute for every pair of nucleotides the posterior probability that they are aligned. In other words, there is a meaningful interpretation and useful application for the Needleman-Wunsch algorithm with the semi-ring (\mathbb{R}_{-}, \oplus := \mbox{max}, \otimes := +) replaced with (\mathbb{R}, \oplus := +, \otimes := \times).

In work in collaboration with Bernd Sturmfels, we realized that it also makes sense to run the Needleman-Wunsch algorithm with the  polytope algebra (\mathcal{P}, \oplus := \mbox{convex hull of union}, \otimes := \mbox{Minkowski sum}). Here \mathcal{P} is  the set of convex polytopes in \mathbb{R}^d for some d, polytopes are “added” using the geometric operation of the convex hull of the union of polytopes, and they are “multiplied” by taking Minkowski sum. The result allows one to find not only a single optimal alignment, but all optimal alignments.  The details are discussed in a pair of papers:

L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138–16143.

C. Dewey, P. Huggins, K, Woods, B. Sturmfels and L. Pachter, Parametric alignment of Drosophila genomes, PLoS Computational Biology, Volume 2, Number 6 (2006) p e73.

The first explains the mathematical foundations for using the result and is where we coined the term polytope propagation algorithm for the Needleman-Wunsch algorithm with the polytope algebra. The second paper illustrates an application to a whole-genome alignment of two Drosophilids and provides an alternative to polytope propagation (the beneath-beyond algorithm) that is much faster. The type of analysis performed in the paper is also called parametric alignment, a problem with a long history whose popularity was launched by Dan Gusfield who provided the first software program for parametric alignment called XPARAL.

The polytope propagation algorithm (Needleman-Wunsch with the polytope algebra) is illustrated in the figure below:


The convex polygon in the bottom right panel contains in its interior points corresponding to all possible summaries for the alignments of the two sequences. The optimal summaries are vertices of the polygon. For each vertex, the parameters for which the Needleman-Wunsch algorithm will produce a summary corresponding to that vertex form a cone. The cones together form a fan which is shown inside the polygon.

The running time of polytope propagation depends on the number of vertices in the final polytope. In our papers we provide bounds showing that polytope propagation is extremely efficient. In particular, for parameters, the number of vertices is at most O(n^{\frac{d(d-1)}{(d+1)}})Cynthia Vinzant, formerly in our math department at UC Berkeley, has written a nice paper on lower bounds.

The fact that the Needleman-Wunsch algorithm can be run with three semi-rings means that it is particularly well suited to C++ thanks to templates that allow for the variables in the recursion to be abstracted. This idea (and code) is thanks to my former student Colin Dewey:

template<typename SemiRing>

alignGlobalLastRow(const string& seq1,
                   const string& seq2,
                   const typename SemiRing::Element match,
                   const typename SemiRing::Element mismatch,
                   const typename SemiRing::Element gap,
                   vector& row);
const Element one = SemiRing::multiplicativeIdentity;
// Initialize row
row.resize(seq2.size() + 1);
row[0] = one;
// Calculate first row
for (size_t j = 1; j <= seq2.size(); ++j) 
    row[j] = gap * row[j - 1];
// Calculate remaining rows
Element up, diag;
for (size_t i = 1; i <= seq1.size(); ++i) {
    diag = row[0];
    row[0] *= gap;
    for (size_t j = 1; j <= seq2.size(); ++j) {
        up = row[j];
        if (seq1[i - 1] == seq2[j - 1]) {
            row[j] = match * diag + gap * (up + row[j - 1]);
        } else {
            row[j] = mismatch * diag + gap * (up + row[j - 1]);
        diag = up;

To recap, the three semi-rings with which it makes sense to run this algorithm are:

  1. (\mathbb{R}, \oplus := +, \otimes := \times)
  2. (\mathbb{R}_{-}, \oplus := \mbox{max}, \otimes := +)
  3. (\mathcal{P}, \oplus := \mbox{convex hull of union}, \otimes := \mbox{Minkowski sum})

The interpretation of the output of the algorithm with the three semi-rings is:

  1. S_{ij} is the score (probability) of a maximum alignment between the ith prefix of one sequence and the jth prefix of the other.
  2. S_{ij} is the total score (probability) of all alignments between the ith prefix of one sequence and the jth prefix of the other.
  3. S_{ij} is a polytope whose faces correspond to summaries that are optimal for some set of parameters.

The semi-rings in (2) and (3) lead to particularly interesting and useful applications of the Needleman-Wunsch algorithm. An example of the use of (2), is in posterior decoding based alignment where nucleotides are aligned as to maximize the sum of the posterior probabilities computed by (2). In the paper “Alignment Metric Accuracy” (with Ariel Schwartz and Gene Myers) we  provide an interpretation in terms of an interesting metric on sequence alignments and in

A.S. Schwartz and L. Pachter, Multiple alignment by sequence annealing, Bioinformatics 23 (2007), e24–e29

we show how to adapt posterior decoding into a (greedy) multiple alignment algorithm. The video below shows how it works, with each step consisting of aligning a single pair of nucleotides from some pair of sequences using posterior probabilities computed using the Needleman-Wunsch algorithm with semi-ring (2):

We later extended this idea in work together with my former student Robert Bradley in what became FSA published in the paper “Fast Statistical Alignment“, R. Bradley et al., PLoS Computational Biology 5 (2009), e1000392.

An interesting example of the application of Needleman-Wunsch with semi-ring (3), i.e. polytope propagation, is provided in the recent paper “The RNA Newton polytope and learnability of energy parameters” by Elmirasadat Forouzmand and Hamidreza Chitsaz, Bioinformatics 29 (2013), i300–i307. Using our polytope propagation algorithm, they investigate whether simple models for RNA folding can produce, for some set of parameters, the folds of sequences with solved structures (the answer is sometimes, but not always).

It is remarkable that forty three years after the publication of the Needleman-Wunsch paper, and thirty two years after the publication of the Smith-Waterman paper, the algorithms remain pertinent, widely used, and continue to reveal new secrets. True classics.

Where are the authors now?

Saul B. Needleman writes about numismatics.

Christian Wunsch is a clinical pathologist in the University of Miami health system.

Temple Smith is Professor Emeritus of Biomedical Engineering at Boston University.

Michael Waterman is Professor of Biological Sciences, Mathematics and Computer Science at the University of Southern California.

“Basically, I’m not interested in doing research and I never have been. I’m interested in understanding, which is quite a different thing. And often to understand something you have to work it out yourself because no one else has done it.” – David Blackwell

The importance of understanding is unfortunately frequently downplayed in computational biology, where methods are the slums of Results, and are at best afforded a few sentences in supplementary materials of research articles.

Nevertheless, there are a handful of examples to the contrary, an important one being the paper “Neighbor-joining revealed” by Olivier Gascuel and Mike Steel, Molecular Biology and Evolution 23 (2006), p 1997–2000, which I have valued for many years and inspired me to write this post about a similar type of paper. In the neighbor-joining case, Gascuel and Steel review a series of results showing that neighbor-joining, first published in the form of an ad-hoc recipe with only metaphorical motivation, is actually a greedy algorithm for optimizing a least squares loss function (“Theoretical foundations of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting“, R Desper and O Gascuel, Molecular Biology and Evolution 21 (2004), p 587–598). The particular form of the minimization criterion emerges from a formula of Pauplin  (“Direct calculation of a tree length using a distance matrix“, Y Paulin, Journal of Molecular Evolution 10 (1993), p 1073–1095) that in turn has deeper roots. This is a story me and my students have played a small role in, providing some insight into the robustness of the neighbor-joining algorithm in “Why neighbor joining works” and an explanation for the surprisingly simple Pauplin’s formula in “Combinatorics of least squares trees“. These series of insights into neighbor-joining turn out to have practical significance. Results in the “Combinatorics of least squares trees” paper were recently extended by Fabio Pardi and Olivier Gascuel in “Combinatorics of distance-based tree inference“, where they synthesize all of the previous results on neighbor-joining into a blueprint for improving distance based methods. The details of this story will be the subject of a future blog post; here I discuss another beautiful example of work that involves understanding as opposed to research:

I have a long-standing interest in the computational problems that arise in metagenomics, starting with work I did with my former student Kevin Chen on “Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities“.  Recently, while drawing parallels between RNA-Seq and metagenomics (with Lorian Schaeffer and Kevin McLoughlin), I was thinking again about problems encountered in the paper “Disordered microbial communities in asthmatic airways” in which I played a minor role assisting William Cookson. Via a long chain of citation chasing, I eventually stumbled upon the elegant article “The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples” by Steven Evans and Erick Matsen, Journal of the Royal Statistical Society 74 (2012), p 569–592, in which the authors explain the meaning of the (weighted) UniFrac distance for metagenomic samples. Read the rest of this entry »

Don’t believe the anti-hype. They are saying that RNA-Seq promises the discovery of new expression events, but it doesn’t deliver:


Is this true? There have been a few papers comparing microarrays to RNA-Seq technology (including one of my own) that I’ll discuss below, but first a break-down of the Affymetrix “evidence”. The first is this figure (the poor quality of the images is exactly what Affymetrix provides, and not due to a reduction in quality on this site; they are slightly enlarged when clicked on):


The content of this figure is an illustration of the gene LMNB1 (Lamin protein of type B), used to argue that microarrays can provide transcript level resolution whereas RNA-Seq can’t!! Really? Affymetrix is saying that RNA-Seq users would likely use the RefSeq annotation which only has three isoforms. But this is a ridiculous claim. It is well known that RefSeq is a conservative annotation and certainly RNA-Seq users have the same access to the multiple databases Affymetrix used to build their annotation (presumably, e.g. Ensembl). It therefore seems that what Affymetrix is saying with this figure is that RNA-Seq users are dumb.

The next figure is showing the variability in abundance estimates as a function of expression level for RNA-SEq and the HTA 2.0, with the intended message being that microarrays are less noisy:


But there is a subtle trick going on here. And its in the units. The x-axis is showing RPM, which is an abbreviation for Reads Per Million. This is not a commonly used unit, and there is a reason. First, its helpful to review what is used. In his landmark paper on RNA-Seq, Ali Mortazavi introduced the units RPKM (note the extra K) that stands for reads per kilobase of transcript per million mapped. Why the extra kilobase term? In my review on RNA-Seq quantification I explain that RPKM is proportional to a maximum likelihood estimate of transcript abundance (obtained from a simple RNA-Seq model). The complete derivation is on page 6 ending in Equation 13; I include a summary here:

The maximum likelihood (ML) abundances \hat{\rho}_t are  given by

\hat{\rho}_t = \frac{\frac{\hat{\alpha}_t}{l_t}}{\sum_{r \in T} \frac{\hat{\alpha}_r}{l_r}} \propto \frac{X_t}{\left( \frac{l_t}{10^3}\right) \left( \frac{N}{10^6}\right) }

In these equations l_t is the length of transcript (if reads are long it is necessary to modify the length due to edge effects, hence the tilde in the paper), the \hat{\alpha}_t are the maximum likelihood estimates for the probabilities of selecting reads from transcripts (unnormalized by their length) and finally X_t is the number of reads mapping to transcript t while N is the total number of mapped reads. The point is that RPKM (the rightmost formula for abundance) is really a unit for describing the maximum likelihood relative abundances (\hat{\rho}) scaled by some factors.

RPKM as a unit has two problems. The first is that in current RNA-Seq experiments reads are paired so that the actual units being counted (in X_t) are fragments. For this reason we modified RPKM to FPKM in the Cufflinks paper (the “F” replaced “R” for fragment instead of read). A more serious problem, noted by Bo Li and Colin Dewey in their paper on RSEM, is that while FPKM is proportional to ML estimates of abundance, the proportionality constants may vary between experiments. For this reason they proposed TPM (transcripts per million) which is also proportional to the ML abundance estimates but with a proportionality constant (a million) that is the same between experiments. All of these units are used simply to avoid writing down the \hat{\rho}_t which are in many cases tiny numbers since they must all sum to 1.

Returning to the Affymetrix figure, we see the strange RPM units. In essence, this is the rightmost term in the equation above, with the l_t length terms removed from the denominators. Therefore RPM is proportional to the \hat{\alpha}_t. If a transcript is short, even if it is equally abundant to a longer transcript ,it will produce less RNA-Seq reads and therefore its \hat{\alpha}_t will be (possibly considerably) smaller. The effect of displaying RPM for RNA-Seq vs. expression level for the HTA 2.0 arrays is therefore to mix apples and oranges. Since what is being displayed is a coefficient of variation, there is a bias caused by the relationship between length and expression (e.g. many highly expressed housekeeping genes are short).

To be fair to Affymetrix the conversion between the \hat{\alpha} and the \hat{\rho} can be confusing (its explained in Lemma 14 in the Supplement of the Cufflinks paper). So maybe the discordant x-axes were unintentional…but then there is the third figure:


Here its a bit hard to tell what is going on because not all the information needed to decipher the figure is provided. For example, its not clear how the “expression of exons” was computed or measured for the RNA-Seq experiment. I suspect that as with the previous figure, read numbers were not normalized by length of exons, and moreover spliced reads (and other possibly informative reads from transcripts) were ignored. In other words, I don’t really believe the result.

Having said this, it is true that expression arrays can have an advantage in measuring exon expression, because an array measurement is absolute (as opposed to the relative quantification that is all that is possible with RNA-Seq). Array signal is based on hybridization, and it is probably a reasonable assumption that some minimum amount of RNA triggers a signal, and that this amount is independent of the remainder of the RNA in an experiment. So arrays can (and in many cases probably do) have advantages over RNA-Seq.

There are a few papers that have looked into this, for example the paper “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae ” by Nookaew et al., Nucleic Acids Research 40 (2012) who find high reproducibility in RNA-Seq and consistency between arrays and RNA-Seq.  Xu et al., in “Human transcriptome array for high-throughput clinical studies“, PNAS 108 (2011), 3707–3712 are more critical, agreeing with Affymetrix that arrays are more sensitive at the exon level. For disease studies, they recommend using RNA-Seq to identify transcripts relevant to the disease, and then screening for those transcripts on patients using arrays.

For the Cuffdiff2 paper describing our new statistical procedures for differential analysis of transcripts and genes, the Rinn lab performed deep RNA-Seq and array expression measurement on the same samples from a HOXA1 knowdown (the experiments included multiple replicates of both the RNA-Seq and the arrays). To my knowledge, it is the deepest and most comprehensive data currently available for comparing arrays and RNA-Seq. Admittedly, the arrays used were not Affymetrix but Agilent SurePrint G3, and the RNA-Seq coverage was deep, however we had two main findings very different from the Affymetrix claims. First, we found overall strong correlation between array expression values and RNA-Seq abundance estimates. The correlation remained strong for large regimes of expression even with very few reads (tested by sequencing fewer reads from a MiSeq). Second, we found that arrays were missing differentially expressed transcripts, especially at low abundance levels. In other words, we found RNA-Seq to have higher resolution. The following figure from our paper made the case (note the overall Spearman Correlation was 0.86):


There are definitely continued applications for arrays. Both in high-throughput screening applications (as suggested in the Xu et al. paper), and also in the development of novel assays. For example Mercer et al. “Targeted rNA sequencing reveals the deep complexity of the human transcriptome“, Nature Biotechnology 30 (2012) 99–104  show how to couple capture (with arrays) with RNA-Seq to provide ultra deep sequencing in subsets of the transcriptome. So its not yet the time to write off arrays. But RNA-Seq has many applications of its own. For example the ability to better detect allele-specific expression, the opportunity to identify RNA-DNA differences (and thereby study RNA editing), and the ability to study expression in non-model organisms where genomes sequences are incomplete and annotations poor. For all these reasons I’m betting on RNA-Seq.

Blog Stats

%d bloggers like this: