You are currently browsing the category archive for the ‘expository’ category.

I recently published a paper on the bioRxiv together with Vasilis Ntranos, Lynn Yi and Páll Melsted on Identification of transcriptional signatures for cell types from single-cell RNA-Seq. The contributions of the paper can be summed up as:

1. The simple technique of logistic regression, by taking advantage of the large number of cells assayed in single-cell RNA-Seq experiments, is much more effective than current approaches at identifying marker genes for clusters of cells.
2. The simplest single-cell RNA-Seq data, namely 3′ single-end reads produced by technologies such as Drop-Seq or 10X, can distinguish isoforms of genes.
3. The simple idea of GDE provides a unified perspective on DGE, DTU and DTE.

These simple, simple and simple ideas are so obvious that of course anyone could have discovered them, and one might be tempted to go so far as to say that even if people didn’t explicitly write them down, they were basically already known. After all, logistic regression was published by David Cox in 1958, and who didn’t know that there are many 3′ unannotated UTRs in the human genome? As for DGE, DTU and DTE (and DTE->G and DTE+G) I mean who doesn’t get these basic concepts? Indeed, after reading our paper someone remarked that one of the key results “was already known“, presumably because the successful application of logistic regression as a gene differential expression method for single-cell RNA-Seq follows from the fact that Šidák aggregation fails for differential gene expression in bulk RNA-Seq.

The “was already known” comment reminded me of a recent blog post about the dirty secret of mathematics. In the post, the author begins with the following math problem: Without taking your pencil off the paper/screen, can you draw four straight lines that go through the middle of all of the dots?

The problem may not yield immediately (try it!) but the solution is obvious once presented. This is a case of the solution requiring a bit of out-of-the-box thinking, leading to a perspective on the problem that is obvious in retrospect. In the Ntranos, Yi et al. paper, the change in perspective was the realization that “Instead of the traditional approach of using the cell labels as covariates for gene expression, logistic regression incorporates transcript quantifications as covariates for cell labels”. It’s no surprise the “was already known” reaction reared it’s head in this case. It’s easy to convince oneself, after the fact, that the “obvious” idea was in one’s head all along.

The egg of Columbus is an apocryphal tale about ideas that seem trivial after the fact. The story originates from the book “History of the New World” by Girolamo Benzoni, who wrote that Columbus, upon upon being told that his journey to the West Indies was unremarkable and that Spain “would not have been devoid of a man who would have attempted the same” had he not undertaken the journey, replied

“Gentlemen, I will lay a wager with any of you, that you will not make this egg stand up as I will, naked and without anything at all.” They all tried, and no one succeeded in making it stand up. When the egg came round to the hands of Columbus, by beating it down on the table he fixed it, having thus crushed a little of one end”

The story makes a good point. Discovery of the Caribbean in the 6th millennium BC was certainly not a trivial accomplishment even if it was obvious after the fact. The egg trick, which Columbus would have learned from the Amerindians who first brought chickens to the Americas, is a good metaphor for the discovery.

There are many Amerindian eggs in mathematics, which has its own apocryphal story to make the point: A professor proving a theorem during a lecture pauses to remark that “it is obvious that…”, upon which she is interrupted by a student asking if that’s truly the case. The professor runs out of the classroom to a nearby office,  returning after several minutes with a notepad filled with equations to exclaim “Why yes, it is obvious!” But even first-rate mathematicians can struggle to accept Amerindian eggs as worthy contributions, frequently succumbing to the temptation of dismissing others’ work as obvious. One of my former graduate school mentors was G.W. Peck, a math professor who created a pseudonym for the express purpose of publishing his Ameridian eggs in a way that would reduce unintended embarrassment for those whose work he was improving on in in “trivial ways”. G.W. Peck has an impressive publication record.

Bioinformatics is not very different from mathematics; the literature is populated with many Amerindian eggs. My favorite example is the Smith-Waterman algorithm, an algorithm for local alignment published by Temple Smith and Michael Waterman in 1981. The Smith-Waterman algorithm is a simple modification of the Needleman-Wunsch algorithm:

The table above shows the differences. That’s it! This table made for a (highly cited) paper. Just initialize the Needleman-Wunsch algorithm with zeroes instead of a gap penalty, set negative scores to 0, trace back from the highest score. In fact, it’s such a minor modification that when I first learned the details of the algorithm I thought “This is obvious! After all, it’s just the Needleman-Wunsch algorithm. Why does it even have a name?! Smith and Waterman got a highly cited paper?! For this?!” My skepticism lasted only as long as it took me to discover and read Peter Sellers’ 1980 paper attempting to solve the same problem. It’s a lot more complicated, relying on the idea of “inductive steps”, and requires untangling mysterious diagrams such as:

The Smith-Waterman solution was clever, simple and obvious (after the fact). Such ideas are a hallmark of Michael Waterman’s distinguished career. Consider the Lander-Waterman model, which is a formula for the expected number of contigs in a shotgun sequencing experiment:

$E(contigs) = Ne^{-R}.$

Here N is the number of reads sequenced and R=NL/G is the “redundancy” (reads * fragment length / genome length). At first glance the Lander-Waterman “model” is just a formula arising from the Poisson distribution! It was obvious… immediately after they published it. The Pevzner-Tang-Waterman approach to DNA assembly is another good example. It is no coincidence that all of these foundational, important and impactful ideas have Waterman in their name.

Looking back at my own career, some of the most satisfying projects have been Amerindian eggs, projects where I was lucky to participate in collaborations leading to ideas that were obvious (after the fact). Nowadays I know I’ve hit the mark when I receive the most authentic of compliments: “your work is trivial!” or “was widely known in the field“, as I did recently after blogging about plagiarism of key ideas from kallisto. However I’m still waiting to hear the ultimate compliment: “everything you do is obvious and was already known!”

(Click “read the rest of this entry” to see the solution to the 9 dot problem.)

The development of microarray technology two decades ago heralded genome-wide comparative studies of gene expression in human, but it was the widespread adoption of RNA-Seq that has led to differential expression analysis becoming a staple of molecular biology studies. RNA-Seq provides measurements of transcript abundance, making possible not only gene-level analyses, but also differential analysis of isoforms of genes. As such, its use has necessitated refinements of the term “differential expression”, and new terms such as “differential transcript expression” have emerged alongside “differential gene expression”. A difficulty with these concepts is that they are used to describe biology, statistical hypotheses, and sometimes to describe types of methods. The aims of this post are to provide a unifying framework for thinking about the various concepts, to clarify their meaning, and to describe connections between them.

To illustrate the different concepts associated to differential expression, I’ll use the following example, consisting of a comparison of a single two-isoform gene in two conditions (the figure is Supplementary Figure 1 in Ntranos, Yi et al. Identification of transcriptional signatures for cell types from single-cell RNA-Seq, 2018):

The isoforms are labeled primary and secondary, and the two conditions are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates $x_A$ and $x_B$ corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates $y_A$ and $y_B$ corresponding to the abundance of the secondary isoforms. In data from an experiment the black dots will represent the mean level of expression of the constituent isoforms as derived from replicates, and there will be uncertainty as to their exact location. In this example I’ll assume they represent the true abundances.

### Biology

Below is a list of terms used to characterize changes in expression:

Differential transcript expression (DTE) is change in one of the isoforms. In the figure, this is represented (conceptually) by the two red lines along the x- and y-axes respectively. Algebraically, one might compute the change in the primary isoform by $x_B-x_A$ and the change in the secondary isoform by $y_B-y_A$. However the term DTE is used to denote not only the extent of change, but also the event that a single isoform of a gene changes between conditions, i.e. when the two points lie on a horizontal or vertical line. DTE can be understood to occur as a result of transcriptional regulation if an isoform has a unique transcription start site, or post-transcriptional regulation if it is determined by a unique splicing event.

Differential gene expression (DGE) is the change in the overall output of the gene. Change in the overall output of a gene is change in the direction of  the line $y=x$, and the extent of change can be understood geometrically to be the distance between the projections of the two points onto the line $y=x$ (blue line labeled DGE). The distance will depend on the metric used. For example, the change in expression could be defined to be the total expression in condition B ($x_B+y_B$) minus the change in expression in condition A ($x_A+y_A$), which is $|x_B-x_A+y_B-y_A|$.  This is just the length of the blue line labeled “DGE” given by the $L_1$ norm. Alternatively, one could consider “DGE” to be the length of the blue line in the $L_2$ norm. As with DTE, DGE can also refer to a specific type of change in gene expression between conditions, one in which every isoform changes (relatively) by the same amount so that the line joining the two points has a slope of 1 (i.e. is angled at 45°). DGE can be understood to be the result of transcriptional regulation, driving overall gene expression up or down.

Differential transcript usage (DTU) is the change in relative expression between the primary and secondary isoforms. This can be interpreted geometrically as the angle between the two points, or alternatively as the length (as given by some norm) of the green line labeled DTU. As with DTE and DGE, DTU is also a term used to describe a certain kind of difference in expression between two conditions, one in which the line joining the two points has a slope of -1. DTU events are most likely controlled by post-transcriptional regulation.

Gene differential expression (GDE) is represented by the red line. It is the amount of change in expression along in the direction of line joining the two points. GDE is a notion that, for reasons explained below, is not typically tested for, and there are few methods that consider it. However GDE is biologically meaningful, in that it generalizes the notions of DGE, DTU and DTE, allowing for change in any direction. A gene that exhibits some change in expression between conditions is GDE regardless of the direction of change. GDE can represent complex changes in expression driven by a combination of transcriptional and post-transcriptional regulation. Note that DGE, DTU and DTE are all special cases of GDE.

If the $L_2$ norm is used to measure length and $DTE_1,DTE_2$ denote DTE in the primary and secondary isoforms respectively, then it is clear that DGE, DTU, DTE and GDE satisfy the relationship

$GDE^2 = DGE^2 + DTU^2 = DTE_1^2 + DTE_2^2.$

### Statistics

The terms DTE, DGE, DTU and GDE have an intuitive biological meaning, but they are also used in genomics as descriptors of certain null hypotheses for statistical testing of differential expression.

The differential transcript expression (DTE) null hypothesis for an isoform is that it did not change between conditions, i.e. $x_A=x_B$ for the primary isoform, or $y_A=y_B$ for the secondary isoform. In other words, in this example there are two DTE null hypotheses one could consider.

The differential gene expresión (DGE) null hypothesis is that there is no change in overall expression of the gene, i.e. $x_A+y_A = x_B+y_B$.

The differential transcript usage (DTU) null hypothesis is that there is no change in the difference in expression of isoforms, i.e. $x_A-y_A = x_B - y_B$.

The gene differential expression (GDE) null hypothesis is that there is no change in expression in any direction, i.e. for all constants $a,b$, $ax_A+by_A = ax_B+by_B$.

The union differential transcript expression (UDTE) null hypothesis is that there is no change in expression of any isoform. That is, that $x_A = y_A$ and $x_B = y_B$ (this null hypothesis is sometimes called DTE+G). The terminology is motivated by $\neg \cup_i DTE_i = \cap_i DTE_i$.

Not that $UDTE \Leftrightarrow GDE$, because if we assume GDE, and set $a=1,b=0$ we obtain DTE for the primary isoform and setting $a=0,b=1$ we obtain DTE for the secondary isoform. To be clear, by GDE or DTE in this case we mean the GDE (respectively DTE) null hypothesis. Furthermore, we have that

$UDTE,GDE \Rightarrow DTE,DGE,DTU$.

This is clear because if $x_A=y_A$ and $x_B=y_B$ then both DTE null hypotheses are satisfied by definition, and both DGE and DTU are trivially satisfied. However no other implications hold, i.e. $DTE \not \Rightarrow DGE,DTU$, similarly $DGE \not \Rightarrow DTE,DTU$, and $DTU \not \Rightarrow DGE, DTE$.

### Methods

The terms DGE, DTE, DTU and GDE also used to describe methods for differential analysis.

A differential gene expression method is one whose goal is to identify changes in overall gene expression. Because DGE depends on the projection of the points (representing gene abundances) to the line y=x, DGE methods typically take as input gene counts or abundances computed by summing transcript abundances $x_A+y_A$ and $x_B+y_B$. Examples of early DGE methods for RNA-Seq were DESeq (now DESeq2) and edgeR. One problem with DGE methods is that it is problematic to estimate gene abundance by adding up counts of the constituent isoforms. This issue was discussed extensively in Trapnell et al. 2013. On the other hand, if the biology of a gene is DGE, i.e. changes in expression are the same (relatively) in all isoforms, then DGE methods will be optimal, and the issue of summed counts not representing gene abundances accurately is moot.

differential transcript expression method is one whose goal is to identify individual transcripts that have undergone DTE. Early methods for DTE were Cufflinks (now Cuffdiff2) and MISO, and more recently sleuth, which improves DTE accuracy by modeling uncertainty in transcript quantifications. A key issue with DTE is that there are many more transcripts than genes, so that rejecting DTE null hypotheses is harder than rejecting DGE null hypotheses. On the other hand, DTE provides differential analysis at the highest resolution possible, pinpointing specific isoforms that change and opening a window to study post-transcriptional regulation. A number of recent examples highlight the importance of DTE in biomedicine (see, e.g., Vitting-Seerup and Sandelin 2017). Unfortunately DTE results do not always translate to testable hypotheses, as it is difficult to knock out individual isoforms of genes.

differential transcript usage method is one whose goal is to identify genes whose overall expression is constant, but where isoform switching leads to changes in relative isoform abundances. Cufflinks implemented a DTU test using Jensen-Shannon divergence, and more recently RATs is a method specialized for DTU.

As discussed in the previous section, none of null hypotheses DGE, DTE and DTU imply any other, so users have to choose, prior to performing an analysis, which type of test they will perform. There are differing opinions on the “right” approach to choosing between DGE, DTU and DTE. Sonseson et al. 2016 suggest that while DTE and DTU may be appropriate in certain niche applications, generally it’s better to choose DGE, and they therefore advise not to bother with transcript-level analysis. In Trapnell et al. 2010, an argument was made for focusing on DTE and DTU, with the conclusion to the paper speculating that “differential RNA level isoform regulation…suggests functional specialization of the isoforms in many genes.” Van den Berge et al. 2017 advocate for a middle ground: performing a gene-level analysis but saving some “FDR budget” for identifying DTE in genes for which the UDTE null hypothesis has been rejected.

There are two alternatives that have been proposed to get around the difficulty of having to choose, prior to analysis, whether to perform DGE, DTU or DTE:

differential transcript expression aggregation (DTE->G) method is a method that first performs DTE on all isoforms of every gene, and then aggregates the resulting p-values (by gene) to obtain gene-level p-values. The “aggregation” relies on the observation that under the null hypothesis, p-values are uniformly distributed. There are a number of different tests (e.g. Fisher’s method) for testing whether (independent) p-values are uniformly distributed. Applying such tests to isoform p-values per gene provides gene-level p-values and the ability to reject UDTE. A DTE->G method was tested in Soneson et al. 2016 (based on Šidák aggregation) and the stageR method (Van den Berge et al. 2017) uses the same method as a first step. Unfortunately, naïve DTE->G methods perform poorly when genes change by DGE, as shown in Yi et al. 2017. The same paper shows that Lancaster aggregation is a DTE->G method that achieves the best of both the DGE and DTU worlds. One major drawback of DTE->G methods is that they are non-constructive, i.e. the rejection of UDTE by a DTE->G method provides no information about which transcripts were differential and how. The stageR method averts this problem but requires sacrificing some power to reject UDTE in favor of the interpretability provided by subsequent DTE.

gene differential expression method is a method for gene-level analysis that tests for differences in the direction of change identified between conditions. For a GDE method to be successful, it must be able to identify the direction of change, and that is not possible with bulk RNA-Seq data. This is because of the one in ten rule that states that approximately one predictive variable can be estimated from ten events. In bulk RNA-Seq, the number of replicates in standard experiments is three, and the number of isoforms in multi-isoform genes is at least two, and sometimes much more than that.

In Ntranos, Yi et al. 2018, it is shown that single-cell RNA-Seq provides enough “replicates” in the form of cells, that logistic regression can be used to predict condition based on expression, effectively identifying the direction of change. As such, it provides an alternative to DTE->G for rejecting UDTE. The Ntranos and Yi GDE methods is extremely powerful: by identifying the direction of change it is a DGE methods when the change is DGE, it is a DTU method when the change is DTU, and it is a DTE method when the change is DTE. Interpretability is provided in the prediction step: it is the estimated direction of change.

### Remarks

The discussion in this post is based on an example consisting of a gene with two isoforms, however the concepts discussed are easy to generalize to multi-isoform genes with more than two transcripts. I have not discussed differential exon usage (DEU), which is the focus of the DEXSeq method because of the complexities arising in genes which don’t have well-defined shared exons. Nevertheless, the DEXSeq approach to rejecting UDTE is similar to DTE->G, with DTE replaced by DEU. There are many programs for DTE, DTU and (especially) DGE that I haven’t mentioned; the ones cited are intended merely to serve as illustrative examples. This is not a comprehensive review of RNA-Seq differential expression methods.

### Acknowledgments

The blog post was motivated by questions of Charlotte Soneson and Mark Robinson arising from an initial draft of the Ntranos, Yi et al. 2018 paper. The exposition was developed with Vasilis Ntranos and Lynn Yi. Valentine Svensson provided valuable comments and feedback.

In 1997 physicist Roger Penrose sued Kimberly-Clark Corporation for infringing on his “Penrose patent” with their Kleenex-Quilted toilet paper. He won the lawsuit but fortunately for lavaphiles the patent has expired leaving much room for aperiodic creativity in the bathroom.

Math is involved in many aspects of house design (two years ago I wrote about how math is even related to something as mundane as the roof), but it is especially important in the design of bathroom floors. The most examined floors in houses are those of bathrooms, as they are stared at for hours on end by pensive thinkers sitting on toilet seats. The best bathroom floors present beautiful tessellations not as mathematical artifact but mathematical artwork, and with this in mind I designed a three colored Penrose tiling for our bathroom a few years ago. This is its story:

Roger Penrose published a series of aperiodic tilings of the plane in the 1970s, famously describing a triplet of related tilings now termed P1, P2 and P3. These tilings turn out to be closely related to tilings in medieval islamic architecture and thus perhaps ought to be called “Iranian tilings” but to be consistent with convention I have decided to stick with the standard “Penrose tilings” in this post.

The tiling P3 is made from two types of rhombic tiles, matched together as desired according to the matching rules (indicated by the colors, or triangle/circle bumps) below:

The result is an aperiodic tiling of the plane, i.e. one without translation symmetry (for those interested, a formal definition is provided here). Such tilings have many interesting and beautiful properties, although a not so-well-known one is that they are 3-colorable. What this means is that each tile can be colored with one of three colors, so that any two adjacent tiles are always colored differently. The proof of the theorem, by Tom Sibley and Stan Wagon, doesn’t really have much to do with the aperiodicity of the rhombic Penrose tiling, but rather with the fact that it is constructed from parallelograms that are arranged so that any pair meet either at a point or along an edge (they call such tilings “tidy”). In fact, they prove that any tidy plane map whose countries are parallelograms is 3-colorable.

The theorem is illustrated below:

This photo is from our guest bathroom. I designed the tiling and the coloring to fit the bathroom space and sent a plan in the form of a figure to Hank Saxe from Saxe-Patterson Inc. in Taos New Mexico who cut and baked the tiles:

Hank mailed me the tiles in groups of “super-tiles”. These were groups of tiles glued together to an easily removable mat to simplify the installation. The tiles were then installed by my friend Robert Kertsman (at the time a general contractor) and his crew.

The final result is a bathroom for thought:

Three years ago Nicolas Bray and I published a post-publication review of the paper “Network link prediction by global silencing of indirect correlations” (Barzel and Barabási, Nature Biotechnology, 2013). Despite our less than positive review of the work, the paper has gone on to garner 95 citations since its publication (source: Google Scholar). In fact, in just this past year the paper has  paper has been cited 44 times, an impressive feat with the result that the paper has become the first author’s most cited work.

Ultimate impact

In another Barabási paper (with Wang and Song) titled Quantifying Long-Term Scientific Impact  (Science, 2013), the authors provide a formula for estimating the total number of citations a paper will acquire during its lifetime. The estimate is

$c^{\infty} = m(e^{\lambda_i-1}),$

where $m$ and $\lambda_i$ are parameters learned from a few years of citation data. The authors call $c^{\infty}$ the ultimate impact because, they explain, “the total number of citations a paper will ever acquire [is equivalent to] the discovery’s ultimate impact”. With 95 citations in 3 years, the Barzel-Barabási “discovery” is therefore on track for significant “ultimate impact” (I leave it as an exercise for the reader to calculate the estimate for $c^{\infty}$ from the citation data). The ultimate impactful destiny of the paper is perhaps no surprise…Barzel and Barabási knew as much when writing it, describing its implication for systems biology as “Overall this silencing method will help translate the abundant correlation data into insights about the system’s interactions” and stating in a companion press release that After silencing, what you are left with is the pre­cise wiring dia­gram of the system… In a sense we get a peek into the black box.”

Drive by citations

Now that three years had passed since the publication of the press release and with the ultimate impact revealed, I was curious to see inside the 95 black boxes opened with global silencing, and to examine the 95 wiring diagrams that were thus precisely figured out.

So I delved into the citation list and examined, paper-by-paper, to what end the global silencing method had been used. Strikingly, I did not find, as I expected, precise wiring diagrams, or even black boxes. A typical example of what I did find is illustrated in the paper Global and portioned reconstructions of undirected complex networks by Xu et al. (European Journal Of Physics B, 2016) where the authors mention the Barzel-Barabási paper only once, in the following sentence of the introduction:

“To address this inverse problem, many methods have been proposed and they usually show robust and high performance with appropriate observations [9,10,11, 12,13,14,15,16,17,18,19,20,21].”

(Barzel-Barabási is reference [16]).

Andrew Perrin has coined the term drive by citations for “references to a work that make a very quick appearance, extract a very small, specific point from the work, and move on without really considering the existence or depth of connection [to] the cited work.” While its tempting to characterize the Xu et al. reference of Barzel-Barabási as a drive by citation the term seems overly generous, as Xu et al. have literally extracted nothing from Barzel-Barabási at all. It turns out that almost all of the 95 citations of Barzel-Barabási are of this type. Or not even that. In some cases I found no logical connection at all to the paper. Consider, for example, the Ph.D. thesis Dysbiosis in Inflammatory Bowel Disease, where Barzel-Barabási, as well as the Feizi et al. paper which Nicolas Bray and I also reviewed, are cited as follows:

The Ribosomal Database Project (RDP) is a web resource of curated reference sequences of bacterial, archeal, and fungal rRNAs. This service also facilitates the data analysis by providing the tools to build rRNA-derived phylogenetic trees, as well as aligned and annotated rRNA sequences (Barzel and Barabasi 2013; Feizi, Marbach et al. 2013).

(Neither papers has anything to do with building rRNA-derived phylogenetic trees or aligning rRNA sequences).

While this was probably an accidental error, some of the drive by citations were more sinister. For example, WenJun Zhang is an author who has cited Barzel-Barabási as

We may use an incomplete network to predict missing interactions (links) (Clauset et al., 2008; Guimera and Sales-Pardo, 2009; Barzel and Barabási, 2013; Lü et al., 2015; Zhang, 2015d, 2016a, 2016d; Zhang and Li, 2015).

in exactly the same way in three papers titled Network Informatics: A new science, Network pharmacology: A further description and Network toxicology: a new science. In fact this author has cited the work in exactly the same way in several other papers which appear to be copies of each other for a total of 7 citations all of which are placed in dubious “papers”. I suppose one may call this sort of thing hit and run citation.

I also found among the 95 citations one paper strongly criticizing the Barzel-Barabási paper in a letter to Nature Biotechnology (the title is Silence on the relevant literature and errors in implementation) , as well as the (to me unintelligible) response by the authors.

In any case, after carefully examining each of the 95 references citing Barzel and Barabási I was able to find only one paper that actually applied global silencing to biological data, and two others that benchmarked it. There are other ways a paper could impact derivative work, for example by virtue of the models or mathematics developed, be of use, but I could not find any other instance where Barzel and Barabási’s work was used meaningfully other than the three citations just mentioned.

When a citation is a citation

As mentioned, two papers have benchmarked global silencing (and also network deconvolution, from Feizi et al.). One was a paper by Nie et al. on Minimum Partial Correlation: An Accurate and Parameter-Free Measure of Functional Connectivity in fMRI. Table 1 from the paper shows the results of global silencing, network deconvolution and other methods on a series of simulations using the measure of c-sensitivity for accuracy:

Table 1 from Nie et al. showing performance of methods for “network cleanup”.

EPC is the “Elastic PC-algorithm” developed by the authors, which they argue is the best method. Interestingly, however, global silencing (GS) is equal to or worse than simply choosing the top entries from the partial correlation matrix (FP) in 19/28 cases- that’s 67% of the time! This is consistent with the results we published in Bray & Pachter 2013. In these simulations network deconvolution performs better than partial correlation, but still only 2/3 of the time. However in another benchmark of global silencing and network deconvolution published by Izadi et al. 2016 (A comparative analytical assay of gene regulatory networks inferred using microarray and RNA-seq datasets) network deconvolution underperformed global silencing. Also network deconvolution was examined in the paper Graph reconstruction Using covariance-based methods by Sulaimanov and Koeppl 2016 who show, with specific examples, that the scaling parameter we criticized in Bray & Pachter 2013 is indeed problematic:

The scaling parameter α is introduced in [Feizi et al. 2013] to improve network deconvolution. However, we show with simple examples that particular choices for α can lead to unwanted elimination of direct edges.

It’s therefore difficult to decide which is worse, network deconvolution or global silencing, however in either case it’s fair to consider the two papers that actually tested global silencing as legitimately citing the paper the method was described in.

The single paper I found that used global silencing to analyze a biological network for biological purposes is A Transcriptional and Metabolic Framework for Secondary Wall Formation in Arabidopsis by Li et al. in Plant Physiology, 2016. In fact the paper combined the results of network deconvolution and global silencing as follows:

First, for the given data set, we calculated the Pearson correlation coefficients matrix Sg×g. Given g1 regulators and g2 nonregulators, with g = g1+g2, the correlation matrix can be modified as

where O denotes the zero matrix, to include biological roles (TF and non-TF genes). We extracted the regulatory genes (TFs) from different databases, such as AGRIS (Palaniswamy et al., 2006), PlnTFDB (Pérez-Rodríguez et al., 2010), and DATF (Guo et al., 2005). We then applied the network deconvolution and global silencing methods to the modified correlation matrix S′. However, global silencing depends on finding the inverse of the correlation matrix that is rank deficient in the case p » n, where p is the number of genes and n is the number of features, as with the data analyzed here. Since finding an inverse for a rank-deficient matrix is an ill-posed problem, we resolved it by adding a noise term that renders the matrix positive definite. We then selected the best result, with respect to a match with experimentally verified regulatory interactions, from 10 runs of the procedure as a final outcome. The resulting distribution of weighted matrices for the regulatory interactions obtained by each method was decomposed into the mixture of two Gaussian distributions, and the value at which the two distributions intersect was taken as a cutoff for filtering the resulting interaction weight matrices. The latter was conducted to avoid arbitrary selection of a threshold value and prompted by the bimodality of the regulatory interaction weight matrices resulting from these methods. Finally, the gene regulatory network is attained by taking the shared regulatory interactions between the resulting filtered regulatory interactions obtained by the two approaches. The edges were rescored based on the geometric mean of the scores obtained by the two approaches.

In light of the benchmarks of global silencing and network deconvolution, and in the absence of analysis of the ad hoc method combining their results, it is difficult to believe that this methodology resulted in a meaningful network. However its citation of the relevant papers is certainly legitimate. Still, the results of the paper, which constitute a crude analysis of the resulting networks, are a far cry from revealing the “precise wiring diagram of the system”. The authors acknowledge this writing

From the cluster-based networks, it is clear that a wide variety of ontology terms are associated with each network, and it is difficult to directly associate a distinct process with a certain transcript profile.

The factor of use correction

The analysis of the Barzel and Barabási citations suggests that, because a citation is not always a citation (thanks to Nicolas Bray for suggesting the title for the post), to reflect the ultimate impact of a paper the quantity $c^{\infty}$ needs to be adjusted. I propose adjustment by the factor

$f^u = \frac{C-d_b}{C},$

where C is the total number of citations of a paper and $d_b$ is the number of drive by citations. The fraction $\frac{d_b}{C}$ is essentially a factor of use correction. It should be possible (and interesting) to develop text analytics algorithms for estimating $d_b$ so as to be able to correct $c^{\infty}$ to  $f^u \cdot c^{\infty}$, and similarly adjusting citations counts, h-indices, impact factors of journals and related metrics. Explicit computation and publication of the factor of use correction for papers would also incentivize authors to reduce or eliminate gratuitous drive by citation practices.

For now I decided to compute the factor of use correction for the Barzel-Barabási paper by generously estimating that $d_b=92$. This yielded $f^u = \frac{3}{95} = 0.0315$. Barabási has an h-index of 117, but applying this factor of use correction to all of his published papers I obtained the result that Barabasi’s factor of use corrected h-index is 30.

What are confusion matrices?

In the 1904 book Mathematical Contributions to the Theory of Evolution, Karl Pearson introduced the notion of contingency tables. Sometime around the 1950s the term “confusion matrix” started to be used for such tables, specifically for 2×2 tables arising in the evaluation of algorithms for binary classification.

Example: Suppose there are 11 items labeled A,B,C,D,E,F,G,H,I,J,K four of which are of the category blue (also to be called “positive”) and eight of which are of the category red (also to be called “negative”). An algorithm called BEST receives as input the objects without their category labels, i.e. just A,B,C,D,E,F,G,H,I,J,K and must rank them so that the top of the ranking is enriched, as much as possible, for blue items. Suppose BEST produces as output the ranking:

1. C
2. E
3. A
4. B
5. F
6. K
7. G
8. D
9. I
10. H
11. J

How good is BEST at distinguishing the blue (positive) from the red (negative) items, i.e. how enriched are they at the top of the list?  The confusion matrix provides a way to organize the assessment. We explain by example: if we declare the top 6 predictions of BEST blue and classify the remainder as red

========
Blue

1. E
2. A
3. B
4. F
5. K
========
Red
6. G
7. D
8. I
9. H
10. J

then the confusion matrix is:

 Predicted category Blue Red True category Blue 3 1 Red 3 4

The matrix shows the number of predictions that are correct (3) as well as the number of blue items that are missing (1). Similarly, it shows the number of red items that were incorrectly predicted to be blue (3), as well as  the number of red items correctly predicted to be red (4). The four numbers have names (true positives, false negatives, false positives and true negatives) and various other numbers can be derived from them that shed light on the quality of the prediction:

The table and the terminology are trivial. All that is involved is simple counting of successes and failures in a binary classification experiment, and computation of derived measures that are obtained by elementary arithmetic. But here is a little secret that I’ve always felt embarrassed to admit… I can never remember the details. For example I find myself forgetting whether “positive predictive value” is really “precision” or how that relates to “specificity”. For many years I thought I was the only one. I had what you might call “confusion syndrome”… But I’ve slowly come to realize that I am not alone.

Confused people

I became confused about confusion matrices from the moment I first encountered them when I started studying computational biology in the mid 1990s. One of the first papers I read at the time was a landmark review by Burset and Guigó (1996) on the evaluation of gene structure prediction programs (the review has now been cited over 800 times). I have highlighted some text from the paper, where the authors explain what “specificity” means.

Not only are there two definitions for specificity on the same page, there is also a typo in Figure 1 (specificity is abbreviated Sn instead of Sp) adding to the confusion.

Burset and Guigó are of course not to blame for the non-standard use of specificity in gene finding. They were merely describing deviant behavior by computational biologists working on gene finding who had apparently decided to redefine a term widely used in statistics and computer science. However lacking context at the time when I was first learning of these terms, I remained confused for years about whether I was supposed to use specificity or the term “precision” for what the gene finding people were computing.

Simple confusion about terms is not the only problem with the confusion matrix. It is filled with (abbreviated) jargon that can be overwhelming. Who can remember TP,TN,FP,FN,TPR,FPR,SN,SP,FDR,PPV,FOR,NPV,FDR,FNR,TNR,SPC,ACC,LR+,LR- etc. etc. etc. ? Even harder is keeping track of the many formulas relating the quantities, e.g. FDR =  1 – PPV or FPR = 1 – SPC. And how is one supposed to gain intuition about a formula described as ACC = (TP + TN) / (TP + FP + FN + TN)?

Confusion matrix confusion remains a problem in computational biology. Consider the recent preprint Creating a universal SNP and small indel variant caller with deep neural networks by Poplin et al. published on December 14, 2016. The paper, from Google/Verily, shows how a deep convolutional neural network can be used to call genetic variation by learning from images of read pileups. This is fascinating and potentially deep innovation. However the authors, experts in machine learning, were confused in the caption of their Figure 2A, incorrectly describing the FDR-Sensitivity plot as a ROC curve. This in turn caused confusion for readers expecting a ROC curve to be a function.

While an FPR-Sensitivity plot is a function an FDR-Sensitivity plot doesn’t have to be. I was confused by this subtlety myself until an author of Poplin et al. kindly sorted me out.

Hence this post, a note for myself to avoid further confusion by utilizing elementary $L_1$ trigonometry to organize the information in confusion (matrices). As we will see shortly, the Poplin et al. Figure 2A is what is more commonly known as a precision-recall (PR) curve, except that the usual axes of a PR plot are flipped, in addition to a reflection which adds further confusion.

ROC curves

The quality of the BEST prediction in the example above can be encoded in a picture as follows:

Figure 1: A false-positive / true-positive plot.

The grid is 4 lines high and 8 lines wide because there are 4 blue objects and 7 red ones. The BEST ranking is a path from the lower left hand corner of the grid (0,0) to the top right hand corner (7,4). The BEST predictions can be “read off” by following the path. Starting with the top ranked object, the path moves one step up if the object is (in truth) blue, and one step to the right if it is (in truth) red. For example, the first prediction of BEST is C which is blue, and therefore the first step is up. Continuing in this way one eventually arrives at (7,4) because each object is somewhere (only once) on BEST’s list. A BEST confusion matrix corresponds to a point in the plot. The confusion matrix  discussed in the introduction thresholding after 6 predictions corresponds to the green point which is the location after 6 steps from (0,0).

The path corresponding to the BEST ranking, when scaled to the unit square, is called a ROC curve. This stands for receiver operator curve (the terminology originates from WWII). Again, each thresholding of the ranking, e.g. the top six as discussed above (green point), corresponds to a single point on the path:

Figure 2: A ROC plot.

In the ROC plot the x-axis is the false positive rate (FPR = # false positives / total number of true negatives) and the y-axis the sensitivity (Sn = #true positives / total number of true positives).

Taxicab geometry

At this point it is useful to think about taxicab geometry (formally known as $L_1$ geometry). In taxicab geometry, the distance between two points on the grid is the number of steps in the shortest path (using edges of the grid) connecting them. In the figure below, a false positive / true positive plot (as shown in Figure 1) is annotated to highlight the connection to trigonometry:

Figure 3: Taxicab confusion.

The terms associated to the confusion matrix can now be understood using taxicab trigonometry as follow:

$cos \theta = \frac{FP}{FP+TP} = FDR$ (false discovery rate),

$sin \theta = \frac{TP}{FP+TP} = PPV$ (positive predictive value),

$cos \varphi = \frac{TN}{TN+FN} = NPV$ (negative predictive value),

$sin \varphi = \frac{FN}{TN+FN} = FOR$ (false omission rate),

Fortunately trigonometry identities are easy in this geometry. Instead of the usual $sin^2 \theta + cos^2 \theta = 1$ we have $|sin \theta| + |cos \theta| = 1$. Thus we see that FDR + PPV = 1 (or PPV = 1 – FDR) and NPV + FOR = 1 (or NPV = 1 -FOR). The false positive rate and true positive rate are seen to just be rescaling of the FP and TP, i.e. $FPR = \frac{FP}{N}$ and $TPR = \frac{TP}{P}$, and the analogies to the identities just described are therefore FPR + SPC = 1 and TPR + FNR = 1.

Revisiting Figure 2A from Poplin et al. we see that FDR is best interpreted as 1 – PPV. PPV is also known as “precision”. The y-axis, sensitivity, is also called “recall”. Thus the curve is a precision-recall curve, although precision-recall curves are typically shown with recall on the x-axis.

Aside from helping to sort out identities, taxicab trigonometry is appealing in other ways. For example the angle $\theta$ in the example of Figure 3 is just $\theta = 1$. In fact in the regime in which the confusion trigonometry takes place $cos \theta = 1 - \frac{1}{2} \theta$ and $sin \theta = \frac{1}{2} \theta$. There is a pedagogical point here: it is useful to identify a quantity such as FDR with directly with the angle subtended by the points (0,0) and (FP,TP). In fact one sees immediately that a confusion matrix is succinctly described by just two quantities (in addition to N and P). These do not necessarily have to be FP and TP, a pair such as FDR and NPV can also suffice.

No longer confused by confusion matrices? Show that

$TP = \frac{N \cdot FDR - (N+P) \cdot FDR \cdot NPV}{FDR-NPV}$.

Probabilistic interpretation

While I think the geometric view of confusion matrices is useful for keeping track of the big picture, there is a probabilistic interpretation of many of the concepts involved that is useful to keep in mind.

For example, the precision associated to a confusion matrix can be interpreted as the probability that a prediction is correct. The identity FDR + PPV = 1 can therefore be thought of probabilistically, with FDR being the probability that a prediction is incorrect.

Most useful among such probabilistic interpretations is the area under a ROC curve (AUROC), which is frequently computed in biology papers. It is the probability that a randomly selected “positive” object is ranked higher than a randomly selected “negative” object by the classifier being evaluated. For example, in the red/blue/BEST example given above, the AUROC (= $\frac{21}{28} = \frac{3}{4}$) is the probability that a randomly selected blue letter is ranked higher than a randomly selected red letter by BEST.

Computational biologists do not all recognize the Kalman filter by name, but they know it in the form of the hidden Markov model (the Kalman filter is a hidden Markov model with continuous latent variables and Gaussian observed variables). I mention this because while hidden Markov models, and more generally graphical models, have had an extraordinary impact on the tools and techniques of high-throughput  biology, one of their primary conceptual sources, the Kalman filter, is rarely credited as such by computational biologists.

Illustration of the Kalman filter (from Wikipedia).

Where the Kalman filter has received high acclaim is in engineering, especially electrical and aeronautical engineering via its applications in control theory and where it has long been a mainstay of the fields. But it was not always so. The original papers, written in the early 1960s by Rudolf Kálmán and colleagues, were published in relatively obscure mechanical engineering journals rather than the mainstream electrical engineering journals of the time. This was because Kálmán’s ideas were initially scoffed at and rejected… literally. Kálmán second paper on the topic, New Results in Linear Filtering and Prediction Theory (with almost 6,000 citations), was rejected at first with a referee writing that “it cannot possibly be true”. The story is told in Grewal and Andrews’ book Kalman Filtering: Theory and Practice Using MATLAB. Of course not only was the Kalman filter theory correct, the underlying ideas were, in modern parlance, transformative and disruptive. In 2009 Rudolf Kálmán received the National Medal of Science from Barack Obama for his contribution. This is worth keeping in mind not only when receiving rejections for submitted papers, but also when writing reviews.

Rudolf Kálmán passed away at the age of 86 on Saturday July 2nd 2016.

In the 1930s the pseudonym Nicolas Bourbaki was adopted by a group of French mathematicians who sought to axiomatize and formalize mathematics in a rigorous manner. Their project greatly influenced mathematics in the 20th century and contributed to its present “definition -> theorem -> proof” format (Bourbaki went a bit overboard with abstraction and generality and eventually the movement petered out).

In biology definitions seem to be hardly worth the trouble. For example, the botanist Willhelm Johannsen coined the term “gene” in 1909 but there is still no universal agreement on a “definition”.  How could there be? Words associated with biological processes or phenomena that are being understood in a piecemeal fashion naturally change their meaning over time (the philosopher Joseph Woodger did try to axiomatize biology in his 1938 book “The Axiomatic Method in Biology” but his attempt, with 53 axioms, was ill fated).

Bioinformatics occupies a sort of murky middle-ground as far as definitions go. On the one hand, mathematics underlies many of the algorithms and statistical models that are used so there is a natural tendency to formalize concepts, but just when things seem neat and tidy and the lecture begins with “Let the biological sequence be a string on an alphabet $\Sigma$ of size 4 or 20″, biology comes along and rips the blanket off the bed (there are 22 proteinogenic amino acids, molecules may be single or double stranded, they could be circular, bases may be chemically modified, and so on and so forth…).

In some cases things have really gotten out of hand. The words “alignment” and “mapping” appear, superficially, to refer to precise objects and procedures, but they probably have more meanings than the verb “run” (which is in the hundreds). To just list the meanings would be a monumental (impossible?) task. I will never attempt it. However I recently came across some twitter requests for clarification of read alignment and mapping, and I thought explaining those terms was perhaps a more tractable problem.

Here is my attempt at a chronological clarification:

1998: The first use of the term “read alignment” in a paper (in the context of genomics) is, to my knowledge, in “Consed: A Graphical Tool for Sequence Finishing” by D. Gordon, C. Abajian and P. Green. The term “read alignment” in the paper was used in the context of the program phrap, and referred to the multiple alignment of reads to a sequence assembled from the reads, so as to generate a consensus sequence with reduced error. The output looked something like this:

Figure 1 from the Consed paper.

To obtain multiple alignments phrap required a “model”. Roughly speaking, a model is an objective scoring function along with parameters, and, although not discussed in the Consed paper, that model was crucial to the functionality and performance of phrap. The phrap model included not just mismatches of bases but also insertions and deletions. Although phrap was designed to work with Sanger reads, there is conceptually not much difference between the way “read alignment” was used in the Consed paper and the way it is sometimes used now with “next-generation” reads. In fact, the paper “Correcting errors in short reads by multiple alignments” by L. Salmela and J. Schröder, 2011 does much of the same thing (it would have been nice if they cited phrap, although in their defense phrap was never published; Consed was about the visualization tool).

For the purposes of defining “read alignment” it is worthwhile to consider what I think were the four key ingredients of phrap:

• An underlying notion of what a read alignment is. Without being overly formal I think its useful to consider the form of the phrap input/output . This is my own take on it. First, there was a set of reads $\mathcal{F}$ with each read $F \in \mathcal{F}$ consisting of a sequence so that $F$ is an ordered set $F =\{\sigma_1, \ldots, \sigma_l\}$ (for simplicity I make the assumption that the reads were single end, each read was of the same length l, and each sequence element $\sigma \in \{A,C,G,T\}$, and of course I’ve ignored the reverse strand; these assumptions and restrictions can and should be relaxed but I work with them here for simplicity). Next, there was a set of target sequences (contigs, to be precise) that I denote by  $\mathcal{T}$ with $T \in \mathcal{T}$ . These each consisted of a sequence, again an ordered set, $T = \{\sigma^T_1,\ldots,\sigma^T_{|T|}\}$. A “read alignment” in phrap consisted of a pair of maps: first $\psi:\mathcal{F} \rightarrow \mathcal{T} \cup \{\emptyset\}$ specifying for each read the contig it maps to (or the mapping to $\{\emptyset\}$ denoting an unaligned read), and also a set of additional maps $\mathcal{L}_F:F \rightarrow \psi(F) \cup \emptyset$ assigning to each base in each read a corresponding base (or gap in the case of $\{\emptyset\}$) in the contig the read mapped to.
• A model: Given the notion of read alignment, a model was needed to be able to choose between different alignments. In phrap, the multiple alignments depended on scores for mismatches and gaps (the defaults were a score of 11 for a match, -2 for a mismatch, -4 for initiating a gap, -3 for extending a gap). In general, a model provides an approach to distinguishing and choosing between the read alignments as (roughly) defined above.
• An algorithm for read alignment. Given a model there was a need for an algorithm to actually find high scoring alignments. In phrap the algorithm consisted of performing a banded Smith-Waterman alignment using the model (roughly) specified above. The Smith-Waterman algorithm produced alignments $\mathcal{L}_F$ that were global, i.e. order preserving between the elements of $F$ and $\psi(F)$.
• Data structures for enabling efficient alignment. With large numbers of targets and many reads, there was a need for phrap to efficiently prune the search for read alignments. Running the banded Smith-Waterman alignment algorithm on all possible places a read could align to would have been intractable. To address this problem,  phrap made use of a hash to quickly find exact matches from parts of the read to targets, and then ran the Smith-Waterman banded alignment only on the resulting credible locations. Statistically, this could be viewed as a filter for very low probability alignments.

The four parts of phrap are present in most read alignment programs today. But the details can be very different. More on this in a moment.

2004: Six years after Consed paper the term “read alignment” shows up again in the paper “Comparative genome assembly” by Pop et al. In the paper the term “read alignment” was not defined but rather described via a process. Reads from obtained via shotgun sequencing of the genome of one organism were “aligned” to a reference genome from another closely related organism using a program called MUMmer (the goal being to use proximity of alignments to assist in assembly). In other words, the “alignments” were the output of MUMmer, followed by a post-processing with a series of heuristics to deal with ambiguous alignments. It is instructive to compare the alignments to those of phrap in terms of the four parts discussed above. First, in the formulation of Pop et al. the alignments were between a single read and a genome, not multiple alignments of sets of reads to numerous contigs. In the language above the map $\psi$ was much simpler in Pop et al., with the range consisting only of chromosomes in a genome rather than the thousands of contigs in an assembly. Second, although MUMmer utilized an implicit model for alignment, the model was of a very different meaning. Reads from one organism were being aligned to sequences from another, a task fundamentally different than aligning reads back to sequences they originated from (albeit with error). In other words, “alignment” in the Pop et al. paper referred to identification of homology. This opens up a whole other bag of worms that I wrote about in a paper with Colin Dewey in 2006 and that I will not open here. In terms of data structures, MUMmer made use of a suffix tree rather than a hash, a distinction that is important in that suffix trees provide a much richer possibility for search (MUM stands for maximum unique match, the kind of thing suffix trees are useful for finding and that MUMmer took advantage of). A lot of the research in read alignment has centered on developing efficient and compact data structures for string search.

Another development in the year 2004 was the appearance of the term “read mapping”. In the paper “Pash: efficient genome-scale anchoring by positional hashing” by Kalafus et al., hashing of k-mers was used to “anchor” reads. The word “anchoring” was and still is somewhat ambiguous as to its meaning, but in the Pash context it referred to the identification of positions of exact matching k-mers within a read on a target sequence. In other words, Pash could be considered to provide only partial alignments, because some bases within the read would not be aligned at all. Formally, one would say that the functions $\mathcal{L}_F$ had as their domain not the full reads $\mathcal{F}$ but rather strict subsets $\mathcal{S} \subset F$ where $\mathcal{S}$ is nonempty. For this reason, in the Pash paper the authors suggest post-processing their output with base-pair level sequence alignment programs popular at the time (e.g. AVID, LAGAN) to fully “align” the reads (they could have also suggested performing a banded Smith-Waterman alignment like phrap).

2006: The first sequencer manufactured by Solexa, the Genome Analyzer was launched, and following it the ELAND program for “read alignment” was provided  to customers buying the machine. This was, in my opinion, a landmark even in “read alignment” because with next-generation sequencing and ELAND read alignment became mainstream, a task to be performed in the hands of individual users, rather than just a step among many in complex genome sequencing pipelines manufactured in large genome sequencing centers. In the initial release of ELAND, the program performed “ungapped read alignment”, which in the language of this post means that the maps $\mathcal{L}_F$  had the property that if  $\mathcal{L}_F(i) = \sigma^{\psi(F)}_{j}$ then $\mathcal{L}_F(i+1) = \sigma^{\psi(F)}_{j+1}$, i.e. there were no gaps in the read alignments.

2009: With two programs published back-to-back in the same year, BWA (“Fast and accurate alignment with the Burrows-Wheeler transform“, Heng Li and Richard Durbin)  and Bowtie (“Ultrafast and memory-efficient alignment of short DNA sequences to the human genome” by Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg) read alignment took a big step forward with the use of more sophisticated data structures for rapid string matching.  BWA and Bowtie were based on the Burrows-Wheeler transform, and specifically innovations with the way it was used, to speed up read alignment. In the case of Bowtie, the alignments were ungapped just as originally with ELAND, although Bowtie2 published in 2012 incorporated gaps into the model.

Also of note is that by this time the terms “read alignment” and “read mapping” had become interchangeable. The BWA and Bowtie papers both used both terms, as did many other papers. Moreover, the terms started to take on a specific meaning in terms of the notion of “alignment”. They were referring to alignment of reads to a target genome, and moreover alignment of individual bases to specific positions in the target genome (i.e. the maps $\mathcal{L}_F$). This was in large part due to the publication of the sequence alignment format also in 2009 in Heng Li et al.‘s important paper on SAMtools.

Finally, 2009 was the year the notion that “spliced read alignment” was introduced. “Spliced alignment” was a concept around since the mid 90s (see, e.g. “Gene recognition by spliced alignment” by MS Gelfand, AA Mironov and PA Pevzner). The idea was to align cDNAs to a genome, requiring alignments allowing for long gaps (for the introns). In the paper “Optimal spliced alignments of short sequencing reads” by Fabio De Bona et al. the authors introduced the tool QPalma for this purpose, building on a spliced alignment tool from the group called Palma. Here the idea was to extend “spliced alignment” to reads, and while QPalma was never widely adopted, other programs such as TopHat, GSNAP, Star, etc. have becomes staples of RNA-Seq. Going back to the formalism I provided with the description of phrap, what spliced alignment did was to extend the models used for read alignment to allow for long gaps.

In the following years, there was what I think is fair to characterize as an explosion in the number of read alignment papers and tools. Models were tweaked and new indexing methods introduced, but the fundamental notion of “read alignment” (by now routinely also called “read mapping”) remained the same.

2014: In his Ph.D. thesis published in December of last year, my former student Nicolas Bray had the idea of entirely dropping the requirement for a read alignment to include the maps $\mathcal{L}_F$. In his setup, a read alignment would consist only of specifying for each read which target sequence it originated from, but not where in the target sequence it aligned to. The point, elaborated on in Chapter 2 of Nick’s thesis, was motivated by the observation that for many applications (e.g. RNA-Seq) the sufficient statistics for the relevant models are essentially the information contained in the map $\psi$ alone. He called this notion of read alignment “pseudoalignment”, as it is a fundamentally different concept than what was previously thought of as read alignment.

Of course there is not much to writing down a new definition, and there wouldn’t be much of a point to pseudoalignment it if not for two things: Nick’s realized that pseudoalignments could be obtained much faster than standard read alignments and second, that the large imprints of alignment files needed for the vast amounts of sequence that was being produced were becoming a major bottleneck in genomics. Working on RNA-Seq projects with collaborators, this is how we felt on a regular basis:

Pseudoalignment offered a reprieve from what was becoming an unbearable task, and (I believe) it is going to turn out to be a useful and widely applicable concept.

2015: Earlier this year we posted a preprint to the arXiv detailing a method for performing pseudoalignment based on the ideas in Nick’s thesis, with the additional idea of Páll Melsted to use a new type of indexing scheme based on a de Bruijn graph of the target sequences. The program implementing these ideas is called kallisto. Looking back at phrap, kallisto can be seen to consist of a different notion of read alignment (pseudoalignment), a novel data structure (the target sequence de Bruijn graph), and a new fast algorithm for finding pseudoalignments (based on intersecting sets of k-mer matches). Although we did not discuss the model explicitly in our paper, it is implicit via the algorithm used to find pseudoalignments. One useful thing to keep in mind is that the nature of the index of kallisto allows for extracting more than a pseudoalignment. If desired, one can obtain positions within the targets for some k-mers contained in a query read, or even strand information, and in fact such data is used in some parts of kallisto. When such information is extracted, what kallisto is performing is no longer a pure pseudoalignment, but rather what the Pash authors called a read mapping (although that term is unsuitable because as mentioned it has become synonyms with read alignment). It turns out there is now another word for exactly that concept.

Following the publication of kallisto on the arXiv another preprint was posted on the bioRxiv introducing a program called Salmon and a new term called “lightweight alignment” (Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-Seq Data Using Lightweight-Alignment by Rob Patro, Geet Duggal and Carl Kingsford) . Lightweight alignment as a notion of alignment is exactly what was just referred to: the notion of read mapping in Pash. Utilizing an index and software by Heng Li called the FMD index, Salmon determines “lightweight alignment” chains of what are called maximal exact matches and super-maximal exact matches (not to be confused with the maximal unique matches, or MUMs discuss above).  These provide partial alignment information for the sequences in each read, i.e. just like with Pash one obtains maps $\mathcal{L}_F$ but with the domain sometimes restricted to a nonempty subset $\mathcal{S} \subset \mathcal{F}$ when the MEMs or SMEMs don’t cover the read. Again, I find it useful to refer back and compare to phrap: with Salmon the “read alignment” consists of a different type of index and an algorithm associated with it (making sure the MEMs/SMEMs are consistent). The notion of alignment is distinct from that of phrap by virtue of the restriction of the domain in the maps $\mathcal{L}_F$.  I think a new term for the latter concept since the one Pash used failed to stick, and “lightweight alignment” seems fine.

Recently (just last week) a new read alignment related term was introduced in a bioRxiv preprint. A paper describing a new program called RapMap uses the term quasi-mapping for the procedure implemented in the software. RapMap finds pseudoalignments in the sense described above, with the only difference being the type of index that is used (instead of the target de Bruijn graph, it utilizes a generalized suffix array together with a hash table related to the array). This time I think the introduction of new terminology was unfortunate. Just as with kallisto, RapMap can be used to extract more information than just a pseudoalignment from the index (i.e. a lightweight alignment), and indeed the program does so. But also like kallisto, the speed of RapMap is derived from the idea of efficiently finding pseudoalignments by intersecting sets of k-mer matches. In other words, quasi-mapping is a procedure and not a concept; what the procedure outputs is a lightweight alignment via a pseudoalignment.

tl;dr

Read alignment is what you think it means.

Lightweight alignment is “partial” read alignment (it should have been called read mapping but that was taken). Some, but not necessarily all bases in a read are aligned to specific bases among target sequences.

Pseudoalignment is read assignment to target sequences without any base-level sequence alignment.

Read alignment, lightweight alignment and pseudoalignment are concepts, not algorithms.

Quasi-mapping is a procedure and not a concept. The procedure outputs a lightweight alignment via a pseudoalignment.

Disclaimer: I’ve almost certainly gotten some dates wrong, missed some papers I should have cited, and omitted some crucial developments in read alignment that I should have included. Please let me know if you think that is the case by commenting and I’ll update the post as needed.

Last year I came across a wonderful post on the arXiv, a paper titled A new approach to enumerating statistics modulo n written by William Kuszmaul while he was a high school student participating in the MIT Primes Program for Research in Mathematics. Among other things, Kuszmaul solved the problem of counting the number of subsets of the n-element set $\{1,2,\ldots,n\}$ that sum to k mod m.

This counting problem is related to beautiful (elementary) number theory and combinatorics, and is connected to ideas in (error correcting) coding theory and even computational biology (the Burrows Wheeler transform). I’ll tell the tale, but first explain my personal connection to it, which is the story of  my first paper that wasn’t, and how I was admitted to graduate school:

In 1993 I was a junior (math major) at Caltech and one of my best friends was a senior, Nitu Kitchloo, now an algebraic topologist and professor of mathematics at Johns Hopkins University. I had a habit of asking Nitu math questions, and he had a habit of solving them. One of the questions I asked him that year was:

How many subsets of the n-element set $\{1,2,\ldots,n\}$ sum to 0 mod n?

I don’t remember exactly why I was thinking of the problem, but I do remember that Nitu immediately started looking at the generating function (the polynomial whose coefficients count the number of subsets for each n) and magic happened quickly thereafter. We eventually wrote up a manuscript whose main result was the enumeration of $N_n^k$, the number of subsets of $\{1,2,\ldots,n\}$ whose elements sum to k mod n. The main result was

$N_n^k = \frac{1}{n} \sum_{s|n; s \,odd}2^{\frac{n}{s}}\frac{\varphi(s)}{\varphi\left(\frac{s}{(k,s)}\right)}\mu\left(\frac{s}{(k,s)} \right)$.

We were particularly tickled that the formula contained both Euler’s totient function and the Möbius function. It seemed nice. So we decided to submit our little result to a journal.

By now months had passed and Nitu had already left Caltech for graduate school. He left me to submit the paper and I didn’t know where, so I consulted the resident combinatorics expert (Rick Wilson) who told me that he liked the result and hadn’t seen it before, but that before sending it off, just in case, I should should consult with this one professor at MIT who was known to be good at counting. I remember Wilson saying something along the lines of “Richard knows everything”. This was in the nascent days of the web, so I hand wrote a letter to Richard Stanley, enclosed a copy of the manuscript, and mailed it off.

A few weeks later I received a hand-written letter from MIT. Richard Stanley had written back to let us know that he very much liked the result… because it had brought back memories of his time as an undergraduate at Caltech, when he had worked on the same problem! He sent me the reference:

R.P Stanley and M.F. Yoder, A study of Varshamov codes for asymmetric channels, JPL Technical Report 32-1526, 1973.

Also included was a note that said “Please consider applying to MIT”.

Stanley and Yoder’s result concerned single-error-correcting codes for what are known as Z-channels. These are communication links where there is an asymmetry in the fidelity of transmission: 0 is reliably transmitted as 0 (probability 1), but 1 may be transmitted as either a 1 or a 0 (with some probability p). In a 1973 paper A class of codes for asymmetric channels and a problem from the additive theory of numbers published in the IEEE Transactions of Information Theory, R. Varshamov had proposed a single error correcting code for such channels, which was essentially to encode a message by a bit string corresponding to a subset of $\{1,\ldots,n\}$ whose elements sum to d mod (n+1). It’s not hard to see that since zeroes are transmitted faithfully, it would not be hard to detect a single error (and correct it) by summing the elements corresponding to the bit string. Stanley and Yoder’s paper addressed questions related to enumerating the number of codewords. In particular, they were basically working out the solution to the problem Nitu and I had considered. I guess we could have published our paper anyway as we had a few additional results, for example a theorem explaining how to enumerate zero summing subsets of finite Abelian groups, but somehow we never did. There is a link to the manuscript we had written on my website:

N. Kitchloo and L. Pachter, An interesting result about subset sums, 1993.

One generalization we had explored in our paper was the enumeration of what we called $N_{n,m}^k$the number of subsets of the n-element set $\{1,2,\ldots,n\}$ that sum to k mod m. We looked specifically at the problem where $m|n$ and proved that

$N_{n,m}^n = \frac{1}{m}\sum_{s|m; s \, odd} 2^{\frac{n}{s}}\varphi(s)$.

What Kuszmaul succeeded in doing is to extend this result to any m < n, which is very nice for two reasons: first, the work completes the investigation of the question of subset sums (to subsets summing to an arbitrary modulus). More importantly, the technique used is that of thinking more generally about “modular enumeration”, which is the problem of finding remainders of polynomials modulo $x^n-1$. This led him to numerous other applications, including results on q-multinomial and q-Catalan numbers, and to the combinatorics of lattice paths. This is the hallmark of excellent mathematics: a proof technique that sheds light on the problem at hand and many others.

One of the ideas that modular enumeration is connected to is that of the Burrows-Wheeler transform (BWT). The BWT was published as a DEC tech report in 1994 (based on earlier work of Wheeler in 1983), and is a transform of one string to another of the same length. To understand the transform, consider the example of a binary string of length n. The BWT consists of forming a matrix of all cyclic permutations of s (one row per permutation), then sorting the rows lexicographically, and finally outputting the last column of the matrix. For example, the string 001101 is transformed to 110010. It is obvious by virtue of the definition that any two strings that are the equivalent up to circular permutation will be transformed to the same string. For example, 110100 will also transform to 110010.

Circular binary strings are called necklaces in combinatorics. Their enumeration is a classic problem, solvable by Burnside’s lemma, and the answer is that the number of distinct necklaces $C_n$ of length n is given by

$C_n = \frac{1}{n} \sum_{s|n} 2^{\frac{n}{s}} \varphi(s)$.

For odd n this formula coincides with the subset sum problem (for subsets summing to 0 mod n). When n is prime it is easy to describe a bijection, but for general odd a simple combinatorial bisection is elusive (see Richard Stanley’s Enumerative Combinatorics Volume 1 Chapter 1 Problem 105b).

The Burrows-Wheeler transform is useful because it can be utilized for constant time string matching while requiring an index whose size is only linear in the size of the target. It has therefore become an indispensable tool for genomics. I’m not aware of an application of the elementary observation above, but as the Stanley-Yoder, Kitchloo-P., Kuszmaul timeline demonstrates (21 years in between publications)… math moves in decades. I do think there is some interesting combinatorics underlying the BWT, and that its elucidation may turn out to have practical implications. We’ll see.

A final point: it is fashionable to think that biology, unlike math, moves in years. After all, NIH R01 grants are funded for a period of 3–5 years, and researchers constantly argue with journals that publication times should be weeks and not months. But in fact, lots of basic research in biology moves in decades as well, just like in mathematics. A good example is the story of CRISP/Cas9, which began with the discovery of “genetic sandwiches” in 1987.  The follow-up identification and interpretation and of CRISRPs took decades, mirroring the slow development of mathematics. Today the utility of the CRISPR/Cas9 system depends on the efficient selection of guides and prediction of off-target binding, and as it turns out, tools developed for this purpose frequently use the Burrows-Wheeler transform. It appears that not only binary strings can form circles, but ideas as well…

William Kuszmaul  won 3rd place in the 2014 Intel Science Talent Search for his work on modular enumeration. Well deserved, and thank you!

When I was an undergraduate at Caltech I took a combinatorics course from Rick Wilson who taught from his then just published textbook A Course in Combinatorics (co-authored with J.H. van Lint). The course and the book emphasized design theory, a subject that is beautiful and fundamental to combinatorics, coding theory, and statistics, but that has sadly been in decline for some time. It was a fantastic course taught by a brilliant professor- an experience that had a profound impact on me. Though to be honest, I haven’t thought much about designs in recent years. Having kids changed that.

A few weeks ago I was playing the card game Colori with my three year old daughter. It’s one of her favorites.

The game consists of 15 cards, each displaying drawings of the same 15 items (beach ball, boat, butterfly, cap, car, drum, duck, fish, flower, kite, pencil, jersey, plane, teapot, teddy bear), with each item colored using two of the colors red, green, yellow and blue. Every pair of cards contains exactly one item that is colored exactly the same. For example, the two cards the teddy bear is holding in the picture above are shown below:

The only pair of items colored exactly the same are the two beach balls. The gameplay consists of shuffling the deck and then placing a pair of cards face-up. Players must find the matching pair, and the first player to do so keeps the cards. This is repeated seven times until there is only one card left in the deck, at which point the player with the most cards wins. When I play with my daughter “winning” consists of enjoying her laughter as she figures out the matching pair, and then proceeds to try to eat one of the cards.

An inspection of all 15 cards provided with the game reveals some interesting structure:

Every card contains exactly one of each type of item. Each item therefore occurs 15 times among the cards, with fourteen of the occurrences consisting of seven matched pairs, plus one extra. This is a type of partially balanced incomplete block design. Ignoring for a moment the extra item placed on each card, what we have is 15 items, each colored one of seven ways (v=15*7=105). The 105 items have been divided into 15 blocks (the cards), so that b=15. Each block contains 14 elements (the items) so that k=14, and each element appears in two blocks (r=2). If every pair of different (colored) items occurred in the same number of cards, we would have a balanced incomplete block design, but this is not the case in Colori. Each item occurs in the same block as 26 (=2*13) other items (we are ignoring the extra item that makes for 15 on each card), and therefore it is not the case that every pair of items occurs in the same number of blocks as would be the case in a balanced incomplete block design. Instead, there is an association scheme that provides extra structure among the 105 items, and in turn describes the way in which items do or do not appear together on cards. The association scheme can be understood as a graph whose nodes consist of the 105 items, with edges between items labeled either 0,1 or 2. An edge between two items of the same type is labeled 0, edges between different items that appear on the same card are labeled 1, and edges between different items that do not appear on the same card are labeled 2. This edge labeling is called an “association scheme” because it has a special property, namely the number of triangles with a base edge labeled k, and other two edges labeled i and respectively is  dependent only on i,j and k and not on the specific base edge selected. In other words, there is a special symmetry to the graph. Returning to the deck of cards, we see that every pair of items appears in the same card exactly 0 or 1 times, and the number depends only on the association class of the pair of objects. This is called a partially balanced incomplete block design.

The author of the game, Reinhard Staupe, made it a bit more difficult by adding an extra item to each card making the identification of the matching pair harder. The addition also ensures that each of the 15 items appears on each card. Moreover, the items are permuted in location on the cards, in an arrangement similar to a latin square, making it hard to pair up the items. And instead of using 8 different colors, he used only four, producing the eight different “colors” of each item on the cards by using pairwise combinations of the four.  The yellow-green two-colored beach balls are particularly difficult to tell apart from the green-yellow ones. Of course, much of this is exactly the kind of thing you would want to do if you were designing an RNA-Seq experiment!

Instead of 15 types of items, think of 15 different strains of mice.  Instead of colors for the items, think of different cellular conditions to be assayed. Instead of one pair for each of seven color combinations, think of one pair of replicates for each of seven cellular conditions. Instead of cards, think of different sequencing centers that will prepare the libraries and sequence the reads. An ideal experimental setup would involve distributing the replicates and different cellular conditions across the different sequencing centers so as to reduce batch effect. This is the essence of part of the paper Statistical Design and Analysis of RNA Sequencing Data by Paul Auer and Rebecca Doerge. For example, in their Figure 4 (shown below) they illustrate the advantage of balanced block designs to ameliorate lane effects:

Figure 4 from P. Auer and R.W. Doerge’s paper Statistical Design and Analysis of RNA Sequencing Data.

Of course the use of experimental designs for constructing controlled gene expression experiments is not new. Kerr and Churchill wrote about the use of combinatorial designs in Experimental Design for gene expression microarrays, and one can trace back a long chain of ideas originating with R.A. Fisher. But design theory seems to me to be a waning art insofar as molecular biology experiments are concerned, and it is frequently being replaced with biological intuition of what makes for a good control. The design of good controls is sometimes obvious, but not always. So next time you design an experiment, if you have young kids, first play a round of Colori. If the kids are older, play Set instead. And if you don’t have any kids, plan for an extra research project, because what else would you do with your time?

I’m a (50%) professor of mathematics and (50%) professor of molecular & cell biology at UC Berkeley. There have been plenty of days when I have spent the working hours with biologists and then gone off at night with some mathematicians. I mean that literally. I have had, of course, intimate friends among both biologists and mathematicians. I think it is through living among these groups and much more, I think, through moving regularly from one to the other and back again that I have become occupied with the problem that I’ve christened to myself as the ‘two cultures’. For constantly I feel that I am moving among two groups- comparable in intelligence, identical in race, not grossly different in social origin, earning about the same incomes, who have almost ceased to communicate at all, who in intellectual, moral and psychological climate have so little in common that instead of crossing the campus from Evans Hall to the Li Ka Shing building, I may as well have crossed an ocean.1

I try not to become preoccupied with the two cultures problem, but this holiday season I have not been able to escape it. First there was a blog post by David Mumford, a professor emeritus of applied mathematics at Brown University, published on December 14th. For those readers of the blog who do not follow mathematics, it is relevant to what I am about to write that David Mumford won the Fields Medal in 1974 for his work in algebraic geometry, and afterwards launched another successful career as an applied mathematician, building on Ulf Grenader’s Pattern Theory and making significant contributions to vision research. A lot of his work is connected to neuroscience and therefore biology. Among his many awards are the MacArthur Fellowship, the Shaw Prize, the Wolf Prize and the National Medal of Science. David Mumford is not Joe Schmo.

It therefore came as a surprise to me to read his post titled “Can one explain schemes to biologists?”  in which he describes the rejection by the journal Nature of an obituary he was asked to write. Now I have to say that I have heard of obituaries being retracted, but never of an obituary being rejected. The Mumford rejection is all the more disturbing because it happened after he was invited by Nature to write the obituary in the first place!

The obituary Mumford was asked to write was for Alexander Grothendieck, a leading and towering figure in 20th century mathematics who built many of the foundations for modern algebraic geometry. My colleague Edward Frenkel published a brief non-technical obituary about Grothendieck in the New York Times, and perhaps that is what Nature had in mind for its journal as well. But since Nature is bills itself as “An international journal, published weekly, with original, groundbreaking research spanning all of the scientific disciplines [emphasis mine]” Mumford assumed the readers of Nature would be interested not only in where Grothendieck was born and died, but in what he actually accomplished in his life, and why he is admired for his mathematics. Here is the beginning excerpt of Mumford’s blog post2 explaining why he and John Tate (his coauthor for the post) needed to talk about the concept of a scheme in their post:

John Tate and I were asked by Nature magazine to write an obituary for Alexander Grothendieck. Now he is a hero of mine, the person that I met most deserving of the adjective “genius”. I got to know him when he visited Harvard and John, Shurik (as he was known) and I ran a seminar on “Existence theorems”. His devotion to math, his disdain for formality and convention, his openness and what John and others call his naiveté struck a chord with me.

So John and I agreed and wrote the obituary below. Since the readership of Nature were more or less entirely made up of non-mathematicians, it seemed as though our challenge was to try to make some key parts of Grothendieck’s work accessible to such an audience. Obviously the very definition of a scheme is central to nearly all his work, and we also wanted to say something genuine about categories and cohomology.

What they came up with is a short but well-written obituary that is the best I have read about Grothendieck. It is non-technical yet accurate and meaningfully describes, at a high level, what he is revered for and why. Here it is (copied verbatim from David Mumford’s blog):

Alexander Grothendieck
David Mumford and John Tate

Although mathematics became more and more abstract and general throughout the 20th century, it was Alexander Grothendieck who was the greatest master of this trend. His unique skill was to eliminate all unnecessary hypotheses and burrow into an area so deeply that its inner patterns on the most abstract level revealed themselves — and then, like a magician, show how the solution of old problems fell out in straightforward ways now that their real nature had been revealed. His strength and intensity were legendary. He worked long hours, transforming totally the field of algebraic geometry and its connections with algebraic mber

mber theory. He was considered by many the greatest mathematician of the 20th century.

Grothendieck was born in Berlin on March 28, 1928 to an anarchist, politically activist couple — a Russian Jewish father, Alexander Shapiro, and a German Protestant mother Johanna (Hanka) Grothendieck, and had a turbulent childhood in Germany and France, evading the holocaust in the French village of Le Chambon, known for protecting refugees. It was here in the midst of the war, at the (secondary school) Collège Cévenol, that he seems to have first developed his fascination for mathematics. He lived as an adult in France but remained stateless (on a “Nansen passport”) his whole life, doing most of his revolutionary work in the period 1956 – 1970, at the Institut des Hautes Études Scientifique (IHES) in a suburb of Paris after it was founded in 1958. He received the Fields Medal in 1966.

His first work, stimulated by Laurent Schwartz and Jean Dieudonné, added major ideas to the theory of function spaces, but he came into his own when he took up algebraic geometry. This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety. Traditionally, this had meant complex solutions of polynomials with complex coefficients but just prior to Grothendieck’s work, Andre Weil and Oscar Zariski had realized that much more scope and insight was gained by considering solutions and polynomials over arbitrary fields, e.g. finite fields or algebraic number fields.

The proper foundations of the enlarged view of algebraic geometry were, however, unclear and this is how Grothendieck made his first, hugely significant, innovation: he invented a class of geometric structures generalizing varieties that he called schemes. In simplest terms, he proposed attaching to any commutative ring (any set of things for which addition, subtraction and a commutative multiplication are defined, like the set of integers, or the set of polynomials in variables x,y,z with complex number coefficients) a geometric object, called the Spec of the ring (short for spectrum) or an affine scheme, and patching or gluing together these objects to form the scheme. The ring is to be thought of as the set of functions on its affine scheme.

To illustrate how revolutionary this was, a ring can be formed by starting with a field, say the field of real numbers, and adjoining a quantity $\epsilon$ satisfying $\epsilon^2=0$. Think of $\epsilon$ this way: your instruments might allow you to measure a small number such as $\epsilon=0.001$ but then $\epsilon^2=0.000001$ might be too small to measure, so there’s no harm if we set it equal to zero. The numbers in this ring are $a+b \cdot \epsilon$ real a,b. The geometric object to which this ring corresponds is an infinitesimal vector, a point which can move infinitesimally but to second order only. In effect, he is going back to Leibniz and making infinitesimals into actual objects that can be manipulated. A related idea has recently been used in physics, for superstrings. To connect schemes to number theory, one takes the ring of integers. The corresponding Spec has one point for each prime, at which functions have values in the finite field of integers mod p and one classical point where functions have rational number values and that is ‘fatter’, having all the others in its closure. Once the machinery became familiar, very few doubted that he had found the right framework for algebraic geometry and it is now universally accepted.

Going further in abstraction, Grothendieck used the web of associated maps — called morphisms — from a variable scheme to a fixed one to describe schemes as functors and noted that many functors that were not obviously schemes at all arose in algebraic geometry. This is similar in science to having many experiments measuring some object from which the unknown real thing is pieced together or even finding something unexpected from its influence on known things. He applied this to construct new schemes, leading to new types of objects called stacks whose functors were precisely characterized later by Michael Artin.

His best known work is his attack on the geometry of schemes and varieties by finding ways to compute their most important topological invariant, their cohomology. A simple example is the topology of a plane minus its origin. Using complex coordinates (z,w), a plane has four real dimensions and taking out a point, what’s left is topologically a three dimensional sphere. Following the inspired suggestions of Grothendieck, Artin was able to show how with algebra alone that a suitably defined third cohomology group of this space has one generator, that is the sphere lives algebraically too. Together they developed what is called étale cohomology at a famous IHES seminar. Grothendieck went on to solve various deep conjectures of Weil, develop crystalline cohomology and a meta-theory of cohomologies called motives with a brilliant group of collaborators whom he drew in at this time.

In 1969, for reasons not entirely clear to anyone, he left the IHES where he had done all this work and plunged into an ecological/political campaign that he called Survivre. With a breathtakingly naive spririt (that had served him well doing math) he believed he could start a movement that would change the world. But when he saw this was not succeeding, he returned to math, teaching at the University of Montpellier. There he formulated remarkable visions of yet deeper structures connecting algebra and geometry, e.g. the symmetry group of the set of all algebraic numbers (known as its Galois group $Gal(\overline{\mathbb{Q}}/\mathbb{Q})$) and graphs drawn on compact surfaces that he called ‘dessin d’enfants’. Despite his writing thousand page treatises on this, still unpublished, his research program was only meagerly funded by the CNRS (Centre Nationale de Recherche Scientifique) and he accused the math world of being totally corrupt. For the last two decades of his life he broke with the whole world and sought total solitude in the small village of Lasserre in the foothills of the Pyrenees. Here he lived alone in his own mental and spiritual world, writing remarkable self-analytic works. He died nearby on Nov. 13, 2014.

As a friend, Grothendieck could be very warm, yet the nightmares of his childhood had left him a very complex person. He was unique in almost every way. His intensity and naivety enabled him to recast the foundations of large parts of 21st century math using unique insights that still amaze today. The power and beauty of Grothendieck’s work on schemes, functors, cohomology, etc. is such that these concepts have come to be the basis of much of math today. The dreams of his later work still stand as challenges to his successors.

Mumford goes on in his blog post to describe the reasons Nature gave for rejecting the obituary. He writes:

The sad thing is that this was rejected as much too technical for their readership. Their editor wrote me that ‘higher degree polynomials’, ‘infinitesimal vectors’ and ‘complex space’ (even complex numbers) were things at least half their readership had never come across. The gap between the world I have lived in and that even of scientists has never seemed larger. I am prepared for lawyers and business people to say they hated math and not to remember any math beyond arithmetic, but this!? Nature is read only by people belonging to the acronym ‘STEM’ (= Science, Technology, Engineering and Mathematics) and in the Common Core Standards, all such people are expected to learn a hell of a lot of math. Very depressing.

I don’t know if the Nature editor had biologists in mind when rejecting the Grothendieck obituary, but Mumford certainly thought so, as he sarcastically titled his post “Can one explain schemes to biologists?” Sadly, I think that Nature and Mumford both missed the point.

Exactly ten years ago Bernd Sturmfels and I published a book titled “Algebraic Statistics for Computational Biology“. From my perspective, the book developed three related ideas: 1. that the language, techniques and theorems of algebraic geometry both unify and provide tools for certain models in statistics, 2. that problems in computational biology are particularly prone to depend on inference with precisely the statistical models amenable to algebraic analysis and (most importantly) 3. mathematical thinking, by way of considering useful generalizations of seemingly unrelated ideas, is a powerful approach for organizing many concepts in (computational) biology, especially in genetics and genomics.

To give a concrete example of what 1,2 and 3 mean, I turn to Mumford’s definition of algebraic geometry in his obituary for Grothendieck. He writes that “This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety.” What is he talking about? The notion of “phylogenetic invariants”, provides a simple example for biologists by biologists. Phylogenetic invariants were first introduced to biology ca. 1987 by Joe Felsenstein (Professor of Genome Sciences and Biology at the University of Washington) and James Lake (Distinguished Professor of Molecular, Cell, and Developmental Biology and of Human Genetics at UCLA)3.

Given a phylogenetic tree describing the evolutionary relationship among n extant species, one can examine the evolution of a single nucleotide along the tree. At the leaves, a single nucleotide is then associated to each species, collectively forming a single selection from among the $4^n$ possible patterns for nucleotides at the leaves. Evolutionary models provide a way to formalize the intuitive notion that random mutations should be associated with branches of the tree and formally are described via (unknown) parameters that can be used to calculate a probability for any pattern at the leaves. It happens to be the case that for most phylogenetic evolutionary model have the property that the probabilities for leaf patterns are polynomials in the parameters. The simplest example to consider is the tree with an ancestral node and two leaves corresponding to two extant species, say “B” and “M”:

The molecular approach to evolution posits that multiple sites together should be used both to estimate parameters associated with evolution along the tree, and maybe even the tree itself. If one assumes that nucleotides mutate according to the 4-state general Markov model with independent processes on each branch, and one writes $p_{ij}$ for $\mathbb{P}(B=i,M=j)$ where i,j are one of A,C,G,T, then it must be the case that $p_{ij}p_{kl} = p_{il}p_{jk}$. In other words, the polynomial

$p_{ij}p_{kl} - p_{il}p_{jk}=0$.

In other words, for any parameters in the 4-state general Markov model, it has to be the case that when the pattern probabilities are plugged into the polynomial equation above, the result is zero. This equation is none other than the condition for two random variables to be independent; in this case the random variable corresponding to the nucleotide at B is independent of the random variable corresponding to the nucleotide at M.

The example is elementary, but it hints at a powerful tool for phylogenetics. It provides an equation that must be satisfied by the pattern probabilities that does not depend specifically on the parameters of the model (which can be intuitively understood as relating to branch length). If many sites are available so that pattern probabilities can be estimated empirically from data, then there is in principle a possibility for testing whether the data fits the topology of a specific tree regardless of what the branch lengths of the tree might be. Returning to Mumford’s description of algebraic geometry, the variety of interest is the geometric object in “pattern probability space” where points are precisely probabilities that can arise for a specific tree, and the “ring of polynomials with the geometric properties of the locus” are the phylogenetic invariants. The relevance of the ring lies in the fact that if and g are two phylogenetic invariants then that means that $f(P)=0$ and $g(P)=0$ for any pattern probabilities from the model, so therefore $f+g$ is also a phylogenetic invariant because $f(P)+g(P)=0$ for any pattern probabilities from the model (the same is true for $c \cdot f$ for any constant c). In other words, there is an algebra of phylogenetic invariants that is closely related to the geometry of pattern probabilities. As Mumford and Tate explain, Grothendieck figured out the right generalizations to construct a theory for any ring, not just the ring of polynomials, and therewith connected the fields of commutative algebra, algebraic geometry and number theory.

The use of phylogenetic invariants for testing tree topologies is conceptually elegantly illustrated in a wonderful book chapter on phylogenetic invariants  by mathematicians Elizabeth Allman and John Rhodes that starts with the simple example of the two taxa tree and delves deeply into the subject. Two surfaces (conceptually) represent the varieties for two trees, and the equations $f_1(P)=f_2(P)=\ldots=f_l(P)=0$ and $h_1(P)=h_2(P)=\ldots=h_k(P)=0$ are the phylogenetic invariants. The empirical pattern probability distribution is the point $\hat{P}$ and the goal is to find the surface it is close to:

Figure 4.2 from Allman and Rhodes chapter on phylogenetic invariants.

Of course for large trees there will be many different phylogenetic invariants, and the polynomials may be of high degree. Figuring out what the invariants are, how many of them there are, bounds for the degrees, understanding the geometry, and developing tests based on the invariants, is essentially a (difficult unsolved) challenge for algebraic geometers. I think it’s fair to say that our book spurred a lot of research on the subject, and helped to create interest among mathematicians who were unaware of the variety and complexity of problems arising from phylogenetics. Nick Eriksson, Kristian Ranestad, Bernd Sturmfels and Seth Sullivant wrote a short piece titled phylogenetic algebraic geometry which is an introduction for algebraic geometers to the subject. Here is where we come full circle to Mumford’s obituary… the notion of a scheme is obviously central to phylogenetic algebraic geometry. And the expository article just cited is just the beginning. There are too many exciting developments in phylogenetic geometry to summarize in this post, but Elizabeth Allman, Marta Casanellas, Joseph Landsberg, John Rhodes, Bernd Sturmfels and Seth Sullivant are just a few of many who have discovered beautiful new mathematics motivated by the biology, and also have had an impact on biology with algebro-geometric tools. There is both theory (see this recent example) and application (see this recent example) coming out of phylogenetic algebraic geometry. More generally, algebraic statistics for computational biology is now a legitimate “field”, complete with a journal, regular conferences, and a critical mass of mathematicians, statisticians, and even some biologists working in the area. Some of the results are truly beautiful and impressive. My favorite recent one is this paper by Caroline Uhler, Donald Richards and Piotr Zwiernik providing important guarantees for maximum likelihood estimation of parameters in Felstenstein’s continuous character model.

But that is not the point here. First, Mumford’s sarcasm was unwarranted. Biologists certainly didn’t discover schemes but as Felsenstein and Lake’s work shows, they did (re)discover algebraic geometry. Moreover, all of the people mentioned above can explain schemes to biologists, thereby answering Mumford’s question in the affirmative. Many of them have not only collaborated with biologists but written biology papers. And among them are some extraordinary expositors, notably Bernd Sturmfels. Still, even if there are mathematicians able and willing to explain schemes to biologists, and even if there are areas within biology where schemes arise (e.g. phylogenetic algebraic geometry), it is fair to ask whether biologists should care to understand them?

The answer to the question is: probably not. In any case I wouldn’t presume to opine on what biologists should and shouldn’t care about. Biology is enormous, and encompasses everything from the study of fecal transplants to the wood frogs of Alaska. However I do have an opinion about the area I work in, namely genomics. When it comes to genomics journalists write about revolutions, personalized precision medicine, curing cancer and data deluge. But the biology of genomics is for real, and it is indeed tremendously exciting as a result of dramatic improvements in underlying technologies (e.g. DNA sequencing and genome editing to name two). I also believe it is true that despite what is written about data deluge, experiments remain the primary and the best way, to elucidate the function of the genome. Data analysis is secondary. But it is true that statistics has become much more important to genomics than it was even to population genetics at the time of R.A. Fisher, computer science is playing an increasingly important role, and I believe that somewhere in the mix of “quantitative sciences for biology”, there is an important role for mathematics.

What biologists should appreciate, what was on offer in Mumford’s obituary, and what mathematicians can deliver to genomics that is special and unique, is the ability to not only generalize, but to do so “correctly”. The mathematician Raoul Bott once reminisced that “Grothendieck was extraordinary as he could play with concepts, and also was prepared to work very hard to make arguments almost tautological.” In other words, what made Grothendieck special was not that he generalized concepts in algebraic geometry to make them more abstract, but that he was able to do so in the right way. What made his insights seemingly tautological at the end of the day, was that he had the “right” way of viewing things and the “right” abstractions in mind. That is what mathematicians can contribute most of all to genomics. Of course sometimes theorems are important, or specific mathematical techniques solve problems and mathematicians are to thank for that. Phylogenetic invariants are important for phylogenetics which in turn is important for comparative genomics which in turn is important for functional genomics which in turn is important for medicine. But it is the the abstract thinking that I think matters most. In other words, I agree with Charles Darwin that mathematicians are endowed with an extra sense… I am not sure exactly what he meant, but it is clear to me that it is the sense that allows for understanding the difference between the “right” way and the “wrong” way to think about something.

There are so many examples of how the “right” thinking has mattered in genomics that they are too numerous to list here, but here are a few samples: At the heart of molecular biology, there is the “right” and the “wrong” way to think about genes: evidently the message to be gleaned from Gerstein et al.‘s in “What is a gene post ENCODE? History and Definition” is that “genes” are not really the “right” level of granularity but transcripts are. In a previous blog post I’ve discussed the “right” way to think about the Needleman-Wunsch algorithm (tropically). In metagenomics there is the “right” abstraction with which to understand UniFrac. One paper I’ve written (with Niko Beerenwinkel and Bernd Sturmfels) is ostensibly about fitness landscapes but really about what we think the “right” way is to look at epistasis. In systems biology there is the “right” way to think about stochasticity in expression (although I plan a blog post that digs a bit deeper). There are many many more examples… way too many to list here… because ultimately every problem in biology is just like in math… there is the “right’ and the “wrong” way to think about it, and figuring out the difference is truly an art that mathematicians, the type of mathematicians that work in math departments, are particularly good at.

Here is a current example from (computational) biology where it is not yet clear what “right” thinking should be despite the experts working hard at it, and that is useful to highlight because of the people involved: With the vast amount of human genomes being sequenced (some estimates are as high as 400,000 in the coming year), there is an increasingly pressing fundamental question about how the (human) genome should be represented and stored. This is ostensibly a computer science question: genomes should perhaps be compressed in ways that allow for efficient search and retrieval, but I’d argue that fundamentally it is a math question. This is because what the question is really asking, is how should one think about genome sequences related mostly via recombination and only slightly by mutation, and what are the “right” mathematical structures for this challenge? The answer matters not only for the technology (how to store genomes), but much more importantly for the foundations of population and statistical genetics. Without the right abstractions for genomes, the task of coherently organizing and interpreting genomic information is hopeless. David Haussler (with coauthors) and Richard Durbin have both written about this problem in papers that are hard to describe in any way other than as math papers; see Mapping to a Reference Genome Structure and Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (BPWT). Perhaps it is no coincidence that both David Haussler and Richard Durbin studied mathematics.

But neither David Haussler nor Richard Durbin are faculty in mathematics departments. In fact, there is a surprisingly long list of very successful (computational) biologists specifically working in genomics, many of whom even continue to do math, but not in math departments, i.e. they are former mathematicians (this is so common there is even a phrase for it “recovering mathematician” as if being one is akin to alcoholism– physicists use the same language). People include Richard Durbin, Phil Green, David Haussler, Eric Lander, Montgomery Slatkin and many others I am omitting; for example almost the entire assembly group at the Broad Institute consists of former mathematicians. Why are there so many “formers” and very few “currents”? And does it matter? After all, it is legitimate to ask whether successful work in genomics is better suited to departments, institutes and companies outside the realm of academic mathematics. It is certainly the case that to do mathematics, or to publish mathematical results, one does not need to be a faculty member in a mathematics department. I’ve thought a lot about these issues and questions, partly because they affect my daily life working between the worlds of mathematics and molecular biology in my own institution. I’ve also seen the consequences of the separation of the two cultures. To illustrate how far apart they are I’ve made a list of specific differences below:

How then can biology, specifically genomics (or genetics), exist and thrive within the mathematics community? And how can mathematics find a place within the culture of biology?

I don’t know. The relationship between biology and mathematics is on the rocks and prospects are grim. Yes, there are biologists who do mathematical work, and yes, there are mathematical biologists, especially in areas such as evolution or ecology who are in math departments. There are certainly applied mathematics departments with faculty working on biology problems involving modeling at the macroscopic level, where the math fits in well with classic applied math (e.g. PDEs, numerical analysis). But there is very little genomics or genetics related math going on in math departments. And conversely, mathematicians who leave math departments to work in biology departments or institutes face enormous pressure to not focus on the math, or when they do any math at all, to not publish it (work is usually relegated to the supplement and completely ignored). The result is that biology loses out due to the minimal real contact with math– the special opportunity of benefiting from the extra sense is lost, and conversely math loses the opportunity to engage biology– one of the most exciting scientific enterprises of the 21st century. The mathematician Gian-Carlo Rota said that “The lack of real contact between mathematics and biology is either a tragedy, a scandal, or a challenge, it is hard to decide which”. He was right.

The extent to which the two cultures have drifted apart is astonishing. For example, visiting other universities I see the word “mathematics” almost every time precision medicine is discussed in the context of a new initiative, but I never see mathematicians or the local math department involved. In the mathematics community, there has been almost no effort to engage and embrace genomics. For example the annual joint AMS-MAA meetings always boast a series of invited talks, many on applications of math, but genomics is never a represented area. Yet in my Junior level course last semester on mathematical biology (taught in the math department) there were 46 students, more than any other upper division elective class in the math department. Even though I am a 50% member of the mathematics department I have been advising three math graduate students this year, equivalent to six for a full time member, a statistic that probably ranks me among the most busy advisors in the department (these numbers do not even reflect the fact that I had to turn down a number of students). Anecdotally, the numbers illustrate how popular genomics is among math undergraduate and graduate students, and although hard data is difficult to come by my interactions with mathematicians everywhere convince me the trend I see at Berkeley is universal. So why is this popularity not reflected in support of genomics by the math community? And why don’t biology journals, conferences and departments embrace more mathematics? There is a hypocrisy of math for biology. People talk about it but when push comes to shove nobody wants to do anything real to foster it.

Examples abound. On December 16th UCLA announced the formation of a new Institute for Quantitative and Computational Biosciences. The announcement leads with a photograph of the director that is captioned “Alexander Hoffmann and his colleagues will collaborate with mathematicians to make sense of a tsunami of biological data.” Strangely though, the math department is not one of the 15 partner departments that will contribute to the Institute. That is not to say that mathematicians won’t interact with the Institute, or that mathematics won’t happen there. E.g., the Institute for Pure and Applied Mathematics is a partner as is the Biomathematics department (an interesting UCLA concoction), not to mention the fact that many of the affiliated faculty do work that is in part mathematical. But formal partnership with the mathematics department, and through it direct affiliation with the mathematics community, is missing. UCLA’s math department is among the top in the world, and boasts a particularly robust applied mathematics program many of whose members work on mathematical biology. More importantly, the “pure” mathematicians at UCLA are first rate and one of them, Terence Tao, is possibly the most talented mathematician alive. Wouldn’t it be great if he could be coaxed to think about some of the profound questions of biology? Wouldn’t it be awesome if mathematicians in the math department at UCLA worked hard with the biologists to tackle the extraordinary challenges of “precision medicine”? Wouldn’t it be wonderful if UCLA’s Quantitative and Computational biosciences Institute could benefit from the vast mathematics talent pool not only at UCLA but beyond: that of the entire mathematics community?

I don’t know if the omission of the math department was an accidental oversight of the Institute, a deliberate snub, or if it was the Institute that was rebuffed by the mathematics department. I don’t think it really matters. The point is that the UCLA situation is ubiquitous. Mathematics departments are almost never part of new initiatives in genomics; biologists are all too quick to glance the other way. Conversely, the mathematics community has shunned biologists. Despite two NSF Institutes dedicated to mathematical biology (the MBI and NIMBioS) almost no top math departments hire mathematicians working in genetics or genomics (see the mathematics jobs wiki). In the rooted tree in the figure above B can represent Biology and M can represent Mathematics and they truly, and sadly, are independent.

I get it. The laundry list of differences between biology and math that I aired above can be overwhelming. Real contact between the subjects will be difficult to foster, and it should be acknowledged that it is neither necessary nor sufficient for the science to progress. But wouldn’t it be better if mathematicians proved they are serious about biology and biologists truly experimented with mathematics?

Notes:

1. The opening paragraph is an edited copy of an excerpt (page 2, paragraph 2) from C.P. Snow’s “The Two Cultures and The Scientific Revolution” (The Rede Lecture 1959).
2. David Mumford’s content on his site is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License, and I have incorporated it in my post (boxed text) unaltered according to the terms of the license.
3. The meaning of the word “invariant” in “phylogenetic invariants” differs from the standard meaning in mathematics, where invariant refers to a property of a class of objects that is unchanged under transformations. In the context of algebraic geometry classic invariant theory addresses the problem of determining polynomial functions that are invariant under transformations from a linear group. Mumford is known for his work on geometric invariant theory. An astute reader could therefore deduce from the term “phylogenetic invariants” that the term was coined by biologists.