I’m thrilled to announce that I will be moving to Caltech next year where I will be professor of computational biology!

Some people have asked me why I’m moving. First and foremost, we (my family) feel it is the right move for us as for a variety of reasons that I won’t get into here. For me personally, Caltech represents a unique, special, and extraordinary opportunity because it is an institution that fosters an environment facilitating research and teaching that, inasmuch as possible, is unencumbered by the minutiae of academia. In particular, Caltech is unintimidated by disciplinary boundaries, and enables a culture that I’ve yearned for my whole career. It doesn’t throw hundreds of millions of dollars at a football team (although the basketball team is doing pretty well). Its priorities are aligned with mine.

I’m leaving behind Berkeley, a university I started working at 17 years ago as a visiting assistant professor. I’ll miss Berkeley. I still remember the January 1999 phone call from Prof. Tsit Yuen Lam, announcing my appointment. I was honored to have been invited to conduct research and to teach at one of the world’s great institutions. Berkeley was, and still is, distinguished by it’s mission of providing world-class affordable public education. I can’t think of any university in the world that has done as well in pursuing this noble goal. Consider, for example, that UC Berkeley has almost as many Pell Grant recipients as all eight Ivy League schools combined. But with time, as I was allowed to drop the prefixes in my title, I found myself increasingly aware of the structure, organization and financing of the university. Two numbers that I learned have stuck in my mind: today, state funding comprises only 13% of the budget (likely even lower next year), less than half of what it was when I arrived. At the same time, tuition has increased by over a factor of three during the same time period. The squeeze has harmed the institution not just because of reductions in resources (though there have been many), but also because of the strain placed on the morale and mission of the university. Over time I started to question whether its world-class education was sustainable, and lamented that its affordability was becoming a myth. Over the past two years I’ve become increasingly aware that the reality of the university is at odds with my values. I’m sad for the University of California and for the citizens who are being harmed by the blows it is taking, and very much wish that the state will protect and nurture its education treasure. But I will be rooting for it from the sidelines.

I can’t wait to start at Caltech, and look forward to the next phase of my career!

My doppelgänger, Charlie Eppes, who developed algebraic statistics for computational biology at “CalSci” (Caltech).

My Caltech calculus professor, Tom Apostol, passed away yesterday May 8th 2016. When I arrived in his Math 1b class in the Fall of 1990 I thought, like most of my classmates, that I already knew calculus. I may have known it but I didn’t understand it. Apostol taught me the understanding part.

Apostol’s calculus books, affectionally called “Tommy I” and ‘Tommy II” were not just textbooks for students to memorize but rather mathematical wisdom and beauty condensed into a pair of books intended to transform grade-obsessed freshmen and sophomores into thinking human beings. Most of all, Apostol emphasized the idea that fundamental to mathematics is how one thinks about things, not just what one is thinking about. One of his iconic examples of this was the ice-cream-cone-proof that the focal property of an ellipse is a consequence of its definition as a section of a cone. Specifically, taking as the definition of an ellipse a plane curve obtained by intersecting an inclined plane with a cone


the goal is to both define the two foci, and then to derive the focal point property as illustrated below:


Apostol demonstrated the connection between conic sections and their foci via a proof and picture of Dandelin. His explanation, which I still remember from my freshman year in college, is beautiful (the excerpt below is from his linear algebra book):

Apostol1Apostol2Apostol didn’t invent Dandelin’s spheres but he knew they were “the right way” to think about conic sections, and he figured out “the right way” for each and every one of his explanations. His calculus books are exceptional for their introduction of integration before differentiation, his preference for axiomatic rather than mechanistic definition (e.g. of determinants) and his exercises that are “easy” when the material is understood “in the right way”. His course had a profound influence on my approach not only to mathematics, but to all of my learning in both the sciences and humanities.

One of Apostol’s signature traditions was his celebration of Gauss’ birthday. His classes were always filled with mathematical treats, but on April 30th every year he would hide a cake in the classroom before the students arrived and would serve an edible treat that day instead. Gauss turned 239 just last week. This seems to be a timely moment to take note of that prime number (Apostol was a number theorist) and to eat a slice of cake for Gauss, Apostol, and those who change our lives.



The Journal lmpact Factor (JIF) was first proposed by Eugene Garfield of Institute for Scientific Information (ISI) fame in 1955. It is a journal specific yearly citation measure, defined to be the average number of citations per paper of the papers published in the preceding two years. Obsession with the impact factor in the face of widespread recognition of its shortcomings as a tool for judging the value of science is an unfortunate example of “the tragedy of the commons”.

Leaving aside for a moment the flaws of the JIF, one may wonder whether journals do in fact have any impact? By “impact”one might imagine something along the lines of the simple definition in the Merriam-Webster Dictionary: “to have a strong and often bad effect on (something or someone)” and as an object for the impact one could study the researchers who publish, the scientific community as a whole, or the papers themselves. On the question of impact on papers, common sense suggests that publishing in a high profile journal helps a paper succeed and there is pseudoscience to support that case. However there is little in the way of direct measurement. Twitter to the rescue.

At the end of last year my twitter account was approaching 5,000 followers. Inspired by others, I found myself reflecting on this “milestone” and in anticipation of the event, I started to ponder the scientific utility of amassing such a large numbers of followers. There is, of course, a lot of work being done on natural language processing of twitter feeds, but it struck me that with 5,000 followers I was in a position to use twitter for proactive experimentation rather than just passive mining. Impact factors, followers, and twitter… it was just the right mix for a little experiment…

In my early tweeting days I encountered a minor technical issue with links to papers: it was unclear to me whether I should use link shorteners (and if so which service?) or include direct links to articles in my tweets. I initially thought that using link shorteners would save me characters but I quickly discovered that this was not the case. Eventually, following advice from fellow twitterati, I began tweeting articles only with direct links to the journal websites. Last year, when twitter launched free analytics for all registered users, I started occasionally examining the stats for article tweets, and I began to notice quantitatively  what I had always suspected intuitively: tweets of Cell, Nature and Science (CNS) articles were being circulated much more widely than those of other journals. Having use bit.ly, the natural question to ask was how do tweets of journal articles with the journal names compare to tweets with anonymized links?

Starting in August of 2015, I began occasionally tweeting articles about 5 minutes apart, using the exact same text (the article title or brief description) but doing it once with the article linked via the journal website so that the journal name was displayed in the link and once with an a bit.ly link that revealed nothing about the journal source. Twitter analytics allowed me to see, for each tweet, a number of (highly correlated) tweet statistics, and I settled on measuring the number of clicks on the link embedded in the tweet. By switching the order of named/anonymized tweets I figured I could control for a temporal effect in tweet appearance, e.g. it seemed likely that users would click on the most recent links on their feed resulting in more views/clicks etc. for later tweets identical except for link type . Ideally this control would have been performed by A/B testing but that was not a possibility (see Supplementary Materials and Notes). I did my tweeting manually, generally waiting a few weeks between batches of tweets so that nobody would catch on to what I was doing (and thereby ruin the experiment). I was eventually caught forcing me to end the experiment but not before I squeezed in enough tests to achieve a significant p-value for something.

I hypothesized that twitter users will click on articles when, and only when, the titles or topics reflect research of interest to them. Thus, I expected not to find a difference in analytics between tweets made with journal names as opposed to bit.ly links. Strikingly, tweets of articles from Cell, Nature and Science journals (CNS) all resulted in higher clicks on the journal title rather than the anonymized link (p-value 0.0078). The average effect was a ratio of 2.166 between clicks on links with the journal name in comparison to clicks on bit.ly links. I would say that this number is the real journal impact factor of what are now called the “glamour journals” (I’ve reported it to three decimal digits to be consistent with the practice of most journals in advertising their JIFs). To avoid confusion with the standard JIF, I call my measured impact factor the RIF (relative impact factor).

Untitled 3

One possible objection to the results reported above is that perhaps the RIF reflects an aversion to clicking on bit.ly links, rather than a preference for clicking on (glamour) journal links. I decided to test that by performing the same test (journal link vs. bit.ly link) with PLoS One articles:

Untitled 4

Strikingly, in three out of the four cases tested users displayed an aversion to clicking on PLoS One links. Does this mean that publishing in PLoS One is career suicide? Certainly not (I note that I have published PLoS One papers that I am very proud of, e.g. Disordered Microbial Communities in Asthmatic Airways), but the PLoS One RIF of 0.877 that I measured (average ratio of journal:bit.ly clicks, as explained above) is certainly not very encouraging for those who hope for science to be journal name blind. It also suggests that the RIF of glamour journals does not reflect an aversion to clicking on bit.ly links, but rather an affinity for.. what else to call it but.. glamour.

Academics frequently complain that administrators are at fault for driving researchers to  emphasize JIFs, but at the recent Gaming Metrics meeting I attend UC Davis University Librarian MacKenzie Smith pointed out something which my little experiment confirms: “It’s you!

Supplementary Material and Notes

The journal Nature Communications is not obviously a “glamour journal”, however I included it in that category because the journal link name began nature.com/… Removing the Nature Communications tweet from the glamour analysis increases the glamour journal RIF to 2.264.

The ideal platform for my experiment is an A/B testing setup, and as my former coauthor Dmitry Ryaboy , head of the experimentation team at twitter explains in a blog post, twitter does perform such testing on users for internal purposes. However I could not perform A/B testing directly from my account, hence the implementation of the design described above.

I tried to tweet the journal/bit.ly tweets exactly 5 minutes apart, but once or twice I got distracted reading nonsense on twitter and was delayed by a bit. Perhaps if I’d been more diligent (and been better at dragging out the experiment) I’d have gotten more and better data. I am comforted by the fact that my sample size was >1.

Twitter analytics provided multiple measures, e.g. number of retweets, impressions, total engagements etc., but I settled on link clicks because that data type gave the best results for the argument I wanted to make. The table with the full dataset is available for download from here (or in pdf). The full list of tweets is here.


In 2014 biologist Mike Snyder published 42 papers. Some of his contributions to those papers were collaborative, but many describe experiments performed in his lab. To make sense of the number 42, one might naïvely assume that Snyder dedicated 8 days and 17 hours (on average) to each paper. That time would include the hours (days?) needed to conceive of ideas, setup and perform experiments, analyze data and write text. It’s not much time, yet even so it is probably an upper bound. This is because Snyder performed the work that resulted in the 42 papers in his spare time. As the chair of the genetics department at Stanford, an administrative position carrying many responsibilities, he was burdened with numerous inter- and intra- and extra- departmental meetings, and as an internationally renowned figure in genomics he was surely called upon to assist in numerous reviews of grants and papers, to deliver invited talks at conferences, and to serve on numerous national and international panels and committees. Moreover, as a “principal investigator” (PI), Snyder was certainly tasked with writing reports about the funding he receives, and to spend time applying for future funding. So how does someone write 42 papers in a year?

Snyder manages 36 postdocs,  13 research assistants,  11 research scientists, 9 visiting scientists, and 8 graduate students. One might imagine that each of the graduate students supervises four postdocs, so that the students/postdocs operate under a structure that enables independent work to proceed in the lab with only occasional supervision and input from the PI. This so-called “PI model” for biology arguably makes sense. The focus of experienced investigators on supervision allows them to provide crucial input gained from years of experience on projects that require not only creative insight, but also large amounts of tedious rote work. Modern biology is also fraught with interdisciplinary challenges, and large groups benefit from the diversity of their members.  For these reasons biology papers nowadays tend to have large numbers of authors.

It is those authors, the unsung heroes of science, that are the “cast” in Eric Lander’s recent perspective on “The Heroes of CRISPR” where he writes that

“the narrative underscores that scientific breakthroughs are rarely eureka moments. They are typically ensemble acts, played out over a decade or more, in which the cast becomes part of something greater than what any one of them could do alone.”

All of the research papers referenced in the Lander perspective have multiple authors, on average about 7, and going up to 20. When discussing papers such as this, it is therefore customary to refer to “the PI and colleagues…” in lieu of crediting all individual authors. Indeed, this phrase appears throughout Lander’s perspective, where he writes “Moineau and colleagues..”, “Siksnys and colleagues..”, etc. It is understood that this means that the PIs guided projects and provided key insights, yet left most of the work to the first author(s) who were in turn aided by others.

There is a more cynical view of the PI model, namely that by running large labs PIs are able to benefit from a scientific roulette of their own making. PIs can claim credit for successes while blaming underlings for failures. Even one success can fund numerous failures, and so the wheel spins faster and faster…the PI always wins. Of course the mad rush to win leads to an overall deterioration of quality. For example, it appears that even the Lander perspective was written in haste (Lander may not be as prolific as Snyder but according to Thomson Reuters he did publish 21 “hot” papers in 2015, placing him fourth on the list of the world’s most influential scientific minds, only a few paces behind winner Stacey Gabriel). The result is what appears to be a fairly significant typo in Figure 2 of his perspective, which shows the geography of the CRISPR story. The circle labeled “9” should be colored in blue like circle “10”, because that color represents “the final step of biological engineering to enable genome editing”. The Jinek et al. 2012 paper of circle “9” clearly states that “We further show that the Cas9 endonuclease can be programmed with guide RNA engineered as a single transcript to target and cleave any dsDNA sequence of interest”. Moreover the Jinek et al. 2013 paper described editing in mammalian cells but was submitted in 2012 (the dates in Lander’s figure are by submission), so circle “9” should also be labeled with a “10” for “Genome editing in mammalian cells”.  In other words, the figure should have looked like this:


Lander acknowledges the reality of the PI model in the opening of his perspective, where he writes that “the Perspective describes an inspiring ensemble of a dozen or so scientists who—with their collaborators and other contributors whose stories are not elaborated here—discovered the CRISPR system, unraveled its molecular mechanisms, and repurposed it as a powerful tool for biological research and biomedicine. Together, they are the Heroes of CRISPR.”

But in choosing to highlight a “dozen or so” scientists, almost all of whom are established PIs at this point, Lander unfairly trivializes the contributions of dozens of graduate students and postdocs who may have made many of the discoveries he discusses, may have developed the key insights, and almost certainly did most of the work. For example, one of the papers mentioned in the Lander perspective is

F. Zhang*, L. Cong*, S. Lodato, S. Kosuri, G.M. Church and P. Arlotta, Efficient construction of sequence-specific TAL effectors for modulating mammalian transcription, 2011

I’ve placed a * next to the names of F. Zhang and L. Cong as is customary to do when “authors contributed equally to this work” (a quote from their paper). Yet in his perspective Lander writes that “when TALEs were deciphered, Zhang, with his collaborators Paola Arlotta and George Church…successfully repurposed them for mammals”. The omission of Cong is surprising not only because his contribution to the TALE work was equal to that of Zhang, but because he is the same Cong that is first author of Cong et al. 2013 that Lander lauds for its number of citations (Le Cong is a former graduate student of Zhang, and is now a postdoc at the Broad Institute). Another person who is absent from Lander’s perspective, yet was first author on two of the key papers mentioned is Martin Jinek. He and Charpentier’s postdoc Krzysztof Chylinski were deemed fundamental to the CRISPR story elsewhere.

A proper narrative of the history of CRISPR would include stories primarily about the graduate students and postdocs who did the heavy lifting. Still, one might imagine giving Lander the benefit of the doubt; perhaps it was simply not possible to exhaustively describe all the individuals who contributed to the 20 year CRISPR story. Maybe Lander’s intent was to draw readers into his piece by including interesting backstories in the style of the New Yorker. Mentions of cognac, choucroute garnie and Saddam Hussein certainly added spice to what would otherwise be just a litany of dates. Yet if that is the case, why are the backstories of the only two women mentioned in Lander’s perspective presented as so so so boring? The entirety of what he has to say about them is “Charpentier had earned her Ph.D in microbiology from Pasteur Institute in 1995” and “Doudna had received her Ph.D. at Harvard” (those who are interested in Charpentier and Doudna’s fascinating CRISPR story will have to forgo the journal Cell and read about The CRISPR Quandary in the New York Times). A possible answer to my question is provided here.

Lander concludes his perspective by writing that “the [CRISPR] narrative…is a wonderful lesson for a young person contemplating a life in science”. That may be true of the CRISPR narrative, but the lesson of his narrative is that young persons should not count on leaders in their field to recognize their work.

COI statement: Eric Lander is my former (informal) Ph.D. advisor. Jennifer Doudna is a current colleague and collaborator.

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as the law of total variance. Given a random (response) variable C with a random covariate  Z, the law of total variance states that the variance of C can be decomposed as:

Var[C] = E_Z[Var[C|Z]] + Var_Z[E[C|Z]].

There is a biological interpretation of this law that also serves to explain it: If the random variable C denotes the expression of a gene in a single cell (C being a random variable means that the expression is stochastic), and Z denotes the (random) state of a cell, then the law of total variance describes the “total noise” Var[C] in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression Var[C|Z] to be the conditional variance, which is also a random variable; its possible values are the variance of the gene expression in different cell states. If Var[C] does not depend on Z then the expression of the gene is said to be homoscedastic, i.e., it does not depend on cell state (if it does then it is said to be heteroscedastic). Because Var[C|Z] is a random variable, the expression E_Z[Var[C|Z] makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression E[C|Z] is a random variable whose possible values are the average  of the gene expression in cells. Thus, Var_Z[E[C|Z]] is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into E. coli and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables Z_1,\ldots,Z_n which are independent and identically distributed, one for each of n different cells, while random variables C_1,\ldots,C_n  and Y_1,\ldots,Y_n correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable C_i and Y_i, i.e. a pair of measurements for each cell. A hierarchical model for the experiment, in which the marginal (unconditional) distributions C_i and Y_i are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead,  in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

\eta^2_{int} = \frac{\frac{1}{n} \left( \sum_{i=1}^n \frac{1}{2} (c_i-y_i)^2\right) }{\overline{c} \cdot \overline{y}}, (intrinsic noise)

\eta^2_{ext} = \frac{\frac{1}{n}\sum_{i=1}^n c_i \cdot y_i - \overline{c} \cdot \overline{y}}{\overline{c} \cdot \overline{y}}, (extrinsic noise)

\eta^2_{tot} = \frac{\frac{1}{n}\sum_{i=1}^n \frac{1}{2} (c_i^2+y_i^2)-\overline{c}\cdot\overline{y}}{\overline{c} \cdot \overline{y}}. (total noise)

Here  c_1,\ldots,c_n and y_1,\ldots,y_n are the measurements of cyan respectively yellow reporter expression in each cell, \overline{c} = \frac{1}{n}\sum_{i=1}^n c_i and \overline{y} = \frac{1}{n}\sum_{i=1}^n y_i.

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho,  studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:


The Elowitz et al. caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of n in the denominator rather than n-1), an observation made by Hilfinger and Paulsson. The sample covariance has a  (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line c=y is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):


There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

  1. There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
  2. The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
  3. The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.

In the 1930s the pseudonym Nicolas Bourbaki was adopted by a group of French mathematicians who sought to axiomatize and formalize mathematics in a rigorous manner. Their project greatly influenced mathematics in the 20th century and contributed to its present “definition -> theorem -> proof” format (Bourbaki went a bit overboard with abstraction and generality and eventually the movement petered out).

In biology definitions seem to be hardly worth the trouble. For example, the botanist Willhelm Johannsen coined the term “gene” in 1909 but there is still no universal agreement on a “definition”.  How could there be? Words associated with biological processes or phenomena that are being understood in a piecemeal fashion naturally change their meaning over time (the philosopher Joseph Woodger did try to axiomatize biology in his 1938 book “The Axiomatic Method in Biology” but his attempt, with 53 axioms, was ill fated).

Bioinformatics occupies a sort of murky middle-ground as far as definitions go. On the one hand, mathematics underlies many of the algorithms and statistical models that are used so there is a natural tendency to formalize concepts, but just when things seem neat and tidy and the lecture begins with “Let the biological sequence be a string on an alphabet \Sigma of size 4 or 20″, biology comes along and rips the blanket off the bed (there are 22 proteinogenic amino acids, molecules may be single or double stranded, they could be circular, bases may be chemically modified, and so on and so forth…).

In some cases things have really gotten out of hand. The words “alignment” and “mapping” appear, superficially, to refer to precise objects and procedures, but they probably have more meanings than the verb “run” (which is in the hundreds). To just list the meanings would be a monumental (impossible?) task. I will never attempt it. However I recently came across some twitter requests for clarification of read alignment and mapping, and I thought explaining those terms was perhaps a more tractable problem.

Here is my attempt at a chronological clarification:

1998: The first use of the term “read alignment” in a paper (in the context of genomics) is, to my knowledge, in “Consed: A Graphical Tool for Sequence Finishing” by D. Gordon, C. Abajian and P. Green. The term “read alignment” in the paper was used in the context of the program phrap, and referred to the multiple alignment of reads to a sequence assembled from the reads, so as to generate a consensus sequence with reduced error. The output looked something like this:


Figure 1 from the Consed paper.

To obtain multiple alignments phrap required a “model”. Roughly speaking, a model is an objective scoring function along with parameters, and, although not discussed in the Consed paper, that model was crucial to the functionality and performance of phrap. The phrap model included not just mismatches of bases but also insertions and deletions. Although phrap was designed to work with Sanger reads, there is conceptually not much difference between the way “read alignment” was used in the Consed paper and the way it is sometimes used now with “next-generation” reads. In fact, the paper “Correcting errors in short reads by multiple alignments” by L. Salmela and J. Schröder, 2011 does much of the same thing (it would have been nice if they cited phrap, although in their defense phrap was never published; Consed was about the visualization tool).

For the purposes of defining “read alignment” it is worthwhile to consider what I think were the four key ingredients of phrap:

  • An underlying notion of what a read alignment is. Without being overly formal I think its useful to consider the form of the phrap input/output . This is my own take on it. First, there was a set of reads \mathcal{F} with each read F \in \mathcal{F} consisting of a sequence so that F is an ordered set F =\{\sigma_1, \ldots, \sigma_l\} (for simplicity I make the assumption that the reads were single end, each read was of the same length l, and each sequence element \sigma \in \{A,C,G,T\}, and of course I’ve ignored the reverse strand; these assumptions and restrictions can and should be relaxed but I work with them here for simplicity). Next, there was a set of target sequences (contigs, to be precise) that I denote by  \mathcal{T} with T \in \mathcal{T} . These each consisted of a sequence, again an ordered set, T = \{\sigma^T_1,\ldots,\sigma^T_{|T|}\}. A “read alignment” in phrap consisted of a pair of maps: first \psi:\mathcal{F} \rightarrow \mathcal{T} \cup \{\emptyset\} specifying for each read the contig it maps to (or the mapping to \{\emptyset\} denoting an unaligned read), and also a set of additional maps \mathcal{L}_F:F \rightarrow \psi(F) \cup \emptyset assigning to each base in each read a corresponding base (or gap in the case of \{\emptyset\} ) in the contig the read mapped to.
  • A model: Given the notion of read alignment, a model was needed to be able to choose between different alignments. In phrap, the multiple alignments depended on scores for mismatches and gaps (the defaults were a score of 11 for a match, -2 for a mismatch, -4 for initiating a gap, -3 for extending a gap). In general, a model provides an approach to distinguishing and choosing between the read alignments as (roughly) defined above.
  • An algorithm for read alignment. Given a model there was a need for an algorithm to actually find high scoring alignments. In phrap the algorithm consisted of performing a banded Smith-Waterman alignment using the model (roughly) specified above. The Smith-Waterman algorithm produced alignments \mathcal{L}_F that were global, i.e. order preserving between the elements of F and \psi(F).
  • Data structures for enabling efficient alignment. With large numbers of targets and many reads, there was a need for phrap to efficiently prune the search for read alignments. Running the banded Smith-Waterman alignment algorithm on all possible places a read could align to would have been intractable. To address this problem,  phrap made use of a hash to quickly find exact matches from parts of the read to targets, and then ran the Smith-Waterman banded alignment only on the resulting credible locations. Statistically, this could be viewed as a filter for very low probability alignments.

The four parts of phrap are present in most read alignment programs today. But the details can be very different. More on this in a moment.

2004: Six years after Consed paper the term “read alignment” shows up again in the paper “Comparative genome assembly” by Pop et al. In the paper the term “read alignment” was not defined but rather described via a process. Reads from obtained via shotgun sequencing of the genome of one organism were “aligned” to a reference genome from another closely related organism using a program called MUMmer (the goal being to use proximity of alignments to assist in assembly). In other words, the “alignments” were the output of MUMmer, followed by a post-processing with a series of heuristics to deal with ambiguous alignments. It is instructive to compare the alignments to those of phrap in terms of the four parts discussed above. First, in the formulation of Pop et al. the alignments were between a single read and a genome, not multiple alignments of sets of reads to numerous contigs. In the language above the map \psi was much simpler in Pop et al., with the range consisting only of chromosomes in a genome rather than the thousands of contigs in an assembly. Second, although MUMmer utilized an implicit model for alignment, the model was of a very different meaning. Reads from one organism were being aligned to sequences from another, a task fundamentally different than aligning reads back to sequences they originated from (albeit with error). In other words, “alignment” in the Pop et al. paper referred to identification of homology. This opens up a whole other bag of worms that I wrote about in a paper with Colin Dewey in 2006 and that I will not open here. In terms of data structures, MUMmer made use of a suffix tree rather than a hash, a distinction that is important in that suffix trees provide a much richer possibility for search (MUM stands for maximum unique match, the kind of thing suffix trees are useful for finding and that MUMmer took advantage of). A lot of the research in read alignment has centered on developing efficient and compact data structures for string search.

Another development in the year 2004 was the appearance of the term “read mapping”. In the paper “Pash: efficient genome-scale anchoring by positional hashing” by Kalafus et al., hashing of k-mers was used to “anchor” reads. The word “anchoring” was and still is somewhat ambiguous as to its meaning, but in the Pash context it referred to the identification of positions of exact matching k-mers within a read on a target sequence. In other words, Pash could be considered to provide only partial alignments, because some bases within the read would not be aligned at all. Formally, one would say that the functions \mathcal{L}_F had as their domain not the full reads \mathcal{F} but rather strict subsets \mathcal{S} \subset F where \mathcal{S} is nonempty. For this reason, in the Pash paper the authors suggest post-processing their output with base-pair level sequence alignment programs popular at the time (e.g. AVID, LAGAN) to fully “align” the reads (they could have also suggested performing a banded Smith-Waterman alignment like phrap).

2006: The first sequencer manufactured by Solexa, the Genome Analyzer was launched, and following it the ELAND program for “read alignment” was provided  to customers buying the machine. This was, in my opinion, a landmark even in “read alignment” because with next-generation sequencing and ELAND read alignment became mainstream, a task to be performed in the hands of individual users, rather than just a step among many in complex genome sequencing pipelines manufactured in large genome sequencing centers. In the initial release of ELAND, the program performed “ungapped read alignment”, which in the language of this post means that the maps \mathcal{L}_F  had the property that if  \mathcal{L}_F(i) = \sigma^{\psi(F)}_{j} then \mathcal{L}_F(i+1) = \sigma^{\psi(F)}_{j+1}, i.e. there were no gaps in the read alignments.

2009: With two programs published back-to-back in the same year, BWA (“Fast and accurate alignment with the Burrows-Wheeler transform“, Heng Li and Richard Durbin)  and Bowtie (“Ultrafast and memory-efficient alignment of short DNA sequences to the human genome” by Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg) read alignment took a big step forward with the use of more sophisticated data structures for rapid string matching.  BWA and Bowtie were based on the Burrows-Wheeler transform, and specifically innovations with the way it was used, to speed up read alignment. In the case of Bowtie, the alignments were ungapped just as originally with ELAND, although Bowtie2 published in 2012 incorporated gaps into the model.

Also of note is that by this time the terms “read alignment” and “read mapping” had become interchangeable. The BWA and Bowtie papers both used both terms, as did many other papers. Moreover, the terms started to take on a specific meaning in terms of the notion of “alignment”. They were referring to alignment of reads to a target genome, and moreover alignment of individual bases to specific positions in the target genome (i.e. the maps \mathcal{L}_F). This was in large part due to the publication of the sequence alignment format also in 2009 in Heng Li et al.‘s important paper on SAMtools.

Finally, 2009 was the year the notion that “spliced read alignment” was introduced. “Spliced alignment” was a concept around since the mid 90s (see, e.g. “Gene recognition by spliced alignment” by MS Gelfand, AA Mironov and PA Pevzner). The idea was to align cDNAs to a genome, requiring alignments allowing for long gaps (for the introns). In the paper “Optimal spliced alignments of short sequencing reads” by Fabio De Bona et al. the authors introduced the tool QPalma for this purpose, building on a spliced alignment tool from the group called Palma. Here the idea was to extend “spliced alignment” to reads, and while QPalma was never widely adopted, other programs such as TopHat, GSNAP, Star, etc. have becomes staples of RNA-Seq. Going back to the formalism I provided with the description of phrap, what spliced alignment did was to extend the models used for read alignment to allow for long gaps.

In the following years, there was what I think is fair to characterize as an explosion in the number of read alignment papers and tools. Models were tweaked and new indexing methods introduced, but the fundamental notion of “read alignment” (by now routinely also called “read mapping”) remained the same.

2014: In his Ph.D. thesis published in December of last year, my former student Nicolas Bray had the idea of entirely dropping the requirement for a read alignment to include the maps \mathcal{L}_F. In his setup, a read alignment would consist only of specifying for each read which target sequence it originated from, but not where in the target sequence it aligned to. The point, elaborated on in Chapter 2 of Nick’s thesis, was motivated by the observation that for many applications (e.g. RNA-Seq) the sufficient statistics for the relevant models are essentially the information contained in the map \psi alone. He called this notion of read alignment “pseudoalignment”, as it is a fundamentally different concept than what was previously thought of as read alignment.

Of course there is not much to writing down a new definition, and there wouldn’t be much of a point to pseudoalignment it if not for two things: Nick’s realized that pseudoalignments could be obtained much faster than standard read alignments and second, that the large imprints of alignment files needed for the vast amounts of sequence that was being produced were becoming a major bottleneck in genomics. Working on RNA-Seq projects with collaborators, this is how we felt on a regular basis:


Pseudoalignment offered a reprieve from what was becoming an unbearable task, and (I believe) it is going to turn out to be a useful and widely applicable concept.

2015: Earlier this year we posted a preprint to the arXiv detailing a method for performing pseudoalignment based on the ideas in Nick’s thesis, with the additional idea of Páll Melsted to use a new type of indexing scheme based on a de Bruijn graph of the target sequences. The program implementing these ideas is called kallisto. Looking back at phrap, kallisto can be seen to consist of a different notion of read alignment (pseudoalignment), a novel data structure (the target sequence de Bruijn graph), and a new fast algorithm for finding pseudoalignments (based on intersecting sets of k-mer matches). Although we did not discuss the model explicitly in our paper, it is implicit via the algorithm used to find pseudoalignments. One useful thing to keep in mind is that the nature of the index of kallisto allows for extracting more than a pseudoalignment. If desired, one can obtain positions within the targets for some k-mers contained in a query read, or even strand information, and in fact such data is used in some parts of kallisto. When such information is extracted, what kallisto is performing is no longer a pure pseudoalignment, but rather what the Pash authors called a read mapping (although that term is unsuitable because as mentioned it has become synonyms with read alignment). It turns out there is now another word for exactly that concept.

Following the publication of kallisto on the arXiv another preprint was posted on the bioRxiv introducing a program called Salmon and a new term called “lightweight alignment” (Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-Seq Data Using Lightweight-Alignment by Rob Patro, Geet Duggal and Carl Kingsford) . Lightweight alignment as a notion of alignment is exactly what was just referred to: the notion of read mapping in Pash. Utilizing an index and software by Heng Li called the FMD index, Salmon determines “lightweight alignment” chains of what are called maximal exact matches and super-maximal exact matches (not to be confused with the maximal unique matches, or MUMs discuss above).  These provide partial alignment information for the sequences in each read, i.e. just like with Pash one obtains maps \mathcal{L}_F but with the domain sometimes restricted to a nonempty subset \mathcal{S} \subset \mathcal{F} when the MEMs or SMEMs don’t cover the read. Again, I find it useful to refer back and compare to phrap: with Salmon the “read alignment” consists of a different type of index and an algorithm associated with it (making sure the MEMs/SMEMs are consistent). The notion of alignment is distinct from that of phrap by virtue of the restriction of the domain in the maps \mathcal{L}_F.  I think a new term for the latter concept since the one Pash used failed to stick, and “lightweight alignment” seems fine.

Recently (just last week) a new read alignment related term was introduced in a bioRxiv preprint. A paper describing a new program called RapMap uses the term quasi-mapping for the procedure implemented in the software. RapMap finds pseudoalignments in the sense described above, with the only difference being the type of index that is used (instead of the target de Bruijn graph, it utilizes a generalized suffix array together with a hash table related to the array). This time I think the introduction of new terminology was unfortunate. Just as with kallisto, RapMap can be used to extract more information than just a pseudoalignment from the index (i.e. a lightweight alignment), and indeed the program does so. But also like kallisto, the speed of RapMap is derived from the idea of efficiently finding pseudoalignments by intersecting sets of k-mer matches. In other words, quasi-mapping is a procedure and not a concept; what the procedure outputs is a lightweight alignment via a pseudoalignment.


Read alignment is what you think it means.

A read mapping is a read alignment.

Lightweight alignment is “partial” read alignment (it should have been called read mapping but that was taken). Some, but not necessarily all bases in a read are aligned to specific bases among target sequences.

Pseudoalignment is read assignment to target sequences without any base-level sequence alignment.

Read alignment, lightweight alignment and pseudoalignment are concepts, not algorithms.

Quasi-mapping is a procedure and not a concept. The procedure outputs a lightweight alignment via a pseudoalignment.

Disclaimer: I’ve almost certainly gotten some dates wrong, missed some papers I should have cited, and omitted some crucial developments in read alignment that I should have included. Please let me know if you think that is the case by commenting and I’ll update the post as needed.

The hierarchical classification of nature initiated by Carl Linnaeus today consists of eight major “ranks”, namely species, genus, family, order, class, phylum, kingdom and domain:


In the microbial world it makes sense to refine the standard taxonomy by subdividing species into strains. An important reason to do so is that bacterial taxonomy must reflect not only phylogeny but also pathogenicity, and small differences in genomes can translate to large pathogenic differences. This has implications for metagenomic analyses of microbial communities: for many biomedical applications it is desirable to characterize individuals strains.

Metagenomics has its roots in culture-independent retrieval and sequencing of 16S rRNA genes, and while variations in 16S can sometimes distinguish between strains, a single gene is not always sufficient. This limitation of 16S can be overcome with whole genome shotgun sequencing of microbial communities, an approach to metagenomics that became popular in the early 2000s and  that opened the door to higher resolution characterization of communities. In 2005 Kevin Chen and I wrote a review on the bioinformatics challenges that would have to be overcome to walk through the door. One of the things we did was to emphasize “problems and their connections to other areas of bioinformatics, such as… gene expression analysis”, and throughout the past decade I’ve always hoped for deeper connections to be established between metagenomics and gene expression bioinformatics. I’ve noticed interesting connections pop up from time to time (e.g. Paulson et al. 2013)  and have occasionally entertained the thought with my students and collaborators, especially as work in my group became more focused on RNA-Seq since the development of Cufflinks in 2008.

However connection modern transcriptome analysis methodology, specifically bioinformatics of RNA-Seq to metagenomics has been difficult to do until recently. One major reason is that until just a few years ago, there was no reference genome database for metagenomics analogous to the reference annotation databases available for use in transcriptomics. Another way to put this is that metagenomics has, until recently, been “de novo” bioinformatics. By this I mean that the analysis of communities from whole genome shotgun data had to largely proceed via de novo analyses of the data (e.g. de novo assembly of genomes), “binning” of reads according to sequence characteristics or hits to gene databases was required because it was impossible to compare sequences to references genomes. While de novo methods have also been developed for RNA-Seq, the scale of transcriptome analysis is much smaller than that of most metagenomic analyses, and as has been well documented, de novo transcriptomics is already very difficult (e.g. Amin et al. 2014).

The de novo state of metagenomics has changed in recent years, as (relatively) low-cost sequencing has been a boon for microbial genomics. The graph below, extracted from NCBI and published in a recent review, shows that in just the past few years thousands of bacterial genomes have been sequences, enabling, for the first time, reference based metagenomics:


This observation is reflected in the recent development of many methods for a variety of metagenomic applications that take advantage of reference genome databases.  Specifically, the problem of read assignment, which is fundamental for abundance estimation, has benefited from the possibility of metagenomic read alignment to reference databases.

The figure below, reproduced from the preprint “An evaluation of the accuracy and speed of metagenome analysis tools” by Stinus Lindgreen, Karen L. Adair and Paul Gardner, bioRxiv May 15, 2015 shows a benchmark of the accuracy and runtime of 14 programs developed for metagenomic read assignment for whole genome shotgun data:


The problem these methods are solving is really similar to the problem of read assignment in RNA-Seq. In RNA-Seq, instead of originating from strains, reads originate from transcripts. Just as strains are present in different abundances in a community, so are RNA transcripts in a cell (or in bulk). The analogy of taxonomy in metagenomics, i.e. the grouping of strains into species, genus etc. is also present in RNA-Seq, where transcripts are grouped into genes. The fragment (or read) assignment problem in RNA-Seq is closely related to the quantification problem in RNA-Seq and is a problem that has been thoroughly researched and for which many algorithms have been developed. I discussed the importance of the fragment assignment problem for RNA-Seq in my 2013 Genome Informatics Keynote.

In response to the development of reference-based bioinformatics possibilities for metagenomics, about three years ago my student Lorian Schaeffer started looking at the suitability of RNA-Seq tools for metagenomic read assignment. Although the metagenomic and RNA-Seq assignment problems are conceptually similar and methodologically related, there are various technical issues involved in applying RNA-Seq tools in the metagenomic setting (e.g. the need to carefully account for taxonomy in the metagenomics setting). After developing the computational infrastructure to benchmark RNA-Seq programs in the metagenomic setting, she proceeded to evaluate the accuracy of eXpress, a streaming algorithm for RNA-Seq quantification. Although the quantification of eXpress was specifically designed to be suitable for large numbers of reads, the program requires read alignments to a reference transcriptome (or in Lorian’s experiments a genome) database. In the metagenomic setting realistic databases are huge, and she found that it took days just to map the reads. Nevertheless, her initial benchmarks revealed that eXpress was significantly more accurate than the available metagenomic read assignment tools of the time.

When Kraken (Wood and Salzberg 2014), and later CLARK (Ounit et al. 2015) were published in 2014 and 2015 respectively, we took note because by circumventing the alignment step they dramatically altered the tractability of metagenomic read assignment. In parallel, in my group, Nicolas Bray and later Páll Melsted and Harold Pimentel were developing what is now kallisto (Bray et al. 2015). Like Kraken, kallisto avoided the need for aligning reads, but with the introduction of the concept of pseudoalignment, allowed for accurate read assignments based on joint analysis of exact k-mer matches. What we showed earlier this year is that unlike naïve k-mer based approaches to quantification, kallisto is as accurate as eXpress and other read alignment based quantification tools, and this observation led Lorian to immediately proceed to benchmark it on metagenomic data. The result of her work was just posted as a preprint:

Lorian Schaeffer, Harold Pimentel, Nicolas Bray, Páll Melsted and Lior Pachter, Pseudoalignment for metagenomic read assignment, arXiv 1510.07371, 2015.

With this paper we demonstrate a “technology transfer” from RNA-Seq bioinformatics to metagenomics, one that achieves dramatic improvements in read assignment accuracy in the metagenomics setting. The main result of her work is Table 1 in our preprint:


Using a published simulated Illumina dataset from Mende et al. 2012 (based on 100 genomes and containing 53.33 million reads), and augmenting it with another 2,308 genomes for the purpose of testing, she shows that kallisto significantly outperforms the best quantification methods (as benchmarked by Lindgreen et al., see figure above). “Significant” here refers to what I think is fair to characterize as an extraordinary improvement: at the genus level, a level that programs such as CLARK have been optimized for, kallisto’s RRMSE (relative root mean squared error)  is 0.13 compared to 17.05 for Kraken and 18.58 for CLARK. The improvement is based on two ideas: first, the results show that the model based approach for read assignment, the concept that underlies GASiC and eXpress, outperforms direct taxonomic read assignment as implemented by MEGAN and Kraken and CLARK (in the latter approach reads are aligned to the lowest rank to which they align unambigously). Second, pseudoalignment is not just faster than traditional alignment but also accurate.

The upshot: the accuracy and efficiency of kallisto make strain level analysis of metagenomes possible. In fact kallisto is more accurate at the strain level than other programs are at the genus level. Just as we have been advocating for transcript level analysis from RNA-Seq data, we believe that strain level analysis should become commonplace in metagenomics.

In digging deeply into the bioinformatics of metagenomics bioinformatics we noticed a few other areas that could benefit from RNA-Seq technology transfer. For example, the standard of RNA-Seq methods benchmarking appears to be higher than in metagenomics. Both the Kraken and CLARK papers benchmarked their programs on simulations with 10 genomes (the number ten is not a typo). CLARK did test on one dataset with 20 genomes, although using only 10,000 reads. To be fair to the authors of those papers, their standards were much higher than others in the field. The paper

Yu-Wei Wu and Yuzhen Ye, A novel abundance-based algorithm for binning metagenomic sequences using l-tuples, Journal of Computational Biology 2011.

benchmarked their method on simulations of reads from 2 (two!!) organisms. Biologists frequently complain that simulations of bioinformaticians are completely non-informative and unfortunately these cases provide fodder for such prejudice. Having said that, the RNA-Seq community also has much to learn from the metagenomics community. The previously mentioned paper by Paulson et al. 2013 addresses missing data in a way that should translate directly to missing data in single-cell RNA-Seq (the paper also makes performance comparisons with their comparative metagenomics approach to the RNA-Seq programs DESeq and edgeR) . One paper (McDavid et al. 2012) does take a look at modeling single-cell data with zero inflated distributions but I think this is a good example where metagenomics is ahead of RNA-Seq. Our immediate plans are to develop the kallisto application to metagenomics to include the ability to perform metagenome comparisons using sleuth. Conversely, inspired by the taxonomy hierarchy fundamental to metagenomics we’re going to explore RNA-Seq quantification with groups of transcripts that go beyond just genes.

Horizontal transfer is good.

The Common Core State Standards Initiative was intended to establish standards for the curriculum for K–12 students in order to universally elevate the the quality of education in the United States. Whether the initiative has succeeded, or not, is a matter of heated debate. In particular, the merits of the mathematics standards are a contentious matter to the extent that colleagues in my math department at UC Berkeley have penned opposing opinions on the pages of the Wall Street Journal (see Frenkel and Wu vs. Ratner). In this post I won’t opine on the merits of the standards, but rather wish to highlight a shortcoming in the almost universal perspective on education that the common core embodies:

The emphasis on what K–12 students ought to learn about what is known has sidelined an important discussion about what they should learn about what is not known.

To make the point, I’ve compiled a list of unsolved problems in mathematics to match the topics covered in the common core. The problems are all well-known to mathematicians, and my only contribution is to select problems that (a) are of interest to research mathematicians (b) provide a good balance among the different areas of mathematics and (c) are understandable by students studying to (the highlighted) Common Core Standards.  For each grade/high school topic, the header is a link to the Common Core Standards. Based on the standards, I have selected a single problem to associate to to the grade/topic (although it is worth noting there were always a large variety to choose from). For each problem, I begin by highlighting the relevant common core curriculum which the problem is related to, followed by a warm up exercise to help introduce students to the problem. The warm ups are exercises that should be solvable by students with knowledge of the Common Core Standards. I then state the unsolved problem, and finally I provide (in brief!) some math background, context and links for those who are interested in digging deeper into the questions.

Disclaimer: it’s possible, though highly unlikely that any of the questions on the list will yield to “elementary” methods accessible to K–12 students. It is quite likely that many of the problems will remain unsolved in our lifetimes. So why bother introducing students to such problems? The point is that the questions reveal a sliver of the vast scope of mathematics, they provide many teachable moments and context for the mathematics that does constitute the common core, and (at least in my opinion) are fun to explore (for kids and adults alike). Perhaps most importantly, the unsolved problems and conjectures reveal that the mathematics taught in K–12 is right at the edge of our knowledge: we are always walking right along the precipice of mystery. This is true for other subjects taught in K–12 as well, and in my view this reality is one of the important lessons children can and should learn in school.

One good thing about the Common Core Standards, is that their structure allows, in principle, for the incorporation of standards for unsolved problems the students ought to know about. Hopefully education policymakers will consider extending the Common Core Standards to include such content. And hopefully educators will not only teach about what is not known, but will also encourage students to ask new questions that don’t have answers. This is because  “there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”


Relevant common core: “describing shapes and space.”

Warm up: can you color the map of Africa with four colors so that no two countries that touch are filled in with the same color?


Can you color in the map with three colors so that no two countries that touch are filled in with the same color?

The unsolved problem: without trying all possibilities, can you tell when a map can be colored in with 3 colors so that no two countries that touch are filled in with the same color?

Background and context: The four color theorem states (informally) that “given any separation of a plane into contiguous regions, producing a figure called a map, no more than four colors are required to color the regions of the map so that no two adjacent regions have the same color.” (from wikipedia). The mathematics statement is that any planar graph can be colored with four colors. Thus, the first part of the “warm up” has a solution; in fact the world map can be colored with four colors. The four color theorem is deceivingly simple- it can be explored by a kindergartner, but it turns out to have a lengthy proof. In fact, the proof of the theorem requires extensive case checking by computer. Not every map can be colored with three colors (for an example illustrating why see here). It is therefore natural to ask for a characterization of which maps can be 3-colored. Of course any map can be tested for 3-colorability by trying all possibilities, but a “characterization” would involve criteria that could be tested by an algorithm that is polynomial in the number of countries. The 3-colorability of planar graphs is NP-complete.

First grade

Relevant common core: “developing understanding of whole number relationships”.

Warm up: Suppose that in a group of people, any pair of individuals are either strangers or acquaintances. Show that among three people there must be at either at least two pairs of strangers or else at least two pairs of acquaintances.

The unsolved problem: Is it true that among 45 people there must be 5 mutual strangers or 5 mutual acquaintances?

Background and context: This problem is formally known as the question of computing the Ramsey number R(5,5). It is an easier (although probably difficult for first graders) problem to show that R(3,3)=6, that is, that among six people there will be either three mutual strangers or else three mutual acquaintances. It is known that 43 \leq R(5,5) \leq 49. The difficulty of computing Ramsey numbers was summed up by mathematician Paul Erdös as follows:

“Imagine an alien force, vastly more powerful than us, landing on Earth and demanding the value of R(5, 5) or they will destroy our planet. In that case we should marshal all our computers and all our mathematicians and attempt to find the value. But suppose, instead, that they ask for R(6, 6). In that case we should attempt to destroy the aliens.” – from Ten Lectures on the Probabilistic Method.

Second grade

Relevant common core: “building fluency with addition and subtraction”.

Warm up: Pascal’s triangle is a triangular array of numbers where each entry in a row is computed by adding up the pair of numbers above it. For example, the first six rows of Pascal’s triangle are:


Write out the next row of Pascal’s triangle.

The unsolved problem: Is there a number (other than 1) that appears 10 times in the (infinite) Pascal’s triangle?

Background and context: The general problem of determining whether numbers can appear with arbitrary multiplicity in Pascal’s triangle is known as Singmaster’s conjecture. It is named after the mathematician David Singmaster who posed the problem in 1971. It is known that the number 3003 appears eight times, but it is not known whether any other number appears eight times, nor, for that matter, whether any other number appears more than eight times.

Third grade

Relevant common core: “(1) developing understanding of multiplication and division and strategies for multiplication and division within 100”.

Warm up: Practice dividing numbers by 2 and multiplying by 3.

The unsolved problem: Choose a natural number n. If n is even, divide it by 2 to get n\div 2. If n is odd, multiply it by 3 and add 1 to obtain 3\times n+1. Repeat the process. Show that for any initial choice n, the repeating process will eventually reach the number 1.

Background and context: The conjecture is called the Collatz conjecture, after Lothar Collatz who proposed it in 1937. It is deceptively simple, but despite much numeric evidence that it is true, has eluded proof. Mathematician Terence Tao has an interesting blog post explaining why (a) the conjecture is likely to be true and (b) why it is likely out of reach of current techniques known to mathematicians.

Fourth grade

Relevant common core: “Determine whether a given whole number in the range 1-100 is prime or composite”.

Warm up: Write the number 100 as the sum of two prime numbers.

The unsolved problem: Show that every even integer greater than 2 can be expressed as the sum of two primes.

Background and context:  This problem is known as the Goldbach conjecture. It was proposed by the mathematician Christian Goldbach in a letter to the mathematician Leonhard Euler in 1742 and has been unsolved since that time (!) In 2013 mathematician Harald Helfgott settled the “weak Goldbach conjecture“, proving that every odd integer greater than 5 is the sum of three primes.

Fifth grade

Relevant common core: “Graph points on the coordinate plane”.

Warm up: A set of points in the coordinate plane are in general position if no three of them lie on a line. A quadrilateral is convex if it does not intersect itself and the line between any two points on the boundary lies entirely within the quadrilateral. Show that any set of five points in the plane in general position has a subset of four points that form the vertices of a convex quadrilateral.

The unsolved problem: A polygon is convex  if it does not intersect itself and the line between any two points on the boundary lies entirely within the polygon. Find the smallest number N so that any N points in the coordinate plane in general position contain a subset of 7 points that form the vertices of a convex polygon.

Background and context: The warm up exercise was posed by mathematician Esther Klein in 1933. The question led to the unsolved problem, which remains unsolved in the general case, i.e. how many points are required so that no matter how they are placed (in general position) in the plane there is a subset that form the vertices of a convex n-gon. There are periodic improvements in the upper bound (the most recent one posted on September 10th 2015), but the best current upper bound is still far from the conjectured answer.


A set of 8 points in the plane containing no convex pentagon.

Sixth grade

Relevant common core: “Represent three-dimensional figures using nets made up of rectangles and triangles”.

Warm upnet

The dodecahedron is an example of a convex polyhedron. A convex polyhedron is a polyhedron that does not intersect itself and that has the property that any line joining two points on the surface lies entirely within the polyhedron. Cut out the net of the dodecahedron (shown above) and fold it into a dodecahedron.

The unsolved problem: Does every convex polyhedron have at least one net?

Background and context: Albrecht Dürer was an artist and mathematician of the German Renaissance. The unsolved problem above is often attributed to him, and is known as Dürer’s unfolding problem. It was formally posed by the mathematician Geoffrey Shephard in 1975.


One of my favorite math art pieces: Albrecht Dürer’s engraving Melencolia I.

Seventh grade

Relevant common core: “Analyze proportional relationships and use them to solve real-world and mathematical problems.”

Warm up: Two runners, running at two different speeds v_0 and v_1, run along a circular track of unit length. A runner is lonely at time t if she is at a distance of at least 1/2 from the other runner at time t. If both runners all start at the same place at t=0, show that the runners will both be lonely at time t=\frac{1}{2(v_1-v_0)}.

The unsolved problem: Eight runners, each running at a speed different from that of the others, run along a circular track of unit length. A runner is lonely at time t if she is at a distance of at least 1/8 from every other runner at time t. If the runners all start at the same place at t=0, show that each of the eight runners will be lonely at some time.


Background and context: This problem is known as the lonely runner conjecture and was proposed by mathematician J.M Wills in 1967. It has been proved for up to seven runners, albeit with lengthy arguments that involve lots of case checking. It is currently unsolved for eight or more runners.

Eighth grade

Relevant common core: “Know and apply the properties of integer exponents”.

Warm up: Show that 3^{3n}+[2(3^n)]^3 = 3^{3n+2} for any integer greater than or equal to 1.

The unsolved problem: If A^x+B^y=C^z where A,B,C,x,y and are positive integers with x,y,z>2 then A,B and C have a common prime factor.

Background and context: This problem is known as Beal’s conjecture (named after the billionaire Andrew Beal). The famous “Fermat’s Last Theorem“, proved via the modularity theorem for semistable elliptic curves,  is a special case of this conjecture, an observation that hints at the difficulty of the conjecture. Andrew Beal is currently offering a prize of $1 million for its solution.

High School: Number and Quantity

Relevant common core: “In high school, students will be exposed to yet another extension of number, when the real numbers are augmented by the imaginary numbers to form the complex numbers.”

Warm up: Gaussian integers are numbers of the form a+bi where a and b are integers. The norm of a Gaussian integer a+bi is a^2+b^2. A Gaussian prime is a Gaussian integer that cannot be factored into Gaussian integers of smaller norm. Show that 4+i is a Gaussian prime.

The unsolved problem: Consider a circle in R2 with centre at the origin and radius r \geq 0. How many points there are inside this circle of the form (m,n) where m and n are both integers. In particular, denoting this number by N(r), find bounds on E(r) where N(r) = \pi r^2 + E(r).


Background and context: The problem is known as Gauss’ circle problem, and while phrased in terms of integer points in the plane, it is equivalent to asking for the number of Gaussian integers with norm less than a given constant.

High School: Algebra

Relevant common core: “Solve systems of equations”.

Warm up: An Euler brick is a cuboid whose edges and face diagonals all have integer length:


Algebraically, an Euler brick requires finding integers a,b,c,d,e,f such that

a^2+b^2=d^2, a^2+c^2=e^2 and b^2+c^2=f^2. Verify that (a,b,c,d,e,f) = (85, 132, 720,157, 725, 732) is an Euler brick.

The unsolved problem: A perfect cuboid is an Euler brick whose space diagonal g (see below) also has integer length:


Is there a perfect cuboid?

Background and context: The existence of perfect cuboids requires solving the systems of equations for the Euler brick with the addition of the requirement that a^2+b^2+c^2=g^2 with g an integer. The solution of (non-linear) systems of equations in many unknowns is the subject matter of algebraic geometry, in which a bridge is developed between the algebra and a corresponding geometry. Generally speaking the equations are easiest to solve when the solutions can be complex, harder when they are required to be real numbers (real algebraic geometry) and hardest when they are constrained to be integers (Diophantine, or arithmetic algebraic geometry).

High School: Functions

Relevant common core: “Understand the concept of a function and use function notation”.

Warm up: Euler’s totient function denoted by \varphi(n) assigns to each positive integer n the number of positive integers that are less than or equal to n and relatively prime to n. What is \varphi(p^k) when p is a prime number?

The unsolved problem: Is it true that for every n there is at least one other integer m \neq n with the property that \varphi(m) = \varphi(n)?

Background and context: The question is known as Carmichael’s conjecture, who posited that the answer is affirmative. Curiously, it has been proved (in The Distribution of Totients by Kevin Ford, 1998) that any counterexample to the conjecture must be larger than 10^{10^{10}}. Yet the problem is unsolved.

High School: Modeling

Relevant common core: “Modeling is the process of choosing and using appropriate mathematics and statistics to analyze empirical situations, to understand them better, and to improve decisions.”

Warm up: Read the biology paper The Biological Underpinnings of Namib Desert Fairy Circles and the mathematical modeling paper Adopting a spatially explicit perspective to study the mysterious fairy circles of Namibia (some basic calculus is required).

The unsolved problem: Develop a biologically motivated mathematical model that explains the fairy circles in Namibia.


Background and context: Fairy circles occur in Southern Africa, mostly in Namibia but also in South Africa. The circles are barren patches of land surrounded by vegetation, and they appear to go through life cycles of dozens of years. There have been many theories about them but despite a number of plausible models the phenomenon is poorly understood and their presence is considered “one of nature’s great mysteries“.

High School: Geometry

Relevant common core: “An understanding of the attributes and relationships of geometric objects can be applied in diverse contexts”.

Warm up: Calculate the area of the handset shape shown moving in the figure below. It consists of two quarter-circles of radius 1 on either side of a 1 by 4/π rectangle from which a semicircle of radius \scriptstyle 2/\pi\, has been removed.

The unsolved problem: What is the rigid two-dimensional shape of largest area that can be maneuvered through an L-shaped planar region (as shown above) with legs of unit width.

Background and context: This problem was posed by Leo Moser in 1966 and is known as the the moving sofa problem. It is known that the largest area for a sofa is between 2.2195 and 2.8284. The problem should be familiar to college age students who have had to manouever furniture in and out of dorm rooms.

High School: Statistics & Probability

Relevant common core: “Describe events as subsets of a sample space (the set of outcomes) using characteristics (or categories) of the outcomes, or as unions, intersections, or complements of other events (‘or,’ ‘and,’ ‘not’).”

Warm up: A family of sets is said to be union-closed if the union of any two sets from the family remains in the family. Write down five different examples of families of sets that are union-closed.

The unsolved problem: Show that for any finite union-closed family of finite sets, other than the family consisting only of the empty set, there exists an element that belongs to at least half of the sets in the family.

Background and context: The conjecture is known as Frankl’s conjecture, named after the mathematician Péter Frankl, and is also called the union closed sets conjecture. It is deceptively simple, but is known to be true only in a few very special cases.

[Update July 15, 2016: A preprint describing sleuth is available on BioRxiv]

Today my student Harold Pimentel released the beta version of his new RNA-Seq analysis method and software program called sleuth. A sleuth for RNA-Seq begins with the quantification of samples with kallisto, and together a sleuth of kallistos can be used to analyze RNA-Seq data rigorously and rapidly.


Why do we need another RNA-Seq program?

A major challenge in transcriptome analysis is to determine the transcripts that have changed in their abundance across conditions.  This challenge is not entirely trivial because the stochasticity in transcription both within and between cells (biological variation), and the randomness in the experiment (RNA-Seq) that is used to determine transcript abundances (technical variation), make it difficult to determine what constitutes “significant” change. 

Technical variation can be assessed by performing technical replicates of experiments. In the case of RNA-Seq, this can be done by repeatedly sequencing from one cDNA library. Such replicates are fundamentally distinct from biological replicates designed to assess biological variation. Biological replicates are performed by sequencing different cDNA libraries that have been constructed from repeated biological experiments performed under the same (or in practice near-same) conditions. Because biological replicates require sequencing different cDNA libraries, a key point is that biological replicates include technical variation.

In the early days of RNA-Seq a few papers (e.g. Marioni et al. 2008, Bullard et al. 2010) described analyses of technical replicates and concluded that they were not really needed in practice, because technical variation could be predicted statistically from the properties of the Poisson distribution. The point is that in an idealized RNA-Seq experiment counts of reads are multinomial (according to abundances of the transcripts they originate from), and therefore approximately Poisson distributed. Their variance is therefore approximately equal to the mean, so that it is possible to predict the variance in counts across technical replicates based on the abundance of the transcripts they originate from. There is, however, one important subtlety here: “counts of reads” as used above refers to the number of reads originating from a transcript, but in many cases, especially in higher eukaryotes, reads are frequently ambiguous as to their transcript of origin because of the presence of multi-isoform genes and genes families. In other words, transcript counts cannot be precisely measured. However, the statement about the Poisson distribution of counts in technical replicates remain true when considering counts of reads by genomic features because then reads are no longer ambiguous. 

This is why, in so-called “count-based methods” for RNA-Seq analysis, there is an analysis only at the gene level. Programs such as DESeq/DESeq2, edgeR and a literally dozens of other count-based methods first require counting reads across genome features using tools such as HTSeq or featureCounts. By utilizing read counts to genomic features, technical replicates are unnecessary in lieu of the statistical assumption that they would reveal Poisson distributed data, and instead the methods focus on modeling biological variation. The issue of how to model biological variation is non-trivial because typically very few biological replicates are performed in experiments. Thus, there is a need for pooling information across genes to obtain reliable variance estimates via a statistical process called shrinkage. How and what to shrink is a matter of extensive debate among statisticians engaged in the development of count-based RNA-Seq methods, but one theme that has emerged is that shrinkage approaches can be compatible with general and generalized linear models, thus allowing for the analysis of complex experimental designs.

Despite these accomplishments,  count-based methods for RNA-Seq have two major (related) drawbacks: first, the use of counts to gene features prevents inference about the transcription of isoforms, and therefore with most count-based methods it is impossible to identify splicing switches and other isoform changes between conditions. Some methods have tried to address this issue by restricting genomic features to specific exons or splice junctions (e.g. DEXSeq) but this requires throwing out a lot of data, thereby reducing power for identifying statistically significant differences between conditions. Second, because of the fact that in general \frac{a}{b} + \frac{c}{d} \neq \frac{a+b}{c+d} it is mathematically incorrect to estimate gene abundances by adding up counts to their genomic region. One consequence of this, is that it is not possible to accurately measure fold change between conditions by using counts to gene features. In other words, count-based methods are problematic even at the gene-level and it is necessary to estimate transcript-level counts.

While reads might be ambiguous as to exactly which transcripts they originated from, it is possible to statistically infer an estimate of the number of reads from each transcript in an experiment. This kind of quantification has its origin in papers of Jiang and Wong, 2009 and Trapnell et al. 2010. However the process of estimating transcript-level counts introduces technical variation. That is to say, if multiple technical replicates were performed on a cDNA library and then transcript-level counts were to be inferred, those inferred counts would no longer be Poisson distributed. Thus, there appears to be a need for performing technical replicates after all. Furthermore, it becomes unclear how to work within the shrinkage frameworks of count-based methods. 

There have been a handful of attempts to develop methods that combine the uncertainty of count estimates at the transcript level with biological variation in the assessment of statistically significant changes in transcript abundances between conditions. For example, the Cuffdiff2 method generalizes DESeq while the bitSeq method relies on a Bayesian framework to simultaneously quantify abundances at the transcript level while modeling biological variability. Despite showing improved performance over count-based methods, they also have significant shortcomings. For example the methods are not as flexible as those of general(ized) linear models, and bitSeq is slow partly because it requires MCMC sampling.

Thus, despite intensive research on both statistical and computational methods for RNA-Seq over the past years, there has been no solution for analysis of experiments that allows biologists to take full advantage of the power and resolution of RNA-Seq.

The sleuth model

The main contribution of sleuth is an intuitive yet powerful model for RNA-Seq that bridges the gap between count-based methods and quantification algorithms in a way that fully exploits the advantages of both.

To understand sleuth, it is helpful to start with the general linear model:

Y_t = X_t\beta_t + \epsilon_t.

Here the subscript t refers to a specific transcript, Y_t is a vector describing transcript abundances (of length equal to the number of samples), X_t is a design matrix (of size number of samples x number of confounders), \beta_t is a parameter vector (of size the number of confounders) and \epsilon_t is a noise vector (of size the number of samples). In this model the abundances Y_t are normally distributed. For the purposes of RNA-Seq data, the Y_t may be assumed to be the logarithm of the counts (or normalized counts per million) from a transcript, and indeed this is the approach taken in a number of approaches to RNA-Seq modeling, e.g. in limma-voom. A common alternative to the general linear model is the generalized linear model, which postulates that some function of Y_t has a distribution with mean equal to g^{-1}(X_t \beta_t) where g is a link function, such as log, thereby allowing for distributions other than the normal to be used for the observed data. In the RNA-Seq context, where the negative binomial distribution may make sense because it is frequently a good distribution for modeling count data, the generalized model is sometimes preferred to the standard general model (e.g. by DESeq2). There is much debate about which approach is “better”.

In the sleuth model the Y_t in the general linear model are modeled as unobserved. They can be thought of us the unobserved logarithms of true counts for each transcript across samples and are assumed to be normally distributed. The observed data D_t is the logarithm of estimated counts for each transcript across samples, and is modeled as

D_t = Y_t + \zeta_t

where the \zeta_t vector parameterizes a perturbation to the unobserved Y_t. This can be understood as the technical noise due to the random sequencing of fragments from a cDNA library and the uncertainty introduced in estimating transcript counts.

The sleuth model incorporates the assumptions that the response error is additive, i.e. if  the variance of transcript in sample is V(D_{t,i}) then V(D_{t,i}) = \sigma^2_t + \tau^2_t where the variance V(\epsilon_{t,i}|y_{t,i}) = \sigma^2_t and the variance V(\zeta_{t,i}|y_{t,i}) = \tau^2_t. Intuitively, sleuth teases apart the two sources of variance by examining both technical and biological replicates, and in doing so directly estimates “true” biological variance, i.e. the variance in biological replicates that is not technical.  In lieu of actual technical replicates, sleuth takes advantage of the bootstraps of kallisto which serve as accurate proxies.

In a test of sleuth on data simulated according to the DESeq2 model we found that sleuth significantly outperforms other methods:


In this simulation transcript counts were simulated according to a negative binomial distribution, following closely the protocol of the DESeq2 paper simulations. Reference parameters for the simulation were first estimated by running DESeq2 on a the female Finnish population from the GEUVADIS dataset (59 individuals). In the simulation above size factors were set to be equal in accordance with typical experiments being performed, but we also tested sleuth with size factors drawn at random with geometric mean of 1 in accordance with the DESeq2 protocol (yielding factors of 1, 0.33, 3, 3, 0.33 and 1) and sleuth still outperformed other methods.

There are many details in the implementation of sleuth that are crucial to its performance, e.g. the approach to shrinkage to estimate the biological variance \sigma^2_t. A forthcoming preprint, together with Nicolas Bray and Páll Melsted that also contributed to the project along with myself, will provide the details.

Exploratory data analysis with sleuth

One of the design goals of sleuth was to create a simple and efficient workflow in line with the principles of kallisto. Working with the Shiny web application framework we have designed an html interface that allows users to interact with sleuth plots allowing for real time exploratory data analysis.

The sleuth Shiny interface is much more than just a GUI for making plots of kallisto processed data. First, it allows for the exploration of the sleuth fitted models; users can explore the technical variation of each transcript, see where statistically significant differential transcripts appear in relationship to others in terms of abundance and variance and much more. Particularly useful are interactive features in the plots. For example, when examining an MA plot, users can highlight a region of points (dynamically created box in upper panel) and see their variance breakdown of the transcripts the points correspond to, and the list of the transcripts in a table below:



The web interface contains diagnostics, summaries of the data, “maps” showing low-dimensional representations of the data and tools for analysis of differential transcripts. The interactivity via Shiny can be especially useful for diagnostics; for example, in the diagnostics users can examine scatterplots of any two samples, and then select outliers to examine their variance, including the breakdown of technical variance. This allows for a determination of whether outliers represent high variance transcripts, or specific samples gone awry. Users can of course make figures showing transcript abundances in all samples, including boxplots displaying the extent of technical variation. Interested in the differential transcribed isoform ENST00000349155 of the TBX3 gene shown in Figure 5d of the Cuffdiff2 paper? It’s trivial to examine using the transcript viewer:


One can immediately see visually that differences between conditions completely dominate both the technical and biological variation within conditions. The sleuth q-value for this transcript is 3*10^(-68).

Among the maps, users can examine PCA projections onto any pair of components, allowing for rapid exploration of the structure of the data. Thus, with kallisto and sleuth raw RNA-Seq reads can be converted into a complete analysis in a matter of minutes. Experts will be able to generate plots and analyses in R using the sleuth library as they would with any R package. We plan numerous improvements and developments to the sleuth interface in the near future that will further facilitate data exploration; in the meantime we welcome feedback from users.

How to try out sleuth

Since sleuth requires the bootstraps and quantifications output by kallisto we recommend starting by running kallisto on your samples. The kallisto program is very fast, processing 30 million reads on a laptop in a matter of minutes. You will have to run kallisto with bootstraps- we have been using 100 bootstraps per sample but it should be possible to work with many fewer. We have yet to fully investigate the minimum number of bootstraps required for sleuth to be accurate.

To learn how to use kallisto start here. If you have already run kallisto you can proceed to the tutorial for sleuth. If you’re really eager to see sleuth without first learning kallisto, you can skip ahead and try it out using pre-computed kallisto runs of the Cuffdiff2 data- the tutorial explains where to obtain the data for trying out sleuth.

For questions, suggestions or help see the program websites and also the kallisto-sleuth user group. We hope you enjoy the tools!

I recently read a “brainiac” column in the Boston Globe titled “For 40 years, computer scientists looked for a solution that doesn’t exist” that caused me to facepalm so violently I now have pain in my right knee.

The article is about the recent arXiv posting:

Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) by Arturs Backurs and Piotr Indyk (first posted in December 2014 and most recently updated in April 2015).

In the preprint, which was presented at STOC, Backurs and Indyk show that computing edit distance sub-quadratically is hard in the sense that if it were possible to compute the edit distance between two strings of length n in time O(n^{2-\delta}) for some constant \delta > 0 then it would be possible to solve the satisfiability of conjunctive normal form formulas in time that would violate the Strong Exponential Time Hypothesis (SETH)This type of “lower bound” reduction is common practice in theoretical computer science. In this case the preprint is of interest because (a) it sheds light on the complexity of a computational problem (edit distance) that has been extensively studied by theoretical computer scientists and (b) it extends the relevance of SETH and sheds more light on what the hypothesis implies.

In the introduction to the preprint Backurs and Indyk motivate their study by writing that “Unfortunately, that [standard dynamic programming] algorithm runs in quadratic time, which is prohibitive for long sequences”  and note that “considerable effort has been invested into designing faster algorithms, either by assuming that the edit distance is bounded, by considering the average case or by resorting to approximation”. They mention that the fastest known exact algorithm is faster than quadratic time running in O(n^2/logn), but note that “only barely so”. This is all true, and such preamble is typical for theoretical computer science, setting the stage for the result. In one single parenthetical remark intended to relate their work to applications they note that “the human genome consists of roughly 3 billions base pairs” and that exact edit distance computation is currently prohibitive for such long sequences.  This is a true sentence, but also not very precise. More on that later in the post…

The parenthetical genome comment in the preprint was picked up on in an MIT press release describing the results and implications of the paper. While the press release does a very good job explaining the reduction of Backurs and Indyk in an accessible way, it also picks up on the parenthetical genome comment and goes a bit further. The word “genome” appears three times, and it is stated that “a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes.” This is technically true, but two human genomes are very similar, so that exhaustive comparison is not necessary to compute the edit distance. In fact, two human genomes are 99.9% identical, and with strong prior upper bounds on the edit distance the computation of edit distance is tractable.

In terms of different species, the edit distance between complete genomes is completely irrelevant, because of the fact that organisms undergo large scale segmental duplications and rearrangements over timescales of speciation. The MIT press release did not claim the Backurs and Indyk result was relevant for comparing genomes of species, but… the Boston Globe column states that

“By calculating the edit distance between the genomes of two organisms, biologists can estimate how far back in evolutionary time the organisms diverged from each other.”


Actually, even for closely related genomes edit distance is not the relevant statistic in biology, but rather the alignment of the genomes. This is why biological sequence analysis is based on what is called the Neeldeman-Wunsch algorithm and not the Wagner-Fischer algorithm mentioned in the MIT press release. Actually, because the length of sequences for which it is relevant to compute alignments is bounded and rather small (usually a few tens or hundreds of thousands of base pairs at most), the real computational issue for biological sequence analysis is not running time but memory. The Needleman-Wunsch and Wagner-Fischer algorithms require O(n^2) space and the algorithmic advance that made edit distance and alignment computations for biological sequence comparison tractable was Hirschberg’s algorithm which requires only linear space at a cost of only a factor of 2 in time.

Thus, the parenthetical comment of Backurs and Indyk, intended to merely draw attention of the reader to the possible practical relevance of the edit distance problem, mutated into utter rubbish by the time the work was being discussed in the Boston Globe. If that was the only issue with the brainiac column I might have just hurt my hand when I facepalmed, but the following paragraph caused me the real pain:

Because it’s computationally infeasible to compute the edit distance between two human genomes, biologists have to rely on approximations. They’d love a faster method, but Indyk and his coauthor, MIT graduate student Arturs Backurs, recently demonstrated a faster method is impossible to create. And by impossible they don’t mean “very hard” or “our technology has to improve first.” They mean something like, “by the laws of the mathematics, it can’t be done.”

We’ve already gone through point #1 but amazingly nothing is true in this paragraph:

  1. No, it is not computationally infeasible to compute the edit distance between two human genomes and no, biologists do not have to rely on approximations. But then there is also the fact that…
  2. Backurs and Indyk did not, in fact, demonstrate that a “faster method is impossible to create”. What they showed is that reducing the exponent in the quadratic algorithm would require SETH to be false. It turns out there are some respectable computer scientists that believe that SETH is indeed false. So…
  3. No, Backurs and Indyk, who understand mathematics quite well… most certainly do not believe that “by the laws of the mathematics, it can’t be done”. Actually,
  4. I’d bet that Backurs and Indyk don’t really believe that mathematics has laws. Axioms yes.. but laws…

I find inaccurate reporting of science to be highly problematic. How can we expect the general public to believe scientists claims about serious issues, e.g. the prospects for climate change, when news reports are filled with hype and rubbish about less serious issues, e.g. the complexity of edit distance?

There is a another issue that this example highlights. While complexity theory is extremely important and fundamental for computer science, and while a lot of the work in complexity has practical applications, there has been a consistent hyping of implications for biology where complexity theory is simply not very relevant. When I started my career it was fashionable to talk about protein folding being NP-complete. But of course millions of proteins were probably being folded in my body while I was writing this blog post. Does that fact that I am alive imply that P=NP?

Of course, exploring the complexity of simplified models for biological systems, such as the HP-model for protein folding that the above reference links to, has great value. It helps to sharpen the questions, to better specify models, and to define the limits of computational tractability for computational biology problems. But it is worth keeping in mind that in biology n does not go to infinity.


Blog Stats

%d bloggers like this: