Okay Houston, we’ve got a problem. We need more power. Case in point: a recently published study Apollo Lunar Astronauts Show Higher Cardiovascular Disease Mortality by Michael Delp et al. was picked up by news outlets with headlines such as:

The headlines were based on a sentence in the paper stating that “the CVD mortality rate among Apollo lunar astronauts (43%) was 4–5 times higher than in non-flight and LEO [low earth orbit] astronauts.”

A reading of the paper reveals that the “5 times more likely to die” risk calculation comes from $43\% \approx 9\% \times 5 = \left\lceil \frac{3}{7} \right\rceil$. The number 9% is the rate of cardiovascular disease observed in 35 non-flight astronauts whereas the number 43% is rate of cardiovascular disease in Apollo lunar astronauts (3 out of 7). In other words, the grandiose claims of the paper are based on three Apollo astronauts dying of cardiovascular disease rather than an expected single astronaut.

The authors themselves must have realized how unfounded their claims were, because the paper evidently flirts with fraud. They used a p-value cutoff of 0.1 to declare the lunar astronaut result “significant”. This is in contrast to the standard cutoff 0.05 which they use for the remainder of the results in the paper, and they justified the strange exception by suggesting that others  “considered [Fisher’s exact test] extremely conservative.” In addition, Ed Mitchell who died at the age of 85 on February 4th 2016 three months before the paper was submitted was excluded from the analysis. His inclusion would have increased the dataset size by 14%! Then there is the fact that they failed to mention the three astronauts who visited the moon twice and are still alive. Or that the lunar astronauts died ten years older on average. Perhaps worst of all, the authors imply that they have experimental data on a mechanism for their statistical (non)result by describing a follow-up experiment examining vascular responses of resistance arteries in irradiated mice. The problem is, the dose given to the mice was 87 times what the astronauts received!  None of this is complicated stuff… and one wonders how only one of the reporters writing about the study picked up on any of this (Sarah Kaplan  from the Washington post headlined the story with  Studying heart disease in astronauts yields clues but not conclusive evidence and concluded correctly “that’s just three of seven people, which doesn’t give you a whole lot of statistical power”.)

One would hope that this kind of paper would be retracted by the journal but my previous attempts to get journals to do the right thing, even when the research was clearly flawed, have been futile. Then there is the funding. Learning nothing doesn’t come for free and the authors’ “work”  was supported by grants from the National Space and Biomedical Research Institute under the NASA Cooperative Agreement. Clearly PI Michael Delp (who is also first author, corresponding author and dean of the College of Human Sciences at Florida State University) would like even more funding, proclaiming in interviews that he wanted to take “a deeper look into the medical history of the Apollo astronauts”, “study future questions” and that he was “working with NASA to conduct additional studies”. My experience in genomics has been that funding agencies typically turn a blind eye to flawed research leaving the task of evaluating the science to “peer reviewers”. I’ve seen many cases where individuals who published complete malarkey and hogwash continue to receive funding.  But it seems NASA cares about the research it funds and may not be on the same page as Delp et al. In a statement published on July 28th, NASA wrote that:

The National Space Biomedical Research Institute, a non-governmental organization with funding from NASA’s Human Research Program, supported a recent study published in Scientific Reports that looked at the rate of cardiovascular disease among Apollo astronauts.

With the current limited astronaut data referenced in the study it is not possible to determine whether cosmic ray radiation affected the Apollo astronauts.

This is not the first time NASA has published statements distancing itself from studies it has supported (either directly or indirectly). Following reports that a NASA-funded study found that industrial civilization was headed for irreversible collapse, NASA published a statement making clear it did not support the results of the study.

Thank you NASA! You have set a great example in taking ownership of the published work your funding enabled. Hopefully others (NIH!!) will follow suit in publicly disavowing poorly designed underpowered studies that make grandiose claims.

Disclosure: I collaborate with NASA scientists, contribute to projects partially funded by NASA, and apply for NASA funding.

Computational biologists do not all recognize the Kalman filter by name, but they know it in the form of the hidden Markov model (the Kalman filter is a hidden Markov model with continuous latent variables and Gaussian observed variables). I mention this because while hidden Markov models, and more generally graphical models, have had an extraordinary impact on the tools and techniques of high-throughput  biology, one of their primary conceptual sources, the Kalman filter, is rarely credited as such by computational biologists.

Illustration of the Kalman filter (from Wikipedia).

Where the Kalman filter has received high acclaim is in engineering, especially electrical and aeronautical engineering via its applications in control theory and where it has long been a mainstay of the fields. But it was not always so. The original papers, written in the early 1960s by Rudolf Kálmán and colleagues, were published in relatively obscure mechanical engineering journals rather than the mainstream electrical engineering journals of the time. This was because Kálmán’s ideas were initially scoffed at and rejected… literally. Kálmán second paper on the topic, New Results in Linear Filtering and Prediction Theory (with almost 6,000 citations), was rejected at first with a referee writing that “it cannot possibly be true”. The story is told in Grewal and Andrews’ book Kalman Filtering: Theory and Practice Using MATLAB. Of course not only was the Kalman filter theory correct, the underlying ideas were, in modern parlance, transformative and disruptive. In 2009 Rudolf Kálmán received the National Medal of Science from Barack Obama for his contribution. This is worth keeping in mind not only when receiving rejections for submitted papers, but also when writing reviews.

Rudolf Kálmán passed away at the age of 86 on Saturday July 2nd 2016.

I’m thrilled to announce that I will be moving to Caltech next year where I will be professor of computational biology!

Some people have asked me why I’m moving. First and foremost, we (my family) feel it is the right move for us as for a variety of reasons that I won’t get into here. For me personally, Caltech represents a unique, special, and extraordinary opportunity because it is an institution that fosters an environment facilitating research and teaching that, inasmuch as possible, is unencumbered by the minutiae of academia. In particular, Caltech is unintimidated by disciplinary boundaries, and enables a culture that I’ve yearned for my whole career. It doesn’t throw hundreds of millions of dollars at a football team (although the basketball team is doing pretty well). Its priorities are aligned with mine.

I’m leaving behind Berkeley, a university I started working at 17 years ago as a visiting assistant professor. I’ll miss Berkeley. I still remember the January 1999 phone call from Prof. Tsit Yuen Lam, announcing my appointment. I was honored to have been invited to conduct research and to teach at one of the world’s great institutions. Berkeley was, and still is, distinguished by it’s mission of providing world-class affordable public education. I can’t think of any university in the world that has done as well in pursuing this noble goal. Consider, for example, that UC Berkeley has almost as many Pell Grant recipients as all eight Ivy League schools combined. But with time, as I was allowed to drop the prefixes in my title, I found myself increasingly aware of the structure, organization and financing of the university. Two numbers that I learned have stuck in my mind: today, state funding comprises only 13% of the budget (likely even lower next year), less than half of what it was when I arrived. At the same time, tuition has increased by over a factor of three during the same time period. The squeeze has harmed the institution not just because of reductions in resources (though there have been many), but also because of the strain placed on the morale and mission of the university. Over time I started to question whether its world-class education was sustainable, and lamented that its affordability was becoming a myth. Over the past two years I’ve become increasingly aware that the reality of the university is at odds with my values. I’m sad for the University of California and for the citizens who are being harmed by the blows it is taking, and very much wish that the state will protect and nurture its education treasure. But I will be rooting for it from the sidelines.

I can’t wait to start at Caltech, and look forward to the next phase of my career!

My doppelgänger, Charlie Eppes, who developed algebraic statistics for computational biology at “CalSci” (Caltech).

My Caltech calculus professor, Tom Apostol, passed away yesterday May 8th 2016. When I arrived in his Math 1b class in the Fall of 1990 I thought, like most of my classmates, that I already knew calculus. I may have known it but I didn’t understand it. Apostol taught me the understanding part.

Apostol’s calculus books, affectionally called “Tommy I” and ‘Tommy II” were not just textbooks for students to memorize but rather mathematical wisdom and beauty condensed into a pair of books intended to transform grade-obsessed freshmen and sophomores into thinking human beings. Most of all, Apostol emphasized the idea that fundamental to mathematics is how one thinks about things, not just what one is thinking about. One of his iconic examples of this was the ice-cream-cone-proof that the focal property of an ellipse is a consequence of its definition as a section of a cone. Specifically, taking as the definition of an ellipse a plane curve obtained by intersecting an inclined plane with a cone

the goal is to both define the two foci, and then to derive the focal point property as illustrated below:

Apostol demonstrated the connection between conic sections and their foci via a proof and picture of Dandelin. His explanation, which I still remember from my freshman year in college, is beautiful (the excerpt below is from his linear algebra book):

Apostol didn’t invent Dandelin’s spheres but he knew they were “the right way” to think about conic sections, and he figured out “the right way” for each and every one of his explanations. His calculus books are exceptional for their introduction of integration before differentiation, his preference for axiomatic rather than mechanistic definition (e.g. of determinants) and his exercises that are “easy” when the material is understood “in the right way”. His course had a profound influence on my approach not only to mathematics, but to all of my learning in both the sciences and humanities.

One of Apostol’s signature traditions was his celebration of Gauss’ birthday. His classes were always filled with mathematical treats, but on April 30th every year he would hide a cake in the classroom before the students arrived and would serve an edible treat that day instead. Gauss turned 239 just last week. This seems to be a timely moment to take note of that prime number (Apostol was a number theorist) and to eat a slice of cake for Gauss, Apostol, and those who change our lives.

The Journal lmpact Factor (JIF) was first proposed by Eugene Garfield of Institute for Scientific Information (ISI) fame in 1955. It is a journal specific yearly citation measure, defined to be the average number of citations per paper of the papers published in the preceding two years. Obsession with the impact factor in the face of widespread recognition of its shortcomings as a tool for judging the value of science is an unfortunate example of “the tragedy of the commons”.

Leaving aside for a moment the flaws of the JIF, one may wonder whether journals do in fact have any impact? By “impact”one might imagine something along the lines of the simple definition in the Merriam-Webster Dictionary: “to have a strong and often bad effect on (something or someone)” and as an object for the impact one could study the researchers who publish, the scientific community as a whole, or the papers themselves. On the question of impact on papers, common sense suggests that publishing in a high profile journal helps a paper succeed and there is pseudoscience to support that case. However there is little in the way of direct measurement. Twitter to the rescue.

At the end of last year my twitter account was approaching 5,000 followers. Inspired by others, I found myself reflecting on this “milestone” and in anticipation of the event, I started to ponder the scientific utility of amassing such a large numbers of followers. There is, of course, a lot of work being done on natural language processing of twitter feeds, but it struck me that with 5,000 followers I was in a position to use twitter for proactive experimentation rather than just passive mining. Impact factors, followers, and twitter… it was just the right mix for a little experiment…

Starting in August of 2015, I began occasionally tweeting articles about 5 minutes apart, using the exact same text (the article title or brief description) but doing it once with the article linked via the journal website so that the journal name was displayed in the link and once with an a bit.ly link that revealed nothing about the journal source. Twitter analytics allowed me to see, for each tweet, a number of (highly correlated) tweet statistics, and I settled on measuring the number of clicks on the link embedded in the tweet. By switching the order of named/anonymized tweets I figured I could control for a temporal effect in tweet appearance, e.g. it seemed likely that users would click on the most recent links on their feed resulting in more views/clicks etc. for later tweets identical except for link type . Ideally this control would have been performed by A/B testing but that was not a possibility (see Supplementary Materials and Notes). I did my tweeting manually, generally waiting a few weeks between batches of tweets so that nobody would catch on to what I was doing (and thereby ruin the experiment). I was eventually caught forcing me to end the experiment but not before I squeezed in enough tests to achieve a significant p-value for something.

I hypothesized that twitter users will click on articles when, and only when, the titles or topics reflect research of interest to them. Thus, I expected not to find a difference in analytics between tweets made with journal names as opposed to bit.ly links. Strikingly, tweets of articles from Cell, Nature and Science journals (CNS) all resulted in higher clicks on the journal title rather than the anonymized link (p-value 0.0078). The average effect was a ratio of 2.166 between clicks on links with the journal name in comparison to clicks on bit.ly links. I would say that this number is the real journal impact factor of what are now called the “glamour journals” (I’ve reported it to three decimal digits to be consistent with the practice of most journals in advertising their JIFs). To avoid confusion with the standard JIF, I call my measured impact factor the RIF (relative impact factor).

One possible objection to the results reported above is that perhaps the RIF reflects an aversion to clicking on bit.ly links, rather than a preference for clicking on (glamour) journal links. I decided to test that by performing the same test (journal link vs. bit.ly link) with PLoS One articles:

Strikingly, in three out of the four cases tested users displayed an aversion to clicking on PLoS One links. Does this mean that publishing in PLoS One is career suicide? Certainly not (I note that I have published PLoS One papers that I am very proud of, e.g. Disordered Microbial Communities in Asthmatic Airways), but the PLoS One RIF of 0.877 that I measured (average ratio of journal:bit.ly clicks, as explained above) is certainly not very encouraging for those who hope for science to be journal name blind. It also suggests that the RIF of glamour journals does not reflect an aversion to clicking on bit.ly links, but rather an affinity for.. what else to call it but.. glamour.

Academics frequently complain that administrators are at fault for driving researchers to  emphasize JIFs, but at the recent Gaming Metrics meeting I attend UC Davis University Librarian MacKenzie Smith pointed out something which my little experiment confirms: “It’s you!

Supplementary Material and Notes

The journal Nature Communications is not obviously a “glamour journal”, however I included it in that category because the journal link name began nature.com/… Removing the Nature Communications tweet from the glamour analysis increases the glamour journal RIF to 2.264.

The ideal platform for my experiment is an A/B testing setup, and as my former coauthor Dmitry Ryaboy , head of the experimentation team at twitter explains in a blog post, twitter does perform such testing on users for internal purposes. However I could not perform A/B testing directly from my account, hence the implementation of the design described above.

I tried to tweet the journal/bit.ly tweets exactly 5 minutes apart, but once or twice I got distracted reading nonsense on twitter and was delayed by a bit. Perhaps if I’d been more diligent (and been better at dragging out the experiment) I’d have gotten more and better data. I am comforted by the fact that my sample size was >1.

Twitter analytics provided multiple measures, e.g. number of retweets, impressions, total engagements etc., but I settled on link clicks because that data type gave the best results for the argument I wanted to make. The table with the full dataset is available for download from here (or in pdf). The full list of tweets is here.

In 2014 biologist Mike Snyder published 42 papers. Some of his contributions to those papers were collaborative, but many describe experiments performed in his lab. To make sense of the number 42, one might naïvely assume that Snyder dedicated 8 days and 17 hours (on average) to each paper. That time would include the hours (days?) needed to conceive of ideas, setup and perform experiments, analyze data and write text. It’s not much time, yet even so it is probably an upper bound. This is because Snyder performed the work that resulted in the 42 papers in his spare time. As the chair of the genetics department at Stanford, an administrative position carrying many responsibilities, he was burdened with numerous inter- and intra- and extra- departmental meetings, and as an internationally renowned figure in genomics he was surely called upon to assist in numerous reviews of grants and papers, to deliver invited talks at conferences, and to serve on numerous national and international panels and committees. Moreover, as a “principal investigator” (PI), Snyder was certainly tasked with writing reports about the funding he receives, and to spend time applying for future funding. So how does someone write 42 papers in a year?

Snyder manages 36 postdocs,  13 research assistants,  11 research scientists, 9 visiting scientists, and 8 graduate students. One might imagine that each of the graduate students supervises four postdocs, so that the students/postdocs operate under a structure that enables independent work to proceed in the lab with only occasional supervision and input from the PI. This so-called “PI model” for biology arguably makes sense. The focus of experienced investigators on supervision allows them to provide crucial input gained from years of experience on projects that require not only creative insight, but also large amounts of tedious rote work. Modern biology is also fraught with interdisciplinary challenges, and large groups benefit from the diversity of their members.  For these reasons biology papers nowadays tend to have large numbers of authors.

It is those authors, the unsung heroes of science, that are the “cast” in Eric Lander’s recent perspective on “The Heroes of CRISPR” where he writes that

“the narrative underscores that scientific breakthroughs are rarely eureka moments. They are typically ensemble acts, played out over a decade or more, in which the cast becomes part of something greater than what any one of them could do alone.”

All of the research papers referenced in the Lander perspective have multiple authors, on average about 7, and going up to 20. When discussing papers such as this, it is therefore customary to refer to “the PI and colleagues…” in lieu of crediting all individual authors. Indeed, this phrase appears throughout Lander’s perspective, where he writes “Moineau and colleagues..”, “Siksnys and colleagues..”, etc. It is understood that this means that the PIs guided projects and provided key insights, yet left most of the work to the first author(s) who were in turn aided by others.

There is a more cynical view of the PI model, namely that by running large labs PIs are able to benefit from a scientific roulette of their own making. PIs can claim credit for successes while blaming underlings for failures. Even one success can fund numerous failures, and so the wheel spins faster and faster…the PI always wins. Of course the mad rush to win leads to an overall deterioration of quality. For example, it appears that even the Lander perspective was written in haste (Lander may not be as prolific as Snyder but according to Thomson Reuters he did publish 21 “hot” papers in 2015, placing him fourth on the list of the world’s most influential scientific minds, only a few paces behind winner Stacey Gabriel). The result is what appears to be a fairly significant typo in Figure 2 of his perspective, which shows the geography of the CRISPR story. The circle labeled “9” should be colored in blue like circle “10”, because that color represents “the final step of biological engineering to enable genome editing”. The Jinek et al. 2012 paper of circle “9” clearly states that “We further show that the Cas9 endonuclease can be programmed with guide RNA engineered as a single transcript to target and cleave any dsDNA sequence of interest”. Moreover the Jinek et al. 2013 paper described editing in mammalian cells but was submitted in 2012 (the dates in Lander’s figure are by submission), so circle “9” should also be labeled with a “10” for “Genome editing in mammalian cells”.  In other words, the figure should have looked like this:

Lander acknowledges the reality of the PI model in the opening of his perspective, where he writes that “the Perspective describes an inspiring ensemble of a dozen or so scientists who—with their collaborators and other contributors whose stories are not elaborated here—discovered the CRISPR system, unraveled its molecular mechanisms, and repurposed it as a powerful tool for biological research and biomedicine. Together, they are the Heroes of CRISPR.”

But in choosing to highlight a “dozen or so” scientists, almost all of whom are established PIs at this point, Lander unfairly trivializes the contributions of dozens of graduate students and postdocs who may have made many of the discoveries he discusses, may have developed the key insights, and almost certainly did most of the work. For example, one of the papers mentioned in the Lander perspective is

F. Zhang*, L. Cong*, S. Lodato, S. Kosuri, G.M. Church and P. Arlotta, Efficient construction of sequence-specific TAL effectors for modulating mammalian transcription, 2011

I’ve placed a * next to the names of F. Zhang and L. Cong as is customary to do when “authors contributed equally to this work” (a quote from their paper). Yet in his perspective Lander writes that “when TALEs were deciphered, Zhang, with his collaborators Paola Arlotta and George Church…successfully repurposed them for mammals”. The omission of Cong is surprising not only because his contribution to the TALE work was equal to that of Zhang, but because he is the same Cong that is first author of Cong et al. 2013 that Lander lauds for its number of citations (Le Cong is a former graduate student of Zhang, and is now a postdoc at the Broad Institute). Another person who is absent from Lander’s perspective, yet was first author on two of the key papers mentioned is Martin Jinek. He and Charpentier’s postdoc Krzysztof Chylinski were deemed fundamental to the CRISPR story elsewhere.

Lander concludes his perspective by writing that “the [CRISPR] narrative…is a wonderful lesson for a young person contemplating a life in science”. That may be true of the CRISPR narrative, but the lesson of his narrative is that young persons should not count on leaders in their field to recognize their work.

COI statement: Eric Lander is my former (informal) Ph.D. advisor. Jennifer Doudna is a current colleague and collaborator.

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as the law of total variance. Given a random (response) variable $C$ with a random covariate  $Z$, the law of total variance states that the variance of $C$ can be decomposed as:

$Var[C] = E_Z[Var[C|Z]] + Var_Z[E[C|Z]]$.

There is a biological interpretation of this law that also serves to explain it: If the random variable $C$ denotes the expression of a gene in a single cell ($C$ being a random variable means that the expression is stochastic), and $Z$ denotes the (random) state of a cell, then the law of total variance describes the “total noise” $Var[C]$ in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression $Var[C|Z]$ to be the conditional variance, which is also a random variable; its possible values are the variance of the gene expression in different cell states. If $Var[C]$ does not depend on $Z$ then the expression of the gene is said to be homoscedastic, i.e., it does not depend on cell state (if it does then it is said to be heteroscedastic). Because $Var[C|Z]$ is a random variable, the expression $E_Z[Var[C|Z]$ makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression $E[C|Z]$ is a random variable whose possible values are the average  of the gene expression in cells. Thus, $Var_Z[E[C|Z]]$ is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into E. coli and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables $Z_1,\ldots,Z_n$ which are independent and identically distributed, one for each of n different cells, while random variables $C_1,\ldots,C_n$  and $Y_1,\ldots,Y_n$ correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable $C_i$ and $Y_i$, i.e. a pair of measurements for each cell. A hierarchical model for the experiment, in which the marginal (unconditional) distributions $C_i$ and $Y_i$ are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead,  in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

$\eta^2_{int} = \frac{\frac{1}{n} \left( \sum_{i=1}^n \frac{1}{2} (c_i-y_i)^2\right) }{\overline{c} \cdot \overline{y}},$ (intrinsic noise)

$\eta^2_{ext} = \frac{\frac{1}{n}\sum_{i=1}^n c_i \cdot y_i - \overline{c} \cdot \overline{y}}{\overline{c} \cdot \overline{y}},$ (extrinsic noise)

$\eta^2_{tot} = \frac{\frac{1}{n}\sum_{i=1}^n \frac{1}{2} (c_i^2+y_i^2)-\overline{c}\cdot\overline{y}}{\overline{c} \cdot \overline{y}}.$ (total noise)

Here  $c_1,\ldots,c_n$ and $y_1,\ldots,y_n$ are the measurements of cyan respectively yellow reporter expression in each cell, $\overline{c} = \frac{1}{n}\sum_{i=1}^n c_i$ and $\overline{y} = \frac{1}{n}\sum_{i=1}^n y_i$.

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho,  studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:

The Elowitz et al. caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of n in the denominator rather than n-1), an observation made by Hilfinger and Paulsson. The sample covariance has a  (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line $c=y$ is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):

There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

1. There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
2. The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
3. The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.

In the 1930s the pseudonym Nicolas Bourbaki was adopted by a group of French mathematicians who sought to axiomatize and formalize mathematics in a rigorous manner. Their project greatly influenced mathematics in the 20th century and contributed to its present “definition -> theorem -> proof” format (Bourbaki went a bit overboard with abstraction and generality and eventually the movement petered out).

In biology definitions seem to be hardly worth the trouble. For example, the botanist Willhelm Johannsen coined the term “gene” in 1909 but there is still no universal agreement on a “definition”.  How could there be? Words associated with biological processes or phenomena that are being understood in a piecemeal fashion naturally change their meaning over time (the philosopher Joseph Woodger did try to axiomatize biology in his 1938 book “The Axiomatic Method in Biology” but his attempt, with 53 axioms, was ill fated).

Bioinformatics occupies a sort of murky middle-ground as far as definitions go. On the one hand, mathematics underlies many of the algorithms and statistical models that are used so there is a natural tendency to formalize concepts, but just when things seem neat and tidy and the lecture begins with “Let the biological sequence be a string on an alphabet $\Sigma$ of size 4 or 20″, biology comes along and rips the blanket off the bed (there are 22 proteinogenic amino acids, molecules may be single or double stranded, they could be circular, bases may be chemically modified, and so on and so forth…).

In some cases things have really gotten out of hand. The words “alignment” and “mapping” appear, superficially, to refer to precise objects and procedures, but they probably have more meanings than the verb “run” (which is in the hundreds). To just list the meanings would be a monumental (impossible?) task. I will never attempt it. However I recently came across some twitter requests for clarification of read alignment and mapping, and I thought explaining those terms was perhaps a more tractable problem.

Here is my attempt at a chronological clarification:

1998: The first use of the term “read alignment” in a paper (in the context of genomics) is, to my knowledge, in “Consed: A Graphical Tool for Sequence Finishing” by D. Gordon, C. Abajian and P. Green. The term “read alignment” in the paper was used in the context of the program phrap, and referred to the multiple alignment of reads to a sequence assembled from the reads, so as to generate a consensus sequence with reduced error. The output looked something like this:

Figure 1 from the Consed paper.

To obtain multiple alignments phrap required a “model”. Roughly speaking, a model is an objective scoring function along with parameters, and, although not discussed in the Consed paper, that model was crucial to the functionality and performance of phrap. The phrap model included not just mismatches of bases but also insertions and deletions. Although phrap was designed to work with Sanger reads, there is conceptually not much difference between the way “read alignment” was used in the Consed paper and the way it is sometimes used now with “next-generation” reads. In fact, the paper “Correcting errors in short reads by multiple alignments” by L. Salmela and J. Schröder, 2011 does much of the same thing (it would have been nice if they cited phrap, although in their defense phrap was never published; Consed was about the visualization tool).

For the purposes of defining “read alignment” it is worthwhile to consider what I think were the four key ingredients of phrap:

• An underlying notion of what a read alignment is. Without being overly formal I think its useful to consider the form of the phrap input/output . This is my own take on it. First, there was a set of reads $\mathcal{F}$ with each read $F \in \mathcal{F}$ consisting of a sequence so that $F$ is an ordered set $F =\{\sigma_1, \ldots, \sigma_l\}$ (for simplicity I make the assumption that the reads were single end, each read was of the same length l, and each sequence element $\sigma \in \{A,C,G,T\}$, and of course I’ve ignored the reverse strand; these assumptions and restrictions can and should be relaxed but I work with them here for simplicity). Next, there was a set of target sequences (contigs, to be precise) that I denote by  $\mathcal{T}$ with $T \in \mathcal{T}$ . These each consisted of a sequence, again an ordered set, $T = \{\sigma^T_1,\ldots,\sigma^T_{|T|}\}$. A “read alignment” in phrap consisted of a pair of maps: first $\psi:\mathcal{F} \rightarrow \mathcal{T} \cup \{\emptyset\}$ specifying for each read the contig it maps to (or the mapping to $\{\emptyset\}$ denoting an unaligned read), and also a set of additional maps $\mathcal{L}_F:F \rightarrow \psi(F) \cup \emptyset$ assigning to each base in each read a corresponding base (or gap in the case of $\{\emptyset\}$) in the contig the read mapped to.
• A model: Given the notion of read alignment, a model was needed to be able to choose between different alignments. In phrap, the multiple alignments depended on scores for mismatches and gaps (the defaults were a score of 11 for a match, -2 for a mismatch, -4 for initiating a gap, -3 for extending a gap). In general, a model provides an approach to distinguishing and choosing between the read alignments as (roughly) defined above.
• An algorithm for read alignment. Given a model there was a need for an algorithm to actually find high scoring alignments. In phrap the algorithm consisted of performing a banded Smith-Waterman alignment using the model (roughly) specified above. The Smith-Waterman algorithm produced alignments $\mathcal{L}_F$ that were global, i.e. order preserving between the elements of $F$ and $\psi(F)$.
• Data structures for enabling efficient alignment. With large numbers of targets and many reads, there was a need for phrap to efficiently prune the search for read alignments. Running the banded Smith-Waterman alignment algorithm on all possible places a read could align to would have been intractable. To address this problem,  phrap made use of a hash to quickly find exact matches from parts of the read to targets, and then ran the Smith-Waterman banded alignment only on the resulting credible locations. Statistically, this could be viewed as a filter for very low probability alignments.

The four parts of phrap are present in most read alignment programs today. But the details can be very different. More on this in a moment.

2004: Six years after Consed paper the term “read alignment” shows up again in the paper “Comparative genome assembly” by Pop et al. In the paper the term “read alignment” was not defined but rather described via a process. Reads from obtained via shotgun sequencing of the genome of one organism were “aligned” to a reference genome from another closely related organism using a program called MUMmer (the goal being to use proximity of alignments to assist in assembly). In other words, the “alignments” were the output of MUMmer, followed by a post-processing with a series of heuristics to deal with ambiguous alignments. It is instructive to compare the alignments to those of phrap in terms of the four parts discussed above. First, in the formulation of Pop et al. the alignments were between a single read and a genome, not multiple alignments of sets of reads to numerous contigs. In the language above the map $\psi$ was much simpler in Pop et al., with the range consisting only of chromosomes in a genome rather than the thousands of contigs in an assembly. Second, although MUMmer utilized an implicit model for alignment, the model was of a very different meaning. Reads from one organism were being aligned to sequences from another, a task fundamentally different than aligning reads back to sequences they originated from (albeit with error). In other words, “alignment” in the Pop et al. paper referred to identification of homology. This opens up a whole other bag of worms that I wrote about in a paper with Colin Dewey in 2006 and that I will not open here. In terms of data structures, MUMmer made use of a suffix tree rather than a hash, a distinction that is important in that suffix trees provide a much richer possibility for search (MUM stands for maximum unique match, the kind of thing suffix trees are useful for finding and that MUMmer took advantage of). A lot of the research in read alignment has centered on developing efficient and compact data structures for string search.

Another development in the year 2004 was the appearance of the term “read mapping”. In the paper “Pash: efficient genome-scale anchoring by positional hashing” by Kalafus et al., hashing of k-mers was used to “anchor” reads. The word “anchoring” was and still is somewhat ambiguous as to its meaning, but in the Pash context it referred to the identification of positions of exact matching k-mers within a read on a target sequence. In other words, Pash could be considered to provide only partial alignments, because some bases within the read would not be aligned at all. Formally, one would say that the functions $\mathcal{L}_F$ had as their domain not the full reads $\mathcal{F}$ but rather strict subsets $\mathcal{S} \subset F$ where $\mathcal{S}$ is nonempty. For this reason, in the Pash paper the authors suggest post-processing their output with base-pair level sequence alignment programs popular at the time (e.g. AVID, LAGAN) to fully “align” the reads (they could have also suggested performing a banded Smith-Waterman alignment like phrap).

2006: The first sequencer manufactured by Solexa, the Genome Analyzer was launched, and following it the ELAND program for “read alignment” was provided  to customers buying the machine. This was, in my opinion, a landmark even in “read alignment” because with next-generation sequencing and ELAND read alignment became mainstream, a task to be performed in the hands of individual users, rather than just a step among many in complex genome sequencing pipelines manufactured in large genome sequencing centers. In the initial release of ELAND, the program performed “ungapped read alignment”, which in the language of this post means that the maps $\mathcal{L}_F$  had the property that if  $\mathcal{L}_F(i) = \sigma^{\psi(F)}_{j}$ then $\mathcal{L}_F(i+1) = \sigma^{\psi(F)}_{j+1}$, i.e. there were no gaps in the read alignments.

2009: With two programs published back-to-back in the same year, BWA (“Fast and accurate alignment with the Burrows-Wheeler transform“, Heng Li and Richard Durbin)  and Bowtie (“Ultrafast and memory-efficient alignment of short DNA sequences to the human genome” by Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg) read alignment took a big step forward with the use of more sophisticated data structures for rapid string matching.  BWA and Bowtie were based on the Burrows-Wheeler transform, and specifically innovations with the way it was used, to speed up read alignment. In the case of Bowtie, the alignments were ungapped just as originally with ELAND, although Bowtie2 published in 2012 incorporated gaps into the model.

Also of note is that by this time the terms “read alignment” and “read mapping” had become interchangeable. The BWA and Bowtie papers both used both terms, as did many other papers. Moreover, the terms started to take on a specific meaning in terms of the notion of “alignment”. They were referring to alignment of reads to a target genome, and moreover alignment of individual bases to specific positions in the target genome (i.e. the maps $\mathcal{L}_F$). This was in large part due to the publication of the sequence alignment format also in 2009 in Heng Li et al.‘s important paper on SAMtools.

Finally, 2009 was the year the notion that “spliced read alignment” was introduced. “Spliced alignment” was a concept around since the mid 90s (see, e.g. “Gene recognition by spliced alignment” by MS Gelfand, AA Mironov and PA Pevzner). The idea was to align cDNAs to a genome, requiring alignments allowing for long gaps (for the introns). In the paper “Optimal spliced alignments of short sequencing reads” by Fabio De Bona et al. the authors introduced the tool QPalma for this purpose, building on a spliced alignment tool from the group called Palma. Here the idea was to extend “spliced alignment” to reads, and while QPalma was never widely adopted, other programs such as TopHat, GSNAP, Star, etc. have becomes staples of RNA-Seq. Going back to the formalism I provided with the description of phrap, what spliced alignment did was to extend the models used for read alignment to allow for long gaps.

In the following years, there was what I think is fair to characterize as an explosion in the number of read alignment papers and tools. Models were tweaked and new indexing methods introduced, but the fundamental notion of “read alignment” (by now routinely also called “read mapping”) remained the same.

2014: In his Ph.D. thesis published in December of last year, my former student Nicolas Bray had the idea of entirely dropping the requirement for a read alignment to include the maps $\mathcal{L}_F$. In his setup, a read alignment would consist only of specifying for each read which target sequence it originated from, but not where in the target sequence it aligned to. The point, elaborated on in Chapter 2 of Nick’s thesis, was motivated by the observation that for many applications (e.g. RNA-Seq) the sufficient statistics for the relevant models are essentially the information contained in the map $\psi$ alone. He called this notion of read alignment “pseudoalignment”, as it is a fundamentally different concept than what was previously thought of as read alignment.

Of course there is not much to writing down a new definition, and there wouldn’t be much of a point to pseudoalignment it if not for two things: Nick’s realized that pseudoalignments could be obtained much faster than standard read alignments and second, that the large imprints of alignment files needed for the vast amounts of sequence that was being produced were becoming a major bottleneck in genomics. Working on RNA-Seq projects with collaborators, this is how we felt on a regular basis:

Pseudoalignment offered a reprieve from what was becoming an unbearable task, and (I believe) it is going to turn out to be a useful and widely applicable concept.

2015: Earlier this year we posted a preprint to the arXiv detailing a method for performing pseudoalignment based on the ideas in Nick’s thesis, with the additional idea of Páll Melsted to use a new type of indexing scheme based on a de Bruijn graph of the target sequences. The program implementing these ideas is called kallisto. Looking back at phrap, kallisto can be seen to consist of a different notion of read alignment (pseudoalignment), a novel data structure (the target sequence de Bruijn graph), and a new fast algorithm for finding pseudoalignments (based on intersecting sets of k-mer matches). Although we did not discuss the model explicitly in our paper, it is implicit via the algorithm used to find pseudoalignments. One useful thing to keep in mind is that the nature of the index of kallisto allows for extracting more than a pseudoalignment. If desired, one can obtain positions within the targets for some k-mers contained in a query read, or even strand information, and in fact such data is used in some parts of kallisto. When such information is extracted, what kallisto is performing is no longer a pure pseudoalignment, but rather what the Pash authors called a read mapping (although that term is unsuitable because as mentioned it has become synonyms with read alignment). It turns out there is now another word for exactly that concept.

Following the publication of kallisto on the arXiv another preprint was posted on the bioRxiv introducing a program called Salmon and a new term called “lightweight alignment” (Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-Seq Data Using Lightweight-Alignment by Rob Patro, Geet Duggal and Carl Kingsford) . Lightweight alignment as a notion of alignment is exactly what was just referred to: the notion of read mapping in Pash. Utilizing an index and software by Heng Li called the FMD index, Salmon determines “lightweight alignment” chains of what are called maximal exact matches and super-maximal exact matches (not to be confused with the maximal unique matches, or MUMs discuss above).  These provide partial alignment information for the sequences in each read, i.e. just like with Pash one obtains maps $\mathcal{L}_F$ but with the domain sometimes restricted to a nonempty subset $\mathcal{S} \subset \mathcal{F}$ when the MEMs or SMEMs don’t cover the read. Again, I find it useful to refer back and compare to phrap: with Salmon the “read alignment” consists of a different type of index and an algorithm associated with it (making sure the MEMs/SMEMs are consistent). The notion of alignment is distinct from that of phrap by virtue of the restriction of the domain in the maps $\mathcal{L}_F$.  I think a new term for the latter concept since the one Pash used failed to stick, and “lightweight alignment” seems fine.

Recently (just last week) a new read alignment related term was introduced in a bioRxiv preprint. A paper describing a new program called RapMap uses the term quasi-mapping for the procedure implemented in the software. RapMap finds pseudoalignments in the sense described above, with the only difference being the type of index that is used (instead of the target de Bruijn graph, it utilizes a generalized suffix array together with a hash table related to the array). This time I think the introduction of new terminology was unfortunate. Just as with kallisto, RapMap can be used to extract more information than just a pseudoalignment from the index (i.e. a lightweight alignment), and indeed the program does so. But also like kallisto, the speed of RapMap is derived from the idea of efficiently finding pseudoalignments by intersecting sets of k-mer matches. In other words, quasi-mapping is a procedure and not a concept; what the procedure outputs is a lightweight alignment via a pseudoalignment.

tl;dr

Read alignment is what you think it means.

Lightweight alignment is “partial” read alignment (it should have been called read mapping but that was taken). Some, but not necessarily all bases in a read are aligned to specific bases among target sequences.

Pseudoalignment is read assignment to target sequences without any base-level sequence alignment.

Read alignment, lightweight alignment and pseudoalignment are concepts, not algorithms.

Quasi-mapping is a procedure and not a concept. The procedure outputs a lightweight alignment via a pseudoalignment.

Disclaimer: I’ve almost certainly gotten some dates wrong, missed some papers I should have cited, and omitted some crucial developments in read alignment that I should have included. Please let me know if you think that is the case by commenting and I’ll update the post as needed.

The hierarchical classification of nature initiated by Carl Linnaeus today consists of eight major “ranks”, namely species, genus, family, order, class, phylum, kingdom and domain:

In the microbial world it makes sense to refine the standard taxonomy by subdividing species into strains. An important reason to do so is that bacterial taxonomy must reflect not only phylogeny but also pathogenicity, and small differences in genomes can translate to large pathogenic differences. This has implications for metagenomic analyses of microbial communities: for many biomedical applications it is desirable to characterize individuals strains.

Metagenomics has its roots in culture-independent retrieval and sequencing of 16S rRNA genes, and while variations in 16S can sometimes distinguish between strains, a single gene is not always sufficient. This limitation of 16S can be overcome with whole genome shotgun sequencing of microbial communities, an approach to metagenomics that became popular in the early 2000s and  that opened the door to higher resolution characterization of communities. In 2005 Kevin Chen and I wrote a review on the bioinformatics challenges that would have to be overcome to walk through the door. One of the things we did was to emphasize “problems and their connections to other areas of bioinformatics, such as… gene expression analysis”, and throughout the past decade I’ve always hoped for deeper connections to be established between metagenomics and gene expression bioinformatics. I’ve noticed interesting connections pop up from time to time (e.g. Paulson et al. 2013)  and have occasionally entertained the thought with my students and collaborators, especially as work in my group became more focused on RNA-Seq since the development of Cufflinks in 2008.

However connection modern transcriptome analysis methodology, specifically bioinformatics of RNA-Seq to metagenomics has been difficult to do until recently. One major reason is that until just a few years ago, there was no reference genome database for metagenomics analogous to the reference annotation databases available for use in transcriptomics. Another way to put this is that metagenomics has, until recently, been “de novo” bioinformatics. By this I mean that the analysis of communities from whole genome shotgun data had to largely proceed via de novo analyses of the data (e.g. de novo assembly of genomes), “binning” of reads according to sequence characteristics or hits to gene databases was required because it was impossible to compare sequences to references genomes. While de novo methods have also been developed for RNA-Seq, the scale of transcriptome analysis is much smaller than that of most metagenomic analyses, and as has been well documented, de novo transcriptomics is already very difficult (e.g. Amin et al. 2014).

The de novo state of metagenomics has changed in recent years, as (relatively) low-cost sequencing has been a boon for microbial genomics. The graph below, extracted from NCBI and published in a recent review, shows that in just the past few years thousands of bacterial genomes have been sequences, enabling, for the first time, reference based metagenomics:

This observation is reflected in the recent development of many methods for a variety of metagenomic applications that take advantage of reference genome databases.  Specifically, the problem of read assignment, which is fundamental for abundance estimation, has benefited from the possibility of metagenomic read alignment to reference databases.

The figure below, reproduced from the preprint “An evaluation of the accuracy and speed of metagenome analysis tools” by Stinus Lindgreen, Karen L. Adair and Paul Gardner, bioRxiv May 15, 2015 shows a benchmark of the accuracy and runtime of 14 programs developed for metagenomic read assignment for whole genome shotgun data:

The problem these methods are solving is really similar to the problem of read assignment in RNA-Seq. In RNA-Seq, instead of originating from strains, reads originate from transcripts. Just as strains are present in different abundances in a community, so are RNA transcripts in a cell (or in bulk). The analogy of taxonomy in metagenomics, i.e. the grouping of strains into species, genus etc. is also present in RNA-Seq, where transcripts are grouped into genes. The fragment (or read) assignment problem in RNA-Seq is closely related to the quantification problem in RNA-Seq and is a problem that has been thoroughly researched and for which many algorithms have been developed. I discussed the importance of the fragment assignment problem for RNA-Seq in my 2013 Genome Informatics Keynote.

In response to the development of reference-based bioinformatics possibilities for metagenomics, about three years ago my student Lorian Schaeffer started looking at the suitability of RNA-Seq tools for metagenomic read assignment. Although the metagenomic and RNA-Seq assignment problems are conceptually similar and methodologically related, there are various technical issues involved in applying RNA-Seq tools in the metagenomic setting (e.g. the need to carefully account for taxonomy in the metagenomics setting). After developing the computational infrastructure to benchmark RNA-Seq programs in the metagenomic setting, she proceeded to evaluate the accuracy of eXpress, a streaming algorithm for RNA-Seq quantification. Although the quantification of eXpress was specifically designed to be suitable for large numbers of reads, the program requires read alignments to a reference transcriptome (or in Lorian’s experiments a genome) database. In the metagenomic setting realistic databases are huge, and she found that it took days just to map the reads. Nevertheless, her initial benchmarks revealed that eXpress was significantly more accurate than the available metagenomic read assignment tools of the time.

When Kraken (Wood and Salzberg 2014), and later CLARK (Ounit et al. 2015) were published in 2014 and 2015 respectively, we took note because by circumventing the alignment step they dramatically altered the tractability of metagenomic read assignment. In parallel, in my group, Nicolas Bray and later Páll Melsted and Harold Pimentel were developing what is now kallisto (Bray et al. 2015). Like Kraken, kallisto avoided the need for aligning reads, but with the introduction of the concept of pseudoalignment, allowed for accurate read assignments based on joint analysis of exact k-mer matches. What we showed earlier this year is that unlike naïve k-mer based approaches to quantification, kallisto is as accurate as eXpress and other read alignment based quantification tools, and this observation led Lorian to immediately proceed to benchmark it on metagenomic data. The result of her work was just posted as a preprint:

Lorian Schaeffer, Harold Pimentel, Nicolas Bray, Páll Melsted and Lior Pachter, Pseudoalignment for metagenomic read assignment, arXiv 1510.07371, 2015.

With this paper we demonstrate a “technology transfer” from RNA-Seq bioinformatics to metagenomics, one that achieves dramatic improvements in read assignment accuracy in the metagenomics setting. The main result of her work is Table 1 in our preprint:

Using a published simulated Illumina dataset from Mende et al. 2012 (based on 100 genomes and containing 53.33 million reads), and augmenting it with another 2,308 genomes for the purpose of testing, she shows that kallisto significantly outperforms the best quantification methods (as benchmarked by Lindgreen et al., see figure above). “Significant” here refers to what I think is fair to characterize as an extraordinary improvement: at the genus level, a level that programs such as CLARK have been optimized for, kallisto’s RRMSE (relative root mean squared error)  is 0.13 compared to 17.05 for Kraken and 18.58 for CLARK. The improvement is based on two ideas: first, the results show that the model based approach for read assignment, the concept that underlies GASiC and eXpress, outperforms direct taxonomic read assignment as implemented by MEGAN and Kraken and CLARK (in the latter approach reads are aligned to the lowest rank to which they align unambigously). Second, pseudoalignment is not just faster than traditional alignment but also accurate.

The upshot: the accuracy and efficiency of kallisto make strain level analysis of metagenomes possible. In fact kallisto is more accurate at the strain level than other programs are at the genus level. Just as we have been advocating for transcript level analysis from RNA-Seq data, we believe that strain level analysis should become commonplace in metagenomics.

In digging deeply into the bioinformatics of metagenomics bioinformatics we noticed a few other areas that could benefit from RNA-Seq technology transfer. For example, the standard of RNA-Seq methods benchmarking appears to be higher than in metagenomics. Both the Kraken and CLARK papers benchmarked their programs on simulations with 10 genomes (the number ten is not a typo). CLARK did test on one dataset with 20 genomes, although using only 10,000 reads. To be fair to the authors of those papers, their standards were much higher than others in the field. The paper

Yu-Wei Wu and Yuzhen Ye, A novel abundance-based algorithm for binning metagenomic sequences using l-tuples, Journal of Computational Biology 2011.

benchmarked their method on simulations of reads from 2 (two!!) organisms. Biologists frequently complain that simulations of bioinformaticians are completely non-informative and unfortunately these cases provide fodder for such prejudice. Having said that, the RNA-Seq community also has much to learn from the metagenomics community. The previously mentioned paper by Paulson et al. 2013 addresses missing data in a way that should translate directly to missing data in single-cell RNA-Seq (the paper also makes performance comparisons with their comparative metagenomics approach to the RNA-Seq programs DESeq and edgeR) . One paper (McDavid et al. 2012) does take a look at modeling single-cell data with zero inflated distributions but I think this is a good example where metagenomics is ahead of RNA-Seq. Our immediate plans are to develop the kallisto application to metagenomics to include the ability to perform metagenome comparisons using sleuth. Conversely, inspired by the taxonomy hierarchy fundamental to metagenomics we’re going to explore RNA-Seq quantification with groups of transcripts that go beyond just genes.

Horizontal transfer is good.

The Common Core State Standards Initiative was intended to establish standards for the curriculum for K–12 students in order to universally elevate the the quality of education in the United States. Whether the initiative has succeeded, or not, is a matter of heated debate. In particular, the merits of the mathematics standards are a contentious matter to the extent that colleagues in my math department at UC Berkeley have penned opposing opinions on the pages of the Wall Street Journal (see Frenkel and Wu vs. Ratner). In this post I won’t opine on the merits of the standards, but rather wish to highlight a shortcoming in the almost universal perspective on education that the common core embodies:

The emphasis on what K–12 students ought to learn about what is known has sidelined an important discussion about what they should learn about what is not known.

To make the point, I’ve compiled a list of unsolved problems in mathematics to match the topics covered in the common core. The problems are all well-known to mathematicians, and my only contribution is to select problems that (a) are of interest to research mathematicians (b) provide a good balance among the different areas of mathematics and (c) are understandable by students studying to (the highlighted) Common Core Standards.  For each grade/high school topic, the header is a link to the Common Core Standards. Based on the standards, I have selected a single problem to associate to to the grade/topic (although it is worth noting there were always a large variety to choose from). For each problem, I begin by highlighting the relevant common core curriculum which the problem is related to, followed by a warm up exercise to help introduce students to the problem. The warm ups are exercises that should be solvable by students with knowledge of the Common Core Standards. I then state the unsolved problem, and finally I provide (in brief!) some math background, context and links for those who are interested in digging deeper into the questions.

Disclaimer: it’s possible, though highly unlikely that any of the questions on the list will yield to “elementary” methods accessible to K–12 students. It is quite likely that many of the problems will remain unsolved in our lifetimes. So why bother introducing students to such problems? The point is that the questions reveal a sliver of the vast scope of mathematics, they provide many teachable moments and context for the mathematics that does constitute the common core, and (at least in my opinion) are fun to explore (for kids and adults alike). Perhaps most importantly, the unsolved problems and conjectures reveal that the mathematics taught in K–12 is right at the edge of our knowledge: we are always walking right along the precipice of mystery. This is true for other subjects taught in K–12 as well, and in my view this reality is one of the important lessons children can and should learn in school.

One good thing about the Common Core Standards, is that their structure allows, in principle, for the incorporation of standards for unsolved problems the students ought to know about. Hopefully education policymakers will consider extending the Common Core Standards to include such content. And hopefully educators will not only teach about what is not known, but will also encourage students to ask new questions that don’t have answers. This is because  “there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”

Kindergarten

Relevant common core: “describing shapes and space.”

Warm up: can you color the map of Africa with four colors so that no two countries that touch are filled in with the same color?

Can you color in the map with three colors so that no two countries that touch are filled in with the same color?

The unsolved problem: without trying all possibilities, can you tell when a map can be colored in with 3 colors so that no two countries that touch are filled in with the same color?

Background and context: The four color theorem states (informally) that “given any separation of a plane into contiguous regions, producing a figure called a map, no more than four colors are required to color the regions of the map so that no two adjacent regions have the same color.” (from wikipedia). The mathematics statement is that any planar graph can be colored with four colors. Thus, the first part of the “warm up” has a solution; in fact the world map can be colored with four colors. The four color theorem is deceivingly simple- it can be explored by a kindergartner, but it turns out to have a lengthy proof. In fact, the proof of the theorem requires extensive case checking by computer. Not every map can be colored with three colors (for an example illustrating why see here). It is therefore natural to ask for a characterization of which maps can be 3-colored. Of course any map can be tested for 3-colorability by trying all possibilities, but a “characterization” would involve criteria that could be tested by an algorithm that is polynomial in the number of countries. The 3-colorability of planar graphs is NP-complete.

Relevant common core: “developing understanding of whole number relationships”.

Warm up: Suppose that in a group of people, any pair of individuals are either strangers or acquaintances. Show that among three people there must be at either at least two pairs of strangers or else at least two pairs of acquaintances.

The unsolved problem: Is it true that among 45 people there must be 5 mutual strangers or 5 mutual acquaintances?

Background and context: This problem is formally known as the question of computing the Ramsey number R(5,5). It is an easier (although probably difficult for first graders) problem to show that R(3,3)=6, that is, that among six people there will be either three mutual strangers or else three mutual acquaintances. It is known that $43 \leq R(5,5) \leq 49$. The difficulty of computing Ramsey numbers was summed up by mathematician Paul Erdös as follows:

“Imagine an alien force, vastly more powerful than us, landing on Earth and demanding the value of R(5, 5) or they will destroy our planet. In that case we should marshal all our computers and all our mathematicians and attempt to find the value. But suppose, instead, that they ask for R(6, 6). In that case we should attempt to destroy the aliens.” – from Ten Lectures on the Probabilistic Method.

Relevant common core: “building fluency with addition and subtraction”.

Warm up: Pascal’s triangle is a triangular array of numbers where each entry in a row is computed by adding up the pair of numbers above it. For example, the first six rows of Pascal’s triangle are:

Write out the next row of Pascal’s triangle.

The unsolved problem: Is there a number (other than 1) that appears 10 times in the (infinite) Pascal’s triangle?

Background and context: The general problem of determining whether numbers can appear with arbitrary multiplicity in Pascal’s triangle is known as Singmaster’s conjecture. It is named after the mathematician David Singmaster who posed the problem in 1971. It is known that the number 3003 appears eight times, but it is not known whether any other number appears eight times, nor, for that matter, whether any other number appears more than eight times.

Relevant common core: “(1) developing understanding of multiplication and division and strategies for multiplication and division within 100”.

Warm up: Practice dividing numbers by 2 and multiplying by 3.

The unsolved problem: Choose a natural number n. If n is even, divide it by 2 to get $n\div 2$. If n is odd, multiply it by 3 and add 1 to obtain $3\times n+1$. Repeat the process. Show that for any initial choice n, the repeating process will eventually reach the number 1.

Background and context: The conjecture is called the Collatz conjecture, after Lothar Collatz who proposed it in 1937. It is deceptively simple, but despite much numeric evidence that it is true, has eluded proof. Mathematician Terence Tao has an interesting blog post explaining why (a) the conjecture is likely to be true and (b) why it is likely out of reach of current techniques known to mathematicians.

Relevant common core: “Determine whether a given whole number in the range 1-100 is prime or composite”.

Warm up: Write the number 100 as the sum of two prime numbers.

The unsolved problem: Show that every even integer greater than 2 can be expressed as the sum of two primes.

Background and context:  This problem is known as the Goldbach conjecture. It was proposed by the mathematician Christian Goldbach in a letter to the mathematician Leonhard Euler in 1742 and has been unsolved since that time (!) In 2013 mathematician Harald Helfgott settled the “weak Goldbach conjecture“, proving that every odd integer greater than 5 is the sum of three primes.

Relevant common core: “Graph points on the coordinate plane”.

Warm up: A set of points in the coordinate plane are in general position if no three of them lie on a line. A quadrilateral is convex if it does not intersect itself and the line between any two points on the boundary lies entirely within the quadrilateral. Show that any set of five points in the plane in general position has a subset of four points that form the vertices of a convex quadrilateral.

The unsolved problem: A polygon is convex  if it does not intersect itself and the line between any two points on the boundary lies entirely within the polygon. Find the smallest number N so that any N points in the coordinate plane in general position contain a subset of 7 points that form the vertices of a convex polygon.

Background and context: The warm up exercise was posed by mathematician Esther Klein in 1933. The question led to the unsolved problem, which remains unsolved in the general case, i.e. how many points are required so that no matter how they are placed (in general position) in the plane there is a subset that form the vertices of a convex n-gon. There are periodic improvements in the upper bound (the most recent one posted on September 10th 2015), but the best current upper bound is still far from the conjectured answer.

A set of 8 points in the plane containing no convex pentagon.

Relevant common core: “Represent three-dimensional figures using nets made up of rectangles and triangles”.

Warm up

The dodecahedron is an example of a convex polyhedron. A convex polyhedron is a polyhedron that does not intersect itself and that has the property that any line joining two points on the surface lies entirely within the polyhedron. Cut out the net of the dodecahedron (shown above) and fold it into a dodecahedron.

The unsolved problem: Does every convex polyhedron have at least one net?

Background and context: Albrecht Dürer was an artist and mathematician of the German Renaissance. The unsolved problem above is often attributed to him, and is known as Dürer’s unfolding problem. It was formally posed by the mathematician Geoffrey Shephard in 1975.

One of my favorite math art pieces: Albrecht Dürer’s engraving Melencolia I.

Relevant common core: “Analyze proportional relationships and use them to solve real-world and mathematical problems.”

Warm up: Two runners, running at two different speeds $v_0$ and $v_1$, run along a circular track of unit length. A runner is lonely at time t if she is at a distance of at least 1/2 from the other runner at time t. If both runners all start at the same place at t=0, show that the runners will both be lonely at time $t=\frac{1}{2(v_1-v_0)}$.

The unsolved problem: Eight runners, each running at a speed different from that of the others, run along a circular track of unit length. A runner is lonely at time t if she is at a distance of at least 1/8 from every other runner at time t. If the runners all start at the same place at t=0, show that each of the eight runners will be lonely at some time.

Background and context: This problem is known as the lonely runner conjecture and was proposed by mathematician J.M Wills in 1967. It has been proved for up to seven runners, albeit with lengthy arguments that involve lots of case checking. It is currently unsolved for eight or more runners.

Relevant common core: “Know and apply the properties of integer exponents”.

Warm up: Show that $3^{3n}+[2(3^n)]^3 = 3^{3n+2}$ for any integer greater than or equal to 1.

The unsolved problem: If $A^x+B^y=C^z$ where A,B,C,x,y and are positive integers with $x,y,z>2$ then A,B and C have a common prime factor.

Background and context: This problem is known as Beal’s conjecture (named after the billionaire Andrew Beal). The famous “Fermat’s Last Theorem“, proved via the modularity theorem for semistable elliptic curves,  is a special case of this conjecture, an observation that hints at the difficulty of the conjecture. Andrew Beal is currently offering a prize of \$1 million for its solution.

High School: Number and Quantity

Relevant common core: “In high school, students will be exposed to yet another extension of number, when the real numbers are augmented by the imaginary numbers to form the complex numbers.”

Warm up: Gaussian integers are numbers of the form $a+bi$ where a and b are integers. The norm of a Gaussian integer $a+bi$ is $a^2+b^2$. A Gaussian prime is a Gaussian integer that cannot be factored into Gaussian integers of smaller norm. Show that $4+i$ is a Gaussian prime.

The unsolved problem: Consider a circle in R2 with centre at the origin and radius $r \geq 0$. How many points there are inside this circle of the form (m,n) where m and n are both integers. In particular, denoting this number by $N(r)$, find bounds on $E(r)$ where $N(r) = \pi r^2 + E(r)$.

Background and context: The problem is known as Gauss’ circle problem, and while phrased in terms of integer points in the plane, it is equivalent to asking for the number of Gaussian integers with norm less than a given constant.

High School: Algebra

Relevant common core: “Solve systems of equations”.

Warm up: An Euler brick is a cuboid whose edges and face diagonals all have integer length:

Algebraically, an Euler brick requires finding integers a,b,c,d,e,f such that

$a^2+b^2=d^2, a^2+c^2=e^2$ and $b^2+c^2=f^2$. Verify that (a,b,c,d,e,f) = (85, 132, 720,157, 725, 732) is an Euler brick.

The unsolved problem: A perfect cuboid is an Euler brick whose space diagonal g (see below) also has integer length:

Is there a perfect cuboid?

Background and context: The existence of perfect cuboids requires solving the systems of equations for the Euler brick with the addition of the requirement that $a^2+b^2+c^2=g^2$ with g an integer. The solution of (non-linear) systems of equations in many unknowns is the subject matter of algebraic geometry, in which a bridge is developed between the algebra and a corresponding geometry. Generally speaking the equations are easiest to solve when the solutions can be complex, harder when they are required to be real numbers (real algebraic geometry) and hardest when they are constrained to be integers (Diophantine, or arithmetic algebraic geometry).

High School: Functions

Relevant common core: “Understand the concept of a function and use function notation”.

Warm up: Euler’s totient function denoted by $\varphi(n)$ assigns to each positive integer n the number of positive integers that are less than or equal to n and relatively prime to n. What is $\varphi(p^k)$ when p is a prime number?

The unsolved problem: Is it true that for every n there is at least one other integer $m \neq n$ with the property that $\varphi(m) = \varphi(n)$?

Background and context: The question is known as Carmichael’s conjecture, who posited that the answer is affirmative. Curiously, it has been proved (in The Distribution of Totients by Kevin Ford, 1998) that any counterexample to the conjecture must be larger than $10^{10^{10}}$. Yet the problem is unsolved.

High School: Modeling

Relevant common core: “Modeling is the process of choosing and using appropriate mathematics and statistics to analyze empirical situations, to understand them better, and to improve decisions.”

Warm up: Read the biology paper The Biological Underpinnings of Namib Desert Fairy Circles and the mathematical modeling paper Adopting a spatially explicit perspective to study the mysterious fairy circles of Namibia (some basic calculus is required).

The unsolved problem: Develop a biologically motivated mathematical model that explains the fairy circles in Namibia.

Background and context: Fairy circles occur in Southern Africa, mostly in Namibia but also in South Africa. The circles are barren patches of land surrounded by vegetation, and they appear to go through life cycles of dozens of years. There have been many theories about them but despite a number of plausible models the phenomenon is poorly understood and their presence is considered “one of nature’s great mysteries“.

High School: Geometry

Relevant common core: “An understanding of the attributes and relationships of geometric objects can be applied in diverse contexts”.

Warm up: Calculate the area of the handset shape shown moving in the figure below. It consists of two quarter-circles of radius 1 on either side of a 1 by 4/π rectangle from which a semicircle of radius $\scriptstyle 2/\pi\,$ has been removed.

The unsolved problem: What is the rigid two-dimensional shape of largest area that can be maneuvered through an L-shaped planar region (as shown above) with legs of unit width.

Background and context: This problem was posed by Leo Moser in 1966 and is known as the the moving sofa problem. It is known that the largest area for a sofa is between 2.2195 and 2.8284. The problem should be familiar to college age students who have had to manouever furniture in and out of dorm rooms.

High School: Statistics & Probability

Relevant common core: “Describe events as subsets of a sample space (the set of outcomes) using characteristics (or categories) of the outcomes, or as unions, intersections, or complements of other events (‘or,’ ‘and,’ ‘not’).”

Warm up: A family of sets is said to be union-closed if the union of any two sets from the family remains in the family. Write down five different examples of families of sets that are union-closed.

The unsolved problem: Show that for any finite union-closed family of finite sets, other than the family consisting only of the empty set, there exists an element that belongs to at least half of the sets in the family.

Background and context: The conjecture is known as Frankl’s conjecture, named after the mathematician Péter Frankl, and is also called the union closed sets conjecture. It is deceptively simple, but is known to be true only in a few very special cases.