The Common Core State Standards Initiative was intended to establish standards for the curriculum for K–12 students in order to universally elevate the the quality of education in the United States. Whether the initiative has succeeded, or not, is a matter of heated debate. In particular, the merits of the mathematics standards are a contentious matter to the extent that colleagues in my math department at UC Berkeley have penned opposing opinions on the pages of the Wall Street Journal (see Frenkel and Wu vs. Ratner). In this post I won’t opine on the merits of the standards, but rather wish to highlight a shortcoming in the almost universal perspective on education that the common core embodies:

The emphasis on what K–12 students ought to learn about what is known has sidelined an important discussion about what they should learn about what is not known.

To make the point, I’ve compiled a list of unsolved problems in mathematics to match the topics covered in the common core. The problems are all well-known to mathematicians, and my only contribution is to select problems that (a) are of interest to research mathematicians (b) provide a good balance among the different areas of mathematics and (c) are understandable by students studying to (the highlighted) Common Core Standards.  For each grade/high school topic, the header is a link to the Common Core Standards. Based on the standards, I have selected a single problem to associate to to the grade/topic (although it is worth noting there were always a large variety to choose from). For each problem, I begin by highlighting the relevant common core curriculum which the problem is related to, followed by a warm up exercise to help introduce students to the problem. The warm ups are exercises that should be solvable by students with knowledge of the Common Core Standards. I then state the unsolved problem, and finally I provide (in brief!) some math background, context and links for those who are interested in digging deeper into the questions.

Disclaimer: it’s possible, though highly unlikely that any of the questions on the list will yield to “elementary” methods accessible to K–12 students. It is quite likely that many of the problems will remain unsolved in our lifetimes. So why bother introducing students to such problems? The point is that the questions reveal a sliver of the vast scope of mathematics, they provide many teachable moments and context for the mathematics that does constitute the common core, and (at least in my opinion) are fun to explore (for kids and adults alike). Perhaps most importantly, the unsolved problems and conjectures reveal that the mathematics taught in K–12 is right at the edge of our knowledge: we are always walking right along the precipice of mystery. This is true for other subjects taught in K–12 as well, and in my view this reality is one of the important lessons children can and should learn in school.

One good thing about the Common Core Standards, is that their structure allows, in principle, for the incorporation of standards for unsolved problems the students ought to know about. Hopefully education policymakers will consider extending the Common Core Standards to include such content. And hopefully educators will not only teach about what is not known, but will also encourage students to ask new questions that don’t have answers. This is because  “there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”

Kindergarten

Relevant common core: “describing shapes and space.”

Warm up: can you color the map of Africa with four colors so that no two countries that touch are filled in with the same color?

Can you color in the map with three colors so that no two countries that touch are filled in with the same color?

The unsolved problem: without trying all possibilities, can you tell when a map can be colored in with 3 colors so that no two countries that touch are filled in with the same color?

Background and context: The four color theorem states (informally) that “given any separation of a plane into contiguous regions, producing a figure called a map, no more than four colors are required to color the regions of the map so that no two adjacent regions have the same color.” (from wikipedia). The mathematics statement is that any planar graph can be colored with four colors. Thus, the first part of the “warm up” has a solution; in fact the world map can be colored with four colors. The four color theorem is deceivingly simple- it can be explored by a kindergartner, but it turns out to have a lengthy proof. In fact, the proof of the theorem requires extensive case checking by computer. Not every map can be colored with three colors (for an example illustrating why see here). It is therefore natural to ask for a characterization of which maps can be 3-colored. Of course any map can be tested for 3-colorability by trying all possibilities, but a “characterization” would involve criteria that could be tested by an algorithm that is polynomial in the number of countries. The 3-colorability of planar graphs is NP-complete.

Relevant common core: “developing understanding of whole number relationships”.

Warm up: Suppose that in a group of people, any pair of individuals are either strangers or acquaintances. Show that among three people there must be at either at least two pairs of strangers or else at least two pairs of acquaintances.

The unsolved problem: Is it true that among 45 people there must be 5 mutual strangers or 5 mutual acquaintances?

Background and context: This problem is formally known as the question of computing the Ramsey number R(5,5). It is an easier (although probably difficult for first graders) problem to show that R(3,3)=6, that is, that among six people there will be either three mutual strangers or else three mutual acquaintances. It is known that $43 \leq R(5,5) \leq 49$. The difficulty of computing Ramsey numbers was summed up by mathematician Paul Erdös as follows:

“Imagine an alien force, vastly more powerful than us, landing on Earth and demanding the value of R(5, 5) or they will destroy our planet. In that case we should marshal all our computers and all our mathematicians and attempt to find the value. But suppose, instead, that they ask for R(6, 6). In that case we should attempt to destroy the aliens.” – from Ten Lectures on the Probabilistic Method.

Relevant common core: “building fluency with addition and subtraction”.

Warm up: Pascal’s triangle is a triangular array of numbers where each entry in a row is computed by adding up the pair of numbers above it. For example, the first six rows of Pascal’s triangle are:

Write out the next row of Pascal’s triangle.

The unsolved problem: Is there a number (other than 1) that appears 10 times in the (infinite) Pascal’s triangle?

Background and context: The general problem of determining whether numbers can appear with arbitrary multiplicity in Pascal’s triangle is known as Singmaster’s conjecture. It is named after the mathematician David Singmaster who posed the problem in 1971. It is known that the number 3003 appears eight times, but it is not known whether any other number appears eight times, nor, for that matter, whether any other number appears more than eight times.

Relevant common core: “(1) developing understanding of multiplication and division and strategies for multiplication and division within 100”.

Warm up: Practice dividing numbers by 2 and multiplying by 3.

The unsolved problem: Choose a natural number n. If n is even, divide it by 2 to get $n\div 2$. If n is odd, multiply it by 3 and add 1 to obtain $3\times n+1$. Repeat the process. Show that for any initial choice n, the repeating process will eventually reach the number 1.

Background and context: The conjecture is called the Collatz conjecture, after Lothar Collatz who proposed it in 1937. It is deceptively simple, but despite much numeric evidence that it is true, has eluded proof. Mathematician Terence Tao has an interesting blog post explaining why (a) the conjecture is likely to be true and (b) why it is likely out of reach of current techniques known to mathematicians.

Relevant common core: “Determine whether a given whole number in the range 1-100 is prime or composite”.

Warm up: Write the number 100 as the sum of two prime numbers.

The unsolved problem: Show that every even integer greater than 2 can be expressed as the sum of two primes.

Background and context:  This problem is known as the Goldbach conjecture. It was proposed by the mathematician Christian Goldbach in a letter to the mathematician Leonhard Euler in 1742 and has been unsolved since that time (!) In 2013 mathematician Harald Helfgott settled the “weak Goldbach conjecture“, proving that every odd integer greater than 5 is the sum of three primes.

Relevant common core: “Graph points on the coordinate plane”.

Warm up: A set of points in the coordinate plane are in general position if no three of them lie on a line. A quadrilateral is convex if it does not intersect itself and the line between any two points on the boundary lies entirely within the quadrilateral. Show that any set of five points in the plane in general position has a subset of four points that form the vertices of a convex quadrilateral.

The unsolved problem: A polygon is convex  if it does not intersect itself and the line between any two points on the boundary lies entirely within the polygon. Find the smallest number N so that any N points in the coordinate plane in general position contain a subset of 7 points that form the vertices of a convex polygon.

Background and context: The warm up exercise was posed by mathematician Esther Klein in 1933. The question led to the unsolved problem, which remains unsolved in the general case, i.e. how many points are required so that no matter how they are placed (in general position) in the plane there is a subset that form the vertices of a convex n-gon. There are periodic improvements in the upper bound (the most recent one posted on September 10th 2015), but the best current upper bound is still far from the conjectured answer.

A set of 8 points in the plane containing no convex pentagon.

Relevant common core: “Represent three-dimensional figures using nets made up of rectangles and triangles”.

Warm up

The dodecahedron is an example of a convex polyhedron. A convex polyhedron is a polyhedron that does not intersect itself and that has the property that any line joining two points on the surface lies entirely within the polyhedron. Cut out the net of the dodecahedron (shown above) and fold it into a dodecahedron.

The unsolved problem: Does every convex polyhedron have at least one net?

Background and context: Albrecht Dürer was an artist and mathematician of the German Renaissance. The unsolved problem above is often attributed to him, and is known as Dürer’s unfolding problem. It was formally posed by the mathematician Geoffrey Shephard in 1975.

One of my favorite math art pieces: Albrecht Dürer’s engraving Melencolia I.

Relevant common core: “Analyze proportional relationships and use them to solve real-world and mathematical problems.”

Warm up: Two runners, running at two different speeds $v_0$ and $v_1$, run along a circular track of unit length. A runner is lonely at time t if she is at a distance of at least 1/2 from the other runner at time t. If both runners all start at the same place at t=0, show that the runners will both be lonely at time $t=\frac{1}{2(v_1-v_0)}$.

The unsolved problem: Eight runners, each running at a speed different from that of the others, run along a circular track of unit length. A runner is lonely at time t if she is at a distance of at least 1/8 from every other runner at time t. If the runners all start at the same place at t=0, show that each of the eight runners will be lonely at some time.

Background and context: This problem is known as the lonely runner conjecture and was proposed by mathematician J.M Wills in 1967. It has been proved for up to seven runners, albeit with lengthy arguments that involve lots of case checking. It is currently unsolved for eight or more runners.

Relevant common core: “Know and apply the properties of integer exponents”.

Warm up: Show that $3^{3n}+[2(3^n)]^3 = 3^{3n+2}$ for any integer greater than or equal to 1.

The unsolved problem: If $A^x+B^y=C^z$ where A,B,C,x,y and are positive integers with $x,y,z>2$ then A,B and C have a common prime factor.

Background and context: This problem is known as Beal’s conjecture (named after the billionaire Andrew Beal). The famous “Fermat’s Last Theorem“, proved via the modularity theorem for semistable elliptic curves,  is a special case of this conjecture, an observation that hints at the difficulty of the conjecture. Andrew Beal is currently offering a prize of 1 million for its solution. High School: Number and Quantity Relevant common core: “In high school, students will be exposed to yet another extension of number, when the real numbers are augmented by the imaginary numbers to form the complex numbers.” Warm up: Gaussian integers are numbers of the form $a+bi$ where a and b are integers. The norm of a Gaussian integer $a+bi$ is $a^2+b^2$. A Gaussian prime is a Gaussian integer that cannot be factored into Gaussian integers of smaller norm. Show that $4+i$ is a Gaussian prime. The unsolved problem: Consider a circle in R2 with centre at the origin and radius $r \geq 0$. How many points there are inside this circle of the form (m,n) where m and n are both integers. In particular, denoting this number by $N(r)$, find bounds on $E(r)$ where $N(r) = \pi r^2 + E(r)$. Background and context: The problem is known as Gauss’ circle problem, and while phrased in terms of integer points in the plane, it is equivalent to asking for the number of Gaussian integers with norm less than a given constant. High School: Algebra Relevant common core: “Solve systems of equations”. Warm up: An Euler brick is a cuboid whose edges and face diagonals all have integer length: Algebraically, an Euler brick requires finding integers a,b,c,d,e,f such that $a^2+b^2=d^2, a^2+c^2=e^2$ and $b^2+c^2=f^2$. Verify that (a,b,c,d,e,f) = (85, 132, 720,157, 725, 732) is an Euler brick. The unsolved problem: A perfect cuboid is an Euler brick whose space diagonal g (see below) also has integer length: Is there a perfect cuboid? Background and context: The existence of perfect cuboids requires solving the systems of equations for the Euler brick with the addition of the requirement that $a^2+b^2+c^2=g^2$ with g an integer. The solution of (non-linear) systems of equations in many unknowns is the subject matter of algebraic geometry, in which a bridge is developed between the algebra and a corresponding geometry. Generally speaking the equations are easiest to solve when the solutions can be complex, harder when they are required to be real numbers (real algebraic geometry) and hardest when they are constrained to be integers (Diophantine, or arithmetic algebraic geometry). High School: Functions Relevant common core: “Understand the concept of a function and use function notation”. Warm up: Euler’s totient function denoted by $\varphi(n)$ assigns to each positive integer n the number of positive integers that are less than or equal to n and relatively prime to n. What is $\varphi(p^k)$ when p is a prime number? The unsolved problem: Is it true that for every n there is at least one other integer $m \neq n$ with the property that $\varphi(m) = \varphi(n)$? Background and context: The question is known as Carmichael’s conjecture, who posited that the answer is affirmative. Curiously, it has been proved (in The Distribution of Totients by Kevin Ford, 1998) that any counterexample to the conjecture must be larger than $10^{10^{10}}$. Yet the problem is unsolved. High School: Modeling Relevant common core: “Modeling is the process of choosing and using appropriate mathematics and statistics to analyze empirical situations, to understand them better, and to improve decisions.” Warm up: Read the biology paper The Biological Underpinnings of Namib Desert Fairy Circles and the mathematical modeling paper Adopting a spatially explicit perspective to study the mysterious fairy circles of Namibia (some basic calculus is required). The unsolved problem: Develop a biologically motivated mathematical model that explains the fairy circles in Namibia. Background and context: Fairy circles occur in Southern Africa, mostly in Namibia but also in South Africa. The circles are barren patches of land surrounded by vegetation, and they appear to go through life cycles of dozens of years. There have been many theories about them but despite a number of plausible models the phenomenon is poorly understood and their presence is considered “one of nature’s great mysteries“. High School: Geometry Relevant common core: “An understanding of the attributes and relationships of geometric objects can be applied in diverse contexts”. Warm up: Calculate the area of the handset shape shown moving in the figure below. It consists of two quarter-circles of radius 1 on either side of a 1 by 4/π rectangle from which a semicircle of radius $\scriptstyle 2/\pi\,$ has been removed. The unsolved problem: What is the rigid two-dimensional shape of largest area that can be maneuvered through an L-shaped planar region (as shown above) with legs of unit width. Background and context: This problem was posed by Leo Moser in 1966 and is known as the the moving sofa problem. It is known that the largest area for a sofa is between 2.2195 and 2.8284. The problem should be familiar to college age students who have had to manouever furniture in and out of dorm rooms. High School: Statistics & Probability Relevant common core: “Describe events as subsets of a sample space (the set of outcomes) using characteristics (or categories) of the outcomes, or as unions, intersections, or complements of other events (‘or,’ ‘and,’ ‘not’).” Warm up: A family of sets is said to be union-closed if the union of any two sets from the family remains in the family. Write down five different examples of families of sets that are union-closed. The unsolved problem: Show that for any finite union-closed family of finite sets, other than the family consisting only of the empty set, there exists an element that belongs to at least half of the sets in the family. Background and context: The conjecture is known as Frankl’s conjecture, named after the mathematician Péter Frankl, and is also called the union closed sets conjecture. It is deceptively simple, but is known to be true only in a few very special cases. Today my student Harold Pimentel released the beta version of his new RNA-Seq analysis method and software program called sleuth. A sleuth for RNA-Seq begins with the quantification of samples with kallisto, and together a sleuth of kallistos can be used to analyze RNA-Seq data rigorously and rapidly. Why do we need another RNA-Seq program? A major challenge in transcriptome analysis is to determine the transcripts that have changed in their abundance across conditions. This challenge is not entirely trivial because the stochasticity in transcription both within and between cells (biological variation), and the randomness in the experiment (RNA-Seq) that is used to determine transcript abundances (technical variation), make it difficult to determine what constitutes “significant” change. Technical variation can be assessed by performing technical replicates of experiments. In the case of RNA-Seq, this can be done by repeatedly sequencing from one cDNA library. Such replicates are fundamentally distinct from biological replicates designed to assess biological variation. Biological replicates are performed by sequencing different cDNA libraries that have been constructed from repeated biological experiments performed under the same (or in practice near-same) conditions. Because biological replicates require sequencing different cDNA libraries, a key point is that biological replicates include technical variation. In the early days of RNA-Seq a few papers (e.g. Marioni et al. 2008, Bullard et al. 2010) described analyses of technical replicates and concluded that they were not really needed in practice, because technical variation could be predicted statistically from the properties of the Poisson distribution. The point is that in an idealized RNA-Seq experiment counts of reads are multinomial (according to abundances of the transcripts they originate from), and therefore approximately Poisson distributed. Their variance is therefore approximately equal to the mean, so that it is possible to predict the variance in counts across technical replicates based on the abundance of the transcripts they originate from. There is, however, one important subtlety here: “counts of reads” as used above refers to the number of reads originating from a transcript, but in many cases, especially in higher eukaryotes, reads are frequently ambiguous as to their transcript of origin because of the presence of multi-isoform genes and genes families. In other words, transcript counts cannot be precisely measured. However, the statement about the Poisson distribution of counts in technical replicates remain true when considering counts of reads by genomic features because then reads are no longer ambiguous. This is why, in so-called “count-based methods” for RNA-Seq analysis, there is an analysis only at the gene level. Programs such as DESeq/DESeq2, edgeR and a literally dozens of other count-based methods first require counting reads across genome features using tools such as HTSeq or featureCounts. By utilizing read counts to genomic features, technical replicates are unnecessary in lieu of the statistical assumption that they would reveal Poisson distributed data, and instead the methods focus on modeling biological variation. The issue of how to model biological variation is non-trivial because typically very few biological replicates are performed in experiments. Thus, there is a need for pooling information across genes to obtain reliable variance estimates via a statistical process called shrinkage. How and what to shrink is a matter of extensive debate among statisticians engaged in the development of count-based RNA-Seq methods, but one theme that has emerged is that shrinkage approaches can be compatible with general and generalized linear models, thus allowing for the analysis of complex experimental designs. Despite these accomplishments, count-based methods for RNA-Seq have two major (related) drawbacks: first, the use of counts to gene features prevents inference about the transcription of isoforms, and therefore with most count-based methods it is impossible to identify splicing switches and other isoform changes between conditions. Some methods have tried to address this issue by restricting genomic features to specific exons or splice junctions (e.g. DEXSeq) but this requires throwing out a lot of data, thereby reducing power for identifying statistically significant differences between conditions. Second, because of the fact that in general $\frac{a}{b} + \frac{c}{d} \neq \frac{a+b}{c+d}$ it is mathematically incorrect to estimate gene abundances by adding up counts to their genomic region. One consequence of this, is that it is not possible to accurately measure fold change between conditions by using counts to gene features. In other words, count-based methods are problematic even at the gene-level and it is necessary to estimate transcript-level counts. While reads might be ambiguous as to exactly which transcripts they originated from, it is possible to statistically infer an estimate of the number of reads from each transcript in an experiment. This kind of quantification has its origin in papers of Jiang and Wong, 2009 and Trapnell et al. 2010. However the process of estimating transcript-level counts introduces technical variation. That is to say, if multiple technical replicates were performed on a cDNA library and then transcript-level counts were to be inferred, those inferred counts would no longer be Poisson distributed. Thus, there appears to be a need for performing technical replicates after all. Furthermore, it becomes unclear how to work within the shrinkage frameworks of count-based methods. There have been a handful of attempts to develop methods that combine the uncertainty of count estimates at the transcript level with biological variation in the assessment of statistically significant changes in transcript abundances between conditions. For example, the Cuffdiff2 method generalizes DESeq while the bitSeq method relies on a Bayesian framework to simultaneously quantify abundances at the transcript level while modeling biological variability. Despite showing improved performance over count-based methods, they also have significant shortcomings. For example the methods are not as flexible as those of general(ized) linear models, and bitSeq is slow partly because it requires MCMC sampling. Thus, despite intensive research on both statistical and computational methods for RNA-Seq over the past years, there has been no solution for analysis of experiments that allows biologists to take full advantage of the power and resolution of RNA-Seq. The sleuth model The main contribution of sleuth is an intuitive yet powerful model for RNA-Seq that bridges the gap between count-based methods and quantification algorithms in a way that fully exploits the advantages of both. To understand sleuth, it is helpful to start with the general linear model: $Y_t = X_t\beta_t + \epsilon_t$. Here the subscript t refers to a specific transcript, $Y_t$ is a vector describing transcript abundances (of length equal to the number of samples), $X_t$ is a design matrix (of size number of samples x number of confounders), $\beta_t$ is a parameter vector (of size the number of confounders) and $\epsilon_t$ is a noise vector (of size the number of samples). In this model the abundances $Y_t$ are normally distributed. For the purposes of RNA-Seq data, the $Y_t$ may be assumed to be the logarithm of the counts (or normalized counts per million) from a transcript, and indeed this is the approach taken in a number of approaches to RNA-Seq modeling, e.g. in limma-voom. A common alternative to the general linear model is the generalized linear model, which postulates that some function of $Y_t$ has a distribution with mean equal to $g^{-1}(X_t \beta_t)$ where g is a link function, such as log, thereby allowing for distributions other than the normal to be used for the observed data. In the RNA-Seq context, where the negative binomial distribution may make sense because it is frequently a good distribution for modeling count data, the generalized model is sometimes preferred to the standard general model (e.g. by DESeq2). There is much debate about which approach is “better”. In the sleuth model the $Y_t$ in the general linear model are modeled as unobserved. They can be thought of us the unobserved logarithms of true counts for each transcript across samples and are assumed to be normally distributed. The observed data $D_t$ is the logarithm of estimated counts for each transcript across samples, and is modeled as $D_t = Y_t + \zeta_t$ where the $\zeta_t$ vector parameterizes a perturbation to the unobserved $Y_t$. This can be understood as the technical noise due to the random sequencing of fragments from a cDNA library and the uncertainty introduced in estimating transcript counts. The sleuth model incorporates the assumptions that the response error is additive, i.e. if the variance of transcript in sample is $V(D_{t,i})$ then $V(D_{t,i}) = \sigma^2_t + \tau^2_t$ where the variance $V(\epsilon_{t,i}|y_{t,i}) = \sigma^2_t$ and the variance $V(\zeta_{t,i}|y_{t,i}) = \tau^2_t$. Intuitively, sleuth teases apart the two sources of variance by examining both technical and biological replicates, and in doing so directly estimates “true” biological variance, i.e. the variance in biological replicates that is not technical. In lieu of actual technical replicates, sleuth takes advantage of the bootstraps of kallisto which serve as accurate proxies. In a test of sleuth on data simulated according to the DESeq2 model we found that sleuth significantly outperforms other methods: In this simulation transcript counts were simulated according to a negative binomial distribution, following closely the protocol of the DESeq2 paper simulations. Reference parameters for the simulation were first estimated by running DESeq2 on a the female Finnish population from the GEUVADIS dataset (59 individuals). In the simulation above size factors were set to be equal in accordance with typical experiments being performed, but we also tested sleuth with size factors drawn at random with geometric mean of 1 in accordance with the DESeq2 protocol (yielding factors of 1, 0.33, 3, 3, 0.33 and 1) and sleuth still outperformed other methods. There are many details in the implementation of sleuth that are crucial to its performance, e.g. the approach to shrinkage to estimate the biological variance $\sigma^2_t$. A forthcoming preprint, together with Nicolas Bray and Páll Melsted that also contributed to the project along with myself, will provide the details. Exploratory data analysis with sleuth One of the design goals of sleuth was to create a simple and efficient workflow in line with the principles of kallisto. Working with the Shiny web application framework we have designed an html interface that allows users to interact with sleuth plots allowing for real time exploratory data analysis. The sleuth Shiny interface is much more than just a GUI for making plots of kallisto processed data. First, it allows for the exploration of the sleuth fitted models; users can explore the technical variation of each transcript, see where statistically significant differential transcripts appear in relationship to others in terms of abundance and variance and much more. Particularly useful are interactive features in the plots. For example, when examining an MA plot, users can highlight a region of points (dynamically created box in upper panel) and see their variance breakdown of the transcripts the points correspond to, and the list of the transcripts in a table below: The web interface contains diagnostics, summaries of the data, “maps” showing low-dimensional representations of the data and tools for analysis of differential transcripts. The interactivity via Shiny can be especially useful for diagnostics; for example, in the diagnostics users can examine scatterplots of any two samples, and then select outliers to examine their variance, including the breakdown of technical variance. This allows for a determination of whether outliers represent high variance transcripts, or specific samples gone awry. Users can of course make figures showing transcript abundances in all samples, including boxplots displaying the extent of technical variation. Interested in the differential transcribed isoform ENST00000349155 of the TBX3 gene shown in Figure 5d of the Cuffdiff2 paper? It’s trivial to examine using the transcript viewer: One can immediately see visually that differences between conditions completely dominate both the technical and biological variation within conditions. The sleuth q-value for this transcript is 3*10^(-68). Among the maps, users can examine PCA projections onto any pair of components, allowing for rapid exploration of the structure of the data. Thus, with kallisto and sleuth raw RNA-Seq reads can be converted into a complete analysis in a matter of minutes. Experts will be able to generate plots and analyses in R using the sleuth library as they would with any R package. We plan numerous improvements and developments to the sleuth interface in the near future that will further facilitate data exploration; in the meantime we welcome feedback from users. How to try out sleuth Since sleuth requires the bootstraps and quantifications output by kallisto we recommend starting by running kallisto on your samples. The kallisto program is very fast, processing 30 million reads on a laptop in a matter of minutes. You will have to run kallisto with bootstraps- we have been using 100 bootstraps per sample but it should be possible to work with many fewer. We have yet to fully investigate the minimum number of bootstraps required for sleuth to be accurate. To learn how to use kallisto start here. If you have already run kallisto you can proceed to the tutorial for sleuth. If you’re really eager to see sleuth without first learning kallisto, you can skip ahead and try it out using pre-computed kallisto runs of the Cuffdiff2 data- the tutorial explains where to obtain the data for trying out sleuth. For questions, suggestions or help see the program websites and also the kallisto-sleuth user group. We hope you enjoy the tools! I recently read a “brainiac” column in the Boston Globe titled “For 40 years, computer scientists looked for a solution that doesn’t exist” that caused me to facepalm so violently I now have pain in my right knee. The article is about the recent arXiv posting: Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) by Arturs Backurs and Piotr Indyk (first posted in December 2014 and most recently updated in April 2015). In the preprint, which was presented at STOC, Backurs and Indyk show that computing edit distance sub-quadratically is hard in the sense that if it were possible to compute the edit distance between two strings of length n in time $O(n^{2-\delta})$ for some constant $\delta > 0$ then it would be possible to solve the satisfiability of conjunctive normal form formulas in time that would violate the Strong Exponential Time Hypothesis (SETH)This type of “lower bound” reduction is common practice in theoretical computer science. In this case the preprint is of interest because (a) it sheds light on the complexity of a computational problem (edit distance) that has been extensively studied by theoretical computer scientists and (b) it extends the relevance of SETH and sheds more light on what the hypothesis implies. In the introduction to the preprint Backurs and Indyk motivate their study by writing that “Unfortunately, that [standard dynamic programming] algorithm runs in quadratic time, which is prohibitive for long sequences” and note that “considerable effort has been invested into designing faster algorithms, either by assuming that the edit distance is bounded, by considering the average case or by resorting to approximation”. They mention that the fastest known exact algorithm is faster than quadratic time running in $O(n^2/logn)$, but note that “only barely so”. This is all true, and such preamble is typical for theoretical computer science, setting the stage for the result. In one single parenthetical remark intended to relate their work to applications they note that “the human genome consists of roughly 3 billions base pairs” and that exact edit distance computation is currently prohibitive for such long sequences. This is a true sentence, but also not very precise. More on that later in the post… The parenthetical genome comment in the preprint was picked up on in an MIT press release describing the results and implications of the paper. While the press release does a very good job explaining the reduction of Backurs and Indyk in an accessible way, it also picks up on the parenthetical genome comment and goes a bit further. The word “genome” appears three times, and it is stated that “a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes.” This is technically true, but two human genomes are very similar, so that exhaustive comparison is not necessary to compute the edit distance. In fact, two human genomes are 99.9% identical, and with strong prior upper bounds on the edit distance the computation of edit distance is tractable. In terms of different species, the edit distance between complete genomes is completely irrelevant, because of the fact that organisms undergo large scale segmental duplications and rearrangements over timescales of speciation. The MIT press release did not claim the Backurs and Indyk result was relevant for comparing genomes of species, but… the Boston Globe column states that “By calculating the edit distance between the genomes of two organisms, biologists can estimate how far back in evolutionary time the organisms diverged from each other.” No. Actually, even for closely related genomes edit distance is not the relevant statistic in biology, but rather the alignment of the genomes. This is why biological sequence analysis is based on what is called the Neeldeman-Wunsch algorithm and not the Wagner-Fischer algorithm mentioned in the MIT press release. Actually, because the length of sequences for which it is relevant to compute alignments is bounded and rather small (usually a few tens or hundreds of thousands of base pairs at most), the real computational issue for biological sequence analysis is not running time but memory. The Needleman-Wunsch and Wagner-Fischer algorithms require $O(n^2)$ space and the algorithmic advance that made edit distance and alignment computations for biological sequence comparison tractable was Hirschberg’s algorithm which requires only linear space at a cost of only a factor of 2 in time. Thus, the parenthetical comment of Backurs and Indyk, intended to merely draw attention of the reader to the possible practical relevance of the edit distance problem, mutated into utter rubbish by the time the work was being discussed in the Boston Globe. If that was the only issue with the brainiac column I might have just hurt my hand when I facepalmed, but the following paragraph caused me the real pain: Because it’s computationally infeasible to compute the edit distance between two human genomes, biologists have to rely on approximations. They’d love a faster method, but Indyk and his coauthor, MIT graduate student Arturs Backurs, recently demonstrated a faster method is impossible to create. And by impossible they don’t mean “very hard” or “our technology has to improve first.” They mean something like, “by the laws of the mathematics, it can’t be done.” We’ve already gone through point #1 but amazingly nothing is true in this paragraph: 1. No, it is not computationally infeasible to compute the edit distance between two human genomes and no, biologists do not have to rely on approximations. But then there is also the fact that… 2. Backurs and Indyk did not, in fact, demonstrate that a “faster method is impossible to create”. What they showed is that reducing the exponent in the quadratic algorithm would require SETH to be false. It turns out there are some respectable computer scientists that believe that SETH is indeed false. So… 3. No, Backurs and Indyk, who understand mathematics quite well… most certainly do not believe that “by the laws of the mathematics, it can’t be done”. Actually, 4. I’d bet that Backurs and Indyk don’t really believe that mathematics has laws. Axioms yes.. but laws… I find inaccurate reporting of science to be highly problematic. How can we expect the general public to believe scientists claims about serious issues, e.g. the prospects for climate change, when news reports are filled with hype and rubbish about less serious issues, e.g. the complexity of edit distance? There is a another issue that this example highlights. While complexity theory is extremely important and fundamental for computer science, and while a lot of the work in complexity has practical applications, there has been a consistent hyping of implications for biology where complexity theory is simply not very relevant. When I started my career it was fashionable to talk about protein folding being NP-complete. But of course millions of proteins were probably being folded in my body while I was writing this blog post. Does that fact that I am alive imply that P=NP? Of course, exploring the complexity of simplified models for biological systems, such as the HP-model for protein folding that the above reference links to, has great value. It helps to sharpen the questions, to better specify models, and to define the limits of computational tractability for computational biology problems. But it is worth keeping in mind that in biology n does not go to infinity. On the 27th March, 2012 the Foreign Currency Department of the Central Bank of Iceland raided the offices of Samherji hf., a fishing company in Iceland. The Office of the Special Prosecutor was concerned that Sammherji hf. might be shifting profits in Iceland to its overseas subsidiaries by underselling its fish. In doing so it would be recording losses in Iceland while sheltering profits abroad thereby violating currency exchange (and possibly other) laws. The raid was immediately addressed by the company in a press release; for a complementary perspective see the article by journalist Ingi Freyr Vilhjálmsson Two months later, in a sharply worded rebuttal, the company argued that the reported discrepancies in prices between Samherji hf. and the market were flawed. Specifically, they argued, the Central Bank had incorrectly averaged numbers (Google English translation) resulting in a Simpson’s paradox! Was this a valid argument? Truly an example of Simpson’s paradox? Or a sleight of hand? Simpson’s paradox (better referred to as Simpson’s reversal) is a mathematical inequality that can arise in the comparison of averages with raw data. It is known as a “paradox” because the inequality can appear to be unintuitive (to those who have poor statistical intuition, i.e. most of us). What makes it particularly perilous is that it can lurk disguised in the most unexpected places. Not only (allegedly) in the Central Bank of Iceland, but also, it turns out, in many comparative genome analyses… Original sin The mouse genome was published in December 2002, and in the publication was the following figure (Figure 25 from the mouse genome paper): In the review Sixty years of genome biology by Doolittle et al. (2013) highlighting key advances in genome biology since the publication of the structure of the double helix in 1953, Chris Ponting singles out panel (a) as being of historical significance describing it as “a wonderful visual guide to the most important features of mammalian genes” and explaining that “collapsing levels of sequence conservation between thousands of mouse and human orthologs into one metagene, .. showed how, from a common sequence over 90 million years ago, mutation has etched away intronic sequence whilst selection has greatly preserved the exons, particularly toward their boundaries”. He is right, of course, in that the figure demonstrated, for the first time, the power of “genome-wide” comparative molecular biology and led to concerted efforts to characterize functional regions in the genome by aggregation of genome-wide data. Figures analogous to the mouse genome paper 25a are now a fixture in many genomics papers, with % sequence identity being replaced by data such as % methylated, % accessible, etc. etc. In the recent paper M. Singer and L. Pachter, Controlling for conservation in genome-wide DNA methylation studiesBMC Genomics, 16:240 my former Ph.D. student Meromit Singer (now a postdoc in the Regev lab at the Broad Institute) and I explain why in some cases the type of “meta-analysis” underlying Figure 25a, now a fixture in many genomics papers, can be misleading. We focused on DNA methylation studies, which I will explain, but first some elementary algebra… Simpson’s paradox According to the English Oxford Dictionary, one of the definitions for the word paradox is “A statement or proposition which on the face of it seems self-contradictory, absurd, or at variance with common sense, though, on investigation or when explained, it may prove to be well-founded”, and it is this sense of the word that is relevant in what has become known as “Simpson’s paradox”. The “paradox”, is just the following fact: Given numbers $p_1,p_2,q_1,q_2,r_1,r_2,s_1,s_2$, it may happen that $\frac{p_1}{q_1}>\frac{p_2}{q_2}$ and $\frac{r_1}{s_1} > \frac{r_2}{s_2}$ but $\frac{p_1+r_1}{q_1+s_1} < \frac{p_2+r_2}{q_2+s_2}$. The geometry is that in the figure below (from wikipedia), even though each of the red vectors has a slope greater than its corresponding blue vector, the sum of the red vectors has a slope less than the sum of the blue vectors: This reversal, nothing more than a simple algebraic curiosity, can sometimes underlie qualitative changes in the results of data analysis after averaging. This is because ratios or rates can be recast as slopes of vectors while averages correspond to sums of vectors. A famous example is described in the classic paper on sex bias in graduate admission at UC Berkeley, Bickel et al. discuss the issue of pooling data, and how it can affect conclusions about bias. The main point of the paper, namely that Simpson’s paradox can emerge in analyzing admissions data, is illustrated in the following toy example (Figure S1b) from our paper, which was constructed to facilitate understanding the effect in the context of genomics: The figure shows (hypothetical) applications and admissions results for eight males and eight females at twelve departments in a university. A “0” indicates an applicant applied and was rejected, a “1” that the applicant applied and was accepted, and a grey box indicates that the applicant did not apply. In all departments the admissions rates are the same for males and females (0% for both males and females in departments 1–4 and 100% for both males and females in departments 5–12, best understood as the ratio, in each row, of the sum of the numbers divided by the number of non-grey squares). Yet the acceptance rate for all males is 66.7% vs. 33.3% for females (for each column, sum of the numbers divided by the number of non-grey squares as displayed in the plot underneath the table). This is a strange effect at first glance… somehow grouping males and females together leads to an appearance of bias, whereas each department is admitting students without regard to gender (the paper by Bickel et al. discusses real data from UC Berkeley admissions in 1973 where this effect resulted in statistically unjustified claims of bias). Looking at the table it is clear that the reason for the discrepancy between departmental admission rates and overall admissions rates is due to the extra grey boxes in the female table. In this example (and in the 1973 UC Berkeley admissions) females applied to departments that are harder to get in to. In the example above the male and female departmental admissions rates were equal, but it is not difficult to construct examples (again, they occurred in the 1973 UC Berkeley data) were a reversal happens, i.e. admissions rates at the departmental level are opposite to those when aggregated. This is what is known as “Simpson’s paradox”. That the issue is relevant in the genomics meta-analysis context can be seen by replacing “males” and “females” with by genomic features, e.g. “introns” and “exons”, and “departments” with different genomic instances of the features. For example, in the case of Figure 25a from the mouse genome paper, “departments” correspond to different instances of introns/exons in the genome, and “0”s to “not conserved” and “1”s to “conserved”. Interestingly, in the genomics meta-analysis context it is the plot (column averages in the table above) that is always the displayed “result”, and this makes the “Simpson’s effect” more subtle to detect or to identify than in standard settings where the table itself is revealed. What is the equivalent of “females apply to departments that are harder to get into”? Consider the human-mouse plot in Figure 25a of the mouse genome paper. In some genes some bases are not conserved between human and mouse, and this is far more likely to happen in introns than exons. In fact, this is shown in Figure 25a (light blue curve labeled “aligning”). Hence the grey boxes of missing data, reflecting the fact that introns are less conserved than exons, and therefore in alignments they have not only more mismatches, but also indels. In our paper, we considered the case of DNA methylation measurements, where “0” referred to unmethylated and “1” to methylated. Since DNA methylation can only be measured at CpGs, there is again, the issue of missing data. And as it turns out, the missing data is associated with feature type, so that genome-wide averaging can be a huge problem. DNA Methylation In the course of working on her Ph.D. on Statistical algorithms in the study of DNA methylation, Meromit worked on identifying and understanding DNA methylation inside genes, and in that context was interested in Figure 4 from the paper Dynamic changes in the human methylome during differentiation by L. Laurent et al., Genome Research, 20 (2010), p 320–331. The figure describes “average distribution of DNA methylation mapped onto a gene model”, and is reproduced below: Figure 4 from L. Laurent et al., Genome Research 2010. Figure 4c plays a major role in the paper, supporting the following sentence from the abstract: “Exons were more highly methylated than introns, and sharp transitions of methylation occurred at exon–intron boundaries, suggesting a role for differential methylation in transcript splicing.” Throughout the paper, there is extensive discussion of “sharp steps”, not only at exon-intron boundaries but also transcription start sites (TSS) and transcription termination sites (TTS), with speculation about the regulation such methylation differences might reflect and contribute to. Of particular note is the following paragraph about Figure 4c: A downward gradient was seen going across exons from 5′ to 3′, while an upward gradient of DNA methylation was seen traveling from 5′ to 3′ across introns (Fig. 4C). While this is the first report on the splice junction methylation spikes, recent reports show that the intron–exon boundaries also appear to be marked by gradients in chromatin features, including nucleosomes (Schwartz et al. 2009) and the H3K36me3 histone mark (Kolasinska-Zwierz et al. 2009). Taken together, our data suggest that coupling of transcription and splicing may be regulated by DNA methylation as well as by other epigenetic marks.” Meromit focused on the italicized sentence (emphasis mine). Looking at the figure, she noticed that the 5′ ends of exons seemed to be much more highly methylated than the 3′ ends. This, and the discussion in the paper, led her to think that indeed DNA methylation may be functional within transcripts. But given the sparsity of CpGs (for one thing, it turned out that the spikes at the intron-exon boundaries were due to an almost complete lack of CpGs there), and noting that CpGs must be present for DNA methylation to be measured, she decided to examine whether there was in fact stronger conservation of CpGs at the 5′ end of exons. However, when reanalyzing the data, she discovered that the reason for the reported difference in methylation was that because intragenic junctions were examined, the 3′ ends of exons were on average closer to the promoter than the 5′ ends, and therefore the difference in methylation was simply reflecting the fact that many first exons are unmethylated due to “leakage” from the promoter. This, in turn, got her thinking about the interplay between DNA methylation, conservation of sequence, and bias that may occur during the averaging of data. One of my favorite figures from our paper is our Figure 3, which illustrates what can and does go wrong in naïve genome-wide averaging of methylation data at functional boundaries: The leftmost column shows the result of naïve plotting of column averages for %methylated CpGs in simulations where the methylation states from real data was shuffled randomly across features (while maintaining the locations of missing data due to absent CpGs). The “sharp steps” reported in Laurent et al. are evident despite the fact that there is no actual difference in methylation. The next column (second from left) is the naïve plot generated from real data (4852 5′ UTR-coding junctions, 20,784 mid-gene intron-exon junctions and 21,408 CpG island junctions from genome-wide bisulfite data generated by Lister et al., 2008). Evidently the “sharp steps” look very similar to the simulated scenario where there is no actual methylation difference across the boundaries. The “Corrected via COMPARE” column shows what happens when the bias is removed via a correction procedure that I describe next. In the case of UTR-coding junctions, the “sharp step” disappears completely (this is confirmed by the analysis shown in the final column, which consists of simply throwing away rows in the data table where there is missing data, thereby removing the possible Simpson’s effect but at the cost of discarding data). Our analysis also shows that although there is some differential methylation (on average) at intron-exon boundaries, the effect is smaller than appears at first glance. Without going into too much detail, it is worth noting that the underlying problem in the methylation case is that deamination leads to methylated CpGs converting to TpGs, so that this is a correlation between methylation and missing data. This is analogous to the problem in conservation analysis, where missing data (due to indels) is correlated with less conservation (i.e. more mismatches). Our paper delves into the details. Although the analysis in Laurent et al. is problematic, the intention of this post is not to single out the paper; it just happens to be the publication that led us to the discovery of Simpson’s effect in the context of genome-wide averaging. In fact, in the preparation of our paper, we identified 40 published articles where naïve averaging of genome-wide methylation data was displayed in figures (the same issue undoubtedly affects other types of genome analysis as well, e.g. conservation analysis as in Figure 25a from the mouse paper, and we make this point in our paper but do not explore it in detail). In some cases the interpretation of the figures is central to the claims of the paper, whereas in other cases the figures are secondary to the main results. Also sometimes correction of Simpson’s effect may lead to a (small) quantitative rather than a qualitative change in interpretation; we have not examined each of the 40 cases in our table in detail (although are happy to send the table upon request). What we recommend is that future studies implement a correction when averaging genome-wide data across features. To assist in this regard, Meromit developed software called COMPARE: Imputation of missing data There are basically three ways to get around the problem discussed above: 1. Don’t average the data in your tables. 2. Discard rows with missing entries. 3. Impute the missing values. The problem with #1 is that summaries of data in the form of plots like Figure 25a from the mouse genome paper are a lot more fun to look at than tables with thousands of rows. The workflows for #2 and #3 above are shown in our Figure 4: However the problem with #2 is that at least in the case of DNA methylation, a lot of data must be discarded (this is evident in the large boxes in the final column of our Figure 3; see above). Imputation is interesting to consider. It is essentially the statistical problem of guessing what the methylation state in a specific position of a DNA sequence would be if there was a CpG at the site. The idea is to perform matrix completion (in a way analogous to the methods used for the Netflix prize); details are in our paper and implemented in software called COMPARE. It turns out that imputation of epigenetic state is possible (and can be surprisingly accurate): The plots show analysis with COMPARE, specifically ten-fold cross-validations for 5’UTR, coding, intronic and exonic regions, at the corresponding junctions analyzed in Figure 3. Smoothing was used to display the large number of data points. In each plot n is the number of data points in the matrix and the regression line is shown in dark blue where s is its slope, and v is the additional amount of variance in the data explained by the regression line relative to a random line. Our results are consistent with those recently published in Ernst and Kellis, 2015, where the tractability and application of imputation of epigenetic states is discussed. With COMPARE, it is possible to generate corrected plots (see column 3 in our Figure 3 above) that provide a clearer quantitative picture of the data. A reanalysis of intron-exon junctions with COMPARE is interesting. As shown in Figure 3 there is a substantially reduced effect in human (approximately a factor of two) in comparison to naïve averaging. What is interesting is that in Arabidopsis correction with COMPARE hardly changes the result at all (Figure S6): The upshot is that after correction in both human and Arabidopsis, a correction that affects human but not Arabidopsis, the differences in DNA methylation at intron-exon boundaries are revealed to be the same. Despite the obvious importance of imputation in this context, we discovered in the course of trying to publish our paper that some biologists are apparently extremely uncomfortable with the idea (even though SNP imputation has become standard for GWAS). Moreover, it seemed that a number of editors and reviewers felt that the reporting and characterization of abundant bias affecting a large number of publications was simply not of much interest. The publication process of our paper was one of the hardest of my career; we went through several journals and multiple review stages, including getting rejaccted (not a typo!). Rejacction is the simultaneous rejection and acceptance of a paper. To wit, an editor wrote to say that “unfortunately, we feel that as it stands the manuscript…cannot [be] consider[ed].. in its current form [but] we have taken the liberty of consulting the editors of [another journal], and they have informed us that if you decided to submit your manuscript to [them], they would only ask for a quick overview.” In other words, rejected, but accepted at our sister journal if you like. We were rejaccted twice (!) bringing to mind a (paraphrasing) of Oscar Wilde: to be rejaccted once may be regarded as misfortune; twice looks like carelessness. At the end, we were delighted to publish our paper in the open access journal BMC Genomics, and hope that people will find it interesting and useful. On fishing Returning to the claims of Samherji hf. about the raid, there is a point to discussing Simpson’s effect in the context of comparing fish prices. The argument being made, is that the Central Bank of Iceland averaged across different weight classes of fish in computing the ratios “price per kilogram” (true) and that it would be better to instead compute weighted averages according to the weight of the fish sold (unclear). Specifically, while Simpson’s reversal can occur in the setting Samherji hf. is referring to, it is not at all clear from the data presented that this is what happened (I would check, but it would require knowledge of prices both of Samherji hf. and of competitors, and also tables that are not in pdf as in the attachments provided). Instead, all that Samherji hf. is saying is that they prefer a weighted averaging in computing price per kilogram, a procedure that, a priori, is by no means clearly “better” than naïve averaging for issues the Central Bank was investigating. After all, just because small fish might sell for different prices than large fish, it is not clear that such fish should be discounted in computing average prices per kilogram. Moreover, it is not at all apparent why Simpson’s paradox is relevant if the matter under consideration is whether or not to weight results when averaging. This is all to say that Simpson’s paradox can not only be subtle to detect, but also subtle to interpret if and when it does occur. One of the notorious abuses of Simpson’s paradox is its use by Republicans in claiming that more of them voted for the 1964 civil rights act than Democrats. It turns out that while this is true, the opposite is true as well. And so it can be with genomics data. Be careful when you fish. Disclosures: I was an author on the mouse genome paper. I am employed by UC Berkeley. I am the brother-in-law of the journalist mentioned in the opening paragraph. I eat fish. So you’re an academic and you’ve written some bioinformatics software. You heard that: 1. Somebody will build on your code. Nope. Ok, maybe not never but almost certainly not. There are many reasons for this. The primary reason in my view is that most bioinformatics software is of very poor quality (more on why this is the case in #2). Who wants to read junk software, let alone try to edit it or build on it? Most bioinformatics software is also targeted at specific applications. Biologists who use application specific software are typically not interested in developing or improving software because methods development is not their main interest and software development is not their best skill. In the computational biology areas I have worked in during the past 20 years (and I have reviewed/tested/examined/used hundreds or even thousands of programs) I can count the software programs that have been extended or developed by someone other than the author on a single hand. Software that has been built on/extended is typically of the framework kind (e.g. SAMtools being a notable example) but even then development of code by anyone other than the author is rare. For example, for the FSA alignment project we used HMMoC, a convenient compiler for HMMs, but has anyone ever built on the codebase? Doesn’t look like it. You may have been told by your PI that your software will take on a life of its own, like Linux, but the data suggests that is simply not going to happen. No, Gnu is Not Unix and your spliced aligner is not the Linux kernel. Most likely you alone will be the only user of your software, so at least put in some comments, because otherwise the first time you have to fix your own bug you won’t remember what you were doing in the code, and that is just sad. 2. You should have assembled a team to build your software. Nope. Although most corporate software these days is built by large teams working collaboratively, scientific software is different. I agree with James Taylor, who in the anatomy of successful computational biology software paper stated that ” A lot of traditional software engineering is about how to build software effectively with large teams, whereas the way most scientific software is developed is (and should be) different. Scientific software is often developed by one or a handful of people.” In fact, I knew you were a graduate student because most bioinformatics software is written singlehandedly by graduate students (occasionally by postdocs). This is actually problem (although not your fault!) Students such as yourself graduate, move on to other projects and labs, and stop maintaining (see #5), let alone developing their code. Many PIs insist on “owning” software their students wrote, hoping that new graduate students in their lab will develop projects of graduated students. But new students are reluctant to work on projects of others because in academia improvement of existing work garners much less credit than new work. After all, isn’t that why you were writing new software in the first place? I absolve you of your solitude, and encourage you to think about how you will create the right incentive structure for yourself to improve your software over a period of time that transcends your Ph.D. degree. 3. If you choose the right license more people will use and build on your program. Nope. People have tried all sorts of licenses but the evidence suggests the success of software (as measured by usage, or development of the software by the computational biology community) is not correlated with any particular license. One of the most widely used software suites in bioinformatics (if not the most widely used) is the UCSC genome browser and its associated tools. The software is not free, in that even though it is free for academic, non-profit and personal use, it is sold commercially. It would be difficult to argue that this has impacted its use, or retarded its development outside of UCSC. To the contrary, it is almost inconceivable that anyone working in genetics, genomics or bioinformatics has not used the UCSC browser (on a regular basis). In fact, I have, during my entire career, heard of only one single person who claims not to use the browser; this person is apparently boycotting it due to the license. As far as development of the software, it has almost certainly been hacked/modified/developed by many academics and companies since its initial release (e.g. even within my own group). In anatomy of successful computational biology software published in Nature Biotechnology two years ago, a list of “software for the ages” consists of programs that utilize a wide variety of licenses, including Boost, BSD, and GPL/LGPL. If there is any pattern it is that the most common are GPL/LGPL, although I suspect that if one looks at bioinformatics software as a whole those licenses are also the most common in failed software. The key to successful software, it appears, is for it to be useful and usable. Worry more about that and less about the license, because ultimately helping biologists and addressing problems in biomedicine might be more ethical than hoisting the “right” software license flag. 4. Making your software free for commercial use shows you are not against companies. Nope. The opposite is true. If you make your software free for commercial use, you are effectively creating a subsidy for companies, one that is funded by your university / your grants. You are a corporate hero! Congratulations! You have found a loophole for transferring scarce public money to the private sector. If you’ve licensed your software with BSD you’ve added another subsidy: a company using your software doesn’t have any reason to share their work with the academic community. There are two reasons why you might want to reconsider offering such subsidies. First, by denying yourself potential profits from sale of your software to industry, you are definitively removing any incentive for future development/maintenance of the software by yourself or future graduate students. Most bioinformatics software, when sold commercially, costs a few thousand dollars. This is a rounding error for companies developing cancer or other drugs at the cost of a billion dollars per drug and a tractable proposition even for startups, yet the money will make a real difference to you three years out from your Ph.D. when you’re earning a postdoc salary. A voice from the future tells you that you’ll appreciate the money, and it will help you remember that you really ought to fix that bug reported on GitHub posted two months ago. You will be part of the global improvement of bioinformatics software. And there is another reason to sell your software to companies: help your university incentivize more and better development of bioinformatics software. At most universities a majority of the royalties from software sales go to the institution (at UC Berkeley, where I work, its 2/3). Most schools, especially public universities, are struggling these days and have been for some time. Help them out in return for their investment in you; you’ll help generate more bioinformatics hires, and increase appreciation for your field. In other words, although it is not always practical or necessary, when possible, please sell your software commercially. 5. You should maintain your software indefinitely. Nope. Someday you will die. Before that you will get a job, or not. Plan for your software to have a limited shelf-life, and act accordingly. 6. Your “stable URL” can exist forever. Nope. When I started out as a computational biologist in the late 1990s I worked on genome alignment. At the time I was excited about Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. This was a framework for specifying bioinformatics related dynamic programming algorithms, such as the Needleman-Wunsch or Smith-Waterman algorithms. The authors wrote that “A stable URL for Dynamite documentation, help and information is http://www.sanger.ac.uk/~birney/dynamite/” Of course the URL is long gone, and by no fault of the authors. The website hosting model of the late 1990s is long extinct. To his credit, Ewan now hosts the Dynamite code on GitHub, following a welcome trend that is likely to extend the life of bioinformatics programs in the future. Will GitHub be around forever? We’ll see. But more importantly, software becomes extinct (or ought to) for reasons other than just 404 errors. For example, returning to sequence alignment, the ClustalW program of 1994 was surpassed in accuracy and speed by many other multiple alignment programs developed in the 2000s. Yet people kept using ClustalW anyway, perhaps because it felt like a “safe bet” with its many citations (eventually in 2011 Clustalw was updated to Clustal Omega). The lag in improving ClustalW resulted in a lot of poor alignments being utilized in genomics studies for a decade (no fault of the authors of ClustalW, but harmful nonetheless). I’ve started the habit of retiring my programs, via announcement on my website and PubMed. Please do the same when the time comes. 7. You should make your software “idiot proof”. Nope. Your users, hopefully biologists (and not other bioinformatics programmers benchmarking your program to show that they beat it) are not idiots. Listen to them. Back in 2004 Nicolas Bray and I published a webserver for the alignment program MAVID. Users were required to input FASTA files. But can you guess why we had to write a script called checkfasta? (hint: the most popular word processor was and is Microsoft Word). We could have stopped there and laughed at our users, but after much feedback we realized the real issue was that FASTA was not necessarily the appropriate input to begin with. Our users wanted to be able to directly input Genbank accessions for alignment, and eventually Nicolas Bray wrote the perl scripts to allow them to do that (the feature lives on here). The take home message for you is that you should deal with user requests wisely, and your time will be needed not only to fix bugs but to address use cases and requested features you never thought of in the first place. Be prepared to treat your users respectfully, and you and your software will benefit enormously. 8. You used the right programming language for the task. Nope. First it was perl, now it’s python. First it was MATLAB, now it’s R. First it was C, then C++. First it was multithreading now it’s Spark. There is always something new coming, which is yet another reason that almost certainly nobody is going to build on your code. By the time someone gets around to having to do something you may have worked on, there will be better ways. Therefore, the main thing is that your software should be written in a way that makes it easy to find and fix bugs, fast, and efficient (in terms of memory usage). If you can do that in Fortran great. In fact, in some fields not very far from bioinformatics, people do exactly that. My advice: stay away from Fortran (but do adopt some of the best practice advice offered here). 9. You should have read Lior Pachter’s blog post about the myths of bioinformatics software before starting your project. Nope. Lior Pachter was an author on the Vista paper describing a program for which the source code was available only “upon request”. Why, sometimes I’ve believed as many as six impossible things before breakfast.” – The White Queen from Through the Looking Glass by Lewis Carroll. Two weeks ago in my post Pachter’s P-value Prize I offered ${\bf \frac{\100}{p}}$ for justifying a reasonable null model and a p-value (p) associated to the statement “”Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions” in the paper M. Kellis, B.W. Birren and E.S. Lander, Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisaeNature 2004 (hereafter referred to as the KBL paper). Today I am happy to announce the winner of the prize. But first, I want to thank the many readers of the blog who offered comments (>135 in total) that are extraordinary in their breadth and depth, and that offer a glimpse of what scientific discourse can look like when not restricted to traditional publishing channels. You have provided a wonderful public example of what “peer review” should mean. Coincidentally, and to answer one of the questions posted, the blog surpassed one million views this past Saturday (the first post was on August 19th, 2013), a testament to the the fact that the collective peer reviewing taking place on these pages is not only of very high quality, but also having an impact. I particularly want to thank the students who had the courage to engage in the conversation, and also faculty who published comments using their name. In that regard, I admire and commend Joshua Plotkin and Hunter Fraser for deciding to deanonymize themselves by agreeing to let me announce here that they were the authors of the critique sent to the authors in April 2004 initially posted as an anonymous comment on the blog. The discussion on the blog was extensive, touching on many interesting issues and I only summarize a few of the threads of discussion here. I decided to touch on a number of key points made in order to provide context and justification for my post and selection of the prize winner. The value of post-publication review One of the comments made in response to my post that I’d like to respond to first was by an author of KBL who dismissed the entire premise of the my challenge writing “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)” This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become, as Ferguson and Heene noted, a vast graveyard of undead theories? Surely the varied and interesting comments posted in response to my challenge (totaling >25,000 words and 50 pages in Arial 11 font), demonstrate the value of communal discussion of science after publication. For the record, this past month I did submit a paper and also a grant, and I did spend lots of time with my family. I didn’t practice the guitar but I did play the piano. Yet in terms of research, for me the highlight of the month was reading and understanding the issues raised in the comments to my blog post. Did I have many other things to do? Sure. But what is more pressing than understanding if the research one does is to be meaningful? The null model A few years ago I introduced a new two-semester freshman math course at UC Berkeley for intended biology majors called Math 10- Methods of Mathematics: Calculus, Statistics and Combinatorics“. One of the key ideas we focus on in the first semester is that of a p-value. The idea of measuring significance of a biological result via a statistical computation involving probabilities is somewhat unnatural, and feedback from the students confirms what one might expect: that the topic of p-values is among the hardest in the course. Math for biologists turns out to be much harder than calculus. I believe that at Berkeley we are progressive in emphasizing the importance of statistics for biology majors at the outset of their education (to be clear, this is a recent development). The prevailing state is that of statistical illiteracy, and the result is that p-values are frequently misunderstood, abused, and violated in just about every possible way imaginable (see, e.g., here, here and here). P-values require a null hypothesis and a test statistic, and of course one of the most common misconceptions about them is that when they are large they confirm that the null hypothesis is correct. No! And worse, a small p-value cannot be used to accept an alternative to the null, only to (confidently) reject the null itself. And rejection of the null comes with numerous subtle issues and caveats (see arguments against the p-value in the papers mentioned above). So what is the point? I think the KBL paper makes for an interesting case study of when p-values can be useful. For starters, the construction of a null model is already a useful exercise, because it is a thought experiment designed to test ones understanding of the problem at hand. The senior author of the KBL paper argues that “we were interested in seeing whether, for genes where duplication frees up at least one copy to evolve rapidly, the evidence better fits one model (“Ohno”: only one copy will evolve quickly) or an alternative model (both genes will evolve quickly).” While I accept this statement at face value, it is important to acknowledge that if there is any science to data science, it is the idea that when examining data one must think beyond the specific hypotheses being tested and consider alternative explanations. This is the essence of what my colleague Ian Holmes is saying in his comment. In data analysis, thinking outside of the box (by using statistics) is not optional. If one is lazy and resorts to intuition then, as Páll Melted points out, one is liable to end up with fantasy. The first author of KBL suggests that the “paper was quite explicit about the null model being tested.” But I was unsure of whether to assume that the one-gene-only-speeds-up model is the null based on”we sought to distinguish between the Ohno one-gene-only speeds-up (OS) model and the alternative both-genes speed-up (BS) model” or was the null the BS model because “the Ohno model is 10^87 times more likely, leading to significant rejection of the BS null”? Or was the paper being explicit about not having a null model at all, because “Two alternatives have been proposed for post-duplication”, or was it the opposite, i.e. two null models: “the OS and BS models are each claiming to be right 95% of the time”? I hope I can be forgiven for failing, despite trying very hard, to identify a null model in either the KBL paper, or the comments of the authors to my blog. There is however a reasonable null model, and it is the “independence model”, which to be clear, is the model where each gene after duplication “accelerates” independently with some small probability (80/914). The suggestions that “the independence model is not biologically rooted” or that it “would predict that only 75% of genes would be preserved in at least one copy, and that 26% would be preserved in both copies” are of course absurd, as explained by Erik van Nimwegen who explains why point clearly and carefully. The fact that many entries reached the same conclusion about the suitable null model in this case is reassuring. I think it qualifies as a “reasonable model” (thereby passing the threshold for my prize). The p-value One of my favorite missives about p-values is by Andrew Gelman, who in “P-values and statistical practice” discusses the subtleties inherent in the use and abuse of p-values. But as my blog post illustrates, subtlety is one thing, and ignorance is an entirely different matter. Consider for example, the entry by Manolis Kellis who submitted that $p = 10^{-87}$ thus claiming that I owe him 903,659,165 million billion trillion quadrillion quintillion sextillion dollars (even more than the debt of the United States of America). His entry will not win the prize, although the elementary statistics lesson that follows is arguably worth a few dollars (for him). First, while it is true that a p-value can be computed from the (log) likelihood ratio when the null hypothesis is a special case of the alternative hypothesis (using the chi^2 distribution), the ratio of two likelihoods is not a p-value! Probabilities of events are also not p-values! For example, the comment that “I calculated p-values for the exact count, but the integral/sum would have been slightly better” is a non-starter. Even though KBL was published in 2004, this is apparently the level of understanding of p-values of one of the authors, a senior computational biologist and professor of computer science at MIT in 2015. Wow. So what is “the correct” p-value? It depends, of course, on the test statistic. Here is where I will confess that like many professors, I had an answer in mind before asking the question. I was thinking specifically of the setting that leads to 0.74 (or 0.72/0.73, depending on roundoff and approximation). Many entries came up with the same answer I had in mind and therefore when I saw them I was relieved: I owed135, which is what I had budgeted for the exercise. I was wrong. The problem with the answer 0.74 is that it is the answer to the specific question: what is the probability of seeing 4 or less pairs accelerate out of 76 pairs in which at least one accelerated. A better test statistic was proposed by Pseudo in which he/she asked for the probability of seeing 5% or less of the pairs accelerate from among the pairs with at least one gene accelerating when examining data from the null model with 457 pairs. This is a subtle but important distinction, and provides a stronger result (albeit with a smaller p-value). The KBL result is not striking even forgoing the specific numbers of genes measured to have accelerated in at least one pair (of course just because p=0.64 also does not mean the independence model is correct). What it means is that the data as presented simply weren’t “striking”.

One caveat in the above analysis is that the arbitrary threshold used to declare “acceleration” is problematic. For example, one might imagine that other thresholds produce more convincing results, i.e. farther from the null, but of course even if that were true the use of an arbitrary cutoff was a poor approach to analysis of the data. Fortunately, regarding the specific question of its impact in terms of the analysis, we do not have to imagine. Thanks to the diligent work of Erik van Nimwegen, who went to the effort of downloading the data and reanalyzing it with different thresholds (from 0.4 to 1.6), we know that the null cannot be rejected even with other thresholds.

The award

There were many entries submitted and I read them all. My favorite was by Michael Eisen for his creative use of multiple testing correction, although I’m happier with the direction that yields $8.79. I will not be awarding him the prize though, because his submission fails the test of “reasonable”, although I will probably take him out to lunch sometime at Perdition Smokehouse. I can’t review every single entry or this post, which is already too long, would become unbearable, but I did think long and hard about the entry of K. It doesn’t directly answer the question of why the 95% number is striking, nor do I completely agree with some of the assumptions (e.g. if neither gene in a pair accelerates then the parent gene was not accelerated pre-WGD). But I’ll give the entry an honorable mention. The prize will be awarded to Pseudo for defining a reasonable null model and test statistic, and producing the smallest p-value within that framework. With a p-value of 0.64 I will be writing a check in the amount of$156.25. Congratulations Pseudo!!

The biology

One of the most interesting results of the blog post was, in my opinion, the extensive discussion about the truth. Leaving aside the flawed analysis of KBL, what is a reasonable model for evolution post-WGD? I am happy to see the finer technical details continue to be debated, and the intensity of the conversation is awesome! Pavel Pevzner’s cynical belief that “science fiction” is not a literary genre but rather a description of what is published in the journal Science may be realistic, but I hope the comments on my blog can change his mind about what the future can look like.

In lieu of trying to summarize the scientific conversation (frankly, I don’t think I could do justice to some of the intricate and excellent arguments posted by some of the population geneticists) I’ll just leave readers to enjoy the comment threads on their own. Comments are still being posted, and I expect the blog post to be a “living” post-publication review for some time. May the skeletons keep finding a way out!

The importance of wrong

Earlier in this post I admitted to being wrong. I have been wrong many times. Even though I’ve admitted some of my mistakes on this blog and elsewhere in talks, I would like to joke that I’m not going to make it easy for you to find other flaws in my work. That would be a terrible mistake. Saying “I was wrong” is important for science and essential for scientists. Without it people lose trust in both.

I have been particularly concerned with a lack of “I was wrong” in genomics. Unfortunately, there is a culture that has developed among “leaders” in the field where the three words admitting error or wrongdoing are taboo. The recent paper of Lin et al. critiqued by Gilad-Mizrahi is a good example. Leaving aside the question of whether the result in the paper is correct (there are strong indications that it isn’t), Mizrahi-Gilad began their critique on twitter by noting that the authors had completely failed to account for, or even discuss, batch effect. Nobody, and I mean nobody who works on RNA-Seq would imagine for even a femtosecond that this is ok. It was a major oversight and mistake. The authors, any of them really, could have just come out and said “I was wrong”. Instead, the last author on the paper, Mike Snyder, told reporters that “All of the sequencing runs were conducted by the same person using the same reagents, lowering the risk of unintentional bias”. Seriously?

Examples abound. The “ENCODE 80% kerfuffle” involved claims that “80% of the genome is functional”. Any self-respecting geneticist recognizes such headline grabbing as rubbish. Ewan Birney, a distinguished scientist who has had a major impact on genomics having being instrumental in the ENSEMBL project and many other high-profile bioinformatics programs defended the claim on BBC:

“EB: Ah, so, I don’t — It’s interesting to reflect back on this. For me, the big important thing of ENCODE is that we found that a lot of the genome had some kind of biochemical activity. And we do describe that as “biochemical function”, but that word “function” in the phrase “biochemical function”is the thing which gets confusing. If we use the phrase “biochemical activity”, that’s precisely what we did, we find that the different parts of the genome, [??] 80% have some specific biochemical event we can attach to it. I was often asked whether that 80% goes to 100%, and that’s what I believe it will do. So, in other words, that number is much more about the coverage of what we’ve assayed over the entire genome. In the paper, we say quite clearly that the majority of the genome is not under negative selection, and we say that most of the elements are not under pan-mammalian selection. So that’s negative selection we can detect between lots of different mammals. [??] really interesting question about what is precisely going on in the human population, but that’s — you know, I’m much closer to the instincts of this kind of 10% to 20% sort of range about what is under, sort of what evolution cares about under selection.”

This response, and others by members of the ENCODE consortium upset many people who may struggle to tell apart white and gold from blue and black, but certainly know that white is not black and black is not white. Likewise, I suspect the response of KBL to my post disappointed many as well. For Fisher’s sake, why not just acknowledge what is obvious and true?

The personal critique of professional conduct

A conversation topic that emerged as a result of the blog (mostly on other forums) is the role of style in online discussion of science. Specifically, the question of whether personal attacks are legitimate has come up previously on my blog pages and also in conversations I’ve had with people. Here is my opinion on the matter:

Science is practiced by human beings. Just like with any other human activity, some of the humans who practice it are ethical while others are not. Some are kind and generous while others are… not. Occasionally scientists are criminal. Frequently they are honorable. Of particular importance is the fact that most scientists’ behavior is not at any of these extremes, but rather a convex combination of the mentioned attributes and many others.

In science it is people who benefit, or are hurt, by the behavior of scientists. Preprints on the bioRxiv do not collect salaries, the people who write them do. Papers published in journals do not get awarded or rejected tenure, people do. Grants do not get jobs, people do. The behavior of people in science affects… people.

Some argue for a de facto ban on discussing the personal behavior of scientists. I agree that the personal life of scientists is off limits. But their professional life shouldn’t be. When Bernie Madoff fabricated gains of \$65 billion it was certainly legitimate to criticize him personally. Imagine if that was taboo, and instead only the technical aspects of his Ponzi scheme were acceptable material for public debate. That would be a terrible idea for the finance industry, and so it should be for science. Science is not special among the professions, and frankly, the people who practice it hold no primacy over others.

I therefore believe it is not only acceptable but imperative to critique the professional behavior of persons who are scientists. I also think that doing so will help eliminate the problematic devil–saint dichotomy that persists with the current system. Having developed a culture in which personal criticism is outlawed in scientific conversations while only science is fair fodder for public discourse, we now have a situation where scientists are all presumed to be living Gods, or else serious criminals to be outlawed and banished from the scientific community. Acknowledging that there ought to be a grey zone, and developing a healthy culture where critique of all aspects of science and scientists is possible and encouraged would relieve a lot of pressure within the current system. It would also be more fair and just.

A final wish

I wish the authors of the KBL paper would publish the reviews of their paper on this blog.

In this blog post I offer a cash prize for computing a p-value [update June 9th: the winner has been announced!]. For details about the competition you can skip directly to the challenge. But context is important:

Background

I’ve recently been reading a bioRxiv posting by X. Lan and J. Pritchard, Long-term survival of duplicate genes despite absence of subfunctionalized expression (2015) that examines the question of whether gene expression data (from human and mouse tissues) supports a model of duplicate preservation by subfunctionalization.

The term subfunctionalization is a hypothesis for explaining the ubiquity of persistence of gene duplicates in extant genomes. The idea is that gene pairs arising from a duplication event evolve, via neutral mutation, different functions that are distinct from their common ancestral gene, yet together recapitulate the original function. It was introduced in 1999 an alternative to the older hypothesis of neofunctionalization, which posits that novel gene functions arise by virtue of “retention” of one copy of a gene after duplication, while the other copy morphs into a new gene with a new function. Neofunctionalization was first floated as an idea to explain gene duplicates in the context of evolutionary theory by Haldane and Fisher in the 1930s, and was popularized by Ohno in his book Evolution by Gene Duplication published in 1970. The cartoon below helps to understand the difference between the *functionalization hypotheses (adapted from wikipedia):

Lan and Pritchard examine the credibility of the sub- and neofunctionalization hypotheses using modern high-throughput gene expression (RNA-Seq) data: in their own words “Based on theoretical models and previous literature, we expected that–aside from the youngest duplicates–most duplicate pairs would be functionally distinct, and that the primary mechanism for this would be through divergent expression profiles. In particular, the sub- and neofunctionalization models suggest that, for each duplicate gene, there should be at least one tissue where that gene is more highly expressed than its partner.”

What they found was that, in their words, that “surprisingly few duplicate pairs show any evidence of sub-/neofunctionalization of expression.” The went further, stating that “the prevailing model for the evolution of gene duplicates holds that, to survive, duplicates must achieve non-redundant functions, and that this usually occurs by partitioning the expression space. However, we report here that sub-/neofunctionalization of expression occurs extremely slowly, and generally does not happen until the duplicates are separated by genomic rearrangements. Thus, in most cases long-term survival must rely on other factors.” They propose instead that “following duplication the expression levels of a gene pair evolve so that their combined expression matches the optimal level. Subsequently, the relative expression levels of the two genes evolve as a random walk, but do so slowly (33) due to constraint on their combined expression. If expression happens to become asymmetric, this reduces functional constraint on the minor gene. Subsequent accumulation of missense mutations in the minor gene may provide weak selective pressure to eventually eliminate expression of this gene, or may free the minor gene to evolve new functions.”

The Lan and Pritchard paper is the latest in a series of works that examine high-browed evolutionary theories with hard data, and that are finding reality to be far more complicated than the intuitively appealing, yet clearly inadequate, hypotheses of neo- and subfunctionalization. One of the excellent papers in the area is

Dean et al. Pervasive and Persistent Redundancy among Duplicated Genes in Yeast, PLoS Genetics, 2008.

where the authors argue that in yeast “duplicate genes do not often evolve to behave like singleton genes even after very long periods of time.” I mention this paper, from the Petrov lab, because its results are fundamentally at odds with what is arguably the first paper to provide genome-wide evidence for neofunctionalization (also in yeast):

M. Kellis, B.W. Birren and E.S. Lander, Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisae, Nature 2004.

At the time, the Kellis-Birren-Lander paper was hailed as containing “work that may lead to better understanding of genetic diseases” and in the press release Kellis stated that “understanding the dynamics of genome duplication has implications in understanding disease. In certain types of cancer, for instance, cells have twice as many chromosomes as they should, and there are many other diseases linked to gene dosage and misregulation.” He added that “these processes are not much different from what happened in yeast.” and the author of the press releases added that “whole genome duplication may have allowed other organisms besides yeast to achieve evolutionary innovations in one giant leap instead of baby steps. It may account for up to 80 percent (seen this number before?) of flowering plant species and could explain why fish are the most diverse of all vertebrates.”

This all brings me to:

The challenge

In the abstract of their paper, Kellis, Birren and Lander wrote that:

Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions.” [boldface by authors]
In the main text of the paper, the authors expanded on this claim, writing:

Strikingly, in nearly every case (95%), accelerated evolution was confined to only one of the two paralogues. This strongly supports the model in which one of the paralogues retained an ancestral function while the other, relieved of this selective constraint, was free to evolve more rapidly”.

The word “strikingly” suggests a result that is surprising in its statistical significance with respect to some null model the authors have in mind. The data is as follows:

The authors identified 457 duplicated gene pairs that arose by whole genome duplication (for a total of 914 genes) in yeast. Of the 457 pairs 76 showed accelerated (protein) evolution in S. cerevisiae. The term “accelerated” was defined to relate to amino acid substitution rates in S. cerevisiae, which were required to be 50% faster than those in another yeast species, K. waltii. Of the 76 genes, only four pairs were accelerated in both paralogs. Therefore 72 gene pairs showed acceleration in only one paralog (72/76 = 95%).

So, is it indeed “striking” that “in nearly every case (95%), accelerated evolution was confined to only one of the two praralogues”? Well, the authors don’t provide a pvalue in their paper, nor do they propose a null model with respect to which the question makes sense. So I am offering a prize to help crowdsource what should have been an exercise undertaken by the authors, or if not a requirement demanded by the referees. To incentivize people in the right direction,

I will award ${\bf \frac{\100}{p}}$

to the person who can best justify a reasonable null model, together with a p-value (p) for the phrase “Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair” in the abstract of the Kellis-Birren-Lander paper. Notice the smaller the (justifiable) p-value someone can come up with, the larger the prize will be.

Bonus: explain in your own words how you think the paper was accepted to Nature without the authors having to justify their use of the word “strikingly” for a main result of the paper, and in a timeframe consisting of submission on December 17th 2003 (just three days before Hanukkah and one week before Christmas) and acceptance January 19th 2004 (Martin Luther King Jr. day).

Rules

To be eligible for the prize entries must be submitted as comments on this blog post by 11:59pm EST on Sunday May 31st June 7th, 2015 and they must be submitted with a valid e-mail address. I will keep the name (and e-mail address) of the winner anonymous if they wish (this can be ensured by using a pseudonym when submitting the entry as a comment). The prize, if awarded, will go to the person submitting the most complete, best explained solution that has a pvalue calculation that is correct according to the model proposed. Preference will be given to submission from students, especially undergraduates, but individuals in any stage of their career, and from anywhere in the world, are encouraged to submit solutions. I reserve the right to interpret the phrase “reasonable null model” in a way that is consistent with its use in the scientific community and I reserve the right to not award the prize if no good/correct solutions are offered. Participants do not have to answer the bonus question to win.

About one and a half years ago I wrote a blog post titled “GTEx is throwing away 90% of their data“. The post was, shall we say, “direct”. For example, in reference to the RNA-Seq quantification program Flux Capacitor I wrote that

Using Flux Capacitor is equivalent to throwing out 90% of the data!

I added that “the methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad” (after explaining that the methods underlying the program were never published, except for a brief mention in that paper). I referred to the sole figure in Montgomery et al. as a “completely useless” description of the method  (and showed that it contained errors). I highlighted the fact that many aspects of Flux Capacitor, its description and documentation provided on its website were “incoherent”. Can we agree that this description is not flattering?

The claim about “throwing out 90% of the data” was based on benchmarking I reported on in the blog post. If I were to summarize the results (politely), I would say that the take home message was that Flux Capacitor is junk. Perhaps nobody had really noticed because nobody cared about the program. Flux Capacitor was literally being used only by the authors of the program  (and their affiliates, which turned out to include the ENCODE, GENCODE, GEUVADIS and GTEx consortiums). In fact, when I wrote the blog post, I don’t think the program had ever been benchmarked or compared to other tools. It was, after all, unpublished and besides, nobody reads consortium papers. However after I blogged a few others decided to include Flux Capacitor in their benchmarks and the conclusions reached were the same as mine: Flux Capacitor is junk and Flux Capacitor is junk. Of course some people objected to my blog post when it came out, so it’s fun to be right and have others say so in print. But true vindication has come in the form of a citation to the blog post in a published paper in a journal! Specifically, in

C. Iannone, A. Pohl, P. Papasaikas, D. Soronellas, G.P. Vincent, M. Beato and J. Valcárcel, Relationship between nucleosome positioning and progesterone-induced alternative splicing in breast cancer cells, RNA 21 (2015) 360–374

the authors cite my blog post. They write:

Ummm…. wait… WHAT THE FLUX? The authors actually used Flux Capacitor for their analysis, and are citing my blog at https://liorpachter.wordpress.com/tag/flux-capacitor/ as the definitive reference for the program. Wait, what again?? They used my blog post as a reference for the method??? This is like [[ readers are invited to leave a comment offering a suitable analogy ]].

I’m not really sure what the authors can do at this point. They could publish an erratum and replace the citation. But with what? Flux Capacitor still hasn’t been published (!) Then there is the journal. Does any journal really think it is acceptable to list my blog as the citation for an RNA-Seq quantification tool that is fundamental for the results in a paper? (I’m flattered, but still…) Speaking of the journal, where were the reviewers? How could they not catch this? And the readers? The paper has been out since January… I have to ask: has anybody read it? Of course the biggest embarrassment here is the fact that there is a citation for Flux Capacitor at all. Why on earth are the authors using this discredited program??? Well maybe one answer is to be found in the acknowledgments section, where the group of a PI from the GTEx project is thanked. Actually, this PI was the last author on one of the recently published GTEx companion papers, which, I am sad to say… used Flux Capacitor (albeit with some quantifications performed with Cufflinks as well to demonstrate “robustness”). Why would GTEx be pushing for Flux Capacitor and insist on its use? We’ve come full circle to my GTEx blog post. By now I don’t even know what I think is the most embarrassing part of this whole story. So I thought I’d host a poll:

Today I posted the preprint N. Bray, H. Pimentel, P. Melsted and L. Pachter, Near-optimal RNA-Seq quantification with kallisto to the arXiv. It describes the RNA-Seq quantification program kallisto.

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was so much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of k-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. That is asking for a lot less, and as it turns out, pseudoalignment can be much faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient k-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

The screenshot below shows kallisto being used on my 2012 iMac desktop to build an index of the human transcriptome (5 min 8 sec), and then quantify 78.6 million GEUVADIS human RNA-Seq reads (14 min). When we first saw these results we thought they were simply too good to be true. Let me repeat: The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. In some tests we’ve gotten even faster running times, up to 15 million reads quantified per minute.

The results in our paper indicate that kallisto is not just fast, but also very accurate. This is not surprising: underlying RNA-Seq analysis are the alignments, and although kallisto is pseudoaligning instead, it is almost always only the compatibility information that is used in actual applications. As we show in our paper, from the point of view of compatibility, the pseudoalignments and alignments are almost the same.

Although accuracy is a primary concern with analysis, we realized in the course of working on kallisto that speed is also paramount, and not just as a  matter of convenience. The speed of kallisto has three major implications:

1. It allows for efficient bootstrapping. All that is required for the bootstrap are reruns of the EM algorithm, and those are particularly fast within kallisto. The result is that we can accurately estimate the uncertainty in abundance estimates. One of my favorite figures from our paper, made by Harold, is this one:

It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. This result is non-trivial: the standard dogma, that the technical variance in RNA-Seq is “Poisson” (i.e. proportional to the mean) is false, as shown in Supplementary Figure 3 of our paper (the correlation becomes 0.64). Thus, the bootstrap will be invaluable when incorporated in downstream application and we are already working on some ideas.

2. It is not just the kallisto quantification that is fast; the index building, and even compilation of the program are also easy and quick. The implication for biologists is that RNA-Seq analysis now becomes interactive. Instead of “freezing” an analysis that might take weeks or even months, data can be explored dynamically, e.g. easily quantified against different transcriptomes, or re-quantified as transcriptomes are updated. The ability to analyze data locally instead of requiring cloud computation means that analysis is portable, and also easily secure.

3. We have found the fast turnaround of analysis helpful in improving the program itself. With kallisto we can quickly check the effect of changes in the algorithms. This allows for much faster debugging of problems, and also better optimization. It also allows us to release improvements knowing that users will be able to test them without resorting to a major computation that might take months. For this reason we’re not afraid to say that some improvements to kallisto will be coming soon.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!