Last year I came across a wonderful post on the arXiv, a paper titled A new approach to enumerating statistics modulo n written by William Kuszmaul while he was a high school student participating in the MIT Primes Program for Research in Mathematics. Among other things, Kuszmaul solved the problem of counting the number of subsets of the n-element set \{1,2,\ldots,n\} that sum to k mod m.

This counting problem is related to beautiful (elementary) number theory and combinatorics, and is connected to ideas in (error correcting) coding theory and even computational biology (the Burrows Wheeler transform). I’ll tell the tale, but first explain my personal connection to it, which is the story of  my first paper that wasn’t, and how I was admitted to graduate school:

In 1993 I was a junior (math major) at Caltech and one of my best friends was a senior, Nitu Kitchloo, now an algebraic topologist and professor of mathematics at Johns Hopkins University. I had a habit of asking Nitu math questions, and he had a habit of solving them. One of the questions I asked him that year was:

How many subsets of the n-element set \{1,2,\ldots,n\} sum to 0 mod n?

I don’t remember exactly why I was thinking of the problem, but I do remember that Nitu immediately started looking at the generating function (the polynomial whose coefficients count the number of subsets for each n) and magic happened quickly thereafter. We eventually wrote up a manuscript whose main result was the enumeration of N_n^k, the number of subsets of \{1,2,\ldots,n\} whose elements sum to k mod n. The main result was

N_n^k = \frac{1}{n} \sum_{s|n; s \,odd}2^{\frac{n}{s}}\frac{\varphi(s)}{\varphi\left(\frac{s}{(k,s)}\right)}\mu\left(\frac{s}{(k,s)} \right).

We were particularly tickled that the formula contained both Euler’s totient function and the Möbius function. It seemed nice. So we decided to submit our little result to a journal.

By now months had passed and Nitu had already left Caltech for graduate school. He left me to submit the paper and I didn’t know where, so I consulted the resident combinatorics expert (Rick Wilson) who told me that he liked the result and hadn’t seen it before, but that before sending it off, just in case, I should should consult with this one professor at MIT who was known to be good at counting. I remember Wilson saying something along the lines of “Richard knows everything”. This was in the nascent days of the web, so I hand wrote a letter to Richard Stanley, enclosed a copy of the manuscript, and mailed it off.

A few weeks later I received a hand-written letter from MIT. Richard Stanley had written back to let us know that he very much liked the result… because it had brought back memories of his time as an undergraduate at Caltech, when he had worked on the same problem! He sent me the reference:

R.P Stanley and M.F. Yoder, A study of Varshamov codes for asymmetric channels, JPL Technical Report 32-1526, 1973.

Also included was a note that said “Please consider applying to MIT”.

Stanley and Yoder’s result concerned single-error-correcting codes for what are known as Z-channels. These are communication links where there is an asymmetry in the fidelity of transmission: 0 is reliably transmitted as 0 (probability 1), but 1 may be transmitted as either a 1 or a 0 (with some probability p). In a 1973 paper A class of codes for asymmetric channels and a problem from the additive theory of numbers published in the IEEE Transactions of Information Theory, R. Varshamov had proposed a single error correcting code for such channels, which was essentially to encode a message by a bit string corresponding to a subset of \{1,\ldots,n\} whose elements sum to d mod (n+1). It’s not hard to see that since zeroes are transmitted faithfully, it would not be hard to detect a single error (and correct it) by summing the elements corresponding to the bit string. Stanley and Yoder’s paper addressed questions related to enumerating the number of codewords. In particular, they were basically working out the solution to the problem Nitu and I had considered. I guess we could have published our paper anyway as we had a few additional results, for example a theorem explaining how to enumerate zero summing subsets of finite Abelian groups, but somehow we never did. There is a link to the manuscript we had written on my website:

N. Kitchloo and L. Pachter, An interesting result about subset sums, 1993.

One generalization we had explored in our paper was the enumeration of what we called N_{n,m}^kthe number of subsets of the n-element set \{1,2,\ldots,n\} that sum to k mod m. We looked specifically at the problem where m|n and proved that

N_{n,m}^n = \frac{1}{m}\sum_{s|m; s \, odd} 2^{\frac{n}{s}}\varphi(s).

What Kuszmaul succeeded in doing is to extend this result to any m < n, which is very nice for two reasons: first, the work completes the investigation of the question of subset sums (to subsets summing to an arbitrary modulus). More importantly, the technique used is that of thinking more generally about “modular enumeration”, which is the problem of finding remainders of polynomials modulo x^n-1. This led him to numerous other applications, including results on q-multinomial and q-Catalan numbers, and to the combinatorics of lattice paths. This is the hallmark of excellent mathematics: a proof technique that sheds light on the problem at hand and many others.

One of the ideas that modular enumeration is connected to is that of the Burrows-Wheeler transform (BWT). The BWT was published as a DEC tech report in 1994 (based on earlier work of Wheeler in 1983), and is a transform of one string to another of the same length. To understand the transform, consider the example of a binary string of length n. The BWT consists of forming a matrix of all cyclic permutations of s (one row per permutation), then sorting the rows lexicographically, and finally outputting the last column of the matrix. For example, the string 001101 is transformed to 110010. It is obvious by virtue of the definition that any two strings that are the equivalent up to circular permutation will be transformed to the same string. For example, 110100 will also transform to 110010.

Circular binary strings are called necklaces in combinatorics. Their enumeration is a classic problem, solvable by Burnside’s lemma, and the answer is that the number of distinct necklaces C_n of length n is given by

C_n = \frac{1}{n} \sum_{s|n} 2^{\frac{n}{s}} \varphi(s).

For odd n this formula coincides with the subset sum problem (for subsets summing to 0 mod n). When n is prime it is easy to describe a bijection, but for general odd a simple combinatorial bisection is elusive (see Richard Stanley’s Enumerative Combinatorics Volume 1 Chapter 1 Problem 105b).

The Burrows-Wheeler transform is useful because it can be utilized for constant time string matching while requiring an index whose size is only linear in the size of the target. It has therefore become an indispensable tool for genomics. I’m not aware of an application of the elementary observation above, but as the Stanley-Yoder, Kitchloo-P., Kuszmaul timeline demonstrates (21 years in between publications)… math moves in decades. I do think there is some interesting combinatorics underlying the BWT, and that its elucidation may turn out to have practical implications. We’ll see.

A final point: it is fashionable to think that biology, unlike math, moves in years. After all, NIH R01 grants are funded for a period of 3–5 years, and researchers constantly argue with journals that publication times should be weeks and not months. But in fact, lots of basic research in biology moves in decades as well, just like in mathematics. A good example is the story of CRISP/Cas9, which began with the discovery of “genetic sandwiches” in 1987.  The follow-up identification and interpretation and of CRISRPs took decades, mirroring the slow development of mathematics. Today the utility of the CRISPR/Cas9 system depends on the efficient selection of guides and prediction of off-target binding, and as it turns out, tools developed for this purpose frequently use the Burrows-Wheeler transform. It appears that not only binary strings can form circles, but ideas as well…

William Kuszmaul  won 3rd place in the 2014 Intel Science Talent Search for his work on modular enumeration. Well deserved, and thank you!

When I was an undergraduate at Caltech I took a combinatorics course from Rick Wilson who taught from his then just published textbook A Course in Combinatorics (co-authored with J.H. van Lint). The course and the book emphasized design theory, a subject that is beautiful and fundamental to combinatorics, coding theory, and statistics, but that has sadly been in decline for some time. It was a fantastic course taught by a brilliant professor- an experience that had a profound impact on me. Though to be honest, I haven’t thought much about designs in recent years. Having kids changed that.

A few weeks ago I was playing the card game Colori with my three year old daughter. It’s one of her favorites.


The game consists of 15 cards, each displaying drawings of the same 15 items (beach ball, boat, butterfly, cap, car, drum, duck, fish, flower, kite, pencil, jersey, plane, teapot, teddy bear), with each item colored using two of the colors red, green, yellow and blue. Every pair of cards contains exactly one item that is colored exactly the same. For example, the two cards the teddy bear is holding in the picture above are shown below:


The only pair of items colored exactly the same are the two beach balls. The gameplay consists of shuffling the deck and then placing a pair of cards face-up. Players must find the matching pair, and the first player to do so keeps the cards. This is repeated seven times until there is only one card left in the deck, at which point the player with the most cards wins. When I play with my daughter “winning” consists of enjoying her laughter as she figures out the matching pair, and then proceeds to try to eat one of the cards.

An inspection of all 15 cards provided with the game reveals some interesting structure:


Every card contains exactly one of each type of item. Each item therefore occurs 15 times among the cards, with fourteen of the occurrences consisting of seven matched pairs, plus one extra. This is a type of partially balanced incomplete block design. Ignoring for a moment the extra item placed on each card, what we have is 15 items, each colored one of seven ways (v=15*7=105). The 105 items have been divided into 15 blocks (the cards), so that b=15. Each block contains 14 elements (the items) so that k=14, and each element appears in two blocks (r=2). If every pair of different (colored) items occurred in the same number of cards, we would have a balanced incomplete block design, but this is not the case in Colori. Each item occurs in the same block as 26 (=2*13) other items (we are ignoring the extra item that makes for 15 on each card), and therefore it is not the case that every pair of items occurs in the same number of blocks as would be the case in a balanced incomplete block design. Instead, there is an association scheme that provides extra structure among the 105 items, and in turn describes the way in which items do or do not appear together on cards. The association scheme can be understood as a graph whose nodes consist of the 105 items, with edges between items labeled either 0,1 or 2. An edge between two items of the same type is labeled 0, edges between different items that appear on the same card are labeled 1, and edges between different items that do not appear on the same card are labeled 2. This edge labeling is called an “association scheme” because it has a special property, namely the number of triangles with a base edge labeled k, and other two edges labeled i and respectively is  dependent only on i,j and k and not on the specific base edge selected. In other words, there is a special symmetry to the graph. Returning to the deck of cards, we see that every pair of items appears in the same card exactly 0 or 1 times, and the number depends only on the association class of the pair of objects. This is called a partially balanced incomplete block design.

The author of the game, Reinhard Staupe, made it a bit more difficult by adding an extra item to each card making the identification of the matching pair harder. The addition also ensures that each of the 15 items appears on each card. Moreover, the items are permuted in location on the cards, in an arrangement similar to a latin square, making it hard to pair up the items. And instead of using 8 different colors, he used only four, producing the eight different “colors” of each item on the cards by using pairwise combinations of the four.  The yellow-green two-colored beach balls are particularly difficult to tell apart from the green-yellow ones. Of course, much of this is exactly the kind of thing you would want to do if you were designing an RNA-Seq experiment!

Instead of 15 types of items, think of 15 different strains of mice.  Instead of colors for the items, think of different cellular conditions to be assayed. Instead of one pair for each of seven color combinations, think of one pair of replicates for each of seven cellular conditions. Instead of cards, think of different sequencing centers that will prepare the libraries and sequence the reads. An ideal experimental setup would involve distributing the replicates and different cellular conditions across the different sequencing centers so as to reduce batch effect. This is the essence of part of the paper Statistical Design and Analysis of RNA Sequencing Data by Paul Auer and Rebecca Doerge. For example, in their Figure 4 (shown below) they illustrate the advantage of balanced block designs to ameliorate lane effects:


Figure 4 from P. Auer and R.W. Doerge’s paper Statistical Design and Analysis of RNA Sequencing Data.

Of course the use of experimental designs for constructing controlled gene expression experiments is not new. Kerr and Churchill wrote about the use of combinatorial designs in Experimental Design for gene expression microarrays, and one can trace back a long chain of ideas originating with R.A. Fisher. But design theory seems to me to be a waning art insofar as molecular biology experiments are concerned, and it is frequently being replaced with biological intuition of what makes for a good control. The design of good controls is sometimes obvious, but not always. So next time you design an experiment, if you have young kids, first play a round of Colori. If the kids are older, play Set instead. And if you don’t have any kids, plan for an extra research project, because what else would you do with your time?

I’m a (50%) professor of mathematics and (50%) professor of molecular & cell biology at UC Berkeley. There have been plenty of days when I have spent the working hours with biologists and then gone off at night with some mathematicians. I mean that literally. I have had, of course, intimate friends among both biologists and mathematicians. I think it is through living among these groups and much more, I think, through moving regularly from one to the other and back again that I have become occupied with the problem that I’ve christened to myself as the ‘two cultures’. For constantly I feel that I am moving among two groups- comparable in intelligence, identical in race, not grossly different in social origin, earning about the same incomes, who have almost ceased to communicate at all, who in intellectual, moral and psychological climate have so little in common that instead of crossing the campus from Evans Hall to the Li Ka Shing building, I may as well have crossed an ocean.1

I try not to become preoccupied with the two cultures problem, but this holiday season I have not been able to escape it. First there was a blog post by David Mumford, a professor emeritus of applied mathematics at Brown University, published on December 14th. For those readers of the blog who do not follow mathematics, it is relevant to what I am about to write that David Mumford won the Fields Medal in 1974 for his work in algebraic geometry, and afterwards launched another successful career as an applied mathematician, building on Ulf Grenader’s Pattern Theory and making significant contributions to vision research. A lot of his work is connected to neuroscience and therefore biology. Among his many awards are the MacArthur Fellowship, the Shaw Prize, the Wolf Prize and the National Medal of Science. David Mumford is not Joe Schmo.

It therefore came as a surprise to me to read his post titled “Can one explain schemes to biologists?”  in which he describes the rejection by the journal Nature of an obituary he was asked to write. Now I have to say that I have heard of obituaries being retracted, but never of an obituary being rejected. The Mumford rejection is all the more disturbing because it happened after he was invited by Nature to write the obituary in the first place!

The obituary Mumford was asked to write was for Alexander Grothendieck, a leading and towering figure in 20th century mathematics who built many of the foundations for modern algebraic geometry. My colleague Edward Frenkel published a brief non-technical obituary about Grothendieck in the New York Times, and perhaps that is what Nature had in mind for its journal as well. But since Nature is bills itself as “An international journal, published weekly, with original, groundbreaking research spanning all of the scientific disciplines [emphasis mine]” Mumford assumed the readers of Nature would be interested not only in where Grothendieck was born and died, but in what he actually accomplished in his life, and why he is admired for his mathematics. Here is the beginning excerpt of Mumford’s blog post2 explaining why he and John Tate (his coauthor for the post) needed to talk about the concept of a scheme in their post:

John Tate and I were asked by Nature magazine to write an obituary for Alexander Grothendieck. Now he is a hero of mine, the person that I met most deserving of the adjective “genius”. I got to know him when he visited Harvard and John, Shurik (as he was known) and I ran a seminar on “Existence theorems”. His devotion to math, his disdain for formality and convention, his openness and what John and others call his naiveté struck a chord with me.

So John and I agreed and wrote the obituary below. Since the readership of Nature were more or less entirely made up of non-mathematicians, it seemed as though our challenge was to try to make some key parts of Grothendieck’s work accessible to such an audience. Obviously the very definition of a scheme is central to nearly all his work, and we also wanted to say something genuine about categories and cohomology.

What they came up with is a short but well-written obituary that is the best I have read about Grothendieck. It is non-technical yet accurate and meaningfully describes, at a high level, what he is revered for and why. Here it is (copied verbatim from David Mumford’s blog):

Alexander Grothendieck
David Mumford and John Tate

Although mathematics became more and more abstract and general throughout the 20th century, it was Alexander Grothendieck who was the greatest master of this trend. His unique skill was to eliminate all unnecessary hypotheses and burrow into an area so deeply that its inner patterns on the most abstract level revealed themselves — and then, like a magician, show how the solution of old problems fell out in straightforward ways now that their real nature had been revealed. His strength and intensity were legendary. He worked long hours, transforming totally the field of algebraic geometry and its connections with algebraic mber

mber theory. He was considered by many the greatest mathematician of the 20th century.

Grothendieck was born in Berlin on March 28, 1928 to an anarchist, politically activist couple — a Russian Jewish father, Alexander Shapiro, and a German Protestant mother Johanna (Hanka) Grothendieck, and had a turbulent childhood in Germany and France, evading the holocaust in the French village of Le Chambon, known for protecting refugees. It was here in the midst of the war, at the (secondary school) Collège Cévenol, that he seems to have first developed his fascination for mathematics. He lived as an adult in France but remained stateless (on a “Nansen passport”) his whole life, doing most of his revolutionary work in the period 1956 – 1970, at the Institut des Hautes Études Scientifique (IHES) in a suburb of Paris after it was founded in 1958. He received the Fields Medal in 1966.

His first work, stimulated by Laurent Schwartz and Jean Dieudonné, added major ideas to the theory of function spaces, but he came into his own when he took up algebraic geometry. This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety. Traditionally, this had meant complex solutions of polynomials with complex coefficients but just prior to Grothendieck’s work, Andre Weil and Oscar Zariski had realized that much more scope and insight was gained by considering solutions and polynomials over arbitrary fields, e.g. finite fields or algebraic number fields.

The proper foundations of the enlarged view of algebraic geometry were, however, unclear and this is how Grothendieck made his first, hugely significant, innovation: he invented a class of geometric structures generalizing varieties that he called schemes. In simplest terms, he proposed attaching to any commutative ring (any set of things for which addition, subtraction and a commutative multiplication are defined, like the set of integers, or the set of polynomials in variables x,y,z with complex number coefficients) a geometric object, called the Spec of the ring (short for spectrum) or an affine scheme, and patching or gluing together these objects to form the scheme. The ring is to be thought of as the set of functions on its affine scheme.

To illustrate how revolutionary this was, a ring can be formed by starting with a field, say the field of real numbers, and adjoining a quantity \epsilon satisfying \epsilon^2=0. Think of \epsilon this way: your instruments might allow you to measure a small number such as \epsilon=0.001 but then \epsilon^2=0.000001 might be too small to measure, so there’s no harm if we set it equal to zero. The numbers in this ring are a+b \cdot \epsilon real a,b. The geometric object to which this ring corresponds is an infinitesimal vector, a point which can move infinitesimally but to second order only. In effect, he is going back to Leibniz and making infinitesimals into actual objects that can be manipulated. A related idea has recently been used in physics, for superstrings. To connect schemes to number theory, one takes the ring of integers. The corresponding Spec has one point for each prime, at which functions have values in the finite field of integers mod p and one classical point where functions have rational number values and that is ‘fatter’, having all the others in its closure. Once the machinery became familiar, very few doubted that he had found the right framework for algebraic geometry and it is now universally accepted.

Going further in abstraction, Grothendieck used the web of associated maps — called morphisms — from a variable scheme to a fixed one to describe schemes as functors and noted that many functors that were not obviously schemes at all arose in algebraic geometry. This is similar in science to having many experiments measuring some object from which the unknown real thing is pieced together or even finding something unexpected from its influence on known things. He applied this to construct new schemes, leading to new types of objects called stacks whose functors were precisely characterized later by Michael Artin.

His best known work is his attack on the geometry of schemes and varieties by finding ways to compute their most important topological invariant, their cohomology. A simple example is the topology of a plane minus its origin. Using complex coordinates (z,w), a plane has four real dimensions and taking out a point, what’s left is topologically a three dimensional sphere. Following the inspired suggestions of Grothendieck, Artin was able to show how with algebra alone that a suitably defined third cohomology group of this space has one generator, that is the sphere lives algebraically too. Together they developed what is called étale cohomology at a famous IHES seminar. Grothendieck went on to solve various deep conjectures of Weil, develop crystalline cohomology and a meta-theory of cohomologies called motives with a brilliant group of collaborators whom he drew in at this time.

In 1969, for reasons not entirely clear to anyone, he left the IHES where he had done all this work and plunged into an ecological/political campaign that he called Survivre. With a breathtakingly naive spririt (that had served him well doing math) he believed he could start a movement that would change the world. But when he saw this was not succeeding, he returned to math, teaching at the University of Montpellier. There he formulated remarkable visions of yet deeper structures connecting algebra and geometry, e.g. the symmetry group of the set of all algebraic numbers (known as its Galois group Gal(\overline{\mathbb{Q}}/\mathbb{Q})) and graphs drawn on compact surfaces that he called ‘dessin d’enfants’. Despite his writing thousand page treatises on this, still unpublished, his research program was only meagerly funded by the CNRS (Centre Nationale de Recherche Scientifique) and he accused the math world of being totally corrupt. For the last two decades of his life he broke with the whole world and sought total solitude in the small village of Lasserre in the foothills of the Pyrenees. Here he lived alone in his own mental and spiritual world, writing remarkable self-analytic works. He died nearby on Nov. 13, 2014.

As a friend, Grothendieck could be very warm, yet the nightmares of his childhood had left him a very complex person. He was unique in almost every way. His intensity and naivety enabled him to recast the foundations of large parts of 21st century math using unique insights that still amaze today. The power and beauty of Grothendieck’s work on schemes, functors, cohomology, etc. is such that these concepts have come to be the basis of much of math today. The dreams of his later work still stand as challenges to his successors.

Mumford goes on in his blog post to describe the reasons Nature gave for rejecting the obituary. He writes:

The sad thing is that this was rejected as much too technical for their readership. Their editor wrote me that ‘higher degree polynomials’, ‘infinitesimal vectors’ and ‘complex space’ (even complex numbers) were things at least half their readership had never come across. The gap between the world I have lived in and that even of scientists has never seemed larger. I am prepared for lawyers and business people to say they hated math and not to remember any math beyond arithmetic, but this!? Nature is read only by people belonging to the acronym ‘STEM’ (= Science, Technology, Engineering and Mathematics) and in the Common Core Standards, all such people are expected to learn a hell of a lot of math. Very depressing.

I don’t know if the Nature editor had biologists in mind when rejecting the Grothendieck obituary, but Mumford certainly thought so, as he sarcastically titled his post “Can one explain schemes to biologists?” Sadly, I think that Nature and Mumford both missed the point.

Exactly ten years ago Bernd Sturmfels and I published a book titled “Algebraic Statistics for Computational Biology“. From my perspective, the book developed three related ideas: 1. that the language, techniques and theorems of algebraic geometry both unify and provide tools for certain models in statistics, 2. that problems in computational biology are particularly prone to depend on inference with precisely the statistical models amenable to algebraic analysis and (most importantly) 3. mathematical thinking, by way of considering useful generalizations of seemingly unrelated ideas, is a powerful approach for organizing many concepts in (computational) biology, especially in genetics and genomics.

To give a concrete example of what 1,2 and 3 mean, I turn to Mumford’s definition of algebraic geometry in his obituary for Grothendieck. He writes that “This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety.” What is he talking about? The notion of “phylogenetic invariants”, provides a simple example for biologists by biologists. Phylogenetic invariants were first introduced to biology ca. 1987 by Joe Felsenstein (Professor of Genome Sciences and Biology at the University of Washington) and James Lake (Distinguished Professor of Molecular, Cell, and Developmental Biology and of Human Genetics at UCLA)3.

Given a phylogenetic tree describing the evolutionary relationship among n extant species, one can examine the evolution of a single nucleotide along the tree. At the leaves, a single nucleotide is then associated to each species, collectively forming a single selection from among the 4^n possible patterns for nucleotides at the leaves. Evolutionary models provide a way to formalize the intuitive notion that random mutations should be associated with branches of the tree and formally are described via (unknown) parameters that can be used to calculate a probability for any pattern at the leaves. It happens to be the case that for most phylogenetic evolutionary model have the property that the probabilities for leaf patterns are polynomials in the parameters. The simplest example to consider is the tree with an ancestral node and two leaves corresponding to two extant species, say “B” and “M”:



The molecular approach to evolution posits that multiple sites together should be used both to estimate parameters associated with evolution along the tree, and maybe even the tree itself. If one assumes that nucleotides mutate according to the 4-state general Markov model with independent processes on each branch, and one writes p_{ij} for \mathbb{P}(B=i,M=j) where i,j are one of A,C,G,T, then it must be the case that p_{ij}p_{kl} = p_{il}p_{jk}. In other words, the polynomial

p_{ij}p_{kl} - p_{il}p_{jk}=0.

In other words, for any parameters in the 4-state general Markov model, it has to be the case that when the pattern probabilities are plugged into the polynomial equation above, the result is zero. This equation is none other than the condition for two random variables to be independent; in this case the random variable corresponding to the nucleotide at B is independent of the random variable corresponding to the nucleotide at M.

The example is elementary, but it hints at a powerful tool for phylogenetics. It provides an equation that must be satisfied by the pattern probabilities that does not depend specifically on the parameters of the model (which can be intuitively understood as relating to branch length). If many sites are available so that pattern probabilities can be estimated empirically from data, then there is in principle a possibility for testing whether the data fits the topology of a specific tree regardless of what the branch lengths of the tree might be. Returning to Mumford’s description of algebraic geometry, the variety of interest is the geometric object in “pattern probability space” where points are precisely probabilities that can arise for a specific tree, and the “ring of polynomials with the geometric properties of the locus” are the phylogenetic invariants. The relevance of the ring lies in the fact that if and g are two phylogenetic invariants then that means that f(P)=0 and g(P)=0 for any pattern probabilities from the model, so therefore f+g is also a phylogenetic invariant because f(P)+g(P)=0 for any pattern probabilities from the model (the same is true for c \cdot f for any constant c). In other words, there is an algebra of phylogenetic invariants that is closely related to the geometry of pattern probabilities. As Mumford and Tate explain, Grothendieck figured out the right generalizations to construct a theory for any ring, not just the ring of polynomials, and therewith connected the fields of commutative algebra, algebraic geometry and number theory.

The use of phylogenetic invariants for testing tree topologies is conceptually elegantly illustrated in a wonderful book chapter on phylogenetic invariants  by mathematicians Elizabeth Allman and John Rhodes that starts with the simple example of the two taxa tree and delves deeply into the subject. Two surfaces (conceptually) represent the varieties for two trees, and the equations f_1(P)=f_2(P)=\ldots=f_l(P)=0 and h_1(P)=h_2(P)=\ldots=h_k(P)=0 are the phylogenetic invariants. The empirical pattern probability distribution is the point \hat{P} and the goal is to find the surface it is close to:


Figure 4.2 from Allman and Rhodes chapter on phylogenetic invariants.

Of course for large trees there will be many different phylogenetic invariants, and the polynomials may be of high degree. Figuring out what the invariants are, how many of them there are, bounds for the degrees, understanding the geometry, and developing tests based on the invariants, is essentially a (difficult unsolved) challenge for algebraic geometers. I think it’s fair to say that our book spurred a lot of research on the subject, and helped to create interest among mathematicians who were unaware of the variety and complexity of problems arising from phylogenetics. Nick Eriksson, Kristian Ranestad, Bernd Sturmfels and Seth Sullivant wrote a short piece titled phylogenetic algebraic geometry which is an introduction for algebraic geometers to the subject. Here is where we come full circle to Mumford’s obituary… the notion of a scheme is obviously central to phylogenetic algebraic geometry. And the expository article just cited is just the beginning. There are too many exciting developments in phylogenetic geometry to summarize in this post, but Elizabeth Allman, Marta Casanellas, Joseph Landsberg, John Rhodes, Bernd Sturmfels and Seth Sullivant are just a few of many who have discovered beautiful new mathematics motivated by the biology, and also have had an impact on biology with algebro-geometric tools. There is both theory (see this recent example) and application (see this recent example) coming out of phylogenetic algebraic geometry. More generally, algebraic statistics for computational biology is now a legitimate “field”, complete with a journal, regular conferences, and a critical mass of mathematicians, statisticians, and even some biologists working in the area. Some of the results are truly beautiful and impressive. My favorite recent one is this paper by Caroline Uhler, Donald Richards and Piotr Zwiernik providing important guarantees for maximum likelihood estimation of parameters in Felstenstein’s continuous character model.

But that is not the point here. First, Mumford’s sarcasm was unwarranted. Biologists certainly didn’t discover schemes but as Felsenstein and Lake’s work shows, they did (re)discover algebraic geometry. Moreover, all of the people mentioned above can explain schemes to biologists, thereby answering Mumford’s question in the affirmative. Many of them have not only collaborated with biologists but written biology papers. And among them are some extraordinary expositors, notably Bernd Sturmfels. Still, even if there are mathematicians able and willing to explain schemes to biologists, and even if there are areas within biology where schemes arise (e.g. phylogenetic algebraic geometry), it is fair to ask whether biologists should care to understand them?

The answer to the question is: probably not. In any case I wouldn’t presume to opine on what biologists should and shouldn’t care about. Biology is enormous, and encompasses everything from the study of fecal transplants to the wood frogs of Alaska. However I do have an opinion about the area I work in, namely genomics. When it comes to genomics journalists write about revolutions, personalized precision medicine, curing cancer and data deluge. But the biology of genomics is for real, and it is indeed tremendously exciting as a result of dramatic improvements in underlying technologies (e.g. DNA sequencing and genome editing to name two). I also believe it is true that despite what is written about data deluge, experiments remain the primary and the best way, to elucidate the function of the genome. Data analysis is secondary. But it is true that statistics has become much more important to genomics than it was even to population genetics at the time of R.A. Fisher, computer science is playing an increasingly important role, and I believe that somewhere in the mix of “quantitative sciences for biology”, there is an important role for mathematics.

What biologists should appreciate, what was on offer in Mumford’s obituary, and what mathematicians can deliver to genomics that is special and unique, is the ability to not only generalize, but to do so “correctly”. The mathematician Raoul Bott once reminisced that “Grothendieck was extraordinary as he could play with concepts, and also was prepared to work very hard to make arguments almost tautological.” In other words, what made Grothendieck special was not that he generalized concepts in algebraic geometry to make them more abstract, but that he was able to do so in the right way. What made his insights seemingly tautological at the end of the day, was that he had the “right” way of viewing things and the “right” abstractions in mind. That is what mathematicians can contribute most of all to genomics. Of course sometimes theorems are important, or specific mathematical techniques solve problems and mathematicians are to thank for that. Phylogenetic invariants are important for phylogenetics which in turn is important for comparative genomics which in turn is important for functional genomics which in turn is important for medicine. But it is the the abstract thinking that I think matters most. In other words, I agree with Charles Darwin that mathematicians are endowed with an extra sense… I am not sure exactly what he meant, but it is clear to me that it is the sense that allows for understanding the difference between the “right” way and the “wrong” way to think about something.

There are so many examples of how the “right” thinking has mattered in genomics that they are too numerous to list here, but here are a few samples: At the heart of molecular biology, there is the “right” and the “wrong” way to think about genes: evidently the message to be gleaned from Gerstein et al.‘s in “What is a gene post ENCODE? History and Definition” is that “genes” are not really the “right” level of granularity but transcripts are. In a previous blog post I’ve discussed the “right” way to think about the Needleman-Wunsch algorithm (tropically). In metagenomics there is the “right” abstraction with which to understand UniFrac. One paper I’ve written (with Niko Beerenwinkel and Bernd Sturmfels) is ostensibly about fitness landscapes but really about what we think the “right” way is to look at epistasis. In systems biology there is the “right” way to think about stochasticity in expression (although I plan a blog post that digs a bit deeper). There are many many more examples… way too many to list here… because ultimately every problem in biology is just like in math… there is the “right’ and the “wrong” way to think about it, and figuring out the difference is truly an art that mathematicians, the type of mathematicians that work in math departments, are particularly good at.

Here is a current example from (computational) biology where it is not yet clear what “right” thinking should be despite the experts working hard at it, and that is useful to highlight because of the people involved: With the vast amount of human genomes being sequenced (some estimates are as high as 400,000 in the coming year), there is an increasingly pressing fundamental question about how the (human) genome should be represented and stored. This is ostensibly a computer science question: genomes should perhaps be compressed in ways that allow for efficient search and retrieval, but I’d argue that fundamentally it is a math question. This is because what the question is really asking, is how should one think about genome sequences related mostly via recombination and only slightly by mutation, and what are the “right” mathematical structures for this challenge? The answer matters not only for the technology (how to store genomes), but much more importantly for the foundations of population and statistical genetics. Without the right abstractions for genomes, the task of coherently organizing and interpreting genomic information is hopeless. David Haussler (with coauthors) and Richard Durbin have both written about this problem in papers that are hard to describe in any way other than as math papers; see Mapping to a Reference Genome Structure and Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (BPWT). Perhaps it is no coincidence that both David Haussler and Richard Durbin studied mathematics.

But neither David Haussler nor Richard Durbin are faculty in mathematics departments. In fact, there is a surprisingly long list of very successful (computational) biologists specifically working in genomics, many of whom even continue to do math, but not in math departments, i.e. they are former mathematicians (this is so common there is even a phrase for it “recovering mathematician” as if being one is akin to alcoholism– physicists use the same language). People include Richard Durbin, Phil Green, David Haussler, Eric Lander, Montgomery Slatkin and many others I am omitting; for example almost the entire assembly group at the Broad Institute consists of former mathematicians. Why are there so many “formers” and very few “currents”? And does it matter? After all, it is legitimate to ask whether successful work in genomics is better suited to departments, institutes and companies outside the realm of academic mathematics. It is certainly the case that to do mathematics, or to publish mathematical results, one does not need to be a faculty member in a mathematics department. I’ve thought a lot about these issues and questions, partly because they affect my daily life working between the worlds of mathematics and molecular biology in my own institution. I’ve also seen the consequences of the separation of the two cultures. To illustrate how far apart they are I’ve made a list of specific differences below:

Biologists publish in “glamour journals” such as Science, Nature and Cell where impact factors are high. Nature publishes its impact factor to three decimal digits accuracy (42.317). Mathematicians publish in journals whose names start with the word Annals, and they haven’t heard of impact factors. The impact factor of the Annals of Mathematics, perhaps the most prestigious journal in mathematics, is 3 (the journal with the highest impact factor is the Journal of the American Mathematical Society at 3.5). Mathematicians post all papers on the ArXiv preprint server prior to publications. Not only do biologists not do that, they are frequently subject to embargos prior to publication. Mathematicians write in LaTeX, biologists in Word (a recent paper argues that Word is better, but I’m not sure). Biologists draw figures and write papers about them. Mathematicians write papers and draw figures to explain them. Mathematicians order authors alphabetically, and authorship is awarded if a mathematical contribution was made. Biologists author lists have two gradients from each end, and authorship can be awarded for payment for the work. Biologists may review papers on two week deadlines. Mathematicians review papers on two year deadlines. Biologists have their papers cited by thousands, and their results have a real impact on society; in many cases diseases are cured as a result of basic research. Mathematicians are lucky if 10 other individuals on the planet have any idea what they are writing about. Impact time can be measured in centuries, and sometimes theorems turn out to simply not have been interesting at all. Biologists don’t teach much. Mathematicians do (at UC Berkeley my math teaching load is 5 times that of my biology teaching load). Biologists value grants during promotion cases and hiring. Mathematicians don’t. Biologists have chalk talks during job interviews. Mathematicians don’t. Mathematicians have a jobs wiki. Biologists don’t. Mathematicians write ten page recommendation letters. Biologists don’t. Biologists go to retreats to converse. Mathematicians retreat from conversations (my math department used to have a yearly retreat that was one day long and consisted of a faculty meeting around a table in the department; it has not been held the past few years). Mathematics graduate students teach. Biology graduate students rotate. Biology students take very little coursework after their first year. Mathematics graduate students take two years of classes (on this particular matter I’m certain mathematicians are right). Biologists pay their graduate students from grants. Mathematicians don’t (graduate students are paid for teaching sections of classes, usually calculus). Mathematics full professors that are female is a number (%) in the single digits. Biology full professors that are female is a number (%) in the double digits (although even added together the numbers are still much less than 50%). Mathematicians believe in God. Biologists don’t.

How then can biology, specifically genomics (or genetics), exist and thrive within the mathematics community? And how can mathematics find a place within the culture of biology?

I don’t know. The relationship between biology and mathematics is on the rocks and prospects are grim. Yes, there are biologists who do mathematical work, and yes, there are mathematical biologists, especially in areas such as evolution or ecology who are in math departments. There are certainly applied mathematics departments with faculty working on biology problems involving modeling at the macroscopic level, where the math fits in well with classic applied math (e.g. PDEs, numerical analysis). But there is very little genomics or genetics related math going on in math departments. And conversely, mathematicians who leave math departments to work in biology departments or institutes face enormous pressure to not focus on the math, or when they do any math at all, to not publish it (work is usually relegated to the supplement and completely ignored). The result is that biology loses out due to the minimal real contact with math– the special opportunity of benefiting from the extra sense is lost, and conversely math loses the opportunity to engage biology– one of the most exciting scientific enterprises of the 21st century. The mathematician Gian-Carlo Rota said that “The lack of real contact between mathematics and biology is either a tragedy, a scandal, or a challenge, it is hard to decide which”. He was right.

The extent to which the two cultures have drifted apart is astonishing. For example, visiting other universities I see the word “mathematics” almost every time precision medicine is discussed in the context of a new initiative, but I never see mathematicians or the local math department involved. In the mathematics community, there has been almost no effort to engage and embrace genomics. For example the annual joint AMS-MAA meetings always boast a series of invited talks, many on applications of math, but genomics is never a represented area. Yet in my Junior level course last semester on mathematical biology (taught in the math department) there were 46 students, more than any other upper division elective class in the math department. Even though I am a 50% member of the mathematics department I have been advising three math graduate students this year, equivalent to six for a full time member, a statistic that probably ranks me among the most busy advisors in the department (these numbers do not even reflect the fact that I had to turn down a number of students). Anecdotally, the numbers illustrate how popular genomics is among math undergraduate and graduate students, and although hard data is difficult to come by my interactions with mathematicians everywhere convince me the trend I see at Berkeley is universal. So why is this popularity not reflected in support of genomics by the math community? And why don’t biology journals, conferences and departments embrace more mathematics? There is a hypocrisy of math for biology. People talk about it but when push comes to shove nobody wants to do anything real to foster it.

Examples abound. On December 16th UCLA announced the formation of a new Institute for Quantitative and Computational Biosciences. The announcement leads with a photograph of the director that is captioned “Alexander Hoffmann and his colleagues will collaborate with mathematicians to make sense of a tsunami of biological data.” Strangely though, the math department is not one of the 15 partner departments that will contribute to the Institute. That is not to say that mathematicians won’t interact with the Institute, or that mathematics won’t happen there. E.g., the Institute for Pure and Applied Mathematics is a partner as is the Biomathematics department (an interesting UCLA concoction), not to mention the fact that many of the affiliated faculty do work that is in part mathematical. But formal partnership with the mathematics department, and through it direct affiliation with the mathematics community, is missing. UCLA’s math department is among the top in the world, and boasts a particularly robust applied mathematics program many of whose members work on mathematical biology. More importantly, the “pure” mathematicians at UCLA are first rate and one of them, Terence Tao, is possibly the most talented mathematician alive. Wouldn’t it be great if he could be coaxed to think about some of the profound questions of biology? Wouldn’t it be awesome if mathematicians in the math department at UCLA worked hard with the biologists to tackle the extraordinary challenges of “precision medicine”? Wouldn’t it be wonderful if UCLA’s Quantitative and Computational biosciences Institute could benefit from the vast mathematics talent pool not only at UCLA but beyond: that of the entire mathematics community?

I don’t know if the omission of the math department was an accidental oversight of the Institute, a deliberate snub, or if it was the Institute that was rebuffed by the mathematics department. I don’t think it really matters. The point is that the UCLA situation is ubiquitous. Mathematics departments are almost never part of new initiatives in genomics; biologists are all too quick to glance the other way. Conversely, the mathematics community has shunned biologists. Despite two NSF Institutes dedicated to mathematical biology (the MBI and NIMBioS) almost no top math departments hire mathematicians working in genetics or genomics (see the mathematics jobs wiki). In the rooted tree in the figure above B can represent Biology and M can represent Mathematics and they truly, and sadly, are independent.

I get it. The laundry list of differences between biology and math that I aired above can be overwhelming. Real contact between the subjects will be difficult to foster, and it should be acknowledged that it is neither necessary nor sufficient for the science to progress. But wouldn’t it be better if mathematicians proved they are serious about biology and biologists truly experimented with mathematics? 


1. The opening paragraph is an edited copy of an excerpt (page 2, paragraph 2) from C.P. Snow’s “The Two Cultures and The Scientific Revolution” (The Rede Lecture 1959).
2. David Mumford’s content on his site is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License, and I have incorporated it in my post (boxed text) unaltered according to the terms of the license.
3. The meaning of the word “invariant” in “phylogenetic invariants” differs from the standard meaning in mathematics, where invariant refers to a property of a class of objects that is unchanged under transformations. In the context of algebraic geometry classic invariant theory addresses the problem of determining polynomial functions that are invariant under transformations from a linear group. Mumford is known for his work on geometric invariant theory. An astute reader could therefore deduce from the term “phylogenetic invariants” that the term was coined by biologists.

Recent news of James Watson’s auction of his Nobel Prize medal has unearthed a very unpleasant memory for me.

In March 2004 I attended an invitation only genomics meeting at the famed Banbury Center at Cold Spring Harbor Laboratory. I had heard legendary stories about Banbury, and have to admit I felt honored and excited when I received the invitation. There were rumors that sometimes James Watson himself would attend meetings. The emails I received explaining the secretive policies of the Center only added to the allure. I felt that I had received an invitation to the genomics equivalent of Skull and Bones.

Although Watson did not end up attending the meeting, my high expectations were met when he did decide to drop in on dinner one evening at Robertson house. Without warning he seated himself at my table. I was in awe. The table was round with seating for six, and Honest Jim sat down right across from me. He spoke incessantly throughout dinner and we listened. Sadly though, most of the time he was spewing racist and misogynistic hate. I remember him asking rhetorically “who would want to adopt an Irish kid?” (followed by a tirade against the Irish that I later saw repeated in the news) and he made a point to disparage Rosalind Franklin referring to her derogatorily as “that woman”. No one at the table (myself included) said a word. I deeply regret that.

One of Watson’s obsessions has been to “improve” the “imperfect human” via human germline engineering. This is disturbing on many many levels. First, there is the fact that for years Watson presided over Cold Spring Harbor Laboratory which actually has a history as a center for eugenics. Then there are the numerous disparaging remarks by Watson about everyone who is not exactly like him, leaving little doubt about who he imagines the “perfect human” to be. But leaving aside creepy feelings… could he be right? Is the “perfect human” an American from Chicago of mixed Scottish/Irish ancestry? Should we look forward to a world filled with Watsons? I have recently undertaken a thought experiment along these lines that I describe below. The result of the experiment is dedicated to James Watson on the occasion of his unbirthday today.


SNPedia is an open database of 59,593 SNPs and their associations. A SNP entry includes fields for “magnitude” (a subjective measure of significance on a scale of 0–10) and “repute” (good or bad), and allele classifications for many diseases and medical conditions. For example, the entry for a SNP (rs1799971) that associates with alcohol cravings describes the “normal” and “bad” alleles. In addition to associating with phenotypes, SNPs can also associate with populations. For example, as seen in the Geography of Genetic Variants Browser, rs1799971 allele frequencies vary greatly among Africans, Europeans and Asians. If the genotype of an individual is known at many SNPs, it is therefore possible to guess where they are from: in the case of rs1799971 someone who is A:A is a lot more likely to be African than Japanese, and with many SNPs the probabilities can narrow the location of an individual to a very specific geographic location. This is the principle behind the application of principal component analysis (PCA) to the study of populations. Together, SNPedia and PCA therefore provide a path to determining where a “perfect human” might be from:

  1. Create a “perfect human” in silico by setting the alleles at all SNPs so that they are “good”.
  2. Add the “perfect human” to a panel of genotyped individuals from across a variety of populations and perform PCA to reveal the location and population of origin of the individual.


After restricting the SNP set from SNPedia to those with green painted alleles, i.e. “good”, there are 4967 SNPs with which to construct the “perfect human” (available for download here).

A dataset of genotyped individuals can be obtain from 1000 genomes including Africans, (indigenous) Americans, East Asians and Europeans.

The PCA plot (1st and 2nd components) showing all the individuals together with the “perfect human” (in pink; see arrow) is shown below:


The nearest neighbor to the “perfect human” is HG00737, a female who isPuerto Rican. One might imagine that such a person already existed, maybe Yuiza, the only female Taino Cacique (chief) in Puerto Rico’s history:


Samuel Lind’s ‘Yuiza’

But as the 3rd principal component shows, reifying the “perfect human” is a misleading undertaking:


Here the “perfect human” is revealed to be decidedly non-human. This is not surprising, and it reflects the fact that the alleles of the “perfect human” place it as significant outlier to the human population. In fact, this is even more evident in the case of the “worst human”, namely the individual that has the “bad” alleles at every SNPs. A projection of that individual onto any combination of principal components shows them to be far removed from any actual human. The best visualization appears in the projection onto the 2nd and 3rd principal components, where they appear as a clear outlier (point labeled DYS), and diametrically opposite to Africans:


The fact that the “worst human” does not project well onto any of the principal components whereas the “perfect human” does is not hard to understand from basic population genetics principles. It is an interesting exercise that I leave to the reader.


The fact that the “perfect human” is Puerto Rican makes a lot of sense. Since many disease SNPs are population specific, it makes sense that an individual homozygous for all “good” alleles should be admixed. And that is exactly what Puerto Ricans are. In a “women in the diaspora” study, Puerto Rican women born on the island but living in the United States were shown to be 53.3±2.8% European, 29.1±2.3% West African, and 17.6±2.4% Native American. In other words, to collect all the “good” alleles it is necessary to be admixed, but admixture itself is not sufficient for perfection. On a personal note, I was happy to see population genetic evidence supporting my admiration for the perennial championship Puerto Rico All Stars team:

As for Watson, it seems fitting that he should donate the proceeds of his auction to the Caribbean Genome Center at the University of Puerto Rico.

[Update: Dec. 7/8: Taras Oleksyk from the Department of Biology at the University of Puerto Rico Mayaguez has written an excellent post-publication peer review of this blog post and Rafael Irizarry from the Harvard School of Public Health has written a similar piece, Genéticamente, no hay tal cosa como la raza puertorriqueña in Spanish. Both are essential reading.]

Earlier this week US News and World Report (USNWR) released, for the first time, a global ranking of universities including rankings by subject area. In mathematics, the top ten universities are:

1. Berkeley
2. Stanford
3. Princeton
5. University of Oxford
6. Harvard
7. King Abdulaziz University
8. Pierre and Marie Curie – Paris 6
9. University of Hong Kong
10. University of Cambridge

The past few days I’ve received a lot of email from colleagues and administrators about this ranking, and also the overall global ranking of USNWR in which Berkeley was #1. The emails generally say something to the effect of “of course rankings are not perfect, everybody knows… but look, we are amazing!”

BUT, one of the top math departments in the world, the math department at the Massachusetts Institute of Technology is ranked #11… they didn’t even make the top ten. Even more surprising is the entry at #7 that I have boldfaced: the math department at King Abdulaziz University (KAU) in Jeddah, Saudi Arabia. I’ve been in the math department at Berkeley for 15 years, and during this entire time I’ve never (to my knowledge) met a person from their math department and I don’t recall seeing a job application from any of their graduates… I honestly had never heard of the university in any scientific context. I’ve heard plenty about KAUST (the King Abdullah University of Science and Technology ) during the past few years, especially because it is the first mixed-gender university campus in Saudi Arabia, is developing a robust research program based on serious faculty hires from overseas, and in a high profile move hired former Caltech president Jean-Lou Chameau to run the school. But KAU is not KAUST.

A quick google searched reveals that although KAU is nearby in Jeddah, it is a very different type of institution. It has two separate campuses for men and women. Although it was established in 1967 (Osama Bin Laden was a student there in 1975) its math department started a Ph.D. program only two years ago. According to the math department website, the chair of the department, Prof. Abdullah Mathker Alotaibi, is a 2005 Ph.D. with zero publications [Update: Nov. 10: This initial claim was based on a Google Scholar Search of his full name; a reader commented below that he has published and that this claim was incorrect. Nevertheless, I do not believe it in any way materially affect the points made in this post.] This department beat MIT math in the USNWR global rankings! Seriously?

The USNWR rankings are based on 8 attributes:

– global research reputation
– regional research reputation
– publications
– normalized citation impact
– total citations
– number of highly cited papers
– percentage of highly cited papers
– international collaboration

Although KAU’s full time faculty are not very highly cited, it has amassed a large adjunct faculty that helped them greatly in these categories. In fact, in “normalized citation impact” KAU’s math department is the top ranked in the world. This amazing statistic is due to the fact that KAU employs (as adjunct faculty) more than a quarter of the highly cited mathematicians at Thomson Reuters. How did a single university assemble a group with such a large proportion of the world’s prolific (according to Thomson Reuters) mathematicians? (When I first heard this statistic from Iddo Friedberg via Twitter I didn’t believe it and had to go compute it myself from the data on the website. I guess I believe it now but I still can’t believe it!!)

In 2011 Yudhijit Bhattacharjee published an article in Science titled “Saudi Universities Offer Cash in Exchange for Academic Prestige” that describes how KAU is targeting highly cited professors for adjunct faculty positions. According to the article, professors are hired as adjunct professors at KAU for $72,000 per year in return for agreeing (apparently by contract) to add KAU as a secondary affiliation at and for adding KAU as an affiliation on their published papers. Annual visits to KAU are apparently also part of the “deal” although it is unclear from the Science article whether these actually happen regularly or not.

[UPDATE Oct 31, 12:14pm: A friend who was solicited by KAU sent me the invitation email with the contract that KAU sends to potential “Distinguished Adjunct Professors”. The details are exactly as described in the Bhattacharjee article:

From: "Dr. Mansour Almazroui" <>
Date: XXXX
Subject: Re: Invitation to Join “International Affiliation Program” at 
         King Abdulaziz University, Jeddah Saudi Arabia

Dear Prof. XXXX ,

Hope this email finds you in good health. Thank you for your interest. 
Please find below the information you requested to be a 
“Distinguished Adjunct Professor” at KAU.

1. Joining our program will put you on an annual contract initially 
   for one year but further renewable. However, either party can 
   terminate its association with one month prior notice.
2. The Salary per month is $ 6000 for the period of contract.
3. You will be required to work at KAU premises for three weeks in 
   each contract year. For this you will be accorded with expected 
   three visits to KAU.
4. Each visit will be at least for one week long but extendable as 
   suited for research needs.
5. Air tickets entitlement will be in Business-class and stay in Jeddah
   will be in a five star hotel. The KAU will cover all travel and living
   expenses of your visits.
6. You have to collaborate with KAU local researchers to work on KAU 
   funded (up to $100,000.00) projects.
7. It is highly recommended to work with KAU researchers to submit an 
   external funded project by different agencies in Saudi Arabia.
8. May submit an international patent.
9. It is expected to publish some papers in ISI journals with KAU 
10. You will be required to amend your ISI highly cited affiliation 
    details at the ISI 
    web site to include your employment and affiliation with KAU.

Kindly let me know your acceptance so that the official contract may
be preceded.




The publication of the Science article elicited a strong rebuttal from KAU on the comments section, where it was vociferously argued that the hiring of distinguished foreign scholars was aimed at creating legitimate research collaborations, and was not merely a gimmick for increasing citation counts. Moreover, some of the faculty who had signed on defended the decision in the article. For example, Neil Robertson, a distinguished graph theorist (of Robertson-Seymour graph minors fame) explained that “it’s just capitalism,” and “they have the capital and they want to build something out of it.” He added that “visibility is very important to them, but they also want to start a Ph.D. program in mathematics,” (they did do that in 2012) and he added that he felt that “this might be a breath of fresh air in a closed society.” It is interesting to note that despite his initial enthusiasm and optimism, Professor Robertson is no longer associated with KAU.

In light of the high math ranking of KAU in the current USNWR I decided to take a closer look at who KAU has been hiring, why, and for what purpose, i.e. I decided to conduct post-publication peer review of the Bhattacharjee Science paper. A web page at KAU lists current “Distinguished Scientists” and another page lists “Former Distinguished Adjunct Professors“. One immediate observation is that out of 118 names on these pages there is 1 woman (Cheryl Praeger from the University of Western Australia). Given that KAU has two separate campuses for men and women, it is perhaps not surprising that women are not rushing to sign on, and perhaps KAU is also not rushing to invite them (I don’t have any information one way or another, but the underrepresentation seems significant). Aside from these faculty, there is also a program aptly named the “Highly Cited Researcher Program” that is part of the Center for Excellence in Genomic Medicine Research. Fourteen faculty are listed there (all men, zero women). But guided by the Science article which described the contract requirement that researchers add KAU to their ISI affiliation, I checked for adjunct KAU faculty at Thomson-Reuters and there I found what appears to be the definitive list.

Although Neil Robertson has left KAU, he has been replaced by another distinguished graph theorist, namely Carsten Thomassen (no accident as his wikipedia page reveals that “He was included on the ISI Web of Knowledge list of the 250 most cited mathematicians.”) This is a name I immediately recognized due to my background in combinatorics; in fact I read a number of Thomassen’s papers as a graduate student. I decided to check whether it is true that adjunct faculty are adding KAU as an affiliation on their articles. Indeed, Thomassen has done exactly that in his latest publication Strongly 2-connected orientations of graphs published this year in the Journal of Combinatorial Theory Series B. At this point I started having serious reservations about the ethics of faculty who have agreed to be adjuncts at KAU. Regardless of the motivation of KAU in hiring adjunct highly cited foreign faculty, it seems highly inappropriate for a faculty member to list an affiliation on a paper to an institution to which they have no scientific connection whatsoever. I find it very hard to believe that serious graph theory is being researched at KAU, an institution that didn’t even have a Ph.D. program until 2012. It is inconceivable that Thomassen joined KAU in order to find collaborators there (he mostly publishes alone), or that he suddenly found a great urge to teach graph theory in Saudi Arabia (KAU had no Ph.D. program until 2012). The problem is also apparent when looking at the papers of researchers in genomics/computational biology that are adjuncts at KAU. I recognized a number of such faculty members, including high-profile names from my field such as Jun Wang, Manolis Dermitzakis and John Huelsenbeck. I was surprised to see their names (none of these faculty mention KAU on their websites) yet in each case I found multiple papers they have authored during the past year in which they list the KAU affiliation. I can only wonder whether their home institutions find this appropriate. Then again, maybe KAU is also paying the actual universities the faculty they are citation borrowing belong to? But assume for a moment that they aren’t, then why should institutions share the credit they deserve for supporting their faculty members by providing them space, infrastructure, staff and students with KAU? What exactly did KAU contribute to Kilpinen et al.  Coordinated effects of sequence variation on DNA binding, chromatin structure and transcription, Science, 2013? Or to Landis et al. Bayesian analysis of biogeography when the number of areas is large, Systematic Biology, 2013? These papers have no authors or apparent contribution from KAU. Just the joint affiliation of the adjunct faculty member. The limit of the question arises in the case of Jun Wang, director of the Beijing Genome Institute, whose affiliations are BGI (60%), University of Copenhagen (15%), King Abdulaziz University (15%), The University of Hong Kong (5%), Macau University of Science and Technology (5%). Should he also acknowledge the airlines he flies on? Should there not be some limit on the number of affiliations of an individual? Shouldn’t journals have a policy about when it is legitimate to list a university as an affiliation for an author? (e.g. the author must have in some significant way been working at the institution).

Another, bigger, disgrace that emerged in my examination of the KAU adjunct faculty is the issue of women. Aside from the complete lack of women in the “Highly Cited Researcher Program”, I found that most of the genomics adjunct faculty hired via the program will be attending an all-male conference in three weeks. The “Third International Conference on Genomic Medicine” will be held from November 17–20th at KAU. This conference has zero women. The same meeting last year… had zero women. I cannot understand how in 2014, at a time when many are speaking out strongly about the urgency of supporting females in STEM and in particular about balancing meetings, a bunch of men are willing to forgo all considerations of gender equality for the price of ~$3 per citation per year (a rough calculation using the figure of $72,000 per year from the Bhattacharjee paper and 24,000 citations for a highly cited researcher). To be clear I have no personal knowledge about whether the people I’ve mentioned in this article are actually being paid or how much, but even if they are being paid zero it is not ok to participate in such meetings. Maybe once (you didn’t know what you are getting into), but twice?!

As for KAU, it seems clear based on the name of the “Highly Cited Researcher Program” and the fact that they advertise their rankings that they are specifically targeting highly cited researchers much more for their delivery of their citations than for development of genuine collaborations (looking at the adjunct faculty I failed to see any theme or concentration of people in any single area as would be expected in building a coherent research program). However I do not fault KAU for the goal of increasing the ranking of their institution. I can see an argument for deliberately increasing rankings in order to attract better students, which in turn can attract faculty. I do think that three years after the publication of the Science article, it is worth taking a closer look at the effects of the program (rankings have increased considerably but it is not clear that research output from individuals based at KAU has increased), and whether this is indeed the most effective way to use money to improve the quality of research institutions. The existence of KAUST lends credence to the idea that the king of Saudi Arabia is genuinely interested in developing Science in the country, and there is a legitimate research question as to how to do so with the existing resources and infrastructure. Regardless of how things ought to be done, the current KAU emphasis on rankings is a reflection of the rankings, which USNWR has jumped into with its latest worldwide ranking. The story of KAU is just evidence of a bad problem getting worse. I have previously thought about the bad version of the problem:

A few years ago I wrote a short paper with my (now former) student Peter Huggins on university rankings:

P. Huggins and L.P., Selecting universities: personal preferences and rankings, arXiv, 2008.

It exists only as an arXiv preprint as we never found a suitable venue for publication (this is code for the paper was rejected upon peer review; no one seemed interested in finding out the extent to which the data behind rankings can produce a multitude of stories). The article addresses a simple question: given that various attributes have been measured for a bunch of universities, and assuming they are combined (linearly) into a score used to produce rankings, how do the rankings depend on the weightings of the individual attributes? The mathematics is that of polyhedral geometry, where the problem is to compute a normal fan of a polytope whose vertices encode all the possible rankings that can be obtained for all possible weightings of the attributes (an object we called the unitope). An example is shown below, indicating the possible rankings as determined by weightings chosen among three attributes measured by USNWR (freshman retention, selectivity, peer assessment). It is important to keep in mind this is data from 2007-2008.



Our paper had an obvious but important message: rankings can be very sensitive to the attribute weightings. Of course some schools such as Harvard came out on top regardless of attribute preferences, but some schools, even top ranked schools, could shift by over 50 positions. Our conclusion was that although the data collected by USNWR was useful, the specific weighting chosen and the ranking it produced were not. Worse than that, sticking to a single choice of weightings was misleading at best, dangerous at worse.

I was reminded of this paper when looking at the math department rankings just published by USNWR. When I saw that KAU was #7 I was immediately suspicious, and even Berkeley’s #1 position bothered me (even though I am a faculty member in the department). I immediately guessed that they must have weighted citations heavily, because our math department has applied math faculty, and KAU has their “highly cited researcher program”. Averaging citations across faculty from different (math) disciplines is inherently unfair. In the case of Berkeley, my applied math colleague James Sethian has a paper on level set methods with more than 10,000 (Google Scholar) citations. This reflects the importance and advance of the paper, but also the huge field of users of the method (many, if not most, of the disciplines in engineering). On the other hand, my topology colleague Ian Agol’s most cited paper has just over 200 citations. This is very respectable for a mathematics paper, but even so it doesn’t come close to reflecting his true stature in the field, namely the person who settled the Virtually Haken Conjecture thereby completing a long standing program of William Thurston that resulted in many of the central open problems in mathematics (Thurston was also incidentally an adjunct faculty member at KAU for some time). In other words, not only are citations not everything, they can also be not anything. By comparing citations across math departments that are diverse to very differing degrees USNWR rendered the math ranking meaningless. Some of the other data collected, e.g. reputation, may be useful or relevant to some, and for completeness I’m including it with this post (here) in a form that allows for it to be examined properly (USNWR does not release it in the form of a table, but rather piecemeal within individual html pages on their site), but collating the data for each university into one number is problematic. In my paper with Peter Huggins we show both how to evaluate the sensitivity of rankings to weightings and also how to infer bounds on the weightings by USNWR from the rankings. It would be great if USNWR included the ability to perform such computations with their data directly on their website but there is a reason USNWR focuses on citations.

The impact factor of a journal is a measure of the average amount of citation per article. It is computed by averaging the citations over all articles published during the preceding two years, and its advertisement by journals reflects a publishing business model where demand for the journal comes from the impact factor, profit from free peer reviewing, and sales from closed subscription based access.  Everyone knows the peer review system is broken, but it’s difficult to break free of when incentives are aligned to maintain it. Moreover, it leads to perverse focus of academic departments on the journals their faculty are publishing in and the citations they accumulate. Rankings such as those by USNWR reflect the emphasis on citations that originates with the journals, as so one cannot fault USNWR for including it as a factor and weighting it highly in their rankings. Having said that, USNWR should have known better than to publish the KAU math rankings; in fact it appears their publication might be a bug. The math department rankings are the only rankings that appear for KAU. They have been ommitted entirely from the global overall ranking and other departmental rankings (I wonder if this is because USNWR knows about the adjunct faculty purchase). In any case, the citation frenzy feeds departments that in aggregate form universities. Universities such as King Abdulaziz, that may reach the point where they feel compelled to enter into the market of citations to increase their overall profile…

I hope this post frightened you. It should. Happy Halloween!

[Update: Dec. 6: an article about KAU and citations has appeared in the Daily Cal, Jonathan Eisen posted his exchanges with KAU, and he has storified the tweets]

This year half of the Nobel prize in Physiology or Medicine was awarded to May-Britt Moser and Edvard Moser, who happen to be both a personal and professional couple. Interestingly, they are not the first but rather the fourth couple to win the prize jointly: In 1903 Marie Curie and Pierre Curie shared the Nobel prize in physics, in 1935 Frederic Joiliot and Irene Joliot-Curie shared the Nobel prize in chemistry and in 1947 Carl Cori and Gerty Cori also shared the Nobel prize in physiology or medicine. It seems working on science with a spouse or partner can be a formula for success. Why then, when partners apply together for academic jobs, do universities refer to them as “two body problems“?

The “two-body problem” is a question in physics about the motions of pairs of celestial bodies that interact with each other gravitationally. It is a special case of the difficult “N-body problem” but simple enough that is (completely) solved; in fact it was solved by Johann Bernoulli a few centuries ago. The use of the term in the context of academic job searches has always bothered me- it suggests that hiring in academia is an exercise in mathematical physics (it is certainly not!) and even if one presumes that it is, the term is an oxymoron because in physics the problem is solved whereas in academia it is used in a way that implies it is unsolvable. There are countless times I have heard my colleagues sigh “so and so would be great but there is a two body problem”. Semantics aside, the allusion to high brow physics problems in the process of academic hiring belies a complete lack of understanding of the basic mathematical notion of epistasis relevant in the consideration of joint applications, not to mention an undercurrent of sexism that plagues science and engineering departments everywhere.  The results are poor hiring decisions, great harm to the academic prospects of partners and couples, and imposition of stress and anxiety that harms the careers of those who are lucky enough to be hired by the flawed system.

I believe it was Aristotle who first noted used the phrase “the whole is greater than the sum of its parts”. The old adage remains true today: owning a pair of matching socks is more than twice as good as having just one sock. This is called positive epistasis, or synergy. Of course the opposite may be true as well: a pair of individuals trying to squeeze through a narrow doorway together will take more than twice as long than if they would just go through one at a time. This would be negative epistasis. There is a beautiful algebra and geometry associated to positive/negative epistasis this is useful to understand, because its generalizations reveal a complexity to epistasis that is very much at play in academia.

Formally, thinking of two “parts”, we can represent them as two bit strings: 01 for one part and 10 for the other. The string 00 represents the situation of having neither part, and 11 having both parts. A “fitness function” f:[0,1]^2 \rightarrow \mathbb{R}_+ assigns to each string a value. Epistasis is defined to be the sign of the linear form


That is, u>0 is positive epistasis, u<0 is negative epistasis and u=0 is no epistasis. In the case where f(00)=0, “the whole is greater than the sum of its parts” means that f(11)>f(10)+f(01) and “the whole is less than the sum of its parts” means f(11)<f(10)+f(01). There is an accompanying geometry that consists of drawing a square in the x-y plane whose corners are labeled by 00,01,10 and 11. At each corner,  the function f can be represented by a point on the z axis, as shown in the example below:


The black line dividing the square into two triangles comes about by imagining that there are poles at the corners of the square, of height equal to the fitness value, and then that a tablecloth is draped over the poles and stretched taught. The picture above then correspond to the leftmost panel below:


The crease is the resulting of projecting down onto the square the “fold” in the tablecloth (assuming there is a fold). In other words, positive and negative epistasis can be thought of as corresponding to one of the two triangulations of the square. This is the geometry of two parts but what about n parts? We can similarly represent them by bit strings 100 \cdots 0, 010 \cdots 0, 001 \cdots 0, \ldots, 000 \cdots 1 with the “whole” corresponding to 111 \cdots 1. Assuming that the parts can only be added up all together, the geometry now works out to be that of triangulations of the hyperbipyramid; the case n=3 is shown below:



“The whole is greater than the sum of its parts”: the superior-inferior slice.


“The whole is less than the sum of its parts”: the transverse slice.

With multiple parts epistasis can become more complicated if one allows for arbitrary combining of parts. In a paper written jointly with Niko Beerenwinkel and Bernd Sturmfels titled “Epistasis and shapes of fitness landscapes“, we developed the mathematics for the general case and showed that epistasis among objects allowed to combine in all possible ways corresponds to the different triangulations of a hypercube. For example, in the case of three objects, the square is replaced by the cube with eight corners corresponding to the eight bit strings of length 3. There are 74 triangulations of the cube, falling into 6 symmetry classes. The complete classification is shown below (for details on the meaning of the GKZ vectors and out-edges see the paper):



There is a beautiful geometry describing how the different epistatic shapes (or triangulations) are related, which is known as the secondary polytope. Its vertices correspond to the triangulations and two are connected by an edge when they are the same except for the “flip” of one pair of neighboring tetrahedra. The case of the cube is shown below:


The point of the geometry, and its connection to academic epistasis that I want to highlight in this post, is made clear when considering the case of n=4. In that case the number of different types of epistatic interactions is given by the number of triangulations of the 4-cube. There are 87,959,448 triangulations and 235,277 symmetry types! In other words, the intuition from two parts that “interaction” can be positive, negative or neutral is difficult to generalize without math, and the point is there are a myriad of ways a faculty in a large department can be interacting both to the benefit and the detriment of their overall scientific output.

In many searches I’ve been involved in the stated principle for hiring is “let’s hire the best person”. Sometimes the search may be restricted to a field, but it is not uncommon that the search is open. Such a hiring policy deliberately ignores epistasis, and I think it’s crazy, not to mention sexist, because the policy affects and hurts women applicants far more than it does men. Not because women are less likely to be “the best” in their field, in fact quite the opposite. It is very common for women in academia to be partnered with men who are also in academia, and inevitably they suffer for that fact because departments have a hard time reconciling that both could be “the best”. There are also many reasons for departments to think epistaticially that go beyond basic fairness principles. For example, in the case of partners that are applying together to a university, even if they are not working together on research, it is likely that each one will be far more productive if the other has a stable job at the same institution. It is difficult to manage a family if one partner needs to commute hours, or in some cases days, to work. I know of a number of couples in academia that have jobs in different states.

In the last few years there are a few couples that have been bold enough to openly declare themselves “positively epistatic”. What I mean is that they apply jointly as a single applicant, or “joint lab” in the case of biology. For example, there is the case of the Altschuler-Wu lab that has recently relocated to UCSF or the Eddy-Rivas lab that is relocating to Harvard. Still, such cases are far and few between, and for the most part hiring is inefficient, clumsy and unfair (it is also worth noting that there are many other epistatic factors that can and should be considered, for example the field someone is working in, collaborators, etc.)

Epistasis has been carefully studied for a long time in population and statistical genetics, where it is fundamental in understanding the effects of genotype on phenotype. The geometry described above can be derived for diploid genomes and this was done by Ingileif Hallgrímsdóttir and Debbie Yuster in the paper “A complete classification of epistatic two-locus models” from 2008. In the paper they examine a previous classification of epistasis among 30 pairs of loci in a QTL analysis of growth traits in chicken (Carlborg et al., Genome Research 2003). The (re)-classification is shown in the figure below:


If we can classify epistasis for chickens in order to understand them, we can certainly assess the epistasis outlook for our potential colleagues, and we should hire accordingly.

It’s time that the two body problem be recognized as the two body opportunity.

This is part (2/2) about my travel this past summer to Iceland and Israel:

In my previous blog post I discussed the genetics of Icelanders, and the fact that most Icelanders can trace their roots back dozens of generations, all the way to Vikings from ca. 900AD. The country is homogenous in many other ways as well (religion, income, etc.), and therefore presents a stark contrast to the other country I visited this summer: Israel. Even though I’ve been to Israel many times since I was a child, now that I am an adult the manifold ethnic, social and religious makeup of the society is much more evident to me. This was particularly true during my visit this past summer, during which political and military turmoil in the country served to accentuate differences. There are Armenians, Ashkenazi Jews, Bahai, Bedouin, Beta Israel, Christian Arabs, Circassians, Copts, Druze, Maronites, Muslim Arab, Sephardic Jews etc. etc. etc. , and additional “diversity” caused by political splits leading to West Bank Palestinians, Gaza Palestinians, Israelis inside vs. outside the Green Line, etc. etc. etc. (and of course many individuals fall into multiple categories). It’s fair to say that “it’s complicated”. Moreover, the complex fabric that makes up Israeli society is part of a larger web of intertwined threads in the Middle East. The “Arab countries” that neighbor Israel are also internally heterogeneous and complex, both in obvious ways (e.g. the Sunni vs. Shia division), but also in many more subtle ways (e.g. language).

The 2014 Israeli-Gaza conflict started on July 8th. Having been in Israel for 4 weeks I was interacting closely with many friends and colleagues who were deeply impacted by the events (e.g. their children were suddenly called up to a partake in a war), and among them I noticed almost immediately an extreme polarization that reflected a public relations battle being waged between Hamas and Israel that played out more intensely than in any previous conflict on news channels and social media. The polarization extended to friends and acquaintances outside of Israel. Everyone had a very strong opinion. One thing I noticed were graphic memes being passed around in which the conflict was projected onto a two-colored map. For example, the map below was passed around on Facebook showing the (“real democratic”) Israel surrounded by a sea of Arab green in the Middle East:

DemocracyI started noticing other bifurcating maps as other Middle East issues came to the fore later in the summer. Here is a map from a website depicting the Sunni-Shia divide:


In many cases the images being passed around were explicitly encouraging a “one-dimensional” view of the conflict(s), whereas in other cases the “us” vs. “them” factor was more subliminal. The feeling that I was being programmed how to think made me uncomfortable.

Moreover, the Middle East memes that were flooding my inbox were distracting me. I had visited Israel to nurture and establish connections and collaborations with the large number of computational biologists in the country. During my trip I was kindly hosted by Yael Mandel-Gutfreund at the Technion, and also had the honor of being an invited speaker at the annual Israeli Bioinformatics Society meeting. The visit was not supposed to be a bootcamp in salon politics. In any case, I found myself thinking about the situation in the Middle East with a computational biology mindset, and I was struck by the following “Middle East Friendship Chart” published in July that showed data about the relationships of the various entities/countries/organizations:


As a (computational) biologist I was keen to understand the data in a visual way that would reveal the connections more clearly, and as a computational (biologist) faced with ordinal data I thought immediately of non-metric multi-dimensional scaling as a way to depict the information in the matrix. I have discussed classic multi-dimensional scaling (or MDS) in a previous blog post, where I explained its connection to principal component analysis. In the case of ordinal data, non-metric MDS seeks to find points in a low-dimensional Euclidean space so that the ranks of distances correspond to the input ordinal matrix. It has been used in computational biology, for example in the analysis of gene expression matrices. The idea originates with a classic paper by Kruskal,that remains a good reference for understanding non-metric MDS. The key idea is summarized in his Figure 4:


Formally, in Kruskal’s notation, given a dissimilarity map \delta (symmetric matrix with zeroes on the diagonal and nonnegative entries), the goal is to find points x in R^k so that their pairwise distance match in rank. In Kruskal’s Figure 4, points on the plot correspond to pairs of points in R^k and \delta is shown on the y-axis, while the Euclidean distance between the points, represented by d, is shown on the x-axis. Monotonically increasing values \hat{d} are then chosen so that S=\sum_{ij} \left( d_{ij}-\hat{d}_{ij} \right)^2 is minimized. The function S is called the “stress” function and is further normalized so that the “stress” is invariant up to scaling of the points. An iterative procedure can then be used to optimize the points, although results depend on which starting configuration is chosen, and for this reason multiple starting positions are considered.

I converted the smiley/frowny faces into numbers 0,1 or 2 (for red, yellow and green faces respectively) and was able to easily experiment with non-metric MDS using an implementation in R. The results for a 2D scaling of the friendship matrix are shown in the figure below:



It is evident that, as expected from the friendship matrix, ISIS is an outlier. One also sees some of “the enemy of thine enemy is thy friend”. What is interesting is that in some cases the placements are clearly affected by shared allegiances and mutual dislikes that are complicated in nature. For example, the reason Saudi Arabia is placed between Israel and the United States is the friendship of the U.S. towards Iraq in contrast to Israel’s relationship to the country. One interesting question, that is not addressed by the non-metric MDS approach, is what the direct influences are. For example, it stands to reason that Israel is neutral to Saudi Arabia partly because of the U.S. friendship with the country- can this be inferred from the data in the same way that causative links are inferred for gene networks? In any case, I thought the scaling was illuminating and it seems like an interesting exercise to extend the analysis to more countries/organizations/entities but it may be necessary to deal with missing data and I don’t have the time to do it.

I did decide to look at the 1D non-metric MDS, to see whether there is a meaningful one-dimensional representation of the matrix, consistent with some of the maps I’d seen. As it turns out, this is not what the data suggests. The one-dimensional scaling described below places ISIS in the middle, i.e. as the “neutral” country!

Israel                -4.55606607
Saudi Arabia          -3.62249810
Turkey                -3.04579321
United States         -2.6429534
Egypt                 -1.12919328
Al-Qaida              -0.38125270
Hamas                  0.01629508
ISIS                   0.40101149
Palestinian Authority  1.55546030
Iraq                   2.23849150
Hezbollah              2.66933449
Iran                   3.29650784
Syria                  5.20065616

This failure of non-metric MDS is simply a reflection of the fact that the friendship matrix is not “one-dimensional”. The Middle East is not one-dimensional. The complex interplay of Sunni vs. Shia, terrorist vs. freedom fighter, muslim vs. infidel, and all the rest of what is going on make it incorrect to think of the conflict in terms of a single attribute. The complex pattern of alliances and conflicts is therefore not well explained by two-colored maps, and the computations described above provide some kind of a “proof” of this fact. The friendship matrix also explains why it’s difficult to have meaningful discussions about the Middle East in 140 characters, or in Facebook tirades, or with soundbites on cable news. But as complicated as the Middle East is, I have no doubt that the “friendship matrix” of my colleagues in computational biology would require even higher dimension…


This past summer I spent a few weeks in Israel and then in Iceland (with brief visits to the Oxford workshop on Biological Sequence Analysis and Probabilistic Models, and to IST Austria). This is the first of two posts about my travels.

I have been a regular visitor to Iceland during the past 12 years, and every visit is associated with unforgettable unique and extraordinary experiences. I have climbed volcanos  and I’ve descended into their depths. I have enjoyed geothermal heating- both in swimming pools and at the beach. And I have seen incredible Aurora Borealis. A couple of years ago I even got married there.

Iceland is indeed a beautiful place. But the most amazing thing about the country… is a truly remarkable and extraordinary… website. It is called Islendingabók, and is not to be confused with the book Islendingabók (Book of Icelanders) from which it borrowed its name. Islendingabók (the website) is a collaboration between the company deCODE Genetics and a computer scientist Friðrik Skúlason, who together set up a searchable genealogical database of Icelanders. The genealogy can only be browsed by registered users, and registration is currently limited to citizens and residents with an Icelandic kennitala (social security number). Many people have heard that Iceland has kept “good” records, but I don’t think the scope and power of the database is generally understood. Even geneticists I have talked to typically don’t have a clue about how detailed and thorough the database is. At the risk of hyperbole, I am blown away every time I look at it. There is nothing like it in the world.

As explained above, I am married to an Icelander (Ingileif Bryndis Hallgrímsdóttir), and she has kindly given me permission to peek at the data. Before getting to her family tree, just a word about the naming system in Iceland, because understanding it is helpful in parsing the genealogy. Surnames are patronymic, and contain the first name of the father with the appendage “son” for sons, and “dóttir” for daughters. Therefore husbands and wives don’t share surnames, but their first names point to their fathers. Below is Ingileif’s (Inga’s) complete family tree going back five generations:


Another naming convention is apparent in the repetition of names (modulo 2 generations) and its coupling to the patronymic naming system. Notice the switch from Ásgeir Jónsson -> Jón Ásgeirsson  -> Ásgeir Jónsson and so on. Traditions run deep. For example, my daughter Steinunn Liorsdóttir is named after her grandmother, Steinunn Jónsdóttir, who is named after her grandmother, Steinunn Guðmundsdóttir, who is named after her grandmother, Steinunn Hannesdóttir, who is named after her grandmother, Steinunn Eyjólfsdóttir (born 1788).

As impressive as this is, the tree is much deeper. Below is the tree for her great-great-great grandfather Ásgeir (on her mother’s side), who was born in 1821:


This tree also goes back five generations (!), and is complete with the exception of three ancestors, who are five generations back (10th generation from my wife). At this point, ten generations back from my wife, we are looking at her relatives born in the early part of the 17th century. There is a lot of annotated information about every individual, not only when and where they were born and died, but also where they lived, and frequently also their professions. Of course the genealogy starts thinning out as one goes back in time and records become more scarce. How far back does it go? For some lines ancestors can be traced back to the 10th century and beyond, with a few lineages reaching kings of Norway ca. 800 AD. Below is the direct line of descendants from Ingólfur Arnarson, first settler of Iceland, to my wife. He is literally one of the great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-grandfathers of my daughters:


Every time I look at Icelandic ancestry chains like this I find my mind forced to stretch to accommodate a broader perspective of history. I begin to think in terms of dozens or hundreds of generations, and what humans have been doing on those timescales. In comparison to Icelandic genetics, other population genetic studies, such as the interesting recent study on Ashkenazi Jews by Itsik Pe’er’s group, are the blur of Gerhard Richter compared to the hyperrealism of Denis Peterson.

The genealogy of Icelanders was partly the rationale for the founding of deCODE Genetics, with the idea that it would be a powerful tool for linkage analysis when coupled with the DNA of Icelanders (in retrospect the genotyping, and now sequencing of a large part of the population means that large swaths of the genealogy can be inferred based on reconstructed haplotype blocks). But another rationale for the formation of deCODE, and one that has turned out to be extremely useful, is the general availability and sharing of records. Of course Iceland has a centralized health care system, and deCODE has been successful in working together with the Ministry of Welfare to perform many GWAS studies for a  variety of diseases (it is worth noting that deCODE has by far the best publication record in GWAS in the world), but what is less well known outside of Iceland is the extent to which individuals are prepared to trade-off privacy for the sake of national equality, and the implications that has for genetics studies. To give one example, this summer during my visit the yearly estimates of salary for representative individuals from all professions were published. These are based on tax returns, which in Iceland are publicly available upon request. Here are the salaries of top executives (salaries are reported in thousands of Icelandic Krona per month; at this time 1 USD = 120 ISK):


Also, the income of a number of early deCODE employees were high this year as a result of the sale of the company to Amgen:


Below are some of the salaries for musicians and artists. Sadly, some things are the same in all countries:


Typically the salaries of about 1% of the population are estimated and published (this year approximately 3,000 people).

Along with publicly available tax records, many other databases are public, for example for many years school records of all students (i.e. grades) were published annually. One can start to imagine all sorts of creative GWAS…Again, as with the genealogy, I can think of no other country in the world with anything like this.

Iceland’s genealogy is embedded deeply into the public psyche, to the extent that I think it’s fair to say that the national identity is constructed around it. After all, it’s hard to argue with someone that they are Icelandic when their ancestry traces back to the first settler. At the same time, like many nations in Europe and around the world, the country is becoming increasingly cosmopolitan, and the genealogy tree is beginning to look a lot more like a trimmed rosebush. Like many other nations, Icelanders are having to confront the questions “Who is an Icelander? What is an Icelander?” Ultimately, the answer cannot and will not come from genetics.

Nature Publishing Group claims on its website that it is committed to publishing “original research” that is “of the highest quality and impact”. But when exactly is research “original”?  This is a question with a complicated answer. A recent blog post by senior editor Dorothy Clyde at Nature Protocols provides insight into the difficulties Nature faces in detecting plagiarism, and identifies the issue of self plagiarism as particularly problematic. The journal tries to avoid publishing the work of authors who have previously published the same work or a minor variant thereof. I imagine this is partly in the interests of fairness, a service to the scientific community to ensure that researchers don’t have to sift through numerous variants of a single research project in the literature, and a personal interest of the journal in its aim to publish only the highest level of scholarship.

On the other hand, there is also a rationale for individual researchers to revisit their own previously published work. Sometimes results can be recast in a way that makes them accessible to different communities, and rethinking of ideas frequently leads to a better understanding, and therefore a better exposition. The mathematician Gian-Carlo Rota made the case for enlightened self-plagiarism in one of his ten lessons he wished he had been taught when he was younger:

3. Publish the same result several times

After getting my degree, I worked for a few years in functional analysis. I bought a copy of Frederick Riesz’ Collected Papers as soon as the big thick heavy oversize volume was published. However, as I began to leaf through, I could not help but notice that the pages were extra thick, almost like cardboard. Strangely, each of Riesz’ publications had been reset in exceptionally large type. I was fond of Riesz’ papers, which were invariably beautifully written and gave the reader a feeling of definitiveness.

As I looked through his Collected Papers however, another picture emerged. The editors had gone out of their way to publish every little scrap Riesz had ever published. It was clear that Riesz’ publications were few. What is more surprising is that the papers had been published several times. Riesz would publish the first rough version of an idea in some obscure Hungarian journal. A few years later, he would send a series of notes to the French Academy’s Comptes Rendus in which the same material was further elaborated. A few more years would pass, and he would publish the definitive paper, either in French or in English. Adam Koranyi, who took courses with Frederick Riesz, told me that Riesz would lecture on the same subject year after year, while meditating on the definitive version to be written. No wonder the final version was perfect.

Riesz’ example is worth following. The mathematical community is split into small groups, each one with its own customs, notation and terminology. It may soon be indispensable to present the same result in several versions, each one accessible to a specific group; the price one might have to pay otherwise is to have our work rediscovered by someone who uses a different language and notation, and who will rightly claim it as his own.

The question is: where does one draw the line?

I was recently forced to confront this question when reading an interesting paper about a statistical approach to utilizing controls in large-scale genomics experiments:

J.A. Gagnon-Bartsch and T.P. Speed, Using control genes to corrected for unwanted variation in microarray dataBiostatistics, 2012.

A cornerstone in the logic and methodology of biology is the notion of a “control”. For example, when testing the result of a drug on patients, a subset of individuals will be given a placebo. This is done to literally control for effects that might be measured in patients taking the drug, but that are not inherent to the drug itself. By examining patients on the placebo, it is possible to essentially cancel out uninteresting effects that are not specific to the drug. In modern genomics experiments that involve thousands, or even hundreds of thousands of measurements, there is a biological question of how to design suitable controls, and a statistical question of how to exploit large numbers of controls to “normalize” (i.e. remove unwanted variation) from the high-dimensional measurements.

Formally, one framework for thinking about this is a linear model for gene expression. Using the notation of Gagnon-Bartsch & Speed, we have an expression matrix Y of size m \times n (samples and genes) modeled as

Y_{m \times n} = X_{m \times p}\beta_{p \times n} + Z_{m \times q}\gamma_{q \times n} + W_{m \times k} \alpha_{k \times n} + \epsilon_{m \times n}.

Here is a matrix describing various conditions (also called factors) and associated to it is the parameter matrix \beta that records the contribution, or influence, of each factor on each gene. \beta is the primary parameter of interest to be estimated from the data Y. The \epsilon are random noise, and finally  and are observed and unobserved covariates respectively. For example Z might encode factors for covariates such as gender, whereas W would encode factors that are hidden, or unobserved. A crucial point is that the number of hidden factors in W, namely k, is not known. The matrices \gamma and \alpha record the contributions of the Z and W factors on gene expression, and must also be estimated. It should be noted that X may be the logarithm of expression levels from a microarray experiment, or the analogous quantity from an RNA-Seq experiment (e.g. log of abundance in FPKM units).

Linear models have been applied to gene expression analysis for a very long time; I can think of papers going back 15 years. But They became central to all analysis about a decade ago, specifically popularized with the Limma package for microarray data analysis. In an important paper in 2007, Leek and Storey focused explicitly on the identification of hidden factors and estimation of their influence, using a method called SVA (Surrogate Variable Analysis). Mathematically, they described a procedure for estimating k and W and the parameters \alpha. I will not delve into the details of SVA in this post, except to say that the overall idea is to first perform linear regression (assuming no hidden factors) to identify the parameters \beta and to then perform singular value decomposition (SVD) on the residuals to identify hidden factors (details omitted here). The resulting identified hidden factors (and associated influence parameters) are then used in a more general model for gene expression in subsequent analysis.

Gagnon-Bartsch and Speed refine this idea by suggesting that it is better to infer W from controls. For example, house-keeping genes that are unlikely to correlate with the conditions being tested, can be used to first estimate W, and then subsequently all the parameters of the model can be estimated by linear regression. They term this two-step process RUV-2 (acronym for Remote Unwanted Variation) where the “2” designates that the procedure is a two-step procedure. As with SVA, the key to inferring W from the controls is to perform singular value decomposition (or more generally factor analysis). This is actually clear from the probabilistic interpretation of PCA and the observation that what it means to be a in the set of “control genes” C  in a setting where there are no observed factors Z, is that

Y_C = W \alpha_C + \epsilon_C.

That is, for such control genes the corresponding \beta parameters are zero. This is a simple but powerful observation, because the explicit designation of control genes in the procedure makes it clear how to estimate W, and therefore the procedure becomes conceptually compelling and practically simple to implement. Thus, even though the model being used is the same as that of Leek & Storey, there is a novel idea in the paper that makes the procedure “cleaner”. Indeed, Gagnon-Bartsch & Speed provide experimental results in their paper showing that RUV-2 outperforms SVA. Even more convincing, is the use of RUV-2 by others. For example, in a paper on “The functional consequences of variation in transcription factor binding” by Cusanovitch et al., PLoS Genetics 2014, RUV-2 is shown to work well, and the authors explain how it helps them to take advantage of the controls in experimental design they created.

There is a tech report and also a preprint that follow up on the Gagnon-Bartsch & Speed paper; the tech report extends RUV-2 to a four step method RUV-4 (it also provides a very clear exposition of the statistics), and separately the preprint describes an extension to RUV-2 for the case where the factor of interest is also unknown. Both of these papers build on the original paper in significant ways and are important work, that to return to the original question in the post, certainly are on the right side of “the line”

The wrong side of the line?

The development of RUV-2 and SVA occurred in the context of microarrays, and it is natural to ask whether the details are really different for RNA-Seq (spoiler: they aren’t).  In a book chapter published earlier this year:

D. Risso, J. Ngai, T.P. Speed, S. Dudoit, The role of spike-in standards in the normalization of RNA-Seq, in Statistical Analysis of Next Generation Sequencing Data (2014), 169-190.

the authors replace “log expression levels” from microarrays with “log counts” from RNA-Seq and the linear regression performed with Limma for RUV-2 with a Poisson regression (this involves one different R command). They call the new method RUV, which is the same as the previously published RUV, a naming convention that makes sense since the paper has no new method. In fact, the mathematical formulas describing the method are identical (and even in almost identical notation!) with the exception that the book chapter ignores altogether, and replaces \epsilon with O. 

To be fair, there is one added highlight in the book chapter, namely the observation that spike-ins can be used in lieu of housekeeping (or other control) genes. The method is unchanged, of course. It is just that the spike-ins are used to estimate W. Although spike-ins were not mentioned in the original Gagnon-Bartsch paper, there is no reason not to use them with arrays as well; they are standard with Affymetrix arrays.

My one critique of the chapter is that it doesn’t make sense to me that counts are used in the procedure. I think it would be better to use abundance estimates, and in fact I believe that Jeff Leek has already investigated the possibility in a preprint that appears to be an update to his original SVA work. That issue aside, the book chapter does provide concrete evidence using a Zebrafish experiment that RUV-2 is relevant and works for RNA-Seq data.

The story should end here (and this blog post would not have been written if it had) but two weeks ago, among five RNA-Seq papers published in Nature Biotechnology (I have yet to read the others), I found the following publication:

D. Risso, J. Ngai, T.P. Speed, S. Dudoit, Normalization of RNA-Seq data using factor analysis of control genes or samples, Nature Biotechnology 32 (2014), 896-902.

This paper has the same authors as the book chapter (with the exception that Sandrine Dudoit is now a co-corresponding author with Davide Risso, who was the sole corresponding author on the first publication), and, it turns out, it is basically the same paper… in fact in many parts it is the identical paper. It looks like the Nature Biotechnology paper is an edited and polished version of the book chapter, with a handful of additional figures (based on the same data) and better graphics. I thought that Nature journals publish original and reproducible research papers. I guess I didn’t realize that for some people “reproducible” means “reproduce your own previous research and republish it”.

At this point, before drawing attention to some comparisons between the papers, I’d like to point out that the book chapter was refereed. This is clear from the fact that it is described as such in both corresponding authors’ CVs (Davide Risso CV and Sandrine Dudoit CV).

How similar are the two papers?

Final paragraph of paper in the book:

Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical effects. With the advent of single-cell sequencing [35], the role of spike-in standards should become even more important, both to account for technical variability [6] and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike-in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols.

Final paragraph of paper in Nature Biotechnology:

Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical factors. With the advent of single-cell sequencing27, the role of spike-in standards should become even more important, both to account for technical variability28 and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike- in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols.

Abstract of paper in the book:

Normalization of RNA-seq data is essential to ensure accurate inference of expression levels, by adjusting for sequencing depth and other more complex nuisance effects, both within and between samples. Recently, the External RNA Control Consortium (ERCC) developed a set of 92 synthetic spike-in standards that are commercially available and relatively easy to add to a typical library preparation. In this chapter, we compare the performance of several state-of-the-art normalization methods, including adaptations that directly use spike-in sequences as controls. We show that although the ERCC spike-ins could in principle be valuable for assessing accuracy in RNA-seq experiments, their read counts are not stable enough to be used for normalization purposes. We propose a novel approach to normalization that can successfully make use of control sequences to remove unwanted effects and lead to accurate estimation of expression fold-changes and tests of differential expression.

Abstract of paper in Nature Biotechnology:

Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms.

Abstract of Gagnon-Bartsch & Speed paper that already took credit for a “new” method called RUV:

Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method “Remove Unwanted Variation, 2-step” (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.

Many figures are also the same (except one that appears to have been fixed in the Nature Biotechnology paper– I leave the discovery of the figure as an exercise to the reader). Here is Figure 9.2 in the book:


The two panels appears as (b) and (c) in Figure 4 in the Nature Biotechnology paper (albeit transformed via a 90 degree rotation and reflection from the dihedral group):


Basically the whole of the book chapter and the Nature Biotechnology paper are essentially the same, down to the math notation, which even two papers removed is just a rehashing of the RUV method of Gagnon-Bartsch & Speed. A complete diff of the papers is beyond the scope of this blog post and technically not trivial to perform, but examination by eye reveals one to be a draft of the other.

Although it is acceptable in the academic community to draw on material from published research articles for expository book chapters (with permission), and conversely to publish preprints, including conference proceedings, in journals, this case is different. (a) the book chapter was refereed, exactly like a journal publication (b) the material in the chapter is not expository; it is research, (c) it was published before the Nature Biotechnology article, and presumably prepared long before,  (d) the book chapter cites the Nature Biotechnology article but not vice versa and (e) the book chapter is not a particularly innovative piece of work to begin with. The method it describes and claims to be “novel”, namely RUV, was already published by Gagnon-Bartsch & Speed.

Below is a musical rendition of what has happened here:

“An entertaining freshness… Tic Tac!” This is Ferrero‘s tag line for its most successful product, the ubiquitous Tic Tac. And the line has stuck. As WikiHow points out in how to make your breath freshfirst buy some mints, then brush your teeth.

One of the amazing things about Tic Tacs is that they are sugar free. Well… almost not. As the label explains, a single serving (one single Tic Tac) contains 0g of sugar (to be precise, less than 0.5g, as explained in a footnote). In what could initially be assumed to be a mere coincidence, the weight of a single serving is 0.49g. It did not escape my attention that 0.50-0.49=0.01. Why?


To understand it helps to look at the labeling rules of the FDA. I’ve reproduced the relevant section (Title 21) below, and boldfaced the relevant parts:


(c) Sugar content claims –(1) Use of terms such as “sugar free,” “free of sugar,” “no sugar,” “zero sugar,” “without sugar,” “sugarless,” “trivial source of sugar,” “negligible source of sugar,” or “dietarily insignificant source of sugar.” Consumers may reasonably be expected to regard terms that represent that the food contains no sugars or sweeteners e.g., “sugar free,” or “no sugar,” as indicating a product which is low in calories or significantly reduced in calories. Consequently, except as provided in paragraph (c)(2) of this section, a food may not be labeled with such terms unless:

(i) The food contains less than 0.5 g of sugars, as defined in 101.9(c)(6)(ii), per reference amount customarily consumed and per labeled serving or, in the case of a meal product or main dish product, less than 0.5 g of sugars per labeled serving; and

(ii) The food contains no ingredient that is a sugar or that is generally understood by consumers to contain sugars unless the listing of the ingredient in the ingredient statement is followed by an asterisk that refers to the statement below the list of ingredients, which states “adds a trivial amount of sugar,” “adds a negligible amount of sugar,” or “adds a dietarily insignificant amount of sugar;” and

(iii)(A) It is labeled “low calorie” or “reduced calorie” or bears a relative claim of special dietary usefulness labeled in compliance with paragraphs (b)(2), (b)(3), (b)(4), or (b)(5) of this section, or, if a dietary supplement, it meets the definition in paragraph (b)(2) of this section for “low calorie” but is prohibited by 101.13(b)(5) and 101.60(a)(4) from bearing the claim; or

(B) Such term is immediately accompanied, each time it is used, by either the statement “not a reduced calorie food,” “not a low calorie food,” or “not for weight control.”

It turns out that Tic Tacs are in fact almost pure sugar. Its easy to figure this out by looking at the number of calories per serving (1.9) and multiplying  the number of calories per gram of sugar (3.8) by 0.49 => 1.862 calories. 98% sugar! Ferrero basically admits this in their FAQ. Acting completely within the bounds of the law, they have simply exploited an arbitrary threshold of the FDA. Arbitrary thresholds are always problematic; not only can they have unintended consequences, but they can be manipulated to engineer desired outcomes. In computational biology they have become ubiquitous, frequently being described as “filters” or “pre-processing steps”.  Regardless of how they are justified, thresholds are thresholds are thresholds. They can sometimes be beneficial, but they are dangerous when wielded indiscriminately.

There is one type of thresholding/filtering in used in RNA-Seq that my postdoc Bo Li and I have been thinking about a bit this year. It consists of removing duplicate reads, i.e. reads that map to the same position in a transcriptome. The motivation behind such filtering is to reduce or eliminate amplification bias, and it is based on the intuition that it is unlikely that lightning strikes the same spot multiple times. That is, it is improbable that many reads would map to the exact same location assuming a model for sequencing that posits selecting fragments from transcripts uniformly. The idea is also called de-duplication or digital normalization.

Digital normalization is obviously problematic for high abundance transcripts. Consider, for example, a transcripts that is so abundant that it is extremely likely that at least one read will start at every site (ignoring the ends, which for the purposes of the thought experiment are not relevant). This would also be the case if the transcript was twice as abundant, and so digital normalization would prevent the possibility for estimating the difference. This issue was noted in a paper published earlier this year by Zhou et al.  The authors investigate in some detail the implications of this problem, and quantify the bias it introduces in a number of data sets. But a key question not answered in the paper is what does digital normalization actually do?

To answer the question, it is helpful to consider how one might estimate the abundance of a transcript after digital normalization. One naive approach is to just count the number of reads after de-duplication, followed by normalization for the length of the transcript and the number of reads sequenced. Specifically if there are sites where a read might start, and of the sites had at least one read, then the naive approach would be to use the estimate \frac{k}{n} suitably normalized for the total number of reads in the experiment. This is exactly what is done in standard de-duplication pipelines, or in digital normalization as described in the preprint by Brown et al. However assuming a simple model for sequencing, namely that every read is selected by first choosing a transcript according to a multinomial distribution and then choosing a location on it uniformly at random from among the sites, a different formula emerges.

Let be a random variable that denotes the number of sites on a transcript of length n that are covered in a random sequencing experiment, where the number of reads starting at each site of the transcript is Poisson distributed with parameter c (i.e., the average coverage of the transcript is c). Note that

Pr(X \geq 1) = 1-Pr(X=0) = 1-e^{-c}.

The maximum likelihood estimate for can also be obtained by the method of moments, which is to set

\frac{k}{n} = 1-e^{-c}

from which it is easy to see that

c = -log(1-\frac{k}{n}).

This is the same as the (derivation of the) Jukes-Cantor correction in phylogenetics where the method of moments equation is replaced by \frac{4}{3}\frac{k}{n} = 1-e^{-\frac{4}{3}c} yielding D_{JC} = -\frac{3}{4}log(1-\frac{4}{3}\frac{k}{n}), but I’ll leave an extended discussion of the Jukes-Cantor model and correction for a future post.

The point here, as noticed by Bo Li, is that since log(1-x) \approx -x by Taylor approximation, it follows that the average coverage can be estimated by c \approx \frac{k}{n}. This is exactly the naive estimate of de-duplication or digital normalization, and the fact that \frac{k}{n} \rightarrow 1 as k \rightarrow n means that -log(1-\frac{k}{n}) blows up, at high coverage hence the results of Zhou et al.

Digital normalization as proposed by Brown et al. involves possibly thresholding at more than one read per site (for example choosing a threshold C and removing all but at most C reads at every site). But even this modified heuristic fails to adequately relate to a probabilistic model of sequencing. One interesting and easy exercise is to consider the second or higher order Taylor approximations. But a more interesting approach to dealing with amplification bias is to avoid thresholding per se,  and to instead identify outliers among duplicate reads and to adjust them according to an estimated distribution of coverage. This is the approach of Hashimoto et al. in a the paper “Universal count correction for high-throughput sequencing” published in March in PLoS One. There are undoubtedly other approaches as well, and in my opinion the issue will received renewed attention in the coming year as the removal of amplification biases in single-cell transcriptome experiments becomes a priority.

As mentioned above, digital normalization/de-duplication is just one of many thresholds applied in a typical RNA-Seq “pipeline”. To get a sense of the extent of thresholding, one need only scan the (supplementary?) methods section of any genomics paper. For example, the GEUVADIS RNA-Seq consortium describe their analysis pipeline as follows:

“We employed the JIP pipeline (T.G. & M.S., data not shown) to map mRNA-seq reads and to quantify mRNA transcripts. For alignment to the human reference genome sequence (GRCh37, autosomes + X + Y + M), we used the GEM mapping suite24 (v1.349 which corresponds to publicly available pre-release 2) to first map (max. mismatches = 4%, max. edit distance = 20%, min. decoded strata = 2 and strata after best = 1) and subsequently to split-map (max.mismatches = 4%, Gencode v12 and de novo junctions) all reads that did not map entirely. Both mapping steps are repeated for reads trimmed 20 nucleotides from their 3′-end, and then for reads trimmed 5 nucleotides from their 5′-end in addition to earlier 3′-trimming—each time considering exclusively reads that have not been mapped in earlier iterations. Finally, all read mappings were assessed with respect to the mate pair information: valid mapping pairs are formed up to a maximum insert size of 100,000 bp, extension trigger = 0.999 and minimum decoded strata = 1. The mapping pipeline and settings are described below and can also be found in, where the code as well as an example pipeline are hosted.”

This is not a bad pipeline- the paper shows it was carefully evaluated– and it may have been a practical approach to dealing with the large amount of RNA-Seq data in the project. But even the first and seemingly innocuous thresholding to trim low quality bases from the ends of reads is controversial and potentially problematic. In a careful analysis published earlier this year, Matthew MacManes looked carefully at the effect of trimming in RNA-Seq, and concluded that aggressive trimming of bases below Q20, a standard that is frequently employed in pipelines, is problematic. I think his Figure 3, which I’ve reproduced below, is very convincing:


It certainly appears that some mild trimming can be beneficial, but a threshold that is optimal (and more importantly not detrimental) depends on the specifics of the dataset and is difficult or impossible to determine a priori. MacManes’ view (for more see his blog post on the topic) is consistent with another paper by Del Fabbro et al. that while seemingly positive about trimming in the abstract, actually concludes that “In the specific case of RNA-Seq, the tradeoff between sensitivity (number of aligned reads) and specificity (number of correctly aligned reads) seems to be always detrimental when trimming the datasets (Figure S2); in such a case, the modern aligners, like Tophat, seem to be able to overcome low quality issues, therefore making trimming unnecessary.”

Alas, Tic Tac thresholds are everywhere. My advice is: brush your teeth first.

Posts in chronological order


Blog Stats

  • 918,861 hits

Get every new post delivered to your Inbox.

Join 4,287 other followers

%d bloggers like this: