In this blog post I offer a cash prize for computing a p-value. For details about the competition you can skip directly to the challenge. But context is important:

Background

I’ve recently been reading a bioRxiv posting by X. Lan and J. Pritchard, Long-term survival of duplicate genes despite absence of subfunctionalized expression (2015) that examines the question of whether gene expression data (from human and mouse tissues) supports a model of duplicate preservation by subfunctionalization.

The term subfunctionalization is a hypothesis for explaining the ubiquity of persistence of gene duplicates in extant genomes. The idea is that gene pairs arising from a duplication event evolve, via neutral mutation, different functions that are distinct from their common ancestral gene, yet together recapitulate the original function. It was introduced in 1999 an alternative to the older hypothesis of neofunctionalization, which posits that novel gene functions arise by virtue of “retention” of one copy of a gene after duplication, while the other copy morphs into a new gene with a new function. Neofunctionalization was first floated as an idea to explain gene duplicates in the context of evolutionary theory by Haldane and Fisher in the 1930s, and was popularized by Ohno in his book Evolution by Gene Duplication published in 1970. The cartoon below helps to understand the difference between the *functionalization hypotheses (adapted from wikipedia):

Lan and Pritchard examine the credibility of the sub- and neofunctionalization hypotheses using modern high-throughput gene expression (RNA-Seq) data: in their own words “Based on theoretical models and previous literature, we expected that–aside from the youngest duplicates–most duplicate pairs would be functionally distinct, and that the primary mechanism for this would be through divergent expression profiles. In particular, the sub- and neofunctionalization models suggest that, for each duplicate gene, there should be at least one tissue where that gene is more highly expressed than its partner.”

What they found was that, in their words, that “surprisingly few duplicate pairs show any evidence of sub-/neofunctionalization of expression.” The went further, stating that “the prevailing model for the evolution of gene duplicates holds that, to survive, duplicates must achieve non-redundant functions, and that this usually occurs by partitioning the expression space. However, we report here that sub-/neofunctionalization of expression occurs extremely slowly, and generally does not happen until the duplicates are separated by genomic rearrangements. Thus, in most cases long-term survival must rely on other factors.” They propose instead that “following duplication the expression levels of a gene pair evolve so that their combined expression matches the optimal level. Subsequently, the relative expression levels of the two genes evolve as a random walk, but do so slowly (33) due to constraint on their combined expression. If expression happens to become asymmetric, this reduces functional constraint on the minor gene. Subsequent accumulation of missense mutations in the minor gene may provide weak selective pressure to eventually eliminate expression of this gene, or may free the minor gene to evolve new functions.”

The Lan and Pritchard paper is the latest in a series of works that examine high-browed evolutionary theories with hard data, and that are finding reality to be far more complicated than the intuitively appealing, yet clearly inadequate, hypotheses of neo- and subfunctionalization. One of the excellent papers in the area is

Dean et al. Pervasive and Persistent Redundancy among Duplicated Genes in Yeast, PLoS Genetics, 2008.

where the authors argue that in yeast “duplicate genes do not often evolve to behave like singleton genes even after very long periods of time.” I mention this paper, from the Petrov lab, because its results are fundamentally at odds with what is arguably the first paper to provide genome-wide evidence for neofunctionalization (also in yeast):

M. Kellis, B.W. Birren and E.S. Lander, Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisae, Nature 2004.

At the time, the Kellis-Birren-Lander paper was hailed as containing “work that may lead to better understanding of genetic diseases” and in the press release Kellis stated that “understanding the dynamics of genome duplication has implications in understanding disease. In certain types of cancer, for instance, cells have twice as many chromosomes as they should, and there are many other diseases linked to gene dosage and misregulation.” He added that “these processes are not much different from what happened in yeast.” and the author of the press releases added that “whole genome duplication may have allowed other organisms besides yeast to achieve evolutionary innovations in one giant leap instead of baby steps. It may account for up to 80 percent (seen this number before?) of flowering plant species and could explain why fish are the most diverse of all vertebrates.”

This all brings me to:

The challenge

In the abstract of their paper, Kellis, Birren and Lander wrote that:

Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions.” [boldface by authors]
In the main text of the paper, the authors expanded on this claim, writing:

Strikingly, in nearly every case (95%), accelerated evolution was confined to only one of the two paralogues. This strongly supports the model in which one of the paralogues retained an ancestral function while the other, relieved of this selective constraint, was free to evolve more rapidly”.

The word “strikingly” suggests a result that is surprising in its statistical significance with respect to some null model the authors have in mind. The data is as follows:

The authors identified 457 duplicated gene pairs that arose by whole genome duplication (for a total of 914 genes) in yeast. Of the 457 pairs 76 showed accelerated (protein) evolution in S. cerevisiae. The term “accelerated” was defined to relate to amino acid substitution rates in S. cerevisiae, which were required to be 50% faster than those in another yeast species, K. waltii. Of the 76 genes, only four pairs were accelerated in both paralogs. Therefore 72 gene pairs showed acceleration in only one paralog (72/76 = 95%).

So, is it indeed “striking” that “in nearly every case (95%), accelerated evolution was confined to only one of the two praralogues”? Well, the authors don’t provide a pvalue in their paper, nor do they propose a null model with respect to which the question makes sense. So I am offering a prize to help crowdsource what should have been an exercise undertaken by the authors, or if not a requirement demanded by the referees. To incentivize people in the right direction,

I will award ${\bf \frac{\100}{p}}$

to the person who can best justify a reasonable null model, together with a p-value (p) for the phrase “Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair” in the abstract of the Kellis-Birren-Lander paper. Notice the smaller the (justifiable) p-value someone can come up with, the larger the prize will be.

Bonus: explain in your own words how you think the paper was accepted to Nature without the authors having to justify their use of the word “strikingly” for a main result of the paper, and in a timeframe consisting of submission on December 17th 2003 (just three days before Hanukkah and one week before Christmas) and acceptance January 19th 2004 (Martin Luther King Jr. day).

Rules

To be eligible for the prize entries must be submitted as comments on this blog post by 11:59pm EST on Sunday May 31st, 2015 and they must be submitted with a valid e-mail address. I will keep the name (and e-mail address) of the winner anonymous if they wish (this can be ensured by using a pseudonym when submitting the entry as a comment). The prize, if awarded, will go to the person submitting the most complete, best explained solution that has a pvalue calculation that is correct according to the model proposed. Preference will be given to submission from students, especially undergraduates, but individuals in any stage of their career, and from anywhere in the world, are encouraged to submit solutions. I reserve the right to interpret the phrase “reasonable null model” in a way that is consistent with its use in the scientific community and I reserve the right to not award the prize if no good/correct solutions are offered. Participants do not have to answer the bonus question to win.

About one and a half years ago I wrote a blog post titled “GTEx is throwing away 90% of their data“. The post was, shall we say, “direct”. For example, in reference to the RNA-Seq quantification program Flux Capacitor I wrote that

Using Flux Capacitor is equivalent to throwing out 90% of the data!

I added that “the methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad” (after explaining that the methods underlying the program were never published, except for a brief mention in that paper). I referred to the sole figure in Montgomery et al. as a “completely useless” description of the method  (and showed that it contained errors). I highlighted the fact that many aspects of Flux Capacitor, its description and documentation provided on its website were “incoherent”. Can we agree that this description is not flattering?

The claim about “throwing out 90% of the data” was based on benchmarking I reported on in the blog post. If I were to summarize the results (politely), I would say that the take home message was that Flux Capacitor is junk. Perhaps nobody had really noticed because nobody cared about the program. Flux Capacitor was literally being used only by the authors of the program  (and their affiliates, which turned out to include the ENCODE, GENCODE, GEUVADIS and GTEx consortiums). In fact, when I wrote the blog post, I don’t think the program had ever been benchmarked or compared to other tools. It was, after all, unpublished and besides, nobody reads consortium papers. However after I blogged a few others decided to include Flux Capacitor in their benchmarks and the conclusions reached were the same as mine: Flux Capacitor is junk and Flux Capacitor is junk. Of course some people objected to my blog post when it came out, so it’s fun to be right and have others say so in print. But true vindication has come in the form of a citation to the blog post in a published paper in a journal! Specifically, in

C. Iannone, A. Pohl, P. Papasaikas, D. Soronellas, G.P. Vincent, M. Beato and J. Valcárcel, Relationship between nucleosome positioning and progesterone-induced alternative splicing in breast cancer cells, RNA 21 (2015) 360–374

the authors cite my blog post. They write:

Ummm…. wait… WHAT THE FLUX? The authors actually used Flux Capacitor for their analysis, and are citing my blog at https://liorpachter.wordpress.com/tag/flux-capacitor/ as the definitive reference for the program. Wait, what again?? They used my blog post as a reference for the method??? This is like [[ readers are invited to leave a comment offering a suitable analogy ]].

I’m not really sure what the authors can do at this point. They could publish an erratum and replace the citation. But with what? Flux Capacitor still hasn’t been published (!) Then there is the journal. Does any journal really think it is acceptable to list my blog as the citation for an RNA-Seq quantification tool that is fundamental for the results in a paper? (I’m flattered, but still…) Speaking of the journal, where were the reviewers? How could they not catch this? And the readers? The paper has been out since January… I have to ask: has anybody read it? Of course the biggest embarrassment here is the fact that there is a citation for Flux Capacitor at all. Why on earth are the authors using this discredited program??? Well maybe one answer is to be found in the acknowledgments section, where the group of a PI from the GTEx project is thanked. Actually, this PI was the last author on one of the recently published GTEx companion papers, which, I am sad to say… used Flux Capacitor (albeit with some quantifications performed with Cufflinks as well to demonstrate “robustness”). Why would GTEx be pushing for Flux Capacitor and insist on its use? We’ve come full circle to my GTEx blog post. By now I don’t even know what I think is the most embarrassing part of this whole story. So I thought I’d host a poll:

Today I posted the preprint N. Bray, H. Pimentel, P. Melsted and L. Pachter, Near-optimal RNA-Seq quantification with kallisto to the arXiv. It describes the RNA-Seq quantification program kallisto.

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was so much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of k-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. That is asking for a lot less, and as it turns out, pseudoalignment can be much faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient k-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

The screenshot below shows kallisto being used on my 2012 iMac desktop to build an index of the human transcriptome (5 min 8 sec), and then quantify 78.6 million GEUVADIS human RNA-Seq reads (14 min). When we first saw these results we thought they were simply too good to be true. Let me repeat: The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. In some tests we’ve gotten even faster running times, up to 15 million reads quantified per minute.

The results in our paper indicate that kallisto is not just fast, but also very accurate. This is not surprising: underlying RNA-Seq analysis are the alignments, and although kallisto is pseudoaligning instead, it is almost always only the compatibility information that is used in actual applications. As we show in our paper, from the point of view of compatibility, the pseudoalignments and alignments are almost the same.

Although accuracy is a primary concern with analysis, we realized in the course of working on kallisto that speed is also paramount, and not just as a  matter of convenience. The speed of kallisto has three major implications:

1. It allows for efficient bootstrapping. All that is required for the bootstrap are reruns of the EM algorithm, and those are particularly fast within kallisto. The result is that we can accurately estimate the uncertainty in abundance estimates. One of my favorite figures from our paper, made by Harold, is this one:

It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. This result is non-trivial: the standard dogma, that the technical variance in RNA-Seq is “Poisson” (i.e. proportional to the mean) is false, as shown in Supplementary Figure 3 of our paper (the correlation becomes 0.64). Thus, the bootstrap will be invaluable when incorporated in downstream application and we are already working on some ideas.

2. It is not just the kallisto quantification that is fast; the index building, and even compilation of the program are also easy and quick. The implication for biologists is that RNA-Seq analysis now becomes interactive. Instead of “freezing” an analysis that might take weeks or even months, data can be explored dynamically, e.g. easily quantified against different transcriptomes, or re-quantified as transcriptomes are updated. The ability to analyze data locally instead of requiring cloud computation means that analysis is portable, and also easily secure.

3. We have found the fast turnaround of analysis helpful in improving the program itself. With kallisto we can quickly check the effect of changes in the algorithms. This allows for much faster debugging of problems, and also better optimization. It also allows us to release improvements knowing that users will be able to test them without resorting to a major computation that might take months. For this reason we’re not afraid to say that some improvements to kallisto will be coming soon.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!

Last year I came across a wonderful post on the arXiv, a paper titled A new approach to enumerating statistics modulo n written by William Kuszmaul while he was a high school student participating in the MIT Primes Program for Research in Mathematics. Among other things, Kuszmaul solved the problem of counting the number of subsets of the n-element set $\{1,2,\ldots,n\}$ that sum to k mod m.

This counting problem is related to beautiful (elementary) number theory and combinatorics, and is connected to ideas in (error correcting) coding theory and even computational biology (the Burrows Wheeler transform). I’ll tell the tale, but first explain my personal connection to it, which is the story of  my first paper that wasn’t, and how I was admitted to graduate school:

In 1993 I was a junior (math major) at Caltech and one of my best friends was a senior, Nitu Kitchloo, now an algebraic topologist and professor of mathematics at Johns Hopkins University. I had a habit of asking Nitu math questions, and he had a habit of solving them. One of the questions I asked him that year was:

How many subsets of the n-element set $\{1,2,\ldots,n\}$ sum to 0 mod n?

I don’t remember exactly why I was thinking of the problem, but I do remember that Nitu immediately started looking at the generating function (the polynomial whose coefficients count the number of subsets for each n) and magic happened quickly thereafter. We eventually wrote up a manuscript whose main result was the enumeration of $N_n^k$, the number of subsets of $\{1,2,\ldots,n\}$ whose elements sum to k mod n. The main result was

$N_n^k = \frac{1}{n} \sum_{s|n; s \,odd}2^{\frac{n}{s}}\frac{\varphi(s)}{\varphi\left(\frac{s}{(k,s)}\right)}\mu\left(\frac{s}{(k,s)} \right)$.

We were particularly tickled that the formula contained both Euler’s totient function and the Möbius function. It seemed nice. So we decided to submit our little result to a journal.

By now months had passed and Nitu had already left Caltech for graduate school. He left me to submit the paper and I didn’t know where, so I consulted the resident combinatorics expert (Rick Wilson) who told me that he liked the result and hadn’t seen it before, but that before sending it off, just in case, I should should consult with this one professor at MIT who was known to be good at counting. I remember Wilson saying something along the lines of “Richard knows everything”. This was in the nascent days of the web, so I hand wrote a letter to Richard Stanley, enclosed a copy of the manuscript, and mailed it off.

A few weeks later I received a hand-written letter from MIT. Richard Stanley had written back to let us know that he very much liked the result… because it had brought back memories of his time as an undergraduate at Caltech, when he had worked on the same problem! He sent me the reference:

R.P Stanley and M.F. Yoder, A study of Varshamov codes for asymmetric channels, JPL Technical Report 32-1526, 1973.

Also included was a note that said “Please consider applying to MIT”.

Stanley and Yoder’s result concerned single-error-correcting codes for what are known as Z-channels. These are communication links where there is an asymmetry in the fidelity of transmission: 0 is reliably transmitted as 0 (probability 1), but 1 may be transmitted as either a 1 or a 0 (with some probability p). In a 1973 paper A class of codes for asymmetric channels and a problem from the additive theory of numbers published in the IEEE Transactions of Information Theory, R. Varshamov had proposed a single error correcting code for such channels, which was essentially to encode a message by a bit string corresponding to a subset of $\{1,\ldots,n\}$ whose elements sum to d mod (n+1). It’s not hard to see that since zeroes are transmitted faithfully, it would not be hard to detect a single error (and correct it) by summing the elements corresponding to the bit string. Stanley and Yoder’s paper addressed questions related to enumerating the number of codewords. In particular, they were basically working out the solution to the problem Nitu and I had considered. I guess we could have published our paper anyway as we had a few additional results, for example a theorem explaining how to enumerate zero summing subsets of finite Abelian groups, but somehow we never did. There is a link to the manuscript we had written on my website:

N. Kitchloo and L. Pachter, An interesting result about subset sums, 1993.

One generalization we had explored in our paper was the enumeration of what we called $N_{n,m}^k$the number of subsets of the n-element set $\{1,2,\ldots,n\}$ that sum to k mod m. We looked specifically at the problem where $m|n$ and proved that

$N_{n,m}^n = \frac{1}{m}\sum_{s|m; s \, odd} 2^{\frac{n}{s}}\varphi(s)$.

What Kuszmaul succeeded in doing is to extend this result to any m < n, which is very nice for two reasons: first, the work completes the investigation of the question of subset sums (to subsets summing to an arbitrary modulus). More importantly, the technique used is that of thinking more generally about “modular enumeration”, which is the problem of finding remainders of polynomials modulo $x^n-1$. This led him to numerous other applications, including results on q-multinomial and q-Catalan numbers, and to the combinatorics of lattice paths. This is the hallmark of excellent mathematics: a proof technique that sheds light on the problem at hand and many others.

One of the ideas that modular enumeration is connected to is that of the Burrows-Wheeler transform (BWT). The BWT was published as a DEC tech report in 1994 (based on earlier work of Wheeler in 1983), and is a transform of one string to another of the same length. To understand the transform, consider the example of a binary string of length n. The BWT consists of forming a matrix of all cyclic permutations of s (one row per permutation), then sorting the rows lexicographically, and finally outputting the last column of the matrix. For example, the string 001101 is transformed to 110010. It is obvious by virtue of the definition that any two strings that are the equivalent up to circular permutation will be transformed to the same string. For example, 110100 will also transform to 110010.

Circular binary strings are called necklaces in combinatorics. Their enumeration is a classic problem, solvable by Burnside’s lemma, and the answer is that the number of distinct necklaces $C_n$ of length n is given by

$C_n = \frac{1}{n} \sum_{s|n} 2^{\frac{n}{s}} \varphi(s)$.

For odd n this formula coincides with the subset sum problem (for subsets summing to 0 mod n). When n is prime it is easy to describe a bijection, but for general odd a simple combinatorial bisection is elusive (see Richard Stanley’s Enumerative Combinatorics Volume 1 Chapter 1 Problem 105b).

The Burrows-Wheeler transform is useful because it can be utilized for constant time string matching while requiring an index whose size is only linear in the size of the target. It has therefore become an indispensable tool for genomics. I’m not aware of an application of the elementary observation above, but as the Stanley-Yoder, Kitchloo-P., Kuszmaul timeline demonstrates (21 years in between publications)… math moves in decades. I do think there is some interesting combinatorics underlying the BWT, and that its elucidation may turn out to have practical implications. We’ll see.

A final point: it is fashionable to think that biology, unlike math, moves in years. After all, NIH R01 grants are funded for a period of 3–5 years, and researchers constantly argue with journals that publication times should be weeks and not months. But in fact, lots of basic research in biology moves in decades as well, just like in mathematics. A good example is the story of CRISP/Cas9, which began with the discovery of “genetic sandwiches” in 1987.  The follow-up identification and interpretation and of CRISRPs took decades, mirroring the slow development of mathematics. Today the utility of the CRISPR/Cas9 system depends on the efficient selection of guides and prediction of off-target binding, and as it turns out, tools developed for this purpose frequently use the Burrows-Wheeler transform. It appears that not only binary strings can form circles, but ideas as well…

William Kuszmaul  won 3rd place in the 2014 Intel Science Talent Search for his work on modular enumeration. Well deserved, and thank you!

When I was an undergraduate at Caltech I took a combinatorics course from Rick Wilson who taught from his then just published textbook A Course in Combinatorics (co-authored with J.H. van Lint). The course and the book emphasized design theory, a subject that is beautiful and fundamental to combinatorics, coding theory, and statistics, but that has sadly been in decline for some time. It was a fantastic course taught by a brilliant professor- an experience that had a profound impact on me. Though to be honest, I haven’t thought much about designs in recent years. Having kids changed that.

A few weeks ago I was playing the card game Colori with my three year old daughter. It’s one of her favorites.

The game consists of 15 cards, each displaying drawings of the same 15 items (beach ball, boat, butterfly, cap, car, drum, duck, fish, flower, kite, pencil, jersey, plane, teapot, teddy bear), with each item colored using two of the colors red, green, yellow and blue. Every pair of cards contains exactly one item that is colored exactly the same. For example, the two cards the teddy bear is holding in the picture above are shown below:

The only pair of items colored exactly the same are the two beach balls. The gameplay consists of shuffling the deck and then placing a pair of cards face-up. Players must find the matching pair, and the first player to do so keeps the cards. This is repeated seven times until there is only one card left in the deck, at which point the player with the most cards wins. When I play with my daughter “winning” consists of enjoying her laughter as she figures out the matching pair, and then proceeds to try to eat one of the cards.

An inspection of all 15 cards provided with the game reveals some interesting structure:

Every card contains exactly one of each type of item. Each item therefore occurs 15 times among the cards, with fourteen of the occurrences consisting of seven matched pairs, plus one extra. This is a type of partially balanced incomplete block design. Ignoring for a moment the extra item placed on each card, what we have is 15 items, each colored one of seven ways (v=15*7=105). The 105 items have been divided into 15 blocks (the cards), so that b=15. Each block contains 14 elements (the items) so that k=14, and each element appears in two blocks (r=2). If every pair of different (colored) items occurred in the same number of cards, we would have a balanced incomplete block design, but this is not the case in Colori. Each item occurs in the same block as 26 (=2*13) other items (we are ignoring the extra item that makes for 15 on each card), and therefore it is not the case that every pair of items occurs in the same number of blocks as would be the case in a balanced incomplete block design. Instead, there is an association scheme that provides extra structure among the 105 items, and in turn describes the way in which items do or do not appear together on cards. The association scheme can be understood as a graph whose nodes consist of the 105 items, with edges between items labeled either 0,1 or 2. An edge between two items of the same type is labeled 0, edges between different items that appear on the same card are labeled 1, and edges between different items that do not appear on the same card are labeled 2. This edge labeling is called an “association scheme” because it has a special property, namely the number of triangles with a base edge labeled k, and other two edges labeled i and respectively is  dependent only on i,j and k and not on the specific base edge selected. In other words, there is a special symmetry to the graph. Returning to the deck of cards, we see that every pair of items appears in the same card exactly 0 or 1 times, and the number depends only on the association class of the pair of objects. This is called a partially balanced incomplete block design.

The author of the game, Reinhard Staupe, made it a bit more difficult by adding an extra item to each card making the identification of the matching pair harder. The addition also ensures that each of the 15 items appears on each card. Moreover, the items are permuted in location on the cards, in an arrangement similar to a latin square, making it hard to pair up the items. And instead of using 8 different colors, he used only four, producing the eight different “colors” of each item on the cards by using pairwise combinations of the four.  The yellow-green two-colored beach balls are particularly difficult to tell apart from the green-yellow ones. Of course, much of this is exactly the kind of thing you would want to do if you were designing an RNA-Seq experiment!

Instead of 15 types of items, think of 15 different strains of mice.  Instead of colors for the items, think of different cellular conditions to be assayed. Instead of one pair for each of seven color combinations, think of one pair of replicates for each of seven cellular conditions. Instead of cards, think of different sequencing centers that will prepare the libraries and sequence the reads. An ideal experimental setup would involve distributing the replicates and different cellular conditions across the different sequencing centers so as to reduce batch effect. This is the essence of part of the paper Statistical Design and Analysis of RNA Sequencing Data by Paul Auer and Rebecca Doerge. For example, in their Figure 4 (shown below) they illustrate the advantage of balanced block designs to ameliorate lane effects:

Figure 4 from P. Auer and R.W. Doerge’s paper Statistical Design and Analysis of RNA Sequencing Data.

Of course the use of experimental designs for constructing controlled gene expression experiments is not new. Kerr and Churchill wrote about the use of combinatorial designs in Experimental Design for gene expression microarrays, and one can trace back a long chain of ideas originating with R.A. Fisher. But design theory seems to me to be a waning art insofar as molecular biology experiments are concerned, and it is frequently being replaced with biological intuition of what makes for a good control. The design of good controls is sometimes obvious, but not always. So next time you design an experiment, if you have young kids, first play a round of Colori. If the kids are older, play Set instead. And if you don’t have any kids, plan for an extra research project, because what else would you do with your time?

I’m a (50%) professor of mathematics and (50%) professor of molecular & cell biology at UC Berkeley. There have been plenty of days when I have spent the working hours with biologists and then gone off at night with some mathematicians. I mean that literally. I have had, of course, intimate friends among both biologists and mathematicians. I think it is through living among these groups and much more, I think, through moving regularly from one to the other and back again that I have become occupied with the problem that I’ve christened to myself as the ‘two cultures’. For constantly I feel that I am moving among two groups- comparable in intelligence, identical in race, not grossly different in social origin, earning about the same incomes, who have almost ceased to communicate at all, who in intellectual, moral and psychological climate have so little in common that instead of crossing the campus from Evans Hall to the Li Ka Shing building, I may as well have crossed an ocean.1

I try not to become preoccupied with the two cultures problem, but this holiday season I have not been able to escape it. First there was a blog post by David Mumford, a professor emeritus of applied mathematics at Brown University, published on December 14th. For those readers of the blog who do not follow mathematics, it is relevant to what I am about to write that David Mumford won the Fields Medal in 1974 for his work in algebraic geometry, and afterwards launched another successful career as an applied mathematician, building on Ulf Grenader’s Pattern Theory and making significant contributions to vision research. A lot of his work is connected to neuroscience and therefore biology. Among his many awards are the MacArthur Fellowship, the Shaw Prize, the Wolf Prize and the National Medal of Science. David Mumford is not Joe Schmo.

It therefore came as a surprise to me to read his post titled “Can one explain schemes to biologists?”  in which he describes the rejection by the journal Nature of an obituary he was asked to write. Now I have to say that I have heard of obituaries being retracted, but never of an obituary being rejected. The Mumford rejection is all the more disturbing because it happened after he was invited by Nature to write the obituary in the first place!

The obituary Mumford was asked to write was for Alexander Grothendieck, a leading and towering figure in 20th century mathematics who built many of the foundations for modern algebraic geometry. My colleague Edward Frenkel published a brief non-technical obituary about Grothendieck in the New York Times, and perhaps that is what Nature had in mind for its journal as well. But since Nature is bills itself as “An international journal, published weekly, with original, groundbreaking research spanning all of the scientific disciplines [emphasis mine]” Mumford assumed the readers of Nature would be interested not only in where Grothendieck was born and died, but in what he actually accomplished in his life, and why he is admired for his mathematics. Here is the beginning excerpt of Mumford’s blog post2 explaining why he and John Tate (his coauthor for the post) needed to talk about the concept of a scheme in their post:

John Tate and I were asked by Nature magazine to write an obituary for Alexander Grothendieck. Now he is a hero of mine, the person that I met most deserving of the adjective “genius”. I got to know him when he visited Harvard and John, Shurik (as he was known) and I ran a seminar on “Existence theorems”. His devotion to math, his disdain for formality and convention, his openness and what John and others call his naiveté struck a chord with me.

So John and I agreed and wrote the obituary below. Since the readership of Nature were more or less entirely made up of non-mathematicians, it seemed as though our challenge was to try to make some key parts of Grothendieck’s work accessible to such an audience. Obviously the very definition of a scheme is central to nearly all his work, and we also wanted to say something genuine about categories and cohomology.

What they came up with is a short but well-written obituary that is the best I have read about Grothendieck. It is non-technical yet accurate and meaningfully describes, at a high level, what he is revered for and why. Here it is (copied verbatim from David Mumford’s blog):

Alexander Grothendieck
David Mumford and John Tate

Although mathematics became more and more abstract and general throughout the 20th century, it was Alexander Grothendieck who was the greatest master of this trend. His unique skill was to eliminate all unnecessary hypotheses and burrow into an area so deeply that its inner patterns on the most abstract level revealed themselves — and then, like a magician, show how the solution of old problems fell out in straightforward ways now that their real nature had been revealed. His strength and intensity were legendary. He worked long hours, transforming totally the field of algebraic geometry and its connections with algebraic mber

mber theory. He was considered by many the greatest mathematician of the 20th century.

Grothendieck was born in Berlin on March 28, 1928 to an anarchist, politically activist couple — a Russian Jewish father, Alexander Shapiro, and a German Protestant mother Johanna (Hanka) Grothendieck, and had a turbulent childhood in Germany and France, evading the holocaust in the French village of Le Chambon, known for protecting refugees. It was here in the midst of the war, at the (secondary school) Collège Cévenol, that he seems to have first developed his fascination for mathematics. He lived as an adult in France but remained stateless (on a “Nansen passport”) his whole life, doing most of his revolutionary work in the period 1956 – 1970, at the Institut des Hautes Études Scientifique (IHES) in a suburb of Paris after it was founded in 1958. He received the Fields Medal in 1966.

His first work, stimulated by Laurent Schwartz and Jean Dieudonné, added major ideas to the theory of function spaces, but he came into his own when he took up algebraic geometry. This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety. Traditionally, this had meant complex solutions of polynomials with complex coefficients but just prior to Grothendieck’s work, Andre Weil and Oscar Zariski had realized that much more scope and insight was gained by considering solutions and polynomials over arbitrary fields, e.g. finite fields or algebraic number fields.

The proper foundations of the enlarged view of algebraic geometry were, however, unclear and this is how Grothendieck made his first, hugely significant, innovation: he invented a class of geometric structures generalizing varieties that he called schemes. In simplest terms, he proposed attaching to any commutative ring (any set of things for which addition, subtraction and a commutative multiplication are defined, like the set of integers, or the set of polynomials in variables x,y,z with complex number coefficients) a geometric object, called the Spec of the ring (short for spectrum) or an affine scheme, and patching or gluing together these objects to form the scheme. The ring is to be thought of as the set of functions on its affine scheme.

To illustrate how revolutionary this was, a ring can be formed by starting with a field, say the field of real numbers, and adjoining a quantity $\epsilon$ satisfying $\epsilon^2=0$. Think of $\epsilon$ this way: your instruments might allow you to measure a small number such as $\epsilon=0.001$ but then $\epsilon^2=0.000001$ might be too small to measure, so there’s no harm if we set it equal to zero. The numbers in this ring are $a+b \cdot \epsilon$ real a,b. The geometric object to which this ring corresponds is an infinitesimal vector, a point which can move infinitesimally but to second order only. In effect, he is going back to Leibniz and making infinitesimals into actual objects that can be manipulated. A related idea has recently been used in physics, for superstrings. To connect schemes to number theory, one takes the ring of integers. The corresponding Spec has one point for each prime, at which functions have values in the finite field of integers mod p and one classical point where functions have rational number values and that is ‘fatter’, having all the others in its closure. Once the machinery became familiar, very few doubted that he had found the right framework for algebraic geometry and it is now universally accepted.

Going further in abstraction, Grothendieck used the web of associated maps — called morphisms — from a variable scheme to a fixed one to describe schemes as functors and noted that many functors that were not obviously schemes at all arose in algebraic geometry. This is similar in science to having many experiments measuring some object from which the unknown real thing is pieced together or even finding something unexpected from its influence on known things. He applied this to construct new schemes, leading to new types of objects called stacks whose functors were precisely characterized later by Michael Artin.

His best known work is his attack on the geometry of schemes and varieties by finding ways to compute their most important topological invariant, their cohomology. A simple example is the topology of a plane minus its origin. Using complex coordinates (z,w), a plane has four real dimensions and taking out a point, what’s left is topologically a three dimensional sphere. Following the inspired suggestions of Grothendieck, Artin was able to show how with algebra alone that a suitably defined third cohomology group of this space has one generator, that is the sphere lives algebraically too. Together they developed what is called étale cohomology at a famous IHES seminar. Grothendieck went on to solve various deep conjectures of Weil, develop crystalline cohomology and a meta-theory of cohomologies called motives with a brilliant group of collaborators whom he drew in at this time.

In 1969, for reasons not entirely clear to anyone, he left the IHES where he had done all this work and plunged into an ecological/political campaign that he called Survivre. With a breathtakingly naive spririt (that had served him well doing math) he believed he could start a movement that would change the world. But when he saw this was not succeeding, he returned to math, teaching at the University of Montpellier. There he formulated remarkable visions of yet deeper structures connecting algebra and geometry, e.g. the symmetry group of the set of all algebraic numbers (known as its Galois group $Gal(\overline{\mathbb{Q}}/\mathbb{Q})$) and graphs drawn on compact surfaces that he called ‘dessin d’enfants’. Despite his writing thousand page treatises on this, still unpublished, his research program was only meagerly funded by the CNRS (Centre Nationale de Recherche Scientifique) and he accused the math world of being totally corrupt. For the last two decades of his life he broke with the whole world and sought total solitude in the small village of Lasserre in the foothills of the Pyrenees. Here he lived alone in his own mental and spiritual world, writing remarkable self-analytic works. He died nearby on Nov. 13, 2014.

As a friend, Grothendieck could be very warm, yet the nightmares of his childhood had left him a very complex person. He was unique in almost every way. His intensity and naivety enabled him to recast the foundations of large parts of 21st century math using unique insights that still amaze today. The power and beauty of Grothendieck’s work on schemes, functors, cohomology, etc. is such that these concepts have come to be the basis of much of math today. The dreams of his later work still stand as challenges to his successors.

Mumford goes on in his blog post to describe the reasons Nature gave for rejecting the obituary. He writes:

The sad thing is that this was rejected as much too technical for their readership. Their editor wrote me that ‘higher degree polynomials’, ‘infinitesimal vectors’ and ‘complex space’ (even complex numbers) were things at least half their readership had never come across. The gap between the world I have lived in and that even of scientists has never seemed larger. I am prepared for lawyers and business people to say they hated math and not to remember any math beyond arithmetic, but this!? Nature is read only by people belonging to the acronym ‘STEM’ (= Science, Technology, Engineering and Mathematics) and in the Common Core Standards, all such people are expected to learn a hell of a lot of math. Very depressing.

I don’t know if the Nature editor had biologists in mind when rejecting the Grothendieck obituary, but Mumford certainly thought so, as he sarcastically titled his post “Can one explain schemes to biologists?” Sadly, I think that Nature and Mumford both missed the point.

Exactly ten years ago Bernd Sturmfels and I published a book titled “Algebraic Statistics for Computational Biology“. From my perspective, the book developed three related ideas: 1. that the language, techniques and theorems of algebraic geometry both unify and provide tools for certain models in statistics, 2. that problems in computational biology are particularly prone to depend on inference with precisely the statistical models amenable to algebraic analysis and (most importantly) 3. mathematical thinking, by way of considering useful generalizations of seemingly unrelated ideas, is a powerful approach for organizing many concepts in (computational) biology, especially in genetics and genomics.

To give a concrete example of what 1,2 and 3 mean, I turn to Mumford’s definition of algebraic geometry in his obituary for Grothendieck. He writes that “This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety.” What is he talking about? The notion of “phylogenetic invariants”, provides a simple example for biologists by biologists. Phylogenetic invariants were first introduced to biology ca. 1987 by Joe Felsenstein (Professor of Genome Sciences and Biology at the University of Washington) and James Lake (Distinguished Professor of Molecular, Cell, and Developmental Biology and of Human Genetics at UCLA)3.

Given a phylogenetic tree describing the evolutionary relationship among n extant species, one can examine the evolution of a single nucleotide along the tree. At the leaves, a single nucleotide is then associated to each species, collectively forming a single selection from among the $4^n$ possible patterns for nucleotides at the leaves. Evolutionary models provide a way to formalize the intuitive notion that random mutations should be associated with branches of the tree and formally are described via (unknown) parameters that can be used to calculate a probability for any pattern at the leaves. It happens to be the case that for most phylogenetic evolutionary model have the property that the probabilities for leaf patterns are polynomials in the parameters. The simplest example to consider is the tree with an ancestral node and two leaves corresponding to two extant species, say “B” and “M”:

The molecular approach to evolution posits that multiple sites together should be used both to estimate parameters associated with evolution along the tree, and maybe even the tree itself. If one assumes that nucleotides mutate according to the 4-state general Markov model with independent processes on each branch, and one writes $p_{ij}$ for $\mathbb{P}(B=i,M=j)$ where i,j are one of A,C,G,T, then it must be the case that $p_{ij}p_{kl} = p_{il}p_{jk}$. In other words, the polynomial

$p_{ij}p_{kl} - p_{il}p_{jk}=0$.

In other words, for any parameters in the 4-state general Markov model, it has to be the case that when the pattern probabilities are plugged into the polynomial equation above, the result is zero. This equation is none other than the condition for two random variables to be independent; in this case the random variable corresponding to the nucleotide at B is independent of the random variable corresponding to the nucleotide at M.

The example is elementary, but it hints at a powerful tool for phylogenetics. It provides an equation that must be satisfied by the pattern probabilities that does not depend specifically on the parameters of the model (which can be intuitively understood as relating to branch length). If many sites are available so that pattern probabilities can be estimated empirically from data, then there is in principle a possibility for testing whether the data fits the topology of a specific tree regardless of what the branch lengths of the tree might be. Returning to Mumford’s description of algebraic geometry, the variety of interest is the geometric object in “pattern probability space” where points are precisely probabilities that can arise for a specific tree, and the “ring of polynomials with the geometric properties of the locus” are the phylogenetic invariants. The relevance of the ring lies in the fact that if and g are two phylogenetic invariants then that means that $f(P)=0$ and $g(P)=0$ for any pattern probabilities from the model, so therefore $f+g$ is also a phylogenetic invariant because $f(P)+g(P)=0$ for any pattern probabilities from the model (the same is true for $c \cdot f$ for any constant c). In other words, there is an algebra of phylogenetic invariants that is closely related to the geometry of pattern probabilities. As Mumford and Tate explain, Grothendieck figured out the right generalizations to construct a theory for any ring, not just the ring of polynomials, and therewith connected the fields of commutative algebra, algebraic geometry and number theory.

The use of phylogenetic invariants for testing tree topologies is conceptually elegantly illustrated in a wonderful book chapter on phylogenetic invariants  by mathematicians Elizabeth Allman and John Rhodes that starts with the simple example of the two taxa tree and delves deeply into the subject. Two surfaces (conceptually) represent the varieties for two trees, and the equations $f_1(P)=f_2(P)=\ldots=f_l(P)=0$ and $h_1(P)=h_2(P)=\ldots=h_k(P)=0$ are the phylogenetic invariants. The empirical pattern probability distribution is the point $\hat{P}$ and the goal is to find the surface it is close to:

Figure 4.2 from Allman and Rhodes chapter on phylogenetic invariants.

Of course for large trees there will be many different phylogenetic invariants, and the polynomials may be of high degree. Figuring out what the invariants are, how many of them there are, bounds for the degrees, understanding the geometry, and developing tests based on the invariants, is essentially a (difficult unsolved) challenge for algebraic geometers. I think it’s fair to say that our book spurred a lot of research on the subject, and helped to create interest among mathematicians who were unaware of the variety and complexity of problems arising from phylogenetics. Nick Eriksson, Kristian Ranestad, Bernd Sturmfels and Seth Sullivant wrote a short piece titled phylogenetic algebraic geometry which is an introduction for algebraic geometers to the subject. Here is where we come full circle to Mumford’s obituary… the notion of a scheme is obviously central to phylogenetic algebraic geometry. And the expository article just cited is just the beginning. There are too many exciting developments in phylogenetic geometry to summarize in this post, but Elizabeth Allman, Marta Casanellas, Joseph Landsberg, John Rhodes, Bernd Sturmfels and Seth Sullivant are just a few of many who have discovered beautiful new mathematics motivated by the biology, and also have had an impact on biology with algebro-geometric tools. There is both theory (see this recent example) and application (see this recent example) coming out of phylogenetic algebraic geometry. More generally, algebraic statistics for computational biology is now a legitimate “field”, complete with a journal, regular conferences, and a critical mass of mathematicians, statisticians, and even some biologists working in the area. Some of the results are truly beautiful and impressive. My favorite recent one is this paper by Caroline Uhler, Donald Richards and Piotr Zwiernik providing important guarantees for maximum likelihood estimation of parameters in Felstenstein’s continuous character model.

But that is not the point here. First, Mumford’s sarcasm was unwarranted. Biologists certainly didn’t discover schemes but as Felsenstein and Lake’s work shows, they did (re)discover algebraic geometry. Moreover, all of the people mentioned above can explain schemes to biologists, thereby answering Mumford’s question in the affirmative. Many of them have not only collaborated with biologists but written biology papers. And among them are some extraordinary expositors, notably Bernd Sturmfels. Still, even if there are mathematicians able and willing to explain schemes to biologists, and even if there are areas within biology where schemes arise (e.g. phylogenetic algebraic geometry), it is fair to ask whether biologists should care to understand them?

The answer to the question is: probably not. In any case I wouldn’t presume to opine on what biologists should and shouldn’t care about. Biology is enormous, and encompasses everything from the study of fecal transplants to the wood frogs of Alaska. However I do have an opinion about the area I work in, namely genomics. When it comes to genomics journalists write about revolutions, personalized precision medicine, curing cancer and data deluge. But the biology of genomics is for real, and it is indeed tremendously exciting as a result of dramatic improvements in underlying technologies (e.g. DNA sequencing and genome editing to name two). I also believe it is true that despite what is written about data deluge, experiments remain the primary and the best way, to elucidate the function of the genome. Data analysis is secondary. But it is true that statistics has become much more important to genomics than it was even to population genetics at the time of R.A. Fisher, computer science is playing an increasingly important role, and I believe that somewhere in the mix of “quantitative sciences for biology”, there is an important role for mathematics.

What biologists should appreciate, what was on offer in Mumford’s obituary, and what mathematicians can deliver to genomics that is special and unique, is the ability to not only generalize, but to do so “correctly”. The mathematician Raoul Bott once reminisced that “Grothendieck was extraordinary as he could play with concepts, and also was prepared to work very hard to make arguments almost tautological.” In other words, what made Grothendieck special was not that he generalized concepts in algebraic geometry to make them more abstract, but that he was able to do so in the right way. What made his insights seemingly tautological at the end of the day, was that he had the “right” way of viewing things and the “right” abstractions in mind. That is what mathematicians can contribute most of all to genomics. Of course sometimes theorems are important, or specific mathematical techniques solve problems and mathematicians are to thank for that. Phylogenetic invariants are important for phylogenetics which in turn is important for comparative genomics which in turn is important for functional genomics which in turn is important for medicine. But it is the the abstract thinking that I think matters most. In other words, I agree with Charles Darwin that mathematicians are endowed with an extra sense… I am not sure exactly what he meant, but it is clear to me that it is the sense that allows for understanding the difference between the “right” way and the “wrong” way to think about something.

There are so many examples of how the “right” thinking has mattered in genomics that they are too numerous to list here, but here are a few samples: At the heart of molecular biology, there is the “right” and the “wrong” way to think about genes: evidently the message to be gleaned from Gerstein et al.‘s in “What is a gene post ENCODE? History and Definition” is that “genes” are not really the “right” level of granularity but transcripts are. In a previous blog post I’ve discussed the “right” way to think about the Needleman-Wunsch algorithm (tropically). In metagenomics there is the “right” abstraction with which to understand UniFrac. One paper I’ve written (with Niko Beerenwinkel and Bernd Sturmfels) is ostensibly about fitness landscapes but really about what we think the “right” way is to look at epistasis. In systems biology there is the “right” way to think about stochasticity in expression (although I plan a blog post that digs a bit deeper). There are many many more examples… way too many to list here… because ultimately every problem in biology is just like in math… there is the “right’ and the “wrong” way to think about it, and figuring out the difference is truly an art that mathematicians, the type of mathematicians that work in math departments, are particularly good at.

Here is a current example from (computational) biology where it is not yet clear what “right” thinking should be despite the experts working hard at it, and that is useful to highlight because of the people involved: With the vast amount of human genomes being sequenced (some estimates are as high as 400,000 in the coming year), there is an increasingly pressing fundamental question about how the (human) genome should be represented and stored. This is ostensibly a computer science question: genomes should perhaps be compressed in ways that allow for efficient search and retrieval, but I’d argue that fundamentally it is a math question. This is because what the question is really asking, is how should one think about genome sequences related mostly via recombination and only slightly by mutation, and what are the “right” mathematical structures for this challenge? The answer matters not only for the technology (how to store genomes), but much more importantly for the foundations of population and statistical genetics. Without the right abstractions for genomes, the task of coherently organizing and interpreting genomic information is hopeless. David Haussler (with coauthors) and Richard Durbin have both written about this problem in papers that are hard to describe in any way other than as math papers; see Mapping to a Reference Genome Structure and Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (BPWT). Perhaps it is no coincidence that both David Haussler and Richard Durbin studied mathematics.

But neither David Haussler nor Richard Durbin are faculty in mathematics departments. In fact, there is a surprisingly long list of very successful (computational) biologists specifically working in genomics, many of whom even continue to do math, but not in math departments, i.e. they are former mathematicians (this is so common there is even a phrase for it “recovering mathematician” as if being one is akin to alcoholism– physicists use the same language). People include Richard Durbin, Phil Green, David Haussler, Eric Lander, Montgomery Slatkin and many others I am omitting; for example almost the entire assembly group at the Broad Institute consists of former mathematicians. Why are there so many “formers” and very few “currents”? And does it matter? After all, it is legitimate to ask whether successful work in genomics is better suited to departments, institutes and companies outside the realm of academic mathematics. It is certainly the case that to do mathematics, or to publish mathematical results, one does not need to be a faculty member in a mathematics department. I’ve thought a lot about these issues and questions, partly because they affect my daily life working between the worlds of mathematics and molecular biology in my own institution. I’ve also seen the consequences of the separation of the two cultures. To illustrate how far apart they are I’ve made a list of specific differences below:

How then can biology, specifically genomics (or genetics), exist and thrive within the mathematics community? And how can mathematics find a place within the culture of biology?

I don’t know. The relationship between biology and mathematics is on the rocks and prospects are grim. Yes, there are biologists who do mathematical work, and yes, there are mathematical biologists, especially in areas such as evolution or ecology who are in math departments. There are certainly applied mathematics departments with faculty working on biology problems involving modeling at the macroscopic level, where the math fits in well with classic applied math (e.g. PDEs, numerical analysis). But there is very little genomics or genetics related math going on in math departments. And conversely, mathematicians who leave math departments to work in biology departments or institutes face enormous pressure to not focus on the math, or when they do any math at all, to not publish it (work is usually relegated to the supplement and completely ignored). The result is that biology loses out due to the minimal real contact with math– the special opportunity of benefiting from the extra sense is lost, and conversely math loses the opportunity to engage biology– one of the most exciting scientific enterprises of the 21st century. The mathematician Gian-Carlo Rota said that “The lack of real contact between mathematics and biology is either a tragedy, a scandal, or a challenge, it is hard to decide which”. He was right.

The extent to which the two cultures have drifted apart is astonishing. For example, visiting other universities I see the word “mathematics” almost every time precision medicine is discussed in the context of a new initiative, but I never see mathematicians or the local math department involved. In the mathematics community, there has been almost no effort to engage and embrace genomics. For example the annual joint AMS-MAA meetings always boast a series of invited talks, many on applications of math, but genomics is never a represented area. Yet in my Junior level course last semester on mathematical biology (taught in the math department) there were 46 students, more than any other upper division elective class in the math department. Even though I am a 50% member of the mathematics department I have been advising three math graduate students this year, equivalent to six for a full time member, a statistic that probably ranks me among the most busy advisors in the department (these numbers do not even reflect the fact that I had to turn down a number of students). Anecdotally, the numbers illustrate how popular genomics is among math undergraduate and graduate students, and although hard data is difficult to come by my interactions with mathematicians everywhere convince me the trend I see at Berkeley is universal. So why is this popularity not reflected in support of genomics by the math community? And why don’t biology journals, conferences and departments embrace more mathematics? There is a hypocrisy of math for biology. People talk about it but when push comes to shove nobody wants to do anything real to foster it.

Examples abound. On December 16th UCLA announced the formation of a new Institute for Quantitative and Computational Biosciences. The announcement leads with a photograph of the director that is captioned “Alexander Hoffmann and his colleagues will collaborate with mathematicians to make sense of a tsunami of biological data.” Strangely though, the math department is not one of the 15 partner departments that will contribute to the Institute. That is not to say that mathematicians won’t interact with the Institute, or that mathematics won’t happen there. E.g., the Institute for Pure and Applied Mathematics is a partner as is the Biomathematics department (an interesting UCLA concoction), not to mention the fact that many of the affiliated faculty do work that is in part mathematical. But formal partnership with the mathematics department, and through it direct affiliation with the mathematics community, is missing. UCLA’s math department is among the top in the world, and boasts a particularly robust applied mathematics program many of whose members work on mathematical biology. More importantly, the “pure” mathematicians at UCLA are first rate and one of them, Terence Tao, is possibly the most talented mathematician alive. Wouldn’t it be great if he could be coaxed to think about some of the profound questions of biology? Wouldn’t it be awesome if mathematicians in the math department at UCLA worked hard with the biologists to tackle the extraordinary challenges of “precision medicine”? Wouldn’t it be wonderful if UCLA’s Quantitative and Computational biosciences Institute could benefit from the vast mathematics talent pool not only at UCLA but beyond: that of the entire mathematics community?

I don’t know if the omission of the math department was an accidental oversight of the Institute, a deliberate snub, or if it was the Institute that was rebuffed by the mathematics department. I don’t think it really matters. The point is that the UCLA situation is ubiquitous. Mathematics departments are almost never part of new initiatives in genomics; biologists are all too quick to glance the other way. Conversely, the mathematics community has shunned biologists. Despite two NSF Institutes dedicated to mathematical biology (the MBI and NIMBioS) almost no top math departments hire mathematicians working in genetics or genomics (see the mathematics jobs wiki). In the rooted tree in the figure above B can represent Biology and M can represent Mathematics and they truly, and sadly, are independent.

I get it. The laundry list of differences between biology and math that I aired above can be overwhelming. Real contact between the subjects will be difficult to foster, and it should be acknowledged that it is neither necessary nor sufficient for the science to progress. But wouldn’t it be better if mathematicians proved they are serious about biology and biologists truly experimented with mathematics?

Notes:

1. The opening paragraph is an edited copy of an excerpt (page 2, paragraph 2) from C.P. Snow’s “The Two Cultures and The Scientific Revolution” (The Rede Lecture 1959).
2. David Mumford’s content on his site is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License, and I have incorporated it in my post (boxed text) unaltered according to the terms of the license.
3. The meaning of the word “invariant” in “phylogenetic invariants” differs from the standard meaning in mathematics, where invariant refers to a property of a class of objects that is unchanged under transformations. In the context of algebraic geometry classic invariant theory addresses the problem of determining polynomial functions that are invariant under transformations from a linear group. Mumford is known for his work on geometric invariant theory. An astute reader could therefore deduce from the term “phylogenetic invariants” that the term was coined by biologists.

Recent news of James Watson’s auction of his Nobel Prize medal has unearthed a very unpleasant memory for me.

In March 2004 I attended an invitation only genomics meeting at the famed Banbury Center at Cold Spring Harbor Laboratory. I had heard legendary stories about Banbury, and have to admit I felt honored and excited when I received the invitation. There were rumors that sometimes James Watson himself would attend meetings. The emails I received explaining the secretive policies of the Center only added to the allure. I felt that I had received an invitation to the genomics equivalent of Skull and Bones.

Although Watson did not end up attending the meeting, my high expectations were met when he did decide to drop in on dinner one evening at Robertson house. Without warning he seated himself at my table. I was in awe. The table was round with seating for six, and Honest Jim sat down right across from me. He spoke incessantly throughout dinner and we listened. Sadly though, most of the time he was spewing racist and misogynistic hate. I remember him asking rhetorically “who would want to adopt an Irish kid?” (followed by a tirade against the Irish that I later saw repeated in the news) and he made a point to disparage Rosalind Franklin referring to her derogatorily as “that woman”. No one at the table (myself included) said a word. I deeply regret that.

One of Watson’s obsessions has been to “improve” the “imperfect human” via human germline engineering. This is disturbing on many many levels. First, there is the fact that for years Watson presided over Cold Spring Harbor Laboratory which actually has a history as a center for eugenics. Then there are the numerous disparaging remarks by Watson about everyone who is not exactly like him, leaving little doubt about who he imagines the “perfect human” to be. But leaving aside creepy feelings… could he be right? Is the “perfect human” an American from Chicago of mixed Scottish/Irish ancestry? Should we look forward to a world filled with Watsons? I have recently undertaken a thought experiment along these lines that I describe below. The result of the experiment is dedicated to James Watson on the occasion of his unbirthday today.

Introduction

SNPedia is an open database of 59,593 SNPs and their associations. A SNP entry includes fields for “magnitude” (a subjective measure of significance on a scale of 0–10) and “repute” (good or bad), and allele classifications for many diseases and medical conditions. For example, the entry for a SNP (rs1799971) that associates with alcohol cravings describes the “normal” and “bad” alleles. In addition to associating with phenotypes, SNPs can also associate with populations. For example, as seen in the Geography of Genetic Variants Browser, rs1799971 allele frequencies vary greatly among Africans, Europeans and Asians. If the genotype of an individual is known at many SNPs, it is therefore possible to guess where they are from: in the case of rs1799971 someone who is A:A is a lot more likely to be African than Japanese, and with many SNPs the probabilities can narrow the location of an individual to a very specific geographic location. This is the principle behind the application of principal component analysis (PCA) to the study of populations. Together, SNPedia and PCA therefore provide a path to determining where a “perfect human” might be from:

1. Create a “perfect human” in silico by setting the alleles at all SNPs so that they are “good”.
2. Add the “perfect human” to a panel of genotyped individuals from across a variety of populations and perform PCA to reveal the location and population of origin of the individual.

Results

After restricting the SNP set from SNPedia to those with green painted alleles, i.e. “good”, there are 4967 SNPs with which to construct the “perfect human” (available for download here).

A dataset of genotyped individuals can be obtain from 1000 genomes including Africans, (indigenous) Americans, East Asians and Europeans.

The PCA plot (1st and 2nd components) showing all the individuals together with the “perfect human” (in pink; see arrow) is shown below:

The nearest neighbor to the “perfect human” is HG00737, a female who isPuerto Rican. One might imagine that such a person already existed, maybe Yuiza, the only female Taino Cacique (chief) in Puerto Rico’s history:

Samuel Lind’s ‘Yuiza’

But as the 3rd principal component shows, reifying the “perfect human” is a misleading undertaking:

Here the “perfect human” is revealed to be decidedly non-human. This is not surprising, and it reflects the fact that the alleles of the “perfect human” place it as significant outlier to the human population. In fact, this is even more evident in the case of the “worst human”, namely the individual that has the “bad” alleles at every SNPs. A projection of that individual onto any combination of principal components shows them to be far removed from any actual human. The best visualization appears in the projection onto the 2nd and 3rd principal components, where they appear as a clear outlier (point labeled DYS), and diametrically opposite to Africans:

The fact that the “worst human” does not project well onto any of the principal components whereas the “perfect human” does is not hard to understand from basic population genetics principles. It is an interesting exercise that I leave to the reader.

Conclusion

The fact that the “perfect human” is Puerto Rican makes a lot of sense. Since many disease SNPs are population specific, it makes sense that an individual homozygous for all “good” alleles should be admixed. And that is exactly what Puerto Ricans are. In a “women in the diaspora” study, Puerto Rican women born on the island but living in the United States were shown to be 53.3±2.8% European, 29.1±2.3% West African, and 17.6±2.4% Native American. In other words, to collect all the “good” alleles it is necessary to be admixed, but admixture itself is not sufficient for perfection. On a personal note, I was happy to see population genetic evidence supporting my admiration for the perennial championship Puerto Rico All Stars team:

As for Watson, it seems fitting that he should donate the proceeds of his auction to the Caribbean Genome Center at the University of Puerto Rico.

[Update: Dec. 7/8: Taras Oleksyk from the Department of Biology at the University of Puerto Rico Mayaguez has written an excellent post-publication peer review of this blog post and Rafael Irizarry from the Harvard School of Public Health has written a similar piece, Genéticamente, no hay tal cosa como la raza puertorriqueña in Spanish. Both are essential reading.]

Earlier this week US News and World Report (USNWR) released, for the first time, a global ranking of universities including rankings by subject area. In mathematics, the top ten universities are:

1. Berkeley
2. Stanford
3. Princeton
4. UCLA
5. University of Oxford
6. Harvard
7. King Abdulaziz University
8. Pierre and Marie Curie – Paris 6
9. University of Hong Kong
10. University of Cambridge

The past few days I’ve received a lot of email from colleagues and administrators about this ranking, and also the overall global ranking of USNWR in which Berkeley was #1. The emails generally say something to the effect of “of course rankings are not perfect, everybody knows… but look, we are amazing!”

BUT, one of the top math departments in the world, the math department at the Massachusetts Institute of Technology is ranked #11… they didn’t even make the top ten. Even more surprising is the entry at #7 that I have boldfaced: the math department at King Abdulaziz University (KAU) in Jeddah, Saudi Arabia. I’ve been in the math department at Berkeley for 15 years, and during this entire time I’ve never (to my knowledge) met a person from their math department and I don’t recall seeing a job application from any of their graduates… I honestly had never heard of the university in any scientific context. I’ve heard plenty about KAUST (the King Abdullah University of Science and Technology ) during the past few years, especially because it is the first mixed-gender university campus in Saudi Arabia, is developing a robust research program based on serious faculty hires from overseas, and in a high profile move hired former Caltech president Jean-Lou Chameau to run the school. But KAU is not KAUST.

A quick google searched reveals that although KAU is nearby in Jeddah, it is a very different type of institution. It has two separate campuses for men and women. Although it was established in 1967 (Osama Bin Laden was a student there in 1975) its math department started a Ph.D. program only two years ago. According to the math department website, the chair of the department, Prof. Abdullah Mathker Alotaibi, is a 2005 Ph.D. with zero publications [Update: Nov. 10: This initial claim was based on a Google Scholar Search of his full name; a reader commented below that he has published and that this claim was incorrect. Nevertheless, I do not believe it in any way materially affect the points made in this post.] This department beat MIT math in the USNWR global rankings! Seriously?

The USNWR rankings are based on 8 attributes:

– global research reputation
– regional research reputation
– publications
– normalized citation impact
– total citations
– number of highly cited papers
– percentage of highly cited papers
– international collaboration

Although KAU’s full time faculty are not very highly cited, it has amassed a large adjunct faculty that helped them greatly in these categories. In fact, in “normalized citation impact” KAU’s math department is the top ranked in the world. This amazing statistic is due to the fact that KAU employs (as adjunct faculty) more than a quarter of the highly cited mathematicians at Thomson Reuters. How did a single university assemble a group with such a large proportion of the world’s prolific (according to Thomson Reuters) mathematicians? (When I first heard this statistic from Iddo Friedberg via Twitter I didn’t believe it and had to go compute it myself from the data on the website. I guess I believe it now but I still can’t believe it!!)

In 2011 Yudhijit Bhattacharjee published an article in Science titled “Saudi Universities Offer Cash in Exchange for Academic Prestige” that describes how KAU is targeting highly cited professors for adjunct faculty positions. According to the article, professors are hired as adjunct professors at KAU for $72,000 per year in return for agreeing (apparently by contract) to add KAU as a secondary affiliation at ISIhighlycited.com and for adding KAU as an affiliation on their published papers. Annual visits to KAU are apparently also part of the “deal” although it is unclear from the Science article whether these actually happen regularly or not. [UPDATE Oct 31, 12:14pm: A friend who was solicited by KAU sent me the invitation email with the contract that KAU sends to potential “Distinguished Adjunct Professors”. The details are exactly as described in the Bhattacharjee article: From: "Dr. Mansour Almazroui" <ceccr@kau.edu.sa> Date: XXXX To: XXXX <XXXX> Subject: Re: Invitation to Join “International Affiliation Program” at King Abdulaziz University, Jeddah Saudi Arabia Dear Prof. XXXX , Hope this email finds you in good health. Thank you for your interest. Please find below the information you requested to be a “Distinguished Adjunct Professor” at KAU. 1. Joining our program will put you on an annual contract initially for one year but further renewable. However, either party can terminate its association with one month prior notice. 2. The Salary per month is$ 6000 for the period of contract.
3. You will be required to work at KAU premises for three weeks in
each contract year. For this you will be accorded with expected
three visits to KAU.
4. Each visit will be at least for one week long but extendable as
suited for research needs.
5. Air tickets entitlement will be in Business-class and stay in Jeddah
will be in a five star hotel. The KAU will cover all travel and living
6. You have to collaborate with KAU local researchers to work on KAU
funded (up to $100,000.00) projects. 7. It is highly recommended to work with KAU researchers to submit an external funded project by different agencies in Saudi Arabia. 8. May submit an international patent. 9. It is expected to publish some papers in ISI journals with KAU affiliation. 10. You will be required to amend your ISI highly cited affiliation details at the ISI highlycited.com web site to include your employment and affiliation with KAU. Kindly let me know your acceptance so that the official contract may be preceded. Sincerely, Mansour ] The publication of the Science article elicited a strong rebuttal from KAU on the comments section, where it was vociferously argued that the hiring of distinguished foreign scholars was aimed at creating legitimate research collaborations, and was not merely a gimmick for increasing citation counts. Moreover, some of the faculty who had signed on defended the decision in the article. For example, Neil Robertson, a distinguished graph theorist (of Robertson-Seymour graph minors fame) explained that “it’s just capitalism,” and “they have the capital and they want to build something out of it.” He added that “visibility is very important to them, but they also want to start a Ph.D. program in mathematics,” (they did do that in 2012) and he added that he felt that “this might be a breath of fresh air in a closed society.” It is interesting to note that despite his initial enthusiasm and optimism, Professor Robertson is no longer associated with KAU. In light of the high math ranking of KAU in the current USNWR I decided to take a closer look at who KAU has been hiring, why, and for what purpose, i.e. I decided to conduct post-publication peer review of the Bhattacharjee Science paper. A web page at KAU lists current “Distinguished Scientists” and another page lists “Former Distinguished Adjunct Professors“. One immediate observation is that out of 118 names on these pages there is 1 woman (Cheryl Praeger from the University of Western Australia). Given that KAU has two separate campuses for men and women, it is perhaps not surprising that women are not rushing to sign on, and perhaps KAU is also not rushing to invite them (I don’t have any information one way or another, but the underrepresentation seems significant). Aside from these faculty, there is also a program aptly named the “Highly Cited Researcher Program” that is part of the Center for Excellence in Genomic Medicine Research. Fourteen faculty are listed there (all men, zero women). But guided by the Science article which described the contract requirement that researchers add KAU to their ISI affiliation, I checked for adjunct KAU faculty at Thomson-Reuters ResearcherID.com and there I found what appears to be the definitive list. Although Neil Robertson has left KAU, he has been replaced by another distinguished graph theorist, namely Carsten Thomassen (no accident as his wikipedia page reveals that “He was included on the ISI Web of Knowledge list of the 250 most cited mathematicians.”) This is a name I immediately recognized due to my background in combinatorics; in fact I read a number of Thomassen’s papers as a graduate student. I decided to check whether it is true that adjunct faculty are adding KAU as an affiliation on their articles. Indeed, Thomassen has done exactly that in his latest publication Strongly 2-connected orientations of graphs published this year in the Journal of Combinatorial Theory Series B. At this point I started having serious reservations about the ethics of faculty who have agreed to be adjuncts at KAU. Regardless of the motivation of KAU in hiring adjunct highly cited foreign faculty, it seems highly inappropriate for a faculty member to list an affiliation on a paper to an institution to which they have no scientific connection whatsoever. I find it very hard to believe that serious graph theory is being researched at KAU, an institution that didn’t even have a Ph.D. program until 2012. It is inconceivable that Thomassen joined KAU in order to find collaborators there (he mostly publishes alone), or that he suddenly found a great urge to teach graph theory in Saudi Arabia (KAU had no Ph.D. program until 2012). The problem is also apparent when looking at the papers of researchers in genomics/computational biology that are adjuncts at KAU. I recognized a number of such faculty members, including high-profile names from my field such as Jun Wang, Manolis Dermitzakis and John Huelsenbeck. I was surprised to see their names (none of these faculty mention KAU on their websites) yet in each case I found multiple papers they have authored during the past year in which they list the KAU affiliation. I can only wonder whether their home institutions find this appropriate. Then again, maybe KAU is also paying the actual universities the faculty they are citation borrowing belong to? But assume for a moment that they aren’t, then why should institutions share the credit they deserve for supporting their faculty members by providing them space, infrastructure, staff and students with KAU? What exactly did KAU contribute to Kilpinen et al. Coordinated effects of sequence variation on DNA binding, chromatin structure and transcription, Science, 2013? Or to Landis et al. Bayesian analysis of biogeography when the number of areas is large, Systematic Biology, 2013? These papers have no authors or apparent contribution from KAU. Just the joint affiliation of the adjunct faculty member. The limit of the question arises in the case of Jun Wang, director of the Beijing Genome Institute, whose affiliations are BGI (60%), University of Copenhagen (15%), King Abdulaziz University (15%), The University of Hong Kong (5%), Macau University of Science and Technology (5%). Should he also acknowledge the airlines he flies on? Should there not be some limit on the number of affiliations of an individual? Shouldn’t journals have a policy about when it is legitimate to list a university as an affiliation for an author? (e.g. the author must have in some significant way been working at the institution). Another, bigger, disgrace that emerged in my examination of the KAU adjunct faculty is the issue of women. Aside from the complete lack of women in the “Highly Cited Researcher Program”, I found that most of the genomics adjunct faculty hired via the program will be attending an all-male conference in three weeks. The “Third International Conference on Genomic Medicine” will be held from November 17–20th at KAU. This conference has zero women. The same meeting last year… had zero women. I cannot understand how in 2014, at a time when many are speaking out strongly about the urgency of supporting females in STEM and in particular about balancing meetings, a bunch of men are willing to forgo all considerations of gender equality for the price of ~$3 per citation per year (a rough calculation using the figure of \$72,000 per year from the Bhattacharjee paper and 24,000 citations for a highly cited researcher). To be clear I have no personal knowledge about whether the people I’ve mentioned in this article are actually being paid or how much, but even if they are being paid zero it is not ok to participate in such meetings. Maybe once (you didn’t know what you are getting into), but twice?!

As for KAU, it seems clear based on the name of the “Highly Cited Researcher Program” and the fact that they advertise their rankings that they are specifically targeting highly cited researchers much more for their delivery of their citations than for development of genuine collaborations (looking at the adjunct faculty I failed to see any theme or concentration of people in any single area as would be expected in building a coherent research program). However I do not fault KAU for the goal of increasing the ranking of their institution. I can see an argument for deliberately increasing rankings in order to attract better students, which in turn can attract faculty. I do think that three years after the publication of the Science article, it is worth taking a closer look at the effects of the program (rankings have increased considerably but it is not clear that research output from individuals based at KAU has increased), and whether this is indeed the most effective way to use money to improve the quality of research institutions. The existence of KAUST lends credence to the idea that the king of Saudi Arabia is genuinely interested in developing Science in the country, and there is a legitimate research question as to how to do so with the existing resources and infrastructure. Regardless of how things ought to be done, the current KAU emphasis on rankings is a reflection of the rankings, which USNWR has jumped into with its latest worldwide ranking. The story of KAU is just evidence of a bad problem getting worse. I have previously thought about the bad version of the problem:

A few years ago I wrote a short paper with my (now former) student Peter Huggins on university rankings:

P. Huggins and L.P., Selecting universities: personal preferences and rankings, arXiv, 2008.

It exists only as an arXiv preprint as we never found a suitable venue for publication (this is code for the paper was rejected upon peer review; no one seemed interested in finding out the extent to which the data behind rankings can produce a multitude of stories). The article addresses a simple question: given that various attributes have been measured for a bunch of universities, and assuming they are combined (linearly) into a score used to produce rankings, how do the rankings depend on the weightings of the individual attributes? The mathematics is that of polyhedral geometry, where the problem is to compute a normal fan of a polytope whose vertices encode all the possible rankings that can be obtained for all possible weightings of the attributes (an object we called the unitope). An example is shown below, indicating the possible rankings as determined by weightings chosen among three attributes measured by USNWR (freshman retention, selectivity, peer assessment). It is important to keep in mind this is data from 2007-2008.

Our paper had an obvious but important message: rankings can be very sensitive to the attribute weightings. Of course some schools such as Harvard came out on top regardless of attribute preferences, but some schools, even top ranked schools, could shift by over 50 positions. Our conclusion was that although the data collected by USNWR was useful, the specific weighting chosen and the ranking it produced were not. Worse than that, sticking to a single choice of weightings was misleading at best, dangerous at worse.

I was reminded of this paper when looking at the math department rankings just published by USNWR. When I saw that KAU was #7 I was immediately suspicious, and even Berkeley’s #1 position bothered me (even though I am a faculty member in the department). I immediately guessed that they must have weighted citations heavily, because our math department has applied math faculty, and KAU has their “highly cited researcher program”. Averaging citations across faculty from different (math) disciplines is inherently unfair. In the case of Berkeley, my applied math colleague James Sethian has a paper on level set methods with more than 10,000 (Google Scholar) citations. This reflects the importance and advance of the paper, but also the huge field of users of the method (many, if not most, of the disciplines in engineering). On the other hand, my topology colleague Ian Agol’s most cited paper has just over 200 citations. This is very respectable for a mathematics paper, but even so it doesn’t come close to reflecting his true stature in the field, namely the person who settled the Virtually Haken Conjecture thereby completing a long standing program of William Thurston that resulted in many of the central open problems in mathematics (Thurston was also incidentally an adjunct faculty member at KAU for some time). In other words, not only are citations not everything, they can also be not anything. By comparing citations across math departments that are diverse to very differing degrees USNWR rendered the math ranking meaningless. Some of the other data collected, e.g. reputation, may be useful or relevant to some, and for completeness I’m including it with this post (here) in a form that allows for it to be examined properly (USNWR does not release it in the form of a table, but rather piecemeal within individual html pages on their site), but collating the data for each university into one number is problematic. In my paper with Peter Huggins we show both how to evaluate the sensitivity of rankings to weightings and also how to infer bounds on the weightings by USNWR from the rankings. It would be great if USNWR included the ability to perform such computations with their data directly on their website but there is a reason USNWR focuses on citations.

The impact factor of a journal is a measure of the average amount of citation per article. It is computed by averaging the citations over all articles published during the preceding two years, and its advertisement by journals reflects a publishing business model where demand for the journal comes from the impact factor, profit from free peer reviewing, and sales from closed subscription based access.  Everyone knows the peer review system is broken, but it’s difficult to break free of when incentives are aligned to maintain it. Moreover, it leads to perverse focus of academic departments on the journals their faculty are publishing in and the citations they accumulate. Rankings such as those by USNWR reflect the emphasis on citations that originates with the journals, as so one cannot fault USNWR for including it as a factor and weighting it highly in their rankings. Having said that, USNWR should have known better than to publish the KAU math rankings; in fact it appears their publication might be a bug. The math department rankings are the only rankings that appear for KAU. They have been ommitted entirely from the global overall ranking and other departmental rankings (I wonder if this is because USNWR knows about the adjunct faculty purchase). In any case, the citation frenzy feeds departments that in aggregate form universities. Universities such as King Abdulaziz, that may reach the point where they feel compelled to enter into the market of citations to increase their overall profile…

I hope this post frightened you. It should. Happy Halloween!

[Update: Dec. 6: an article about KAU and citations has appeared in the Daily Cal, Jonathan Eisen posted his exchanges with KAU, and he has storified the tweets]

This year half of the Nobel prize in Physiology or Medicine was awarded to May-Britt Moser and Edvard Moser, who happen to be both a personal and professional couple. Interestingly, they are not the first but rather the fourth couple to win the prize jointly: In 1903 Marie Curie and Pierre Curie shared the Nobel prize in physics, in 1935 Frederic Joiliot and Irene Joliot-Curie shared the Nobel prize in chemistry and in 1947 Carl Cori and Gerty Cori also shared the Nobel prize in physiology or medicine. It seems working on science with a spouse or partner can be a formula for success. Why then, when partners apply together for academic jobs, do universities refer to them as “two body problems“?

The “two-body problem” is a question in physics about the motions of pairs of celestial bodies that interact with each other gravitationally. It is a special case of the difficult “N-body problem” but simple enough that is (completely) solved; in fact it was solved by Johann Bernoulli a few centuries ago. The use of the term in the context of academic job searches has always bothered me- it suggests that hiring in academia is an exercise in mathematical physics (it is certainly not!) and even if one presumes that it is, the term is an oxymoron because in physics the problem is solved whereas in academia it is used in a way that implies it is unsolvable. There are countless times I have heard my colleagues sigh “so and so would be great but there is a two body problem”. Semantics aside, the allusion to high brow physics problems in the process of academic hiring belies a complete lack of understanding of the basic mathematical notion of epistasis relevant in the consideration of joint applications, not to mention an undercurrent of sexism that plagues science and engineering departments everywhere.  The results are poor hiring decisions, great harm to the academic prospects of partners and couples, and imposition of stress and anxiety that harms the careers of those who are lucky enough to be hired by the flawed system.

I believe it was Aristotle who first noted used the phrase “the whole is greater than the sum of its parts”. The old adage remains true today: owning a pair of matching socks is more than twice as good as having just one sock. This is called positive epistasis, or synergy. Of course the opposite may be true as well: a pair of individuals trying to squeeze through a narrow doorway together will take more than twice as long than if they would just go through one at a time. This would be negative epistasis. There is a beautiful algebra and geometry associated to positive/negative epistasis this is useful to understand, because its generalizations reveal a complexity to epistasis that is very much at play in academia.

Formally, thinking of two “parts”, we can represent them as two bit strings: 01 for one part and 10 for the other. The string 00 represents the situation of having neither part, and 11 having both parts. A “fitness function” $f:[0,1]^2 \rightarrow \mathbb{R}_+$ assigns to each string a value. Epistasis is defined to be the sign of the linear form

$u=f(00)+f(11)-f(10)-f(01)$.

That is, $u>0$ is positive epistasis, $u<0$ is negative epistasis and $u=0$ is no epistasis. In the case where $f(00)=0$, “the whole is greater than the sum of its parts” means that $f(11)>f(10)+f(01)$ and “the whole is less than the sum of its parts” means $f(11). There is an accompanying geometry that consists of drawing a square in the x-y plane whose corners are labeled by $00,01,10$ and $11$. At each corner,  the function $f$ can be represented by a point on the z axis, as shown in the example below:

The black line dividing the square into two triangles comes about by imagining that there are poles at the corners of the square, of height equal to the fitness value, and then that a tablecloth is draped over the poles and stretched taught. The picture above then correspond to the leftmost panel below:

The crease is the resulting of projecting down onto the square the “fold” in the tablecloth (assuming there is a fold). In other words, positive and negative epistasis can be thought of as corresponding to one of the two triangulations of the square. This is the geometry of two parts but what about n parts? We can similarly represent them by bit strings $100 \cdots 0, 010 \cdots 0, 001 \cdots 0, \ldots, 000 \cdots 1$ with the “whole” corresponding to $111 \cdots 1$. Assuming that the parts can only be added up all together, the geometry now works out to be that of triangulations of the hyperbipyramid; the case $n=3$ is shown below:

“The whole is greater than the sum of its parts”: the superior-inferior slice.

“The whole is less than the sum of its parts”: the transverse slice.

With multiple parts epistasis can become more complicated if one allows for arbitrary combining of parts. In a paper written jointly with Niko Beerenwinkel and Bernd Sturmfels titled “Epistasis and shapes of fitness landscapes“, we developed the mathematics for the general case and showed that epistasis among objects allowed to combine in all possible ways corresponds to the different triangulations of a hypercube. For example, in the case of three objects, the square is replaced by the cube with eight corners corresponding to the eight bit strings of length 3. There are 74 triangulations of the cube, falling into 6 symmetry classes. The complete classification is shown below (for details on the meaning of the GKZ vectors and out-edges see the paper):

There is a beautiful geometry describing how the different epistatic shapes (or triangulations) are related, which is known as the secondary polytope. Its vertices correspond to the triangulations and two are connected by an edge when they are the same except for the “flip” of one pair of neighboring tetrahedra. The case of the cube is shown below:

The point of the geometry, and its connection to academic epistasis that I want to highlight in this post, is made clear when considering the case of $n=4$. In that case the number of different types of epistatic interactions is given by the number of triangulations of the 4-cube. There are 87,959,448 triangulations and 235,277 symmetry types! In other words, the intuition from two parts that “interaction” can be positive, negative or neutral is difficult to generalize without math, and the point is there are a myriad of ways a faculty in a large department can be interacting both to the benefit and the detriment of their overall scientific output.

In many searches I’ve been involved in the stated principle for hiring is “let’s hire the best person”. Sometimes the search may be restricted to a field, but it is not uncommon that the search is open. Such a hiring policy deliberately ignores epistasis, and I think it’s crazy, not to mention sexist, because the policy affects and hurts women applicants far more than it does men. Not because women are less likely to be “the best” in their field, in fact quite the opposite. It is very common for women in academia to be partnered with men who are also in academia, and inevitably they suffer for that fact because departments have a hard time reconciling that both could be “the best”. There are also many reasons for departments to think epistaticially that go beyond basic fairness principles. For example, in the case of partners that are applying together to a university, even if they are not working together on research, it is likely that each one will be far more productive if the other has a stable job at the same institution. It is difficult to manage a family if one partner needs to commute hours, or in some cases days, to work. I know of a number of couples in academia that have jobs in different states.

In the last few years there are a few couples that have been bold enough to openly declare themselves “positively epistatic”. What I mean is that they apply jointly as a single applicant, or “joint lab” in the case of biology. For example, there is the case of the Altschuler-Wu lab that has recently relocated to UCSF or the Eddy-Rivas lab that is relocating to Harvard. Still, such cases are far and few between, and for the most part hiring is inefficient, clumsy and unfair (it is also worth noting that there are many other epistatic factors that can and should be considered, for example the field someone is working in, collaborators, etc.)

Epistasis has been carefully studied for a long time in population and statistical genetics, where it is fundamental in understanding the effects of genotype on phenotype. The geometry described above can be derived for diploid genomes and this was done by Ingileif Hallgrímsdóttir and Debbie Yuster in the paper “A complete classification of epistatic two-locus models” from 2008. In the paper they examine a previous classification of epistasis among 30 pairs of loci in a QTL analysis of growth traits in chicken (Carlborg et al., Genome Research 2003). The (re)-classification is shown in the figure below:

If we can classify epistasis for chickens in order to understand them, we can certainly assess the epistasis outlook for our potential colleagues, and we should hire accordingly.

It’s time that the two body problem be recognized as the two body opportunity.

This is part (2/2) about my travel this past summer to Iceland and Israel:

In my previous blog post I discussed the genetics of Icelanders, and the fact that most Icelanders can trace their roots back dozens of generations, all the way to Vikings from ca. 900AD. The country is homogenous in many other ways as well (religion, income, etc.), and therefore presents a stark contrast to the other country I visited this summer: Israel. Even though I’ve been to Israel many times since I was a child, now that I am an adult the manifold ethnic, social and religious makeup of the society is much more evident to me. This was particularly true during my visit this past summer, during which political and military turmoil in the country served to accentuate differences. There are Armenians, Ashkenazi Jews, Bahai, Bedouin, Beta Israel, Christian Arabs, Circassians, Copts, Druze, Maronites, Muslim Arab, Sephardic Jews etc. etc. etc. , and additional “diversity” caused by political splits leading to West Bank Palestinians, Gaza Palestinians, Israelis inside vs. outside the Green Line, etc. etc. etc. (and of course many individuals fall into multiple categories). It’s fair to say that “it’s complicated”. Moreover, the complex fabric that makes up Israeli society is part of a larger web of intertwined threads in the Middle East. The “Arab countries” that neighbor Israel are also internally heterogeneous and complex, both in obvious ways (e.g. the Sunni vs. Shia division), but also in many more subtle ways (e.g. language).

The 2014 Israeli-Gaza conflict started on July 8th. Having been in Israel for 4 weeks I was interacting closely with many friends and colleagues who were deeply impacted by the events (e.g. their children were suddenly called up to a partake in a war), and among them I noticed almost immediately an extreme polarization that reflected a public relations battle being waged between Hamas and Israel that played out more intensely than in any previous conflict on news channels and social media. The polarization extended to friends and acquaintances outside of Israel. Everyone had a very strong opinion. One thing I noticed were graphic memes being passed around in which the conflict was projected onto a two-colored map. For example, the map below was passed around on Facebook showing the (“real democratic”) Israel surrounded by a sea of Arab green in the Middle East:

I started noticing other bifurcating maps as other Middle East issues came to the fore later in the summer. Here is a map from a website depicting the Sunni-Shia divide:

In many cases the images being passed around were explicitly encouraging a “one-dimensional” view of the conflict(s), whereas in other cases the “us” vs. “them” factor was more subliminal. The feeling that I was being programmed how to think made me uncomfortable.

Moreover, the Middle East memes that were flooding my inbox were distracting me. I had visited Israel to nurture and establish connections and collaborations with the large number of computational biologists in the country. During my trip I was kindly hosted by Yael Mandel-Gutfreund at the Technion, and also had the honor of being an invited speaker at the annual Israeli Bioinformatics Society meeting. The visit was not supposed to be a bootcamp in salon politics. In any case, I found myself thinking about the situation in the Middle East with a computational biology mindset, and I was struck by the following “Middle East Friendship Chart” published in July that showed data about the relationships of the various entities/countries/organizations:

As a (computational) biologist I was keen to understand the data in a visual way that would reveal the connections more clearly, and as a computational (biologist) faced with ordinal data I thought immediately of non-metric multi-dimensional scaling as a way to depict the information in the matrix. I have discussed classic multi-dimensional scaling (or MDS) in a previous blog post, where I explained its connection to principal component analysis. In the case of ordinal data, non-metric MDS seeks to find points in a low-dimensional Euclidean space so that the ranks of distances correspond to the input ordinal matrix. It has been used in computational biology, for example in the analysis of gene expression matrices. The idea originates with a classic paper by Kruskal,that remains a good reference for understanding non-metric MDS. The key idea is summarized in his Figure 4:

Formally, in Kruskal’s notation, given a dissimilarity map $\delta$ (symmetric matrix with zeroes on the diagonal and nonnegative entries), the goal is to find points x in $R^k$ so that their pairwise distance match in rank. In Kruskal’s Figure 4, points on the plot correspond to pairs of points in $R^k$ and $\delta$ is shown on the y-axis, while the Euclidean distance between the points, represented by $d$, is shown on the x-axis. Monotonically increasing values $\hat{d}$ are then chosen so that $S=\sum_{ij} \left( d_{ij}-\hat{d}_{ij} \right)^2$ is minimized. The function S is called the “stress” function and is further normalized so that the “stress” is invariant up to scaling of the points. An iterative procedure can then be used to optimize the points, although results depend on which starting configuration is chosen, and for this reason multiple starting positions are considered.

I converted the smiley/frowny faces into numbers 0,1 or 2 (for red, yellow and green faces respectively) and was able to easily experiment with non-metric MDS using an implementation in R. The results for a 2D scaling of the friendship matrix are shown in the figure below:

It is evident that, as expected from the friendship matrix, ISIS is an outlier. One also sees some of “the enemy of thine enemy is thy friend”. What is interesting is that in some cases the placements are clearly affected by shared allegiances and mutual dislikes that are complicated in nature. For example, the reason Saudi Arabia is placed between Israel and the United States is the friendship of the U.S. towards Iraq in contrast to Israel’s relationship to the country. One interesting question, that is not addressed by the non-metric MDS approach, is what the direct influences are. For example, it stands to reason that Israel is neutral to Saudi Arabia partly because of the U.S. friendship with the country- can this be inferred from the data in the same way that causative links are inferred for gene networks? In any case, I thought the scaling was illuminating and it seems like an interesting exercise to extend the analysis to more countries/organizations/entities but it may be necessary to deal with missing data and I don’t have the time to do it.

I did decide to look at the 1D non-metric MDS, to see whether there is a meaningful one-dimensional representation of the matrix, consistent with some of the maps I’d seen. As it turns out, this is not what the data suggests. The one-dimensional scaling described below places ISIS in the middle, i.e. as the “neutral” country!

Israel                -4.55606607
Saudi Arabia          -3.62249810
Turkey                -3.04579321
United States         -2.6429534
Egypt                 -1.12919328
Al-Qaida              -0.38125270
Hamas                  0.01629508
ISIS                   0.40101149
Palestinian Authority  1.55546030
Iraq                   2.23849150
Hezbollah              2.66933449
Iran                   3.29650784
Syria                  5.20065616


This failure of non-metric MDS is simply a reflection of the fact that the friendship matrix is not “one-dimensional”. The Middle East is not one-dimensional. The complex interplay of Sunni vs. Shia, terrorist vs. freedom fighter, muslim vs. infidel, and all the rest of what is going on make it incorrect to think of the conflict in terms of a single attribute. The complex pattern of alliances and conflicts is therefore not well explained by two-colored maps, and the computations described above provide some kind of a “proof” of this fact. The friendship matrix also explains why it’s difficult to have meaningful discussions about the Middle East in 140 characters, or in Facebook tirades, or with soundbites on cable news. But as complicated as the Middle East is, I have no doubt that the “friendship matrix” of my colleagues in computational biology would require even higher dimension…