You are currently browsing the monthly archive for December 2014.

I’m a (50%) professor of mathematics and (50%) professor of molecular & cell biology at UC Berkeley. There have been plenty of days when I have spent the working hours with biologists and then gone off at night with some mathematicians. I mean that literally. I have had, of course, intimate friends among both biologists and mathematicians. I think it is through living among these groups and much more, I think, through moving regularly from one to the other and back again that I have become occupied with the problem that I’ve christened to myself as the ‘two cultures’. For constantly I feel that I am moving among two groups- comparable in intelligence, identical in race, not grossly different in social origin, earning about the same incomes, who have almost ceased to communicate at all, who in intellectual, moral and psychological climate have so little in common that instead of crossing the campus from Evans Hall to the Li Ka Shing building, I may as well have crossed an ocean.1

I try not to become preoccupied with the two cultures problem, but this holiday season I have not been able to escape it. First there was a blog post by David Mumford, a professor emeritus of applied mathematics at Brown University, published on December 14th. For those readers of the blog who do not follow mathematics, it is relevant to what I am about to write that David Mumford won the Fields Medal in 1974 for his work in algebraic geometry, and afterwards launched another successful career as an applied mathematician, building on Ulf Grenader’s Pattern Theory and making significant contributions to vision research. A lot of his work is connected to neuroscience and therefore biology. Among his many awards are the MacArthur Fellowship, the Shaw Prize, the Wolf Prize and the National Medal of Science. David Mumford is not Joe Schmo.

It therefore came as a surprise to me to read his post titled “Can one explain schemes to biologists?”  in which he describes the rejection by the journal Nature of an obituary he was asked to write. Now I have to say that I have heard of obituaries being retracted, but never of an obituary being rejected. The Mumford rejection is all the more disturbing because it happened after he was invited by Nature to write the obituary in the first place!

The obituary Mumford was asked to write was for Alexander Grothendieck, a leading and towering figure in 20th century mathematics who built many of the foundations for modern algebraic geometry. My colleague Edward Frenkel published a brief non-technical obituary about Grothendieck in the New York Times, and perhaps that is what Nature had in mind for its journal as well. But since Nature is bills itself as “An international journal, published weekly, with original, groundbreaking research spanning all of the scientific disciplines [emphasis mine]” Mumford assumed the readers of Nature would be interested not only in where Grothendieck was born and died, but in what he actually accomplished in his life, and why he is admired for his mathematics. Here is the beginning excerpt of Mumford’s blog post2 explaining why he and John Tate (his coauthor for the post) needed to talk about the concept of a scheme in their post:

John Tate and I were asked by Nature magazine to write an obituary for Alexander Grothendieck. Now he is a hero of mine, the person that I met most deserving of the adjective “genius”. I got to know him when he visited Harvard and John, Shurik (as he was known) and I ran a seminar on “Existence theorems”. His devotion to math, his disdain for formality and convention, his openness and what John and others call his naiveté struck a chord with me.

So John and I agreed and wrote the obituary below. Since the readership of Nature were more or less entirely made up of non-mathematicians, it seemed as though our challenge was to try to make some key parts of Grothendieck’s work accessible to such an audience. Obviously the very definition of a scheme is central to nearly all his work, and we also wanted to say something genuine about categories and cohomology.

What they came up with is a short but well-written obituary that is the best I have read about Grothendieck. It is non-technical yet accurate and meaningfully describes, at a high level, what he is revered for and why. Here it is (copied verbatim from David Mumford’s blog):

Alexander Grothendieck
David Mumford and John Tate

Although mathematics became more and more abstract and general throughout the 20th century, it was Alexander Grothendieck who was the greatest master of this trend. His unique skill was to eliminate all unnecessary hypotheses and burrow into an area so deeply that its inner patterns on the most abstract level revealed themselves — and then, like a magician, show how the solution of old problems fell out in straightforward ways now that their real nature had been revealed. His strength and intensity were legendary. He worked long hours, transforming totally the field of algebraic geometry and its connections with algebraic mber

mber theory. He was considered by many the greatest mathematician of the 20th century.

Grothendieck was born in Berlin on March 28, 1928 to an anarchist, politically activist couple — a Russian Jewish father, Alexander Shapiro, and a German Protestant mother Johanna (Hanka) Grothendieck, and had a turbulent childhood in Germany and France, evading the holocaust in the French village of Le Chambon, known for protecting refugees. It was here in the midst of the war, at the (secondary school) Collège Cévenol, that he seems to have first developed his fascination for mathematics. He lived as an adult in France but remained stateless (on a “Nansen passport”) his whole life, doing most of his revolutionary work in the period 1956 – 1970, at the Institut des Hautes Études Scientifique (IHES) in a suburb of Paris after it was founded in 1958. He received the Fields Medal in 1966.

His first work, stimulated by Laurent Schwartz and Jean Dieudonné, added major ideas to the theory of function spaces, but he came into his own when he took up algebraic geometry. This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety. Traditionally, this had meant complex solutions of polynomials with complex coefficients but just prior to Grothendieck’s work, Andre Weil and Oscar Zariski had realized that much more scope and insight was gained by considering solutions and polynomials over arbitrary fields, e.g. finite fields or algebraic number fields.

The proper foundations of the enlarged view of algebraic geometry were, however, unclear and this is how Grothendieck made his first, hugely significant, innovation: he invented a class of geometric structures generalizing varieties that he called schemes. In simplest terms, he proposed attaching to any commutative ring (any set of things for which addition, subtraction and a commutative multiplication are defined, like the set of integers, or the set of polynomials in variables x,y,z with complex number coefficients) a geometric object, called the Spec of the ring (short for spectrum) or an affine scheme, and patching or gluing together these objects to form the scheme. The ring is to be thought of as the set of functions on its affine scheme.

To illustrate how revolutionary this was, a ring can be formed by starting with a field, say the field of real numbers, and adjoining a quantity \epsilon satisfying \epsilon^2=0. Think of \epsilon this way: your instruments might allow you to measure a small number such as \epsilon=0.001 but then \epsilon^2=0.000001 might be too small to measure, so there’s no harm if we set it equal to zero. The numbers in this ring are a+b \cdot \epsilon real a,b. The geometric object to which this ring corresponds is an infinitesimal vector, a point which can move infinitesimally but to second order only. In effect, he is going back to Leibniz and making infinitesimals into actual objects that can be manipulated. A related idea has recently been used in physics, for superstrings. To connect schemes to number theory, one takes the ring of integers. The corresponding Spec has one point for each prime, at which functions have values in the finite field of integers mod p and one classical point where functions have rational number values and that is ‘fatter’, having all the others in its closure. Once the machinery became familiar, very few doubted that he had found the right framework for algebraic geometry and it is now universally accepted.

Going further in abstraction, Grothendieck used the web of associated maps — called morphisms — from a variable scheme to a fixed one to describe schemes as functors and noted that many functors that were not obviously schemes at all arose in algebraic geometry. This is similar in science to having many experiments measuring some object from which the unknown real thing is pieced together or even finding something unexpected from its influence on known things. He applied this to construct new schemes, leading to new types of objects called stacks whose functors were precisely characterized later by Michael Artin.

His best known work is his attack on the geometry of schemes and varieties by finding ways to compute their most important topological invariant, their cohomology. A simple example is the topology of a plane minus its origin. Using complex coordinates (z,w), a plane has four real dimensions and taking out a point, what’s left is topologically a three dimensional sphere. Following the inspired suggestions of Grothendieck, Artin was able to show how with algebra alone that a suitably defined third cohomology group of this space has one generator, that is the sphere lives algebraically too. Together they developed what is called étale cohomology at a famous IHES seminar. Grothendieck went on to solve various deep conjectures of Weil, develop crystalline cohomology and a meta-theory of cohomologies called motives with a brilliant group of collaborators whom he drew in at this time.

In 1969, for reasons not entirely clear to anyone, he left the IHES where he had done all this work and plunged into an ecological/political campaign that he called Survivre. With a breathtakingly naive spririt (that had served him well doing math) he believed he could start a movement that would change the world. But when he saw this was not succeeding, he returned to math, teaching at the University of Montpellier. There he formulated remarkable visions of yet deeper structures connecting algebra and geometry, e.g. the symmetry group of the set of all algebraic numbers (known as its Galois group Gal(\overline{\mathbb{Q}}/\mathbb{Q})) and graphs drawn on compact surfaces that he called ‘dessin d’enfants’. Despite his writing thousand page treatises on this, still unpublished, his research program was only meagerly funded by the CNRS (Centre Nationale de Recherche Scientifique) and he accused the math world of being totally corrupt. For the last two decades of his life he broke with the whole world and sought total solitude in the small village of Lasserre in the foothills of the Pyrenees. Here he lived alone in his own mental and spiritual world, writing remarkable self-analytic works. He died nearby on Nov. 13, 2014.

As a friend, Grothendieck could be very warm, yet the nightmares of his childhood had left him a very complex person. He was unique in almost every way. His intensity and naivety enabled him to recast the foundations of large parts of 21st century math using unique insights that still amaze today. The power and beauty of Grothendieck’s work on schemes, functors, cohomology, etc. is such that these concepts have come to be the basis of much of math today. The dreams of his later work still stand as challenges to his successors.

Mumford goes on in his blog post to describe the reasons Nature gave for rejecting the obituary. He writes:

The sad thing is that this was rejected as much too technical for their readership. Their editor wrote me that ‘higher degree polynomials’, ‘infinitesimal vectors’ and ‘complex space’ (even complex numbers) were things at least half their readership had never come across. The gap between the world I have lived in and that even of scientists has never seemed larger. I am prepared for lawyers and business people to say they hated math and not to remember any math beyond arithmetic, but this!? Nature is read only by people belonging to the acronym ‘STEM’ (= Science, Technology, Engineering and Mathematics) and in the Common Core Standards, all such people are expected to learn a hell of a lot of math. Very depressing.

I don’t know if the Nature editor had biologists in mind when rejecting the Grothendieck obituary, but Mumford certainly thought so, as he sarcastically titled his post “Can one explain schemes to biologists?” Sadly, I think that Nature and Mumford both missed the point.

Exactly ten years ago Bernd Sturmfels and I published a book titled “Algebraic Statistics for Computational Biology“. From my perspective, the book developed three related ideas: 1. that the language, techniques and theorems of algebraic geometry both unify and provide tools for certain models in statistics, 2. that problems in computational biology are particularly prone to depend on inference with precisely the statistical models amenable to algebraic analysis and (most importantly) 3. mathematical thinking, by way of considering useful generalizations of seemingly unrelated ideas, is a powerful approach for organizing many concepts in (computational) biology, especially in genetics and genomics.

To give a concrete example of what 1,2 and 3 mean, I turn to Mumford’s definition of algebraic geometry in his obituary for Grothendieck. He writes that “This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety.” What is he talking about? The notion of “phylogenetic invariants”, provides a simple example for biologists by biologists. Phylogenetic invariants were first introduced to biology ca. 1987 by Joe Felsenstein (Professor of Genome Sciences and Biology at the University of Washington) and James Lake (Distinguished Professor of Molecular, Cell, and Developmental Biology and of Human Genetics at UCLA)3.

Given a phylogenetic tree describing the evolutionary relationship among n extant species, one can examine the evolution of a single nucleotide along the tree. At the leaves, a single nucleotide is then associated to each species, collectively forming a single selection from among the 4^n possible patterns for nucleotides at the leaves. Evolutionary models provide a way to formalize the intuitive notion that random mutations should be associated with branches of the tree and formally are described via (unknown) parameters that can be used to calculate a probability for any pattern at the leaves. It happens to be the case that for most phylogenetic evolutionary model have the property that the probabilities for leaf patterns are polynomials in the parameters. The simplest example to consider is the tree with an ancestral node and two leaves corresponding to two extant species, say “B” and “M”:



The molecular approach to evolution posits that multiple sites together should be used both to estimate parameters associated with evolution along the tree, and maybe even the tree itself. If one assumes that nucleotides mutate according to the 4-state general Markov model with independent processes on each branch, and one writes p_{ij} for \mathbb{P}(B=i,M=j) where i,j are one of A,C,G,T, then it must be the case that p_{ij}p_{kl} = p_{il}p_{jk}. In other words, the polynomial

p_{ij}p_{kl} - p_{il}p_{jk}=0.

In other words, for any parameters in the 4-state general Markov model, it has to be the case that when the pattern probabilities are plugged into the polynomial equation above, the result is zero. This equation is none other than the condition for two random variables to be independent; in this case the random variable corresponding to the nucleotide at B is independent of the random variable corresponding to the nucleotide at M.

The example is elementary, but it hints at a powerful tool for phylogenetics. It provides an equation that must be satisfied by the pattern probabilities that does not depend specifically on the parameters of the model (which can be intuitively understood as relating to branch length). If many sites are available so that pattern probabilities can be estimated empirically from data, then there is in principle a possibility for testing whether the data fits the topology of a specific tree regardless of what the branch lengths of the tree might be. Returning to Mumford’s description of algebraic geometry, the variety of interest is the geometric object in “pattern probability space” where points are precisely probabilities that can arise for a specific tree, and the “ring of polynomials with the geometric properties of the locus” are the phylogenetic invariants. The relevance of the ring lies in the fact that if and g are two phylogenetic invariants then that means that f(P)=0 and g(P)=0 for any pattern probabilities from the model, so therefore f+g is also a phylogenetic invariant because f(P)+g(P)=0 for any pattern probabilities from the model (the same is true for c \cdot f for any constant c). In other words, there is an algebra of phylogenetic invariants that is closely related to the geometry of pattern probabilities. As Mumford and Tate explain, Grothendieck figured out the right generalizations to construct a theory for any ring, not just the ring of polynomials, and therewith connected the fields of commutative algebra, algebraic geometry and number theory.

The use of phylogenetic invariants for testing tree topologies is conceptually elegantly illustrated in a wonderful book chapter on phylogenetic invariants  by mathematicians Elizabeth Allman and John Rhodes that starts with the simple example of the two taxa tree and delves deeply into the subject. Two surfaces (conceptually) represent the varieties for two trees, and the equations f_1(P)=f_2(P)=\ldots=f_l(P)=0 and h_1(P)=h_2(P)=\ldots=h_k(P)=0 are the phylogenetic invariants. The empirical pattern probability distribution is the point \hat{P} and the goal is to find the surface it is close to:


Figure 4.2 from Allman and Rhodes chapter on phylogenetic invariants.

Of course for large trees there will be many different phylogenetic invariants, and the polynomials may be of high degree. Figuring out what the invariants are, how many of them there are, bounds for the degrees, understanding the geometry, and developing tests based on the invariants, is essentially a (difficult unsolved) challenge for algebraic geometers. I think it’s fair to say that our book spurred a lot of research on the subject, and helped to create interest among mathematicians who were unaware of the variety and complexity of problems arising from phylogenetics. Nick Eriksson, Kristian Ranestad, Bernd Sturmfels and Seth Sullivant wrote a short piece titled phylogenetic algebraic geometry which is an introduction for algebraic geometers to the subject. Here is where we come full circle to Mumford’s obituary… the notion of a scheme is obviously central to phylogenetic algebraic geometry. And the expository article just cited is just the beginning. There are too many exciting developments in phylogenetic geometry to summarize in this post, but Elizabeth Allman, Marta Casanellas, Joseph Landsberg, John Rhodes, Bernd Sturmfels and Seth Sullivant are just a few of many who have discovered beautiful new mathematics motivated by the biology, and also have had an impact on biology with algebro-geometric tools. There is both theory (see this recent example) and application (see this recent example) coming out of phylogenetic algebraic geometry. More generally, algebraic statistics for computational biology is now a legitimate “field”, complete with a journal, regular conferences, and a critical mass of mathematicians, statisticians, and even some biologists working in the area. Some of the results are truly beautiful and impressive. My favorite recent one is this paper by Caroline Uhler, Donald Richards and Piotr Zwiernik providing important guarantees for maximum likelihood estimation of parameters in Felstenstein’s continuous character model.

But that is not the point here. First, Mumford’s sarcasm was unwarranted. Biologists certainly didn’t discover schemes but as Felsenstein and Lake’s work shows, they did (re)discover algebraic geometry. Moreover, all of the people mentioned above can explain schemes to biologists, thereby answering Mumford’s question in the affirmative. Many of them have not only collaborated with biologists but written biology papers. And among them are some extraordinary expositors, notably Bernd Sturmfels. Still, even if there are mathematicians able and willing to explain schemes to biologists, and even if there are areas within biology where schemes arise (e.g. phylogenetic algebraic geometry), it is fair to ask whether biologists should care to understand them?

The answer to the question is: probably not. In any case I wouldn’t presume to opine on what biologists should and shouldn’t care about. Biology is enormous, and encompasses everything from the study of fecal transplants to the wood frogs of Alaska. However I do have an opinion about the area I work in, namely genomics. When it comes to genomics journalists write about revolutions, personalized precision medicine, curing cancer and data deluge. But the biology of genomics is for real, and it is indeed tremendously exciting as a result of dramatic improvements in underlying technologies (e.g. DNA sequencing and genome editing to name two). I also believe it is true that despite what is written about data deluge, experiments remain the primary and the best way, to elucidate the function of the genome. Data analysis is secondary. But it is true that statistics has become much more important to genomics than it was even to population genetics at the time of R.A. Fisher, computer science is playing an increasingly important role, and I believe that somewhere in the mix of “quantitative sciences for biology”, there is an important role for mathematics.

What biologists should appreciate, what was on offer in Mumford’s obituary, and what mathematicians can deliver to genomics that is special and unique, is the ability to not only generalize, but to do so “correctly”. The mathematician Raoul Bott once reminisced that “Grothendieck was extraordinary as he could play with concepts, and also was prepared to work very hard to make arguments almost tautological.” In other words, what made Grothendieck special was not that he generalized concepts in algebraic geometry to make them more abstract, but that he was able to do so in the right way. What made his insights seemingly tautological at the end of the day, was that he had the “right” way of viewing things and the “right” abstractions in mind. That is what mathematicians can contribute most of all to genomics. Of course sometimes theorems are important, or specific mathematical techniques solve problems and mathematicians are to thank for that. Phylogenetic invariants are important for phylogenetics which in turn is important for comparative genomics which in turn is important for functional genomics which in turn is important for medicine. But it is the the abstract thinking that I think matters most. In other words, I agree with Charles Darwin that mathematicians are endowed with an extra sense… I am not sure exactly what he meant, but it is clear to me that it is the sense that allows for understanding the difference between the “right” way and the “wrong” way to think about something.

There are so many examples of how the “right” thinking has mattered in genomics that they are too numerous to list here, but here are a few samples: At the heart of molecular biology, there is the “right” and the “wrong” way to think about genes: evidently the message to be gleaned from Gerstein et al.‘s in “What is a gene post ENCODE? History and Definition” is that “genes” are not really the “right” level of granularity but transcripts are. In a previous blog post I’ve discussed the “right” way to think about the Needleman-Wunsch algorithm (tropically). In metagenomics there is the “right” abstraction with which to understand UniFrac. One paper I’ve written (with Niko Beerenwinkel and Bernd Sturmfels) is ostensibly about fitness landscapes but really about what we think the “right” way is to look at epistasis. In systems biology there is the “right” way to think about stochasticity in expression (although I plan a blog post that digs a bit deeper). There are many many more examples… way too many to list here… because ultimately every problem in biology is just like in math… there is the “right’ and the “wrong” way to think about it, and figuring out the difference is truly an art that mathematicians, the type of mathematicians that work in math departments, are particularly good at.

Here is a current example from (computational) biology where it is not yet clear what “right” thinking should be despite the experts working hard at it, and that is useful to highlight because of the people involved: With the vast amount of human genomes being sequenced (some estimates are as high as 400,000 in the coming year), there is an increasingly pressing fundamental question about how the (human) genome should be represented and stored. This is ostensibly a computer science question: genomes should perhaps be compressed in ways that allow for efficient search and retrieval, but I’d argue that fundamentally it is a math question. This is because what the question is really asking, is how should one think about genome sequences related mostly via recombination and only slightly by mutation, and what are the “right” mathematical structures for this challenge? The answer matters not only for the technology (how to store genomes), but much more importantly for the foundations of population and statistical genetics. Without the right abstractions for genomes, the task of coherently organizing and interpreting genomic information is hopeless. David Haussler (with coauthors) and Richard Durbin have both written about this problem in papers that are hard to describe in any way other than as math papers; see Mapping to a Reference Genome Structure and Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (BPWT). Perhaps it is no coincidence that both David Haussler and Richard Durbin studied mathematics.

But neither David Haussler nor Richard Durbin are faculty in mathematics departments. In fact, there is a surprisingly long list of very successful (computational) biologists specifically working in genomics, many of whom even continue to do math, but not in math departments, i.e. they are former mathematicians (this is so common there is even a phrase for it “recovering mathematician” as if being one is akin to alcoholism– physicists use the same language). People include Richard Durbin, Phil Green, David Haussler, Eric Lander, Montgomery Slatkin and many others I am omitting; for example almost the entire assembly group at the Broad Institute consists of former mathematicians. Why are there so many “formers” and very few “currents”? And does it matter? After all, it is legitimate to ask whether successful work in genomics is better suited to departments, institutes and companies outside the realm of academic mathematics. It is certainly the case that to do mathematics, or to publish mathematical results, one does not need to be a faculty member in a mathematics department. I’ve thought a lot about these issues and questions, partly because they affect my daily life working between the worlds of mathematics and molecular biology in my own institution. I’ve also seen the consequences of the separation of the two cultures. To illustrate how far apart they are I’ve made a list of specific differences below:

Biologists publish in “glamour journals” such as Science, Nature and Cell where impact factors are high. Nature publishes its impact factor to three decimal digits accuracy (42.317). Mathematicians publish in journals whose names start with the word Annals, and they haven’t heard of impact factors. The impact factor of the Annals of Mathematics, perhaps the most prestigious journal in mathematics, is 3 (the journal with the highest impact factor is the Journal of the American Mathematical Society at 3.5). Mathematicians post all papers on the ArXiv preprint server prior to publications. Not only do biologists not do that, they are frequently subject to embargos prior to publication. Mathematicians write in LaTeX, biologists in Word (a recent paper argues that Word is better, but I’m not sure). Biologists draw figures and write papers about them. Mathematicians write papers and draw figures to explain them. Mathematicians order authors alphabetically, and authorship is awarded if a mathematical contribution was made. Biologists author lists have two gradients from each end, and authorship can be awarded for payment for the work. Biologists may review papers on two week deadlines. Mathematicians review papers on two year deadlines. Biologists have their papers cited by thousands, and their results have a real impact on society; in many cases diseases are cured as a result of basic research. Mathematicians are lucky if 10 other individuals on the planet have any idea what they are writing about. Impact time can be measured in centuries, and sometimes theorems turn out to simply not have been interesting at all. Biologists don’t teach much. Mathematicians do (at UC Berkeley my math teaching load is 5 times that of my biology teaching load). Biologists value grants during promotion cases and hiring. Mathematicians don’t. Biologists have chalk talks during job interviews. Mathematicians don’t. Mathematicians have a jobs wiki. Biologists don’t. Mathematicians write ten page recommendation letters. Biologists don’t. Biologists go to retreats to converse. Mathematicians retreat from conversations (my math department used to have a yearly retreat that was one day long and consisted of a faculty meeting around a table in the department; it has not been held the past few years). Mathematics graduate students teach. Biology graduate students rotate. Biology students take very little coursework after their first year. Mathematics graduate students take two years of classes (on this particular matter I’m certain mathematicians are right). Biologists pay their graduate students from grants. Mathematicians don’t (graduate students are paid for teaching sections of classes, usually calculus). Mathematics full professors that are female is a number (%) in the single digits. Biology full professors that are female is a number (%) in the double digits (although even added together the numbers are still much less than 50%). Mathematicians believe in God. Biologists don’t.

How then can biology, specifically genomics (or genetics), exist and thrive within the mathematics community? And how can mathematics find a place within the culture of biology?

I don’t know. The relationship between biology and mathematics is on the rocks and prospects are grim. Yes, there are biologists who do mathematical work, and yes, there are mathematical biologists, especially in areas such as evolution or ecology who are in math departments. There are certainly applied mathematics departments with faculty working on biology problems involving modeling at the macroscopic level, where the math fits in well with classic applied math (e.g. PDEs, numerical analysis). But there is very little genomics or genetics related math going on in math departments. And conversely, mathematicians who leave math departments to work in biology departments or institutes face enormous pressure to not focus on the math, or when they do any math at all, to not publish it (work is usually relegated to the supplement and completely ignored). The result is that biology loses out due to the minimal real contact with math– the special opportunity of benefiting from the extra sense is lost, and conversely math loses the opportunity to engage biology– one of the most exciting scientific enterprises of the 21st century. The mathematician Gian-Carlo Rota said that “The lack of real contact between mathematics and biology is either a tragedy, a scandal, or a challenge, it is hard to decide which”. He was right.

The extent to which the two cultures have drifted apart is astonishing. For example, visiting other universities I see the word “mathematics” almost every time precision medicine is discussed in the context of a new initiative, but I never see mathematicians or the local math department involved. In the mathematics community, there has been almost no effort to engage and embrace genomics. For example the annual joint AMS-MAA meetings always boast a series of invited talks, many on applications of math, but genomics is never a represented area. Yet in my Junior level course last semester on mathematical biology (taught in the math department) there were 46 students, more than any other upper division elective class in the math department. Even though I am a 50% member of the mathematics department I have been advising three math graduate students this year, equivalent to six for a full time member, a statistic that probably ranks me among the most busy advisors in the department (these numbers do not even reflect the fact that I had to turn down a number of students). Anecdotally, the numbers illustrate how popular genomics is among math undergraduate and graduate students, and although hard data is difficult to come by my interactions with mathematicians everywhere convince me the trend I see at Berkeley is universal. So why is this popularity not reflected in support of genomics by the math community? And why don’t biology journals, conferences and departments embrace more mathematics? There is a hypocrisy of math for biology. People talk about it but when push comes to shove nobody wants to do anything real to foster it.

Examples abound. On December 16th UCLA announced the formation of a new Institute for Quantitative and Computational Biosciences. The announcement leads with a photograph of the director that is captioned “Alexander Hoffmann and his colleagues will collaborate with mathematicians to make sense of a tsunami of biological data.” Strangely though, the math department is not one of the 15 partner departments that will contribute to the Institute. That is not to say that mathematicians won’t interact with the Institute, or that mathematics won’t happen there. E.g., the Institute for Pure and Applied Mathematics is a partner as is the Biomathematics department (an interesting UCLA concoction), not to mention the fact that many of the affiliated faculty do work that is in part mathematical. But formal partnership with the mathematics department, and through it direct affiliation with the mathematics community, is missing. UCLA’s math department is among the top in the world, and boasts a particularly robust applied mathematics program many of whose members work on mathematical biology. More importantly, the “pure” mathematicians at UCLA are first rate and one of them, Terence Tao, is possibly the most talented mathematician alive. Wouldn’t it be great if he could be coaxed to think about some of the profound questions of biology? Wouldn’t it be awesome if mathematicians in the math department at UCLA worked hard with the biologists to tackle the extraordinary challenges of “precision medicine”? Wouldn’t it be wonderful if UCLA’s Quantitative and Computational biosciences Institute could benefit from the vast mathematics talent pool not only at UCLA but beyond: that of the entire mathematics community?

I don’t know if the omission of the math department was an accidental oversight of the Institute, a deliberate snub, or if it was the Institute that was rebuffed by the mathematics department. I don’t think it really matters. The point is that the UCLA situation is ubiquitous. Mathematics departments are almost never part of new initiatives in genomics; biologists are all too quick to glance the other way. Conversely, the mathematics community has shunned biologists. Despite two NSF Institutes dedicated to mathematical biology (the MBI and NIMBioS) almost no top math departments hire mathematicians working in genetics or genomics (see the mathematics jobs wiki). In the rooted tree in the figure above B can represent Biology and M can represent Mathematics and they truly, and sadly, are independent.

I get it. The laundry list of differences between biology and math that I aired above can be overwhelming. Real contact between the subjects will be difficult to foster, and it should be acknowledged that it is neither necessary nor sufficient for the science to progress. But wouldn’t it be better if mathematicians proved they are serious about biology and biologists truly experimented with mathematics? 


1. The opening paragraph is an edited copy of an excerpt (page 2, paragraph 2) from C.P. Snow’s “The Two Cultures and The Scientific Revolution” (The Rede Lecture 1959).
2. David Mumford’s content on his site is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License, and I have incorporated it in my post (boxed text) unaltered according to the terms of the license.
3. The meaning of the word “invariant” in “phylogenetic invariants” differs from the standard meaning in mathematics, where invariant refers to a property of a class of objects that is unchanged under transformations. In the context of algebraic geometry classic invariant theory addresses the problem of determining polynomial functions that are invariant under transformations from a linear group. Mumford is known for his work on geometric invariant theory. An astute reader could therefore deduce from the term “phylogenetic invariants” that the term was coined by biologists.

Recent news of James Watson’s auction of his Nobel Prize medal has unearthed a very unpleasant memory for me.

In March 2004 I attended an invitation-only genomics meeting at the famed Banbury Center at Cold Spring Harbor Laboratory. I had heard legendary stories about Banbury, and have to admit I felt honored and excited when I received the invitation. There were rumors that sometimes James Watson himself would attend meetings. The emails I received explaining the secretive policies of the Center only added to the allure. I felt that I had received an invitation to the genomics equivalent of Skull and Bones.

Although Watson did not end up attending the meeting, my high expectations were met when he did decide to drop in on dinner one evening at Robertson house. Without warning he seated himself at my table. I was in awe. The table was round with seating for six, and Honest Jim sat down right across from me. He spoke incessantly throughout dinner and we listened. Sadly though, most of the time he was spewing racist and misogynistic hate. I remember him asking rhetorically “who would want to adopt an Irish kid?” (followed by a tirade against the Irish that I later saw repeated in the news) and he made a point to disparage Rosalind Franklin referring to her derogatorily as “that woman”. No one at the table (myself included) said a word. I deeply regret that.

One of Watson’s obsessions has been to “improve” the “imperfect human” via human germline engineering. This is disturbing on many many levels. First, there is the fact that for years Watson presided over Cold Spring Harbor Laboratory which actually has a history as a center for eugenics. Then there are the numerous disparaging remarks by Watson about everyone who is not exactly like him, leaving little doubt about who he imagines the “perfect human” to be. But leaving aside creepy feelings… could he be right? Is the “perfect human” an American from Chicago of mixed Scottish/Irish ancestry? Should we look forward to a world filled with Watsons? I have recently undertaken a thought experiment along these lines that I describe below. The result of the experiment is dedicated to James Watson on the occasion of his unbirthday today.


SNPedia is an open database of 59,593 SNPs and their associations. A SNP entry includes fields for “magnitude” (a subjective measure of significance on a scale of 0–10) and “repute” (good or bad), and allele classifications for many diseases and medical conditions. For example, the entry for a SNP (rs1799971) that associates with alcohol cravings describes the “normal” and “bad” alleles. In addition to associating with phenotypes, SNPs can also associate with populations. For example, as seen in the Geography of Genetic Variants Browser, rs1799971 allele frequencies vary greatly among Africans, Europeans and Asians. If the genotype of an individual is known at many SNPs, it is therefore possible to guess where they are from: in the case of rs1799971 someone who is A:A is a lot more likely to be African than Japanese, and with many SNPs the probabilities can narrow the location of an individual to a very specific geographic location. This is the principle behind the application of principal component analysis (PCA) to the study of populations. Together, SNPedia and PCA therefore provide a path to determining where a “perfect human” might be from:

  1. Create a “perfect human” in silico by setting the alleles at all SNPs so that they are “good”.
  2. Add the “perfect human” to a panel of genotyped individuals from across a variety of populations and perform PCA to reveal the location and population of origin of the individual.


After restricting the SNP set from SNPedia to those with green painted alleles, i.e. “good”, there are 4967 SNPs with which to construct the “perfect human” (available for download here).

A dataset of genotyped individuals can be obtain from 1000 genomes including Africans, (indigenous) Americans, East Asians and Europeans.

The PCA plot (1st and 2nd components) showing all the individuals together with the “perfect human” (in pink; see arrow) is shown below:


The nearest neighbor to the “perfect human” is HG00737, a female who isPuerto Rican. One might imagine that such a person already existed, maybe Yuiza, the only female Taino Cacique (chief) in Puerto Rico’s history:


Samuel Lind’s ‘Yuiza’

But as the 3rd principal component shows, reifying the “perfect human” is a misleading undertaking:


Here the “perfect human” is revealed to be decidedly non-human. This is not surprising, and it reflects the fact that the alleles of the “perfect human” place it as significant outlier to the human population. In fact, this is even more evident in the case of the “worst human”, namely the individual that has the “bad” alleles at every SNPs. A projection of that individual onto any combination of principal components shows them to be far removed from any actual human. The best visualization appears in the projection onto the 2nd and 3rd principal components, where they appear as a clear outlier (point labeled DYS), and diametrically opposite to Africans:


The fact that the “worst human” does not project well onto any of the principal components whereas the “perfect human” does is not hard to understand from basic population genetics principles. It is an interesting exercise that I leave to the reader.


The fact that the “perfect human” is Puerto Rican makes a lot of sense. Since many disease SNPs are population specific, it makes sense that an individual homozygous for all “good” alleles should be admixed. And that is exactly what Puerto Ricans are. In a “women in the diaspora” study, Puerto Rican women born on the island but living in the United States were shown to be 53.3±2.8% European, 29.1±2.3% West African, and 17.6±2.4% Native American. In other words, to collect all the “good” alleles it is necessary to be admixed, but admixture itself is not sufficient for perfection. On a personal note, I was happy to see population genetic evidence supporting my admiration for the perennial championship Puerto Rico All Stars team:

As for Watson, it seems fitting that he should donate the proceeds of his auction to the Caribbean Genome Center at the University of Puerto Rico.

[Update: Dec. 7/8: Taras Oleksyk from the Department of Biology at the University of Puerto Rico Mayaguez has written an excellent post-publication peer review of this blog post and Rafael Irizarry from the Harvard School of Public Health has written a similar piece, Genéticamente, no hay tal cosa como la raza puertorriqueña in Spanish. Both are essential reading.]

Blog Stats

%d bloggers like this: