You are currently browsing the tag archive for the ‘Gian-Carlo Rota’ tag.

I’m a (50%) professor of mathematics and (50%) professor of molecular & cell biology at UC Berkeley. There have been plenty of days when I have spent the working hours with biologists and then gone off at night with some mathematicians. I mean that literally. I have had, of course, intimate friends among both biologists and mathematicians. I think it is through living among these groups and much more, I think, through moving regularly from one to the other and back again that I have become occupied with the problem that I’ve christened to myself as the ‘two cultures’. For constantly I feel that I am moving among two groups- comparable in intelligence, identical in race, not grossly different in social origin, earning about the same incomes, who have almost ceased to communicate at all, who in intellectual, moral and psychological climate have so little in common that instead of crossing the campus from Evans Hall to the Li Ka Shing building, I may as well have crossed an ocean.1

I try not to become preoccupied with the two cultures problem, but this holiday season I have not been able to escape it. First there was a blog post by David Mumford, a professor emeritus of applied mathematics at Brown University, published on December 14th. For those readers of the blog who do not follow mathematics, it is relevant to what I am about to write that David Mumford won the Fields Medal in 1974 for his work in algebraic geometry, and afterwards launched another successful career as an applied mathematician, building on Ulf Grenader’s Pattern Theory and making significant contributions to vision research. A lot of his work is connected to neuroscience and therefore biology. Among his many awards are the MacArthur Fellowship, the Shaw Prize, the Wolf Prize and the National Medal of Science. David Mumford is not Joe Schmo.

It therefore came as a surprise to me to read his post titled “Can one explain schemes to biologists?”  in which he describes the rejection by the journal Nature of an obituary he was asked to write. Now I have to say that I have heard of obituaries being retracted, but never of an obituary being rejected. The Mumford rejection is all the more disturbing because it happened after he was invited by Nature to write the obituary in the first place!

The obituary Mumford was asked to write was for Alexander Grothendieck, a leading and towering figure in 20th century mathematics who built many of the foundations for modern algebraic geometry. My colleague Edward Frenkel published a brief non-technical obituary about Grothendieck in the New York Times, and perhaps that is what Nature had in mind for its journal as well. But since Nature is bills itself as “An international journal, published weekly, with original, groundbreaking research spanning all of the scientific disciplines [emphasis mine]” Mumford assumed the readers of Nature would be interested not only in where Grothendieck was born and died, but in what he actually accomplished in his life, and why he is admired for his mathematics. Here is the beginning excerpt of Mumford’s blog post2 explaining why he and John Tate (his coauthor for the post) needed to talk about the concept of a scheme in their post:

John Tate and I were asked by Nature magazine to write an obituary for Alexander Grothendieck. Now he is a hero of mine, the person that I met most deserving of the adjective “genius”. I got to know him when he visited Harvard and John, Shurik (as he was known) and I ran a seminar on “Existence theorems”. His devotion to math, his disdain for formality and convention, his openness and what John and others call his naiveté struck a chord with me.

So John and I agreed and wrote the obituary below. Since the readership of Nature were more or less entirely made up of non-mathematicians, it seemed as though our challenge was to try to make some key parts of Grothendieck’s work accessible to such an audience. Obviously the very definition of a scheme is central to nearly all his work, and we also wanted to say something genuine about categories and cohomology.

What they came up with is a short but well-written obituary that is the best I have read about Grothendieck. It is non-technical yet accurate and meaningfully describes, at a high level, what he is revered for and why. Here it is (copied verbatim from David Mumford’s blog):

Alexander Grothendieck
David Mumford and John Tate

Although mathematics became more and more abstract and general throughout the 20th century, it was Alexander Grothendieck who was the greatest master of this trend. His unique skill was to eliminate all unnecessary hypotheses and burrow into an area so deeply that its inner patterns on the most abstract level revealed themselves — and then, like a magician, show how the solution of old problems fell out in straightforward ways now that their real nature had been revealed. His strength and intensity were legendary. He worked long hours, transforming totally the field of algebraic geometry and its connections with algebraic mber

mber theory. He was considered by many the greatest mathematician of the 20th century.

Grothendieck was born in Berlin on March 28, 1928 to an anarchist, politically activist couple — a Russian Jewish father, Alexander Shapiro, and a German Protestant mother Johanna (Hanka) Grothendieck, and had a turbulent childhood in Germany and France, evading the holocaust in the French village of Le Chambon, known for protecting refugees. It was here in the midst of the war, at the (secondary school) Collège Cévenol, that he seems to have first developed his fascination for mathematics. He lived as an adult in France but remained stateless (on a “Nansen passport”) his whole life, doing most of his revolutionary work in the period 1956 – 1970, at the Institut des Hautes Études Scientifique (IHES) in a suburb of Paris after it was founded in 1958. He received the Fields Medal in 1966.

His first work, stimulated by Laurent Schwartz and Jean Dieudonné, added major ideas to the theory of function spaces, but he came into his own when he took up algebraic geometry. This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety. Traditionally, this had meant complex solutions of polynomials with complex coefficients but just prior to Grothendieck’s work, Andre Weil and Oscar Zariski had realized that much more scope and insight was gained by considering solutions and polynomials over arbitrary fields, e.g. finite fields or algebraic number fields.

The proper foundations of the enlarged view of algebraic geometry were, however, unclear and this is how Grothendieck made his first, hugely significant, innovation: he invented a class of geometric structures generalizing varieties that he called schemes. In simplest terms, he proposed attaching to any commutative ring (any set of things for which addition, subtraction and a commutative multiplication are defined, like the set of integers, or the set of polynomials in variables x,y,z with complex number coefficients) a geometric object, called the Spec of the ring (short for spectrum) or an affine scheme, and patching or gluing together these objects to form the scheme. The ring is to be thought of as the set of functions on its affine scheme.

To illustrate how revolutionary this was, a ring can be formed by starting with a field, say the field of real numbers, and adjoining a quantity \epsilon satisfying \epsilon^2=0. Think of \epsilon this way: your instruments might allow you to measure a small number such as \epsilon=0.001 but then \epsilon^2=0.000001 might be too small to measure, so there’s no harm if we set it equal to zero. The numbers in this ring are a+b \cdot \epsilon real a,b. The geometric object to which this ring corresponds is an infinitesimal vector, a point which can move infinitesimally but to second order only. In effect, he is going back to Leibniz and making infinitesimals into actual objects that can be manipulated. A related idea has recently been used in physics, for superstrings. To connect schemes to number theory, one takes the ring of integers. The corresponding Spec has one point for each prime, at which functions have values in the finite field of integers mod p and one classical point where functions have rational number values and that is ‘fatter’, having all the others in its closure. Once the machinery became familiar, very few doubted that he had found the right framework for algebraic geometry and it is now universally accepted.

Going further in abstraction, Grothendieck used the web of associated maps — called morphisms — from a variable scheme to a fixed one to describe schemes as functors and noted that many functors that were not obviously schemes at all arose in algebraic geometry. This is similar in science to having many experiments measuring some object from which the unknown real thing is pieced together or even finding something unexpected from its influence on known things. He applied this to construct new schemes, leading to new types of objects called stacks whose functors were precisely characterized later by Michael Artin.

His best known work is his attack on the geometry of schemes and varieties by finding ways to compute their most important topological invariant, their cohomology. A simple example is the topology of a plane minus its origin. Using complex coordinates (z,w), a plane has four real dimensions and taking out a point, what’s left is topologically a three dimensional sphere. Following the inspired suggestions of Grothendieck, Artin was able to show how with algebra alone that a suitably defined third cohomology group of this space has one generator, that is the sphere lives algebraically too. Together they developed what is called étale cohomology at a famous IHES seminar. Grothendieck went on to solve various deep conjectures of Weil, develop crystalline cohomology and a meta-theory of cohomologies called motives with a brilliant group of collaborators whom he drew in at this time.

In 1969, for reasons not entirely clear to anyone, he left the IHES where he had done all this work and plunged into an ecological/political campaign that he called Survivre. With a breathtakingly naive spririt (that had served him well doing math) he believed he could start a movement that would change the world. But when he saw this was not succeeding, he returned to math, teaching at the University of Montpellier. There he formulated remarkable visions of yet deeper structures connecting algebra and geometry, e.g. the symmetry group of the set of all algebraic numbers (known as its Galois group Gal(\overline{\mathbb{Q}}/\mathbb{Q})) and graphs drawn on compact surfaces that he called ‘dessin d’enfants’. Despite his writing thousand page treatises on this, still unpublished, his research program was only meagerly funded by the CNRS (Centre Nationale de Recherche Scientifique) and he accused the math world of being totally corrupt. For the last two decades of his life he broke with the whole world and sought total solitude in the small village of Lasserre in the foothills of the Pyrenees. Here he lived alone in his own mental and spiritual world, writing remarkable self-analytic works. He died nearby on Nov. 13, 2014.

As a friend, Grothendieck could be very warm, yet the nightmares of his childhood had left him a very complex person. He was unique in almost every way. His intensity and naivety enabled him to recast the foundations of large parts of 21st century math using unique insights that still amaze today. The power and beauty of Grothendieck’s work on schemes, functors, cohomology, etc. is such that these concepts have come to be the basis of much of math today. The dreams of his later work still stand as challenges to his successors.

Mumford goes on in his blog post to describe the reasons Nature gave for rejecting the obituary. He writes:

The sad thing is that this was rejected as much too technical for their readership. Their editor wrote me that ‘higher degree polynomials’, ‘infinitesimal vectors’ and ‘complex space’ (even complex numbers) were things at least half their readership had never come across. The gap between the world I have lived in and that even of scientists has never seemed larger. I am prepared for lawyers and business people to say they hated math and not to remember any math beyond arithmetic, but this!? Nature is read only by people belonging to the acronym ‘STEM’ (= Science, Technology, Engineering and Mathematics) and in the Common Core Standards, all such people are expected to learn a hell of a lot of math. Very depressing.

I don’t know if the Nature editor had biologists in mind when rejecting the Grothendieck obituary, but Mumford certainly thought so, as he sarcastically titled his post “Can one explain schemes to biologists?” Sadly, I think that Nature and Mumford both missed the point.

Exactly ten years ago Bernd Sturmfels and I published a book titled “Algebraic Statistics for Computational Biology“. From my perspective, the book developed three related ideas: 1. that the language, techniques and theorems of algebraic geometry both unify and provide tools for certain models in statistics, 2. that problems in computational biology are particularly prone to depend on inference with precisely the statistical models amenable to algebraic analysis and (most importantly) 3. mathematical thinking, by way of considering useful generalizations of seemingly unrelated ideas, is a powerful approach for organizing many concepts in (computational) biology, especially in genetics and genomics.

To give a concrete example of what 1,2 and 3 mean, I turn to Mumford’s definition of algebraic geometry in his obituary for Grothendieck. He writes that “This is the field where one studies the locus of solutions of sets of polynomial equations by combining the algebraic properties of the rings of polynomials with the geometric properties of this locus, known as a variety.” What is he talking about? The notion of “phylogenetic invariants”, provides a simple example for biologists by biologists. Phylogenetic invariants were first introduced to biology ca. 1987 by Joe Felsenstein (Professor of Genome Sciences and Biology at the University of Washington) and James Lake (Distinguished Professor of Molecular, Cell, and Developmental Biology and of Human Genetics at UCLA)3.

Given a phylogenetic tree describing the evolutionary relationship among n extant species, one can examine the evolution of a single nucleotide along the tree. At the leaves, a single nucleotide is then associated to each species, collectively forming a single selection from among the 4^n possible patterns for nucleotides at the leaves. Evolutionary models provide a way to formalize the intuitive notion that random mutations should be associated with branches of the tree and formally are described via (unknown) parameters that can be used to calculate a probability for any pattern at the leaves. It happens to be the case that for most phylogenetic evolutionary model have the property that the probabilities for leaf patterns are polynomials in the parameters. The simplest example to consider is the tree with an ancestral node and two leaves corresponding to two extant species, say “B” and “M”:

Tree_two

 

The molecular approach to evolution posits that multiple sites together should be used both to estimate parameters associated with evolution along the tree, and maybe even the tree itself. If one assumes that nucleotides mutate according to the 4-state general Markov model with independent processes on each branch, and one writes p_{ij} for \mathbb{P}(B=i,M=j) where i,j are one of A,C,G,T, then it must be the case that p_{ij}p_{kl} = p_{il}p_{jk}. In other words, the polynomial

p_{ij}p_{kl} - p_{il}p_{jk}=0.

In other words, for any parameters in the 4-state general Markov model, it has to be the case that when the pattern probabilities are plugged into the polynomial equation above, the result is zero. This equation is none other than the condition for two random variables to be independent; in this case the random variable corresponding to the nucleotide at B is independent of the random variable corresponding to the nucleotide at M.

The example is elementary, but it hints at a powerful tool for phylogenetics. It provides an equation that must be satisfied by the pattern probabilities that does not depend specifically on the parameters of the model (which can be intuitively understood as relating to branch length). If many sites are available so that pattern probabilities can be estimated empirically from data, then there is in principle a possibility for testing whether the data fits the topology of a specific tree regardless of what the branch lengths of the tree might be. Returning to Mumford’s description of algebraic geometry, the variety of interest is the geometric object in “pattern probability space” where points are precisely probabilities that can arise for a specific tree, and the “ring of polynomials with the geometric properties of the locus” are the phylogenetic invariants. The relevance of the ring lies in the fact that if and g are two phylogenetic invariants then that means that f(P)=0 and g(P)=0 for any pattern probabilities from the model, so therefore f+g is also a phylogenetic invariant because f(P)+g(P)=0 for any pattern probabilities from the model (the same is true for c \cdot f for any constant c). In other words, there is an algebra of phylogenetic invariants that is closely related to the geometry of pattern probabilities. As Mumford and Tate explain, Grothendieck figured out the right generalizations to construct a theory for any ring, not just the ring of polynomials, and therewith connected the fields of commutative algebra, algebraic geometry and number theory.

The use of phylogenetic invariants for testing tree topologies is conceptually elegantly illustrated in a wonderful book chapter on phylogenetic invariants  by mathematicians Elizabeth Allman and John Rhodes that starts with the simple example of the two taxa tree and delves deeply into the subject. Two surfaces (conceptually) represent the varieties for two trees, and the equations f_1(P)=f_2(P)=\ldots=f_l(P)=0 and h_1(P)=h_2(P)=\ldots=h_k(P)=0 are the phylogenetic invariants. The empirical pattern probability distribution is the point \hat{P} and the goal is to find the surface it is close to:

Allman_Rhodes_pic

Figure 4.2 from Allman and Rhodes chapter on phylogenetic invariants.

Of course for large trees there will be many different phylogenetic invariants, and the polynomials may be of high degree. Figuring out what the invariants are, how many of them there are, bounds for the degrees, understanding the geometry, and developing tests based on the invariants, is essentially a (difficult unsolved) challenge for algebraic geometers. I think it’s fair to say that our book spurred a lot of research on the subject, and helped to create interest among mathematicians who were unaware of the variety and complexity of problems arising from phylogenetics. Nick Eriksson, Kristian Ranestad, Bernd Sturmfels and Seth Sullivant wrote a short piece titled phylogenetic algebraic geometry which is an introduction for algebraic geometers to the subject. Here is where we come full circle to Mumford’s obituary… the notion of a scheme is obviously central to phylogenetic algebraic geometry. And the expository article just cited is just the beginning. There are too many exciting developments in phylogenetic geometry to summarize in this post, but Elizabeth Allman, Marta Casanellas, Joseph Landsberg, John Rhodes, Bernd Sturmfels and Seth Sullivant are just a few of many who have discovered beautiful new mathematics motivated by the biology, and also have had an impact on biology with algebro-geometric tools. There is both theory (see this recent example) and application (see this recent example) coming out of phylogenetic algebraic geometry. More generally, algebraic statistics for computational biology is now a legitimate “field”, complete with a journal, regular conferences, and a critical mass of mathematicians, statisticians, and even some biologists working in the area. Some of the results are truly beautiful and impressive. My favorite recent one is this paper by Caroline Uhler, Donald Richards and Piotr Zwiernik providing important guarantees for maximum likelihood estimation of parameters in Felstenstein’s continuous character model.

But that is not the point here. First, Mumford’s sarcasm was unwarranted. Biologists certainly didn’t discover schemes but as Felsenstein and Lake’s work shows, they did (re)discover algebraic geometry. Moreover, all of the people mentioned above can explain schemes to biologists, thereby answering Mumford’s question in the affirmative. Many of them have not only collaborated with biologists but written biology papers. And among them are some extraordinary expositors, notably Bernd Sturmfels. Still, even if there are mathematicians able and willing to explain schemes to biologists, and even if there are areas within biology where schemes arise (e.g. phylogenetic algebraic geometry), it is fair to ask whether biologists should care to understand them?

The answer to the question is: probably not. In any case I wouldn’t presume to opine on what biologists should and shouldn’t care about. Biology is enormous, and encompasses everything from the study of fecal transplants to the wood frogs of Alaska. However I do have an opinion about the area I work in, namely genomics. When it comes to genomics journalists write about revolutions, personalized precision medicine, curing cancer and data deluge. But the biology of genomics is for real, and it is indeed tremendously exciting as a result of dramatic improvements in underlying technologies (e.g. DNA sequencing and genome editing to name two). I also believe it is true that despite what is written about data deluge, experiments remain the primary and the best way, to elucidate the function of the genome. Data analysis is secondary. But it is true that statistics has become much more important to genomics than it was even to population genetics at the time of R.A. Fisher, computer science is playing an increasingly important role, and I believe that somewhere in the mix of “quantitative sciences for biology”, there is an important role for mathematics.

What biologists should appreciate, what was on offer in Mumford’s obituary, and what mathematicians can deliver to genomics that is special and unique, is the ability to not only generalize, but to do so “correctly”. The mathematician Raoul Bott once reminisced that “Grothendieck was extraordinary as he could play with concepts, and also was prepared to work very hard to make arguments almost tautological.” In other words, what made Grothendieck special was not that he generalized concepts in algebraic geometry to make them more abstract, but that he was able to do so in the right way. What made his insights seemingly tautological at the end of the day, was that he had the “right” way of viewing things and the “right” abstractions in mind. That is what mathematicians can contribute most of all to genomics. Of course sometimes theorems are important, or specific mathematical techniques solve problems and mathematicians are to thank for that. Phylogenetic invariants are important for phylogenetics which in turn is important for comparative genomics which in turn is important for functional genomics which in turn is important for medicine. But it is the the abstract thinking that I think matters most. In other words, I agree with Charles Darwin that mathematicians are endowed with an extra sense… I am not sure exactly what he meant, but it is clear to me that it is the sense that allows for understanding the difference between the “right” way and the “wrong” way to think about something.

There are so many examples of how the “right” thinking has mattered in genomics that they are too numerous to list here, but here are a few samples: At the heart of molecular biology, there is the “right” and the “wrong” way to think about genes: evidently the message to be gleaned from Gerstein et al.‘s in “What is a gene post ENCODE? History and Definition” is that “genes” are not really the “right” level of granularity but transcripts are. In a previous blog post I’ve discussed the “right” way to think about the Needleman-Wunsch algorithm (tropically). In metagenomics there is the “right” abstraction with which to understand UniFrac. One paper I’ve written (with Niko Beerenwinkel and Bernd Sturmfels) is ostensibly about fitness landscapes but really about what we think the “right” way is to look at epistasis. In systems biology there is the “right” way to think about stochasticity in expression (although I plan a blog post that digs a bit deeper). There are many many more examples… way too many to list here… because ultimately every problem in biology is just like in math… there is the “right’ and the “wrong” way to think about it, and figuring out the difference is truly an art that mathematicians, the type of mathematicians that work in math departments, are particularly good at.

Here is a current example from (computational) biology where it is not yet clear what “right” thinking should be despite the experts working hard at it, and that is useful to highlight because of the people involved: With the vast amount of human genomes being sequenced (some estimates are as high as 400,000 in the coming year), there is an increasingly pressing fundamental question about how the (human) genome should be represented and stored. This is ostensibly a computer science question: genomes should perhaps be compressed in ways that allow for efficient search and retrieval, but I’d argue that fundamentally it is a math question. This is because what the question is really asking, is how should one think about genome sequences related mostly via recombination and only slightly by mutation, and what are the “right” mathematical structures for this challenge? The answer matters not only for the technology (how to store genomes), but much more importantly for the foundations of population and statistical genetics. Without the right abstractions for genomes, the task of coherently organizing and interpreting genomic information is hopeless. David Haussler (with coauthors) and Richard Durbin have both written about this problem in papers that are hard to describe in any way other than as math papers; see Mapping to a Reference Genome Structure and Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (BPWT). Perhaps it is no coincidence that both David Haussler and Richard Durbin studied mathematics.

But neither David Haussler nor Richard Durbin are faculty in mathematics departments. In fact, there is a surprisingly long list of very successful (computational) biologists specifically working in genomics, many of whom even continue to do math, but not in math departments, i.e. they are former mathematicians (this is so common there is even a phrase for it “recovering mathematician” as if being one is akin to alcoholism– physicists use the same language). People include Richard Durbin, Phil Green, David Haussler, Eric Lander, Montgomery Slatkin and many others I am omitting; for example almost the entire assembly group at the Broad Institute consists of former mathematicians. Why are there so many “formers” and very few “currents”? And does it matter? After all, it is legitimate to ask whether successful work in genomics is better suited to departments, institutes and companies outside the realm of academic mathematics. It is certainly the case that to do mathematics, or to publish mathematical results, one does not need to be a faculty member in a mathematics department. I’ve thought a lot about these issues and questions, partly because they affect my daily life working between the worlds of mathematics and molecular biology in my own institution. I’ve also seen the consequences of the separation of the two cultures. To illustrate how far apart they are I’ve made a list of specific differences below:

Biologists publish in “glamour journals” such as Science, Nature and Cell where impact factors are high. Nature publishes its impact factor to three decimal digits accuracy (42.317). Mathematicians publish in journals whose names start with the word Annals, and they haven’t heard of impact factors. The impact factor of the Annals of Mathematics, perhaps the most prestigious journal in mathematics, is 3 (the journal with the highest impact factor is the Journal of the American Mathematical Society at 3.5). Mathematicians post all papers on the ArXiv preprint server prior to publications. Not only do biologists not do that, they are frequently subject to embargos prior to publication. Mathematicians write in LaTeX, biologists in Word (a recent paper argues that Word is better, but I’m not sure). Biologists draw figures and write papers about them. Mathematicians write papers and draw figures to explain them. Mathematicians order authors alphabetically, and authorship is awarded if a mathematical contribution was made. Biologists author lists have two gradients from each end, and authorship can be awarded for payment for the work. Biologists may review papers on two week deadlines. Mathematicians review papers on two year deadlines. Biologists have their papers cited by thousands, and their results have a real impact on society; in many cases diseases are cured as a result of basic research. Mathematicians are lucky if 10 other individuals on the planet have any idea what they are writing about. Impact time can be measured in centuries, and sometimes theorems turn out to simply not have been interesting at all. Biologists don’t teach much. Mathematicians do (at UC Berkeley my math teaching load is 5 times that of my biology teaching load). Biologists value grants during promotion cases and hiring. Mathematicians don’t. Biologists have chalk talks during job interviews. Mathematicians don’t. Mathematicians have a jobs wiki. Biologists don’t. Mathematicians write ten page recommendation letters. Biologists don’t. Biologists go to retreats to converse. Mathematicians retreat from conversations (my math department used to have a yearly retreat that was one day long and consisted of a faculty meeting around a table in the department; it has not been held the past few years). Mathematics graduate students teach. Biology graduate students rotate. Biology students take very little coursework after their first year. Mathematics graduate students take two years of classes (on this particular matter I’m certain mathematicians are right). Biologists pay their graduate students from grants. Mathematicians don’t (graduate students are paid for teaching sections of classes, usually calculus). Mathematics full professors that are female is a number (%) in the single digits. Biology full professors that are female is a number (%) in the double digits (although even added together the numbers are still much less than 50%). Mathematicians believe in God. Biologists don’t.

How then can biology, specifically genomics (or genetics), exist and thrive within the mathematics community? And how can mathematics find a place within the culture of biology?

I don’t know. The relationship between biology and mathematics is on the rocks and prospects are grim. Yes, there are biologists who do mathematical work, and yes, there are mathematical biologists, especially in areas such as evolution or ecology who are in math departments. There are certainly applied mathematics departments with faculty working on biology problems involving modeling at the macroscopic level, where the math fits in well with classic applied math (e.g. PDEs, numerical analysis). But there is very little genomics or genetics related math going on in math departments. And conversely, mathematicians who leave math departments to work in biology departments or institutes face enormous pressure to not focus on the math, or when they do any math at all, to not publish it (work is usually relegated to the supplement and completely ignored). The result is that biology loses out due to the minimal real contact with math– the special opportunity of benefiting from the extra sense is lost, and conversely math loses the opportunity to engage biology– one of the most exciting scientific enterprises of the 21st century. The mathematician Gian-Carlo Rota said that “The lack of real contact between mathematics and biology is either a tragedy, a scandal, or a challenge, it is hard to decide which”. He was right.

The extent to which the two cultures have drifted apart is astonishing. For example, visiting other universities I see the word “mathematics” almost every time precision medicine is discussed in the context of a new initiative, but I never see mathematicians or the local math department involved. In the mathematics community, there has been almost no effort to engage and embrace genomics. For example the annual joint AMS-MAA meetings always boast a series of invited talks, many on applications of math, but genomics is never a represented area. Yet in my Junior level course last semester on mathematical biology (taught in the math department) there were 46 students, more than any other upper division elective class in the math department. Even though I am a 50% member of the mathematics department I have been advising three math graduate students this year, equivalent to six for a full time member, a statistic that probably ranks me among the most busy advisors in the department (these numbers do not even reflect the fact that I had to turn down a number of students). Anecdotally, the numbers illustrate how popular genomics is among math undergraduate and graduate students, and although hard data is difficult to come by my interactions with mathematicians everywhere convince me the trend I see at Berkeley is universal. So why is this popularity not reflected in support of genomics by the math community? And why don’t biology journals, conferences and departments embrace more mathematics? There is a hypocrisy of math for biology. People talk about it but when push comes to shove nobody wants to do anything real to foster it.

Examples abound. On December 16th UCLA announced the formation of a new Institute for Quantitative and Computational Biosciences. The announcement leads with a photograph of the director that is captioned “Alexander Hoffmann and his colleagues will collaborate with mathematicians to make sense of a tsunami of biological data.” Strangely though, the math department is not one of the 15 partner departments that will contribute to the Institute. That is not to say that mathematicians won’t interact with the Institute, or that mathematics won’t happen there. E.g., the Institute for Pure and Applied Mathematics is a partner as is the Biomathematics department (an interesting UCLA concoction), not to mention the fact that many of the affiliated faculty do work that is in part mathematical. But formal partnership with the mathematics department, and through it direct affiliation with the mathematics community, is missing. UCLA’s math department is among the top in the world, and boasts a particularly robust applied mathematics program many of whose members work on mathematical biology. More importantly, the “pure” mathematicians at UCLA are first rate and one of them, Terence Tao, is possibly the most talented mathematician alive. Wouldn’t it be great if he could be coaxed to think about some of the profound questions of biology? Wouldn’t it be awesome if mathematicians in the math department at UCLA worked hard with the biologists to tackle the extraordinary challenges of “precision medicine”? Wouldn’t it be wonderful if UCLA’s Quantitative and Computational biosciences Institute could benefit from the vast mathematics talent pool not only at UCLA but beyond: that of the entire mathematics community?

I don’t know if the omission of the math department was an accidental oversight of the Institute, a deliberate snub, or if it was the Institute that was rebuffed by the mathematics department. I don’t think it really matters. The point is that the UCLA situation is ubiquitous. Mathematics departments are almost never part of new initiatives in genomics; biologists are all too quick to glance the other way. Conversely, the mathematics community has shunned biologists. Despite two NSF Institutes dedicated to mathematical biology (the MBI and NIMBioS) almost no top math departments hire mathematicians working in genetics or genomics (see the mathematics jobs wiki). In the rooted tree in the figure above B can represent Biology and M can represent Mathematics and they truly, and sadly, are independent.

I get it. The laundry list of differences between biology and math that I aired above can be overwhelming. Real contact between the subjects will be difficult to foster, and it should be acknowledged that it is neither necessary nor sufficient for the science to progress. But wouldn’t it be better if mathematicians proved they are serious about biology and biologists truly experimented with mathematics? 


Notes:

1. The opening paragraph is an edited copy of an excerpt (page 2, paragraph 2) from C.P. Snow’s “The Two Cultures and The Scientific Revolution” (The Rede Lecture 1959).
2. David Mumford’s content on his site is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License, and I have incorporated it in my post (boxed text) unaltered according to the terms of the license.
3. The meaning of the word “invariant” in “phylogenetic invariants” differs from the standard meaning in mathematics, where invariant refers to a property of a class of objects that is unchanged under transformations. In the context of algebraic geometry classic invariant theory addresses the problem of determining polynomial functions that are invariant under transformations from a linear group. Mumford is known for his work on geometric invariant theory. An astute reader could therefore deduce from the term “phylogenetic invariants” that the term was coined by biologists.

I visited Duke’s mathematics department yesterday to give a talk in the mathematical biology seminar. After an interesting day meeting many mathematicians and (computational) biologists, I had an excellent dinner with Jonathan Mattingly, Sayan MukherjeeMichael Reed and David Schaeffer. During dinner conversation, the topic of probability theory (and how to teach it) came up, and in particular Buffon’s needle problem.

The question was posed by Georges-Louis Leclerc, Comte de Buffon in the 18th century:

Suppose we have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. What is the probability that the needle will lie across a line between two strips?

If the strips are distance t apart, and l \leq t, then it is easy to see that the probability P is given by

P = \int_{\theta =0}^{\frac{\pi}{2}} \int_{x = 0}^{\frac{l}{2}sin \theta} \frac{4}{t \pi} dx d\theta = \frac{2l}{t \pi}.

The appearance of \pi in the denominator turns the problem into a Monte Carlo technique for estimating \pi: simply simulate random needle tosses and count crossings.

It turns out there is a much more elegant solution to the problem– one that does not require calculus. I learned of it from Gian-Carlo Rota when I was a graduate student at MIT. It appears in his book Introduction to Geometric Probability (with Dan Klain) that I have occasionally used when teaching Math 249. The argument relies on the linearity of expectation, and is as follows:

Let f(l) denote the expected number of crossings when a needle of length l is thrown on the floor. Now consider two needles, one of length l and the other m, attached to each other end to end (possibly at some angle). If X_1 is a random variable describing the number of crossings of the first needle, and X_2 of the second, its certainly the case that X_1 and X_2 are dependent, but because expectation is linear, it is the case that E(X_1+X_2) = E(X_1)+E(X_2). In other words, the total number of crossings is, in expectation, f(l)+f(m).

Buffon

Buffon’s needle problem: what is the probability that a needle of length l \leq t crosses a line? (A) A short needle being thrown at random on a floor with parallel lines. (B) Two connected needles. The expected number of crossings is proportional to the sum of their lengths. (C) A circle of diameter always crosses exactly two lines.

It follows that f is a linear function, and since f(0)=0, we have that f(l) = cl where c is some constant. Now consider a circle of diameter t. Such a circle, when thrown on the floor, always crosses the parallel lines exactly twice. If C is a regular polygon with vertices on the circle, and the total length of the polygon segments is l, then the total number of crossings is f(l). Taking the limit as the number of segments in the polygon goes to infinity, we find that f(t \pi ) = 2. In other words,

f(t \pi) = c \cdot t \pi = 2 \Rightarrow c = \frac{2}{t \pi},

and the expected number of crossings of a needle of length l is \frac{2l}{t \pi}. If l < t, the number of crossings is either 0 or 1, so the expected number of crossings is, by definition of expectation, equal to the probability of a single crossing. This solves Buffon’s problem no calculus required!

The linearity of expectation appears elementary at first glance. The proof is simple, and it is one of the first “facts” learned in statistics– I taught it to my math 10 students last week. However the apparent simplicity masks its depth and utility; the above example is cute, and one of my favorites, but linearity of expectation is useful in many settings. For example I recently saw an interesting application in an arXiv preprint by Anand Bhaskar, Andy Clark and Yun Song on “Distortion of genealogical properties when the sample is very large“.

The paper addresses an important question, namely the suitability of the coalescent as an approximation to discrete time random mating models, when sample sizes are large. This is an important question, because population sequencing is starting to involve hundreds of thousands, if not millions of individuals.

The results of Bhaskar, Clark and Song are based on dynamic programming calculations of various genealogical quantities as inferred from the discrete time Wright-Fisher model. An example is the expected frequency spectrum for random samples of individuals from a population. By frequency spectrum, they mean, for each k, the expected number of polymorphic sites with k derived alleles and n-k ancestral alleles under an infinite-sites model of mutation in a sample of n individuals. Without going into details (see their equations (8),(9) and (10)), the point is that they are able to derive dynamic programming recursions because they are computing the expected frequencies, and the linearity of expectation is what allows for the derivation of the dynamic programming recursions.

None of this has anything to do with my seminar, except for the fact that the expectation-maximization algorithm did make a brief appearance, as it frequently does in my lectures these days. I spoke mainly about some of the mathematics problems that arise in comparative transcriptomics, with a view towards a principled approach to comparing transcriptomes between cells, tissues, individuals and species.

photo-9

The Duke Chapel. While I was inside someone was playing the organ, and as I stared at the ceiling, I could have sworn I was in Europe.

Blog Stats

  • 1,620,050 views
%d bloggers like this: