Last Saturday I returned from Cold Spring Harbor Laboratories where I spoke at the Genome Informatics Meeting on Stories from the Supplement. On Monday I delivered the “Prestige Lecture” at a meeting of the Center for Science of Information on New Directions in the Science of Information and I started by talking about Cold Spring Harbor Laboratory (CSHL). That is because the Eugenics Record Office at CSHL is where Claude Shannon, famous father of information theory, wrapped up his Ph.D. in population genetics in 1939.
The fact that Shannon did his Ph.D. in population genetics– his Ph.D. was titled “An Algebra for Theoretical Genetics“– is unknown to most information theorists and population geneticists. It is his masters thesis that is famous (for good reason– it can be said to have started the digital revolution), and his paper in 1948 that founded information theory. But his Ph.D. thesis was impressive in its own right: its contents formed the beginning of my talk to the information theorists, and I summarize the interesting story below.
I learned about the details surrounding Shannon’s foray into biology from a wonderful final project paper written for the class The Structure of Engineering Revolutions in the Fall of 2001: Eugene Chiu, Jocelyn Lin, Brok Mcferron, Noshirwan Petigara, Satwiksai Seshasai, Mathematical Theory of Claude Shannon. In 1939, Shannon’s advisor, Vannevar Bush, sent him to study genetics with Barbara Burks at the Eugenics Record Office at Cold Spring Harbor. That’s right, the Eugenics office was located at Cold Spring Harbor from 1910 until 1939, when it was closed down as a result of Nazi eugenics. Fortunately, Shannon was not very interested in the practical aspects of eugenics, and more focused on the theoretical aspects of genetics.
His work in genetics was a result of direction from Vannevar Bush, who knew about genetics via his presidency of the Carnegie Institution of Washington that ran the Cold Spring Harbor research center. Apparently Bush remarked to a colleague that “It occurred to me that, just as a special algebra had worked well in his hands on the theory of relays, another special algebra might conceivably handle some of the aspects of Mendelian heredity”. The main result of his thesis is his Theorem 12:
The notation refers to genotype frequencies in a diploid population. The indices refer to alleles at three loci on one haplotype, and at the same loci on the other haplotype. The variables correspond to recombination crossover probabilities. is the probability of an even number of crossovers between both the 1st and 2nd loci, and the 2nd and 3rd loci. is the probability of an even number of crossovers between the 1st and 2nd loci but an odd number of crossovers between the 2nd and 3rd loci, and so on. Finally, the dot notation in the represents summation over the index (these days one might use a ). The result is a formula for the population genotype frequencies after generations. The derivation involves elementary combinatorics, specifically induction, but it is an interesting result and at the time was not something population geneticists had worked out. What I find impressive about it is that Shannon, apparently on his own, mastered the basic principles of (population) genetics of his time, and performed a calculation that is quite similar to many that are relevant in population genetics today. Bush wrote about Shannon “At the time that I suggested that he try his queer algebra on this subject, he did not even know what the words meant… “.
Why did Shannon not pursue a career in population genetics? The Eugenics Record Office closed shortly after he left and Bush discouraged him from continuing in the field, telling him that “few scientists are ever able to apply creatively a new and unconventional method furnished by some one else – at least of their own generation”. Thus, despite encouragement from a number of statisticians and geneticists that his work was novel and of interest, Shannon returned to electrical engineering. Shortly thereafter, the world got information theory.
Of course today population genetics has data, tons of it, and many interesting problems, including some that I think require insights and ideas from information theory. My Prestige Lecture was aimed at encouraging information theorists to return to their Shannon roots, and redirect their focus towards biology. I have been working with information theorist David Tse (academic grandson of Shannon) for the past year on de novo RNA-Seq assembly (a talk on our joint work with postdoc Sreeram Kannan was presented by Sreeram at the Genome Informatics meeting), and I believe the engagement of information theorists in biology would be of great benefit to both fields; in terms of biology, I see many applications of information theory beyond population genetics. Some back-and-forth has already started. Recently there have been some interesting papers using information theory to study genome signatures and compression, but I believe that there are many other fruitful avenues for collaboration. David and Sreeram were the only information theorists at CSHL last week (I think), but I hope that there will be many more at the 2014 meeting in Cambridge, UK!
The beach at Cold Spring Harbor. I took the photo on November 1st before my Genome Informatics keynote.