Last Saturday I returned from Cold Spring Harbor Laboratories where I spoke at the Genome Informatics Meeting on Stories from the Supplement. On Monday I delivered the “Prestige Lecture” at a meeting of the Center for Science of Information on New Directions in the Science of Information and I started by talking about Cold Spring Harbor Laboratory (CSHL). That is because the Eugenics Record Office at CSHL is where Claude Shannon, famous father of information theory, wrapped up his Ph.D. in population genetics in 1939.
The fact that Shannon did his Ph.D. in population genetics– his Ph.D. was titled “An Algebra for Theoretical Genetics“– is unknown to most information theorists and population geneticists. It is his masters thesis that is famous (for good reason– it can be said to have started the digital revolution), and his paper in 1948 that founded information theory. But his Ph.D. thesis was impressive in its own right: its contents formed the beginning of my talk to the information theorists, and I summarize the interesting story below.
I learned about the details surrounding Shannon’s foray into biology from a wonderful final project paper written for the class The Structure of Engineering Revolutions in the Fall of 2001: Eugene Chiu, Jocelyn Lin, Brok Mcferron, Noshirwan Petigara, Satwiksai Seshasai, Mathematical Theory of Claude Shannon. In 1939, Shannon’s advisor, Vannevar Bush, sent him to study genetics with Barbara Burks at the Eugenics Record Office at Cold Spring Harbor. That’s right, the Eugenics office was located at Cold Spring Harbor from 1910 until 1939, when it was closed down as a result of Nazi eugenics. Fortunately, Shannon was not very interested in the practical aspects of eugenics, and more focused on the theoretical aspects of genetics.
His work in genetics was a result of direction from Vannevar Bush, who knew about genetics via his presidency of the Carnegie Institution of Washington that ran the Cold Spring Harbor research center. Apparently Bush remarked to a colleague that “It occurred to me that, just as a special algebra had worked well in his hands on the theory of relays, another special algebra might conceivably handle some of the aspects of Mendelian heredity”. The main result of his thesis is his Theorem 12:
The notation refers to genotype frequencies in a diploid population. The indices
refer to alleles at three loci on one haplotype, and
at the same loci on the other haplotype. The
variables correspond to recombination crossover probabilities.
is the probability of an even number of crossovers between both the 1st and 2nd loci, and the 2nd and 3rd loci.
is the probability of an even number of crossovers between the 1st and 2nd loci but an odd number of crossovers between the 2nd and 3rd loci, and so on. Finally, the dot notation in the
represents summation over the index (these days one might use a
). The result is a formula for the population genotype frequencies
after
generations. The derivation involves elementary combinatorics, specifically induction, but it is an interesting result and at the time was not something population geneticists had worked out. What I find impressive about it is that Shannon, apparently on his own, mastered the basic principles of (population) genetics of his time, and performed a calculation that is quite similar to many that are relevant in population genetics today. Bush wrote about Shannon “At the time that I suggested that he try his queer algebra on this subject, he did not even know what the words meant… “.
Why did Shannon not pursue a career in population genetics? The Eugenics Record Office closed shortly after he left and Bush discouraged him from continuing in the field, telling him that “few scientists are ever able to apply creatively a new and unconventional method furnished by some one else – at least of their own generation”. Thus, despite encouragement from a number of statisticians and geneticists that his work was novel and of interest, Shannon returned to electrical engineering. Shortly thereafter, the world got information theory.
Of course today population genetics has data, tons of it, and many interesting problems, including some that I think require insights and ideas from information theory. My Prestige Lecture was aimed at encouraging information theorists to return to their Shannon roots, and redirect their focus towards biology. I have been working with information theorist David Tse (academic grandson of Shannon) for the past year on de novo RNA-Seq assembly (a talk on our joint work with postdoc Sreeram Kannan was presented by Sreeram at the Genome Informatics meeting), and I believe the engagement of information theorists in biology would be of great benefit to both fields; in terms of biology, I see many applications of information theory beyond population genetics. Some back-and-forth has already started. Recently there have been some interesting papers using information theory to study genome signatures and compression, but I believe that there are many other fruitful avenues for collaboration. David and Sreeram were the only information theorists at CSHL last week (I think), but I hope that there will be many more at the 2014 meeting in Cambridge, UK!
The beach at Cold Spring Harbor. I took the photo on November 1st before my Genome Informatics keynote.
9 comments
Comments feed for this article
November 6, 2013 at 5:36 am
Marcin Cieslik
Interesting post. A couple of years ago I got fascinated by a paper by
Gavin Crooks and Steven Brenner (http://bioinformatics.oxfordjournals.org/content/21/7/975.full). They proposed a *very* elegant Bayesian model of amino acid substitution within columns of protein alignments. In the limit of an infinitely large alignment their substitution score reduces to the Jensen-Shannon divergence – a very useful “distance” for probability distributions.
November 6, 2013 at 6:00 am
Lior Pachter
Thanks for the comment, and the link to the interesting paper. Gavin Crooks’ papers were my starting point in thinking of the Jensen-Shannon metric (square root of the divergence) for measuring changes in relative isoform abundances for the Cufflinks paper. Specifically the paper
Click to access 0706.0559.pdf
is very cool
and
Click to access Crooks-inequality.pdf
is also good to know.
His website and blog are here
http://threeplusone.com/gec/
November 6, 2013 at 8:37 am
dntse
Thanks Lior for an inspiring talk and an interesting bit of history about Shannon. While Shannon is usually described in the popular press as a mathematician or a scientist, I believe he was an engineer at heart. His best works were done by taking a concrete engineering problem, whether it’s digital switching design or communication, and finding an appropriate mathematical framework to guide optimal design, replacing the adhoc methods prevalent in the field at the time. For example, he titled his information theory paper “A Mathematical Theory of Communication” and not “A Theory of Information”. Perhaps that’s another reason why he left biology after his Ph.D. thesis and went back to his engineering roots. So another reason why information theorists should take another looking at biology is perhaps that, with technologies like high throughput sequencing, there is much more interplay between engineering and biology nowadays than in 1939.
November 10, 2013 at 4:20 pm
Pramod Viswanath
Hi Lior and David,
I am curious to see the “prestige lecture” by Lior at the center for the science of information. Is there any video footage archived? Or at least the plain slides? Thanks,
Pramod (academic great-grandson of Shannon)
November 10, 2013 at 4:27 pm
Lior Pachter
My lecture should be available online shortly. When it is I’ll update the blog post with the link.
November 7, 2013 at 5:05 pm
Anna Carbone
Hi Lior,
very interesting piece on the roots of Shannon work. I didn’t know it as well.
Since you wrote “My Prestige Lecture was aimed at encouraging information theorists to return to their Shannon roots, and redirect their focus towards biology.” I was encouraged to send you this link to my recent work on new approach to Shannon entropy estimation and its application to the 24 Human chromosomes. I’m not a biologist or genetist…just an applied mathematician:
“Information Measures for Long-Range Correlated Sequences: the Case of the 24 Human Chromosome Sequences”
Scientific Reports vol. 3, Article number: 2721 (2013)
http://arxiv.org/abs/1302.0784
I would try and attend the meeting next September in Cambridge.
November 11, 2013 at 12:31 am
someone
Information theory has been used quite successfully in computational biology for a long time already. The latest developments in information theory are happening in Minimum Description Length principle (MDL) and Normalized Maximum Likelihood (NML) principle. For example, one of the best DNA compression algorithms is using NML principle (see: Tabus et al, “DNA sequence compression using the normalized maximum likelihood model for discrete regression”, 2003, http://alturl.com/9rqnb )
One may see the way how NML principle is used to compress a DNA sequence, as a better way to compute entropy, that is to lower the entropy (compression code length = log probability).
Here are more examples about information theory used in computational biology:
http://alturl.com/b76sw
http://alturl.com/j5h6d
Tabus et al. “Classification and feature gene selection using the normalized maximum likelihood model for discrete regression”, 2003 http://www.sciencedirect.com/science/article/pii/S016516840200470X Tabus et al., “Nonlinear modeling of protein expressions in protein arrays”, 2006, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1634842
P.S. I am not author or co-author in any of these articles!
Tabus et al. “Classification and feature gene selection using the normalized maximum likelihood model for discrete regression”, 2003 http://www.sciencedirect.com/science/article/pii/S016516840200470X Tabus et al., “Nonlinear modeling of protein expressions in protein arrays”, 2006, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1634842
P.S. I am not author or co-author in any of these articles!
An interesting quote regarding information theory and statistics from this book J. Rissanen, “Optimal parameters estimation”, 2012,
http://www.amazon.com/Optimal-Estimation-Parameters-Jorma-Rissanen/dp/1107004748
“Very few statisticians have been studying information theory, the result of which, I think, is the disarray of the present discipline of statistics.” Jorma Rissanen (page.2)
Jorma Rissanen is information theorist (e.g. he invented the arithmetic coding). More about him is here: http://en.wikipedia.org/wiki/Jorma_Rissanen
October 31, 2014 at 11:26 pm
Chris Aldrich
Lior, Seemingly pure serendipity brought me to your blog via a post from a new “colleague” on twitter, but I was quite pleased after reading several other excellent articles you’ve written to have run into this particular post on Shannon!
Truly his Ph.D. work has long been ignored, though he did have an ensuing close association with Norbert Weiner who became the father of the Cybernetics movement. I’ve long been fascinated with the confluence of the topics of information theory and microbiology and in the last several years, there seems to be a rising tide in research in this area, though often I feel like there are very few who have spent enough time in both fields to be very comfortable or capable, particularly as even the differing viewpoints of information theory (that of engineers, mathematicians, physicists, and computer scientists) don’t always seem to have complete coherence.
In any case, I thought it might be of some use to mention a few specific references here for those who come across your posting who are interested in the two topics. First I’ll mention an excellent 5 day workshop that just finished up this week at the Banff International Research Station (BIRS) on Biological and Bio-Inspired Information Theory. Fortunately the majority of the talks were recorded and are available as downloads or streaming video within that link. A few of the participants there are also putting together another workshop in April 2015 at NIMBios on Information and entropy in biological systems.
For those interested in more depth and breadth, I along with others maintain a free collaborative list of journal articles, books, and other references via Mendeley at http://www.mendeley.com/groups/2545131/itbio-information-theory-microbiology-evolution-and-complexity/. My own personal blog has a handful of posts on these two subjects as well as a constantly growing list of pointers to specific researchers at the forefront of information theory and biology.
I’m curious if you ran into any more information theorists at the 2014 meeting as you’d hoped?
November 1, 2014 at 8:00 am
Lior Pachter
Many thanks for the link to the BIRS workshop- I was not aware of it, and it is indeed a really great that they record the talks. Thanks also for the pointer to your bibliography. I was not able to attend Genome Informatics this year so I can’t answer your question, but I was happy to present a tutorial at the annual Allerton meeting this year and I do think many information theorists are interested, and in turn will find a rewarding set of problems in/from biology.