You are currently browsing the monthly archive for August 2013.
One of my favorite ideas in phylogenetics is that of a phylogenetic orange. The idea and the terminology come from a classic paper by Junhyong Kim: Slicing hyperdimensional oranges: the geometry of phylogenetic estimation, Molecular Phylogenetics and Evolution, 17 (2000), p 58–75 (pdf available here). In the words of the author (from the abstract):
A new view of phylogenetic estimation is presented where data sets, tree evolution models, and estimation methods are placed in a common geometric framework.
Prior to Kim’s paper the term “space of phylogenetic trees” was used only metaphorically. Moreover, even though there were many papers proposing definitions for the “distance” between trees (the most popular being the Robinson-Foulds distance), the ideas were disconnected from the models used to analyze data, and therefore there was little hope of building a coherent theory of phylogenetics. The phylogenetic orange that Kim defined served to both define a “tree space” and at the same time provides a geometric framework for thinking about data and estimation in that space. Read the rest of this entry »
A few years ago Mike Steel wrote a wonderful whimsical short story “My friend and I catch a bus“. It describes a conversation between the narrator (a biologist?) and a friend (a probabilist?) as they take a bus to the movies.
Along the way the narrator learns some statistics: Bayes rule, expectation (of the exponential distribution) and the Poisson distribution. The latter comes up in response to a question about the probability that we are alone in the universe.
The “solution” is as follows: suppose that the probability that life evolves on any given planet is some tiny number p, and that there are a huge number N of planets in the universe. The probability that life evolved on k planets is then approximately
based on the Poisson approximation to the binomial distribution. The question of interest is equivalent to asking for the probability that given that . Assuming that (why? well, why not!?) the answer is
At UC Berkeley, the mathematicians and biologists have decided, proverbially, to catch a bus together. In 2008 then deans Mark Richards of natural sciences and Mark Schlissel of biological sciences took the bold step of forming a committee, which I chaired, charged with investigating the possibility of revamping the math education of biology majors. The goal was not blind reform but rather a plan for training that would prepare students for upper division biology courses with more math and statistics. Read the rest of this entry »
There is a new arXiv paper out with the title Sailfish: Alignment-free Isoform Quantification from RNA-Seq Reads using Lightweight Algorithms by Rob Patro, Stephen M. Mount and Carl Kingsford. It describes a new approach to RNA-Seq quantification that is based on directly estimating abundances from k-mers rather than read alignments. This is an interesting approach, because it avoids the time-intensive read alignment step that is rapidly becoming a bottleneck in RNA-Seq analysis. The idea of avoiding read alignments to reference genomes/transcriptome in *Seq experiments is being explored in other contexts as well, such as for mutant mapping (by the Korbinian Schneeberger group) and genotyping (by the Gil McVean group). I am particularly interested in these ideas as we have been exploring such methods for association mapping.
Patro, Mount and Kingsford work with a simplified model for RNA-Seq to first obtain approximate transcript abundance estimates. In the notation of my survey paper on RNA-Seq models (see equation 14, except with k replaced by j to avoid confusion), they are maximizing the likelihood
where the product is over k-mers instead of reads, so that (where k is the k-mer size) rather than the total number of reads. The EM updates are therefore the same as those of other RNA-Seq quantification algorithms (see Figure 4 in my survey). They also implement an acceleration of the EM called SQUAREM (by Varadhan and Roland) in order to improve convergence.
The results of the paper are impressive. They compare speed and accuracy with RSEM, Cufflinks and eXpress and obtain comparable accuracy while avoiding the time intensive alignment of reads to transcripts (or the genome in the case of Cufflinks). An interesting point made is that bias can be corrected after fragment assignment (or in the case of Sailfish after k-mer assignment) without much loss in accuracy. We used a similar approximation in eXpress, namely stopping estimation of bias parameters after 5 million reads have been processed, but it seems that postponing the entire correction until fragment assignment is complete is acceptable.
Sailfish also appears to have been well engineered. The code (in C++) is well documented and available in both source and executable (for Linux and Mac OS X). I haven’t had a chance to test it yet but hope to do so soon.
My only concern with the manuscript is that the simulation results for eXpress appear to be unreasonable. The experiments conducted on “real data” (for which there appear to be qPCR) suggest that the accuracy of Sailfish is similar to that of eXpress, RSEM and Cufflinks (with RSEM underperforming slightly presumably to the lack of bias correction), but the simulations, performed with the Flux Simulator, are inconsistent. I suspect there is a (trivial) problem with the simulated data and presumably the authors will address this before journal publication. Update: The authors responded to my blog post within a day and we quickly realized the problem was likely to have been that Flux Simulator did not output reads in random order. Random ordering of reads is important for eXpress to function correctly, and when we wrote our paper we verified that Illumina sequencers do indeed output reads in random order. Rob Patro shuffled the Flux Simulator reads and verified that the performance of eXpress on simulated data is consistent with the results on real data (see attached figure). We’re grateful for his quick work in sorting out the issue and thank the authors of Sailfish for posting their paper on the arXiv (as others are starting to do), thereby enabling this exchange to occur prior to publication.
When the organizers of ISMB 2013 kindly invited me to give a keynote presentation this year, I decided to use the opportunity to survey “sequence census” methods. These are functional genomics assays based on high throughput sequencing. It has become customary to append the suffix “-Seq” to such assays (e.g. RNA-Seq), and I therefore prefer the term *Seq where the * denotes a wildcard.
The starting point for preparing the talk was a molecular biology seminar I organized in the Spring of 2010, where we discussed new high-throughput sequencing based assays with a focus on the diverse range of applications being explored. At the time I had performed a brief literature search to find *Seq papers for students to present, and this was helpful as a starting point for building a more complete bibliography for my talk. Finding *Seq assays is not easy- there is no central repository for them- but after some work I put together a (likely incomplete) bibliography that is appended to the end of the post. Update: I’ve created a page for actively maintaining a bibliography of *Seq assays.
The goal for my talk was to distill what appears to be a plethora of complex and seemingly unrelated experiments (see, e.g., Drukier et al. on the *Seq list) into a descriptive framework useful for thinking about their commonalities and differences. I came up with this cartoonish figure that I briefly explain in the remainder of this post. In future posts I plan to review in more detail some of these papers and the research they have enabled.
A typical assay first involves thinking of a (molecular) measurement to be made. The problem of making the measurement is then “reduced” (in the computer science sense of the word) to sequencing. This means that the measurement will be inferred from sequencing bits of DNA from “target” sequences (created during the reduction), and then counting the resulting fragments. It is important to keep in mind that the small fragments of DNA are sampled randomly from the targets, but the sampling may not be uniform.
The inference step is represented in the “Solve inverse problem” box in the figure, and involves developing a model of the experiment, together with an algorithm for inferring the desired measurement from the data (the sequenced DNA reads). Finally, the measurement becomes a starting point for further (computational) biology inquiry. Read the rest of this entry »