You are currently browsing the monthly archive for August 2013.

RNA-Seq is the new kid on the block, but there is still something to be learned from the stodgy microarray. One of the lessons is hidden in a tech report by Daniela Witten and Robert Tibshirani from 2007: “A comparison of fold-change  and the t-statistic for microarray data analysis“.

The tech report makes three main points. The first is that it is preferable to use a modified t-statistic rather than the ordinary t-statistic. This means that rather than comparing (normalized) means using

T_i = \frac{\bar{x_i} - \bar{y_i}}{s_i}

where s_i is the standard deviation of the replicates x_i (respectively y_i) of gene i in two different conditions, it is better to use

T'_i = \frac{\bar{x_i} - \bar{y_i}}{s_i+s_0}

 where s_0 minimizes the coefficient of variation of T'_i.

The second point made is that the intuition that reproducibility implies accuracy is not correct (fold change had been proposed for use instead of a t-statistic because the results were more reproducible).

The third point, in my opinion the most important one, I quote directly from the report:

“A researcher should choose the measure of differential expression based on the biological system of interest. If large absolute changes in expression are relevant to the system, then fold-change should be used; on the other hand, if changes in expression relative to the underlying noise are important, then a modified t-statistic is preferable.”

How does this pertain to RNA-Seq? Microarray experiments and RNA-Seq both measure expression but the translation of methods for the analysis of one platform to the other can be non-trivial. One reason is that in RNA-Seq experiments accurately measuring “fold-change” is difficult. Read counts accumulated across a gene cannot be used directly to estimate fold change because the transcripts making up the gene may have different lengths. For this reason, methods such as Cufflinks, RSEM or eXpress (and most recently Sailfish recently reviewed on this blog) use the EM algorithm to “deconvolute” ambiguously mapped reads. The following thought experiment (Figure 1 in our paper describing Cufflinks/Cuffdiff 2) illustrates the issue:


Changes in fragment counts for a gene do not necessarily equal a change in expression. The “exon-union” method counts reads falling on any of a gene’s exons, whereas the “exon-intersection” method counts only reads
on constitutive exons. Both of the exon-union and exon-intersection counting schemes may incorrectly estimate a change in expression in genes with multiple isoforms as shown in the table. It is important to note that the problem of fragment assignment described here in the context of RNA-Seq is crucial for accurate estimation of parameters in many other *Seq assays.

“Count-based” methods for differential expression, such as DESeq, work directly with accumulated gene counts and are based on the premise that even if estimated fold-change is wrong, statistical significance can be assessed based on differences between replicates.  In our recent paper describing Cuffdiff 2 (with a new method for differential abundance analysis) we examine DESeq (as a proxy for count-based methods) carefully and show using both simulation and real data that fold-change is not estimated accurately. In fact, even when DESeq and Cufflinks both deem a gene to be differentially expressed, and even when the effect is in the same direction (e.g. up-regulation), DESeq can (and many times does) estimate fold-change incorrectly. This problem is not specific to DESeq. All “count based” methods that employ naive heuristics for computing fold change will produce inaccurate estimates:


Comparison of fold-change estimated by Cufflinks (tail of arrows) vs. “intersection-count” (head of arrows) reproduced from Figure 5 of the supplementary material of the Cuffdiff 2 paper. “Intersection-count” consists of the accumulated read counts in the regions shared among transcripts in a gene. The x-axis shows array fold change vs. the estimated fold-change on the y-axis.  For more details on the experiment see the Cuffdiff 2 paper.

In other words,

it is essential to perform fragment assignment in a biological context where absolute expression differences are relevant to the system.

What might that biological context be? This is a subjective question but in my experience users of microarrays or RNA-Seq (including myself) always examine fold-change in addition to p-values obtained from (modified) t-statistics or other model based statistics because the raw fold-change is more directly connected to the data from the experiment.

In many settings though, statistical significance remains the gold standard for discovery. In the recent epic “On the immortality of television sets: ‘function’ in the human genome according to the evolution-free gospel of ENCODE“, Dan Graur criticizes the ENCODE project for reaching an “absurd conclusion” through various means, among them the emphasis of “statistical significance rather than magnitude of effect”. Or, to paraphrase Samuel Johnson,

statistical significance is the last refuge from a poor analysis of data.

One of my favorite ideas in phylogenetics is that of a phylogenetic orange. The idea and the terminology come from a classic paper by  Junhyong KimSlicing hyperdimensional oranges: the geometry of phylogenetic estimation, Molecular Phylogenetics and Evolution, 17 (2000), p 58–75  (pdf available here). In the words of the author (from the abstract):

A new view of phylogenetic estimation is presented where data sets, tree evolution models, and estimation methods are placed in a common geometric framework.

Prior to Kim’s paper the term “space of phylogenetic trees” was used only metaphorically. Moreover, even though there were many papers proposing definitions for the “distance” between trees (the most popular being the Robinson-Foulds distance), the ideas were disconnected from the models used to analyze data, and therefore there was little hope of building a coherent theory of phylogenetics. The phylogenetic orange that Kim defined served to both define a “tree space” and at the same time provides a geometric framework for thinking about data and estimation in that space. Read the rest of this entry »

A few years ago Mike Steel wrote a wonderful whimsical short story “My friend and I catch a bus“. It describes a conversation between the narrator (a biologist?) and a friend (a probabilist?) as they take a bus to the movies.

Along the way the narrator learns some statistics: Bayes rule, expectation (of the exponential distribution) and the Poisson distribution. The latter comes up in response to a question about the probability that we are alone in the universe.

The “solution” is as follows: suppose that the probability that life evolves on any given planet is some tiny number p, and that there are a huge number N of planets in the universe. The probability that life evolved on k planets is then approximately


based on the Poisson approximation to the binomial distribution. The question of interest is equivalent to asking for the probability that k \geq 2 given that k \geq 1. Assuming that Np=1 (why? well, why not!?) the answer is

\frac{(1-(e^{-1}+e^{-1}))}{1-e^{-1}} \approx 0.42 ,

i.e., the meaning of life, the universe and everything.

At UC Berkeley, the mathematicians and biologists have decided, proverbially, to catch a bus together. In 2008 then deans Mark Richards of natural sciences and Mark Schlissel of biological sciences took the bold step of forming a committee, which I chaired, charged with investigating the possibility of revamping  the math education of biology majors. The goal was not blind reform but rather a plan for training that would prepare students for upper division biology courses with more math and statistics.  Read the rest of this entry »

There is a new arXiv paper out with the title Sailfish: Alignment-free Isoform Quantification from RNA-Seq Reads using Lightweight Algorithms by Rob Patro, Stephen M. Mount and Carl Kingsford. It describes a new approach to RNA-Seq quantification that is based on directly estimating abundances from k-mers rather than read alignments. This is an interesting approach, because it avoids the time-intensive read alignment step that is rapidly becoming a bottleneck in RNA-Seq analysis. The idea of avoiding read alignments to reference genomes/transcriptome in *Seq experiments is being explored in other contexts as well, such as for mutant mapping (by the Korbinian Schneeberger group) and genotyping (by the Gil McVean group). I am particularly interested in these ideas as we have been exploring such methods for association mapping.

Patro, Mount and Kingsford work with a  simplified model for RNA-Seq to first obtain approximate transcript abundance estimates. In the notation of my survey paper on RNA-Seq models (see equation 14, except with k replaced by to avoid confusion), they are maximizing the likelihood

L(\rho) = \prod_{i=1}^N \left( \sum_{j=1}^K y_{i,j} \frac{\alpha_j}{l_j} \right)

 where the product is over k-mers instead of reads, so that N=4^k (where is the k-mer size) rather than the total number of reads. The EM updates are therefore the same as those of other RNA-Seq quantification algorithms (see Figure 4 in my survey). They also implement an acceleration of the EM called SQUAREM (by Varadhan and Roland) in order to improve convergence.

The results of the paper are impressive. They compare speed and accuracy with RSEM, Cufflinks and eXpress and obtain comparable accuracy while avoiding the time intensive alignment of reads to transcripts (or the genome in the case of Cufflinks). An interesting point made is that bias can be corrected after fragment assignment (or in the case of Sailfish after k-mer assignment) without much loss in accuracy. We used a similar approximation in eXpress, namely stopping estimation of bias parameters after 5 million reads have been processed, but it seems that postponing the entire correction until fragment assignment is complete is acceptable.

Sailfish also appears to have been well engineered. The code (in C++) is well documented and available in both source and executable (for Linux and Mac OS X). I haven’t had a chance to test it yet but hope to do so soon. My only concern with the manuscript is that the simulation results for eXpress appear to be unreasonable. The experiments conducted on “real data” (for which there appear to be qPCR) suggest that the accuracy of Sailfish is similar to that of eXpress, RSEM and Cufflinks (with RSEM underperforming slightly presumably to the lack of bias correction), but the simulations, performed with the Flux Simulator, are inconsistent. I suspect there is a (trivial) problem with the simulated data and presumably the authors will address this before journal publication. Update: The authors responded to my blog post within a day and we quickly realized the problem was likely to have been that Flux Simulator did not output reads in random order. Random ordering of reads is important for eXpress to function correctly, and when we wrote our paper we verified that Illumina sequencers do indeed output reads in random order. Rob Patro shuffled the Flux Simulator reads and verified that the performance of eXpress on simulated data is consistent with the results on real data (see attached figure). We’re grateful for his quick work in sorting out the issue and thank the authors of Sailfish for posting their paper on the arXiv (as others are starting to do), thereby enabling this exchange to occur prior to publication.

When the organizers of ISMB 2013 kindly invited me to give a keynote presentation this year, I decided to use the opportunity to survey “sequence census” methods. These are functional genomics assays based on high throughput sequencing. It has become customary to append the suffix “-Seq” to such assays (e.g. RNA-Seq), and I therefore prefer the term *Seq where the * denotes a wildcard.

The starting point for preparing the talk was a molecular biology seminar I organized in the Spring of 2010, where we discussed new high-throughput sequencing based assays with a focus on the diverse range of applications being explored. At the time I had performed a brief literature search to find *Seq papers for students to present, and this was helpful as a starting point for building a more complete bibliography for my talk. Finding *Seq assays is not easy- there is no central repository for them- but after some work I put together a (likely incomplete) bibliography that is appended to the end of the post. Update: I’ve created a page for actively maintaining a bibliography of *Seq assays.

The goal for my talk was to distill what appears to be a plethora of complex and seemingly unrelated experiments (see, e.g., Drukier et al. on the *Seq list) into a descriptive framework useful for thinking about their commonalities and differences. I came up with this cartoonish figure that I briefly explain in the remainder of this post. In future posts I plan to review in more detail some of these papers and the research they have enabled.


A typical assay first involves thinking of a (molecular) measurement to be made. The problem of making the measurement is then “reduced” (in the computer science sense of the word) to sequencing. This means that the measurement will be inferred from sequencing bits of DNA from “target” sequences (created during the reduction), and then counting the resulting fragments.  It is important to keep in mind that the small fragments of DNA are sampled randomly from the targets, but the sampling may not be uniform.

The inference step is represented in the “Solve inverse problem” box in the figure, and involves developing a model of the experiment, together with an algorithm for inferring the desired measurement from the data (the sequenced DNA reads). Finally, the measurement becomes a starting point for further (computational) biology inquiry.  Read the rest of this entry »

Blog Stats

%d bloggers like this: