You are currently browsing the tag archive for the ‘RNA-Seq’ tag.

The hierarchical classification of nature initiated by Carl Linnaeus today consists of eight major “ranks”, namely species, genus, family, order, class, phylum, kingdom and domain:

In the microbial world it makes sense to refine the standard taxonomy by subdividing species into strains. An important reason to do so is that bacterial taxonomy must reflect not only phylogeny but also pathogenicity, and small differences in genomes can translate to large pathogenic differences. This has implications for metagenomic analyses of microbial communities: for many biomedical applications it is desirable to characterize individuals strains.

Metagenomics has its roots in culture-independent retrieval and sequencing of 16S rRNA genes, and while variations in 16S can sometimes distinguish between strains, a single gene is not always sufficient. This limitation of 16S can be overcome with whole genome shotgun sequencing of microbial communities, an approach to metagenomics that became popular in the early 2000s and  that opened the door to higher resolution characterization of communities. In 2005 Kevin Chen and I wrote a review on the bioinformatics challenges that would have to be overcome to walk through the door. One of the things we did was to emphasize “problems and their connections to other areas of bioinformatics, such as… gene expression analysis”, and throughout the past decade I’ve always hoped for deeper connections to be established between metagenomics and gene expression bioinformatics. I’ve noticed interesting connections pop up from time to time (e.g. Paulson et al. 2013)  and have occasionally entertained the thought with my students and collaborators, especially as work in my group became more focused on RNA-Seq since the development of Cufflinks in 2008.

However connection modern transcriptome analysis methodology, specifically bioinformatics of RNA-Seq to metagenomics has been difficult to do until recently. One major reason is that until just a few years ago, there was no reference genome database for metagenomics analogous to the reference annotation databases available for use in transcriptomics. Another way to put this is that metagenomics has, until recently, been “de novo” bioinformatics. By this I mean that the analysis of communities from whole genome shotgun data had to largely proceed via de novo analyses of the data (e.g. de novo assembly of genomes), “binning” of reads according to sequence characteristics or hits to gene databases was required because it was impossible to compare sequences to references genomes. While de novo methods have also been developed for RNA-Seq, the scale of transcriptome analysis is much smaller than that of most metagenomic analyses, and as has been well documented, de novo transcriptomics is already very difficult (e.g. Amin et al. 2014).

The de novo state of metagenomics has changed in recent years, as (relatively) low-cost sequencing has been a boon for microbial genomics. The graph below, extracted from NCBI and published in a recent review, shows that in just the past few years thousands of bacterial genomes have been sequences, enabling, for the first time, reference based metagenomics:

This observation is reflected in the recent development of many methods for a variety of metagenomic applications that take advantage of reference genome databases.  Specifically, the problem of read assignment, which is fundamental for abundance estimation, has benefited from the possibility of metagenomic read alignment to reference databases.

The figure below, reproduced from the preprint “An evaluation of the accuracy and speed of metagenome analysis tools” by Stinus Lindgreen, Karen L. Adair and Paul Gardner, bioRxiv May 15, 2015 shows a benchmark of the accuracy and runtime of 14 programs developed for metagenomic read assignment for whole genome shotgun data:

The problem these methods are solving is really similar to the problem of read assignment in RNA-Seq. In RNA-Seq, instead of originating from strains, reads originate from transcripts. Just as strains are present in different abundances in a community, so are RNA transcripts in a cell (or in bulk). The analogy of taxonomy in metagenomics, i.e. the grouping of strains into species, genus etc. is also present in RNA-Seq, where transcripts are grouped into genes. The fragment (or read) assignment problem in RNA-Seq is closely related to the quantification problem in RNA-Seq and is a problem that has been thoroughly researched and for which many algorithms have been developed. I discussed the importance of the fragment assignment problem for RNA-Seq in my 2013 Genome Informatics Keynote.

In response to the development of reference-based bioinformatics possibilities for metagenomics, about three years ago my student Lorian Schaeffer started looking at the suitability of RNA-Seq tools for metagenomic read assignment. Although the metagenomic and RNA-Seq assignment problems are conceptually similar and methodologically related, there are various technical issues involved in applying RNA-Seq tools in the metagenomic setting (e.g. the need to carefully account for taxonomy in the metagenomics setting). After developing the computational infrastructure to benchmark RNA-Seq programs in the metagenomic setting, she proceeded to evaluate the accuracy of eXpress, a streaming algorithm for RNA-Seq quantification. Although the quantification of eXpress was specifically designed to be suitable for large numbers of reads, the program requires read alignments to a reference transcriptome (or in Lorian’s experiments a genome) database. In the metagenomic setting realistic databases are huge, and she found that it took days just to map the reads. Nevertheless, her initial benchmarks revealed that eXpress was significantly more accurate than the available metagenomic read assignment tools of the time.

When Kraken (Wood and Salzberg 2014), and later CLARK (Ounit et al. 2015) were published in 2014 and 2015 respectively, we took note because by circumventing the alignment step they dramatically altered the tractability of metagenomic read assignment. In parallel, in my group, Nicolas Bray and later Páll Melsted and Harold Pimentel were developing what is now kallisto (Bray et al. 2015). Like Kraken, kallisto avoided the need for aligning reads, but with the introduction of the concept of pseudoalignment, allowed for accurate read assignments based on joint analysis of exact k-mer matches. What we showed earlier this year is that unlike naïve k-mer based approaches to quantification, kallisto is as accurate as eXpress and other read alignment based quantification tools, and this observation led Lorian to immediately proceed to benchmark it on metagenomic data. The result of her work was just posted as a preprint:

Lorian Schaeffer, Harold Pimentel, Nicolas Bray, Páll Melsted and Lior Pachter, Pseudoalignment for metagenomic read assignment, arXiv 1510.07371, 2015.

With this paper we demonstrate a “technology transfer” from RNA-Seq bioinformatics to metagenomics, one that achieves dramatic improvements in read assignment accuracy in the metagenomics setting. The main result of her work is Table 1 in our preprint:

Using a published simulated Illumina dataset from Mende et al. 2012 (based on 100 genomes and containing 53.33 million reads), and augmenting it with another 2,308 genomes for the purpose of testing, she shows that kallisto significantly outperforms the best quantification methods (as benchmarked by Lindgreen et al., see figure above). “Significant” here refers to what I think is fair to characterize as an extraordinary improvement: at the genus level, a level that programs such as CLARK have been optimized for, kallisto’s RRMSE (relative root mean squared error)  is 0.13 compared to 17.05 for Kraken and 18.58 for CLARK. The improvement is based on two ideas: first, the results show that the model based approach for read assignment, the concept that underlies GASiC and eXpress, outperforms direct taxonomic read assignment as implemented by MEGAN and Kraken and CLARK (in the latter approach reads are aligned to the lowest rank to which they align unambigously). Second, pseudoalignment is not just faster than traditional alignment but also accurate.

The upshot: the accuracy and efficiency of kallisto make strain level analysis of metagenomes possible. In fact kallisto is more accurate at the strain level than other programs are at the genus level. Just as we have been advocating for transcript level analysis from RNA-Seq data, we believe that strain level analysis should become commonplace in metagenomics.

In digging deeply into the bioinformatics of metagenomics bioinformatics we noticed a few other areas that could benefit from RNA-Seq technology transfer. For example, the standard of RNA-Seq methods benchmarking appears to be higher than in metagenomics. Both the Kraken and CLARK papers benchmarked their programs on simulations with 10 genomes (the number ten is not a typo). CLARK did test on one dataset with 20 genomes, although using only 10,000 reads. To be fair to the authors of those papers, their standards were much higher than others in the field. The paper

Yu-Wei Wu and Yuzhen Ye, A novel abundance-based algorithm for binning metagenomic sequences using l-tuples, Journal of Computational Biology 2011.

benchmarked their method on simulations of reads from 2 (two!!) organisms. Biologists frequently complain that simulations of bioinformaticians are completely non-informative and unfortunately these cases provide fodder for such prejudice. Having said that, the RNA-Seq community also has much to learn from the metagenomics community. The previously mentioned paper by Paulson et al. 2013 addresses missing data in a way that should translate directly to missing data in single-cell RNA-Seq (the paper also makes performance comparisons with their comparative metagenomics approach to the RNA-Seq programs DESeq and edgeR) . One paper (McDavid et al. 2012) does take a look at modeling single-cell data with zero inflated distributions but I think this is a good example where metagenomics is ahead of RNA-Seq. Our immediate plans are to develop the kallisto application to metagenomics to include the ability to perform metagenome comparisons using sleuth. Conversely, inspired by the taxonomy hierarchy fundamental to metagenomics we’re going to explore RNA-Seq quantification with groups of transcripts that go beyond just genes.

Horizontal transfer is good.

[Update July 15, 2016: A preprint describing sleuth is available on BioRxiv]

Today my student Harold Pimentel released the beta version of his new RNA-Seq analysis method and software program called sleuth. A sleuth for RNA-Seq begins with the quantification of samples with kallisto, and together a sleuth of kallistos can be used to analyze RNA-Seq data rigorously and rapidly.

Why do we need another RNA-Seq program?

A major challenge in transcriptome analysis is to determine the transcripts that have changed in their abundance across conditions.  This challenge is not entirely trivial because the stochasticity in transcription both within and between cells (biological variation), and the randomness in the experiment (RNA-Seq) that is used to determine transcript abundances (technical variation), make it difficult to determine what constitutes “significant” change.

Technical variation can be assessed by performing technical replicates of experiments. In the case of RNA-Seq, this can be done by repeatedly sequencing from one cDNA library. Such replicates are fundamentally distinct from biological replicates designed to assess biological variation. Biological replicates are performed by sequencing different cDNA libraries that have been constructed from repeated biological experiments performed under the same (or in practice near-same) conditions. Because biological replicates require sequencing different cDNA libraries, a key point is that biological replicates include technical variation.

In the early days of RNA-Seq a few papers (e.g. Marioni et al. 2008, Bullard et al. 2010) described analyses of technical replicates and concluded that they were not really needed in practice, because technical variation could be predicted statistically from the properties of the Poisson distribution. The point is that in an idealized RNA-Seq experiment counts of reads are multinomial (according to abundances of the transcripts they originate from), and therefore approximately Poisson distributed. Their variance is therefore approximately equal to the mean, so that it is possible to predict the variance in counts across technical replicates based on the abundance of the transcripts they originate from. There is, however, one important subtlety here: “counts of reads” as used above refers to the number of reads originating from a transcript, but in many cases, especially in higher eukaryotes, reads are frequently ambiguous as to their transcript of origin because of the presence of multi-isoform genes and genes families. In other words, transcript counts cannot be precisely measured. However, the statement about the Poisson distribution of counts in technical replicates remain true when considering counts of reads by genomic features because then reads are no longer ambiguous.

This is why, in so-called “count-based methods” for RNA-Seq analysis, there is an analysis only at the gene level. Programs such as DESeq/DESeq2, edgeR and a literally dozens of other count-based methods first require counting reads across genome features using tools such as HTSeq or featureCounts. By utilizing read counts to genomic features, technical replicates are unnecessary in lieu of the statistical assumption that they would reveal Poisson distributed data, and instead the methods focus on modeling biological variation. The issue of how to model biological variation is non-trivial because typically very few biological replicates are performed in experiments. Thus, there is a need for pooling information across genes to obtain reliable variance estimates via a statistical process called shrinkage. How and what to shrink is a matter of extensive debate among statisticians engaged in the development of count-based RNA-Seq methods, but one theme that has emerged is that shrinkage approaches can be compatible with general and generalized linear models, thus allowing for the analysis of complex experimental designs.

Despite these accomplishments,  count-based methods for RNA-Seq have two major (related) drawbacks: first, the use of counts to gene features prevents inference about the transcription of isoforms, and therefore with most count-based methods it is impossible to identify splicing switches and other isoform changes between conditions. Some methods have tried to address this issue by restricting genomic features to specific exons or splice junctions (e.g. DEXSeq) but this requires throwing out a lot of data, thereby reducing power for identifying statistically significant differences between conditions. Second, because of the fact that in general $\frac{a}{b} + \frac{c}{d} \neq \frac{a+b}{c+d}$ it is mathematically incorrect to estimate gene abundances by adding up counts to their genomic region. One consequence of this, is that it is not possible to accurately measure fold change between conditions by using counts to gene features. In other words, count-based methods are problematic even at the gene-level and it is necessary to estimate transcript-level counts.

While reads might be ambiguous as to exactly which transcripts they originated from, it is possible to statistically infer an estimate of the number of reads from each transcript in an experiment. This kind of quantification has its origin in papers of Jiang and Wong, 2009 and Trapnell et al. 2010. However the process of estimating transcript-level counts introduces technical variation. That is to say, if multiple technical replicates were performed on a cDNA library and then transcript-level counts were to be inferred, those inferred counts would no longer be Poisson distributed. Thus, there appears to be a need for performing technical replicates after all. Furthermore, it becomes unclear how to work within the shrinkage frameworks of count-based methods.

There have been a handful of attempts to develop methods that combine the uncertainty of count estimates at the transcript level with biological variation in the assessment of statistically significant changes in transcript abundances between conditions. For example, the Cuffdiff2 method generalizes DESeq while the bitSeq method relies on a Bayesian framework to simultaneously quantify abundances at the transcript level while modeling biological variability. Despite showing improved performance over count-based methods, they also have significant shortcomings. For example the methods are not as flexible as those of general(ized) linear models, and bitSeq is slow partly because it requires MCMC sampling.

Thus, despite intensive research on both statistical and computational methods for RNA-Seq over the past years, there has been no solution for analysis of experiments that allows biologists to take full advantage of the power and resolution of RNA-Seq.

The sleuth model

The main contribution of sleuth is an intuitive yet powerful model for RNA-Seq that bridges the gap between count-based methods and quantification algorithms in a way that fully exploits the advantages of both.

$Y_t = X_t\beta_t + \epsilon_t$.

Here the subscript t refers to a specific transcript, $Y_t$ is a vector describing transcript abundances (of length equal to the number of samples), $X_t$ is a design matrix (of size number of samples x number of confounders), $\beta_t$ is a parameter vector (of size the number of confounders) and $\epsilon_t$ is a noise vector (of size the number of samples). In this model the abundances $Y_t$ are normally distributed. For the purposes of RNA-Seq data, the $Y_t$ may be assumed to be the logarithm of the counts (or normalized counts per million) from a transcript, and indeed this is the approach taken in a number of approaches to RNA-Seq modeling, e.g. in limma-voom. A common alternative to the general linear model is the generalized linear model, which postulates that some function of $Y_t$ has a distribution with mean equal to $g^{-1}(X_t \beta_t)$ where g is a link function, such as log, thereby allowing for distributions other than the normal to be used for the observed data. In the RNA-Seq context, where the negative binomial distribution may make sense because it is frequently a good distribution for modeling count data, the generalized model is sometimes preferred to the standard general model (e.g. by DESeq2). There is much debate about which approach is “better”.

In the sleuth model the $Y_t$ in the general linear model are modeled as unobserved. They can be thought of us the unobserved logarithms of true counts for each transcript across samples and are assumed to be normally distributed. The observed data $D_t$ is the logarithm of estimated counts for each transcript across samples, and is modeled as

$D_t = Y_t + \zeta_t$

where the $\zeta_t$ vector parameterizes a perturbation to the unobserved $Y_t$. This can be understood as the technical noise due to the random sequencing of fragments from a cDNA library and the uncertainty introduced in estimating transcript counts.

The sleuth model incorporates the assumptions that the response error is additive, i.e. if  the variance of transcript in sample is $V(D_{t,i})$ then $V(D_{t,i}) = \sigma^2_t + \tau^2_t$ where the variance $V(\epsilon_{t,i}|y_{t,i}) = \sigma^2_t$ and the variance $V(\zeta_{t,i}|y_{t,i}) = \tau^2_t$. Intuitively, sleuth teases apart the two sources of variance by examining both technical and biological replicates, and in doing so directly estimates “true” biological variance, i.e. the variance in biological replicates that is not technical.  In lieu of actual technical replicates, sleuth takes advantage of the bootstraps of kallisto which serve as accurate proxies.

In a test of sleuth on data simulated according to the DESeq2 model we found that sleuth significantly outperforms other methods:

In this simulation transcript counts were simulated according to a negative binomial distribution, following closely the protocol of the DESeq2 paper simulations. Reference parameters for the simulation were first estimated by running DESeq2 on a the female Finnish population from the GEUVADIS dataset (59 individuals). In the simulation above size factors were set to be equal in accordance with typical experiments being performed, but we also tested sleuth with size factors drawn at random with geometric mean of 1 in accordance with the DESeq2 protocol (yielding factors of 1, 0.33, 3, 3, 0.33 and 1) and sleuth still outperformed other methods.

There are many details in the implementation of sleuth that are crucial to its performance, e.g. the approach to shrinkage to estimate the biological variance $\sigma^2_t$. A forthcoming preprint, together with Nicolas Bray and Páll Melsted that also contributed to the project along with myself, will provide the details.

Exploratory data analysis with sleuth

One of the design goals of sleuth was to create a simple and efficient workflow in line with the principles of kallisto. Working with the Shiny web application framework we have designed an html interface that allows users to interact with sleuth plots allowing for real time exploratory data analysis.

The sleuth Shiny interface is much more than just a GUI for making plots of kallisto processed data. First, it allows for the exploration of the sleuth fitted models; users can explore the technical variation of each transcript, see where statistically significant differential transcripts appear in relationship to others in terms of abundance and variance and much more. Particularly useful are interactive features in the plots. For example, when examining an MA plot, users can highlight a region of points (dynamically created box in upper panel) and see their variance breakdown of the transcripts the points correspond to, and the list of the transcripts in a table below:

The web interface contains diagnostics, summaries of the data, “maps” showing low-dimensional representations of the data and tools for analysis of differential transcripts. The interactivity via Shiny can be especially useful for diagnostics; for example, in the diagnostics users can examine scatterplots of any two samples, and then select outliers to examine their variance, including the breakdown of technical variance. This allows for a determination of whether outliers represent high variance transcripts, or specific samples gone awry. Users can of course make figures showing transcript abundances in all samples, including boxplots displaying the extent of technical variation. Interested in the differential transcribed isoform ENST00000349155 of the TBX3 gene shown in Figure 5d of the Cuffdiff2 paper? It’s trivial to examine using the transcript viewer:

One can immediately see visually that differences between conditions completely dominate both the technical and biological variation within conditions. The sleuth q-value for this transcript is 3*10^(-68).

Among the maps, users can examine PCA projections onto any pair of components, allowing for rapid exploration of the structure of the data. Thus, with kallisto and sleuth raw RNA-Seq reads can be converted into a complete analysis in a matter of minutes. Experts will be able to generate plots and analyses in R using the sleuth library as they would with any R package. We plan numerous improvements and developments to the sleuth interface in the near future that will further facilitate data exploration; in the meantime we welcome feedback from users.

How to try out sleuth

Since sleuth requires the bootstraps and quantifications output by kallisto we recommend starting by running kallisto on your samples. The kallisto program is very fast, processing 30 million reads on a laptop in a matter of minutes. You will have to run kallisto with bootstraps- we have been using 100 bootstraps per sample but it should be possible to work with many fewer. We have yet to fully investigate the minimum number of bootstraps required for sleuth to be accurate.

To learn how to use kallisto start here. If you have already run kallisto you can proceed to the tutorial for sleuth. If you’re really eager to see sleuth without first learning kallisto, you can skip ahead and try it out using pre-computed kallisto runs of the Cuffdiff2 data- the tutorial explains where to obtain the data for trying out sleuth.

For questions, suggestions or help see the program websites and also the kallisto-sleuth user group. We hope you enjoy the tools!

Today I posted the preprint N. Bray, H. Pimentel, P. Melsted and L. Pachter, Near-optimal RNA-Seq quantification with kallisto to the arXiv. It describes the RNA-Seq quantification program kallisto. [Update April 5, 2016: a revised version of the preprint has been published: Nicolas L. Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-Seq quantification, Nature Biotechnology (2016), doi:10.1038/nbt.3519 published online April 4, 2016.]

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was so much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of k-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. That is asking for a lot less, and as it turns out, pseudoalignment can be much faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient k-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

The screenshot below shows kallisto being used on my 2012 iMac desktop to build an index of the human transcriptome (5 min 8 sec), and then quantify 78.6 million GEUVADIS human RNA-Seq reads (14 min). When we first saw these results we thought they were simply too good to be true. Let me repeat: The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. In some tests we’ve gotten even faster running times, up to 15 million reads quantified per minute.

The results in our paper indicate that kallisto is not just fast, but also very accurate. This is not surprising: underlying RNA-Seq analysis are the alignments, and although kallisto is pseudoaligning instead, it is almost always only the compatibility information that is used in actual applications. As we show in our paper, from the point of view of compatibility, the pseudoalignments and alignments are almost the same.

Although accuracy is a primary concern with analysis, we realized in the course of working on kallisto that speed is also paramount, and not just as a  matter of convenience. The speed of kallisto has three major implications:

1. It allows for efficient bootstrapping. All that is required for the bootstrap are reruns of the EM algorithm, and those are particularly fast within kallisto. The result is that we can accurately estimate the uncertainty in abundance estimates. One of my favorite figures from our paper, made by Harold, is this one:

It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. This result is non-trivial: the standard dogma, that the technical variance in RNA-Seq is “Poisson” (i.e. proportional to the mean) is false, as shown in Supplementary Figure 3 of our paper (the correlation becomes 0.64). Thus, the bootstrap will be invaluable when incorporated in downstream application and we are already working on some ideas.

2. It is not just the kallisto quantification that is fast; the index building, and even compilation of the program are also easy and quick. The implication for biologists is that RNA-Seq analysis now becomes interactive. Instead of “freezing” an analysis that might take weeks or even months, data can be explored dynamically, e.g. easily quantified against different transcriptomes, or re-quantified as transcriptomes are updated. The ability to analyze data locally instead of requiring cloud computation means that analysis is portable, and also easily secure.

3. We have found the fast turnaround of analysis helpful in improving the program itself. With kallisto we can quickly check the effect of changes in the algorithms. This allows for much faster debugging of problems, and also better optimization. It also allows us to release improvements knowing that users will be able to test them without resorting to a major computation that might take months. For this reason we’re not afraid to say that some improvements to kallisto will be coming soon.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!

When I was an undergraduate at Caltech I took a combinatorics course from Rick Wilson who taught from his then just published textbook A Course in Combinatorics (co-authored with J.H. van Lint). The course and the book emphasized design theory, a subject that is beautiful and fundamental to combinatorics, coding theory, and statistics, but that has sadly been in decline for some time. It was a fantastic course taught by a brilliant professor- an experience that had a profound impact on me. Though to be honest, I haven’t thought much about designs in recent years. Having kids changed that.

A few weeks ago I was playing the card game Colori with my three year old daughter. It’s one of her favorites.

The game consists of 15 cards, each displaying drawings of the same 15 items (beach ball, boat, butterfly, cap, car, drum, duck, fish, flower, kite, pencil, jersey, plane, teapot, teddy bear), with each item colored using two of the colors red, green, yellow and blue. Every pair of cards contains exactly one item that is colored exactly the same. For example, the two cards the teddy bear is holding in the picture above are shown below:

The only pair of items colored exactly the same are the two beach balls. The gameplay consists of shuffling the deck and then placing a pair of cards face-up. Players must find the matching pair, and the first player to do so keeps the cards. This is repeated seven times until there is only one card left in the deck, at which point the player with the most cards wins. When I play with my daughter “winning” consists of enjoying her laughter as she figures out the matching pair, and then proceeds to try to eat one of the cards.

An inspection of all 15 cards provided with the game reveals some interesting structure:

Every card contains exactly one of each type of item. Each item therefore occurs 15 times among the cards, with fourteen of the occurrences consisting of seven matched pairs, plus one extra. This is a type of partially balanced incomplete block design. Ignoring for a moment the extra item placed on each card, what we have is 15 items, each colored one of seven ways (v=15*7=105). The 105 items have been divided into 15 blocks (the cards), so that b=15. Each block contains 14 elements (the items) so that k=14, and each element appears in two blocks (r=2). If every pair of different (colored) items occurred in the same number of cards, we would have a balanced incomplete block design, but this is not the case in Colori. Each item occurs in the same block as 26 (=2*13) other items (we are ignoring the extra item that makes for 15 on each card), and therefore it is not the case that every pair of items occurs in the same number of blocks as would be the case in a balanced incomplete block design. Instead, there is an association scheme that provides extra structure among the 105 items, and in turn describes the way in which items do or do not appear together on cards. The association scheme can be understood as a graph whose nodes consist of the 105 items, with edges between items labeled either 0,1 or 2. An edge between two items of the same type is labeled 0, edges between different items that appear on the same card are labeled 1, and edges between different items that do not appear on the same card are labeled 2. This edge labeling is called an “association scheme” because it has a special property, namely the number of triangles with a base edge labeled k, and other two edges labeled i and respectively is  dependent only on i,j and k and not on the specific base edge selected. In other words, there is a special symmetry to the graph. Returning to the deck of cards, we see that every pair of items appears in the same card exactly 0 or 1 times, and the number depends only on the association class of the pair of objects. This is called a partially balanced incomplete block design.

The author of the game, Reinhard Staupe, made it a bit more difficult by adding an extra item to each card making the identification of the matching pair harder. The addition also ensures that each of the 15 items appears on each card. Moreover, the items are permuted in location on the cards, in an arrangement similar to a latin square, making it hard to pair up the items. And instead of using 8 different colors, he used only four, producing the eight different “colors” of each item on the cards by using pairwise combinations of the four.  The yellow-green two-colored beach balls are particularly difficult to tell apart from the green-yellow ones. Of course, much of this is exactly the kind of thing you would want to do if you were designing an RNA-Seq experiment!

Instead of 15 types of items, think of 15 different strains of mice.  Instead of colors for the items, think of different cellular conditions to be assayed. Instead of one pair for each of seven color combinations, think of one pair of replicates for each of seven cellular conditions. Instead of cards, think of different sequencing centers that will prepare the libraries and sequence the reads. An ideal experimental setup would involve distributing the replicates and different cellular conditions across the different sequencing centers so as to reduce batch effect. This is the essence of part of the paper Statistical Design and Analysis of RNA Sequencing Data by Paul Auer and Rebecca Doerge. For example, in their Figure 4 (shown below) they illustrate the advantage of balanced block designs to ameliorate lane effects:

Figure 4 from P. Auer and R.W. Doerge’s paper Statistical Design and Analysis of RNA Sequencing Data.

Of course the use of experimental designs for constructing controlled gene expression experiments is not new. Kerr and Churchill wrote about the use of combinatorial designs in Experimental Design for gene expression microarrays, and one can trace back a long chain of ideas originating with R.A. Fisher. But design theory seems to me to be a waning art insofar as molecular biology experiments are concerned, and it is frequently being replaced with biological intuition of what makes for a good control. The design of good controls is sometimes obvious, but not always. So next time you design an experiment, if you have young kids, first play a round of Colori. If the kids are older, play Set instead. And if you don’t have any kids, plan for an extra research project, because what else would you do with your time?

“An entertaining freshness… Tic Tac!” This is Ferrero‘s tag line for its most successful product, the ubiquitous Tic Tac. And the line has stuck. As WikiHow points out in how to make your breath freshfirst buy some mints, then brush your teeth.

One of the amazing things about Tic Tacs is that they are sugar free. Well… almost not. As the label explains, a single serving (one single Tic Tac) contains 0g of sugar (to be precise, less than 0.5g, as explained in a footnote). In what could initially be assumed to be a mere coincidence, the weight of a single serving is 0.49g. It did not escape my attention that 0.50-0.49=0.01. Why?

To understand it helps to look at the labeling rules of the FDA. I’ve reproduced the relevant section (Title 21) below, and boldfaced the relevant parts:

 TITLE 21–FOOD AND DRUGS
 CHAPTER I–FOOD AND DRUG ADMINISTRATION DEPARTMENT OF HEALTH AND HUMAN SERVICES
 SUBCHAPTER B–FOOD FOR HUMAN CONSUMPTION

(c) Sugar content claims –(1) Use of terms such as “sugar free,” “free of sugar,” “no sugar,” “zero sugar,” “without sugar,” “sugarless,” “trivial source of sugar,” “negligible source of sugar,” or “dietarily insignificant source of sugar.” Consumers may reasonably be expected to regard terms that represent that the food contains no sugars or sweeteners e.g., “sugar free,” or “no sugar,” as indicating a product which is low in calories or significantly reduced in calories. Consequently, except as provided in paragraph (c)(2) of this section, a food may not be labeled with such terms unless:

(i) The food contains less than 0.5 g of sugars, as defined in 101.9(c)(6)(ii), per reference amount customarily consumed and per labeled serving or, in the case of a meal product or main dish product, less than 0.5 g of sugars per labeled serving; and

(ii) The food contains no ingredient that is a sugar or that is generally understood by consumers to contain sugars unless the listing of the ingredient in the ingredient statement is followed by an asterisk that refers to the statement below the list of ingredients, which states “adds a trivial amount of sugar,” “adds a negligible amount of sugar,” or “adds a dietarily insignificant amount of sugar;” and

(iii)(A) It is labeled “low calorie” or “reduced calorie” or bears a relative claim of special dietary usefulness labeled in compliance with paragraphs (b)(2), (b)(3), (b)(4), or (b)(5) of this section, or, if a dietary supplement, it meets the definition in paragraph (b)(2) of this section for “low calorie” but is prohibited by 101.13(b)(5) and 101.60(a)(4) from bearing the claim; or

(B) Such term is immediately accompanied, each time it is used, by either the statement “not a reduced calorie food,” “not a low calorie food,” or “not for weight control.”

It turns out that Tic Tacs are in fact almost pure sugar. Its easy to figure this out by looking at the number of calories per serving (1.9) and multiplying  the number of calories per gram of sugar (3.8) by 0.49 => 1.862 calories. 98% sugar! Ferrero basically admits this in their FAQ. Acting completely within the bounds of the law, they have simply exploited an arbitrary threshold of the FDA. Arbitrary thresholds are always problematic; not only can they have unintended consequences, but they can be manipulated to engineer desired outcomes. In computational biology they have become ubiquitous, frequently being described as “filters” or “pre-processing steps”.  Regardless of how they are justified, thresholds are thresholds are thresholds. They can sometimes be beneficial, but they are dangerous when wielded indiscriminately.

There is one type of thresholding/filtering in used in RNA-Seq that my postdoc Bo Li and I have been thinking about a bit this year. It consists of removing duplicate reads, i.e. reads that map to the same position in a transcriptome. The motivation behind such filtering is to reduce or eliminate amplification bias, and it is based on the intuition that it is unlikely that lightning strikes the same spot multiple times. That is, it is improbable that many reads would map to the exact same location assuming a model for sequencing that posits selecting fragments from transcripts uniformly. The idea is also called de-duplication or digital normalization.

Digital normalization is obviously problematic for high abundance transcripts. Consider, for example, a transcripts that is so abundant that it is extremely likely that at least one read will start at every site (ignoring the ends, which for the purposes of the thought experiment are not relevant). This would also be the case if the transcript was twice as abundant, and so digital normalization would prevent the possibility for estimating the difference. This issue was noted in a paper published earlier this year by Zhou et al.  The authors investigate in some detail the implications of this problem, and quantify the bias it introduces in a number of data sets. But a key question not answered in the paper is what does digital normalization actually do?

To answer the question, it is helpful to consider how one might estimate the abundance of a transcript after digital normalization. One naive approach is to just count the number of reads after de-duplication, followed by normalization for the length of the transcript and the number of reads sequenced. Specifically if there are sites where a read might start, and of the sites had at least one read, then the naive approach would be to use the estimate $\frac{k}{n}$ suitably normalized for the total number of reads in the experiment. This is exactly what is done in standard de-duplication pipelines, or in digital normalization as described in the preprint by Brown et al. However assuming a simple model for sequencing, namely that every read is selected by first choosing a transcript according to a multinomial distribution and then choosing a location on it uniformly at random from among the sites, a different formula emerges.

Let be a random variable that denotes the number of sites on a transcript of length n that are covered in a random sequencing experiment, where the number of reads starting at each site of the transcript is Poisson distributed with parameter c (i.e., the average coverage of the transcript is c). Note that

$Pr(X \geq 1) = 1-Pr(X=0) = 1-e^{-c}$.

The maximum likelihood estimate for can also be obtained by the method of moments, which is to set

$\frac{k}{n} = 1-e^{-c}$

from which it is easy to see that

$c = -log(1-\frac{k}{n})$.

This is the same as the (derivation of the) Jukes-Cantor correction in phylogenetics where the method of moments equation is replaced by $\frac{4}{3}\frac{k}{n} = 1-e^{-\frac{4}{3}c}$ yielding $D_{JC} = -\frac{3}{4}log(1-\frac{4}{3}\frac{k}{n})$, but I’ll leave an extended discussion of the Jukes-Cantor model and correction for a future post.

The point here, as noticed by Bo Li, is that since $log(1-x) \approx -x$ by Taylor approximation, it follows that the average coverage can be estimated by $c \approx \frac{k}{n}$. This is exactly the naive estimate of de-duplication or digital normalization, and the fact that $\frac{k}{n} \rightarrow 1$ as $k \rightarrow n$ means that $-log(1-\frac{k}{n})$ blows up, at high coverage hence the results of Zhou et al.

Digital normalization as proposed by Brown et al. involves possibly thresholding at more than one read per site (for example choosing a threshold C and removing all but at most C reads at every site). But even this modified heuristic fails to adequately relate to a probabilistic model of sequencing. One interesting and easy exercise is to consider the second or higher order Taylor approximations. But a more interesting approach to dealing with amplification bias is to avoid thresholding per se,  and to instead identify outliers among duplicate reads and to adjust them according to an estimated distribution of coverage. This is the approach of Hashimoto et al. in a the paper “Universal count correction for high-throughput sequencing” published in March in PLoS One. There are undoubtedly other approaches as well, and in my opinion the issue will received renewed attention in the coming year as the removal of amplification biases in single-cell transcriptome experiments becomes a priority.

As mentioned above, digital normalization/de-duplication is just one of many thresholds applied in a typical RNA-Seq “pipeline”. To get a sense of the extent of thresholding, one need only scan the (supplementary?) methods section of any genomics paper. For example, the GEUVADIS RNA-Seq consortium describe their analysis pipeline as follows:

“We employed the JIP pipeline (T.G. & M.S., data not shown) to map mRNA-seq reads and to quantify mRNA transcripts. For alignment to the human reference genome sequence (GRCh37, autosomes + X + Y + M), we used the GEM mapping suite24 (v1.349 which corresponds to publicly available pre-release 2) to first map (max. mismatches = 4%, max. edit distance = 20%, min. decoded strata = 2 and strata after best = 1) and subsequently to split-map (max.mismatches = 4%, Gencode v12 and de novo junctions) all reads that did not map entirely. Both mapping steps are repeated for reads trimmed 20 nucleotides from their 3′-end, and then for reads trimmed 5 nucleotides from their 5′-end in addition to earlier 3′-trimming—each time considering exclusively reads that have not been mapped in earlier iterations. Finally, all read mappings were assessed with respect to the mate pair information: valid mapping pairs are formed up to a maximum insert size of 100,000 bp, extension trigger = 0.999 and minimum decoded strata = 1. The mapping pipeline and settings are described below and can also be found in https://github.com/gemtools, where the code as well as an example pipeline are hosted.”

This is not a bad pipeline- the paper shows it was carefully evaluated– and it may have been a practical approach to dealing with the large amount of RNA-Seq data in the project. But even the first and seemingly innocuous thresholding to trim low quality bases from the ends of reads is controversial and potentially problematic. In a careful analysis published earlier this year, Matthew MacManes looked carefully at the effect of trimming in RNA-Seq, and concluded that aggressive trimming of bases below Q20, a standard that is frequently employed in pipelines, is problematic. I think his Figure 3, which I’ve reproduced below, is very convincing:

It certainly appears that some mild trimming can be beneficial, but a threshold that is optimal (and more importantly not detrimental) depends on the specifics of the dataset and is difficult or impossible to determine a priori. MacManes’ view (for more see his blog post on the topic) is consistent with another paper by Del Fabbro et al. that while seemingly positive about trimming in the abstract, actually concludes that “In the specific case of RNA-Seq, the tradeoff between sensitivity (number of aligned reads) and specificity (number of correctly aligned reads) seems to be always detrimental when trimming the datasets (Figure S2); in such a case, the modern aligners, like Tophat, seem to be able to overcome low quality issues, therefore making trimming unnecessary.”

Alas, Tic Tac thresholds are everywhere. My advice is: brush your teeth first.

I was recently reading the latest ENCODE paper published in PNAS when a sentence in the caption of Figure 2 caught my attention:

“Depending on the total amount of RNA in a cell, one transcript copy per cell corresponds to between 0.5 and 5 FPKM in PolyA+ whole-cell samples according to current estimates (with the upper end of that range corresponding to small cells with little RNA and vice versa).”

Although very few people actually care about ENCODE, many people do care about the interpretation of RNA-Seq FPKM measurements and to them this is likely to be a sentence of interest. In fact, there have been a number of attempts to provide intuitive meaning for RPKM (and FPKM) in terms of copy numbers of transcripts per cell. Even though the ENCODE PNAS paper provides no citation for the statement (or methods section explaining the derivation), I believe its source is the RNA-Seq paper by Mortazavi et al. In that paper, the authors write that

“…absolute transcript levels per cell can also be calculated. For example, on the basis of literature values for the mRNA content of a liver cell [Galau et al. 1977] and the RNA standards, we estimated that 3 RPKM corresponds to about one transcript per liver cell. For C2C12 tissue culture cells, for which we know the starting cell number and RNA preparation yields needed to make the calculation, a transcript of 1 RPKM corresponds to approximately one transcript per cell. “

This statement has been picked up on in a number of publications (e.g., Hebenstreit et al., 2011, van Bakel et al., 2011). However the inference of transcript copies per cell directly from RPKM or FPKM estimates is not possible and conversion factors such as 1 RPKM = 1 transcript per cell are incoherent. At the same time, the estimates of Mortazavi et al. and the range provided in the ENCODE PNAS paper are informative. The “incoherence” stems from a subtle issue in the normalization of RPKM/FPKM that I have discussed in a talk I gave at CSHL, and is the reason why TPM is a better unit for RNA abundance. Still, the estimates turn out to be “informative”, in the sense that the effect of (lack of) normalization appears to be smaller than variability in the amount of RNA per cell. I explain these issues below:

Why is the sentence incoherent?

RNA-Seq can be used to estimate transcript abundances in an RNA sample. Formally, a sample consists of n distinct types of transcripts, and each occurs with different multiplicity (copy number), so that transcript appears $m_i$ times in the sample. By “abundance” we mean the relative amounts $\rho_1,\ldots,\rho_n$ where $\rho_i = \frac{m_i}{\sum_{i=1}^n m_i}$. Note that  $0 \leq \rho_i \leq 1$ and $\sum_{i=1}^n \rho_i = 1$. Suppose that $m_j=1$ for some j. The corresponding $\rho_j$ is therefore $\rho_j = \frac{1}{M}$ where $M = \sum_{i=1}^n m_i$. The question is what does this $\rho$ value correspond to in RPKM (or FPKM).

RPKM stands for “reads per kilobase  of transcript per million reads mapped” and FPKM is the same except with “fragment” replacing read (initially reads were not paired-end, but with the advent of paired-end sequencing it makes more sense to speak of fragments, and hence FPKM). As a unit of measurement for an estimate, what FPKM really refers to is the expected number of fragments per kilboase of transcript per million reads. Formally, if we let $l_i$ be the length of transcript and define $\alpha_i = \frac{\rho_i l_i}{\sum_{j=1}^n \rho_j l_j}$ then abundance in FPKM for transcript is abundance measured as $FPKM_i = \frac{\alpha_i \cdot 10^{6}}{l_i/(10^3)}$. In terms of $\rho$, we obtain that

$FPKM_i = \frac{\rho_i \cdot 10^9}{\sum_{j=1}^n \rho_j l_j}$.

The term in the denominator can be considered a kind of normalization factor, that while identical for each transcript, depends on the abundances of each transcript (unless all lengths are equal). It is in essence an average of lengths of transcripts weighted by abundance. Moreover, the length of each transcript should be taken to be taken to be its “effective” length, i.e. the length with respect to fragment lengths, or equivalently, the number of positions where fragments can start.

The implication for finding a relationship between FPKM and relative abundance constituting one transcript copy per cell is that one cannot. Mathematically, the latter is equivalent to setting $\rho_i = \frac{1}{M}$ in the formula above and then trying to determine $FPKM_i$. Unfortunately, all the remaining $\rho$ are still in the formula, and must be known in order to calculate the corresponding FPKM value.

The argument above makes clear that it does not make sense to estimate transcript copy counts per cell in terms of RPKM or FPKM. Measurements in RPKM or FPKM units depend on the abundances of transcripts in the specific sample being considered, and therefore the connection to copy counts is incoherent. The obvious and correct solution is to work directly with the $\rho$. This is the rationale of TPM (transcripts per million) used by Bo Li and Colin Dewey in the RSEM paper (the argument for TPM is also made in Wagner et al. 2012).

Why is the sentence informative?

Even though incoherent, it turns out there is some truth to the ranges and estimates of copy count per cell in terms of RPKM and FPKM that have been circulated. To understand why requires noting that there are in fact two factors that come into play in estimating the FPKM corresponding to abundance of one transcript copy per cell. First, is M as defined above to be the total number of transcripts in a cell. This depends on the amount of RNA in a cell. Second are the relative abundances of all transcripts and their contribution to the denominator in the $FPKM_i$ formula.

The best paper to date on the connection between transcript copy numbers and RNA-Seq measurements is the careful work of Marinov et al. in “From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing” published in Genome Research earlier this year. First of all, the paper describes careful estimates of RNA quantities in different cells, and concludes that (at least for the cells studied in the paper) amounts vary by approximately one order of magnitude. Incidentally, the estimates in Marinov et al. confirm and are consistent with rough estimates of Galau et al. from 1977, of 300,000 transcripts per cell. Marinov et al. also use spike-in measurements are used to conclude that in “GM12878 single cells, one transcript copy corresponds to ∼10 FPKM on average.”. The main value of the paper lies in its confirmation that RNA quantities can vary by an order of magnitude, and I am guessing this factor of 10 is the basis for the range provided in the ENCODE PNAS paper (0.5 to 5 FPKM).

In order to determine the relative importance of the denominator in $FPKM_i$ I looked at a few RNA-Seq datasets we are currently examining. In the GEUVADIS data, the weighted average can vary by as much as 20% between samples. In a rat RNA-Seq dataset we are analyzing, the difference is a factor of two (and interestingly very dependent on the exact annotation used for quantification). The point here is that even the denominator in $FPKM_i$ does vary, but less, it seems, than the variability in RNA quantity. In other words, the estimate of 0.5 to 5 FPKM corresponding to one transcript per cell is incoherent albeit probably not too far off.

One consequence of all of the above discussion is that while differential analysis of experiments can be performed based on FPKM units (as done for example in Cufflinks, where the normalization factors are appropriately accounted for), it does not make sense to compare raw FPKM values across experiments. This is precisely what is done in Figure 2 of the ENCODE PNAS paper. What the analysis above shows, is that actual abundances may be off by amounts much larger than the differences shown in the figure. In other words, while the caption turns out to contain an interesting comment the overall figure doesn’t really make sense. Specifically, I’m not sure the relative RPKM values shown in the figure deliver the correct relative amounts, an issue that ENCODE can and should check. Which brings me to the last part of this post…

What is ENCODE doing?

Having realized the possible issue with RPKM comparisons in Figure 2, I took a look at Figure 3 to try to understand whether there were potential implications for it as well. That exercise took me to a whole other level of ENCODEness. To begin with, I was trying to make sense of the x-axis, which is labeled “biochemical signal strength (log10)” when I realized that the different curves on the plot all come from different, completely unrelated x-axes. If this sounds confusing, it is. The green curves are showing graphs of functions whose domain is in log 10 RPKM units. However the histone modification curves are in log (-10 log p), where p is a p-value that has been computed. I’ve never seen anyone plot log(log(p-values)); what does it mean?! Nor do I understand how such graphs can be placed on a common x-axis (?!). What is “biochemical signal strength” (?) Why in the bottom panel is the grey H3K9me3 showing %nucleotides conserved decreasing as “biochemical strength” is increasing (?!) Why is the green RNA curves showing conservation below genome average for low expressed transcripts (?!) and why in the top panel is the red H3K4me3 an “M” shape (?!) What does any of this mean (?!) What I’m supposed to understand from it, or frankly, what is going on at all ??? I know many of the authors of this ENCODE PNAS paper and I simply cannot believe they saw and approved this figure. It is truly beyond belief… see below:

All of these figures are of course to support the main point of the paper. Which is that even though 80% of the genome is functional it is also true that this is not what was meant to be said , and that what is true is that “survey of biochemical activity led to a significant increase in genome coverage and thus accentuated the discrepancy between biochemical and evolutionary estimates… where function is ascertained independently of cellular state but is dependent on environment and evolutionary niche therefore resulting in estimates that  differ widely in their false-positive and false-negative rates and the resolution with which elements can be defined… [unlike] genetic approaches that rely on sequence alterations to establish the biological relevance of a DNA segment and are often considered a gold standard for defining function.”

The ENCODE PNAS paper was first published behind a paywall. However after some public criticism, the authors relented and paid for it to be open access. This was a mistake. Had it remained behind a paywall not only would the consortium have saved money, I and others might have been spared the experience of reading the paper. I hope the consortium will afford me the courtesy of paywall next time.

Last Monday some biostatisticians/epidemiologists from Australia published a paper about a “visualization tool which may allow greater understanding of medical and epidemiological data”:

H. Wand et al., “Quilt Plots: A Simple Tool for the Visualisation of Large Epidemiological Data“, PLoS ONE 9(1): e85047.

A brief look at the “paper” reveals that the quilt plot they propose is a special case of what is commonly known as a heat map, a point the authors acknowledge, with the caveat that they claim that

” ‘heat maps’ require the specification of 21 arguments including hierarchical clustering, weights for reordering the row and columns dendrogram, which are not always easily understood unless one has an extensive programming knowledge and skills. One of the aims of our paper is to present ‘‘quilt plots’’ as a useful tool with simply formulated R-functions that can be easily understood by researchers from different scientific backgrounds without high-level programming skills.”

In other words, the quilt plot is a simplified heat map and the authors think it should be used because specifying parameters for a heat map (in R) would require a terrifying skill known as programming. This is of course all completely ridiculous. Not only does usage of R not require programming skill, there are also simplified heat map functions in many programming languages/computation environments that are as simple as the quilt plot function.

The fact that a paper like this was published in a journal is preposterous, and indeed the authors and editor of the paper have been ridiculed on social media, blogs and in comments to their paper on the PLoS One website.

BUT…

Wand et al. do have one point… those 21 parameters are not an entirely trivial matter. In fact, the majority of computational biologists (including many who have been ridiculing Wand) appear not to understand heat maps themselves, despite repeatedly (ab)using them in their own work.

What are heat maps?

In the simplest case, heat maps are just the conversion of a table of numbers into a grid with colored squares, where the colors represent the magnitude of the numbers. In the quilt plot paper that is the type of heat map considered. However in applications such as gene expression analysis, heat maps are used to visualize distances between experiments.

Heat maps have been popular for visualizing multiple gene expression datasets since the publication of the “Eisengram” (or the guilt plot?). So when my student Lorian Schaeffer and I recently needed to create a heat map from RNA-Seq abundance estimates in multiple samples we are analyzing with Ryan Forster and Dirk Hockemeyer, we assumed there would be a standard method (and software) we could use. However when starting to look at the literature we quickly found 3 papers with 4 different opinions about which similarity measure to use:

There are also the folks who don’t worry too much and just try anything and everything (for example using the heatmap.2 function in R) hoping that some distance measure produces the figure they need for their paper. There are certainly a plethora of distance measures for them to try out. And even if none of the distance measures provide the needed figure, there is always the opportunity to play with the colors and shading to “highlight” the desired result. In other words, heat maps are great for cheating with what appears to be statistics.

We wondered…  what is the “right” way to make a heat map?

Consider first the obvious choice for measuring similarity: Euclidean distance. Suppose that we are estimating the distance between abundance estimates from two RNA-Seq experiments, where for simplicity we assume that there are only three transcripts (A,B,C). The two abundance estimates can be represented by 3-tuples $(p_A,p_B,p_C)$ and $(q_A,q_B,q_C)$such that both $p_A+p_B+p_C=1$ and $q_A+q_B+q_C=1$. If  $p_A=1$ and $q_A=0$, then the Euclidean distance is given by $\sqrt{1+q_B^2+q_C^2}$. This obviously depends on $q_B$ and $q_C$, a dependence  that is problematic. What has changed between the two RNA-Seq experiments is that transcript $A$ has gone from being the only one transcribed, to not being transcribed at all. It is difficult to justify a distance metric that depends on the relative changes in $q_B$ and $q_C$. Why, for example, should $(1,0,0)$ be closer to $(1,\frac{1}{2},\frac{1}{2})$ than to $(1,1,0)$?

The Jensen-Shannon divergence, defined for two distributions $P$ and $Q$ by

$JSD(P,Q) = \frac{1}{2}D(P\|M) + \frac{1}{2}D(Q\|M)$

where $M = \frac{1}{2}(P+Q)$ and $D(A\|B)$ is the Kullback-Leibler divergence, is an example of a distance measure that does not have this problem. For the example above the JSD is always $log2$ (regardless of $q_B$ and $q_C$). However the JSD is not a metric (hence the term divergence in its name). In particular, it does not satisfy the triangle inequality (which the Euclidean distance does). Interestingly, this defect can be rectified by replacing JSD with the square root of JSD (the JSD metric). Formal proofs that the square root of JSD is a metric were provided in “A new Metric for Probability Distributions” by Dominik Endres and Johannes Schindelin (2003), and separately (and independently) in “A new class of metric divergences on probability spaces and its applicability in statistics” by Ferdinand Österreicher and Igor Vajda (2003). The paper “Jensen-Shannon Divergence and Hilbert space embedding” by Bent Fuglede and Flemming Topsøe (2004) makes clear the mathematical origins for this result by showing that the square root of JSD can be isometrically embedded into Hilbert space (as a logarithmic spiral)

The 2-simplex with contour lines showing points equidistant
from the probability distribution (1/3, 1/3, 1/3) for the JSD metric.

The meaning of the JSD metric is not immediately apparent based on its definition, but a number of results provide some insight. First, the JSD metric can be approximated by Pearson’s $\chi^2$ distance  (Equation (7) in Endres and Schindelin). This relationship is confirmed in the numerical experiments of Sung-Hyuk Cha (see Figure 3 in “Comprehensive survey on distance/similarity measures between probability distance functions“, in particular the close relationship between JSD and the probabilistic symmetric $\chi^2$). There are also information theoretic and physical interpretations for the JSD metric stemming from the definition of JSD in terms of Kullback-Leibler divergence.

In “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation“, Trapnell et al., Nature Biotechnology 28 (2010), we used the JSD metric to examine changes to relative isoform abundances in genes (see, for example, the Minard plot in Figure 2c). This application of the JSD metric makes sense, however the JSD metric  is not a panacea. Consider Figure 1 in the Merkin et al. paper mentioned above. It displays a heat map generated from 7713 genes (genes with singleton orthologs in the five species studied). Some of these genes will have higher expression, and therefore higher variance, than others. The nature of the JSD metric is such that those genes will dominate the distance function, so that the heat map is effectively generated only from the highly abundant genes. Since there is typically an (approximately) exponential distribution of transcript abundance this means that, in effect, very few genes dominate the analysis.

I started thinking about this issue with my student Nicolas Bray and we began by looking at the first obvious candidate for addressing the issue of domination by high variance genes: the Mahalanobis distance. Mahalanobis distance is an option in many heat map packages (e.g. in R), but has been used only rarely in publications (although there is some history of its use in the analyses of microarray data). Intuitively, Mahalanobis distance seeks to remedy the problem of genes with high variance among the samples dominating the distance calculation by appropriate normalization. This appears to have been the aim of the method in the Anders and Huber paper cited above, where the expression values are first normalized to obtain equal variance for each gene (the variance stabilization procedure). Mahalanobis distance goes a step further and better, by normalizing using the entire covariance matrix (as opposed to just its diagonal).

Intuitively, given a set of points in some dimension, the Mahalanobis distance is the Euclidean distance between the points after they have been transformed via a linear transformation that maps an ellipsoid fitted to the points to a sphere.  Formally, I think it is best understood in the following slightly more general terms:

Given an $n \times m$ expression matrix $X$ (rows=transcripts, columns=experiments), let $P=PCA(X)$ be the matrix consisting of projections of $X$ onto its principal components, and denote by $s^2_k(ij)$ the distance between projection of points i and j onto the kth component, i.e. $s^2_k(ij) = (P_{ik}-P_{jk})^2$. Let $\lambda_1,\ldots,\lambda_n$ be the singular values. For some $1 \leq p \leq n$, define the distance

$D_{ij} = \frac{s^2_1(ij)}{\lambda_1} + \cdots + \frac{s^2_p(ij)}{\lambda_p}$

When $n \leq m$ and $p=n$ then the distance D defined above is the Mahalanobis distance.

The Mahalanobis ellipses. In this figure the distance shown is from every point to the center (mean of the point) rather than between pairs of points. Mahalanobis distance is sometimes defined in this way. The figure is reproduced from this website. Note that the Anders-Huber heat map produces distances looking only at the variance in each direction (in this case horizontal and vertical) which assumes that the gene expression values are independent, or equivalently that the ellipse is not rotated.

It is interesting to note that D is defined even when $n > m$, providing a generalization of Mahalanobis distance for high-dimensional data.

The cutoff p involves ignoring the last few principal components. The reason one might want to do this is that the last few principal components presumably correspond to noise in the data. Amplifying this noise and treating it the same as the signal is not desirable. This is because as p increases the denominators $\lambda_p$ get smaller, and therefore have an increasing effect on the distance. So even though it makes sense to normalize by variance thereby allowing all genes to count the same, it is important to keep in mind that the last few principal components may be desirable to toss out. One way one could choose the appropriate threshold is by examination of a scree plot.

We’re still not completely happy with Mahalanobis distance. For example, unlike the Jensen-Shannon metric, it does not provide a metric over probability distributions. In functional genomics, almost all *Seq assays produce an output which is a (discrete) probability distribution (for example in RNA-Seq the output after quantification is a probability distribution on the set of transcripts). So making heat maps for such data seems to not be entirely trivial…

Does any of this matter?

The landmark Michael Eisen et al. paper “Cluster analysis and the display of genome-wide expression patterns“, PNAS 95 (1998), 14863–14868 describing the “Eisengram” was based on correlation as the distance measure between expression vectors. This has a similar problem to the issues we discussed above, namely that  abundant genes are weighted more heavily in the distance measure, and therefore they define the characteristics of the heat map. Yet the Eisengram and its variants have proven to be extremely popular and useful. It is fair to ask whether any of the issues I’ve raised matter in practice.

Depends. In many papers the heat map is a visualization tool intended for a qualitative exploration of the data. The issues discussed here touch on quantitative aspects, and in some applications changing distance measures may not change the qualitative results. Its difficult to say without reanalyzing data sets and (re)creating the heat maps with different parameters. Regardless, as expression technology continues to transition from microarrays to RNA-Seq, the demand for quantitative results is increasing. So I think it does matters how heat maps are made. Of course its easy to ridicule Handan Wand for her quilt plots, but I think those guilty of pasting ad-hoc heat maps based on arbitrary distance measures in their papers are really the ones that deserve a public spanking.

P.S. If you’re going to make your own heat map, after adhering to sound statistics, please use a colorblind-friendly palette.

P.P.S. In this post I have ignored the issue of clustering, namely how to order the rows and columns of heat maps so that similar expression profiles cluster together. This goes along with the problem of constructing meaningful dendograms, a visualization that has been a major factor in the popularization of the Eisengram. The choice of clustering algorithm is just as important as the choice of similarity measure, but I leave this for a future post.

Hui Jiang and Julia Salzman have posted a new paper on the arXiv proposing a novel approach to correcting for non-uniform coverage of transcripts in RNA-Seq: “A penalized likelihood approach for robust estimation of isoform expression” (October 1, 2013).

Their paper addresses the issue of non-uniformity of read coverage across transcripts in RNA-Seq, an issue that is frustrating for the challenges it presents in analysis. The non-uniformity of read coverage in RNA-Seq was first noticed in A. Mortazavi et al., Mapping and quantifying mammalian transcriptomes, Nature Methods 5 (2008), 621–628. Figure 1 in the paper (see below) shows an example of non-uniform coverage, and the paper discusses ideas for library preparation that can reduce bias and improve uniformity.

Figure 1b from Mortazavi et al. (2008) showing (non-uniform) coverage of Myf6.

Supplementary Figure 1a from Mortazavi et al. (2008) describing uniformity of coverage achievable with different library preparations. “Deviation from uniformity” was assessed using the Kolmogorov-Smirnov test.

The experimental approach of modifying library preparation to reduce non-uniformity has been complemented by statistical approaches to the problem. Specifically, various models have been proposed for “correcting” for experimental artefacts that induce non-uniform coverage. To understand Jiang and Salzman’s latest paper, it is helpful to review previous approaches that have been proposed. Read the rest of this entry »

Don’t believe the anti-hype. They are saying that RNA-Seq promises the discovery of new expression events, but it doesn’t deliver:

Is this true? There have been a few papers comparing microarrays to RNA-Seq technology (including one of my own) that I’ll discuss below, but first a break-down of the Affymetrix “evidence”. The first is this figure (the poor quality of the images is exactly what Affymetrix provides, and not due to a reduction in quality on this site; they are slightly enlarged when clicked on):

The content of this figure is an illustration of the gene LMNB1 (Lamin protein of type B), used to argue that microarrays can provide transcript level resolution whereas RNA-Seq can’t!! Really? Affymetrix is saying that RNA-Seq users would likely use the RefSeq annotation which only has three isoforms. But this is a ridiculous claim. It is well known that RefSeq is a conservative annotation and certainly RNA-Seq users have the same access to the multiple databases Affymetrix used to build their annotation (presumably, e.g. Ensembl). It therefore seems that what Affymetrix is saying with this figure is that RNA-Seq users are dumb.

The next figure is showing the variability in abundance estimates as a function of expression level for RNA-SEq and the HTA 2.0, with the intended message being that microarrays are less noisy:

But there is a subtle trick going on here. And its in the units. The x-axis is showing RPM, which is an abbreviation for Reads Per Million. This is not a commonly used unit, and there is a reason. First, its helpful to review what is used. In his landmark paper on RNA-Seq, Ali Mortazavi introduced the units RPKM (note the extra K) that stands for reads per kilobase of transcript per million mapped. Why the extra kilobase term? In my review on RNA-Seq quantification I explain that RPKM is proportional to a maximum likelihood estimate of transcript abundance (obtained from a simple RNA-Seq model). The complete derivation is on page 6 ending in Equation 13; I include a summary here:

The maximum likelihood (ML) abundances $\hat{\rho}_t$ are  given by

$\hat{\rho}_t = \frac{\frac{\hat{\alpha}_t}{l_t}}{\sum_{r \in T} \frac{\hat{\alpha}_r}{l_r}} \propto \frac{X_t}{\left( \frac{l_t}{10^3}\right) \left( \frac{N}{10^6}\right) }$

In these equations $l_t$ is the length of transcript (if reads are long it is necessary to modify the length due to edge effects, hence the tilde in the paper), the $\hat{\alpha}_t$ are the maximum likelihood estimates for the probabilities of selecting reads from transcripts (unnormalized by their length) and finally $X_t$ is the number of reads mapping to transcript t while N is the total number of mapped reads. The point is that RPKM (the rightmost formula for abundance) is really a unit for describing the maximum likelihood relative abundances ($\hat{\rho}$) scaled by some factors.

RPKM as a unit has two problems. The first is that in current RNA-Seq experiments reads are paired so that the actual units being counted (in $X_t$) are fragments. For this reason we modified RPKM to FPKM in the Cufflinks paper (the “F” replaced “R” for fragment instead of read). A more serious problem, noted by Bo Li and Colin Dewey in their paper on RSEM, is that while FPKM is proportional to ML estimates of abundance, the proportionality constants may vary between experiments. For this reason they proposed TPM (transcripts per million) which is also proportional to the ML abundance estimates but with a proportionality constant (a million) that is the same between experiments. All of these units are used simply to avoid writing down the $\hat{\rho}_t$ which are in many cases tiny numbers since they must all sum to 1.

Returning to the Affymetrix figure, we see the strange RPM units. In essence, this is the rightmost term in the equation above, with the $l_t$ length terms removed from the denominators. Therefore RPM is proportional to the $\hat{\alpha}_t$. If a transcript is short, even if it is equally abundant to a longer transcript ,it will produce less RNA-Seq reads and therefore its $\hat{\alpha}_t$ will be (possibly considerably) smaller. The effect of displaying RPM for RNA-Seq vs. expression level for the HTA 2.0 arrays is therefore to mix apples and oranges. Since what is being displayed is a coefficient of variation, there is a bias caused by the relationship between length and expression (e.g. many highly expressed housekeeping genes are short).

To be fair to Affymetrix the conversion between the $\hat{\alpha}$ and the $\hat{\rho}$ can be confusing (its explained in Lemma 14 in the Supplement of the Cufflinks paper). So maybe the discordant x-axes were unintentional…but then there is the third figure:

Here its a bit hard to tell what is going on because not all the information needed to decipher the figure is provided. For example, its not clear how the “expression of exons” was computed or measured for the RNA-Seq experiment. I suspect that as with the previous figure, read numbers were not normalized by length of exons, and moreover spliced reads (and other possibly informative reads from transcripts) were ignored. In other words, I don’t really believe the result.

Having said this, it is true that expression arrays can have an advantage in measuring exon expression, because an array measurement is absolute (as opposed to the relative quantification that is all that is possible with RNA-Seq). Array signal is based on hybridization, and it is probably a reasonable assumption that some minimum amount of RNA triggers a signal, and that this amount is independent of the remainder of the RNA in an experiment. So arrays can (and in many cases probably do) have advantages over RNA-Seq.

There are a few papers that have looked into this, for example the paper “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae ” by Nookaew et al., Nucleic Acids Research 40 (2012) who find high reproducibility in RNA-Seq and consistency between arrays and RNA-Seq.  Xu et al., in “Human transcriptome array for high-throughput clinical studies“, PNAS 108 (2011), 3707–3712 are more critical, agreeing with Affymetrix that arrays are more sensitive at the exon level. For disease studies, they recommend using RNA-Seq to identify transcripts relevant to the disease, and then screening for those transcripts on patients using arrays.

For the Cuffdiff2 paper describing our new statistical procedures for differential analysis of transcripts and genes, the Rinn lab performed deep RNA-Seq and array expression measurement on the same samples from a HOXA1 knowdown (the experiments included multiple replicates of both the RNA-Seq and the arrays). To my knowledge, it is the deepest and most comprehensive data currently available for comparing arrays and RNA-Seq. Admittedly, the arrays used were not Affymetrix but Agilent SurePrint G3, and the RNA-Seq coverage was deep, however we had two main findings very different from the Affymetrix claims. First, we found overall strong correlation between array expression values and RNA-Seq abundance estimates. The correlation remained strong for large regimes of expression even with very few reads (tested by sequencing fewer reads from a MiSeq). Second, we found that arrays were missing differentially expressed transcripts, especially at low abundance levels. In other words, we found RNA-Seq to have higher resolution. The following figure from our paper made the case (note the overall Spearman Correlation was 0.86):

There are definitely continued applications for arrays. Both in high-throughput screening applications (as suggested in the Xu et al. paper), and also in the development of novel assays. For example Mercer et al. “Targeted rNA sequencing reveals the deep complexity of the human transcriptome“, Nature Biotechnology 30 (2012) 99–104  show how to couple capture (with arrays) with RNA-Seq to provide ultra deep sequencing in subsets of the transcriptome. So its not yet the time to write off arrays. But RNA-Seq has many applications of its own. For example the ability to better detect allele-specific expression, the opportunity to identify RNA-DNA differences (and thereby study RNA editing), and the ability to study expression in non-model organisms where genomes sequences are incomplete and annotations poor. For all these reasons I’m betting on RNA-Seq.

RNA-Seq is the new kid on the block, but there is still something to be learned from the stodgy microarray. One of the lessons is hidden in a tech report by Daniela Witten and Robert Tibshirani from 2007: “A comparison of fold-change  and the t-statistic for microarray data analysis“.

The tech report makes three main points. The first is that it is preferable to use a modified t-statistic rather than the ordinary t-statistic. This means that rather than comparing (normalized) means using

$T_i = \frac{\bar{x_i} - \bar{y_i}}{s_i}$

where $s_i$ is the standard deviation of the replicates $x_i$ (respectively $y_i$) of gene i in two different conditions, it is better to use

$T'_i = \frac{\bar{x_i} - \bar{y_i}}{s_i+s_0}$

where $s_0$ minimizes the coefficient of variation of $T'_i$.

The second point made is that the intuition that reproducibility implies accuracy is not correct (fold change had been proposed for use instead of a t-statistic because the results were more reproducible).

The third point, in my opinion the most important one, I quote directly from the report:

“A researcher should choose the measure of differential expression based on the biological system of interest. If large absolute changes in expression are relevant to the system, then fold-change should be used; on the other hand, if changes in expression relative to the underlying noise are important, then a modified t-statistic is preferable.”

How does this pertain to RNA-Seq? Microarray experiments and RNA-Seq both measure expression but the translation of methods for the analysis of one platform to the other can be non-trivial. One reason is that in RNA-Seq experiments accurately measuring “fold-change” is difficult. Read counts accumulated across a gene cannot be used directly to estimate fold change because the transcripts making up the gene may have different lengths. For this reason, methods such as Cufflinks, RSEM or eXpress (and most recently Sailfish recently reviewed on this blog) use the EM algorithm to “deconvolute” ambiguously mapped reads. The following thought experiment (Figure 1 in our paper describing Cufflinks/Cuffdiff 2) illustrates the issue:

Changes in fragment counts for a gene do not necessarily equal a change in expression. The “exon-union” method counts reads falling on any of a gene’s exons, whereas the “exon-intersection” method counts only reads
on constitutive exons. Both of the exon-union and exon-intersection counting schemes may incorrectly estimate a change in expression in genes with multiple isoforms as shown in the table. It is important to note that the problem of fragment assignment described here in the context of RNA-Seq is crucial for accurate estimation of parameters in many other *Seq assays.

“Count-based” methods for differential expression, such as DESeq, work directly with accumulated gene counts and are based on the premise that even if estimated fold-change is wrong, statistical significance can be assessed based on differences between replicates.  In our recent paper describing Cuffdiff 2 (with a new method for differential abundance analysis) we examine DESeq (as a proxy for count-based methods) carefully and show using both simulation and real data that fold-change is not estimated accurately. In fact, even when DESeq and Cufflinks both deem a gene to be differentially expressed, and even when the effect is in the same direction (e.g. up-regulation), DESeq can (and many times does) estimate fold-change incorrectly. This problem is not specific to DESeq. All “count based” methods that employ naive heuristics for computing fold change will produce inaccurate estimates:

Comparison of fold-change estimated by Cufflinks (tail of arrows) vs. “intersection-count” (head of arrows) reproduced from Figure 5 of the supplementary material of the Cuffdiff 2 paper. “Intersection-count” consists of the accumulated read counts in the regions shared among transcripts in a gene. The x-axis shows array fold change vs. the estimated fold-change on the y-axis.  For more details on the experiment see the Cuffdiff 2 paper.

In other words,

it is essential to perform fragment assignment in a biological context where absolute expression differences are relevant to the system.

What might that biological context be? This is a subjective question but in my experience users of microarrays or RNA-Seq (including myself) always examine fold-change in addition to p-values obtained from (modified) t-statistics or other model based statistics because the raw fold-change is more directly connected to the data from the experiment.

In many settings though, statistical significance remains the gold standard for discovery. In the recent epic “On the immortality of television sets: ‘function’ in the human genome according to the evolution-free gospel of ENCODE“, Dan Graur criticizes the ENCODE project for reaching an “absurd conclusion” through various means, among them the emphasis of “statistical significance rather than magnitude of effect”. Or, to paraphrase Samuel Johnson,

statistical significance is the last refuge from a poor analysis of data.