You are currently browsing the tag archive for the ‘RNA-Seq’ tag.

This post is the third in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

There is a million dollar prize on offer for a solution to the P vs. NP problem, so it’s understandable that one may wonder whether this blog post is an official entry. It is not.

The title for this post was inspired by a talk presented by David Tse at the CGSI 2017 meeting where he explained “How to solve NP-hard assembly problems in linear time“. The gist of the talk was summarized by Tse as follows:

“In computational genomics there’s been a lot of problems where the formulation is combinatorial optimization. Usually they come from some maximum likelihood formulation of some inference problem and those problems end up being mostly NP-hard. And the solution is typically to develop some heuristic way of solving the NP-hard problem. What I’m saying here is that actually there is a different way of approaching such problems. You can look at them from an information point of view.”

Of course thinking about NP-hard problems from an information point of view does not provide polynomial algorithms for them. But what Tse means is that information-theoretic insights can lead to efficient algorithms that squeeze the most out of the available information.

One of the computational genomics areas where an NP-complete formulation for a key problem was recently proposed is in single-cell RNA-seq pre-processing. After RNA molecules are captured from cells, they are amplified by PCR, and it is possible, in principle, to account for the PCR duplicates of the molecules by making use of unique molecular identifiers (UMIs). Since UMIs are (in theory) unique to each captured molecule, but identical among the PCR duplicates of that captured molecule, they can be used to identify and discard the PCR duplicates. In practice distinct captured molecules may share the same UMI causing a collision, so it can be challenging to decide when to “collapse” reads to account for PCR duplicates.

In the recent paper Srivastava et al. 2019, the authors developed a combinatorial optimization formulation for collapsing. They introduce the notion of “monochromatic arborescences” on a graph, where these objects correspond to what is, in the language of the previous post, elements of the set C. They explain that the combinatorial optimization formulation of UMI collapsing in this framework is to find a minimum cardinality covering of a certain graph by monochromatic arboresences. The authors then prove the following theorem, by reduction from the dominating set decision problem:

Theorem [Srivastava, Malik, Smith, Sudbery, Patro]: Minimum cardinality covering by monochromatic arborescences is NP-complete.

Following the standard practice David Tse described in his talk, the authors then apply a heuristic to the challenging NP-complete problem. It’s all good except for one small thing. The formulation is based on an assumption, articulated in Srivastava et al. 2019 (boldface and strikethrough is mine):

…gene-level deduplication provides a conservative approach and assumes that it is highly unlikely for molecules that are distinct transcripts of the same gene to be tagged with a similar UMI (within an edit distance of 1 from another UMI from the same gene). However, entirely discarding transcript-level information will mask true UMI collisions to some degree, even when there is direct evidence that similar UMIs must have arisen from distinct transcripts. For example, if similar UMIs appear in transcript-disjoint equivalence classes (even if all of the transcripts labeling both classes belong to the same gene), then they cannot have arisen from the same pre-PCR molecule. Accounting for such cases is especially true [important] when using an error-aware deduplication approach and as sequencing depth increases.

The one small thing? Well… the authors never checked whether the claim at the end, namely that “accounting for such cases is especially important”, is actually true. In our paper “Modular and efficient pre-processing of single-cell RNA-seq” we checked. The result is in our Figure 1d:

Each column in the figure corresponds to a dataset, and the y-axis shows the distribution (over cells) of the proportion of counts one can expect to lose if applying naïve collapsing to a gene. Naïve collapsing here means that two reads with the same UMI are considered to have come from the same molecule. The numbers are so small we had to include an inset in the top right. Basically, it almost never happens that there is “direct evidence that similar UMIs must have arisen from distinct transcripts”. If one does observe such an occurrence, it is almost certainly an artifact of missing annotation. In fact, this leads to an…

💡 Idea: prioritize genes with colliding UMIs for annotation correction. The UMIs directly highlight transcripts that are incomplete. Maybe for a future paper, but returning to the matter at hand…

Crucially, the information analysis shows that there is no point in solving an NP-complete problem in this setting. The naïve algorithm not only suffices, it is sensible to apply it. And the great thing about naïve collapsing is that it’s straightforward to implement and run; the algorithm is linear. The Srivastava et al. question of what is the “minimum number of UMIs, along with their counts, required to explain the set of mapped reads” is a precise, but wrong question. In the words of John Tukey: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

The math behind Figure 1d is elementary but interesting (see the Supplementary Note of our paper). We work with a simple binomial model which we justify based on the data. For related work see Petukhov et al. 2018. One interesting result that came out of our calculations (work done with Sina Booeshaghi), is an estimate for the effective number of UMIs on each bead in a cell. This resulted in Supplementary Figure 1:

The result is encouraging. While the number of UMIs on a bead is not quite $4^L$ where L is the length of the UMI (theoretical maximum shown by dashed red line for v2 chemistry and solid red line for v3 chemistry), it is nevertheless high. We don’t know whether the variation is a result of batch effect, model mis-specification, or other artifacts; that is an interesting question to explore with more data and analysis.

As for UMI collapsing, the naïve algorithm has been used for almost every experiment to date as it is the method that was implemented in the Cell Ranger software, and subsequently adopted in other software packages. This was done without any consideration of whether it is appropriate. As the Srivastava et al. paper shows, intuition is not to be relied upon, but fortunately, in this case, the naïve approach is the right one.

This post is the second in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

A few months ago, while working on the kallisto | bustools project, some of us in the lab were discussing various aspects of single-cell RNA-seq technology when the conversation veered into a debate over the meaning of some frequently used words and phrases in the art: “library complexity”, “library size”, “sensitivity”, “capture rate”, “saturation”, “number of UMIs”, “bork bork bork” etc. There was some sense of confusion. I felt like a dummy because even after working on RNA-seq for more than a decade, I was still lacking language and clarity about even the most basic concepts. This was perhaps not entirely my fault. Consider, for example, that the phrase “library size” is used to mean “the number of molecules in a cDNA library” by some authors, and the “number of reads sequenced” by others.

Since we were writing a paper on single-cell RNA-seq pre-processing that required some calculations related to the basic concepts (libraries, UMIs, and so on), we decided to write down notation for the key objects. After some back-and-forth, Sina Booeshaghi and I ended up drafting the diagram below that summarizes the sets of objects in a single-cell RNA-seq experiment, and the maps that relate them:

Structure of a single-cell RNA-seq experiment.

Each letter in this diagram is a set. The ensemble of RNA molecules contained within a single cell is denoted by R. To investigate R, a library (L) is constructed from the set of molecules captured from R (the set C). Typically, L is the result of of various fragmentation and amplification steps performed on C, meaning each element of C may be observed in L with some multiplicity. Thus, there is an inclusion map from C to L (arrow with curly tail), and an injection from C to R (arrows with head and tail). The library is interrogated via sequencing of some of the molecules in L, resulting in a set F of fragments. Subsequently, the set F is aligned or pseudoaligned to create a set B, which in our case is a BUS file. Not every fragment F is represented in B, hence the injection, rather than bijection, from B to F, and similarly from F to L. The set T consists of transcripts that correspond to molecules in C that were represented in B. Note that $|R| \geq |C| \geq |T|$. Separately, the set U consists of the unique molecular identifiers (UMIs) available to label molecules from the cell, and I is a multiset of UMIs associated with the molecules in T. Importantly, the data from an experiment consists of F, together with the support of I. The support of I means the number of distinct objects in I, and is denoted by |supp(I)|. The common term is “number of distinct UMIs”.

The diagram has three distinct parts. The sets on the top (L, F, B) are “lifted” from  and by PCR. Without PCR one would be in an the ideal situation of measuring C directly to produce T, which would then be used to directly draw inferences about R. This is the hope for direct RNA sequencing, a technology that is promising but that cannot yet be applied at the scale of cDNA based methods. The sets U and I are intended to be seen as orthogonal to the rest of the objects. They relate to the UMIs which, in droplet single-cell RNA-seq technology, are delivered via beads. While the figure was designed to describe single-cell RNA-seq, it is quite general and possibly a useful model for many sequence census assays.

So what is all this formality good for? Nothing in this setup is novel; any practitioner working with single-cell RNA-seq already knows what the ingredients for the technology are. However I do think there is some trouble with the language and meaning of words, and hopefully having names and labels for the relevant sets can help in communication.

The questions

With some notation at hand, it is possible to precisely articulate some of the key technical questions associated with a single-cell RNA-seq experiment:

• The alignment (or pseudoalignment) problem: compute B from F.
• The pre-processing problem: what is the set ?
• What is the library richness/complexity, i.e. what is |supp(L)|?
• What is the sensitivity, i.e. what is $\frac{|C|}{|R|}$?
• In droplet based experiments, what are the number of UMIs available to tag molecules in a cell, i.e. what is |U|?

These basic questions are sometimes confused with each other. For example, the capture rate refers to the proportion of cells from a sample that are captured in an experiment and should not be confused with sensitivity. The |supp(L)| is a concept that is natural to refer to when thinking about a cDNA library. Note that the “library size”, referred to in the beginning of this post, is used by molecular biologists to naturally mean |L|, and not |F| (this confusion was unfortunately disseminated by the highly influential RNA-seq papers Anders and Huber, 2010 and Robinson and Oshlack, 2010) . The support of another set, |supp(I)|, is one that is easy to measure but precisely because I is a multiset, $|I| \neq |supp(I)|$, and there is considerable confusion about this fact. The number of distinct UMIs, |supp(I)|, is frequently used in lieu of the set whose size is being referred to, namely |I| (this is the case when “knee plots” are made, a topic for the fourth blog post in this series). Similarly, |U| is usually not estimated, and the number $4^L$ where $L$ is the length of the UMIs is used in its stead. This is partly intellectual laziness but partly, I think, the lack of standard notation to refer to the objects in single-cell RNA-seq experiments.

This diagram in this post is just step 0 in discussing single-cell RNA-seq. There is a lot more subtlety and nuance in understanding and interpreting experiments (see Introduction to single-cell RNA-seq technologies). ∎

This post is the first in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

During the past few years computational biologists have expended enormous effort in developing tools for processing and analyzing single-cell RNA-seq. This post describes yet another: the kallisto|bustools workflow for pre-processing single-cell RNA-seq. A preprint describing the method (was recently posted on the bioRχiv.

Number of single-cell RNA-seq tools (from the scRNA-tools catalog).

Given that there are so many programs, a natural question is: why on earth would we write yet another software program for generating a count matrix from single-cell RNA-seq reads when there are already plenty of programs out there? There’s alevin, cell rangerdropseqpipedropseqtoolsindrops… I’ve been going in alphabetical order but have to jump in with starsolo because it’s got the coolest name…now back to optimus, scruff, scpipescumiumis, zumis,  and I’m probably forgetting a few other something-umis. So why another one?

The answer requires briefly venturing back to a time long, long ago when RNA-seq was a fledgling, exciting new technology (~2009). At the time the notion of an “equivalence class” was introduced to the field (see e.g. Jiang and Wong, 2009 or Nicolae et al., 2011). Briefly, there is a natural equivalence relation on the set of reads in an RNA-seq experiment, where two reads are related when they are compatible with (i.e. could have originated from) exactly the same set of transcripts. The equivalence relation partitions the reads into equivalence classes, and, in a slight abuse of notation, the term “equivalence class” in RNA-seq is used to denote the set of transcripts corresponding to an equivalence class of reads. Starting with the pseudoalignment program kallisto that we published in Bray et al. 2016, it became possible to rapidly obtain the (transcript) equivalence classes for reads from an RNA-seq experiment.

When single-cell RNA-seq started to scale it became apparent to those of us working with equivalence classes for bulk RNA-seq that rather than counting genes from single-cell RNA-seq data, it would be better to examine what we called transcript compatibility counts (TCCs), i.e. counts of the equivalence classes (the origin of the term TCC is discussed in a previous blog post of mine). This vision has borne out: we recently published a paper demonstrating the power of TCCs for differential analysis of single-cell data (Ntranos, Yi et al. 2019) and I believe TCCs are ideal for many different single-cell RNA-seq analyses. So back to the question: why another single-cell RNA-seq pre-processing workflow?

Already in 2016 we wanted to be able to produce TCC matrices from single-cell RNA-seq data but there was no program to do it. My postdoc at the time, Vasilis Ntranos, developed a workflow, but in the course of working on a program he started to realize that there were numerous non-trivial aspects to processing single-cell RNA-seq. Even basic questions, such as how to correct barcodes or collapse UMIs required careful thought and analysis. As more and more programs for single-cell RNA-seq pre-processing started to appear, we examined them carefully and noted two things: 1. Most were not able to output TCC matrices and 2. They were, for the most part, based on ad hoc heuristics and unvalidated methods. Many of the programs were not even released with a preprint or paper. So we started working on the problem.

A key insight was that we needed a new format to allow for modular pre-processing. So we developed such a format, which we called the Barcode, UMI, Set (BUS) format, and we published a paper about it earlier this year (Melsted, Ntranos et al., 2019). This allowed us to start investigating different algorithms for the key steps, and to rearrange them and plug them in to an overall workflow as needed. Finally, after careful consideration of each of the key steps, weighing tradeoffs between efficiency and accuracy, and extensive experimentation, we settled on a workflow that is faster than any other method and based on reason rather than intuition. The workflow uses two programs, kallisto and bustools, and we call it the kallisto|bustools workflow. Some highlights:

• kallisto|bustools can produce a TCC matrix. The matrix is compatible with the gene count matrix (it collapses to the latter), and of course gene count matrices can be output as well for use in existing downstream tools.
• The workflow is very very fast. With kallisto|bustools very large datasets can be processed in minutes. The title of this post refers to the workflow as “near-optimal” because it runs in time similar to the unix word count function. Maybe it’s possible to be a bit faster with some optimizations, but probably not by much:
• kallisto|bustools uses very little memory. We worked hard to achieve this feature, as we wanted it to be useful for large-scale analyses that are going to be performed by consortia such as the Human Cell Atlas project. The workflow currently uses ~3.5Gb of RAM for processing 10x v2 chemistry data, and ~11Gb RAM for 10x v3 chemistry data; both numbers are independent of the number of reads being processed. This means users can pre-process data on a laptop:
• The workflow is modular, thanks to its reliance on the flexible BUS format. It was straightforward to develop an RNA velocity workflow (more on this in a companion blog post). It will be easy to adapt the workflow to various technologies, to multiomics experiments, and to any custom analysis required:
• We tried to create comprehensive, yet succinct documentation to help make it easy to use the software (recommendations for improvements are welcome). We have online tutorials, as well as videos for novices:
Installation instructions (and video)
Getting started tutorial (and video).
– Manuals for kallisto and bustools.
– Complete code for reproducing all the results in the preprint
• We were not lazy. In our tests we found variability in performance on different datasets so we tested the program extensively and ran numerous benchmarks on 10x Genomics data to validate Cell Ranger with respect to kallisto|bustools (note that Cell Ranger’s methods have been neither validated nor published). We compiled a benchmark panel consisting of 20 datasets from a wide variety of species. This resulted in 20 supplementary figures, each with 8 panels showing: a) the number of genes detected, b) concordance in counts per gene, c) number of genes detected, d) correlation in gene counts by cell, e) spatial separation between corresponding cells vs. neighboring cells, f,g) t-SNE analysis, h) gene set analysis to detect systematic differences in gene abundance estimation (see example below for the dataset SRR8257100 from the paper Ryu et al., 2019). We also examined in detail results on a species mixing experiment, and confirmed that Cell Ranger is consistent with kallisto on that as well. One thing we did not do in this paper is describe workflows for different technologies but such workflows and accompanying tutorials will be available soon:
• In addition we ran a detailed analysis on the 10x Genomics 10k E18 mouse brain dataset to investigate whether Cell Ranger pre-processing produces different results than kallisto insofar as downstream analyses are concerned. We looked at dimensionality reduction, clustering, identification of marker genes, marker gene prevalence, and pseudotime. The results were all highly concordant. An example (the pseudotime analysis) is shown below:
• We did the math on some of the basic aspects of single-cell RNA-seq. We’re not the first to do this (see, e.g. Petukhov et al., 2018), but one new result we have is an estimation of the UMI diversity on beads. This should be useful for those developing new technologies, or trying to optimize existing protocols:

Note that this post is the first in a series of five that discuss in more detail various aspects of the paper (see links at the top). Finally, a note on reproducibility and usability:

The development of the kallisto|bustools workflow, research into the methods, compilation of the results, and execution of the project required a tremendous team effort, and in working on it I was thinking of the first bioinformatics tool I wrote about and posted to the arXiv (the bioRxiv didn’t exist yet). The paper was:

Nicolas Bray and Lior Pachter, MAVID: Constrained ancestral alignment of multiple sequences, arXiv, 2003.

At the time we posted the code on our own website (now defunct, but accessible via the Wayback machine). We did our best to make the results reproducible but we were limited in our options with the tools available at the time. Furthermore, posting the preprint was highly unusual; there was almost no biology preprinting at the time. Other things have stayed the same. Considerations of software portability, efficiency and documentation were relevant then and remain relevant now.

Still, there has been an incredible development in the tools and techniques available for reproducibility and usability since that time. A lot of the innovation has been made possible by cloud infrastructure, but much of the development has been the result of changes in community standards and requirements (see e.g., Weber et al., 2019). I thought I’d compile a list of the parts and pieces of the project; they are typical for what is needed for a bioinformatics project today and comparing them to the bar in 2003 is mind boggling:

Software: GitHub repositories (kallisto and bustools); releases of binaries for multiple operating systems (Mac, Linux, Windows, Rock64); portable source code with minimal dependencies; multithreading; memory optimization; user interface.

Paper: Preprint (along with extensive Supplement providing backup for every result and claim in the main text); GitHub repository with code to reproduce all the figures/results in the preprint (reproducibility code includes R markdown, python notebooks, snakemake, software versions for every program used, fixed seeds).

Documentation: Manuals for the software; Tutorials for learning to use the code; Explanatory videos (all required materials posted on Github or available on stable websites for download).

The totality of work required to do all of this was substantial. Páll Melsted was the primary developer of kallisto and he wrote and designed bustools, which has been the foundation of the project. The key insight to adopt the BUS format was work in collaboration with Vasilis Ntranos. This was followed by long conversations on the fundamentals of single-cell RNA-seq with Jase Gehring. Sina Booeshaghi carried the project. He was responsible for the crucial UMI collapsing analysis, and put together the paper. Fan Gao, director of the Caltech Bioinformatics Resource Center, set up and implemented the extensive benchmarking, and helped fine-tune the algorithms and converge to the final approach taken. Lambda Lu conducted what I believe to be the most in-depth and detailed analysis to date of the effect of pre-processing on results. Her framework should serve as a template for future development work in this area. Eduardo Beltrame designed the benchmark panels and had a key insight about how to present results that is described in more detail in the companion post on rotating the knee plot. He also helped in the complex task of designing and building the companion websites for the project. Kristján Eldjarn Hjörleifsson helped with the RNA velocity work and helped make custom indices that turned out to be fundamental in understanding the performance of pseudoalignment in the single-cell RNA-seq setting. Sina Booeshaghi spent a lot of time thinking about how to optimize the user experience, making the tutorials and videos, and working overall to make the results of the paper not just reproducible, but the the methods usable.

This post is a review of a recent preprint on an approach to testing for RNA-seq gene differential expression directly from transcript compatibility counts:

Marek Cmero, Nadia M Davidson and Alicia Oshlack, Fast and accurate differential transcript usage by testing equivalence class counts, bioRxiv 2018.

To understand the preprint two definitions are important. The first is of gene differential expression, which I wrote about in a previous blog post and is best understood, I think, with the following figure (reproduced from Supplementary Figure 1 of Ntranos, Yi, et al., 2018):

In this figure, two isoforms of a hypothetical gene are called primary and secondary, and two conditions in a hypothetical experiment are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates $x_A$ and $x_B$ corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates $y_A$ and $y_B$ corresponding to the abundance of the secondary isoforms. In data from the hypothetical experiment, the black dots represent the mean level of expression of the constituent isoforms as derived from replicates. Differential transcript expression (DTE) refers to change in one of the isoforms. Differential gene expression (DGE) refers to change in overall gene expression (i.e. expression as the sum of the expression of the two isoforms). Differential transcript usage (DTU) refers to change in relative expression between the two isoform and gene differential expression (GDE) refers to change in expression along the red line. Note that DGE, DTU and DGE are special cases of GDE.

The Cmero et al. preprint describes a method for testing for GDE, and the method is based on comparison of equivalence classes of reads between conditions. There is a natural equivalence relation $\sim$ on the set of reads in an RNA-seq experiment, where two reads $r_1$ and $r_2$ are related by $\sim$ when $r_1$ and $r_2$ align (ambiguously) to exactly the same set of transcripts (see, e.g. Nicolae et al. 2011). The equivalence relation $\sim$ partitions the reads into equivalence classes, and, in a slight abuse of notation, the term “equivalence class” in RNA-seq is used to denote the set of transcripts corresponding to an equivalence class of reads. Starting with the pseudoalignment program kallisto published in Bray et al. 2016, it became possible to rapidly obtain the (transcript) equivalence classes for reads from an RNA-seq experiment.

In previous work (Ntranos et al. 2016) we introduced the term transcript compatibility counts to denote the cardinality of the (read) equivalence classes. We thought about this name carefully; due to the abuse of notation inherent in the term “equivalence class” in RNA-seq, we felt that using “equivalence class counts” would be confusing as it would be unclear whether it refers to the cardinalities of the (read) equivalence classes or the (transcript) “equivalence classes”.

With these definitions at hand, the Cmero et al.’s preprint can be understood to describe a method for identifying GDE between conditions by directly comparing transcript compatibility counts. The Cmero et al. method is to perform Šidák aggregation of p-values for equivalence classes, where the p-values are computed by comparing transcript compatibility counts for each equivalence class with the program DEXSeq (Anders et al. 2012). A different method that also identifies GDE by directly comparing transcript compatibility counts was previously published by my student Lynn Yi in Yi et al. 2018. I was curious to see how the Yi et al. method, which is based on Lancaster aggregation of p-values computed from transcript compatibility counts compares to the Cmero et al. method. Fortunately it was really easy to find out because Cmero et al. released code with their paper that can be used to make all of their figures.

I would like to note how much fun it is to reproduce someone else’s work. It is extremely empowering to know that all the methods of a paper are pliable at the press of a button. Below is the first results figure, Figure 2, from Cmero et al.’s paper:

It’s beautiful to see not only apples-to-apples, but the exact same apple! Reproducibility is obviously important to facilitate transparency in papers and to ensure correctness, but its real value lies in the fact that it allows for modifying and experimenting with methods in a paper. Below is the second results figure, Figure 3, from Cmero et al.’s paper:

The figure below is the reproduction, but with an added analysis in Figure 3a, namely the method of Yi et al. 2018 included (shown in orange as “Lancaster_equivalence_class”).

The additional code required for the extra analysis is just a few lines and can be downloaded from the Bits of DNA Github repository:

library(aggregation)
library(dplyr)
dm_dexseq_results <- as.data.frame(DEXSeqResults(dm_ec_results$dexseq_object)) dm_lancaster_results <- dm_dexseq_results %>% group_by(groupID) %>% summarize(pval = lancaster(pvalue, log(exonBaseMean))) dm_lancaster_results$gene_FDR <- p.adjust(dm_lancaster_results$pval, ‘BH’) dm_lancaster_results <- data.frame(gene = dm_lancaster_results$groupID,
FDR = dm_lancaster_results$gene_FDR) hs_dexseq_results <- as.data.frame(DEXSeqResults(hs_ec_results$dexseq_object))
hs_lancaster_results <- hs_dexseq_results %>% group_by(groupID) %>% summarize(pval = lancaster(pvalue, log(exonBaseMean)))
hs_lancaster_results$gene_FDR <- p.adjust(hs_lancaster_results$pval, ‘BH’)
hs_lancaster_results <- data.frame(gene = hs_lancaster_results$groupID, FDR = hs_lancaster_results$gene_FDR)

A zoom-in of Figure 3a below shows that the improvement of Yi et al.’s method in the hsapiens dataset over the method of Cmero et al. is as large as the improvement of aggregation (of any sort) over GDE based on transcript quantifications. Importantly, this is a true apples-to-apples comparison because Yi et al.’s method is being tested on exactly the data and with exactly the metrics that Cmero et al. chose:

The improvement is not surprising; an extensive comparison of Lancaster aggregation with Šidák aggregation is detailed in Yi et al. and there we noted that while Šidák aggregation performs well when transcripts are perturbed independently, it performs very poorly in the more common case of correlated effect. Furthermore, we also examined in detail DEXSeq’s aggregation (perGeneQvalue) which appears to be an attempt to perform Šidák aggregation but is not quite right, in a sense we explain in detail in Section 2 of the Yi et al. supplement. While DEXSeq’s implementation of Šidák aggregation does control the FDR, it will tend to report genes with many isoforms and consumes the “FDR budget” faster than Šidák aggregation. This is one reason why, for the purpose of comparing Lancaster and Šidák aggregation in Yi et al. 2018, we did not rely on DEXSeq’s implementation of Šidák aggregation. Needless to say, separately from this issue, as mentioned above we found that Lancaster aggregation substantially outperforms Šidák aggregation.

The figures below complete the reproduction of the results of Cmero et al. The reproduced figures are are very similar to Cmero et al.’s figures but not identical. The difference is likely due to the fact that the Cmero paper states that a full comparison of the “Bottomly data” (on which these results are based) is a comparison of 10 vs. 10 samples. The reproduced results are based on downloading the data which consists of 10 vs. 11 samples for a total of 21 samples (this is confirmed in the Bottomly et al. paper which states that they “generated single end RNA-Seq reads from 10 B6 and 11 D2 mice.”) I also noticed one other small difference in the Drosophila analysis shown in Figure 3a where one of the methods is different for reasons I don’t understand. As for the supplement, the Cmero et al. figures are shown on the left hand side below, and to their right are the reproduced figures:

The final supplementary figure is a comparison of kallisto to Salmon: the Cmero et al. paper shows that Salmon results are consistent with kallisto results shown in Figure 3a,  and reproduces the claim I made in a previous blog post, namely that Salmon results are near identical to kallisto:

The final paragraph in the discussion of Cmero et al. states that “[transcript compatibility counts] have the potential to be useful in a range of other expression analysis. In particular [transcript compatibility counts] could be used as the initial unit of measurement for many other types of analysis such as dimension reduction visualizations, clustering and differential expression.” In fact, transcript compatibility counts have already been used for all these applications and have been shown to have numerous advantages. See the papers:

Many of these papers were summarized in a talk I gave at Cold Spring Harbor in 2017 on “Post-Procrustean Bioinformatics”, where I emphasized that instead of fitting methods to the predominant data types (in the case of RNA-seq, gene counts), one should work with data types that can support powerful analysis methods (in the case of RNA-seq, transcript compatibility counts).

Three years ago, when my coauthors (Páll Melsted, Nicolas Bray, Harold Pimentel) and I published the “kallisto paper” on the arXiv (later Bray et al. “Near-optimal probabilistic RNA-seq quantification“, 2016), we claimed that kallisto removed a major computational bottleneck from RNA-seq analysis by virtue of being two orders of magnitude faster than other state-of-the-art quantification methods of the time, without compromising accuracy. With kallisto, computations that previously took days, could be performed as accurately in minutes. Even though the speedup was significant, its relevance was immediately questioned. Critics noted that experiments, library preparations and sequencing take at least months, if not years, and questioned the relevance of a speedup that would save only days.

One rebuttal we made to this legitimate point was that kallisto would be useful not only for rapid analysis of individual datasets, but that it would enable analyses at previously unimaginable scales. To make our point concrete, in a follow-up paper (Pimentel et al., “The Lair: a resource for exploratory analysis of published RNA-seq data”, 2016) we described a semi-automated framework for analysis of archived RNA-seq data that was possible thanks to the speed and accuracy of kallisto, and we articulated a vision for “holistic analysis of [short read archive] SRA data” that would facilitate “comparison of results across studies [by] use of the same tools to process diverse datasets.” A major challenge in realizing this vision was that although kallisto was fast enough to allow for low cost processing of all the RNA-seq in the short read archive (e.g. shortly after we published kallisto, Vivian et al., 2017 showed that kallisto reduced the cost of processing per sample from $1.30 to$0.19, and Tatlow and Piccolo, 2016 achieved \$0.09 per sample with it), an analysis of experiments consists of much more than just quantification. In Pimentel et al. 2016 we struggled with how to wrangle metadata of experiments (subsequently an entire paper was written by Bernstein et al. 2017 just on this problem), how to enable users to dynamically test distinct hypotheses for studies, and how to link results to existing databases and resources. As a result, Pimentel et al. 2016 was more of a proof-of-principle than a complete resource; ultimately we were able to set up analysis of only a few dozen datasets.

Now, the group of Avi Ma’ayan at the Icahn School of Medicine at Mount Sinai has surmounted the many challenges that must be overcome to enable automated analysis of RNA-seq projects on the short read archive, and has published a tool called BioJupies (Torre et al. 2018). To assess BioJupies I began by conducting a positive control in the form of analysis of data from the “Cuffdiff2” paper, Trapnell et al. 2013. The data is archived as GSE37704. This is the dataset I used to initially test the methods of Pimentel et al. 2016 and is also the dataset underlying the Getting Started Walkthrough for sleuth. I thought, given my familiarity with it, that it would be a good test case for BioJupies.

Briefly, in Trapnell et al. 2013, Trapnell and Hendrickson performed a differential analysis of lung fibroblasts in response to an siRNA knockdown of HOXA1 which is a developmental transcription factor. Analyzing the dataset with BioJupies is as simple as typing the Gene Expression Omnibus (GEO) accession on the BioJupies searchbox. I clicked “analyze”, clicked on “+” a few times to add all the possible plots that can be generated, and this opened a window asking for a description of the samples:

I selected “Perturbation” for the HOXA1 knockdown samples and “Control” for the samples that were treated with scrambled siRNA that did not target a specific gene. Finally, I  clicked on “generate notebook”…

and

BioJupies displayed a notebook (Trapnell et al. 2013 | BioJupies) with a complete analysis of the data. Much of the Trapnell et al. 2013 analysis was immediately evident in the notebook. For example, the following figure is Figure 5a in Trapnell et al. 2013. It is a gene set enrichment analysis (GSEA) of the knockdown:

BioJupies presents the same analysis:

It’s easy to match them up. Of course BioJupies presents a lot of other information and analysis, ranging from a useful PCA plot to an interesting L1000 connectivity map analysis (expression signatures from a large database of over 20,000 perturbations applied to various cell lines that match the signatures in the dataset).

One of the powerful applications of BioJupies is the presentation of ARCHS⁴ co-expression data. ARCHS⁴ is the kallisto computed database of expression for the whole and is the primary database that enables BioJupies. One of its features is a list of co-expressed genes (as ascertained via correlation across the whole short read archive). These are displayed in BioJupies making it possible to place the results of an experiment in the context of “global” transcriptome associations.

While the Trapnell et al. 2013 reanalysis was fun, the real power of BioJupies is clear when analyzing a dataset that has not yet been published. I examined the GEO database and found a series GSE60538 that appears to be a partial dataset from what looks like a paper in the works. The data is from an experiment designed to investigate the role of Sox5 and Sox6 in the mouse heart via two single knockout experiments, and a double knockout. The entry originates in 2014 (consistent with the single-end 50bp reads it contains), but was updated recently. There are a total of 8 samples: 4 controls and 4 from the double knockout (the single knockouts are not available yet). I could not find an associated paper, nor was one linked to on GEO, but the abstract of the paper has already been uploaded to the site. Just as I did with the Trapnell et al. 2013 dataset, I entered the accession in the BioJupies website and… four minutes later:

The abstract of the GSE60538 entry states that “We performed RNA deep sequencing in ventricles from DKO and control mice to identify potential Sox5/6 target genes and found altered expression of genes encoding regulators of calcium handling and cation transporters” and indeed, BioJupies verifies this result (see Beetz et al. GSE60538| BioJupies):

Of course, there is a lot more analysis than just this. The BioJupies page includes, in addition to basic QC and datasets statistics, the PCA analysis, a “clustergrammer” showing which genes drive similarity between samples, differentially expressed genes (with associated MA and volcano plots), gene ontology enrichment analysis, pathway enrichment analysis, transcription factor enrichment analysis, kinase enrichment analysis, microRNA enrichment analysis, and L1000 analysis. In a sense, one could say that with BioJupies, users can literally produce the analysis for a paper in four minutes via a website.

The Ma’ayan lab has been working towards BioJupies for some time. The service is essentially a combination of a number of tools, workflows and resources published previously by the lab, including:

With BioJupies, these tools become more than the sum of their parts. Yet while BioJupies is impressive, it is not complete. There is no isoform analysis, which is unfortunate; for example one of the key points of Trapnell et al. 2013 was how informative transcript-level analysis of RNA-seq data can be. However I see no reason why a future release of BioJupies can’t include detailed isoform analyses. Isoform quantifications are provided by kallisto and are already downloadable via ARCHS⁴. It would also be great if BioJupies were extended to organisms other than human and mouse, although some of the databases that are currently relied on are less complete in other model organisms. Still, it should even be possible to create a BioJupies for non-models. I expect the authors have thought of all of these ideas. I do have some other issues with BioJupies: e.g. the notebook should cite all the underlying programs and databases used to generate the results, and while it’s neat that there is an automatically generated methods section, it is far from complete and should include the actual calls made to the programs used so as to facilitate complete reproducibility. Then, there is my pet peeve: “library size” is not the number of reads in a sample. The number of reads sequenced is “sequencing depth”.  All of these issues can be easily fixed.

In summary, BioJupies represents an impressive breakthrough in RNA-seq analysis. It leverages a comprehensive analysis of all (human and mouse) publicly available RNA-seq data to enable rapid and detailed analyses that transcend what has been previously possible. Discoveries await.

The development of microarray technology two decades ago heralded genome-wide comparative studies of gene expression in human, but it was the widespread adoption of RNA-Seq that has led to differential expression analysis becoming a staple of molecular biology studies. RNA-Seq provides measurements of transcript abundance, making possible not only gene-level analyses, but also differential analysis of isoforms of genes. As such, its use has necessitated refinements of the term “differential expression”, and new terms such as “differential transcript expression” have emerged alongside “differential gene expression”. A difficulty with these concepts is that they are used to describe biology, statistical hypotheses, and sometimes to describe types of methods. The aims of this post are to provide a unifying framework for thinking about the various concepts, to clarify their meaning, and to describe connections between them.

To illustrate the different concepts associated to differential expression, I’ll use the following example, consisting of a comparison of a single two-isoform gene in two conditions (the figure is Supplementary Figure 1 in Ntranos, Yi et al. Identification of transcriptional signatures for cell types from single-cell RNA-Seq, 2018):

The isoforms are labeled primary and secondary, and the two conditions are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates $x_A$ and $x_B$ corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates $y_A$ and $y_B$ corresponding to the abundance of the secondary isoforms. In data from an experiment the black dots will represent the mean level of expression of the constituent isoforms as derived from replicates, and there will be uncertainty as to their exact location. In this example I’ll assume they represent the true abundances.

Biology

Below is a list of terms used to characterize changes in expression:

Differential transcript expression (DTE) is change in one of the isoforms. In the figure, this is represented (conceptually) by the two red lines along the x- and y-axes respectively. Algebraically, one might compute the change in the primary isoform by $x_B-x_A$ and the change in the secondary isoform by $y_B-y_A$. However the term DTE is used to denote not only the extent of change, but also the event that a single isoform of a gene changes between conditions, i.e. when the two points lie on a horizontal or vertical line. DTE can be understood to occur as a result of transcriptional regulation if an isoform has a unique transcription start site, or post-transcriptional regulation if it is determined by a unique splicing event.

Differential gene expression (DGE) is the change in the overall output of the gene. Change in the overall output of a gene is change in the direction of  the line $y=x$, and the extent of change can be understood geometrically to be the distance between the projections of the two points onto the line $y=x$ (blue line labeled DGE). The distance will depend on the metric used. For example, the change in expression could be defined to be the total expression in condition B ($x_B+y_B$) minus the change in expression in condition A ($x_A+y_A$), which is $|x_B-x_A+y_B-y_A|$.  This is just the length of the blue line labeled “DGE” given by the $L_1$ norm. Alternatively, one could consider “DGE” to be the length of the blue line in the $L_2$ norm. As with DTE, DGE can also refer to a specific type of change in gene expression between conditions, one in which every isoform changes (relatively) by the same amount so that the line joining the two points has a slope of 1 (i.e. is angled at 45°). DGE can be understood to be the result of transcriptional regulation, driving overall gene expression up or down.

Differential transcript usage (DTU) is the change in relative expression between the primary and secondary isoforms. This can be interpreted geometrically as the angle between the two points, or alternatively as the length (as given by some norm) of the green line labeled DTU. As with DTE and DGE, DTU is also a term used to describe a certain kind of difference in expression between two conditions, one in which the line joining the two points has a slope of -1. DTU events are most likely controlled by post-transcriptional regulation.

Gene differential expression (GDE) is represented by the red line. It is the amount of change in expression along in the direction of line joining the two points. GDE is a notion that, for reasons explained below, is not typically tested for, and there are few methods that consider it. However GDE is biologically meaningful, in that it generalizes the notions of DGE, DTU and DTE, allowing for change in any direction. A gene that exhibits some change in expression between conditions is GDE regardless of the direction of change. GDE can represent complex changes in expression driven by a combination of transcriptional and post-transcriptional regulation. Note that DGE, DTU and DTE are all special cases of GDE.

If the $L_2$ norm is used to measure length and $DTE_1,DTE_2$ denote DTE in the primary and secondary isoforms respectively, then it is clear that DGE, DTU, DTE and GDE satisfy the relationship

$GDE^2 = DGE^2 + DTU^2 = DTE_1^2 + DTE_2^2.$

Statistics

The terms DTE, DGE, DTU and GDE have an intuitive biological meaning, but they are also used in genomics as descriptors of certain null hypotheses for statistical testing of differential expression.

The differential transcript expression (DTE) null hypothesis for an isoform is that it did not change between conditions, i.e. $x_A=x_B$ for the primary isoform, or $y_A=y_B$ for the secondary isoform. In other words, in this example there are two DTE null hypotheses one could consider.

The differential gene expresión (DGE) null hypothesis is that there is no change in overall expression of the gene, i.e. $x_A+y_A = x_B+y_B$.

The differential transcript usage (DTU) null hypothesis is that there is no change in the difference in expression of isoforms, i.e. $x_A-y_A = x_B - y_B$.

The gene differential expression (GDE) null hypothesis is that there is no change in expression in any direction, i.e. for all constants $a,b$, $ax_A+by_A = ax_B+by_B$.

The union differential transcript expression (UDTE) null hypothesis is that there is no change in expression of any isoform. That is, that $x_A = y_A$ and $x_B = y_B$ (this null hypothesis is sometimes called DTE+G). The terminology is motivated by $\neg \cup_i DTE_i = \cap_i DTE_i$.

Not that $UDTE \Leftrightarrow GDE$, because if we assume GDE, and set $a=1,b=0$ we obtain DTE for the primary isoform and setting $a=0,b=1$ we obtain DTE for the secondary isoform. To be clear, by GDE or DTE in this case we mean the GDE (respectively DTE) null hypothesis. Furthermore, we have that

$UDTE,GDE \Rightarrow DTE,DGE,DTU$.

This is clear because if $x_A=y_A$ and $x_B=y_B$ then both DTE null hypotheses are satisfied by definition, and both DGE and DTU are trivially satisfied. However no other implications hold, i.e. $DTE \not \Rightarrow DGE,DTU$, similarly $DGE \not \Rightarrow DTE,DTU$, and $DTU \not \Rightarrow DGE, DTE$.

Methods

The terms DGE, DTE, DTU and GDE also used to describe methods for differential analysis.

A differential gene expression method is one whose goal is to identify changes in overall gene expression. Because DGE depends on the projection of the points (representing gene abundances) to the line y=x, DGE methods typically take as input gene counts or abundances computed by summing transcript abundances $x_A+y_A$ and $x_B+y_B$. Examples of early DGE methods for RNA-Seq were DESeq (now DESeq2) and edgeR. One problem with DGE methods is that it is problematic to estimate gene abundance by adding up counts of the constituent isoforms. This issue was discussed extensively in Trapnell et al. 2013. On the other hand, if the biology of a gene is DGE, i.e. changes in expression are the same (relatively) in all isoforms, then DGE methods will be optimal, and the issue of summed counts not representing gene abundances accurately is moot.

differential transcript expression method is one whose goal is to identify individual transcripts that have undergone DTE. Early methods for DTE were Cufflinks (now Cuffdiff2) and MISO, and more recently sleuth, which improves DTE accuracy by modeling uncertainty in transcript quantifications. A key issue with DTE is that there are many more transcripts than genes, so that rejecting DTE null hypotheses is harder than rejecting DGE null hypotheses. On the other hand, DTE provides differential analysis at the highest resolution possible, pinpointing specific isoforms that change and opening a window to study post-transcriptional regulation. A number of recent examples highlight the importance of DTE in biomedicine (see, e.g., Vitting-Seerup and Sandelin 2017). Unfortunately DTE results do not always translate to testable hypotheses, as it is difficult to knock out individual isoforms of genes.

differential transcript usage method is one whose goal is to identify genes whose overall expression is constant, but where isoform switching leads to changes in relative isoform abundances. Cufflinks implemented a DTU test using Jensen-Shannon divergence, and more recently RATs is a method specialized for DTU.

As discussed in the previous section, none of null hypotheses DGE, DTE and DTU imply any other, so users have to choose, prior to performing an analysis, which type of test they will perform. There are differing opinions on the “right” approach to choosing between DGE, DTU and DTE. Sonseson et al. 2016 suggest that while DTE and DTU may be appropriate in certain niche applications, generally it’s better to choose DGE, and they therefore advise not to bother with transcript-level analysis. In Trapnell et al. 2010, an argument was made for focusing on DTE and DTU, with the conclusion to the paper speculating that “differential RNA level isoform regulation…suggests functional specialization of the isoforms in many genes.” Van den Berge et al. 2017 advocate for a middle ground: performing a gene-level analysis but saving some “FDR budget” for identifying DTE in genes for which the UDTE null hypothesis has been rejected.

There are two alternatives that have been proposed to get around the difficulty of having to choose, prior to analysis, whether to perform DGE, DTU or DTE:

differential transcript expression aggregation (DTE->G) method is a method that first performs DTE on all isoforms of every gene, and then aggregates the resulting p-values (by gene) to obtain gene-level p-values. The “aggregation” relies on the observation that under the null hypothesis, p-values are uniformly distributed. There are a number of different tests (e.g. Fisher’s method) for testing whether (independent) p-values are uniformly distributed. Applying such tests to isoform p-values per gene provides gene-level p-values and the ability to reject UDTE. A DTE->G method was tested in Soneson et al. 2016 (based on Šidák aggregation) and the stageR method (Van den Berge et al. 2017) uses the same method as a first step. Unfortunately, naïve DTE->G methods perform poorly when genes change by DGE, as shown in Yi et al. 2017. The same paper shows that Lancaster aggregation is a DTE->G method that achieves the best of both the DGE and DTU worlds. One major drawback of DTE->G methods is that they are non-constructive, i.e. the rejection of UDTE by a DTE->G method provides no information about which transcripts were differential and how. The stageR method averts this problem but requires sacrificing some power to reject UDTE in favor of the interpretability provided by subsequent DTE.

gene differential expression method is a method for gene-level analysis that tests for differences in the direction of change identified between conditions. For a GDE method to be successful, it must be able to identify the direction of change, and that is not possible with bulk RNA-Seq data. This is because of the one in ten rule that states that approximately one predictive variable can be estimated from ten events. In bulk RNA-Seq, the number of replicates in standard experiments is three, and the number of isoforms in multi-isoform genes is at least two, and sometimes much more than that.

In Ntranos, Yi et al. 2018, it is shown that single-cell RNA-Seq provides enough “replicates” in the form of cells, that logistic regression can be used to predict condition based on expression, effectively identifying the direction of change. As such, it provides an alternative to DTE->G for rejecting UDTE. The Ntranos and Yi GDE methods is extremely powerful: by identifying the direction of change it is a DGE methods when the change is DGE, it is a DTU method when the change is DTU, and it is a DTE method when the change is DTE. Interpretability is provided in the prediction step: it is the estimated direction of change.

Remarks

The discussion in this post is based on an example consisting of a gene with two isoforms, however the concepts discussed are easy to generalize to multi-isoform genes with more than two transcripts. I have not discussed differential exon usage (DEU), which is the focus of the DEXSeq method because of the complexities arising in genes which don’t have well-defined shared exons. Nevertheless, the DEXSeq approach to rejecting UDTE is similar to DTE->G, with DTE replaced by DEU. There are many programs for DTE, DTU and (especially) DGE that I haven’t mentioned; the ones cited are intended merely to serve as illustrative examples. This is not a comprehensive review of RNA-Seq differential expression methods.

Acknowledgments

The blog post was motivated by questions of Charlotte Soneson and Mark Robinson arising from an initial draft of the Ntranos, Yi et al. 2018 paper. The exposition was developed with Vasilis Ntranos and Lynn Yi. Valentine Svensson provided valuable comments and feedback.

[September 2, 2017: A response to this post has been posted by the authors of Patro et al. 2017, and I have replied to them with a rebuttal]

Spot the difference

One of the maxims of computational biology is that “no two programs ever give the same result.” This is perhaps not so surprising; after all, most journals seek papers that report a significant improvement to an existing method. As a result, when developing new methods, computational biologists ensure that the results of their tools are different, specifically better (by some metric), than those of previous methods. The maxim certainly holds for RNA-Seq tools. For example, the large symmetric differences displayed in the Venn diagram below (from Zhang et al. 2014) are typical for differential expression tool benchmarks:

In a comparison of RNA-Seq quantification methods, Hayer et al. 2015 showed that methods differ even at the level of summary statistics (in Figure 7 from the paper, shown below, Pearson correlation was calculated using ground truth from a simulation):

These sort of of results are the norm in computational genomics. Finding a pair of software programs that produce identical results is about as likely as finding someone who has won the lottery… twice…. in one week. Well, it turns out there has been such a person, and here I describe the computational genomics analog of that unlikely event. Below are a pair of plots made using two different RNA-Seq quantification programs:

The two volcano plots show the log-fold change in abundance estimated for samples sequenced by Boj et al. 2015, plotted against p-values obtained with the program limma-voom. I repeat: the plots were made with quantifications from two different RNA-Seq programs. Details are described in the next section, but before reading it first try playing spot the difference.

The reveal

The top plot is reproduced from Supplementary Figure 6 in Beaulieu-Jones and Greene, 2017. The quantification program used in that paper was kallisto, an RNA-Seq quantification program based on pseudoalignment that was published in

The bottom plot was made using the quantification program Salmon, and is reproduced from a GitHub repository belonging to the lead author of

Patro et al. 2017 claim that “[Salmon] achieves the same order-of-magnitude benefits in speed as kallisto and Sailfish but with greater accuracy”, however after being unable to spot any differences myself in the volcano plots shown above, I decided, with mixed feelings of amusement and annoyance, to check for myself whether the similarity between the programs was some sort of fluke. Or maybe I’d overlooked something obvious, e.g. the fact that programs may tend to give more similar results at the gene level than at the transcript level. Thus began this blog post.

In the figure below, made by quantifying RNA-Seq sample ERR188140 with the latest versions of the two programs, each point is a transcript and its coordinates are the estimated counts produced by kallisto and salmon respectively.

Strikingly, the Pearson correlation coefficient is 0.9996026. However astute readers will recognize a possible sleight of hand on my part. The correlation may be inflated by similar results for the very abundant transcripts, and the plot hides thousands of points in the lower left-hand corner. RNA-Seq analyses are notorious for such plots that appear sounds but can be misleading. However in this case I’m not hiding anything. The Pearson correlation computed with $log(counts+1)$ is still extremely high (0.9955965) and the Spearman correlation, which gives equal balance to transcripts irrespective of the magnitude of their counts is 0.991206. My observation is confirmed in Table 3 of Sarkar et al. 2017 (note that in this table “quasi-mapping” corresponds to Salmon):

For context, the Spearman correlation between kallisto and a truly different RNA-Seq quantification program, RSEM, is 0.8944941. At this point I have to say… I’ve been doing computational biology for more than 20 years and I have never seen a situation where two ostensibly different programs output such similar results.

Patro and I are not alone in finding that Salmon $\simeq$ kallisto (if kallisto and Salmon gave identical results I would write that Salmon = kallisto but in lieu of the missing 0.004 in correlation I use the symbol $\, \simeq \,$ to denote the very very strong similarity). Examples in the literature abound, e.g. Supplementary Figure 5 from Majoros et al. 2017 (shown later in the post), Figure 1 from Everaert et al. 2017

or Figure 3A from Jin et al. 2017:

Just a few weeks ago, Sahraeian et al. 2017 published a comprehensive analysis of 39 RNA-Seq analysis tools and performed hierarchical clusterings of methods according to the similarity of their output. Here is one example (their Supplementary Figure 24a):

Amazingly, kallisto and Salmon-Quasi (the latest version of Salmon) are the two closest programs to each other in the entire comparison, producing output even more similar than the same program, e.g. Cufflinks or StringTie run with different alignments!

This raises the question of how, with kallisto published in May 2016 and Salmon $\simeq$ kallisto, Patro et al. 2017 was published in one of the most respected scientific publications that advertises first and foremost that it “is a forum for the publication of novel methods and significant improvements to tried-and-tested basic research techniques in the life sciences.” ?

How not to perform a differential expression analysis

The Patro et al. 2017 paper presents a number of comparisons between kallisto and Salmon in which Salmon appears to dramatically improve on the performance of kallisto. For example Figure 1c from Patro et al. 2017 is a table showing an enormous performance difference between kallisto and Salmon:

Figure 1c from Patro et al. 2017.

At a false discovery rate of 0.01, the authors claim that in a simulation study where ground truth is known Salmon identifies 4.5 times more truly differential transcripts than kallisto!

This can explain how Salmon was published, namely the reviewers and editor believed Patro et al.’s claims that Salmon significantly improves on previous work. In one analysis Patro et al. provide a p-value to help the “significance” stick. They write that “we found that Salmon’s distribution of mean absolute relative differences was significantly smaller (Mann-Whitney U test, P=0.00017) than those of kallisto. But how can the result Salmon >> kallisto, be reconciled with the fact that everybody repeatedly finds that Salmon $\simeq$ kallisto?

A closer look reveals three things:

1. In a differential expression analysis billed as “a typical downstream analysis” Patro et al. did not examine differential expression results for a typical biological experiment with a handful of replicates. Instead they examined a simulation of two conditions with eight replicates in each.
2. The large number of replicates allowed them to apply the log-ratio t-test directly to abundance estimates based on transcript per million (TPM) units, rather than estimated counts which are required for methods such as their own DESeq2.
3. The simulation involved generation of GC bias in an approach compatible with the inference model, with one batch of eight samples exhibiting “weak GC content dependence” while the other batch of eight exhibiting “more severe fragment-level GC bias.” Salmon was run in a GC bias correction mode.

These were unusual choices by Patro et al. What they did was allow Patro et al. to showcase the performance of their method in a way that leveraged the match between one of their inference models and the procedure for simulating the reads. The showcasing was enabled by having a confounding variable (bias) that exactly matches their condition variable, the use of TPM units to magnify the impact of that effect on their inference, simulation with a large number of replicates to enable the use of TPM,  which was possible because with many replicates one could directly apply the log t-test. This complex chain of dependencies is unraveled below:

There is a reason why log-fold changes are not directly tested in standard RNA-Seq differential expression analyses. Variance estimation is challenging with few replicates and RNA-Seq methods developers understood this early on. That is why all competitive methods for differential expression analysis such as DESeq/DESeq2, edgeR, limma-voom, Cuffdiff, BitSeq, sleuth, etc. regularize variance estimates (i.e., perform shrinkage) by sharing information across transcripts/genes of similar abundance. In a careful benchmarking of differential expression tools, Shurch et al. 2016 show that log-ratio t-test is the worst method. See, e.g., their Figure 2:

Figure 2 from Schurch et al. 2016. The four vertical panels show FPR and TPR for programs using 3,6,12 and 20 biological replicates (in yeast). Details are in the Schurch et al. 2016 paper.

The log-ratio t-test performs poorly not only when the number of replicates is small and regularization of variance estimates is essential. Schurch et al. specifically recommend DESeq2 (or edgeR) when up to 12 replicates are performed. In fact, the log-ratio t-test was so bad that it didn’t even make it into their Table 2 “summary of recommendations”.

The authors of Patro et al. 2017 are certainly well-aware of the poor performance of the log-ratio t-test. After all, one of them was specifically thanked in the Schurch et al. 2016 paper “for his assistance in identifying and correcting a bug”. Moreover, the recommended program by Schurch etal. (DESeq2) was authored by one of the coauthors on the Patro et al. paper, who regularly and publicly advocates for the use of his programs (and not the log-ratio t-test):

This recommendation has been codified in a detailed RNA-Seq tutorial where M. Love et al. write that “This [Salmon + tximport] is our current recommended pipeline for users”.

In Soneson and Delorenzi, 2013, the authors wrote that “there is no general consensus regarding which [RNA-Seq differential expression] method performs best in a given situation” and despite the development of many methods and benchmarks since this influential review, the question of how to perform differential expression analysis continues to be debated. While it’s true that “best practices” are difficult to agree on, one thing I hope everyone can agree on is that in a “typical downstream analysis” with a handful of replicates

do not perform differential expression with a log-ratio t-test.

Turning to Patro et al.‘s choice of units, it is important to note that the requirement of shrinkage for RNA-Seq differential analysis is the reason most differential expression tools require abundances measured in counts as input, and do not use length normalized units such as Transcripts Per Million (TPM). In TPM units the abundance $\rho_t$ for a transcript t is $\rho_t \propto \frac{c_t}{N \cdot l_t}$ where $c_t$ are the estimated counts for transcript t, $l_t$ is the (effective) length of t and N the number of total reads. Whereas counts are approximately Poisson distributed (albeit with some over-dispersion), variance estimates of abundances in TPM units depend on the lengths used in normalization and therefore cannot be used directly for regularization of variance estimation. Furthermore, the dependency of TPM on effective lengths means that abundances reported in TPM are very sensitive to the estimates of effective length.

This is why, when comparing the quantification accuracy of different programs, it is important to compare abundances using estimated counts. This was highlighted in Bray et al. 2016: “Estimated counts were used rather than transcripts per million (TPM) because the latter is based on both the assignment of ambiguous reads and the estimation of effective lengths of transcripts, so a program might be penalized for having a differing notion of effective length despite accurately assigning reads.” Yet Patro et al. perform no comparisons of programs in terms of estimated counts.

A typical analysis

The choices of Patro et al. in designing their benchmarks are demystified when one examines what would have happened had they compared Salmon to kallisto on typical data with standard downstream differential analysis tools such as their own tximport and DESeq2. I took the definition of “typical” from one of the Patro et al. coauthors’ own papers (Soneson et al. 2016): “Currently, one of the most common approaches is to define a set of non-overlapping targets (typically, genes) and use the number of reads overlapping a target as a measure of its abundance, or expression level.”

The Venn diagram below shows the differences in transcripts detected as differentially expressed when kallisto and Salmon are compared using the workflow the authors recommend publicly (quantifications -> tximport -> DESeq2) on a typical biological dataset with three replicates in each of two conditions. The number of overlapping genes is shown for a false discovery rate of 0.05 on RNA-Seq data from Trapnell et al. 2014:

A Venn diagram showing the overlap in genes predicted to be differential expressed by kallisto (blue) and Salmon (pink). Differential expression was performed with DESeq2 using transcript-level counts estimated by kallisto and Salmon and imported to DESeq2 with tximport. Salmon was run with GC bias correction.

This example provides Salmon the benefit of the doubt- the dataset was chosen to be older (when bias was more prevalent) and Salmon was not run in default mode but rather with GC bias correction turned on (option –gcBias).

When I saw these numbers for the first time I gasped. Of course I shouldn’t have been surprised; they are consistent with repeated published experiments in which comparisons of kallisto and Salmon have revealed near identical results. And while I think it’s valuable to publish confirmation of previous work, I did wonder whether Nature Methods would have accepted the Patro et al. paper had the authors conducted an actual “typical downstream analysis”.

Patro et al. utilized TPM based comparisons for all the results in their paper, presumably to highlight the improvements in accuracy resulting from better effective length estimates. Numerous results in the paper suggest that Salmon is much more accurate than kallisto. However I had seen a figure in Majoros et al. 2017 that examined the (cumulative) distribution of both kallisto and Salmon abundances in TPM units (their Supplementary Figure 5) in which the curves literally overlapped at almost all thresholds:

The plot above was made with Salmon v0.7.2 so in fairness to Patro et al. I remade it using the ERR188140 dataset mentioned above with Salmon v0.8.2:

The distribution of abundances (in TPM units) as estimated by kallisto (blue circles) and Salmon (red stars).

The blue circles correspond to kallisto and the red stars inside to Salmon. With the latest version of Salmon the similarity is even higher than what Majoros et al. observed! The Spearman correlation between kallisto and Salmon with TPM units is 0.9899896.

It’s interesting to examine what this means for a (truly) typical TPM analysis. One way that TPMs are used is to filter transcripts (or genes) by some threshold, typically TPM >  1 (in another deviation from “typical”, a key table in Patro et al. 2017 – Figure 1d – is made by thresholding with TPM > 0.1). The Venn diagram below shows the differences between the programs at the typical TPM > 1  threshold:

A Venn diagram showing the overlap in transcripts predicted by kallisto and Salmon to have estimated abundance > 1 TPM.

The figures above were made with Salmon 0.8.2 run in default mode. The correlation between kallisto and Salmon (in TPM) units decreases a tiny amount, from 0.9989224 to 0.9974325 with the –gcBias option and even the Spearman correlation decreases by only 0.011 from 0.9899896 to 0.9786092.

I think it’s perfectly fine for authors to present their work in the best light possible. What is not ok is to deliberately hide important and relevant truth, which in this case is that Salmon $\, \simeq \,$ kallisto.

A note on speed

One of the claims in Patro et al. 2017 is that “[the speed of Salmon] roughly matches the speed of the recently introduced kallisto.” The Salmon claim is based on a benchmark of an experiment (details unknown) with 600 million 75bp paired-end reads using 30 threads. Below are the results of a similar benchmark of Salmon showing time to process 19 samples from Boj et al. 2015 with variable numbers of threads:

First, Salmon with –gcBias is considerably slower than default Salmon. Furthermore, there is a rapid decrease in performance gain with increasing number of threads, something that should come as no surprise. It is well known that quantification can be I/O bound which means that at some point, extra threads don’t provide any gain as the disk starts grinding limiting access from the CPUs. So why did Patro et al. choose to benchmark runtime with 30 threads?

The figure below provides a possible answer:

In other words, not only is Salmon $\simeq$ kallisto in accuracy, but contrary to the claims in Patro et al. 2017, kallisto is faster. This result is confirmed in Table 1 of Sarkar et al. 2017 who find that Salmon is slower by roughly the same factor as seen above (in the table “quasi-mapping” is Salmon).

Having said that, the speed differences between kallisto and Salmon should not matter much in practice and large scale projects made possible with kallisto (e.g. Vivian et al. 2017) are possible with Salmon as well. Why then did the authors not report their running time benchmarks honestly?

The first common notion

The Patro et al. 2017 paper uses the term “quasi-mapping” to describe an algorithm, published in Srivastava et al. 2016, for obtaining their (what turned out to be near identical to kallisto) results. I have written previously how “quasi-mapping” is the same as pseudoalignment as an alignment concept, even though Srivastava et al. 2016 initially implemented pseudoalignment differently than the way we described it originally in our preprint in Bray et al. 2015. However the reviewers of Patro et al. 2017 may be forgiven for assuming that “quasi-mapping” is a technical advance over pseudoalignment. The Srivastava et al. paper is dense and filled with complex technical detail. Even for an expert in alignment/RNA-Seq it is not easy to see from a superficial reading of the paper that “quasi-mapping” is an equivalent concept to kallisto’s pseudoalignment (albeit implemented with suffix arrays instead of a de Bruijn graph). Nevertheless, the key to the paper is a simple sentence: “Specifically, the algorithm [RapMap, which is now used in Salmon] reports the intersection of transcripts appearing in all hits” in the section 2.1 of the paper. That’s the essence of pseudoalignment right there. The paper acknowledges as much, “This lightweight consensus mechanism is inspired by Kallisto ( Bray et al. , 2016 ), though certain differences exist”. Well, as shown above, those differences appear to have made no difference in standard practice, except insofar as the Salmon implementation of pseudoalignment being slower than the one in Bray et al. 2016.

Srivastava et al. 2016 and Patro et al. 2017 make a fuss about the fact that their “quasi-mappings” take into account the starting positions of reads in transcripts, thereby including more information than a “pure” pseudoalignment. This is a pedantic distinction Patro et al. are trying to create. Already in the kallisto preprint (May 11, 2015),  it was made clear that this information was trivially accessible via a reasonable approach to pseudoalignment: “Once the graph and contigs have been constructed, kallisto stores a hash table mapping each k-mer to the contig it is contained in, along with the position within the contig.”

In other words, Salmon is not producing near identical results to kallisto due to an unprecedented cosmic coincidence. The underlying method is the same. I leave it to the reader to apply Euclid’s first common notion:

Things which equal the same thing are also equal to each other.

Convergence

While Salmon is now producing almost identical output to kallisto and is based on the same principles and methods, this was not the case when the program was first released. The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods.

The first preprint was published on the BioRxiv on June 27, 2015. It followed shortly on the heels of the kallisto preprint which was published on May 11, 2015. However the first Salmon preprint described a program very different from kallisto. Instead of pseudoalignment, Salmon relied on chaining SMEMs (super-maximal exact matches) between reads and transcripts to identifying what the authors called “approximately consistent co-linear chains” as proxies for alignments of reads to transcripts. The authors then compared Salmon to kallisto writing that “We also compare with the recently released method of Kallisto which employs an idea similar in some respects to (but significantly different than) our lightweight-alignment algorithm and again find that Salmon tends to produce more accurate estimates in general, and in particular is better able [to] estimate abundances for multi-isoform genes.” In other words, in 2015 Patro et al. claimed that Salmon was “better” than kallisto. If so, why did the authors of Salmon later change the underlying method of their program to pseudoalignment from SMEM alignment?

Inspired by temporal ordering analysis of expression data and single-cell pseudotime analysis, I ran all the versions of kallisto and Salmon on ERR188140, and performed PCA on the resulting transcript abundance table to be able to visualize the progression of the programs over time. The figure below shows all the points with the exception of three: Sailfish 0.6.3, kallisto 0.42.0 and Salmon 0.32.0. I removed Sailfish 0.6.3 because it is such an outlier that it caused all the remaining points to cluster together on one side of the plot (the figure is below in the next section). In fairness I also removed one Salmon point (version 0.32.0) because it differed substantially from version 0.4.0 that was released a few weeks after 0.32.0 and fixed some bugs. Similarly, I removed kallisto 0.42.0, the first release of kallisto which had some bugs that were fixed 6 days later in version 0.42.1.

Evidently kallisto output has changed little since May 12, 2015. Although some small bugs were fixed and features added, the quantifications have been very similar. The quantifications have been stable because the algorithm has been the same.

On the other hand the Salmon trajectory shows a steady convergence towards kallisto. The result everyone is finding, namely that currently Salmon $\simeq$ kallisto is revealed by the clustering of recent versions of Salmon near kallisto. However the first releases of Salmon are very different from kallisto. This is also clear from the heatmap/hierarchical clustering of  Sahraeian et al. in which Salmon-SMEM was included (Salmon used SMEMs until version 0.5.1, sometimes labeled fmd, until “quasi-mapping” became the default). A question: if Salmon ca. 2015 was truly better than kallisto then is Salmon ca. 2017 worse than Salmon ca. 2015?

Convergence of Salmon and Sailfish to kallisto over the course of a year. The x-axis labels the time different versions of each program were released. The y-axis is PC1 from a PCA of transcript abundances of the programs.

Prestamping

The bioRxiv preprint server provides a feature by which a preprint can be linked to its final form in a journal. This feature is useful to readers of the bioRxiv, as final published papers are generally improved after preprint reader, reviewer, and editor comments have been addressed. Journal linking is also a mechanism for authors to time stamp their published work using the bioRxiv. However I’m sure the bioRxiv founders did not intend the linking feature to be abused as a “prestamping” mechanism, i.e. a way for authors to ex post facto receive a priority date for a published paper that had very little, or nothing, in common with the original preprint.

A comparison of the June 2015 preprint mentioning the Salmon program and the current Patro et al. paper reveals almost nothing in common. The initial method changed drastically in tandem with an update to the preprint on October 3, 2015 at which point the Salmon program was using “quasi mapping”, later published in Srivastava et al. 2016. Last year I met with Carl Kingsford (co-corresponding author of Patro et al. 2017) to discuss my concern that Salmon was changing from a method distinct from that of kallisto (SMEMs of May 2015) to one that was replicating all the innovations in kallisto, without properly disclosing that it was essentially a clone. Yet despite a promise that he would raise my concerns with the Salmon team, I never received a response.

At this point, the Salmon core algorithms have changed completely, the software program has changed completely, and the benchmarking has changed completely. The Salmon project of 2015 and the Salmon project of 2017 are two very different projects although the name of the program is the same. While some features have remained, for example the Salmon mode that processes transcriptome alignments (similar to eXpress) was present in 2015, and the approach to likelihood maximization has persisted, considering the programs the same is to descend into Theseus’ paradox.

Interestingly, Patro specifically asked to have the Salmon preprint linked to the journal:

The linking of preprints to journal articles is a feature that arXiv does not automate, and perhaps wisely so. If bioRxiv is to continue to automatically link preprints to journals it needs to focus not only on eliminating false negatives but also false positives, so that journal linking cannot be abused by authors seeking to use the preprint server to prestamp their work after the fact.

The fish always win?

The Sailfish program was the precursor of Salmon, and was published in Patro et al. 2014. At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

Despite the claim in the Sailfish abstract that “Sailfish provides quantification time…without loss of accuracy” and Figure 1 from the paper showing Sailfish to be more accurate than RSEM, we felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case; this was later noted by others, e.g. Hensman et al. 2015, Lee et al. 2015).

After reflecting on the Sailfish paper and results, Nicolas Bray had the key idea of abandoning alignments as a requirement for RNA-Seq quantification, developed pseudoalignment, and later created kallisto (with Harold Pimentel and Páll Melsted).

I mention this because after the publication of kallisto, Sailfish started changing along with Salmon, and is now frequently discussed in the context of kallisto and Salmon as an equal. Indeed, the PCA plot above shows that (in its current form, v0.10.0) Sailfish is also nearly identical to kallisto. This is because with the release of Sailfish 0.7.0 in September 2015, Patro et al. started changing the Sailfish approach to use pseudoalignment in parallel with the conversion of Salmon to use pseudoalignment. To clarify the changes in Sailfish, I made the PCA plot below which shows where the original version of Sailfish that coincided with the publication of Patro et al. 2014 (version 0.6.3 March 2014) lies relative to the more recent versions and to Salmon:

In other words, despite a series of confusing statements on the Sailfish GitHub page and an out-of-date description of the program on its homepage, Sailfish in its published form was substantially less accurate and slower than kallisto, and in its current form Sailfish is kallisto.

In retrospect, the results in Figure 1 of Patro et al. 2014 seem to be as problematic as the results in Figure 1 of Patro et al. 2017.  Apparently crafting computational experiments via biased simulations and benchmarks to paint a distorted picture of performance is a habit of Patro et al.

In the post I wrote that “The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods” Here are the details of how these support the claims I make (tl;dr https://twitter.com/yarbsalocin/status/893886707564662784):

Sailfish (current version) and Salmon implemented kallisto’s pseudoalignment algorithm using suffix arrays

First, both Sailfish and Salmon use RapMap (via SACollector) and call mergeLeftRightHits():
Sailfish:
https://github.com/kingsfordgroup/sailfish/blob/352f9001a442549370eb39924b06fa3140666a9e/src/SailfishQuantify.cpp#L192
Salmon:
https://github.com/COMBINE-lab/salmon/commit/234cb13d67a9a1b995c86c8669d4cefc919fbc87#diff-594b6c23e3bdd02a14cc1b861c812b10R2205

The RapMap code for “quasi mapping” executes an algorithm identical to psuedoalignment, down to the detail of what happens to the k-mers in a single read:

First, hitCollector() calls getSAHits_():
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L249

Here kmers are used hashed to SAintervals (Suffix Array intervals), that are then extended to see how far ahead to jump. This is the one of two key ideas in the kallisto paper, namely that not all the k-mers in a read need to be examined to pseudoalign the read. It’s much more than that though, it’s the actual exact same algorithm to the level of exactly the k-mers that are examined. kallisto performs this “skipping” using contig jumping with a different data structure (the transcriptome de Bruijn graph) but aside from data structure used what happens is identical:

https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SACollector.hpp#L652
makes a call to jumping and the code to compute MMP (skipping) is
https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SASearcher.hpp#L77

There is a different detail in the Sailfish/Salmon code which is that when skipping forward the suffix array is checked for exact matching on the skipped sequence. kallisto does not have this requirement (although it could). On error-free data these will obviously be identical; on error prone data this may make Salmon/Sailfish a bit more conservative and kallisto a bit more robust to error. Also due to the structure of suffix arrays there is a possible difference in behavior when a transcript contains a repeated k-mer. These differences affect a tiny proportion of reads, as is evident from the result that kallisto and Salmon produce near identical results.

The second key idea in kallisto of intersecting equivalence classes for a read. This exact procedure is in:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L363
which calls:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/src/HitManager.cpp#L599

There was a choice we had to make in kallisto of how to handle information from paired end reads (does one require consistent pseudoalignment in both? Just one suffices to pseudoalign a read?)
The code for intersection between left and right reads making the identical choices as kallisto is:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/RapMapUtils.hpp#L810

In other words, stepping through what happens to the k-mers in a read shows that Sailfish/Salmon copied the algorithms of kallisto and implemented it with the only difference being a different data structure used to hash the kmers. This is why, when I did my run of Salmon vs. kallisto that led to this blog post I found that
vs
salmon 69,701,169.
That’s a difference of 79,000 out of ~70 million = 0.1%.

1.  Until the kallisto program and preprint was published Salmon used SMEMs. Only after kallisto does Salmon change to using kmer cached suffix array intervals.
2. The kallisto preprint did not discuss outputting position as part of pseudoalignment because it was not central to the idea. It’s trivial to report pseudoalignment positions with either data structure and in fact both kallisto and Salmon do.

I want to make very clear here that I think there can be great value in implementing an algorithm with a different data structure. It’s a form of reproducibility that one can learn from: how to optimize, where performance gains can be made, etc. Unfortunately most funding agencies don’t give grants for projects whose goal is solely to reproduce someone else’s work. Neither do most journal publish papers that set out to do that. That’s too bad. If Patro et al. had presented their work honestly, and explained that they were implementing pseudoalignment with a different data structure to see if it’s better, I’d be a champion of their work. That’s not how they presented their work.

Salmon copied details in the quantification

The idea of using the EM algorithm for quantification with RNA-Seq goes back to Jiang and Wong, 2009, arguably even to Xing et al. 2006. I wrote up the details of the history in a review in 2011 that is on the arXiv. kallisto runs the EM algorithm on equivalence classes, an idea that originates with Nicolae et al. 2011 (or perhaps even Jiang and Wong 2009) but whose significance we understood from the Sailfish paper (Patro et al. 2014). Therefore the fact that Salmon (now) and kallisto both use the EM algorithm, in the same way, makes sense.

However Salmon did not use the EM algorithm before the kallisto preprint and program were published. It used an online variational Bayes algorithm instead. In the May 18, 2015 release of Salmon there is no mention of EM. Then, with the version 0.4 release date Salmon suddenly switches to the EM. In implementing the EM algorithm there are details that must be addressed, for example setting thresholds for when to terminate rounds of inference based on changes in the (log) likelihood (i.e. determine convergence).

For example, kallisto sets parameters
const double alpha_limit = 1e-7;
const double alpha_change_limit = 1e-2;
const double alpha_change = 1e-2;

in EMalgorithm.h
https://github.com/pachterlab/kallisto/blob/90db56ee8e37a703c368e22d08b692275126900e/src/EMAlgorithm.h
The link above shows that these kallisto parameters were set and have not changed since the release of kallisto
Also they were not always this way, see e.g. the version of April 6, 2015:
https://github.com/pachterlab/kallisto/blob/2651317188330f7199db7989b6a4dc472f5d1669/src/EMAlgorithm.h
This is because one of the things we did is explore the effects of these thresholds, and understand how setting them affects performance. This can be seen also in a legacy redundancy, we have both alpha_change and alpha_change_limit which ended up being unnecessary because they are equal in the program and used on one line.

The first versions of Salmon post-kallisto switched to the EM, but didn’t even terminate it the same way as kallisto, adopting instead a maximum iteration of 1,000. See
https://github.com/COMBINE-lab/salmon/blob/59bb9b2e45c76137abce15222509e74424629662/include/CollapsedEMOptimizer.hpp
from May 30, 2015.
This changed later first with the introduction of minAlpha (= kallisto’s alpha_limit)
https://github.com/COMBINE-lab/salmon/blob/56120af782a126c673e68c8880926f1e59cf1427/src/CollapsedEMOptimizer.cpp
and then alphaCheckCutoff (kallisto’s alpha_change_limit)
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp

Here are the salmon thresholds:
double minAlpha = 1e-8;
double alphaCheckCutoff = 1e-2;
double cutoff = minAlpha;

Notice that they are identical except that minAlpha = 1e-8 and not kallisto’s alpha_limit = 1e-7. However in kallisto, from the outset, the way that alpha_limit has been used is:
if (alpha_[ec] < alpha_limit/10.0) {
alpha_[ec] = 0.0;
}

In other words, alpha_limit in kallisto is really 1e-8, and has been all along.

The copying of all the details of our program have consequences for performance. In the sample I ran kallisto performed 1216 EM rounds of EM vs. 1214 EM rounds in Salmon.

Sailfish (current version) copied our sequence specific bias method

One of the things we did in kallisto is implement a sequence specific bias correction along the lines of what was done previously in Roberts et al. 2011, and later in Roberts et al. 2013. Implementing sequence specific bias correction in kallisto required working things out from scratch because of the way equivalence classes were being used with the EM algorithm, and not reads. I worked this out together with Páll Melsted during conversations that lasted about a month in the Spring of 2015. We implemented it in the code although did not release details of how it worked with the initial preprint because it was an option and not default, and we thought we might want to still change it before submitting the journal paper.

Here Rob is stating that Salmon can account for biases that kallisto cannot:
https://www.biostars.org/p/143458/#143639
This was a random forest bias correction method different from kallisto’s.

Shortly thereafter, here is the source code in Sailfish deprecating the Salmon bias correction and switching to kallisto’s method:
https://github.com/kingsfordgroup/sailfish/commit/377f6d65fe5201f7816213097e82df69e4786714#diff-fe8a1774cd7c858907112e6c9fda1e9dR76

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-3e922f9589567fee3b20671da9493c82R34

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-b14c09a136906d1c5d8534afa3a51c4cR818

This is the update to effective length in kallisto:
https://github.com/pachterlab/kallisto/blob/e5957cf96f029be4e899e5746edcf2f63e390609/src/weights.cpp#L184
Here is the Sailfish code:

Notice that there has been a literal copying down to the variable names:

The code written by the student of Rob was:

The code written by us is

efflen *= 0.5*biasAlphaNorm/biasDataNorm;

The code rewritten by Rob (editing that of the student):

effLength *= 0.5 * (txomeNormFactor / readNormFactor);

Note that since our bias correction method was not reported in our preprint, this had to have been copied directly from our codebase and was done so without any attribution.

I raised this specific issue with Carl Kingsford by email prior to our meeting in April 13 2016. We then discussed it in person. The conversation and email were prompted by a change to the Sailfish README on April 7, 2016 specifically accusing us of comparing kallisto to a “ **very old** version of Sailfish”:
https://github.com/kingsfordgroup/sailfish/commit/550cd19f7de0ea526f512a5266f77bfe07148266

What was stated is “The benchmarks in the kallisto paper *are* made against a very old version of Sailfish” not “were made against”. By the time that was written, it might well have been true. But kallisto was published in May 2015, it benchmarked with the Sailfish program described in Patro et al. 2014, and by 2016 Sailfish had changed completely implementing the pseudoalignment of kallisto.

Another aspect of an RNA-Seq quantification program is effective length estimation. There is an attribution to kallisto in the Sailfish code now explaining that this is from kallisto:
“Computes (and returns) new effective lengths for the transcripts based on the current abundance estimates (alphas) and the current effective lengths (effLensIn). This approach is based on the one taken in Kallisto
This is from January 23rd, 2016, almost 9 months after kallisto was released, and 3 months before the Sailfish README accused us of not testing the latest version of Sailfish in May 2015.

The attribution for effective lengths is also in the Salmon code, from 6 months later June 2016:
https://github.com/COMBINE-lab/salmon/blob/335c34b196205c6aebe4ddcc12c380eb47f5043a/include/DistributionUtils.hpp

There is also an acknowledgement in the Salmon code that a machine floating point tolerance we use
https://github.com/pachterlab/kallisto/blob/master/src/EMAlgorithm.h#L19
was copied.
The acknowledgment in Salmon is here
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp
This is the same file where the kallisto thresholds for the EM were copied to.

So after copying our entire method, our core algorithm, many of our ideas, specific parameters, and numerous features… really just about everything that goes into an RNA-Seq quantification project, there is an acknowledgment that our machine tolerance threshold was “intelligently chosen”.

The hierarchical classification of nature initiated by Carl Linnaeus today consists of eight major “ranks”, namely species, genus, family, order, class, phylum, kingdom and domain:

In the microbial world it makes sense to refine the standard taxonomy by subdividing species into strains. An important reason to do so is that bacterial taxonomy must reflect not only phylogeny but also pathogenicity, and small differences in genomes can translate to large pathogenic differences. This has implications for metagenomic analyses of microbial communities: for many biomedical applications it is desirable to characterize individuals strains.

Metagenomics has its roots in culture-independent retrieval and sequencing of 16S rRNA genes, and while variations in 16S can sometimes distinguish between strains, a single gene is not always sufficient. This limitation of 16S can be overcome with whole genome shotgun sequencing of microbial communities, an approach to metagenomics that became popular in the early 2000s and  that opened the door to higher resolution characterization of communities. In 2005 Kevin Chen and I wrote a review on the bioinformatics challenges that would have to be overcome to walk through the door. One of the things we did was to emphasize “problems and their connections to other areas of bioinformatics, such as… gene expression analysis”, and throughout the past decade I’ve always hoped for deeper connections to be established between metagenomics and gene expression bioinformatics. I’ve noticed interesting connections pop up from time to time (e.g. Paulson et al. 2013)  and have occasionally entertained the thought with my students and collaborators, especially as work in my group became more focused on RNA-Seq since the development of Cufflinks in 2008.

However connection modern transcriptome analysis methodology, specifically bioinformatics of RNA-Seq to metagenomics has been difficult to do until recently. One major reason is that until just a few years ago, there was no reference genome database for metagenomics analogous to the reference annotation databases available for use in transcriptomics. Another way to put this is that metagenomics has, until recently, been “de novo” bioinformatics. By this I mean that the analysis of communities from whole genome shotgun data had to largely proceed via de novo analyses of the data (e.g. de novo assembly of genomes), “binning” of reads according to sequence characteristics or hits to gene databases was required because it was impossible to compare sequences to references genomes. While de novo methods have also been developed for RNA-Seq, the scale of transcriptome analysis is much smaller than that of most metagenomic analyses, and as has been well documented, de novo transcriptomics is already very difficult (e.g. Amin et al. 2014).

The de novo state of metagenomics has changed in recent years, as (relatively) low-cost sequencing has been a boon for microbial genomics. The graph below, extracted from NCBI and published in a recent review, shows that in just the past few years thousands of bacterial genomes have been sequences, enabling, for the first time, reference based metagenomics:

This observation is reflected in the recent development of many methods for a variety of metagenomic applications that take advantage of reference genome databases.  Specifically, the problem of read assignment, which is fundamental for abundance estimation, has benefited from the possibility of metagenomic read alignment to reference databases.

The figure below, reproduced from the preprint “An evaluation of the accuracy and speed of metagenome analysis tools” by Stinus Lindgreen, Karen L. Adair and Paul Gardner, bioRxiv May 15, 2015 shows a benchmark of the accuracy and runtime of 14 programs developed for metagenomic read assignment for whole genome shotgun data:

The problem these methods are solving is really similar to the problem of read assignment in RNA-Seq. In RNA-Seq, instead of originating from strains, reads originate from transcripts. Just as strains are present in different abundances in a community, so are RNA transcripts in a cell (or in bulk). The analogy of taxonomy in metagenomics, i.e. the grouping of strains into species, genus etc. is also present in RNA-Seq, where transcripts are grouped into genes. The fragment (or read) assignment problem in RNA-Seq is closely related to the quantification problem in RNA-Seq and is a problem that has been thoroughly researched and for which many algorithms have been developed. I discussed the importance of the fragment assignment problem for RNA-Seq in my 2013 Genome Informatics Keynote.

In response to the development of reference-based bioinformatics possibilities for metagenomics, about three years ago my student Lorian Schaeffer started looking at the suitability of RNA-Seq tools for metagenomic read assignment. Although the metagenomic and RNA-Seq assignment problems are conceptually similar and methodologically related, there are various technical issues involved in applying RNA-Seq tools in the metagenomic setting (e.g. the need to carefully account for taxonomy in the metagenomics setting). After developing the computational infrastructure to benchmark RNA-Seq programs in the metagenomic setting, she proceeded to evaluate the accuracy of eXpress, a streaming algorithm for RNA-Seq quantification. Although the quantification of eXpress was specifically designed to be suitable for large numbers of reads, the program requires read alignments to a reference transcriptome (or in Lorian’s experiments a genome) database. In the metagenomic setting realistic databases are huge, and she found that it took days just to map the reads. Nevertheless, her initial benchmarks revealed that eXpress was significantly more accurate than the available metagenomic read assignment tools of the time.

When Kraken (Wood and Salzberg 2014), and later CLARK (Ounit et al. 2015) were published in 2014 and 2015 respectively, we took note because by circumventing the alignment step they dramatically altered the tractability of metagenomic read assignment. In parallel, in my group, Nicolas Bray and later Páll Melsted and Harold Pimentel were developing what is now kallisto (Bray et al. 2015). Like Kraken, kallisto avoided the need for aligning reads, but with the introduction of the concept of pseudoalignment, allowed for accurate read assignments based on joint analysis of exact k-mer matches. What we showed earlier this year is that unlike naïve k-mer based approaches to quantification, kallisto is as accurate as eXpress and other read alignment based quantification tools, and this observation led Lorian to immediately proceed to benchmark it on metagenomic data. The result of her work was just posted as a preprint:

Lorian Schaeffer, Harold Pimentel, Nicolas Bray, Páll Melsted and Lior Pachter, Pseudoalignment for metagenomic read assignment, arXiv 1510.07371, 2015.

With this paper we demonstrate a “technology transfer” from RNA-Seq bioinformatics to metagenomics, one that achieves dramatic improvements in read assignment accuracy in the metagenomics setting. The main result of her work is Table 1 in our preprint:

Using a published simulated Illumina dataset from Mende et al. 2012 (based on 100 genomes and containing 53.33 million reads), and augmenting it with another 2,308 genomes for the purpose of testing, she shows that kallisto significantly outperforms the best quantification methods (as benchmarked by Lindgreen et al., see figure above). “Significant” here refers to what I think is fair to characterize as an extraordinary improvement: at the genus level, a level that programs such as CLARK have been optimized for, kallisto’s RRMSE (relative root mean squared error)  is 0.13 compared to 17.05 for Kraken and 18.58 for CLARK. The improvement is based on two ideas: first, the results show that the model based approach for read assignment, the concept that underlies GASiC and eXpress, outperforms direct taxonomic read assignment as implemented by MEGAN and Kraken and CLARK (in the latter approach reads are aligned to the lowest rank to which they align unambigously). Second, pseudoalignment is not just faster than traditional alignment but also accurate.

The upshot: the accuracy and efficiency of kallisto make strain level analysis of metagenomes possible. In fact kallisto is more accurate at the strain level than other programs are at the genus level. Just as we have been advocating for transcript level analysis from RNA-Seq data, we believe that strain level analysis should become commonplace in metagenomics.

In digging deeply into the bioinformatics of metagenomics bioinformatics we noticed a few other areas that could benefit from RNA-Seq technology transfer. For example, the standard of RNA-Seq methods benchmarking appears to be higher than in metagenomics. Both the Kraken and CLARK papers benchmarked their programs on simulations with 10 genomes (the number ten is not a typo). CLARK did test on one dataset with 20 genomes, although using only 10,000 reads. To be fair to the authors of those papers, their standards were much higher than others in the field. The paper

Yu-Wei Wu and Yuzhen Ye, A novel abundance-based algorithm for binning metagenomic sequences using l-tuples, Journal of Computational Biology 2011.

benchmarked their method on simulations of reads from 2 (two!!) organisms. Biologists frequently complain that simulations of bioinformaticians are completely non-informative and unfortunately these cases provide fodder for such prejudice. Having said that, the RNA-Seq community also has much to learn from the metagenomics community. The previously mentioned paper by Paulson et al. 2013 addresses missing data in a way that should translate directly to missing data in single-cell RNA-Seq (the paper also makes performance comparisons with their comparative metagenomics approach to the RNA-Seq programs DESeq and edgeR) . One paper (McDavid et al. 2012) does take a look at modeling single-cell data with zero inflated distributions but I think this is a good example where metagenomics is ahead of RNA-Seq. Our immediate plans are to develop the kallisto application to metagenomics to include the ability to perform metagenome comparisons using sleuth. Conversely, inspired by the taxonomy hierarchy fundamental to metagenomics we’re going to explore RNA-Seq quantification with groups of transcripts that go beyond just genes.

Horizontal transfer is good.

[Update July 15, 2016: A preprint describing sleuth is available on BioRxiv]

Today my student Harold Pimentel released the beta version of his new RNA-Seq analysis method and software program called sleuth. A sleuth for RNA-Seq begins with the quantification of samples with kallisto, and together a sleuth of kallistos can be used to analyze RNA-Seq data rigorously and rapidly.

Why do we need another RNA-Seq program?

A major challenge in transcriptome analysis is to determine the transcripts that have changed in their abundance across conditions.  This challenge is not entirely trivial because the stochasticity in transcription both within and between cells (biological variation), and the randomness in the experiment (RNA-Seq) that is used to determine transcript abundances (technical variation), make it difficult to determine what constitutes “significant” change.

Technical variation can be assessed by performing technical replicates of experiments. In the case of RNA-Seq, this can be done by repeatedly sequencing from one cDNA library. Such replicates are fundamentally distinct from biological replicates designed to assess biological variation. Biological replicates are performed by sequencing different cDNA libraries that have been constructed from repeated biological experiments performed under the same (or in practice near-same) conditions. Because biological replicates require sequencing different cDNA libraries, a key point is that biological replicates include technical variation.

In the early days of RNA-Seq a few papers (e.g. Marioni et al. 2008, Bullard et al. 2010) described analyses of technical replicates and concluded that they were not really needed in practice, because technical variation could be predicted statistically from the properties of the Poisson distribution. The point is that in an idealized RNA-Seq experiment counts of reads are multinomial (according to abundances of the transcripts they originate from), and therefore approximately Poisson distributed. Their variance is therefore approximately equal to the mean, so that it is possible to predict the variance in counts across technical replicates based on the abundance of the transcripts they originate from. There is, however, one important subtlety here: “counts of reads” as used above refers to the number of reads originating from a transcript, but in many cases, especially in higher eukaryotes, reads are frequently ambiguous as to their transcript of origin because of the presence of multi-isoform genes and genes families. In other words, transcript counts cannot be precisely measured. However, the statement about the Poisson distribution of counts in technical replicates remain true when considering counts of reads by genomic features because then reads are no longer ambiguous.

This is why, in so-called “count-based methods” for RNA-Seq analysis, there is an analysis only at the gene level. Programs such as DESeq/DESeq2, edgeR and a literally dozens of other count-based methods first require counting reads across genome features using tools such as HTSeq or featureCounts. By utilizing read counts to genomic features, technical replicates are unnecessary in lieu of the statistical assumption that they would reveal Poisson distributed data, and instead the methods focus on modeling biological variation. The issue of how to model biological variation is non-trivial because typically very few biological replicates are performed in experiments. Thus, there is a need for pooling information across genes to obtain reliable variance estimates via a statistical process called shrinkage. How and what to shrink is a matter of extensive debate among statisticians engaged in the development of count-based RNA-Seq methods, but one theme that has emerged is that shrinkage approaches can be compatible with general and generalized linear models, thus allowing for the analysis of complex experimental designs.

Despite these accomplishments,  count-based methods for RNA-Seq have two major (related) drawbacks: first, the use of counts to gene features prevents inference about the transcription of isoforms, and therefore with most count-based methods it is impossible to identify splicing switches and other isoform changes between conditions. Some methods have tried to address this issue by restricting genomic features to specific exons or splice junctions (e.g. DEXSeq) but this requires throwing out a lot of data, thereby reducing power for identifying statistically significant differences between conditions. Second, because of the fact that in general $\frac{a}{b} + \frac{c}{d} \neq \frac{a+b}{c+d}$ it is mathematically incorrect to estimate gene abundances by adding up counts to their genomic region. One consequence of this, is that it is not possible to accurately measure fold change between conditions by using counts to gene features. In other words, count-based methods are problematic even at the gene-level and it is necessary to estimate transcript-level counts.

While reads might be ambiguous as to exactly which transcripts they originated from, it is possible to statistically infer an estimate of the number of reads from each transcript in an experiment. This kind of quantification has its origin in papers of Jiang and Wong, 2009 and Trapnell et al. 2010. However the process of estimating transcript-level counts introduces technical variation. That is to say, if multiple technical replicates were performed on a cDNA library and then transcript-level counts were to be inferred, those inferred counts would no longer be Poisson distributed. Thus, there appears to be a need for performing technical replicates after all. Furthermore, it becomes unclear how to work within the shrinkage frameworks of count-based methods.

There have been a handful of attempts to develop methods that combine the uncertainty of count estimates at the transcript level with biological variation in the assessment of statistically significant changes in transcript abundances between conditions. For example, the Cuffdiff2 method generalizes DESeq while the bitSeq method relies on a Bayesian framework to simultaneously quantify abundances at the transcript level while modeling biological variability. Despite showing improved performance over count-based methods, they also have significant shortcomings. For example the methods are not as flexible as those of general(ized) linear models, and bitSeq is slow partly because it requires MCMC sampling.

Thus, despite intensive research on both statistical and computational methods for RNA-Seq over the past years, there has been no solution for analysis of experiments that allows biologists to take full advantage of the power and resolution of RNA-Seq.

The sleuth model

The main contribution of sleuth is an intuitive yet powerful model for RNA-Seq that bridges the gap between count-based methods and quantification algorithms in a way that fully exploits the advantages of both.

$Y_t = X_t\beta_t + \epsilon_t$.

Here the subscript t refers to a specific transcript, $Y_t$ is a vector describing transcript abundances (of length equal to the number of samples), $X_t$ is a design matrix (of size number of samples x number of confounders), $\beta_t$ is a parameter vector (of size the number of confounders) and $\epsilon_t$ is a noise vector (of size the number of samples). In this model the abundances $Y_t$ are normally distributed. For the purposes of RNA-Seq data, the $Y_t$ may be assumed to be the logarithm of the counts (or normalized counts per million) from a transcript, and indeed this is the approach taken in a number of approaches to RNA-Seq modeling, e.g. in limma-voom. A common alternative to the general linear model is the generalized linear model, which postulates that some function of $Y_t$ has a distribution with mean equal to $g^{-1}(X_t \beta_t)$ where g is a link function, such as log, thereby allowing for distributions other than the normal to be used for the observed data. In the RNA-Seq context, where the negative binomial distribution may make sense because it is frequently a good distribution for modeling count data, the generalized model is sometimes preferred to the standard general model (e.g. by DESeq2). There is much debate about which approach is “better”.

In the sleuth model the $Y_t$ in the general linear model are modeled as unobserved. They can be thought of us the unobserved logarithms of true counts for each transcript across samples and are assumed to be normally distributed. The observed data $D_t$ is the logarithm of estimated counts for each transcript across samples, and is modeled as

$D_t = Y_t + \zeta_t$

where the $\zeta_t$ vector parameterizes a perturbation to the unobserved $Y_t$. This can be understood as the technical noise due to the random sequencing of fragments from a cDNA library and the uncertainty introduced in estimating transcript counts.

The sleuth model incorporates the assumptions that the response error is additive, i.e. if  the variance of transcript in sample is $V(D_{t,i})$ then $V(D_{t,i}) = \sigma^2_t + \tau^2_t$ where the variance $V(\epsilon_{t,i}|y_{t,i}) = \sigma^2_t$ and the variance $V(\zeta_{t,i}|y_{t,i}) = \tau^2_t$. Intuitively, sleuth teases apart the two sources of variance by examining both technical and biological replicates, and in doing so directly estimates “true” biological variance, i.e. the variance in biological replicates that is not technical.  In lieu of actual technical replicates, sleuth takes advantage of the bootstraps of kallisto which serve as accurate proxies.

In a test of sleuth on data simulated according to the DESeq2 model we found that sleuth significantly outperforms other methods:

In this simulation transcript counts were simulated according to a negative binomial distribution, following closely the protocol of the DESeq2 paper simulations. Reference parameters for the simulation were first estimated by running DESeq2 on a the female Finnish population from the GEUVADIS dataset (59 individuals). In the simulation above size factors were set to be equal in accordance with typical experiments being performed, but we also tested sleuth with size factors drawn at random with geometric mean of 1 in accordance with the DESeq2 protocol (yielding factors of 1, 0.33, 3, 3, 0.33 and 1) and sleuth still outperformed other methods.

There are many details in the implementation of sleuth that are crucial to its performance, e.g. the approach to shrinkage to estimate the biological variance $\sigma^2_t$. A forthcoming preprint, together with Nicolas Bray and Páll Melsted that also contributed to the project along with myself, will provide the details.

Exploratory data analysis with sleuth

One of the design goals of sleuth was to create a simple and efficient workflow in line with the principles of kallisto. Working with the Shiny web application framework we have designed an html interface that allows users to interact with sleuth plots allowing for real time exploratory data analysis.

The sleuth Shiny interface is much more than just a GUI for making plots of kallisto processed data. First, it allows for the exploration of the sleuth fitted models; users can explore the technical variation of each transcript, see where statistically significant differential transcripts appear in relationship to others in terms of abundance and variance and much more. Particularly useful are interactive features in the plots. For example, when examining an MA plot, users can highlight a region of points (dynamically created box in upper panel) and see their variance breakdown of the transcripts the points correspond to, and the list of the transcripts in a table below:

The web interface contains diagnostics, summaries of the data, “maps” showing low-dimensional representations of the data and tools for analysis of differential transcripts. The interactivity via Shiny can be especially useful for diagnostics; for example, in the diagnostics users can examine scatterplots of any two samples, and then select outliers to examine their variance, including the breakdown of technical variance. This allows for a determination of whether outliers represent high variance transcripts, or specific samples gone awry. Users can of course make figures showing transcript abundances in all samples, including boxplots displaying the extent of technical variation. Interested in the differential transcribed isoform ENST00000349155 of the TBX3 gene shown in Figure 5d of the Cuffdiff2 paper? It’s trivial to examine using the transcript viewer:

One can immediately see visually that differences between conditions completely dominate both the technical and biological variation within conditions. The sleuth q-value for this transcript is 3*10^(-68).

Among the maps, users can examine PCA projections onto any pair of components, allowing for rapid exploration of the structure of the data. Thus, with kallisto and sleuth raw RNA-Seq reads can be converted into a complete analysis in a matter of minutes. Experts will be able to generate plots and analyses in R using the sleuth library as they would with any R package. We plan numerous improvements and developments to the sleuth interface in the near future that will further facilitate data exploration; in the meantime we welcome feedback from users.

How to try out sleuth

Since sleuth requires the bootstraps and quantifications output by kallisto we recommend starting by running kallisto on your samples. The kallisto program is very fast, processing 30 million reads on a laptop in a matter of minutes. You will have to run kallisto with bootstraps- we have been using 100 bootstraps per sample but it should be possible to work with many fewer. We have yet to fully investigate the minimum number of bootstraps required for sleuth to be accurate.

To learn how to use kallisto start here. If you have already run kallisto you can proceed to the tutorial for sleuth. If you’re really eager to see sleuth without first learning kallisto, you can skip ahead and try it out using pre-computed kallisto runs of the Cuffdiff2 data- the tutorial explains where to obtain the data for trying out sleuth.

For questions, suggestions or help see the program websites and also the kallisto-sleuth user group. We hope you enjoy the tools!

Today I posted the preprint N. Bray, H. Pimentel, P. Melsted and L. Pachter, Near-optimal RNA-Seq quantification with kallisto to the arXiv. It describes the RNA-Seq quantification program kallisto. [Update April 5, 2016: a revised version of the preprint has been published: Nicolas L. Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-Seq quantification, Nature Biotechnology (2016), doi:10.1038/nbt.3519 published online April 4, 2016.]

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was so much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of k-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. That is asking for a lot less, and as it turns out, pseudoalignment can be much faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient k-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

The screenshot below shows kallisto being used on my 2012 iMac desktop to build an index of the human transcriptome (5 min 8 sec), and then quantify 78.6 million GEUVADIS human RNA-Seq reads (14 min). When we first saw these results we thought they were simply too good to be true. Let me repeat: The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. In some tests we’ve gotten even faster running times, up to 15 million reads quantified per minute.

The results in our paper indicate that kallisto is not just fast, but also very accurate. This is not surprising: underlying RNA-Seq analysis are the alignments, and although kallisto is pseudoaligning instead, it is almost always only the compatibility information that is used in actual applications. As we show in our paper, from the point of view of compatibility, the pseudoalignments and alignments are almost the same.

Although accuracy is a primary concern with analysis, we realized in the course of working on kallisto that speed is also paramount, and not just as a  matter of convenience. The speed of kallisto has three major implications:

1. It allows for efficient bootstrapping. All that is required for the bootstrap are reruns of the EM algorithm, and those are particularly fast within kallisto. The result is that we can accurately estimate the uncertainty in abundance estimates. One of my favorite figures from our paper, made by Harold, is this one:

It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. This result is non-trivial: the standard dogma, that the technical variance in RNA-Seq is “Poisson” (i.e. proportional to the mean) is false, as shown in Supplementary Figure 3 of our paper (the correlation becomes 0.64). Thus, the bootstrap will be invaluable when incorporated in downstream application and we are already working on some ideas.

2. It is not just the kallisto quantification that is fast; the index building, and even compilation of the program are also easy and quick. The implication for biologists is that RNA-Seq analysis now becomes interactive. Instead of “freezing” an analysis that might take weeks or even months, data can be explored dynamically, e.g. easily quantified against different transcriptomes, or re-quantified as transcriptomes are updated. The ability to analyze data locally instead of requiring cloud computation means that analysis is portable, and also easily secure.

3. We have found the fast turnaround of analysis helpful in improving the program itself. With kallisto we can quickly check the effect of changes in the algorithms. This allows for much faster debugging of problems, and also better optimization. It also allows us to release improvements knowing that users will be able to test them without resorting to a major computation that might take months. For this reason we’re not afraid to say that some improvements to kallisto will be coming soon.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!