You are currently browsing the tag archive for the ‘pseudoalignment’ tag.

[September 2, 2017: A response to this post has been posted by the authors of Patro et al. 2017, and I have replied to them with a rebuttal]

Spot the difference

One of the maxims of computational biology is that “no two programs ever give the same result.” This is perhaps not so surprising; after all, most journals seek papers that report a significant improvement to an existing method. As a result, when developing new methods, computational biologists ensure that the results of their tools are different, specifically better (by some metric), than those of previous methods. The maxim certainly holds for RNA-Seq tools. For example, the large symmetric differences displayed in the Venn diagram below (from Zhang et al. 2014) are typical for differential expression tool benchmarks:

In a comparison of RNA-Seq quantification methods, Hayer et al. 2015 showed that methods differ even at the level of summary statistics (in Figure 7 from the paper, shown below, Pearson correlation was calculated using ground truth from a simulation):

These sort of of results are the norm in computational genomics. Finding a pair of software programs that produce identical results is about as likely as finding someone who has won the lottery… twice…. in one week. Well, it turns out there has been such a person, and here I describe the computational genomics analog of that unlikely event. Below are a pair of plots made using two different RNA-Seq quantification programs:

The two volcano plots show the log-fold change in abundance estimated for samples sequenced by Boj et al. 2015, plotted against p-values obtained with the program limma-voom. I repeat: the plots were made with quantifications from two different RNA-Seq programs. Details are described in the next section, but before reading it first try playing spot the difference.

The reveal

The top plot is reproduced from Supplementary Figure 6 in Beaulieu-Jones and Greene, 2017. The quantification program used in that paper was kallisto, an RNA-Seq quantification program based on pseudoalignment that was published in

The bottom plot was made using the quantification program Salmon, and is reproduced from a GitHub repository belonging to the lead author of

Patro et al. 2017 claim that “[Salmon] achieves the same order-of-magnitude benefits in speed as kallisto and Sailfish but with greater accuracy”, however after being unable to spot any differences myself in the volcano plots shown above, I decided, with mixed feelings of amusement and annoyance, to check for myself whether the similarity between the programs was some sort of fluke. Or maybe I’d overlooked something obvious, e.g. the fact that programs may tend to give more similar results at the gene level than at the transcript level. Thus began this blog post.

In the figure below, made by quantifying RNA-Seq sample ERR188140 with the latest versions of the two programs, each point is a transcript and its coordinates are the estimated counts produced by kallisto and salmon respectively.

Strikingly, the Pearson correlation coefficient is 0.9996026. However astute readers will recognize a possible sleight of hand on my part. The correlation may be inflated by similar results for the very abundant transcripts, and the plot hides thousands of points in the lower left-hand corner. RNA-Seq analyses are notorious for such plots that appear sounds but can be misleading. However in this case I’m not hiding anything. The Pearson correlation computed with $log(counts+1)$ is still extremely high (0.9955965) and the Spearman correlation, which gives equal balance to transcripts irrespective of the magnitude of their counts is 0.991206. My observation is confirmed in Table 3 of Sarkar et al. 2017 (note that in this table “quasi-mapping” corresponds to Salmon):

For context, the Spearman correlation between kallisto and a truly different RNA-Seq quantification program, RSEM, is 0.8944941. At this point I have to say… I’ve been doing computational biology for more than 20 years and I have never seen a situation where two ostensibly different programs output such similar results.

Patro and I are not alone in finding that Salmon $\simeq$ kallisto (if kallisto and Salmon gave identical results I would write that Salmon = kallisto but in lieu of the missing 0.004 in correlation I use the symbol $\, \simeq \,$ to denote the very very strong similarity). Examples in the literature abound, e.g. Supplementary Figure 5 from Majoros et al. 2017 (shown later in the post), Figure 1 from Everaert et al. 2017

or Figure 3A from Jin et al. 2017:

Just a few weeks ago, Sahraeian et al. 2017 published a comprehensive analysis of 39 RNA-Seq analysis tools and performed hierarchical clusterings of methods according to the similarity of their output. Here is one example (their Supplementary Figure 24a):

Amazingly, kallisto and Salmon-Quasi (the latest version of Salmon) are the two closest programs to each other in the entire comparison, producing output even more similar than the same program, e.g. Cufflinks or StringTie run with different alignments!

This raises the question of how, with kallisto published in May 2016 and Salmon $\simeq$ kallisto, Patro et al. 2017 was published in one of the most respected scientific publications that advertises first and foremost that it “is a forum for the publication of novel methods and significant improvements to tried-and-tested basic research techniques in the life sciences.” ?

How not to perform a differential expression analysis

The Patro et al. 2017 paper presents a number of comparisons between kallisto and Salmon in which Salmon appears to dramatically improve on the performance of kallisto. For example Figure 1c from Patro et al. 2017 is a table showing an enormous performance difference between kallisto and Salmon:

Figure 1c from Patro et al. 2017.

At a false discovery rate of 0.01, the authors claim that in a simulation study where ground truth is known Salmon identifies 4.5 times more truly differential transcripts than kallisto!

This can explain how Salmon was published, namely the reviewers and editor believed Patro et al.’s claims that Salmon significantly improves on previous work. In one analysis Patro et al. provide a p-value to help the “significance” stick. They write that “we found that Salmon’s distribution of mean absolute relative differences was significantly smaller (Mann-Whitney U test, P=0.00017) than those of kallisto. But how can the result Salmon >> kallisto, be reconciled with the fact that everybody repeatedly finds that Salmon $\simeq$ kallisto?

A closer look reveals three things:

1. In a differential expression analysis billed as “a typical downstream analysis” Patro et al. did not examine differential expression results for a typical biological experiment with a handful of replicates. Instead they examined a simulation of two conditions with eight replicates in each.
2. The large number of replicates allowed them to apply the log-ratio t-test directly to abundance estimates based on transcript per million (TPM) units, rather than estimated counts which are required for methods such as their own DESeq2.
3. The simulation involved generation of GC bias in an approach compatible with the inference model, with one batch of eight samples exhibiting “weak GC content dependence” while the other batch of eight exhibiting “more severe fragment-level GC bias.” Salmon was run in a GC bias correction mode.

These were unusual choices by Patro et al. What they did was allow Patro et al. to showcase the performance of their method in a way that leveraged the match between one of their inference models and the procedure for simulating the reads. The showcasing was enabled by having a confounding variable (bias) that exactly matches their condition variable, the use of TPM units to magnify the impact of that effect on their inference, simulation with a large number of replicates to enable the use of TPM,  which was possible because with many replicates one could directly apply the log t-test. This complex chain of dependencies is unraveled below:

There is a reason why log-fold changes are not directly tested in standard RNA-Seq differential expression analyses. Variance estimation is challenging with few replicates and RNA-Seq methods developers understood this early on. That is why all competitive methods for differential expression analysis such as DESeq/DESeq2, edgeR, limma-voom, Cuffdiff, BitSeq, sleuth, etc. regularize variance estimates (i.e., perform shrinkage) by sharing information across transcripts/genes of similar abundance. In a careful benchmarking of differential expression tools, Shurch et al. 2016 show that log-ratio t-test is the worst method. See, e.g., their Figure 2:

Figure 2 from Schurch et al. 2016. The four vertical panels show FPR and TPR for programs using 3,6,12 and 20 biological replicates (in yeast). Details are in the Schurch et al. 2016 paper.

The log-ratio t-test performs poorly not only when the number of replicates is small and regularization of variance estimates is essential. Schurch et al. specifically recommend DESeq2 (or edgeR) when up to 12 replicates are performed. In fact, the log-ratio t-test was so bad that it didn’t even make it into their Table 2 “summary of recommendations”.

The authors of Patro et al. 2017 are certainly well-aware of the poor performance of the log-ratio t-test. After all, one of them was specifically thanked in the Schurch et al. 2016 paper “for his assistance in identifying and correcting a bug”. Moreover, the recommended program by Schurch etal. (DESeq2) was authored by one of the coauthors on the Patro et al. paper, who regularly and publicly advocates for the use of his programs (and not the log-ratio t-test):

This recommendation has been codified in a detailed RNA-Seq tutorial where M. Love et al. write that “This [Salmon + tximport] is our current recommended pipeline for users”.

In Soneson and Delorenzi, 2013, the authors wrote that “there is no general consensus regarding which [RNA-Seq differential expression] method performs best in a given situation” and despite the development of many methods and benchmarks since this influential review, the question of how to perform differential expression analysis continues to be debated. While it’s true that “best practices” are difficult to agree on, one thing I hope everyone can agree on is that in a “typical downstream analysis” with a handful of replicates

do not perform differential expression with a log-ratio t-test.

Turning to Patro et al.‘s choice of units, it is important to note that the requirement of shrinkage for RNA-Seq differential analysis is the reason most differential expression tools require abundances measured in counts as input, and do not use length normalized units such as Transcripts Per Million (TPM). In TPM units the abundance $\rho_t$ for a transcript t is $\rho_t \propto \frac{c_t}{N \cdot l_t}$ where $c_t$ are the estimated counts for transcript t, $l_t$ is the (effective) length of t and N the number of total reads. Whereas counts are approximately Poisson distributed (albeit with some over-dispersion), variance estimates of abundances in TPM units depend on the lengths used in normalization and therefore cannot be used directly for regularization of variance estimation. Furthermore, the dependency of TPM on effective lengths means that abundances reported in TPM are very sensitive to the estimates of effective length.

This is why, when comparing the quantification accuracy of different programs, it is important to compare abundances using estimated counts. This was highlighted in Bray et al. 2016: “Estimated counts were used rather than transcripts per million (TPM) because the latter is based on both the assignment of ambiguous reads and the estimation of effective lengths of transcripts, so a program might be penalized for having a differing notion of effective length despite accurately assigning reads.” Yet Patro et al. perform no comparisons of programs in terms of estimated counts.

A typical analysis

The choices of Patro et al. in designing their benchmarks are demystified when one examines what would have happened had they compared Salmon to kallisto on typical data with standard downstream differential analysis tools such as their own tximport and DESeq2. I took the definition of “typical” from one of the Patro et al. coauthors’ own papers (Soneson et al. 2016): “Currently, one of the most common approaches is to define a set of non-overlapping targets (typically, genes) and use the number of reads overlapping a target as a measure of its abundance, or expression level.”

The Venn diagram below shows the differences in transcripts detected as differentially expressed when kallisto and Salmon are compared using the workflow the authors recommend publicly (quantifications -> tximport -> DESeq2) on a typical biological dataset with three replicates in each of two conditions. The number of overlapping genes is shown for a false discovery rate of 0.05 on RNA-Seq data from Trapnell et al. 2014:

A Venn diagram showing the overlap in genes predicted to be differential expressed by kallisto (blue) and Salmon (pink). Differential expression was performed with DESeq2 using transcript-level counts estimated by kallisto and Salmon and imported to DESeq2 with tximport. Salmon was run with GC bias correction.

This example provides Salmon the benefit of the doubt- the dataset was chosen to be older (when bias was more prevalent) and Salmon was not run in default mode but rather with GC bias correction turned on (option –gcBias).

When I saw these numbers for the first time I gasped. Of course I shouldn’t have been surprised; they are consistent with repeated published experiments in which comparisons of kallisto and Salmon have revealed near identical results. And while I think it’s valuable to publish confirmation of previous work, I did wonder whether Nature Methods would have accepted the Patro et al. paper had the authors conducted an actual “typical downstream analysis”.

Patro et al. utilized TPM based comparisons for all the results in their paper, presumably to highlight the improvements in accuracy resulting from better effective length estimates. Numerous results in the paper suggest that Salmon is much more accurate than kallisto. However I had seen a figure in Majoros et al. 2017 that examined the (cumulative) distribution of both kallisto and Salmon abundances in TPM units (their Supplementary Figure 5) in which the curves literally overlapped at almost all thresholds:

The plot above was made with Salmon v0.7.2 so in fairness to Patro et al. I remade it using the ERR188140 dataset mentioned above with Salmon v0.8.2:

The distribution of abundances (in TPM units) as estimated by kallisto (blue circles) and Salmon (red stars).

The blue circles correspond to kallisto and the red stars inside to Salmon. With the latest version of Salmon the similarity is even higher than what Majoros et al. observed! The Spearman correlation between kallisto and Salmon with TPM units is 0.9899896.

It’s interesting to examine what this means for a (truly) typical TPM analysis. One way that TPMs are used is to filter transcripts (or genes) by some threshold, typically TPM >  1 (in another deviation from “typical”, a key table in Patro et al. 2017 – Figure 1d – is made by thresholding with TPM > 0.1). The Venn diagram below shows the differences between the programs at the typical TPM > 1  threshold:

A Venn diagram showing the overlap in transcripts predicted by kallisto and Salmon to have estimated abundance > 1 TPM.

The figures above were made with Salmon 0.8.2 run in default mode. The correlation between kallisto and Salmon (in TPM) units decreases a tiny amount, from 0.9989224 to 0.9974325 with the –gcBias option and even the Spearman correlation decreases by only 0.011 from 0.9899896 to 0.9786092.

I think it’s perfectly fine for authors to present their work in the best light possible. What is not ok is to deliberately hide important and relevant truth, which in this case is that Salmon $\, \simeq \,$ kallisto.

A note on speed

One of the claims in Patro et al. 2017 is that “[the speed of Salmon] roughly matches the speed of the recently introduced kallisto.” The Salmon claim is based on a benchmark of an experiment (details unknown) with 600 million 75bp paired-end reads using 30 threads. Below are the results of a similar benchmark of Salmon showing time to process 19 samples from Boj et al. 2015 with variable numbers of threads:

First, Salmon with –gcBias is considerably slower than default Salmon. Furthermore, there is a rapid decrease in performance gain with increasing number of threads, something that should come as no surprise. It is well known that quantification can be I/O bound which means that at some point, extra threads don’t provide any gain as the disk starts grinding limiting access from the CPUs. So why did Patro et al. choose to benchmark runtime with 30 threads?

The figure below provides a possible answer:

In other words, not only is Salmon $\simeq$ kallisto in accuracy, but contrary to the claims in Patro et al. 2017, kallisto is faster. This result is confirmed in Table 1 of Sarkar et al. 2017 who find that Salmon is slower by roughly the same factor as seen above (in the table “quasi-mapping” is Salmon).

Having said that, the speed differences between kallisto and Salmon should not matter much in practice and large scale projects made possible with kallisto (e.g. Vivian et al. 2017) are possible with Salmon as well. Why then did the authors not report their running time benchmarks honestly?

The first common notion

The Patro et al. 2017 paper uses the term “quasi-mapping” to describe an algorithm, published in Srivastava et al. 2016, for obtaining their (what turned out to be near identical to kallisto) results. I have written previously how “quasi-mapping” is the same as pseudoalignment as an alignment concept, even though Srivastava et al. 2016 initially implemented pseudoalignment differently than the way we described it originally in our preprint in Bray et al. 2015. However the reviewers of Patro et al. 2017 may be forgiven for assuming that “quasi-mapping” is a technical advance over pseudoalignment. The Srivastava et al. paper is dense and filled with complex technical detail. Even for an expert in alignment/RNA-Seq it is not easy to see from a superficial reading of the paper that “quasi-mapping” is an equivalent concept to kallisto’s pseudoalignment (albeit implemented with suffix arrays instead of a de Bruijn graph). Nevertheless, the key to the paper is a simple sentence: “Specifically, the algorithm [RapMap, which is now used in Salmon] reports the intersection of transcripts appearing in all hits” in the section 2.1 of the paper. That’s the essence of pseudoalignment right there. The paper acknowledges as much, “This lightweight consensus mechanism is inspired by Kallisto ( Bray et al. , 2016 ), though certain differences exist”. Well, as shown above, those differences appear to have made no difference in standard practice, except insofar as the Salmon implementation of pseudoalignment being slower than the one in Bray et al. 2016.

Srivastava et al. 2016 and Patro et al. 2017 make a fuss about the fact that their “quasi-mappings” take into account the starting positions of reads in transcripts, thereby including more information than a “pure” pseudoalignment. This is a pedantic distinction Patro et al. are trying to create. Already in the kallisto preprint (May 11, 2015),  it was made clear that this information was trivially accessible via a reasonable approach to pseudoalignment: “Once the graph and contigs have been constructed, kallisto stores a hash table mapping each k-mer to the contig it is contained in, along with the position within the contig.”

In other words, Salmon is not producing near identical results to kallisto due to an unprecedented cosmic coincidence. The underlying method is the same. I leave it to the reader to apply Euclid’s first common notion:

Things which equal the same thing are also equal to each other.

Convergence

While Salmon is now producing almost identical output to kallisto and is based on the same principles and methods, this was not the case when the program was first released. The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods.

The first preprint was published on the BioRxiv on June 27, 2015. It followed shortly on the heels of the kallisto preprint which was published on May 11, 2015. However the first Salmon preprint described a program very different from kallisto. Instead of pseudoalignment, Salmon relied on chaining SMEMs (super-maximal exact matches) between reads and transcripts to identifying what the authors called “approximately consistent co-linear chains” as proxies for alignments of reads to transcripts. The authors then compared Salmon to kallisto writing that “We also compare with the recently released method of Kallisto which employs an idea similar in some respects to (but significantly different than) our lightweight-alignment algorithm and again find that Salmon tends to produce more accurate estimates in general, and in particular is better able [to] estimate abundances for multi-isoform genes.” In other words, in 2015 Patro et al. claimed that Salmon was “better” than kallisto. If so, why did the authors of Salmon later change the underlying method of their program to pseudoalignment from SMEM alignment?

Inspired by temporal ordering analysis of expression data and single-cell pseudotime analysis, I ran all the versions of kallisto and Salmon on ERR188140, and performed PCA on the resulting transcript abundance table to be able to visualize the progression of the programs over time. The figure below shows all the points with the exception of three: Sailfish 0.6.3, kallisto 0.42.0 and Salmon 0.32.0. I removed Sailfish 0.6.3 because it is such an outlier that it caused all the remaining points to cluster together on one side of the plot (the figure is below in the next section). In fairness I also removed one Salmon point (version 0.32.0) because it differed substantially from version 0.4.0 that was released a few weeks after 0.32.0 and fixed some bugs. Similarly, I removed kallisto 0.42.0, the first release of kallisto which had some bugs that were fixed 6 days later in version 0.42.1.

Evidently kallisto output has changed little since May 12, 2015. Although some small bugs were fixed and features added, the quantifications have been very similar. The quantifications have been stable because the algorithm has been the same.

On the other hand the Salmon trajectory shows a steady convergence towards kallisto. The result everyone is finding, namely that currently Salmon $\simeq$ kallisto is revealed by the clustering of recent versions of Salmon near kallisto. However the first releases of Salmon are very different from kallisto. This is also clear from the heatmap/hierarchical clustering of  Sahraeian et al. in which Salmon-SMEM was included (Salmon used SMEMs until version 0.5.1, sometimes labeled fmd, until “quasi-mapping” became the default). A question: if Salmon ca. 2015 was truly better than kallisto then is Salmon ca. 2017 worse than Salmon ca. 2015?

Convergence of Salmon and Sailfish to kallisto over the course of a year. The x-axis labels the time different versions of each program were released. The y-axis is PC1 from a PCA of transcript abundances of the programs.

Prestamping

The bioRxiv preprint server provides a feature by which a preprint can be linked to its final form in a journal. This feature is useful to readers of the bioRxiv, as final published papers are generally improved after preprint reader, reviewer, and editor comments have been addressed. Journal linking is also a mechanism for authors to time stamp their published work using the bioRxiv. However I’m sure the bioRxiv founders did not intend the linking feature to be abused as a “prestamping” mechanism, i.e. a way for authors to ex post facto receive a priority date for a published paper that had very little, or nothing, in common with the original preprint.

A comparison of the June 2015 preprint mentioning the Salmon program and the current Patro et al. paper reveals almost nothing in common. The initial method changed drastically in tandem with an update to the preprint on October 3, 2015 at which point the Salmon program was using “quasi mapping”, later published in Srivastava et al. 2016. Last year I met with Carl Kingsford (co-corresponding author of Patro et al. 2017) to discuss my concern that Salmon was changing from a method distinct from that of kallisto (SMEMs of May 2015) to one that was replicating all the innovations in kallisto, without properly disclosing that it was essentially a clone. Yet despite a promise that he would raise my concerns with the Salmon team, I never received a response.

At this point, the Salmon core algorithms have changed completely, the software program has changed completely, and the benchmarking has changed completely. The Salmon project of 2015 and the Salmon project of 2017 are two very different projects although the name of the program is the same. While some features have remained, for example the Salmon mode that processes transcriptome alignments (similar to eXpress) was present in 2015, and the approach to likelihood maximization has persisted, considering the programs the same is to descend into Theseus’ paradox.

Interestingly, Patro specifically asked to have the Salmon preprint linked to the journal:

The linking of preprints to journal articles is a feature that arXiv does not automate, and perhaps wisely so. If bioRxiv is to continue to automatically link preprints to journals it needs to focus not only on eliminating false negatives but also false positives, so that journal linking cannot be abused by authors seeking to use the preprint server to prestamp their work after the fact.

The fish always win?

The Sailfish program was the precursor of Salmon, and was published in Patro et al. 2014. At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

Despite the claim in the Sailfish abstract that “Sailfish provides quantification time…without loss of accuracy” and Figure 1 from the paper showing Sailfish to be more accurate than RSEM, we felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case; this was later noted by others, e.g. Hensman et al. 2015, Lee et al. 2015).

After reflecting on the Sailfish paper and results, Nicolas Bray had the key idea of abandoning alignments as a requirement for RNA-Seq quantification, developed pseudoalignment, and later created kallisto (with Harold Pimentel and Páll Melsted).

I mention this because after the publication of kallisto, Sailfish started changing along with Salmon, and is now frequently discussed in the context of kallisto and Salmon as an equal. Indeed, the PCA plot above shows that (in its current form, v0.10.0) Sailfish is also nearly identical to kallisto. This is because with the release of Sailfish 0.7.0 in September 2015, Patro et al. started changing the Sailfish approach to use pseudoalignment in parallel with the conversion of Salmon to use pseudoalignment. To clarify the changes in Sailfish, I made the PCA plot below which shows where the original version of Sailfish that coincided with the publication of Patro et al. 2014 (version 0.6.3 March 2014) lies relative to the more recent versions and to Salmon:

In other words, despite a series of confusing statements on the Sailfish GitHub page and an out-of-date description of the program on its homepage, Sailfish in its published form was substantially less accurate and slower than kallisto, and in its current form Sailfish is kallisto.

In retrospect, the results in Figure 1 of Patro et al. 2014 seem to be as problematic as the results in Figure 1 of Patro et al. 2017.  Apparently crafting computational experiments via biased simulations and benchmarks to paint a distorted picture of performance is a habit of Patro et al.

In the post I wrote that “The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods” Here are the details of how these support the claims I make (tl;dr https://twitter.com/yarbsalocin/status/893886707564662784):

Sailfish (current version) and Salmon implemented kallisto’s pseudoalignment algorithm using suffix arrays

First, both Sailfish and Salmon use RapMap (via SACollector) and call mergeLeftRightHits():
Sailfish:
https://github.com/kingsfordgroup/sailfish/blob/352f9001a442549370eb39924b06fa3140666a9e/src/SailfishQuantify.cpp#L192
Salmon:
https://github.com/COMBINE-lab/salmon/commit/234cb13d67a9a1b995c86c8669d4cefc919fbc87#diff-594b6c23e3bdd02a14cc1b861c812b10R2205

The RapMap code for “quasi mapping” executes an algorithm identical to psuedoalignment, down to the detail of what happens to the k-mers in a single read:

First, hitCollector() calls getSAHits_():
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L249

Here kmers are used hashed to SAintervals (Suffix Array intervals), that are then extended to see how far ahead to jump. This is the one of two key ideas in the kallisto paper, namely that not all the k-mers in a read need to be examined to pseudoalign the read. It’s much more than that though, it’s the actual exact same algorithm to the level of exactly the k-mers that are examined. kallisto performs this “skipping” using contig jumping with a different data structure (the transcriptome de Bruijn graph) but aside from data structure used what happens is identical:

https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SACollector.hpp#L652
makes a call to jumping and the code to compute MMP (skipping) is
https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SASearcher.hpp#L77

There is a different detail in the Sailfish/Salmon code which is that when skipping forward the suffix array is checked for exact matching on the skipped sequence. kallisto does not have this requirement (although it could). On error-free data these will obviously be identical; on error prone data this may make Salmon/Sailfish a bit more conservative and kallisto a bit more robust to error. Also due to the structure of suffix arrays there is a possible difference in behavior when a transcript contains a repeated k-mer. These differences affect a tiny proportion of reads, as is evident from the result that kallisto and Salmon produce near identical results.

The second key idea in kallisto of intersecting equivalence classes for a read. This exact procedure is in:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L363
which calls:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/src/HitManager.cpp#L599

There was a choice we had to make in kallisto of how to handle information from paired end reads (does one require consistent pseudoalignment in both? Just one suffices to pseudoalign a read?)
The code for intersection between left and right reads making the identical choices as kallisto is:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/RapMapUtils.hpp#L810

In other words, stepping through what happens to the k-mers in a read shows that Sailfish/Salmon copied the algorithms of kallisto and implemented it with the only difference being a different data structure used to hash the kmers. This is why, when I did my run of Salmon vs. kallisto that led to this blog post I found that
vs
salmon 69,701,169.
That’s a difference of 79,000 out of ~70 million = 0.1%.

1.  Until the kallisto program and preprint was published Salmon used SMEMs. Only after kallisto does Salmon change to using kmer cached suffix array intervals.
2. The kallisto preprint did not discuss outputting position as part of pseudoalignment because it was not central to the idea. It’s trivial to report pseudoalignment positions with either data structure and in fact both kallisto and Salmon do.

I want to make very clear here that I think there can be great value in implementing an algorithm with a different data structure. It’s a form of reproducibility that one can learn from: how to optimize, where performance gains can be made, etc. Unfortunately most funding agencies don’t give grants for projects whose goal is solely to reproduce someone else’s work. Neither do most journal publish papers that set out to do that. That’s too bad. If Patro et al. had presented their work honestly, and explained that they were implementing pseudoalignment with a different data structure to see if it’s better, I’d be a champion of their work. That’s not how they presented their work.

Salmon copied details in the quantification

The idea of using the EM algorithm for quantification with RNA-Seq goes back to Jiang and Wong, 2009, arguably even to Xing et al. 2006. I wrote up the details of the history in a review in 2011 that is on the arXiv. kallisto runs the EM algorithm on equivalence classes, an idea that originates with Nicolae et al. 2011 (or perhaps even Jiang and Wong 2009) but whose significance we understood from the Sailfish paper (Patro et al. 2014). Therefore the fact that Salmon (now) and kallisto both use the EM algorithm, in the same way, makes sense.

However Salmon did not use the EM algorithm before the kallisto preprint and program were published. It used an online variational Bayes algorithm instead. In the May 18, 2015 release of Salmon there is no mention of EM. Then, with the version 0.4 release date Salmon suddenly switches to the EM. In implementing the EM algorithm there are details that must be addressed, for example setting thresholds for when to terminate rounds of inference based on changes in the (log) likelihood (i.e. determine convergence).

For example, kallisto sets parameters
const double alpha_limit = 1e-7;
const double alpha_change_limit = 1e-2;
const double alpha_change = 1e-2;

in EMalgorithm.h
https://github.com/pachterlab/kallisto/blob/90db56ee8e37a703c368e22d08b692275126900e/src/EMAlgorithm.h
The link above shows that these kallisto parameters were set and have not changed since the release of kallisto
Also they were not always this way, see e.g. the version of April 6, 2015:
https://github.com/pachterlab/kallisto/blob/2651317188330f7199db7989b6a4dc472f5d1669/src/EMAlgorithm.h
This is because one of the things we did is explore the effects of these thresholds, and understand how setting them affects performance. This can be seen also in a legacy redundancy, we have both alpha_change and alpha_change_limit which ended up being unnecessary because they are equal in the program and used on one line.

The first versions of Salmon post-kallisto switched to the EM, but didn’t even terminate it the same way as kallisto, adopting instead a maximum iteration of 1,000. See
https://github.com/COMBINE-lab/salmon/blob/59bb9b2e45c76137abce15222509e74424629662/include/CollapsedEMOptimizer.hpp
from May 30, 2015.
This changed later first with the introduction of minAlpha (= kallisto’s alpha_limit)
https://github.com/COMBINE-lab/salmon/blob/56120af782a126c673e68c8880926f1e59cf1427/src/CollapsedEMOptimizer.cpp
and then alphaCheckCutoff (kallisto’s alpha_change_limit)
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp

Here are the salmon thresholds:
double minAlpha = 1e-8;
double alphaCheckCutoff = 1e-2;
double cutoff = minAlpha;

Notice that they are identical except that minAlpha = 1e-8 and not kallisto’s alpha_limit = 1e-7. However in kallisto, from the outset, the way that alpha_limit has been used is:
if (alpha_[ec] < alpha_limit/10.0) {
alpha_[ec] = 0.0;
}

In other words, alpha_limit in kallisto is really 1e-8, and has been all along.

The copying of all the details of our program have consequences for performance. In the sample I ran kallisto performed 1216 EM rounds of EM vs. 1214 EM rounds in Salmon.

Sailfish (current version) copied our sequence specific bias method

One of the things we did in kallisto is implement a sequence specific bias correction along the lines of what was done previously in Roberts et al. 2011, and later in Roberts et al. 2013. Implementing sequence specific bias correction in kallisto required working things out from scratch because of the way equivalence classes were being used with the EM algorithm, and not reads. I worked this out together with Páll Melsted during conversations that lasted about a month in the Spring of 2015. We implemented it in the code although did not release details of how it worked with the initial preprint because it was an option and not default, and we thought we might want to still change it before submitting the journal paper.

Here Rob is stating that Salmon can account for biases that kallisto cannot:
https://www.biostars.org/p/143458/#143639
This was a random forest bias correction method different from kallisto’s.

Shortly thereafter, here is the source code in Sailfish deprecating the Salmon bias correction and switching to kallisto’s method:
https://github.com/kingsfordgroup/sailfish/commit/377f6d65fe5201f7816213097e82df69e4786714#diff-fe8a1774cd7c858907112e6c9fda1e9dR76

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-3e922f9589567fee3b20671da9493c82R34

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-b14c09a136906d1c5d8534afa3a51c4cR818

This is the update to effective length in kallisto:
https://github.com/pachterlab/kallisto/blob/e5957cf96f029be4e899e5746edcf2f63e390609/src/weights.cpp#L184
Here is the Sailfish code:

Notice that there has been a literal copying down to the variable names:

The code written by the student of Rob was:

The code written by us is

efflen *= 0.5*biasAlphaNorm/biasDataNorm;

The code rewritten by Rob (editing that of the student):

effLength *= 0.5 * (txomeNormFactor / readNormFactor);

Note that since our bias correction method was not reported in our preprint, this had to have been copied directly from our codebase and was done so without any attribution.

I raised this specific issue with Carl Kingsford by email prior to our meeting in April 13 2016. We then discussed it in person. The conversation and email were prompted by a change to the Sailfish README on April 7, 2016 specifically accusing us of comparing kallisto to a “ **very old** version of Sailfish”:
https://github.com/kingsfordgroup/sailfish/commit/550cd19f7de0ea526f512a5266f77bfe07148266

What was stated is “The benchmarks in the kallisto paper *are* made against a very old version of Sailfish” not “were made against”. By the time that was written, it might well have been true. But kallisto was published in May 2015, it benchmarked with the Sailfish program described in Patro et al. 2014, and by 2016 Sailfish had changed completely implementing the pseudoalignment of kallisto.

Another aspect of an RNA-Seq quantification program is effective length estimation. There is an attribution to kallisto in the Sailfish code now explaining that this is from kallisto:
“Computes (and returns) new effective lengths for the transcripts based on the current abundance estimates (alphas) and the current effective lengths (effLensIn). This approach is based on the one taken in Kallisto
This is from January 23rd, 2016, almost 9 months after kallisto was released, and 3 months before the Sailfish README accused us of not testing the latest version of Sailfish in May 2015.

The attribution for effective lengths is also in the Salmon code, from 6 months later June 2016:
https://github.com/COMBINE-lab/salmon/blob/335c34b196205c6aebe4ddcc12c380eb47f5043a/include/DistributionUtils.hpp

There is also an acknowledgement in the Salmon code that a machine floating point tolerance we use
https://github.com/pachterlab/kallisto/blob/master/src/EMAlgorithm.h#L19
was copied.
The acknowledgment in Salmon is here
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp
This is the same file where the kallisto thresholds for the EM were copied to.

So after copying our entire method, our core algorithm, many of our ideas, specific parameters, and numerous features… really just about everything that goes into an RNA-Seq quantification project, there is an acknowledgment that our machine tolerance threshold was “intelligently chosen”.

In the 1930s the pseudonym Nicolas Bourbaki was adopted by a group of French mathematicians who sought to axiomatize and formalize mathematics in a rigorous manner. Their project greatly influenced mathematics in the 20th century and contributed to its present “definition -> theorem -> proof” format (Bourbaki went a bit overboard with abstraction and generality and eventually the movement petered out).

In biology definitions seem to be hardly worth the trouble. For example, the botanist Willhelm Johannsen coined the term “gene” in 1909 but there is still no universal agreement on a “definition”.  How could there be? Words associated with biological processes or phenomena that are being understood in a piecemeal fashion naturally change their meaning over time (the philosopher Joseph Woodger did try to axiomatize biology in his 1938 book “The Axiomatic Method in Biology” but his attempt, with 53 axioms, was ill fated).

Bioinformatics occupies a sort of murky middle-ground as far as definitions go. On the one hand, mathematics underlies many of the algorithms and statistical models that are used so there is a natural tendency to formalize concepts, but just when things seem neat and tidy and the lecture begins with “Let the biological sequence be a string on an alphabet $\Sigma$ of size 4 or 20″, biology comes along and rips the blanket off the bed (there are 22 proteinogenic amino acids, molecules may be single or double stranded, they could be circular, bases may be chemically modified, and so on and so forth…).

In some cases things have really gotten out of hand. The words “alignment” and “mapping” appear, superficially, to refer to precise objects and procedures, but they probably have more meanings than the verb “run” (which is in the hundreds). To just list the meanings would be a monumental (impossible?) task. I will never attempt it. However I recently came across some twitter requests for clarification of read alignment and mapping, and I thought explaining those terms was perhaps a more tractable problem.

Here is my attempt at a chronological clarification:

1998: The first use of the term “read alignment” in a paper (in the context of genomics) is, to my knowledge, in “Consed: A Graphical Tool for Sequence Finishing” by D. Gordon, C. Abajian and P. Green. The term “read alignment” in the paper was used in the context of the program phrap, and referred to the multiple alignment of reads to a sequence assembled from the reads, so as to generate a consensus sequence with reduced error. The output looked something like this:

Figure 1 from the Consed paper.

To obtain multiple alignments phrap required a “model”. Roughly speaking, a model is an objective scoring function along with parameters, and, although not discussed in the Consed paper, that model was crucial to the functionality and performance of phrap. The phrap model included not just mismatches of bases but also insertions and deletions. Although phrap was designed to work with Sanger reads, there is conceptually not much difference between the way “read alignment” was used in the Consed paper and the way it is sometimes used now with “next-generation” reads. In fact, the paper “Correcting errors in short reads by multiple alignments” by L. Salmela and J. Schröder, 2011 does much of the same thing (it would have been nice if they cited phrap, although in their defense phrap was never published; Consed was about the visualization tool).

For the purposes of defining “read alignment” it is worthwhile to consider what I think were the four key ingredients of phrap:

• An underlying notion of what a read alignment is. Without being overly formal I think its useful to consider the form of the phrap input/output . This is my own take on it. First, there was a set of reads $\mathcal{F}$ with each read $F \in \mathcal{F}$ consisting of a sequence so that $F$ is an ordered set $F =\{\sigma_1, \ldots, \sigma_l\}$ (for simplicity I make the assumption that the reads were single end, each read was of the same length l, and each sequence element $\sigma \in \{A,C,G,T\}$, and of course I’ve ignored the reverse strand; these assumptions and restrictions can and should be relaxed but I work with them here for simplicity). Next, there was a set of target sequences (contigs, to be precise) that I denote by  $\mathcal{T}$ with $T \in \mathcal{T}$ . These each consisted of a sequence, again an ordered set, $T = \{\sigma^T_1,\ldots,\sigma^T_{|T|}\}$. A “read alignment” in phrap consisted of a pair of maps: first $\psi:\mathcal{F} \rightarrow \mathcal{T} \cup \{\emptyset\}$ specifying for each read the contig it maps to (or the mapping to $\{\emptyset\}$ denoting an unaligned read), and also a set of additional maps $\mathcal{L}_F:F \rightarrow \psi(F) \cup \emptyset$ assigning to each base in each read a corresponding base (or gap in the case of $\{\emptyset\}$) in the contig the read mapped to.
• A model: Given the notion of read alignment, a model was needed to be able to choose between different alignments. In phrap, the multiple alignments depended on scores for mismatches and gaps (the defaults were a score of 11 for a match, -2 for a mismatch, -4 for initiating a gap, -3 for extending a gap). In general, a model provides an approach to distinguishing and choosing between the read alignments as (roughly) defined above.
• An algorithm for read alignment. Given a model there was a need for an algorithm to actually find high scoring alignments. In phrap the algorithm consisted of performing a banded Smith-Waterman alignment using the model (roughly) specified above. The Smith-Waterman algorithm produced alignments $\mathcal{L}_F$ that were global, i.e. order preserving between the elements of $F$ and $\psi(F)$.
• Data structures for enabling efficient alignment. With large numbers of targets and many reads, there was a need for phrap to efficiently prune the search for read alignments. Running the banded Smith-Waterman alignment algorithm on all possible places a read could align to would have been intractable. To address this problem,  phrap made use of a hash to quickly find exact matches from parts of the read to targets, and then ran the Smith-Waterman banded alignment only on the resulting credible locations. Statistically, this could be viewed as a filter for very low probability alignments.

The four parts of phrap are present in most read alignment programs today. But the details can be very different. More on this in a moment.

2004: Six years after Consed paper the term “read alignment” shows up again in the paper “Comparative genome assembly” by Pop et al. In the paper the term “read alignment” was not defined but rather described via a process. Reads from obtained via shotgun sequencing of the genome of one organism were “aligned” to a reference genome from another closely related organism using a program called MUMmer (the goal being to use proximity of alignments to assist in assembly). In other words, the “alignments” were the output of MUMmer, followed by a post-processing with a series of heuristics to deal with ambiguous alignments. It is instructive to compare the alignments to those of phrap in terms of the four parts discussed above. First, in the formulation of Pop et al. the alignments were between a single read and a genome, not multiple alignments of sets of reads to numerous contigs. In the language above the map $\psi$ was much simpler in Pop et al., with the range consisting only of chromosomes in a genome rather than the thousands of contigs in an assembly. Second, although MUMmer utilized an implicit model for alignment, the model was of a very different meaning. Reads from one organism were being aligned to sequences from another, a task fundamentally different than aligning reads back to sequences they originated from (albeit with error). In other words, “alignment” in the Pop et al. paper referred to identification of homology. This opens up a whole other bag of worms that I wrote about in a paper with Colin Dewey in 2006 and that I will not open here. In terms of data structures, MUMmer made use of a suffix tree rather than a hash, a distinction that is important in that suffix trees provide a much richer possibility for search (MUM stands for maximum unique match, the kind of thing suffix trees are useful for finding and that MUMmer took advantage of). A lot of the research in read alignment has centered on developing efficient and compact data structures for string search.

Another development in the year 2004 was the appearance of the term “read mapping”. In the paper “Pash: efficient genome-scale anchoring by positional hashing” by Kalafus et al., hashing of k-mers was used to “anchor” reads. The word “anchoring” was and still is somewhat ambiguous as to its meaning, but in the Pash context it referred to the identification of positions of exact matching k-mers within a read on a target sequence. In other words, Pash could be considered to provide only partial alignments, because some bases within the read would not be aligned at all. Formally, one would say that the functions $\mathcal{L}_F$ had as their domain not the full reads $\mathcal{F}$ but rather strict subsets $\mathcal{S} \subset F$ where $\mathcal{S}$ is nonempty. For this reason, in the Pash paper the authors suggest post-processing their output with base-pair level sequence alignment programs popular at the time (e.g. AVID, LAGAN) to fully “align” the reads (they could have also suggested performing a banded Smith-Waterman alignment like phrap).

2006: The first sequencer manufactured by Solexa, the Genome Analyzer was launched, and following it the ELAND program for “read alignment” was provided  to customers buying the machine. This was, in my opinion, a landmark even in “read alignment” because with next-generation sequencing and ELAND read alignment became mainstream, a task to be performed in the hands of individual users, rather than just a step among many in complex genome sequencing pipelines manufactured in large genome sequencing centers. In the initial release of ELAND, the program performed “ungapped read alignment”, which in the language of this post means that the maps $\mathcal{L}_F$  had the property that if  $\mathcal{L}_F(i) = \sigma^{\psi(F)}_{j}$ then $\mathcal{L}_F(i+1) = \sigma^{\psi(F)}_{j+1}$, i.e. there were no gaps in the read alignments.

2009: With two programs published back-to-back in the same year, BWA (“Fast and accurate alignment with the Burrows-Wheeler transform“, Heng Li and Richard Durbin)  and Bowtie (“Ultrafast and memory-efficient alignment of short DNA sequences to the human genome” by Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg) read alignment took a big step forward with the use of more sophisticated data structures for rapid string matching.  BWA and Bowtie were based on the Burrows-Wheeler transform, and specifically innovations with the way it was used, to speed up read alignment. In the case of Bowtie, the alignments were ungapped just as originally with ELAND, although Bowtie2 published in 2012 incorporated gaps into the model.

Also of note is that by this time the terms “read alignment” and “read mapping” had become interchangeable. The BWA and Bowtie papers both used both terms, as did many other papers. Moreover, the terms started to take on a specific meaning in terms of the notion of “alignment”. They were referring to alignment of reads to a target genome, and moreover alignment of individual bases to specific positions in the target genome (i.e. the maps $\mathcal{L}_F$). This was in large part due to the publication of the sequence alignment format also in 2009 in Heng Li et al.‘s important paper on SAMtools.

Finally, 2009 was the year the notion that “spliced read alignment” was introduced. “Spliced alignment” was a concept around since the mid 90s (see, e.g. “Gene recognition by spliced alignment” by MS Gelfand, AA Mironov and PA Pevzner). The idea was to align cDNAs to a genome, requiring alignments allowing for long gaps (for the introns). In the paper “Optimal spliced alignments of short sequencing reads” by Fabio De Bona et al. the authors introduced the tool QPalma for this purpose, building on a spliced alignment tool from the group called Palma. Here the idea was to extend “spliced alignment” to reads, and while QPalma was never widely adopted, other programs such as TopHat, GSNAP, Star, etc. have becomes staples of RNA-Seq. Going back to the formalism I provided with the description of phrap, what spliced alignment did was to extend the models used for read alignment to allow for long gaps.

In the following years, there was what I think is fair to characterize as an explosion in the number of read alignment papers and tools. Models were tweaked and new indexing methods introduced, but the fundamental notion of “read alignment” (by now routinely also called “read mapping”) remained the same.

2014: In his Ph.D. thesis published in December of last year, my former student Nicolas Bray had the idea of entirely dropping the requirement for a read alignment to include the maps $\mathcal{L}_F$. In his setup, a read alignment would consist only of specifying for each read which target sequence it originated from, but not where in the target sequence it aligned to. The point, elaborated on in Chapter 2 of Nick’s thesis, was motivated by the observation that for many applications (e.g. RNA-Seq) the sufficient statistics for the relevant models are essentially the information contained in the map $\psi$ alone. He called this notion of read alignment “pseudoalignment”, as it is a fundamentally different concept than what was previously thought of as read alignment.

Of course there is not much to writing down a new definition, and there wouldn’t be much of a point to pseudoalignment it if not for two things: Nick’s realized that pseudoalignments could be obtained much faster than standard read alignments and second, that the large imprints of alignment files needed for the vast amounts of sequence that was being produced were becoming a major bottleneck in genomics. Working on RNA-Seq projects with collaborators, this is how we felt on a regular basis:

Pseudoalignment offered a reprieve from what was becoming an unbearable task, and (I believe) it is going to turn out to be a useful and widely applicable concept.

2015: Earlier this year we posted a preprint to the arXiv detailing a method for performing pseudoalignment based on the ideas in Nick’s thesis, with the additional idea of Páll Melsted to use a new type of indexing scheme based on a de Bruijn graph of the target sequences. The program implementing these ideas is called kallisto. Looking back at phrap, kallisto can be seen to consist of a different notion of read alignment (pseudoalignment), a novel data structure (the target sequence de Bruijn graph), and a new fast algorithm for finding pseudoalignments (based on intersecting sets of k-mer matches). Although we did not discuss the model explicitly in our paper, it is implicit via the algorithm used to find pseudoalignments. One useful thing to keep in mind is that the nature of the index of kallisto allows for extracting more than a pseudoalignment. If desired, one can obtain positions within the targets for some k-mers contained in a query read, or even strand information, and in fact such data is used in some parts of kallisto. When such information is extracted, what kallisto is performing is no longer a pure pseudoalignment, but rather what the Pash authors called a read mapping (although that term is unsuitable because as mentioned it has become synonyms with read alignment). It turns out there is now another word for exactly that concept.

Following the publication of kallisto on the arXiv another preprint was posted on the bioRxiv introducing a program called Salmon and a new term called “lightweight alignment” (Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-Seq Data Using Lightweight-Alignment by Rob Patro, Geet Duggal and Carl Kingsford) . Lightweight alignment as a notion of alignment is exactly what was just referred to: the notion of read mapping in Pash. Utilizing an index and software by Heng Li called the FMD index, Salmon determines “lightweight alignment” chains of what are called maximal exact matches and super-maximal exact matches (not to be confused with the maximal unique matches, or MUMs discuss above).  These provide partial alignment information for the sequences in each read, i.e. just like with Pash one obtains maps $\mathcal{L}_F$ but with the domain sometimes restricted to a nonempty subset $\mathcal{S} \subset \mathcal{F}$ when the MEMs or SMEMs don’t cover the read. Again, I find it useful to refer back and compare to phrap: with Salmon the “read alignment” consists of a different type of index and an algorithm associated with it (making sure the MEMs/SMEMs are consistent). The notion of alignment is distinct from that of phrap by virtue of the restriction of the domain in the maps $\mathcal{L}_F$.  I think a new term for the latter concept since the one Pash used failed to stick, and “lightweight alignment” seems fine.

Recently (just last week) a new read alignment related term was introduced in a bioRxiv preprint. A paper describing a new program called RapMap uses the term quasi-mapping for the procedure implemented in the software. RapMap finds pseudoalignments in the sense described above, with the only difference being the type of index that is used (instead of the target de Bruijn graph, it utilizes a generalized suffix array together with a hash table related to the array). This time I think the introduction of new terminology was unfortunate. Just as with kallisto, RapMap can be used to extract more information than just a pseudoalignment from the index (i.e. a lightweight alignment), and indeed the program does so. But also like kallisto, the speed of RapMap is derived from the idea of efficiently finding pseudoalignments by intersecting sets of k-mer matches. In other words, quasi-mapping is a procedure and not a concept; what the procedure outputs is a lightweight alignment via a pseudoalignment.

tl;dr

Read alignment is what you think it means.

Lightweight alignment is “partial” read alignment (it should have been called read mapping but that was taken). Some, but not necessarily all bases in a read are aligned to specific bases among target sequences.

Pseudoalignment is read assignment to target sequences without any base-level sequence alignment.

Read alignment, lightweight alignment and pseudoalignment are concepts, not algorithms.

Quasi-mapping is a procedure and not a concept. The procedure outputs a lightweight alignment via a pseudoalignment.

Disclaimer: I’ve almost certainly gotten some dates wrong, missed some papers I should have cited, and omitted some crucial developments in read alignment that I should have included. Please let me know if you think that is the case by commenting and I’ll update the post as needed.

The hierarchical classification of nature initiated by Carl Linnaeus today consists of eight major “ranks”, namely species, genus, family, order, class, phylum, kingdom and domain:

In the microbial world it makes sense to refine the standard taxonomy by subdividing species into strains. An important reason to do so is that bacterial taxonomy must reflect not only phylogeny but also pathogenicity, and small differences in genomes can translate to large pathogenic differences. This has implications for metagenomic analyses of microbial communities: for many biomedical applications it is desirable to characterize individuals strains.

Metagenomics has its roots in culture-independent retrieval and sequencing of 16S rRNA genes, and while variations in 16S can sometimes distinguish between strains, a single gene is not always sufficient. This limitation of 16S can be overcome with whole genome shotgun sequencing of microbial communities, an approach to metagenomics that became popular in the early 2000s and  that opened the door to higher resolution characterization of communities. In 2005 Kevin Chen and I wrote a review on the bioinformatics challenges that would have to be overcome to walk through the door. One of the things we did was to emphasize “problems and their connections to other areas of bioinformatics, such as… gene expression analysis”, and throughout the past decade I’ve always hoped for deeper connections to be established between metagenomics and gene expression bioinformatics. I’ve noticed interesting connections pop up from time to time (e.g. Paulson et al. 2013)  and have occasionally entertained the thought with my students and collaborators, especially as work in my group became more focused on RNA-Seq since the development of Cufflinks in 2008.

However connection modern transcriptome analysis methodology, specifically bioinformatics of RNA-Seq to metagenomics has been difficult to do until recently. One major reason is that until just a few years ago, there was no reference genome database for metagenomics analogous to the reference annotation databases available for use in transcriptomics. Another way to put this is that metagenomics has, until recently, been “de novo” bioinformatics. By this I mean that the analysis of communities from whole genome shotgun data had to largely proceed via de novo analyses of the data (e.g. de novo assembly of genomes), “binning” of reads according to sequence characteristics or hits to gene databases was required because it was impossible to compare sequences to references genomes. While de novo methods have also been developed for RNA-Seq, the scale of transcriptome analysis is much smaller than that of most metagenomic analyses, and as has been well documented, de novo transcriptomics is already very difficult (e.g. Amin et al. 2014).

The de novo state of metagenomics has changed in recent years, as (relatively) low-cost sequencing has been a boon for microbial genomics. The graph below, extracted from NCBI and published in a recent review, shows that in just the past few years thousands of bacterial genomes have been sequences, enabling, for the first time, reference based metagenomics:

This observation is reflected in the recent development of many methods for a variety of metagenomic applications that take advantage of reference genome databases.  Specifically, the problem of read assignment, which is fundamental for abundance estimation, has benefited from the possibility of metagenomic read alignment to reference databases.

The figure below, reproduced from the preprint “An evaluation of the accuracy and speed of metagenome analysis tools” by Stinus Lindgreen, Karen L. Adair and Paul Gardner, bioRxiv May 15, 2015 shows a benchmark of the accuracy and runtime of 14 programs developed for metagenomic read assignment for whole genome shotgun data:

The problem these methods are solving is really similar to the problem of read assignment in RNA-Seq. In RNA-Seq, instead of originating from strains, reads originate from transcripts. Just as strains are present in different abundances in a community, so are RNA transcripts in a cell (or in bulk). The analogy of taxonomy in metagenomics, i.e. the grouping of strains into species, genus etc. is also present in RNA-Seq, where transcripts are grouped into genes. The fragment (or read) assignment problem in RNA-Seq is closely related to the quantification problem in RNA-Seq and is a problem that has been thoroughly researched and for which many algorithms have been developed. I discussed the importance of the fragment assignment problem for RNA-Seq in my 2013 Genome Informatics Keynote.

In response to the development of reference-based bioinformatics possibilities for metagenomics, about three years ago my student Lorian Schaeffer started looking at the suitability of RNA-Seq tools for metagenomic read assignment. Although the metagenomic and RNA-Seq assignment problems are conceptually similar and methodologically related, there are various technical issues involved in applying RNA-Seq tools in the metagenomic setting (e.g. the need to carefully account for taxonomy in the metagenomics setting). After developing the computational infrastructure to benchmark RNA-Seq programs in the metagenomic setting, she proceeded to evaluate the accuracy of eXpress, a streaming algorithm for RNA-Seq quantification. Although the quantification of eXpress was specifically designed to be suitable for large numbers of reads, the program requires read alignments to a reference transcriptome (or in Lorian’s experiments a genome) database. In the metagenomic setting realistic databases are huge, and she found that it took days just to map the reads. Nevertheless, her initial benchmarks revealed that eXpress was significantly more accurate than the available metagenomic read assignment tools of the time.

When Kraken (Wood and Salzberg 2014), and later CLARK (Ounit et al. 2015) were published in 2014 and 2015 respectively, we took note because by circumventing the alignment step they dramatically altered the tractability of metagenomic read assignment. In parallel, in my group, Nicolas Bray and later Páll Melsted and Harold Pimentel were developing what is now kallisto (Bray et al. 2015). Like Kraken, kallisto avoided the need for aligning reads, but with the introduction of the concept of pseudoalignment, allowed for accurate read assignments based on joint analysis of exact k-mer matches. What we showed earlier this year is that unlike naïve k-mer based approaches to quantification, kallisto is as accurate as eXpress and other read alignment based quantification tools, and this observation led Lorian to immediately proceed to benchmark it on metagenomic data. The result of her work was just posted as a preprint:

Lorian Schaeffer, Harold Pimentel, Nicolas Bray, Páll Melsted and Lior Pachter, Pseudoalignment for metagenomic read assignment, arXiv 1510.07371, 2015.

With this paper we demonstrate a “technology transfer” from RNA-Seq bioinformatics to metagenomics, one that achieves dramatic improvements in read assignment accuracy in the metagenomics setting. The main result of her work is Table 1 in our preprint:

Using a published simulated Illumina dataset from Mende et al. 2012 (based on 100 genomes and containing 53.33 million reads), and augmenting it with another 2,308 genomes for the purpose of testing, she shows that kallisto significantly outperforms the best quantification methods (as benchmarked by Lindgreen et al., see figure above). “Significant” here refers to what I think is fair to characterize as an extraordinary improvement: at the genus level, a level that programs such as CLARK have been optimized for, kallisto’s RRMSE (relative root mean squared error)  is 0.13 compared to 17.05 for Kraken and 18.58 for CLARK. The improvement is based on two ideas: first, the results show that the model based approach for read assignment, the concept that underlies GASiC and eXpress, outperforms direct taxonomic read assignment as implemented by MEGAN and Kraken and CLARK (in the latter approach reads are aligned to the lowest rank to which they align unambigously). Second, pseudoalignment is not just faster than traditional alignment but also accurate.

The upshot: the accuracy and efficiency of kallisto make strain level analysis of metagenomes possible. In fact kallisto is more accurate at the strain level than other programs are at the genus level. Just as we have been advocating for transcript level analysis from RNA-Seq data, we believe that strain level analysis should become commonplace in metagenomics.

In digging deeply into the bioinformatics of metagenomics bioinformatics we noticed a few other areas that could benefit from RNA-Seq technology transfer. For example, the standard of RNA-Seq methods benchmarking appears to be higher than in metagenomics. Both the Kraken and CLARK papers benchmarked their programs on simulations with 10 genomes (the number ten is not a typo). CLARK did test on one dataset with 20 genomes, although using only 10,000 reads. To be fair to the authors of those papers, their standards were much higher than others in the field. The paper

Yu-Wei Wu and Yuzhen Ye, A novel abundance-based algorithm for binning metagenomic sequences using l-tuples, Journal of Computational Biology 2011.

benchmarked their method on simulations of reads from 2 (two!!) organisms. Biologists frequently complain that simulations of bioinformaticians are completely non-informative and unfortunately these cases provide fodder for such prejudice. Having said that, the RNA-Seq community also has much to learn from the metagenomics community. The previously mentioned paper by Paulson et al. 2013 addresses missing data in a way that should translate directly to missing data in single-cell RNA-Seq (the paper also makes performance comparisons with their comparative metagenomics approach to the RNA-Seq programs DESeq and edgeR) . One paper (McDavid et al. 2012) does take a look at modeling single-cell data with zero inflated distributions but I think this is a good example where metagenomics is ahead of RNA-Seq. Our immediate plans are to develop the kallisto application to metagenomics to include the ability to perform metagenome comparisons using sleuth. Conversely, inspired by the taxonomy hierarchy fundamental to metagenomics we’re going to explore RNA-Seq quantification with groups of transcripts that go beyond just genes.

Horizontal transfer is good.

Today I posted the preprint N. Bray, H. Pimentel, P. Melsted and L. Pachter, Near-optimal RNA-Seq quantification with kallisto to the arXiv. It describes the RNA-Seq quantification program kallisto. [Update April 5, 2016: a revised version of the preprint has been published: Nicolas L. Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-Seq quantification, Nature Biotechnology (2016), doi:10.1038/nbt.3519 published online April 4, 2016.]

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was so much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of k-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. That is asking for a lot less, and as it turns out, pseudoalignment can be much faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient k-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

The screenshot below shows kallisto being used on my 2012 iMac desktop to build an index of the human transcriptome (5 min 8 sec), and then quantify 78.6 million GEUVADIS human RNA-Seq reads (14 min). When we first saw these results we thought they were simply too good to be true. Let me repeat: The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. In some tests we’ve gotten even faster running times, up to 15 million reads quantified per minute.

The results in our paper indicate that kallisto is not just fast, but also very accurate. This is not surprising: underlying RNA-Seq analysis are the alignments, and although kallisto is pseudoaligning instead, it is almost always only the compatibility information that is used in actual applications. As we show in our paper, from the point of view of compatibility, the pseudoalignments and alignments are almost the same.

Although accuracy is a primary concern with analysis, we realized in the course of working on kallisto that speed is also paramount, and not just as a  matter of convenience. The speed of kallisto has three major implications:

1. It allows for efficient bootstrapping. All that is required for the bootstrap are reruns of the EM algorithm, and those are particularly fast within kallisto. The result is that we can accurately estimate the uncertainty in abundance estimates. One of my favorite figures from our paper, made by Harold, is this one:

It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. This result is non-trivial: the standard dogma, that the technical variance in RNA-Seq is “Poisson” (i.e. proportional to the mean) is false, as shown in Supplementary Figure 3 of our paper (the correlation becomes 0.64). Thus, the bootstrap will be invaluable when incorporated in downstream application and we are already working on some ideas.

2. It is not just the kallisto quantification that is fast; the index building, and even compilation of the program are also easy and quick. The implication for biologists is that RNA-Seq analysis now becomes interactive. Instead of “freezing” an analysis that might take weeks or even months, data can be explored dynamically, e.g. easily quantified against different transcriptomes, or re-quantified as transcriptomes are updated. The ability to analyze data locally instead of requiring cloud computation means that analysis is portable, and also easily secure.

3. We have found the fast turnaround of analysis helpful in improving the program itself. With kallisto we can quickly check the effect of changes in the algorithms. This allows for much faster debugging of problems, and also better optimization. It also allows us to release improvements knowing that users will be able to test them without resorting to a major computation that might take months. For this reason we’re not afraid to say that some improvements to kallisto will be coming soon.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!