You are currently browsing the tag archive for the ‘RNA velocity’ tag.

A few months ago, in July 2019, I wrote a series of five posts about the Melsted, Booeshaghi et al. 2019 preprint on this blog, including a post focusing on a new fast workflow for RNA velocity based on kallisto | bustools.  This new workflow replaces the velocyto software published with the RNA velocity paper (La Manno et al. 2018), in the sense that the kallisto | bustools is more than an order of magnitude faster than velocyto (and Cell Ranger which it builds on), while producing near identical results:


In my blogpost, I showed how we were able to utilize this much faster workflow to easily produce RNA velocity for a large dataset of 113,917 cells from Clark et al. 2019, a dataset that was intractable with Cell Ranger + velocyto.

The kallisto | bustools RNA velocity workflow makes use of two novel developments: a new mode in kallisto called kallisto bus that produces BUS files from single-cell RNA-seq data, and a new collection of new C++ programs forming “bustools” that can be used to process BUS files to produce count matrices. The RNA velocity workflow utilizes the bustools sort, correct, capture and count commands. With these tools an RNA velocity analysis that previously took a day now takes minutes. The workflow, associated tools and validations are described in the Melsted, Booeshaghi et al. 2019 preprint.

Now, in an interesting exercise presented at the Single Cell Genomics 2019 meeting, Sten Linnarsson revealed that he reimplemented the kallisto | bustools workflow in Loompy. Loompy, which previously consisted of some Python tools for reading and writing Loom files, now has a function that runs kallisto bus. It also consists of Python functions that are used to manipulate BUS files; these are Python reimplementations of the bustools functions needed for RNA velocity and produce the same output as kallisto | bustools. It is therefore possible to now answer a question I know has been on many minds… one that has been asked before but not to my knowledge, in the single-cell RNA-seq setting… is Python really faster than C++ ?

To answer this question we (this is an analysis performed with Sina Booeshaghi), performed an apples-to-apples comparison running kallisto | bustools and Loompy on exactly the same data, with the same hardware. We pre-processed both the human forebrain data from La Manno et al. 2018,  and data from Oetjen et al. 2018 consisting of 498,303,099 single-cell RNA-seq reads sequenced from a cDNA library of human bone marrow (SRA accession SRR7881412; see also PanglaoDB).

First, we examined the correspondence between Loopy and bustools on the human forebrain data. As expected, given that the Loompy code first runs the same kallisto as in the kallisto | bustools workflow, and then reimplements bustools, the results are the near identical. In the first plot every dot is a cell (as defined by the velocyto output from La Manno et al. 2018) and the number of counts produced by each method is shown. In the second, the correlation between gene counts in each cell are plotted:

The figures above are produced from the “spliced” RNA velocity matrix. We also examined the “unspliced” matrix, with similar results:

In runtime benchmarks on the Oetjen et al. 2018 data we found that kallisto | bustools runs 3.75 faster than Loompy (note that by default Loompy runs kallisto with all available threads, so we modified the Loompy source code to create a fair comparison). Furthermore, kallisto | bustools requires 1.8 times less RAM. In other words, despite rumors to the contrary, Python is indeed slower than C++ ! 

Of course, sometimes there is utility in reimplementing software in another language, even a slower one. For example, a reimplementation of C++ code could lead to a simpler workflow in a higher level language. That’s not the case here. The memory efficiency of kallisto | bustools makes possible the simplest user interface imaginable: a kallisto | bustools based Google Colab notebook allows for single-cell RNA-seq pre-processing in the cloud on a web browser without a personal computer.

At the recent Single Cell Genomics 2019 meeting, Linnarsson’s noted that Cell Ranger + veloctyto has been replaced by kallisto | bustools:

image (9).png

Indeed, as I wrote on my blog shortly after the publication of Melsted, Booeshaghi et al., 2019, RNA velocity calculations that were previously intractable on large datasets are now straightforward. Linnarsson is right. Users should always adopt best-in-class tools in favor of methods that underperform in accuracy, efficiency, or both. #methodsmatter

This post is the fifth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

  1. Near-optimal pre-processing of single-cell RNA-seq
  2. Single-cell RNA-seq for dummies
  3. How to solve an NP-complete problem in linear time
  4. Rotating the knee (plot) and related yoga
  5. High velocity RNA velocity

The following passage about Beethoven’s fifth symphony was written by one of my favorite musicologists:

“No great music has ever been built from an initial figure of four notes. As I have said elsewhere, you might as well say that every piece of music is built from an initial figure of one note. You may profitably say that the highest living creatures have begun from a single nucleated cell. But no ultra-microscope has yet unraveled the complexities of the single living cell; nor, if the spectroscope is to be believed, are we yet very full informed of the complexities of a single atom of iron : and it is quite absurd to suppose that the evolution of a piece of music can proceed from a ‘simple figure of four notes’ on lines in the least resembling those of nature.” – Donald Francis Tovey writing about Beethoven’s Fifth Symphony in Essays in Musical Analysis Volume I, 1935.

This passage conveys something true about Beethoven’s fifth symphony: an understanding of it cannot arise from a limited fixation on the famous four note motif. As far as single-cell biology goes, I don’t know whether Tovey was familiar with Theodor Boveri‘s sea urchin experiments, but he certainly hit upon a scientific truth as well: single cells cannot be understood in isolation. Key to understanding them is context (Eberwine et al., 2013).

RNA velocity, with roots in the work of Zeisel et al., 2011, has been recently adapted for single-cell RNA-seq by La Manno et al. 2018, and provides much needed context for interpreting the transcriptomes of single-cells in the form of a dynamics overlay. Since writing a review about the idea last year (Svensson and Pachter, 2019), I’ve become increasingly convinced that the method, despite relying on sparse data, numerous very strong model assumptions, and lots of averaging, is providing meaningful biological insight. For example, in a recent study of spermatogonial stem cells (Guo et al. 2018), the authors describe two “unexpected” transitions between distinct states of cells that are revealed by RNA velocity analysis (panel a from their Figure 6, see below):


Producing an RNA velocity analysis currently requires running the programs Cell Ranger followed by velocyto. These programs are both very slow. Cell Ranger’s running time scales at about 3 hours per hundred million reads (see Supplementary Table 1 Melsted, Booeshaghi et al., 2019). The subsequent velocyto run is also slow. The authors describe it as taking “approximately 3 hours” but anecdotally the running time can be much longer on large datasets. The programs also require lots of memory.

To facilitate rapid and large-scale RNA velocity analysis, in Melsted, Booeshaghi et al., 2019  we describe a kallisto|bustools workflow that makes possible efficient RNA velocity computations at least an order of magnitude faster than with Cell Ranger and velocyto. The work, a tour-de-force of development, testing and validation, was primarily that of Sina Booeshaghi. Páll Melsted implemented the bustools capture command and Kristján Hjörleifsson assisted with identifying and optimizing the indices for pseudoalignment. We present analysis on two datasets in the paper. The first is single-cell RNA-seq from retinal development recently published in Clark et al. 2019. This is a beautiful paper- and I don’t mean just in terms of the results. Their data and results are extremely well organized making their paper reproducible. This is so important it merits a shout out 👏🏾

See Clark et al. 2019‘s  GEO GSE 118614 for a well-organized and useful data share.

The figure below shows RNA velocity vectors overlaid on UMAP coordinates for Clark et al.’s 10 stage time series of retinal development (see cell [8] in our python notebook):


An overlap on the same UMAP with cells colored by type is shown below:


Clark et al. performed a detailed pseudotime analysis in their paper, which successfully identified genes associated with cell changes during development. This is a reproduction of their figure 2:


We examined the six genes from their panel C from a velocity point of view using the scvelo package and the results are beautiful:


What can be seen with RNA velocity is not only the changes in expression that are extracted from pseudotime analysis (Clark et al. 2019 Figure 2 panel C), but also changes in their velocity, i.e. their acceleration (middle column above). RNA velocity adds an interesting dimension to the analysis.

To validate that our kallisto|bustools RNA velocity workflow provides results consistent with velocyto, we performed a direct comparison with the developing human forebrain dataset published by La Manno et al. in the original RNA velocity paper (La Manno et al. 2018 Figure 4).


The results are concordant, not only in terms of the displayed vectors, but also, crucially, in the estimation of the underlying phase diagrams (the figure below shows a comparison for the same dataset; kallisto on the left, Cell Ranger + velocyto on the right):


Digging deeper into the data, one difference we found between the workflows (other than speed) is the number of reads counts. We implemented a simple strategy to estimate the required spliced and unspliced matrices that attempts to follow the one described in the La Manno et al. paper, where the authors describe the rules for characterizing reads as spliced vs. unspliced as follows:

1. A molecule was annotated as spliced if all of the reads in the set supporting a given molecule map only to the exonic regions of the compatible transcripts.
2. A molecule was annotated as unspliced if all of the compatible transcript models had at least one read among the supporting set of reads for this molecule mapping that i) spanned exon-intron boundary, or ii) mapped to the intron of that transcript.

In the kallisto|bustools workflow this logic was implemented via the bustools capture command which was first use to identify all reads that were compatible only with exons (i.e. there was no pseudoalignment to any intron) and then all reads that were compatible only with introns  (i.e. there was no pseudoalignment completely within an exon). While our “spliced matrices” had similar numbers of counts, our “unspliced matrices” had considerably more (see Melsted, Booeshaghi et al. 2019 Supplementary Figure 10A and B):


To understand the discrepancy better we investigated the La Manno et al. code, and we believe that differences arise from the velocyto package code in which the same count function

def count(self, molitem: vcy.Molitem, cell_bcidx: int, dict_layers_columns: Dict[str, np.ndarray], geneid2ix: Dict[str, int])

appears 8 times and each version appears to implement a slightly different “logic” than described in the methods section.

A tutorial showing how to efficiently perform RNA velocity is available on the kallisto|bustools website. There is no excuse not to examine cells in context.


Blog Stats

%d bloggers like this: