You are currently browsing the category archive for the ‘single cell RNA-Seq’ category.

A recent opinion piece titled “A decade of molecular cell atlases” by Stephen Quake narrates the incredible single-cell genomics technology advances that have taken place over the last decade, and how they have translated to increasingly resolved cell atlases. While the article tells some fascinating stories (apparently when hearing a report about the CZI mouse cell atlas Priscilla Chan remarked “why don’t we just do human?” and thus the idea for a human cell atlas was born), it contains several errors and omissions. I have summarized some of them below, and have sent a copy to Trends in Genetics, where the opinion piece was published, requesting corrections [Update April 5, 2022: Trends in Genetics rejected my submission; it is posted on the arXiv]:

  • Quake writes that “The year 2017 marked the release of our Tabula Muris data set and preprint” and cites the 2nd version of a preprint on the bioRxiv posted on March 29, 2018, that later became an article in Nature on October 3, 2018. While it is true that the preprint was first posted on December 20, 2017, at that time the data was not released. The data was only released in the 2nd version of the preprint on March 29, 2018 (the data was made publicly available on GEO at accession GSE109774 on March 19, 2018). Without the data, namely the reads which were processed, it is not possible to verify or reproduce results from a paper, nor, in the case of single-cell transcriptomics, is it possible to build on work by uniformly processing it together with a new dataset for joint analysis. Notably, Quake’s false claim that the preprint and data was released in 2017 is a repeat of what he stated in a lecture titled “The Cell is a Bag of RNA” (an apt analogy of David Lilley). In the lecture Quake specifically said that “in this whole gap [from December 20, 2017 until the paper was published in October 2018] where normally you wouldn’t have access to the paper or the data, the whole world did because we put it on the [bio]arXiv… not just the manuscript but all the data.”
  • The error in describing the date when the Tabula Muris data was shared is significant in light of Quake’s narrative that Tabula Muris was “the first mammalian whole-organism cell atlas”. Quake describes another single-cell RNA-seq based mouse cell atlas published by Guoji Guo‘s group at Zhejiang University on February 22, 2018 (Han et al., Mapping the Mouse Cell Atlas by Microwell-Seq, Cell), as “further work”. Guo’s data was publicly available on GEO on February 14, 2018 (GSE 108097) along with the publication, a date that preceded the release of the Tabula Muris data. In fact, in the Tabula Muris preprint update on March 29, 2018, the Han et al. 2018 paper data is analyzed in conjunction with the Tabula Muris data, with the authors concluding that “independent datasets generated from various atlases that are beginning to arise can be combined and collectively analyzed…”. Thus, it was the Tabula Muris paper that was “further work” following the Han et al. 2018 paper, not the other way around. The timeline that Quake presents is shown on the left (screenshot from his “The Cell is a Bag of RNA” talk); the actual (edits by me in red) timeline is on the right:
  • Quake mischaracterizes another paper from the Guo group, namely Han et al., Construction of a human cell landscape at single-cell level, Nature, 2020. He refers to this paper as one of several that represent “a distinct strategy of compiling cell atlases one tissue at a time.” However the Han et al. 2020 paper analyzed samples of both fetal and adult tissue, and covered 60 human tissue types which were assayed in (2-4) replicates. Han et al. 2020 also examined several types of cell culture, including induced pluripotent stem cells, embryoid body cells, haematopoietic cells derived from co-cultures of human H9 and mouse OP9 cells, and pancreatic beta cells derived from H9 cells. The scope of the Han et al. 2020 paper is apparent in Figure 1 of their paper:
figure 1
  • Quake similarly misrepresents two other human cell atlas papers, namely He et al. Single-cell transcriptome profiling of an adult human cell atlas of 15 major organs, Genome Biology, 2020. This paper, whose title makes clear it mapped cell types in 15 organs, is also described by Quake as “representing a distinct strategy of compiling cell atlases one tissue at a time” implying that only one tissue or organ was assayed. The same is the case for Cao et al., A human cell atlas of fetal gene expression, Science 2020, which derived cell types using single-cell gene expression and chromatin landscape data from 15 organs.
  • When highlighting “his”Tabula Sapiens which was first preprinted in 2021 , Quake fails to mention the human cell atlas papers above, and instead mentions only preprints from the Broad and Sanger which were also published in 2021, and a paper from what he calls a “Swedish consortium” (the work was conceived and designed by Mathias Uhlén and Cecilia Lindskog). This omission makes it seem that Tabula Sapiens was the first human cell atlas to be published, along with a handful of others preprinted at the same time and one published concurrently, when in fact that was not the case.
  • Quake characterizes the Tabula Muris as “representing the first mammalian whole-organism cell atlas.” As noted above, it was not “first”, but priority claims aside, the description as a “whole-organism cell atlas” needs to be qualified. Here is how the project is characterized in the published paper: “Although these data are by no means a complete representation of all mouse organs and cell types, they provide a first draft attempt to create an organism-wide representation of cellular diversity.”
  • In reviewing the technology developments that led to high-throughput single-cell RNA-seq, Quake omits several important advances. There is a large body of work to refer to and cite, including several key advances in barcoding of beads to identify cells and barcoding for distinguishing molecules. For the latter, see, e.g., Shiroguchi et al., 2012 (from the lab of Sunney Xie).
  • The paper declares 2011-2012 to be “seminal years” in conceptualizing the notion of a transcriptomic cell atlas. While it’s true that those were “seminal” years for Quake when he published his own sperm (Wang et al., 2012), the timeline seems arbitrary and possibly self-serving. The Tang et al. paper in 2009 could just as well be taken as the starting point for “conceptualizing the notion of a transcriptomic cell atlas. Tang et al. write specifically that “For example, mouse embryonic stem cells, probably the most thoroughly analyzed type of stem cells, contain multiple subpopulations with strong differences in both gene expression and physiological function. Therefore, a more sensitive mRNA-Seq assay, ideally an assay capable of working at single cell resolution, is needed to meaningfully study crucial developmental processes and stem cell biology.” Similarly, Long et al., in “A 3D digital atlas of C. elegans and its application to single-cell analyses” published in 2009, were anticipating the notion of a transcriptomic cell atlas”, noting that their technology would be particularly useful for “high-throughput analysis of cellular information such as gene expression at single-cell resolution.” Alternatively, a reasonable starting point for consideration could be 2014, when cells actually started to be assayed en masse:
Figure from Svensson et al., 2020.
  • The Long et al. paper brings to the fore the field of spatial transcriptomics, which Quake ignores entirely in his review. However, conceptualization of the notion of a transcriptomic cell atlas was happening by scientists in that field; in fact spatial transcriptomics was arguably the domain where most of the ideas pervasive in single-cell genomics today originated (see, the Museum of Spatial Transcriptomics for a detailed history and review).
  • Another important omission in the Quake opinion is the discussion of the computational biology technologies crucial for cell atlases. None of the Tabula papers, or for that matter any of the single-cell transcriptomics papers that have been written during the past few years would have been possible without the Seurat and Scanpy programs from the Satija and Theis labs respectively. More importantly, the atlases themselves are, fundamentally, a product of the computational tools used to analyze the data. For example, in the Tabula Microcebus the annotated cell types were obtained by analyzing the 10X Genomics single-cell RNA-seq data “through dimensionality reduction by principal component analysis, visualization in 2-dimensional projections with t-Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), and clustering by the Louvain method.” These methods are tuned with numerous parameters and even evaluating when they are accurate is challenging (Kiselev et al., 2019). In regards to dimensionality reduction, there are numerous problems that have been documented with the t-SNE and UMAP leading to questions about results based on interpreting them (see, e.g., Chari et al., 2021, Cooley et al., 2022). The only mention of computational technology in the article is a comment that “Similarly, the development of new algorithms and computational approaches was also a powerful enabler of the field as it now exists.”
  • The omission of computational technology is chalked up to space constraints, yet there was apparently enough space for the Quake to narrate an origin story of the Human Cell Atlas project in which he centers himself, instead of Aviv Regev and Sarah Teichmann whose contribution was much more than to have “asked whether various efforts should be merged into an international collaboration”. They were early champions of a collaborative human cell atlas project and have co-chaired the organizing committee from the outset. Teichmann co-founded the Wellcome Trust Sanger Institute Single Cell Genomics Centre in 2013, and by 2015 had been awarded the EMBO Gold Medal in part for her contributions to, and vision for, single-cell transcriptomics. Regev pioneered many of the single-cell RNA-seq technology developments that enabled single-cell genomics, including single-cell studies of immune cells in 2013 and spatial single-cell RNA-seq in 2015. By the time of the inaugural Human Cell Atlas meeting in London in 2016, Regev had been widely publicizing a vision for a “periodic table of cells” and Teichmann had joined forces with her to develop a joint vision to accomplish the task (see article in the Pacific Standard, 2018).
  • There are a few minor errors in the paper. Quake writes that “These [microfluidics automation technologies] were eventually commercialized by a company I founded called Fluidigm..”. In fact, Quake did not found Fluidigm by himself; the company was co-founded with Gajus Worthington. Miriam Merad’s name is incorrectly spelled as “Meriam” and Christophe Benoist’s name is incorrectly spelled as “Benoiste”. A recurring typo is the misspelling of Sarah Teichmann’s name. It is incorrectly spelled “Teichman” three times throughout the manuscript, including in the Acknowledgment section where she is thanked for specific comments on the manuscript.

A broader point regarding cell atlases is that defining cell types, distinguishing cell types from cell states, and comprehensively organizing cells in any species in a meaningful framework, is a monumental task that we are only beginning to tackle. There are no definitive human or mouse cell atlases yet, and there won’t be for some time. Among the “atlases” published so far there is little consensus. The Tabula Muris, cell atlas annotates far fewer cell types than Han et al., 2018, perhaps because the latter assayed many more cells. Similarly, the fly cell atlas by Li et al., 2021 lists ~250 cell types in comparison to the Tabula Sapiens that finds ~400 in human. Perhaps these similar numbers do not reflect fundamental shared biology or a universal organizing principle for cells, but rather the fact that both projects sequenced similar numbers of cells (~580k vs. 500k respectively). Unsurprisingly, the number of annotated cell types in publications is strongly correlated with the number of cells assayed:

Cluster and cell numbers. The number of cells studied versus the number of clusters or cell types reported in a study. Red curves correspond to linear regression stratified to five quantiles of ‘Reported cells total’.
Figure from Svensson et al., 2020.

The brain presents an especially daunting challenge. An entire recent issue of Nature was devoted to only one region: the primary motor cortex. Frankly, opinion pieces elbowing for priority claims are neither appropriate nor interesting. To the extent that the human cell atlas will ever become a meaningful accomplishment it will have been a project without a single winner. Instead, it will have been a collaborative effort of thousands of scientists from across the world who will have deepened our understanding of biology to the benefit of all.

The arXiv preprint server has its roots in an “e-print” email list curated by astrophysicist Joanne Cohn, who in 1989 had the idea of organizing the sharing of preprints among her physics colleagues. In her recollections of the dawn of the arXiv,  she mentions that “at one point two people put out papers on the list on the same topic within a few days of each other” and that her “impression was that because of the worldwide reach of [her] distribution list, people realized it could be a way to establish precedence for research.” In biology, where many areas are crowded and competitive, the ability to time stamp research before a possibly lengthy journal review and publication process is almost certainly one of the driving forces behind the rapid growth of the bioRxiv (ASAPbio 2016 and Vale & Hyman, 2016).

However the ability to establish priority with preprints is not, in my opinion, what makes them important for science. Rather, the value of preprints is in their ability to accelerate research via the rapid dissemination of methods and discoveries. This point was eloquently made by Stephen Quake, co-president of the Chan Zuckerberg Biohub, at a Caltech Kavli Nanonscience Institute Distinguished Seminar Series talk earlier this year. He demonstrated the impact of preprints and of sharing data prior to journal publication by way of example, noting that posting of the CZ Biohub “Tabula Muris” preprint along with the data directly accelerated two different unrelated projects: Cusanovich et al. 2018 and La Manno et al. 2018. In fact, in the case of La Manno et al. 2018, Quake revealed that one of the corresponding authors of the paper, Sten Linnarsson, had told him that “[he] couldn’t get the paper past the referees without using all of the [Tabula Muris] data”:

Moreover, Quake made clear that the open science principles practiced with the Tabula Muris preprint were not just a one-off experiment, but fundamental Chan Zuckerberg Initiative (CZI) values that are required for all CZI internal research and for publications arising from work the CZI supports: “[the CZI has] taken a pretty aggressive policy about publication… people have to agree to use biorXiv or a preprint server to share results… and the hope is that this is going to accelerate science because you’ll learn about things sooner and be able to work on them”:

Indeed, on its website the CZI lists four values that guide its mission and one of them is “Open Science”:

Open Science
The velocity of science and pace of discovery increase as scientists build on each others’ discoveries. Sharing results, open-source software, experimental methods, and biological resources as early as possible will accelerate progress in every area.

This is a strong and direct rebuttal to Dan Longo and Jeffrey Drazen’s “research parasite” fear mongering in The New England Journal of Medicine.



I was therefore disappointed with the CZI after failing, for the past two months, to obtain the code and data for the preprint “A molecular cell atlas of the human lung from single cell RNA sequencing” by Travaglini, Nabhan et al. (the preprint was posted on the bioRxiv on August 27th 2019). The interesting preprint describes an atlas of 58 cell populations in the human lung, which include 41 of 45 previously characterized cell types or subtypes and the discovery of 14 new ones. Of particular interest to me, in light of some ongoing projects in my lab, is a comparative analysis examining cell type concordance between human and mouse. Travaglini, Nabhan et al. note that 17 molecular types have been gained or lost since the divergence of human and mouse. The results are based on large-scale single-cell RNA-seq (using two technologies) of ~70,000 human and lung peripheral blood cells.

The comparative analysis is detailed in Extended Data Figure S5 (reproduced below), which shows scatter plots of (log) gene counts for homologous human and mouse cell types. For each pair of cell types, a sub-figures also shows the correlation between gene expression and divergent genes are highlighted:


I wanted to understand the details behind this figure: how exactly were cell types defined and homologous cell types identified? What was the precise thresholding for “divergent” genes? How were the ln(CPM+1) expression units computed? Some aspects of these questions have answers in the Methods section of the preprint, but I wanted to know exactly; I needed to see the code. For example, the manuscript describes the cluster selection procedure as follows: “Clusters of similar cells were detected using the Louvain method for community detection including only biologically meaningful principle [sic] components (see below)” and looking “below” for the definition of “biologically meaningful” I only found a descriptive explanation illustrated with an example, but with no precise specification provided. I also wanted to explore the data. We have been examining some approaches for cross-species single-cell analysis and this preprint describes an exceptionally useful dataset for this purpose. Thus, access to the software and data used for the preprint would accelerate the research in my lab.

But while the preprint has a sentence with a link to the software (“Code for demultiplexing counts/UMI tables, clustering, annotation, and other downstream analyses are available on GitHub (”) clicking on the link merely sends one to the Github Octocat.

Screen Shot 2019-10-16 at 4.01.58 PM

The Travaglini, Nabhan et al. Github repository that is supposed to contain the analysis code is nowhere to be found. The data is also not available in any form. The preprint states that “Raw sequencing data, alignments, counts/UMI tables, and cellular metadata are available on GEO (accession GEOXX),” The only data a search for GEOXX turns up is a list of prices on a shoe website.

I wrote to the authors of Travaglini, Nabhan et al. right after their preprint appeared noting the absence of code and data and asking for both. I was told by one of the first co-authors that they were in the midst of uploading the materials, but that the decision of whether to share them would have to be made by the corresponding authors. Almost two months later, after repeated requests, I have yet to receive anything. My initial excitement for the Travaglini, Nabhan et al. single-cell RNA-seq has turned into disappointment at their zero-data RNA-seq.

🦗 🦗 🦗 🦗 🦗 

This state of affairs, namely the posting of bioRxiv preprints without data or code, is far too commonplace. I was first struck with the extent of the problem last year when the Gupta, Collier et al. 2018 preprint was posted without a Methods section (let alone with data or code). Also problematic was that the preprint was posted just three months before publication while the journal submission was under review. I say problematic because not sharing code, not sharing software, not sharing methods, and not posting the preprint at the time of submission to a journal does not accelerate progress in science (see the CZI Open Science values statement above).

The Gupta, Collier et al. preprint was not a CZI related preprint but the Travaglini, Nabhan et al. preprint is. Specifically, Travaglini, Nabhan et al. 2019 is a collaboration between CZ Biohub and Stanford University researchers, and the preprint appears on the Chan Zuckerberg Biohub bioRxiv channel:

Screen Shot 2019-10-18 at 10.57.52 PM

The Travaglini, Nabhan et al. 2019 preprint is also not an isolated example; another recent CZ Biohub preprint from the same lab, Horns et al. 2019,  states explicitly that “Sequence data, preprocessed data, and code will be made freely available [only] at the time of [journal] publication.” These are cases where instead of putting its money where its mouth is, the mouth took the money, ate it, and spat out a 404 error.

angry meryl streep GIF

To be fair, sharing data, software and methods is difficult. Human data must sometimes be protected due to confidentiality constraints, thus requiring controlled access with firewalls such as dbGaP that can be difficult to set up. Even with unrestricted data, sharing can be cumbersome. For example, the SRA upload process is notoriously difficult to manage, and the lack of metadata standards can make organizing experimental data, especially sequencing data, complicated and time consuming. The sharing of experimental protocols can be challenging when they are in flux and still being optimized while work is being finalized. And when it comes to software, ensuring reproducibility and usability can take months of work in the form of wrangling Snakemake and other workflows, not to mention the writing of documentation. Practicing Open Science, I mean really doing it, is difficult work. There is a lot more to it than just dumping an advertisement on the bioRxiv to collect a timestamp. By not sharing their data or software, preprints such as Travaglini, Nabhan et al. 2019 and Horns et al. 2019 appear to be little more than a cynical attempt to claim priority.

It would be great if the CZI, an initiative backed by billions of dollars with hundreds of employees, would truly champion Open Science. The Tabula Muris preprint is a great example of how preprints that are released with data and software can accelerate progress in science. But Tabula Muris seems to be an exception for CZ Biohub researchers rather than the rule, and actions speak louder than a website with a statement about Open Science values.

A few months ago, in July 2019, I wrote a series of five posts about the Melsted, Booeshaghi et al. 2019 preprint on this blog, including a post focusing on a new fast workflow for RNA velocity based on kallisto | bustools.  This new workflow replaces the velocyto software published with the RNA velocity paper (La Manno et al. 2018), in the sense that the kallisto | bustools is more than an order of magnitude faster than velocyto (and Cell Ranger which it builds on), while producing near identical results:


In my blogpost, I showed how we were able to utilize this much faster workflow to easily produce RNA velocity for a large dataset of 113,917 cells from Clark et al. 2019, a dataset that was intractable with Cell Ranger + velocyto.

The kallisto | bustools RNA velocity workflow makes use of two novel developments: a new mode in kallisto called kallisto bus that produces BUS files from single-cell RNA-seq data, and a new collection of new C++ programs forming “bustools” that can be used to process BUS files to produce count matrices. The RNA velocity workflow utilizes the bustools sort, correct, capture and count commands. With these tools an RNA velocity analysis that previously took a day now takes minutes. The workflow, associated tools and validations are described in the Melsted, Booeshaghi et al. 2019 preprint.

Now, in an interesting exercise presented at the Single Cell Genomics 2019 meeting, Sten Linnarsson revealed that he reimplemented the kallisto | bustools workflow in Loompy. Loompy, which previously consisted of some Python tools for reading and writing Loom files, now has a function that runs kallisto bus. It also consists of Python functions that are used to manipulate BUS files; these are Python reimplementations of the bustools functions needed for RNA velocity and produce the same output as kallisto | bustools. It is therefore possible to now answer a question I know has been on many minds… one that has been asked before but not to my knowledge, in the single-cell RNA-seq setting… is Python really faster than C++ ?

To answer this question we (this is an analysis performed with Sina Booeshaghi), performed an apples-to-apples comparison running kallisto | bustools and Loompy on exactly the same data, with the same hardware. We pre-processed both the human forebrain data from La Manno et al. 2018,  and data from Oetjen et al. 2018 consisting of 498,303,099 single-cell RNA-seq reads sequenced from a cDNA library of human bone marrow (SRA accession SRR7881412; see also PanglaoDB).

First, we examined the correspondence between Loopy and bustools on the human forebrain data. As expected, given that the Loompy code first runs the same kallisto as in the kallisto | bustools workflow, and then reimplements bustools, the results are the near identical. In the first plot every dot is a cell (as defined by the velocyto output from La Manno et al. 2018) and the number of counts produced by each method is shown. In the second, the correlation between gene counts in each cell are plotted:

The figures above are produced from the “spliced” RNA velocity matrix. We also examined the “unspliced” matrix, with similar results:

In runtime benchmarks on the Oetjen et al. 2018 data we found that kallisto | bustools runs 3.75 faster than Loompy (note that by default Loompy runs kallisto with all available threads, so we modified the Loompy source code to create a fair comparison). Furthermore, kallisto | bustools requires 1.8 times less RAM. In other words, despite rumors to the contrary, Python is indeed slower than C++ ! 

Of course, sometimes there is utility in reimplementing software in another language, even a slower one. For example, a reimplementation of C++ code could lead to a simpler workflow in a higher level language. That’s not the case here. The memory efficiency of kallisto | bustools makes possible the simplest user interface imaginable: a kallisto | bustools based Google Colab notebook allows for single-cell RNA-seq pre-processing in the cloud on a web browser without a personal computer.

At the recent Single Cell Genomics 2019 meeting, Linnarsson’s noted that Cell Ranger + veloctyto has been replaced by kallisto | bustools:

image (9).png

Indeed, as I wrote on my blog shortly after the publication of Melsted, Booeshaghi et al., 2019, RNA velocity calculations that were previously intractable on large datasets are now straightforward. Linnarsson is right. Users should always adopt best-in-class tools in favor of methods that underperform in accuracy, efficiency, or both. #methodsmatter

This post is the fifth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

  1. Near-optimal pre-processing of single-cell RNA-seq
  2. Single-cell RNA-seq for dummies
  3. How to solve an NP-complete problem in linear time
  4. Rotating the knee (plot) and related yoga
  5. High velocity RNA velocity

The following passage about Beethoven’s fifth symphony was written by one of my favorite musicologists:

“No great music has ever been built from an initial figure of four notes. As I have said elsewhere, you might as well say that every piece of music is built from an initial figure of one note. You may profitably say that the highest living creatures have begun from a single nucleated cell. But no ultra-microscope has yet unraveled the complexities of the single living cell; nor, if the spectroscope is to be believed, are we yet very full informed of the complexities of a single atom of iron : and it is quite absurd to suppose that the evolution of a piece of music can proceed from a ‘simple figure of four notes’ on lines in the least resembling those of nature.” – Donald Francis Tovey writing about Beethoven’s Fifth Symphony in Essays in Musical Analysis Volume I, 1935.

This passage conveys something true about Beethoven’s fifth symphony: an understanding of it cannot arise from a limited fixation on the famous four note motif. As far as single-cell biology goes, I don’t know whether Tovey was familiar with Theodor Boveri‘s sea urchin experiments, but he certainly hit upon a scientific truth as well: single cells cannot be understood in isolation. Key to understanding them is context (Eberwine et al., 2013).

RNA velocity, with roots in the work of Zeisel et al., 2011, has been recently adapted for single-cell RNA-seq by La Manno et al. 2018, and provides much needed context for interpreting the transcriptomes of single-cells in the form of a dynamics overlay. Since writing a review about the idea last year (Svensson and Pachter, 2019), I’ve become increasingly convinced that the method, despite relying on sparse data, numerous very strong model assumptions, and lots of averaging, is providing meaningful biological insight. For example, in a recent study of spermatogonial stem cells (Guo et al. 2018), the authors describe two “unexpected” transitions between distinct states of cells that are revealed by RNA velocity analysis (panel a from their Figure 6, see below):


Producing an RNA velocity analysis currently requires running the programs Cell Ranger followed by velocyto. These programs are both very slow. Cell Ranger’s running time scales at about 3 hours per hundred million reads (see Supplementary Table 1 Melsted, Booeshaghi et al., 2019). The subsequent velocyto run is also slow. The authors describe it as taking “approximately 3 hours” but anecdotally the running time can be much longer on large datasets. The programs also require lots of memory.

To facilitate rapid and large-scale RNA velocity analysis, in Melsted, Booeshaghi et al., 2019  we describe a kallisto|bustools workflow that makes possible efficient RNA velocity computations at least an order of magnitude faster than with Cell Ranger and velocyto. The work, a tour-de-force of development, testing and validation, was primarily that of Sina Booeshaghi. Páll Melsted implemented the bustools capture command and Kristján Hjörleifsson assisted with identifying and optimizing the indices for pseudoalignment. We present analysis on two datasets in the paper. The first is single-cell RNA-seq from retinal development recently published in Clark et al. 2019. This is a beautiful paper- and I don’t mean just in terms of the results. Their data and results are extremely well organized making their paper reproducible. This is so important it merits a shout out 👏🏾

See Clark et al. 2019‘s  GEO GSE 118614 for a well-organized and useful data share.

The figure below shows RNA velocity vectors overlaid on UMAP coordinates for Clark et al.’s 10 stage time series of retinal development (see cell [8] in our python notebook):


An overlap on the same UMAP with cells colored by type is shown below:


Clark et al. performed a detailed pseudotime analysis in their paper, which successfully identified genes associated with cell changes during development. This is a reproduction of their figure 2:


We examined the six genes from their panel C from a velocity point of view using the scvelo package and the results are beautiful:


What can be seen with RNA velocity is not only the changes in expression that are extracted from pseudotime analysis (Clark et al. 2019 Figure 2 panel C), but also changes in their velocity, i.e. their acceleration (middle column above). RNA velocity adds an interesting dimension to the analysis.

To validate that our kallisto|bustools RNA velocity workflow provides results consistent with velocyto, we performed a direct comparison with the developing human forebrain dataset published by La Manno et al. in the original RNA velocity paper (La Manno et al. 2018 Figure 4).


The results are concordant, not only in terms of the displayed vectors, but also, crucially, in the estimation of the underlying phase diagrams (the figure below shows a comparison for the same dataset; kallisto on the left, Cell Ranger + velocyto on the right):


Digging deeper into the data, one difference we found between the workflows (other than speed) is the number of reads counts. We implemented a simple strategy to estimate the required spliced and unspliced matrices that attempts to follow the one described in the La Manno et al. paper, where the authors describe the rules for characterizing reads as spliced vs. unspliced as follows:

1. A molecule was annotated as spliced if all of the reads in the set supporting a given molecule map only to the exonic regions of the compatible transcripts.
2. A molecule was annotated as unspliced if all of the compatible transcript models had at least one read among the supporting set of reads for this molecule mapping that i) spanned exon-intron boundary, or ii) mapped to the intron of that transcript.

In the kallisto|bustools workflow this logic was implemented via the bustools capture command which was first use to identify all reads that were compatible only with exons (i.e. there was no pseudoalignment to any intron) and then all reads that were compatible only with introns  (i.e. there was no pseudoalignment completely within an exon). While our “spliced matrices” had similar numbers of counts, our “unspliced matrices” had considerably more (see Melsted, Booeshaghi et al. 2019 Supplementary Figure 10A and B):


To understand the discrepancy better we investigated the La Manno et al. code, and we believe that differences arise from the velocyto package code in which the same count function

def count(self, molitem: vcy.Molitem, cell_bcidx: int, dict_layers_columns: Dict[str, np.ndarray], geneid2ix: Dict[str, int])

appears 8 times and each version appears to implement a slightly different “logic” than described in the methods section.

A tutorial showing how to efficiently perform RNA velocity is available on the kallisto|bustools website. There is no excuse not to examine cells in context.


Blog Stats

%d bloggers like this: