You are currently browsing the category archive for the ‘*Seq’ category.

The arXiv preprint server has its roots in an “e-print” email list curated by astrophysicist Joanne Cohn, who in 1989 had the idea of organizing the sharing of preprints among her physics colleagues. In her recollections of the dawn of the arXiv,  she mentions that “at one point two people put out papers on the list on the same topic within a few days of each other” and that her “impression was that because of the worldwide reach of [her] distribution list, people realized it could be a way to establish precedence for research.” In biology, where many areas are crowded and competitive, the ability to time stamp research before a possibly lengthy journal review and publication process is almost certainly one of the driving forces behind the rapid growth of the bioRxiv (ASAPbio 2016 and Vale & Hyman, 2016).

However the ability to establish priority with preprints is not, in my opinion, what makes them important for science. Rather, the value of preprints is in their ability to accelerate research via the rapid dissemination of methods and discoveries. This point was eloquently made by Stephen Quake, co-president of the Chan Zuckerberg Biohub, at a Caltech Kavli Nanonscience Institute Distinguished Seminar Series talk earlier this year. He demonstrated the impact of preprints and of sharing data prior to journal publication by way of example, noting that posting of the CZ Biohub “Tabula Muris” preprint along with the data directly accelerated two different unrelated projects: Cusanovich et al. 2018 and La Manno et al. 2018. In fact, in the case of La Manno et al. 2018, Quake revealed that one of the corresponding authors of the paper, Sten Linnarsson, had told him that “[he] couldn’t get the paper past the referees without using all of the [Tabula Muris] data”:

Moreover, Quake made clear that the open science principles practiced with the Tabula Muris preprint were not just a one-off experiment, but fundamental Chan Zuckerberg Initiative (CZI) values that are required for all CZI internal research and for publications arising from work the CZI supports: “[the CZI has] taken a pretty aggressive policy about publication… people have to agree to use biorXiv or a preprint server to share results… and the hope is that this is going to accelerate science because you’ll learn about things sooner and be able to work on them”:

Indeed, on its website the CZI lists four values that guide its mission and one of them is “Open Science”:

Open Science
The velocity of science and pace of discovery increase as scientists build on each others’ discoveries. Sharing results, open-source software, experimental methods, and biological resources as early as possible will accelerate progress in every area.

This is a strong and direct rebuttal to Dan Longo and Jeffrey Drazen’s “research parasite” fear mongering in The New England Journal of Medicine.

I was therefore disappointed with the CZI after failing, for the past two months, to obtain the code and data for the preprint “A molecular cell atlas of the human lung from single cell RNA sequencing” by Travaglini, Nabhan et al. (the preprint was posted on the bioRxiv on August 27th 2019). The interesting preprint describes an atlas of 58 cell populations in the human lung, which include 41 of 45 previously characterized cell types or subtypes and the discovery of 14 new ones. Of particular interest to me, in light of some ongoing projects in my lab, is a comparative analysis examining cell type concordance between human and mouse. Travaglini, Nabhan et al. note that 17 molecular types have been gained or lost since the divergence of human and mouse. The results are based on large-scale single-cell RNA-seq (using two technologies) of ~70,000 human and lung peripheral blood cells.

The comparative analysis is detailed in Extended Data Figure S5 (reproduced below), which shows scatter plots of (log) gene counts for homologous human and mouse cell types. For each pair of cell types, a sub-figures also shows the correlation between gene expression and divergent genes are highlighted:

I wanted to understand the details behind this figure: how exactly were cell types defined and homologous cell types identified? What was the precise thresholding for “divergent” genes? How were the ln(CPM+1) expression units computed? Some aspects of these questions have answers in the Methods section of the preprint, but I wanted to know exactly; I needed to see the code. For example, the manuscript describes the cluster selection procedure as follows: “Clusters of similar cells were detected using the Louvain method for community detection including only biologically meaningful principle [sic] components (see below)” and looking “below” for the definition of “biologically meaningful” I only found a descriptive explanation illustrated with an example, but with no precise specification provided. I also wanted to explore the data. We have been examining some approaches for cross-species single-cell analysis and this preprint describes an exceptionally useful dataset for this purpose. Thus, access to the software and data used for the preprint would accelerate the research in my lab.

But while the preprint has a sentence with a link to the software (“Code for demultiplexing counts/UMI tables, clustering, annotation, and other downstream analyses are available on GitHub (https://github.com/krasnowlab/HLCA)”) clicking on the link merely sends one to the Github Octocat.

The Travaglini, Nabhan et al. Github repository that is supposed to contain the analysis code is nowhere to be found. The data is also not available in any form. The preprint states that “Raw sequencing data, alignments, counts/UMI tables, and cellular metadata are available on GEO (accession GEOXX),” The only data a search for GEOXX turns up is a list of prices on a shoe website.

I wrote to the authors of Travaglini, Nabhan et al. right after their preprint appeared noting the absence of code and data and asking for both. I was told by one of the first co-authors that they were in the midst of uploading the materials, but that the decision of whether to share them would have to be made by the corresponding authors. Almost two months later, after repeated requests, I have yet to receive anything. My initial excitement for the Travaglini, Nabhan et al. single-cell RNA-seq has turned into disappointment at their zero-data RNA-seq.

🦗 🦗 🦗 🦗 🦗

This state of affairs, namely the posting of bioRxiv preprints without data or code, is far too commonplace. I was first struck with the extent of the problem last year when the Gupta, Collier et al. 2018 preprint was posted without a Methods section (let alone with data or code). Also problematic was that the preprint was posted just three months before publication while the journal submission was under review. I say problematic because not sharing code, not sharing software, not sharing methods, and not posting the preprint at the time of submission to a journal does not accelerate progress in science (see the CZI Open Science values statement above).

The Gupta, Collier et al. preprint was not a CZI related preprint but the Travaglini, Nabhan et al. preprint is. Specifically, Travaglini, Nabhan et al. 2019 is a collaboration between CZ Biohub and Stanford University researchers, and the preprint appears on the Chan Zuckerberg Biohub bioRxiv channel:

The Travaglini, Nabhan et al. 2019 preprint is also not an isolated example; another recent CZ Biohub preprint from the same lab, Horns et al. 2019,  states explicitly that “Sequence data, preprocessed data, and code will be made freely available [only] at the time of [journal] publication.” These are cases where instead of putting its money where its mouth is, the mouth took the money, ate it, and spat out a 404 error.

To be fair, sharing data, software and methods is difficult. Human data must sometimes be protected due to confidentiality constraints, thus requiring controlled access with firewalls such as dbGaP that can be difficult to set up. Even with unrestricted data, sharing can be cumbersome. For example, the SRA upload process is notoriously difficult to manage, and the lack of metadata standards can make organizing experimental data, especially sequencing data, complicated and time consuming. The sharing of experimental protocols can be challenging when they are in flux and still being optimized while work is being finalized. And when it comes to software, ensuring reproducibility and usability can take months of work in the form of wrangling Snakemake and other workflows, not to mention the writing of documentation. Practicing Open Science, I mean really doing it, is difficult work. There is a lot more to it than just dumping an advertisement on the bioRxiv to collect a timestamp. By not sharing their data or software, preprints such as Travaglini, Nabhan et al. 2019 and Horns et al. 2019 appear to be little more than a cynical attempt to claim priority.

It would be great if the CZI, an initiative backed by billions of dollars with hundreds of employees, would truly champion Open Science. The Tabula Muris preprint is a great example of how preprints that are released with data and software can accelerate progress in science. But Tabula Muris seems to be an exception for CZ Biohub researchers rather than the rule, and actions speak louder than a website with a statement about Open Science values.

A few months ago, in July 2019, I wrote a series of five posts about the Melsted, Booeshaghi et al. 2019 preprint on this blog, including a post focusing on a new fast workflow for RNA velocity based on kallisto | bustools.  This new workflow replaces the velocyto software published with the RNA velocity paper (La Manno et al. 2018), in the sense that the kallisto | bustools is more than an order of magnitude faster than velocyto (and Cell Ranger which it builds on), while producing near identical results:

In my blogpost, I showed how we were able to utilize this much faster workflow to easily produce RNA velocity for a large dataset of 113,917 cells from Clark et al. 2019, a dataset that was intractable with Cell Ranger + velocyto.

The kallisto | bustools RNA velocity workflow makes use of two novel developments: a new mode in kallisto called kallisto bus that produces BUS files from single-cell RNA-seq data, and a new collection of new C++ programs forming “bustools” that can be used to process BUS files to produce count matrices. The RNA velocity workflow utilizes the bustools sort, correct, capture and count commands. With these tools an RNA velocity analysis that previously took a day now takes minutes. The workflow, associated tools and validations are described in the Melsted, Booeshaghi et al. 2019 preprint.

Now, in an interesting exercise presented at the Single Cell Genomics 2019 meeting, Sten Linnarsson revealed that he reimplemented the kallisto | bustools workflow in Loompy. Loompy, which previously consisted of some Python tools for reading and writing Loom files, now has a function that runs kallisto bus. It also consists of Python functions that are used to manipulate BUS files; these are Python reimplementations of the bustools functions needed for RNA velocity and produce the same output as kallisto | bustools. It is therefore possible to now answer a question I know has been on many minds… one that has been asked before but not to my knowledge, in the single-cell RNA-seq setting… is Python really faster than C++ ?

To answer this question we (this is an analysis performed with Sina Booeshaghi), performed an apples-to-apples comparison running kallisto | bustools and Loompy on exactly the same data, with the same hardware. We pre-processed both the human forebrain data from La Manno et al. 2018,  and data from Oetjen et al. 2018 consisting of 498,303,099 single-cell RNA-seq reads sequenced from a cDNA library of human bone marrow (SRA accession SRR7881412; see also PanglaoDB).

First, we examined the correspondence between Loopy and bustools on the human forebrain data. As expected, given that the Loompy code first runs the same kallisto as in the kallisto | bustools workflow, and then reimplements bustools, the results are the near identical. In the first plot every dot is a cell (as defined by the velocyto output from La Manno et al. 2018) and the number of counts produced by each method is shown. In the second, the correlation between gene counts in each cell are plotted:

The figures above are produced from the “spliced” RNA velocity matrix. We also examined the “unspliced” matrix, with similar results:

In runtime benchmarks on the Oetjen et al. 2018 data we found that kallisto | bustools runs 3.75 faster than Loompy (note that by default Loompy runs kallisto with all available threads, so we modified the Loompy source code to create a fair comparison). Furthermore, kallisto | bustools requires 1.8 times less RAM. In other words, despite rumors to the contrary, Python is indeed slower than C++ !

Of course, sometimes there is utility in reimplementing software in another language, even a slower one. For example, a reimplementation of C++ code could lead to a simpler workflow in a higher level language. That’s not the case here. The memory efficiency of kallisto | bustools makes possible the simplest user interface imaginable: a kallisto | bustools based Google Colab notebook allows for single-cell RNA-seq pre-processing in the cloud on a web browser without a personal computer.

At the recent Single Cell Genomics 2019 meeting, Linnarsson’s noted that Cell Ranger + veloctyto has been replaced by kallisto | bustools:

Indeed, as I wrote on my blog shortly after the publication of Melsted, Booeshaghi et al., 2019, RNA velocity calculations that were previously intractable on large datasets are now straightforward. Linnarsson is right. Users should always adopt best-in-class tools in favor of methods that underperform in accuracy, efficiency, or both. #methodsmatter

This post is the fifth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

The following passage about Beethoven’s fifth symphony was written by one of my favorite musicologists:

“No great music has ever been built from an initial figure of four notes. As I have said elsewhere, you might as well say that every piece of music is built from an initial figure of one note. You may profitably say that the highest living creatures have begun from a single nucleated cell. But no ultra-microscope has yet unraveled the complexities of the single living cell; nor, if the spectroscope is to be believed, are we yet very full informed of the complexities of a single atom of iron : and it is quite absurd to suppose that the evolution of a piece of music can proceed from a ‘simple figure of four notes’ on lines in the least resembling those of nature.” – Donald Francis Tovey writing about Beethoven’s Fifth Symphony in Essays in Musical Analysis Volume I, 1935.

This passage conveys something true about Beethoven’s fifth symphony: an understanding of it cannot arise from a limited fixation on the famous four note motif. As far as single-cell biology goes, I don’t know whether Tovey was familiar with Theodor Boveri‘s sea urchin experiments, but he certainly hit upon a scientific truth as well: single cells cannot be understood in isolation. Key to understanding them is context (Eberwine et al., 2013).

RNA velocity, with roots in the work of Zeisel et al., 2011, has been recently adapted for single-cell RNA-seq by La Manno et al. 2018, and provides much needed context for interpreting the transcriptomes of single-cells in the form of a dynamics overlay. Since writing a review about the idea last year (Svensson and Pachter, 2019), I’ve become increasingly convinced that the method, despite relying on sparse data, numerous very strong model assumptions, and lots of averaging, is providing meaningful biological insight. For example, in a recent study of spermatogonial stem cells (Guo et al. 2018), the authors describe two “unexpected” transitions between distinct states of cells that are revealed by RNA velocity analysis (panel a from their Figure 6, see below):

Producing an RNA velocity analysis currently requires running the programs Cell Ranger followed by velocyto. These programs are both very slow. Cell Ranger’s running time scales at about 3 hours per hundred million reads (see Supplementary Table 1 Melsted, Booeshaghi et al., 2019). The subsequent velocyto run is also slow. The authors describe it as taking “approximately 3 hours” but anecdotally the running time can be much longer on large datasets. The programs also require lots of memory.

To facilitate rapid and large-scale RNA velocity analysis, in Melsted, Booeshaghi et al., 2019  we describe a kallisto|bustools workflow that makes possible efficient RNA velocity computations at least an order of magnitude faster than with Cell Ranger and velocyto. The work, a tour-de-force of development, testing and validation, was primarily that of Sina Booeshaghi. Páll Melsted implemented the bustools capture command and Kristján Hjörleifsson assisted with identifying and optimizing the indices for pseudoalignment. We present analysis on two datasets in the paper. The first is single-cell RNA-seq from retinal development recently published in Clark et al. 2019. This is a beautiful paper- and I don’t mean just in terms of the results. Their data and results are extremely well organized making their paper reproducible. This is so important it merits a shout out 👏🏾

See Clark et al. 2019‘s  GEO GSE 118614 for a well-organized and useful data share.

The figure below shows RNA velocity vectors overlaid on UMAP coordinates for Clark et al.’s 10 stage time series of retinal development (see cell [8] in our python notebook):

An overlap on the same UMAP with cells colored by type is shown below:

Clark et al. performed a detailed pseudotime analysis in their paper, which successfully identified genes associated with cell changes during development. This is a reproduction of their figure 2:

We examined the six genes from their panel C from a velocity point of view using the scvelo package and the results are beautiful:

What can be seen with RNA velocity is not only the changes in expression that are extracted from pseudotime analysis (Clark et al. 2019 Figure 2 panel C), but also changes in their velocity, i.e. their acceleration (middle column above). RNA velocity adds an interesting dimension to the analysis.

To validate that our RNA velocity workflow provides results consistent with velocyto, we performed a direct comparison with the developing human forebrain dataset published by La Manno et al. in the original RNA velocity paper (La Manno et al. 2018 Figure 4).

The results are concordant, not only in terms of the displayed vectors, but also, crucially, in the estimation of the underlying phase diagrams (the figure below shows a comparison for the same dataset; kallisto on the left, Cell Ranger + velocyto on the right):

Digging deeper into the data, one difference we found between the workflows (other than speed) is the number of reads counts. We implemented a simple strategy to estimate the required spliced and unspliced matrices that attempts to follow the one described in the La Manno et al. paper, where the authors describe the rules for characterizing reads as spliced vs. unspliced as follows:

1. A molecule was annotated as spliced if all of the reads in the set supporting a given molecule map only to the exonic regions of the compatible transcripts.
2. A molecule was annotated as unspliced if all of the compatible transcript models had at least one read among the supporting set of reads for this molecule mapping that i) spanned exon-intron boundary, or ii) mapped to the intron of that transcript.

In the kallisto|bustools workflow this logic was implemented via the bustools capture command which was first use to identify all reads that were compatible only with exons (i.e. there was no pseudoalignment to any intron) and then all reads that were compatible only with introns  (i.e. there was no pseudoalignment completely within an exon). While our “spliced matrices” had similar numbers of counts, our “unspliced matrices” had considerably more (see Melsted, Booeshaghi et al. 2019 Supplementary Figure 10A and B):

To understand the discrepancy better we investigated the La Manno et al. code, and we believe that differences arise from the velocyto package logic.py code in which the same count function

###### def count(self, molitem: vcy.Molitem, cell_bcidx: int, dict_layers_columns: Dict[str, np.ndarray], geneid2ix: Dict[str, int])

appears 8 times and each version appears to implement a slightly different “logic” than described in the methods section.

A tutorial showing how to efficiently perform RNA velocity is available on the kallisto|bustools website. There is no excuse not to examine cells in context.

This post is the fourth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

The “knee plot” is a standard single-cell RNA-seq quality control that is also used to determine a threshold for considering cells valid for analysis in an experiment. To make the plot, cells are ordered on the x-axis according to the number of distinct UMIs observed. The y-axis displays the number of distinct UMIs for each barcode (here barcodes are proxies for cells). The following example is from Aaron Lun’s DropletUtils vignette:

A single-cell RNA-seq knee plot.

High quality barcodes are located on the left hand side of the plot, and thresholding is performed by identifying the “knee” on the curve. On the right hand side, past the inflection point, are barcodes which have relatively low numbers of reads, and are therefore considered to have had failure in capture and to be too noisy for further analysis.

In Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019, we display a series of plots for a benchmark panel of 20 datasets, and the first plot in each panel (subplot A)is a knee plot. The following example is from an Arabidopsis thaliana dataset (Ryu et al., 2019; SRR8257100)

Careful examination of our plots shows that unlike the typical knee plot made for single-cell RNA-seq , ours has the x- and y- axes transposed. In our plot the x-axis displays the number of distinct UMI counts, and the y-axis corresponds to the barcodes, ordered from those with the most UMIs (bottom) to the least (top). The figure below shows both versions of a knee plot for the same data (the “standard” one in blue, our transposed plot in red):

Why bother transposing a plot?

We begin by observing that if one ranks barcodes according to the number of distinct UMIs associated with them (from highest to lowest), then the rank of a barcode with x distinct UMIs is given by f(x) where

$f(x) = |\{c:\# \mbox{UMIs} \in c \geq x\}|$.

In other words, the rank of a barcode is interpretable as the size of a certain set. Now suppose that instead of only measurements of RNA molecules in cells, there is another measurement. This could be measurement of surface protein abundances (e.g. CITE-seq or REAP-seq), or measurements of sample tags from a multiplexing technology (e.g. ClickTags). The natural interpretation of #distinct UMIs as the independent variable and  the rank of a barcode as the dependent variable is now clearly preferable. We can now define a bivariate function f(x,y) which informs on the number of barcodes with at least x RNA observations and tag observations:

$f(x,y) = |\{c:\# \mbox{UMIs} \in c \geq x \mbox{ and} \# \mbox{tags} \in c \geq y \}|$.

Nadia Volovich, with whom I’ve worked on this, has examined this function for the 8 sample species mixing experiment from Gehring et al. 2018. The function is shown below:

Here the x-axis corresponds to the #UMIs in a barcode, and the y-axis to the number of tags. The z-axis, or height of the surface, is the f(x,y) as defined above.  Instead of thresholding on either #UMIs or #tags, this “3D knee plot” makes possible thresholding using both (note that the red curve shown above corresponds to one projection of this surface).

Separately from the issue described above, there is another subtle issue with the knee plot. The x-axis (dependent) variable really ought to display the number of molecules assayed rather than the number of distinct UMIs. In the notation of Melsted, Booeshaghi et al., 2019 (see also the blog post on single-cell RNA-seq for dummies), what is currently being plotted is |supp(I)|, instead of |I|. While |I| cannot be directly measured, it can be inferred (see the Supplementary Note of Melsted, Booeshaghi et al., 2019), where the cardinality of I is denoted by k (see also Grün et al,, 2014). If d denotes the number of distinct UMIs for a barcode and n the effective number of UMIs , then k can be estimated by

$\hat{k} = \frac{log(1-\frac{d}{n})}{log(1-\frac{1}{n})}$.

The function estimating k is monotonic so for the purpose of thresholding with the knee plot it doesn’t matter much whether the correction is applied, but it is worth noting that the correction can be applied without much difficulty.

This post is the third in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

There is a million dollar prize on offer for a solution to the P vs. NP problem, so it’s understandable that one may wonder whether this blog post is an official entry. It is not.

The title for this post was inspired by a talk presented by David Tse at the CGSI 2017 meeting where he explained “How to solve NP-hard assembly problems in linear time“. The gist of the talk was summarized by Tse as follows:

“In computational genomics there’s been a lot of problems where the formulation is combinatorial optimization. Usually they come from some maximum likelihood formulation of some inference problem and those problems end up being mostly NP-hard. And the solution is typically to develop some heuristic way of solving the NP-hard problem. What I’m saying here is that actually there is a different way of approaching such problems. You can look at them from an information point of view.”

Of course thinking about NP-hard problems from an information point of view does not provide polynomial algorithms for them. But what Tse means is that information-theoretic insights can lead to efficient algorithms that squeeze the most out of the available information.

One of the computational genomics areas where an NP-complete formulation for a key problem was recently proposed is in single-cell RNA-seq pre-processing. After RNA molecules are captured from cells, they are amplified by PCR, and it is possible, in principle, to account for the PCR duplicates of the molecules by making use of unique molecular identifiers (UMIs). Since UMIs are (in theory) unique to each captured molecule, but identical among the PCR duplicates of that captured molecule, they can be used to identify and discard the PCR duplicates. In practice distinct captured molecules may share the same UMI causing a collision, so it can be challenging to decide when to “collapse” reads to account for PCR duplicates.

In the recent paper Srivastava et al. 2019, the authors developed a combinatorial optimization formulation for collapsing. They introduce the notion of “monochromatic arborescences” on a graph, where these objects correspond to what is, in the language of the previous post, elements of the set C. They explain that the combinatorial optimization formulation of UMI collapsing in this framework is to find a minimum cardinality covering of a certain graph by monochromatic arboresences. The authors then prove the following theorem, by reduction from the dominating set decision problem:

Theorem [Srivastava, Malik, Smith, Sudbery, Patro]: Minimum cardinality covering by monochromatic arborescences is NP-complete.

Following the standard practice David Tse described in his talk, the authors then apply a heuristic to the challenging NP-complete problem. It’s all good except for one small thing. The formulation is based on an assumption, articulated in Srivastava et al. 2019 (boldface and strikethrough is mine):

…gene-level deduplication provides a conservative approach and assumes that it is highly unlikely for molecules that are distinct transcripts of the same gene to be tagged with a similar UMI (within an edit distance of 1 from another UMI from the same gene). However, entirely discarding transcript-level information will mask true UMI collisions to some degree, even when there is direct evidence that similar UMIs must have arisen from distinct transcripts. For example, if similar UMIs appear in transcript-disjoint equivalence classes (even if all of the transcripts labeling both classes belong to the same gene), then they cannot have arisen from the same pre-PCR molecule. Accounting for such cases is especially true [important] when using an error-aware deduplication approach and as sequencing depth increases.

The one small thing? Well… the authors never checked whether the claim at the end, namely that “accounting for such cases is especially important”, is actually true. In our paper “Modular and efficient pre-processing of single-cell RNA-seq” we checked. The result is in our Figure 1d:

Each column in the figure corresponds to a dataset, and the y-axis shows the distribution (over cells) of the proportion of counts one can expect to lose if applying naïve collapsing to a gene. Naïve collapsing here means that two reads with the same UMI are considered to have come from the same molecule. The numbers are so small we had to include an inset in the top right. Basically, it almost never happens that there is “direct evidence that similar UMIs must have arisen from distinct transcripts”. If one does observe such an occurrence, it is almost certainly an artifact of missing annotation. In fact, this leads to an…

💡 Idea: prioritize genes with colliding UMIs for annotation correction. The UMIs directly highlight transcripts that are incomplete. Maybe for a future paper, but returning to the matter at hand…

Crucially, the information analysis shows that there is no point in solving an NP-complete problem in this setting. The naïve algorithm not only suffices, it is sensible to apply it. And the great thing about naïve collapsing is that it’s straightforward to implement and run; the algorithm is linear. The Srivastava et al. question of what is the “minimum number of UMIs, along with their counts, required to explain the set of mapped reads” is a precise, but wrong question. In the words of John Tukey: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

The math behind Figure 1d is elementary but interesting (see the Supplementary Note of our paper). We work with a simple binomial model which we justify based on the data. For related work see Petukhov et al. 2018. One interesting result that came out of our calculations (work done with Sina Booeshaghi), is an estimate for the effective number of UMIs on each bead in a cell. This resulted in Supplementary Figure 1:

The result is encouraging. While the number of UMIs on a bead is not quite $4^L$ where L is the length of the UMI (theoretical maximum shown by dashed red line for v2 chemistry and solid red line for v3 chemistry), it is nevertheless high. We don’t know whether the variation is a result of batch effect, model mis-specification, or other artifacts; that is an interesting question to explore with more data and analysis.

As for UMI collapsing, the naïve algorithm has been used for almost every experiment to date as it is the method that was implemented in the Cell Ranger software, and subsequently adopted in other software packages. This was done without any consideration of whether it is appropriate. As the Srivastava et al. paper shows, intuition is not to be relied upon, but fortunately, in this case, the naïve approach is the right one.

This post is the second in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

A few months ago, while working on the kallisto | bustools project, some of us in the lab were discussing various aspects of single-cell RNA-seq technology when the conversation veered into a debate over the meaning of some frequently used words and phrases in the art: “library complexity”, “library size”, “sensitivity”, “capture rate”, “saturation”, “number of UMIs”, “bork bork bork” etc. There was some sense of confusion. I felt like a dummy because even after working on RNA-seq for more than a decade, I was still lacking language and clarity about even the most basic concepts. This was perhaps not entirely my fault. Consider, for example, that the phrase “library size” is used to mean “the number of molecules in a cDNA library” by some authors, and the “number of reads sequenced” by others.

Since we were writing a paper on single-cell RNA-seq pre-processing that required some calculations related to the basic concepts (libraries, UMIs, and so on), we decided to write down notation for the key objects. After some back-and-forth, Sina Booeshaghi and I ended up drafting the diagram below that summarizes the sets of objects in a single-cell RNA-seq experiment, and the maps that relate them:

Structure of a single-cell RNA-seq experiment.

Each letter in this diagram is a set. The ensemble of RNA molecules contained within a single cell is denoted by R. To investigate R, a library (L) is constructed from the set of molecules captured from R (the set C). Typically, L is the result of of various fragmentation and amplification steps performed on C, meaning each element of C may be observed in L with some multiplicity. Thus, there is an inclusion map from C to L (arrow with curly tail), and an injection from C to R (arrows with head and tail). The library is interrogated via sequencing of some of the molecules in L, resulting in a set F of fragments. Subsequently, the set F is aligned or pseudoaligned to create a set B, which in our case is a BUS file. Not every fragment F is represented in B, hence the injection, rather than bijection, from B to F, and similarly from F to L. The set T consists of transcripts that correspond to molecules in C that were represented in B. Note that $|R| \geq |C| \geq |T|$. Separately, the set U consists of the unique molecular identifiers (UMIs) available to label molecules from the cell, and I is a multiset of UMIs associated with the molecules in T. Importantly, the data from an experiment consists of F, together with the support of I. The support of I means the number of distinct objects in I, and is denoted by |supp(I)|. The common term is “number of distinct UMIs”.

The diagram has three distinct parts. The sets on the top (L, F, B) are “lifted” from  and by PCR. Without PCR one would be in an the ideal situation of measuring C directly to produce T, which would then be used to directly draw inferences about R. This is the hope for direct RNA sequencing, a technology that is promising but that cannot yet be applied at the scale of cDNA based methods. The sets U and I are intended to be seen as orthogonal to the rest of the objects. They relate to the UMIs which, in droplet single-cell RNA-seq technology, are delivered via beads. While the figure was designed to describe single-cell RNA-seq, it is quite general and possibly a useful model for many sequence census assays.

So what is all this formality good for? Nothing in this setup is novel; any practitioner working with single-cell RNA-seq already knows what the ingredients for the technology are. However I do think there is some trouble with the language and meaning of words, and hopefully having names and labels for the relevant sets can help in communication.

The questions

With some notation at hand, it is possible to precisely articulate some of the key technical questions associated with a single-cell RNA-seq experiment:

• The alignment (or pseudoalignment) problem: compute B from F.
• The pre-processing problem: what is the set ?
• What is the library richness/complexity, i.e. what is |supp(L)|?
• What is the sensitivity, i.e. what is $\frac{|C|}{|R|}$?
• In droplet based experiments, what are the number of UMIs available to tag molecules in a cell, i.e. what is |U|?

These basic questions are sometimes confused with each other. For example, the capture rate refers to the proportion of cells from a sample that are captured in an experiment and should not be confused with sensitivity. The |supp(L)| is a concept that is natural to refer to when thinking about a cDNA library. Note that the “library size”, referred to in the beginning of this post, is used by molecular biologists to naturally mean |L|, and not |F| (this confusion was unfortunately disseminated by the highly influential RNA-seq papers Anders and Huber, 2010 and Robinson and Oshlack, 2010) . The support of another set, |supp(I)|, is one that is easy to measure but precisely because I is a multiset, $|I| \neq |supp(I)|$, and there is considerable confusion about this fact. The number of distinct UMIs, |supp(I)|, is frequently used in lieu of the set whose size is being referred to, namely |I| (this is the case when “knee plots” are made, a topic for the fourth blog post in this series). Similarly, |U| is usually not estimated, and the number $4^L$ where $L$ is the length of the UMIs is used in its stead. This is partly intellectual laziness but partly, I think, the lack of standard notation to refer to the objects in single-cell RNA-seq experiments.

This diagram in this post is just step 0 in discussing single-cell RNA-seq. There is a lot more subtlety and nuance in understanding and interpreting experiments (see Introduction to single-cell RNA-seq technologies). ∎

This post is the first in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

During the past few years computational biologists have expended enormous effort in developing tools for processing and analyzing single-cell RNA-seq. This post describes yet another: the kallisto|bustools workflow for pre-processing single-cell RNA-seq. A preprint describing the method (was recently posted on the bioRχiv.

Number of single-cell RNA-seq tools (from the scRNA-tools catalog).

Given that there are so many programs, a natural question is: why on earth would we write yet another software program for generating a count matrix from single-cell RNA-seq reads when there are already plenty of programs out there? There’s alevin, cell rangerdropseqpipedropseqtoolsindrops… I’ve been going in alphabetical order but have to jump in with starsolo because it’s got the coolest name…now back to optimus, scruff, scpipescumiumis, zumis,  and I’m probably forgetting a few other something-umis. So why another one?

The answer requires briefly venturing back to a time long, long ago when RNA-seq was a fledgling, exciting new technology (~2009). At the time the notion of an “equivalence class” was introduced to the field (see e.g. Jiang and Wong, 2009 or Nicolae et al., 2011). Briefly, there is a natural equivalence relation on the set of reads in an RNA-seq experiment, where two reads are related when they are compatible with (i.e. could have originated from) exactly the same set of transcripts. The equivalence relation partitions the reads into equivalence classes, and, in a slight abuse of notation, the term “equivalence class” in RNA-seq is used to denote the set of transcripts corresponding to an equivalence class of reads. Starting with the pseudoalignment program kallisto that we published in Bray et al. 2016, it became possible to rapidly obtain the (transcript) equivalence classes for reads from an RNA-seq experiment.

When single-cell RNA-seq started to scale it became apparent to those of us working with equivalence classes for bulk RNA-seq that rather than counting genes from single-cell RNA-seq data, it would be better to examine what we called transcript compatibility counts (TCCs), i.e. counts of the equivalence classes (the origin of the term TCC is discussed in a previous blog post of mine). This vision has borne out: we recently published a paper demonstrating the power of TCCs for differential analysis of single-cell data (Ntranos, Yi et al. 2019) and I believe TCCs are ideal for many different single-cell RNA-seq analyses. So back to the question: why another single-cell RNA-seq pre-processing workflow?

Already in 2016 we wanted to be able to produce TCC matrices from single-cell RNA-seq data but there was no program to do it. My postdoc at the time, Vasilis Ntranos, developed a workflow, but in the course of working on a program he started to realize that there were numerous non-trivial aspects to processing single-cell RNA-seq. Even basic questions, such as how to correct barcodes or collapse UMIs required careful thought and analysis. As more and more programs for single-cell RNA-seq pre-processing started to appear, we examined them carefully and noted two things: 1. Most were not able to output TCC matrices and 2. They were, for the most part, based on ad hoc heuristics and unvalidated methods. Many of the programs were not even released with a preprint or paper. So we started working on the problem.

A key insight was that we needed a new format to allow for modular pre-processing. So we developed such a format, which we called the Barcode, UMI, Set (BUS) format, and we published a paper about it earlier this year (Melsted, Ntranos et al., 2019). This allowed us to start investigating different algorithms for the key steps, and to rearrange them and plug them in to an overall workflow as needed. Finally, after careful consideration of each of the key steps, weighing tradeoffs between efficiency and accuracy, and extensive experimentation, we settled on a workflow that is faster than any other method and based on reason rather than intuition. The workflow uses two programs, kallisto and bustools, and we call it the kallisto|bustools workflow. Some highlights:

• kallisto|bustools can produce a TCC matrix. The matrix is compatible with the gene count matrix (it collapses to the latter), and of course gene count matrices can be output as well for use in existing downstream tools.
• The workflow is very very fast. With kallisto|bustools very large datasets can be processed in minutes. The title of this post refers to the workflow as “near-optimal” because it runs in time similar to the unix word count function. Maybe it’s possible to be a bit faster with some optimizations, but probably not by much:
• kallisto|bustools uses very little memory. We worked hard to achieve this feature, as we wanted it to be useful for large-scale analyses that are going to be performed by consortia such as the Human Cell Atlas project. The workflow currently uses ~3.5Gb of RAM for processing 10x v2 chemistry data, and ~11Gb RAM for 10x v3 chemistry data; both numbers are independent of the number of reads being processed. This means users can pre-process data on a laptop:
• The workflow is modular, thanks to its reliance on the flexible BUS format. It was straightforward to develop an RNA velocity workflow (more on this in a companion blog post). It will be easy to adapt the workflow to various technologies, to multiomics experiments, and to any custom analysis required:
• We tried to create comprehensive, yet succinct documentation to help make it easy to use the software (recommendations for improvements are welcome). We have online tutorials, as well as videos for novices:
Installation instructions (and video)
Getting started tutorial (and video).
– Manuals for kallisto and bustools.
– Complete code for reproducing all the results in the preprint
• We were not lazy. In our tests we found variability in performance on different datasets so we tested the program extensively and ran numerous benchmarks on 10x Genomics data to validate Cell Ranger with respect to kallisto|bustools (note that Cell Ranger’s methods have been neither validated nor published). We compiled a benchmark panel consisting of 20 datasets from a wide variety of species. This resulted in 20 supplementary figures, each with 8 panels showing: a) the number of genes detected, b) concordance in counts per gene, c) number of genes detected, d) correlation in gene counts by cell, e) spatial separation between corresponding cells vs. neighboring cells, f,g) t-SNE analysis, h) gene set analysis to detect systematic differences in gene abundance estimation (see example below for the dataset SRR8257100 from the paper Ryu et al., 2019). We also examined in detail results on a species mixing experiment, and confirmed that Cell Ranger is consistent with kallisto on that as well. One thing we did not do in this paper is describe workflows for different technologies but such workflows and accompanying tutorials will be available soon:
• In addition we ran a detailed analysis on the 10x Genomics 10k E18 mouse brain dataset to investigate whether Cell Ranger pre-processing produces different results than kallisto insofar as downstream analyses are concerned. We looked at dimensionality reduction, clustering, identification of marker genes, marker gene prevalence, and pseudotime. The results were all highly concordant. An example (the pseudotime analysis) is shown below:
• We did the math on some of the basic aspects of single-cell RNA-seq. We’re not the first to do this (see, e.g. Petukhov et al., 2018), but one new result we have is an estimation of the UMI diversity on beads. This should be useful for those developing new technologies, or trying to optimize existing protocols:

Note that this post is the first in a series of five that discuss in more detail various aspects of the paper (see links at the top). Finally, a note on reproducibility and usability:

The development of the kallisto|bustools workflow, research into the methods, compilation of the results, and execution of the project required a tremendous team effort, and in working on it I was thinking of the first bioinformatics tool I wrote about and posted to the arXiv (the bioRxiv didn’t exist yet). The paper was:

Nicolas Bray and Lior Pachter, MAVID: Constrained ancestral alignment of multiple sequences, arXiv, 2003.

At the time we posted the code on our own website (now defunct, but accessible via the Wayback machine). We did our best to make the results reproducible but we were limited in our options with the tools available at the time. Furthermore, posting the preprint was highly unusual; there was almost no biology preprinting at the time. Other things have stayed the same. Considerations of software portability, efficiency and documentation were relevant then and remain relevant now.

Still, there has been an incredible development in the tools and techniques available for reproducibility and usability since that time. A lot of the innovation has been made possible by cloud infrastructure, but much of the development has been the result of changes in community standards and requirements (see e.g., Weber et al., 2019). I thought I’d compile a list of the parts and pieces of the project; they are typical for what is needed for a bioinformatics project today and comparing them to the bar in 2003 is mind boggling:

Software: GitHub repositories (kallisto and bustools); releases of binaries for multiple operating systems (Mac, Linux, Windows, Rock64); portable source code with minimal dependencies; multithreading; memory optimization; user interface.

Paper: Preprint (along with extensive Supplement providing backup for every result and claim in the main text); GitHub repository with code to reproduce all the figures/results in the preprint (reproducibility code includes R markdown, python notebooks, snakemake, software versions for every program used, fixed seeds).

Documentation: Manuals for the software; Tutorials for learning to use the code; Explanatory videos (all required materials posted on Github or available on stable websites for download).

The totality of work required to do all of this was substantial. Páll Melsted was the primary developer of kallisto and he wrote and designed bustools, which has been the foundation of the project. The key insight to adopt the BUS format was work in collaboration with Vasilis Ntranos. This was followed by long conversations on the fundamentals of single-cell RNA-seq with Jase Gehring. Sina Booeshaghi carried the project. He was responsible for the crucial UMI collapsing analysis, and put together the paper. Fan Gao, director of the Caltech Bioinformatics Resource Center, set up and implemented the extensive benchmarking, and helped fine-tune the algorithms and converge to the final approach taken. Lambda Lu conducted what I believe to be the most in-depth and detailed analysis to date of the effect of pre-processing on results. Her framework should serve as a template for future development work in this area. Eduardo Beltrame designed the benchmark panels and had a key insight about how to present results that is described in more detail in the companion post on rotating the knee plot. He also helped in the complex task of designing and building the companion websites for the project. Kristján Eldjarn Hjörleifsson helped with the RNA velocity work and helped make custom indices that turned out to be fundamental in understanding the performance of pseudoalignment in the single-cell RNA-seq setting. Sina Booeshaghi spent a lot of time thinking about how to optimize the user experience, making the tutorials and videos, and working overall to make the results of the paper not just reproducible, but the the methods usable.

Bi/BE/CS183 is a computational biology class at Caltech with a mix of undergraduate and graduate students. Matt Thomson and I are co-teaching the class this quarter with help from teaching assistants Eduardo Beltrame, Dongyi (Lambda) Lu and Jialong Jiang. The class has a focus on the computational biology of single-cell RNA-seq analysis, and as such we recently taught an introduction to single-cell RNA-seq technologies. We thought the slides used would be useful to others so we have published them on figshare:

Eduardo Beltrame, Jase Gehring, Valentine Svensson, Dongyi Lu, Jialong Jiang, Matt Thomson and Lior Pachter, Introduction to single-cell RNA-seq technologies, 2019. doi.org/10.6084/m9.figshare.7704659.v1

Thanks to Eduardo Beltrame, Jase Gehring and Valentine Svensson for many extensive and helpful discussions that clarified many of the key concepts. Eduardo Beltrame and Valentine Svensson performed new analysis (slide 28) and Jase Gehring resolved the tangle of “doublet” literature (slides 17–25). The 31 slides were presented in a 1.5 hour lecture. Some accompanying notes that might be helpful to anyone interested in using them are included for reference (numbered according to slide) below:

1. The first (title) slide makes the point that single-cell RNA-seq is sufficiently complicated that a deep understanding of the details of the technologies, and methods for analysis, is required. The #methodsmatter.
2. The second slide presents an overview of attributes associated with what one might imagine would constitute an ideal single-cell RNA-seq experiment. We tried to be careful about terminology and jargon, and therefore widely used terms are italicized and boldfaced.
3. This slide presents Figure 1 from Svensson et al. 2018. This is an excellent perspective that highlights key technological developments that have spurred the growth in popularity of single-cell RNA-seq. At this time (February 2019) the largest single-cell RNA-seq dataset that has been published consists of 690,000 Drop-seq adult mouse brain cells (Saunders, Macosko et al. 2018). Notably, the size and complexity of this dataset rivals that of a large-scale genome project that until recently would be undertaken by hundreds of researchers. The rapid adoption of single-cell RNA-seq is evident in the growth of records in public sequence databases.
4. The Chen et al. 2018 review on single-cell RNA-seq is an exceptionally useful and thorough review that is essential reading in the field. The slide shows Figure 2 which is rich in information and summarizes some of the technical aspects of single-cell RNA-seq technologies. Understanding of the details of individual protocols is essential to evaluating and assessing the strengths and weaknesses of different technologies for specific applications.
5. Current single-cell RNA-seq technologies can be broadly classified into two groups: well-based and droplet-based technologies. The Papalexi and Satija 2017 review provides a useful high-level overview and this slide shows a part of Figure 1 from the review.
6. The details of the SMART-Seq2 protocol are crucial for understanding the technology. SMART is a clever acronym for Switching Mechanism At the 5′ end of the RNA Transcript. It allows the addition of an arbitrary primer sequence at the 5′ end of a cDNA strand, and thus makes full length cDNA PCR possible. It relies on the peculiar properties of the reverse transcriptase from the Moloney murine leukemia virus (MMLV), which, upon reaching the 5’ end of the template, will add a few extra nucleotides (usually Cytosines). The resultant overhang is a binding site for the “template switch oligo”, which contains three riboguanines (rGrGrG). Upon annealing, the reverse transcriptase “switches” templates, and continues transcribing the DNA oligo, thus adding a constant sequence to the 5’ end of the cDNA. After a few cycles of PCR, the full length cDNA generated is too long for Illumina sequencing (where a maximum length of 800bp is desirable). To chop it up into smaller fragments of appropriate size while simultaneously adding the necessary Illumina adapter sequences, one can can use the Illumina tagmentation Nextera™ kits based on Tn5 tagmentation. The SMART template switching idea is also used in the Drop-seq and 10x genomics technologies.
7. While it is difficult to rate technologies exactly on specific metrics, it is possible to identify strengths and weaknesses of distinct approaches. The SMART-Seq2 technology has a major advantage in that it produces reads from across transcripts, thereby providing “full-length” information that can be used to quantify individual isoforms of genes. However this superior isoform resolution requires more sequencing, and as a result makes the method less cost effective. Well-based methods, in general, are not as scalable as droplet methods in terms of numbers of cells assayed. Nevertheless, the tradeoffs are complex. For example robotics technologies can be used to parallelize well-based technologies, thereby increasing throughput.
8. The cost of a single-cell technology is difficult to quantify. Costs depend on number of cells assayed as well as number of reads sequenced, and different technologies have differing needs in terms of reagents and library preparation costs. Ziegenhain et al. 2017 provide an in-depth discussion of how to assess cost in terms of accuracy and power, and the table shown in the slide is reproduced from part of Table 1 in the paper.
9. A major determinant of single-cell RNA-seq cost is sequencing cost. This slide shows sequencing costs at the UC Davis Genome Center and its purpose is to illustrate numerous tradeoffs, relating to throughput, cost per base, and cost per fragment that must be considered when selecting which sequencing machine to use. In addition, sequencing time frequently depends on core facilities or 3rd party providers multiplexing samples on machines, and some sequencing choices are likely to face more delay than others.
10. Turning to droplet technologies based on microfluidics, two key papers are the Drop-seq and inDrops papers which were published in the same issue of a single journal in 2015. The papers went to great lengths to document the respective technologies, and to provide numerous text and video tutorials to facilitate adoption by biologists. Arguably, this emphasis on usability (and not just reproducibility) played a major role in the rapid adoption of single-cell RNA-seq by many labs over the past three years. Two other references on the slide point to pioneering microfluidics work by Rustem Ismagilov, David Weitz and their collaborators that made possible the numerous microfluidic single-cell applications that have since been developed.
11.  This slide displays a figure showing a monodispersed emulsion from the recently posted preprint “Design principles for open source bioinstrumentation: the poseidon syringe pump system as an example” by Booeshaghi et al., 2019. The generation of such emulsions is a prerequisite for successful droplet-based single-cell RNA-seq. In droplet based single-cell RNA-seq, emulsions act as “parallelizing agents”, essentially making use of droplets to parallelize the biochemical reactions needed to capture transcriptomic (or other) information from single-cells.
12. The three objects that are central to droplet based single-cell RNA-seq are beads, cells and droplets. The relevance of emulsions in connection to these objects is that the basis of droplet methods for single-cell RNA-seq is the encapsulation of single cells together with single beads in the emulsion droplest. The beads are “barcode” delivery vectors. Barcodes are DNA sequences that are associated to transcripts, which are accessible after cell lysis in droplets. Therefore, beads must be manufactured in a way that ensures that each bead is coated with the same barcodes, but that the barcodes associated with two distinct beads are different from each other.
13. The inDrops approach to single-cell RNA-seq serves as a useful model for droplet based single-cell RNA-seq methods. The figure in the slide is from a protocol paper by Zilionis et al. 2017 and provides a useful overview of inDrops. In panel (a) one sees a zoom-in of droplets being generated in a microfluidic device, with channels delivering cells and beads highlighted. Panel (b) illustrates how inDrops hydrogel beads are used once inside droplets: barcodes (DNA oligos together with appropriate priming sequences) are released from the hydrogel beads and allow for cell barcoded cDNA synthesis. Finally, panel (c) shows the sequence construct of oligos on the beads.
14. This slide is analogous to slide 6, and shows an overview of the protocols that need to be followed both to make the hydrogel beads used for inDrops, and the inDrops protocol itself.  In a clever dual use of microfluidics, inDrops makes the hydrogel beads in an emulsion. Of note in the inDrops protocol itself is the fact that it is what is termed a “3′ protocol”. This means that the library, in addition to containing barcode and other auxiliary sequence, contains sequence only from 3′ ends of transcripts (seen in grey in the figure). This is the case also with other droplet based single-cell RNA-seq technologies such as Drop-seq or 10X Genomics technology.
15. The significance of 3′ protocols it is difficult to quantify individual isoforms of genes from the data they produce. This is because many transcripts, while differing in internal exon structure, will share a 3′ UTR. Nevertheless, in exploratory work aimed at investigating the information content delivered by 3′ protocols, Ntranos et al. 2019 show that there is a much greater diversity of 3′ UTRs in the genome than is currently annotated, and this can be taken advantage of to (sometimes) measure isoform dynamics with 3′ protocols.
16. To analyze the various performance metrics of a technology such as inDrops it is necessary to understand some of the underlying statistics and combinatorics of beads, cells and drops. Two simple modeling assumptions that can be made is that the number of cells and beads in droplets are each Poisson distributed (albeit with different rate parameters). Specifically, we assume that
$\mathbb{P}(\mbox{droplet has } k \mbox{ cells}) = \frac{e^{-\lambda}\lambda^k}{k!}$ and $\mathbb{P}(\mbox{droplet has } k \mbox{ beads}) = \frac{e^{-\mu}\mu^j}{j!}$. These assumptions are reasonable for the Drop-seq technology. Adjustment of concentrations and flow rates of beads and cells and oil allows for controlling the rate parameters of these distributions and as a result allow for controlling numerous tradeoffs which are discussed next.
17. The cell capture rate of a technology is the fraction of input cells that are assayed in an experiment. Droplets that contain one or more cells but no beads will result in a lost cells whose transcriptome is not measured. The probability that a droplet has no beads is $e^{-\mu}$ and therefore the probability that a droplet has at least one bead is $1-e^{-\mu}$. To raise the capture rate it is therefore desirable to increase the Poisson rate $\mu$ which is equal to the average number of beads in a droplet. However increasing $\mu$ leads to duplication, i.e. cases where a single droplet has more than one bead, thus leading .a single cell transcriptome to appear as two or more cells. The duplication rate is the fraction of assayed cells which were captured with more than one bead. The duplication rate can be calculated as $\frac{\mathbb{P}(\mbox{droplet has 2 or more beads})}{\mathbb{P}(\mbox{droplet has 1 or more beads})}$ (which happens to be equivalent to a calculation of the probability that we are alone in the universe). The tradeoff, shown quantitatively as capture rate vs. duplication rate, is shown in a figure I made for the slide.
19. This slide shows a comparison of bead technologies. In addition to inDrops, 10x Genomics single-cell RNA-seq technology also uses hydrogel beads. There are numerous technical details associated with beads including whether or not barcodes are released (and how).
20. Barcode collisions arise when two cells are separately encapsulated with beads that contain identical barcodes. The slide shows the barcode collision rate formula, which is $1-\left( 1-\frac{1}{M} \right)^{N-1}$. This formula is derived as follows: Let $p=\frac{1}{M}$. The probability that a barcodes is associated with k cells is given by the binomial formula ${N \choose k}p^k(1-p)^{N-k}$. Thus, the probability that a barcode is associated to exactly one cell is $Np(1-p)^{N-1} = \frac{N}{M}\left(1-\frac{1}{M}\right)^{N-1}$. Therefore the expected number of cells with a unique barcode is $N\left(1-\frac{1}{M}\right)^{N-1}$ and the barcode collision rate is $\left(1-\frac{1}{M}\right)^{N-1}$. This is approximately $1-\left( \frac{1}{e} \right)^{\frac{N}{M}}$. The term synthetic doublet is used to refer to the situation when two or more different cells appear to be a single cell due to barcode collision.
21. In the notation of the previous slide, the barcode diversity is $\frac{N}{M}$, which is an important number in that it determines the barcode collision rate. Since barcodes are encoded in sequence, a natural question is what sequence length is needed to ensure a desired level of barcode diversity. This slide provides a lower bound on the sequence length.
22. Technical doublets are multiple cells that are captured in a single droplet, barcoded with the same sequence, and thus the transcripts that are recorded from them appear to originate from a single-cell.  The technical doublet rate can be estimated using a calculation analogous to the one used for the cell duplication rate (slide 17), except that it is a function of the Poisson rate $\lambda$ and not $\mu$. In the single-cell RNA-seq literature the term “doublet” generally refers to technical doublets, although it is useful to distinguish such doublets from synthetic doublets and biological doublets (slide 25).
23. One way to estimate the technical doublet rate is via pooled experiments of cells from different species. The resulting data can be viewed in what has become known as a “Barnyard” plot, a term originating from the Macosko et al. 2015 Drop-seq paper. Despite the authors’ original intention to pool mouse, chicken, cow and pig cells, typical validation of single-cell technology now proceeds with a mixture of human and mouse cells. Doublets are readily identifiable in barnyard plots as points (corresponding to droplets) that display transcript sequence from two different species. The only way for this to happen is via the capture of a doublet of cells (one from each species) in a single droplet. Thus, validation of single-cell RNA-seq rigs via a pooled experiment, and assessment of the resultant barnyard plot, has become standard for ensuring that the data is indeed single cell.
24.  It is important to note that while points on the diagonal of barnyard plots do correspond to technical doublets, pooled experiments, say of human and mouse cells, will also result in human-human and mouse-mouse doublets that are not evident in barnyard plots. To estimate such within species technical doublets from a “barnyard analysis”, Bloom 2018 developed a simple procedure that is described in this slide. The “Bloom correction” is important to perform in order to estimate doublet rate from mixed species experiments.
25. Biological doublets arise when two cells form a discrete unit that does not break apart during disruption to form a suspension. Such doublets are by definition from the same species, and therefore will not be detected in barnyard plots. One way to avoid biological doublets is via single nucleus RNA-seq (see, Habib et al. 2017). Single nuclear RNA-seq has proved to be important for experiments involving cells that are difficult to dissociate, e.g. human brain cells. In fact, the formation of suspensions is a major bottleneck in the adoption of single-cell RNA-seq technology, as techniques vary by tissue and organism, and there is no general strategy. On the other hand, in a recent interesting paper, Halpern et al. 2018 show that biological doublets can sometimes be considered a feature rather than a bug. In any case, doublets are more complicated than initially appears, and we have by now seen that there are three types of doublets: synthetic doublets, technical doublets, and biological doublets .
26. Unique molecular identifiers (UMIs) are part of the oligo constructs for beads in droplet single-cell RNA-seq technologies. They are used to identify reads that originated from the same molecule, so that double counting of such molecules can be avoided. UMIs are generally much shorter than cell barcodes, and estimates of required numbers (and corresponding sequence lengths) proceed analogously to the calculations for cell barcodes.
27. The sensitivity of a single-cell RNA-seq technology (also called transcript capture rate) is the fraction of transcripts captured per cell. Crucially, the sensitivity of a technology is dependent on the amount sequenced, and the plot in this slide (made by Eduardo Beltrame and Valentine Svensson by analysis of data from Svensson et al. 2017) shows that dependency.  The dots in the figure are cells from different datasets that included spike-ins, whose capture is being measured (y-axis). Every technology displays an approximately linear relationship between number of reads sequenced and transcripts captured, however each line is describe by two parameters: a slope and an intercept. In other words, specification of sensitivity for a technology requires reporting two numbers, not one.
28. Having surveyed the different attributes of droplet technologies, this slide summarizes some of the pros and cons of inDrops similarly to slide 7 for SMART-Seq2.
29. While individual aspects of single-cell RNA-seq technologies can be readily understood via careful modeling coupled to straightforward quality control experiments, a comprehensive assessment of whether a technology is “good” or “bad” is meaningless. The figure in this slide, from Zhang et al. 2019, provides a useful overview of some of the important technical attributes.
30. In terms of the practical question “which technology should I choose for my experiment?”, the best that can be done is to offer very general decision workflows. The flowchart shown on this slide, also from Zhang et al. 2019, is not very specific but does provide some simple to understand checkpoints to think about. A major challenge in comparing and contrasting technologies for single-cell RNA-seq is that they are all changing very rapidly as numerous ideas and improvements are now published weekly.
31. The technologies reviewed in the slides are strictly transcriptomics methods (single-cell RNA-seq). During the past few years there has been a proliferation of novel multi-modal technologies that simultaneously measures the transcriptome of single cells along with other modalities. Such technologies are reviewed in Packer and Trapnell 2018 and the slide mentions a few of them.

This post is a review of a recent preprint on an approach to testing for RNA-seq gene differential expression directly from transcript compatibility counts:

Marek Cmero, Nadia M Davidson and Alicia Oshlack, Fast and accurate differential transcript usage by testing equivalence class counts, bioRxiv 2018.

To understand the preprint two definitions are important. The first is of gene differential expression, which I wrote about in a previous blog post and is best understood, I think, with the following figure (reproduced from Supplementary Figure 1 of Ntranos, Yi, et al., 2018):

In this figure, two isoforms of a hypothetical gene are called primary and secondary, and two conditions in a hypothetical experiment are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates $x_A$ and $x_B$ corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates $y_A$ and $y_B$ corresponding to the abundance of the secondary isoforms. In data from the hypothetical experiment, the black dots represent the mean level of expression of the constituent isoforms as derived from replicates. Differential transcript expression (DTE) refers to change in one of the isoforms. Differential gene expression (DGE) refers to change in overall gene expression (i.e. expression as the sum of the expression of the two isoforms). Differential transcript usage (DTU) refers to change in relative expression between the two isoform and gene differential expression (GDE) refers to change in expression along the red line. Note that DGE, DTU and DGE are special cases of GDE.

The Cmero et al. preprint describes a method for testing for GDE, and the method is based on comparison of equivalence classes of reads between conditions. There is a natural equivalence relation $\sim$ on the set of reads in an RNA-seq experiment, where two reads $r_1$ and $r_2$ are related by $\sim$ when $r_1$ and $r_2$ align (ambiguously) to exactly the same set of transcripts (see, e.g. Nicolae et al. 2011). The equivalence relation $\sim$ partitions the reads into equivalence classes, and, in a slight abuse of notation, the term “equivalence class” in RNA-seq is used to denote the set of transcripts corresponding to an equivalence class of reads. Starting with the pseudoalignment program kallisto published in Bray et al. 2016, it became possible to rapidly obtain the (transcript) equivalence classes for reads from an RNA-seq experiment.

In previous work (Ntranos et al. 2016) we introduced the term transcript compatibility counts to denote the cardinality of the (read) equivalence classes. We thought about this name carefully; due to the abuse of notation inherent in the term “equivalence class” in RNA-seq, we felt that using “equivalence class counts” would be confusing as it would be unclear whether it refers to the cardinalities of the (read) equivalence classes or the (transcript) “equivalence classes”.

With these definitions at hand, the Cmero et al.’s preprint can be understood to describe a method for identifying GDE between conditions by directly comparing transcript compatibility counts. The Cmero et al. method is to perform Šidák aggregation of p-values for equivalence classes, where the p-values are computed by comparing transcript compatibility counts for each equivalence class with the program DEXSeq (Anders et al. 2012). A different method that also identifies GDE by directly comparing transcript compatibility counts was previously published by my student Lynn Yi in Yi et al. 2018. I was curious to see how the Yi et al. method, which is based on Lancaster aggregation of p-values computed from transcript compatibility counts compares to the Cmero et al. method. Fortunately it was really easy to find out because Cmero et al. released code with their paper that can be used to make all of their figures.

I would like to note how much fun it is to reproduce someone else’s work. It is extremely empowering to know that all the methods of a paper are pliable at the press of a button. Below is the first results figure, Figure 2, from Cmero et al.’s paper:

It’s beautiful to see not only apples-to-apples, but the exact same apple! Reproducibility is obviously important to facilitate transparency in papers and to ensure correctness, but its real value lies in the fact that it allows for modifying and experimenting with methods in a paper. Below is the second results figure, Figure 3, from Cmero et al.’s paper:

The figure below is the reproduction, but with an added analysis in Figure 3a, namely the method of Yi et al. 2018 included (shown in orange as “Lancaster_equivalence_class”).

The additional code required for the extra analysis is just a few lines and can be downloaded from the Bits of DNA Github repository:

library(aggregation)
library(dplyr)
dm_dexseq_results <- as.data.frame(DEXSeqResults(dm_ec_results$dexseq_object)) dm_lancaster_results <- dm_dexseq_results %>% group_by(groupID) %>% summarize(pval = lancaster(pvalue, log(exonBaseMean))) dm_lancaster_results$gene_FDR <- p.adjust(dm_lancaster_results$pval, ‘BH’) dm_lancaster_results <- data.frame(gene = dm_lancaster_results$groupID,
FDR = dm_lancaster_results$gene_FDR) hs_dexseq_results <- as.data.frame(DEXSeqResults(hs_ec_results$dexseq_object))
hs_lancaster_results <- hs_dexseq_results %>% group_by(groupID) %>% summarize(pval = lancaster(pvalue, log(exonBaseMean)))
hs_lancaster_results$gene_FDR <- p.adjust(hs_lancaster_results$pval, ‘BH’)
hs_lancaster_results <- data.frame(gene = hs_lancaster_results$groupID, FDR = hs_lancaster_results$gene_FDR)

A zoom-in of Figure 3a below shows that the improvement of Yi et al.’s method in the hsapiens dataset over the method of Cmero et al. is as large as the improvement of aggregation (of any sort) over GDE based on transcript quantifications. Importantly, this is a true apples-to-apples comparison because Yi et al.’s method is being tested on exactly the data and with exactly the metrics that Cmero et al. chose:

The improvement is not surprising; an extensive comparison of Lancaster aggregation with Šidák aggregation is detailed in Yi et al. and there we noted that while Šidák aggregation performs well when transcripts are perturbed independently, it performs very poorly in the more common case of correlated effect. Furthermore, we also examined in detail DEXSeq’s aggregation (perGeneQvalue) which appears to be an attempt to perform Šidák aggregation but is not quite right, in a sense we explain in detail in Section 2 of the Yi et al. supplement. While DEXSeq’s implementation of Šidák aggregation does control the FDR, it will tend to report genes with many isoforms and consumes the “FDR budget” faster than Šidák aggregation. This is one reason why, for the purpose of comparing Lancaster and Šidák aggregation in Yi et al. 2018, we did not rely on DEXSeq’s implementation of Šidák aggregation. Needless to say, separately from this issue, as mentioned above we found that Lancaster aggregation substantially outperforms Šidák aggregation.

The figures below complete the reproduction of the results of Cmero et al. The reproduced figures are are very similar to Cmero et al.’s figures but not identical. The difference is likely due to the fact that the Cmero paper states that a full comparison of the “Bottomly data” (on which these results are based) is a comparison of 10 vs. 10 samples. The reproduced results are based on downloading the data which consists of 10 vs. 11 samples for a total of 21 samples (this is confirmed in the Bottomly et al. paper which states that they “generated single end RNA-Seq reads from 10 B6 and 11 D2 mice.”) I also noticed one other small difference in the Drosophila analysis shown in Figure 3a where one of the methods is different for reasons I don’t understand. As for the supplement, the Cmero et al. figures are shown on the left hand side below, and to their right are the reproduced figures:

The final supplementary figure is a comparison of kallisto to Salmon: the Cmero et al. paper shows that Salmon results are consistent with kallisto results shown in Figure 3a,  and reproduces the claim I made in a previous blog post, namely that Salmon results are near identical to kallisto:

The final paragraph in the discussion of Cmero et al. states that “[transcript compatibility counts] have the potential to be useful in a range of other expression analysis. In particular [transcript compatibility counts] could be used as the initial unit of measurement for many other types of analysis such as dimension reduction visualizations, clustering and differential expression.” In fact, transcript compatibility counts have already been used for all these applications and have been shown to have numerous advantages. See the papers:

Many of these papers were summarized in a talk I gave at Cold Spring Harbor in 2017 on “Post-Procrustean Bioinformatics”, where I emphasized that instead of fitting methods to the predominant data types (in the case of RNA-seq, gene counts), one should work with data types that can support powerful analysis methods (in the case of RNA-seq, transcript compatibility counts).

Three years ago, when my coauthors (Páll Melsted, Nicolas Bray, Harold Pimentel) and I published the “kallisto paper” on the arXiv (later Bray et al. “Near-optimal probabilistic RNA-seq quantification“, 2016), we claimed that kallisto removed a major computational bottleneck from RNA-seq analysis by virtue of being two orders of magnitude faster than other state-of-the-art quantification methods of the time, without compromising accuracy. With kallisto, computations that previously took days, could be performed as accurately in minutes. Even though the speedup was significant, its relevance was immediately questioned. Critics noted that experiments, library preparations and sequencing take at least months, if not years, and questioned the relevance of a speedup that would save only days.

One rebuttal we made to this legitimate point was that kallisto would be useful not only for rapid analysis of individual datasets, but that it would enable analyses at previously unimaginable scales. To make our point concrete, in a follow-up paper (Pimentel et al., “The Lair: a resource for exploratory analysis of published RNA-seq data”, 2016) we described a semi-automated framework for analysis of archived RNA-seq data that was possible thanks to the speed and accuracy of kallisto, and we articulated a vision for “holistic analysis of [short read archive] SRA data” that would facilitate “comparison of results across studies [by] use of the same tools to process diverse datasets.” A major challenge in realizing this vision was that although kallisto was fast enough to allow for low cost processing of all the RNA-seq in the short read archive (e.g. shortly after we published kallisto, Vivian et al., 2017 showed that kallisto reduced the cost of processing per sample from $1.30 to$0.19, and Tatlow and Piccolo, 2016 achieved \$0.09 per sample with it), an analysis of experiments consists of much more than just quantification. In Pimentel et al. 2016 we struggled with how to wrangle metadata of experiments (subsequently an entire paper was written by Bernstein et al. 2017 just on this problem), how to enable users to dynamically test distinct hypotheses for studies, and how to link results to existing databases and resources. As a result, Pimentel et al. 2016 was more of a proof-of-principle than a complete resource; ultimately we were able to set up analysis of only a few dozen datasets.

Now, the group of Avi Ma’ayan at the Icahn School of Medicine at Mount Sinai has surmounted the many challenges that must be overcome to enable automated analysis of RNA-seq projects on the short read archive, and has published a tool called BioJupies (Torre et al. 2018). To assess BioJupies I began by conducting a positive control in the form of analysis of data from the “Cuffdiff2” paper, Trapnell et al. 2013. The data is archived as GSE37704. This is the dataset I used to initially test the methods of Pimentel et al. 2016 and is also the dataset underlying the Getting Started Walkthrough for sleuth. I thought, given my familiarity with it, that it would be a good test case for BioJupies.

Briefly, in Trapnell et al. 2013, Trapnell and Hendrickson performed a differential analysis of lung fibroblasts in response to an siRNA knockdown of HOXA1 which is a developmental transcription factor. Analyzing the dataset with BioJupies is as simple as typing the Gene Expression Omnibus (GEO) accession on the BioJupies searchbox. I clicked “analyze”, clicked on “+” a few times to add all the possible plots that can be generated, and this opened a window asking for a description of the samples:

I selected “Perturbation” for the HOXA1 knockdown samples and “Control” for the samples that were treated with scrambled siRNA that did not target a specific gene. Finally, I  clicked on “generate notebook”…

and

BioJupies displayed a notebook (Trapnell et al. 2013 | BioJupies) with a complete analysis of the data. Much of the Trapnell et al. 2013 analysis was immediately evident in the notebook. For example, the following figure is Figure 5a in Trapnell et al. 2013. It is a gene set enrichment analysis (GSEA) of the knockdown:

BioJupies presents the same analysis:

It’s easy to match them up. Of course BioJupies presents a lot of other information and analysis, ranging from a useful PCA plot to an interesting L1000 connectivity map analysis (expression signatures from a large database of over 20,000 perturbations applied to various cell lines that match the signatures in the dataset).

One of the powerful applications of BioJupies is the presentation of ARCHS⁴ co-expression data. ARCHS⁴ is the kallisto computed database of expression for the whole and is the primary database that enables BioJupies. One of its features is a list of co-expressed genes (as ascertained via correlation across the whole short read archive). These are displayed in BioJupies making it possible to place the results of an experiment in the context of “global” transcriptome associations.

While the Trapnell et al. 2013 reanalysis was fun, the real power of BioJupies is clear when analyzing a dataset that has not yet been published. I examined the GEO database and found a series GSE60538 that appears to be a partial dataset from what looks like a paper in the works. The data is from an experiment designed to investigate the role of Sox5 and Sox6 in the mouse heart via two single knockout experiments, and a double knockout. The entry originates in 2014 (consistent with the single-end 50bp reads it contains), but was updated recently. There are a total of 8 samples: 4 controls and 4 from the double knockout (the single knockouts are not available yet). I could not find an associated paper, nor was one linked to on GEO, but the abstract of the paper has already been uploaded to the site. Just as I did with the Trapnell et al. 2013 dataset, I entered the accession in the BioJupies website and… four minutes later:

The abstract of the GSE60538 entry states that “We performed RNA deep sequencing in ventricles from DKO and control mice to identify potential Sox5/6 target genes and found altered expression of genes encoding regulators of calcium handling and cation transporters” and indeed, BioJupies verifies this result (see Beetz et al. GSE60538| BioJupies):

Of course, there is a lot more analysis than just this. The BioJupies page includes, in addition to basic QC and datasets statistics, the PCA analysis, a “clustergrammer” showing which genes drive similarity between samples, differentially expressed genes (with associated MA and volcano plots), gene ontology enrichment analysis, pathway enrichment analysis, transcription factor enrichment analysis, kinase enrichment analysis, microRNA enrichment analysis, and L1000 analysis. In a sense, one could say that with BioJupies, users can literally produce the analysis for a paper in four minutes via a website.

The Ma’ayan lab has been working towards BioJupies for some time. The service is essentially a combination of a number of tools, workflows and resources published previously by the lab, including:

With BioJupies, these tools become more than the sum of their parts. Yet while BioJupies is impressive, it is not complete. There is no isoform analysis, which is unfortunate; for example one of the key points of Trapnell et al. 2013 was how informative transcript-level analysis of RNA-seq data can be. However I see no reason why a future release of BioJupies can’t include detailed isoform analyses. Isoform quantifications are provided by kallisto and are already downloadable via ARCHS⁴. It would also be great if BioJupies were extended to organisms other than human and mouse, although some of the databases that are currently relied on are less complete in other model organisms. Still, it should even be possible to create a BioJupies for non-models. I expect the authors have thought of all of these ideas. I do have some other issues with BioJupies: e.g. the notebook should cite all the underlying programs and databases used to generate the results, and while it’s neat that there is an automatically generated methods section, it is far from complete and should include the actual calls made to the programs used so as to facilitate complete reproducibility. Then, there is my pet peeve: “library size” is not the number of reads in a sample. The number of reads sequenced is “sequencing depth”.  All of these issues can be easily fixed.

In summary, BioJupies represents an impressive breakthrough in RNA-seq analysis. It leverages a comprehensive analysis of all (human and mouse) publicly available RNA-seq data to enable rapid and detailed analyses that transcend what has been previously possible. Discoveries await.