You are currently browsing the monthly archive for June 2019.

This post is the fourth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

The “knee plot” is a standard single-cell RNA-seq quality control that is also used to determine a threshold for considering cells valid for analysis in an experiment. To make the plot, cells are ordered on the x-axis according to the number of distinct UMIs observed. The y-axis displays the number of distinct UMIs for each barcode (here barcodes are proxies for cells). The following example is from Aaron Lun’s DropletUtils vignette:

A single-cell RNA-seq knee plot.

High quality barcodes are located on the left hand side of the plot, and thresholding is performed by identifying the “knee” on the curve. On the right hand side, past the inflection point, are barcodes which have relatively low numbers of reads, and are therefore considered to have had failure in capture and to be too noisy for further analysis.

In Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019, we display a series of plots for a benchmark panel of 20 datasets, and the first plot in each panel (subplot A)is a knee plot. The following example is from an Arabidopsis thaliana dataset (Ryu et al., 2019; SRR8257100)

Careful examination of our plots shows that unlike the typical knee plot made for single-cell RNA-seq , ours has the x- and y- axes transposed. In our plot the x-axis displays the number of distinct UMI counts, and the y-axis corresponds to the barcodes, ordered from those with the most UMIs (bottom) to the least (top). The figure below shows both versions of a knee plot for the same data (the “standard” one in blue, our transposed plot in red):

Why bother transposing a plot?

We begin by observing that if one ranks barcodes according to the number of distinct UMIs associated with them (from highest to lowest), then the rank of a barcode with x distinct UMIs is given by f(x) where

$f(x) = |\{c:\# \mbox{UMIs} \in c \geq x\}|$.

In other words, the rank of a barcode is interpretable as the size of a certain set. Now suppose that instead of only measurements of RNA molecules in cells, there is another measurement. This could be measurement of surface protein abundances (e.g. CITE-seq or REAP-seq), or measurements of sample tags from a multiplexing technology (e.g. ClickTags). The natural interpretation of #distinct UMIs as the independent variable and  the rank of a barcode as the dependent variable is now clearly preferable. We can now define a bivariate function f(x,y) which informs on the number of barcodes with at least x RNA observations and tag observations:

$f(x,y) = |\{c:\# \mbox{UMIs} \in c \geq x \mbox{ and} \# \mbox{tags} \in c \geq y \}|$.

Nadia Volovich, with whom I’ve worked on this, has examined this function for the 8 sample species mixing experiment from Gehring et al. 2018. The function is shown below:

Here the x-axis corresponds to the #UMIs in a barcode, and the y-axis to the number of tags. The z-axis, or height of the surface, is the f(x,y) as defined above.  Instead of thresholding on either #UMIs or #tags, this “3D knee plot” makes possible thresholding using both (note that the red curve shown above corresponds to one projection of this surface).

Separately from the issue described above, there is another subtle issue with the knee plot. The x-axis (dependent) variable really ought to display the number of molecules assayed rather than the number of distinct UMIs. In the notation of Melsted, Booeshaghi et al., 2019 (see also the blog post on single-cell RNA-seq for dummies), what is currently being plotted is |supp(I)|, instead of |I|. While |I| cannot be directly measured, it can be inferred (see the Supplementary Note of Melsted, Booeshaghi et al., 2019), where the cardinality of I is denoted by k (see also Grün et al,, 2014). If d denotes the number of distinct UMIs for a barcode and n the effective number of UMIs , then k can be estimated by

$\hat{k} = \frac{log(1-\frac{d}{n})}{log(1-\frac{1}{n})}$.

The function estimating k is monotonic so for the purpose of thresholding with the knee plot it doesn’t matter much whether the correction is applied, but it is worth noting that the correction can be applied without much difficulty.

This post is the third in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

There is a million dollar prize on offer for a solution to the P vs. NP problem, so it’s understandable that one may wonder whether this blog post is an official entry. It is not.

The title for this post was inspired by a talk presented by David Tse at the CGSI 2017 meeting where he explained “How to solve NP-hard assembly problems in linear time“. The gist of the talk was summarized by Tse as follows:

“In computational genomics there’s been a lot of problems where the formulation is combinatorial optimization. Usually they come from some maximum likelihood formulation of some inference problem and those problems end up being mostly NP-hard. And the solution is typically to develop some heuristic way of solving the NP-hard problem. What I’m saying here is that actually there is a different way of approaching such problems. You can look at them from an information point of view.”

Of course thinking about NP-hard problems from an information point of view does not provide polynomial algorithms for them. But what Tse means is that information-theoretic insights can lead to efficient algorithms that squeeze the most out of the available information.

One of the computational genomics areas where an NP-complete formulation for a key problem was recently proposed is in single-cell RNA-seq pre-processing. After RNA molecules are captured from cells, they are amplified by PCR, and it is possible, in principle, to account for the PCR duplicates of the molecules by making use of unique molecular identifiers (UMIs). Since UMIs are (in theory) unique to each captured molecule, but identical among the PCR duplicates of that captured molecule, they can be used to identify and discard the PCR duplicates. In practice distinct captured molecules may share the same UMI causing a collision, so it can be challenging to decide when to “collapse” reads to account for PCR duplicates.

In the recent paper Srivastava et al. 2019, the authors developed a combinatorial optimization formulation for collapsing. They introduce the notion of “monochromatic arborescences” on a graph, where these objects correspond to what is, in the language of the previous post, elements of the set C. They explain that the combinatorial optimization formulation of UMI collapsing in this framework is to find a minimum cardinality covering of a certain graph by monochromatic arboresences. The authors then prove the following theorem, by reduction from the dominating set decision problem:

Theorem [Srivastava, Malik, Smith, Sudbery, Patro]: Minimum cardinality covering by monochromatic arborescences is NP-complete.

Following the standard practice David Tse described in his talk, the authors then apply a heuristic to the challenging NP-complete problem. It’s all good except for one small thing. The formulation is based on an assumption, articulated in Srivastava et al. 2019 (boldface and strikethrough is mine):

…gene-level deduplication provides a conservative approach and assumes that it is highly unlikely for molecules that are distinct transcripts of the same gene to be tagged with a similar UMI (within an edit distance of 1 from another UMI from the same gene). However, entirely discarding transcript-level information will mask true UMI collisions to some degree, even when there is direct evidence that similar UMIs must have arisen from distinct transcripts. For example, if similar UMIs appear in transcript-disjoint equivalence classes (even if all of the transcripts labeling both classes belong to the same gene), then they cannot have arisen from the same pre-PCR molecule. Accounting for such cases is especially true [important] when using an error-aware deduplication approach and as sequencing depth increases.

The one small thing? Well… the authors never checked whether the claim at the end, namely that “accounting for such cases is especially important”, is actually true. In our paper “Modular and efficient pre-processing of single-cell RNA-seq” we checked. The result is in our Figure 1d:

Each column in the figure corresponds to a dataset, and the y-axis shows the distribution (over cells) of the proportion of counts one can expect to lose if applying naïve collapsing to a gene. Naïve collapsing here means that two reads with the same UMI are considered to have come from the same molecule. The numbers are so small we had to include an inset in the top right. Basically, it almost never happens that there is “direct evidence that similar UMIs must have arisen from distinct transcripts”. If one does observe such an occurrence, it is almost certainly an artifact of missing annotation. In fact, this leads to an…

💡 Idea: prioritize genes with colliding UMIs for annotation correction. The UMIs directly highlight transcripts that are incomplete. Maybe for a future paper, but returning to the matter at hand…

Crucially, the information analysis shows that there is no point in solving an NP-complete problem in this setting. The naïve algorithm not only suffices, it is sensible to apply it. And the great thing about naïve collapsing is that it’s straightforward to implement and run; the algorithm is linear. The Srivastava et al. question of what is the “minimum number of UMIs, along with their counts, required to explain the set of mapped reads” is a precise, but wrong question. In the words of John Tukey: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

The math behind Figure 1d is elementary but interesting (see the Supplementary Note of our paper). We work with a simple binomial model which we justify based on the data. For related work see Petukhov et al. 2018. One interesting result that came out of our calculations (work done with Sina Booeshaghi), is an estimate for the effective number of UMIs on each bead in a cell. This resulted in Supplementary Figure 1:

The result is encouraging. While the number of UMIs on a bead is not quite $4^L$ where L is the length of the UMI (theoretical maximum shown by dashed red line for v2 chemistry and solid red line for v3 chemistry), it is nevertheless high. We don’t know whether the variation is a result of batch effect, model mis-specification, or other artifacts; that is an interesting question to explore with more data and analysis.

As for UMI collapsing, the naïve algorithm has been used for almost every experiment to date as it is the method that was implemented in the Cell Ranger software, and subsequently adopted in other software packages. This was done without any consideration of whether it is appropriate. As the Srivastava et al. paper shows, intuition is not to be relied upon, but fortunately, in this case, the naïve approach is the right one.

This post is the second in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

A few months ago, while working on the kallisto | bustools project, some of us in the lab were discussing various aspects of single-cell RNA-seq technology when the conversation veered into a debate over the meaning of some frequently used words and phrases in the art: “library complexity”, “library size”, “sensitivity”, “capture rate”, “saturation”, “number of UMIs”, “bork bork bork” etc. There was some sense of confusion. I felt like a dummy because even after working on RNA-seq for more than a decade, I was still lacking language and clarity about even the most basic concepts. This was perhaps not entirely my fault. Consider, for example, that the phrase “library size” is used to mean “the number of molecules in a cDNA library” by some authors, and the “number of reads sequenced” by others.

Since we were writing a paper on single-cell RNA-seq pre-processing that required some calculations related to the basic concepts (libraries, UMIs, and so on), we decided to write down notation for the key objects. After some back-and-forth, Sina Booeshaghi and I ended up drafting the diagram below that summarizes the sets of objects in a single-cell RNA-seq experiment, and the maps that relate them:

Structure of a single-cell RNA-seq experiment.

Each letter in this diagram is a set. The ensemble of RNA molecules contained within a single cell is denoted by R. To investigate R, a library (L) is constructed from the set of molecules captured from R (the set C). Typically, L is the result of of various fragmentation and amplification steps performed on C, meaning each element of C may be observed in L with some multiplicity. Thus, there is an inclusion map from C to L (arrow with curly tail), and an injection from C to R (arrows with head and tail). The library is interrogated via sequencing of some of the molecules in L, resulting in a set F of fragments. Subsequently, the set F is aligned or pseudoaligned to create a set B, which in our case is a BUS file. Not every fragment F is represented in B, hence the injection, rather than bijection, from B to F, and similarly from F to L. The set T consists of transcripts that correspond to molecules in C that were represented in B. Note that $|R| \geq |C| \geq |T|$. Separately, the set U consists of the unique molecular identifiers (UMIs) available to label molecules from the cell, and I is a multiset of UMIs associated with the molecules in T. Importantly, the data from an experiment consists of F, together with the support of I. The support of I means the number of distinct objects in I, and is denoted by |supp(I)|. The common term is “number of distinct UMIs”.

The diagram has three distinct parts. The sets on the top (L, F, B) are “lifted” from  and by PCR. Without PCR one would be in an the ideal situation of measuring C directly to produce T, which would then be used to directly draw inferences about R. This is the hope for direct RNA sequencing, a technology that is promising but that cannot yet be applied at the scale of cDNA based methods. The sets U and I are intended to be seen as orthogonal to the rest of the objects. They relate to the UMIs which, in droplet single-cell RNA-seq technology, are delivered via beads. While the figure was designed to describe single-cell RNA-seq, it is quite general and possibly a useful model for many sequence census assays.

So what is all this formality good for? Nothing in this setup is novel; any practitioner working with single-cell RNA-seq already knows what the ingredients for the technology are. However I do think there is some trouble with the language and meaning of words, and hopefully having names and labels for the relevant sets can help in communication.

The questions

With some notation at hand, it is possible to precisely articulate some of the key technical questions associated with a single-cell RNA-seq experiment:

• The alignment (or pseudoalignment) problem: compute B from F.
• The pre-processing problem: what is the set ?
• What is the library richness/complexity, i.e. what is |supp(L)|?
• What is the sensitivity, i.e. what is $\frac{|C|}{|R|}$?
• In droplet based experiments, what are the number of UMIs available to tag molecules in a cell, i.e. what is |U|?

These basic questions are sometimes confused with each other. For example, the capture rate refers to the proportion of cells from a sample that are captured in an experiment and should not be confused with sensitivity. The |supp(L)| is a concept that is natural to refer to when thinking about a cDNA library. Note that the “library size”, referred to in the beginning of this post, is used by molecular biologists to naturally mean |L|, and not |F| (this confusion was unfortunately disseminated by the highly influential RNA-seq papers Anders and Huber, 2010 and Robinson and Oshlack, 2010) . The support of another set, |supp(I)|, is one that is easy to measure but precisely because I is a multiset, $|I| \neq |supp(I)|$, and there is considerable confusion about this fact. The number of distinct UMIs, |supp(I)|, is frequently used in lieu of the set whose size is being referred to, namely |I| (this is the case when “knee plots” are made, a topic for the fourth blog post in this series). Similarly, |U| is usually not estimated, and the number $4^L$ where $L$ is the length of the UMIs is used in its stead. This is partly intellectual laziness but partly, I think, the lack of standard notation to refer to the objects in single-cell RNA-seq experiments.

This diagram in this post is just step 0 in discussing single-cell RNA-seq. There is a lot more subtlety and nuance in understanding and interpreting experiments (see Introduction to single-cell RNA-seq technologies). ∎

This post is the first in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

During the past few years computational biologists have expended enormous effort in developing tools for processing and analyzing single-cell RNA-seq. This post describes yet another: the kallisto|bustools workflow for pre-processing single-cell RNA-seq. A preprint describing the method (was recently posted on the bioRχiv.

Number of single-cell RNA-seq tools (from the scRNA-tools catalog).

Given that there are so many programs, a natural question is: why on earth would we write yet another software program for generating a count matrix from single-cell RNA-seq reads when there are already plenty of programs out there? There’s alevin, cell rangerdropseqpipedropseqtoolsindrops… I’ve been going in alphabetical order but have to jump in with starsolo because it’s got the coolest name…now back to optimus, scruff, scpipescumiumis, zumis,  and I’m probably forgetting a few other something-umis. So why another one?

The answer requires briefly venturing back to a time long, long ago when RNA-seq was a fledgling, exciting new technology (~2009). At the time the notion of an “equivalence class” was introduced to the field (see e.g. Jiang and Wong, 2009 or Nicolae et al., 2011). Briefly, there is a natural equivalence relation on the set of reads in an RNA-seq experiment, where two reads are related when they are compatible with (i.e. could have originated from) exactly the same set of transcripts. The equivalence relation partitions the reads into equivalence classes, and, in a slight abuse of notation, the term “equivalence class” in RNA-seq is used to denote the set of transcripts corresponding to an equivalence class of reads. Starting with the pseudoalignment program kallisto that we published in Bray et al. 2016, it became possible to rapidly obtain the (transcript) equivalence classes for reads from an RNA-seq experiment.

When single-cell RNA-seq started to scale it became apparent to those of us working with equivalence classes for bulk RNA-seq that rather than counting genes from single-cell RNA-seq data, it would be better to examine what we called transcript compatibility counts (TCCs), i.e. counts of the equivalence classes (the origin of the term TCC is discussed in a previous blog post of mine). This vision has borne out: we recently published a paper demonstrating the power of TCCs for differential analysis of single-cell data (Ntranos, Yi et al. 2019) and I believe TCCs are ideal for many different single-cell RNA-seq analyses. So back to the question: why another single-cell RNA-seq pre-processing workflow?

Already in 2016 we wanted to be able to produce TCC matrices from single-cell RNA-seq data but there was no program to do it. My postdoc at the time, Vasilis Ntranos, developed a workflow, but in the course of working on a program he started to realize that there were numerous non-trivial aspects to processing single-cell RNA-seq. Even basic questions, such as how to correct barcodes or collapse UMIs required careful thought and analysis. As more and more programs for single-cell RNA-seq pre-processing started to appear, we examined them carefully and noted two things: 1. Most were not able to output TCC matrices and 2. They were, for the most part, based on ad hoc heuristics and unvalidated methods. Many of the programs were not even released with a preprint or paper. So we started working on the problem.

A key insight was that we needed a new format to allow for modular pre-processing. So we developed such a format, which we called the Barcode, UMI, Set (BUS) format, and we published a paper about it earlier this year (Melsted, Ntranos et al., 2019). This allowed us to start investigating different algorithms for the key steps, and to rearrange them and plug them in to an overall workflow as needed. Finally, after careful consideration of each of the key steps, weighing tradeoffs between efficiency and accuracy, and extensive experimentation, we settled on a workflow that is faster than any other method and based on reason rather than intuition. The workflow uses two programs, kallisto and bustools, and we call it the kallisto|bustools workflow. Some highlights:

• kallisto|bustools can produce a TCC matrix. The matrix is compatible with the gene count matrix (it collapses to the latter), and of course gene count matrices can be output as well for use in existing downstream tools.
• The workflow is very very fast. With kallisto|bustools very large datasets can be processed in minutes. The title of this post refers to the workflow as “near-optimal” because it runs in time similar to the unix word count function. Maybe it’s possible to be a bit faster with some optimizations, but probably not by much:
• kallisto|bustools uses very little memory. We worked hard to achieve this feature, as we wanted it to be useful for large-scale analyses that are going to be performed by consortia such as the Human Cell Atlas project. The workflow currently uses ~3.5Gb of RAM for processing 10x v2 chemistry data, and ~11Gb RAM for 10x v3 chemistry data; both numbers are independent of the number of reads being processed. This means users can pre-process data on a laptop:
• The workflow is modular, thanks to its reliance on the flexible BUS format. It was straightforward to develop an RNA velocity workflow (more on this in a companion blog post). It will be easy to adapt the workflow to various technologies, to multiomics experiments, and to any custom analysis required:
• We tried to create comprehensive, yet succinct documentation to help make it easy to use the software (recommendations for improvements are welcome). We have online tutorials, as well as videos for novices:
Installation instructions (and video)
Getting started tutorial (and video).
– Manuals for kallisto and bustools.
– Complete code for reproducing all the results in the preprint
• We were not lazy. In our tests we found variability in performance on different datasets so we tested the program extensively and ran numerous benchmarks on 10x Genomics data to validate Cell Ranger with respect to kallisto|bustools (note that Cell Ranger’s methods have been neither validated nor published). We compiled a benchmark panel consisting of 20 datasets from a wide variety of species. This resulted in 20 supplementary figures, each with 8 panels showing: a) the number of genes detected, b) concordance in counts per gene, c) number of genes detected, d) correlation in gene counts by cell, e) spatial separation between corresponding cells vs. neighboring cells, f,g) t-SNE analysis, h) gene set analysis to detect systematic differences in gene abundance estimation (see example below for the dataset SRR8257100 from the paper Ryu et al., 2019). We also examined in detail results on a species mixing experiment, and confirmed that Cell Ranger is consistent with kallisto on that as well. One thing we did not do in this paper is describe workflows for different technologies but such workflows and accompanying tutorials will be available soon:
• In addition we ran a detailed analysis on the 10x Genomics 10k E18 mouse brain dataset to investigate whether Cell Ranger pre-processing produces different results than kallisto insofar as downstream analyses are concerned. We looked at dimensionality reduction, clustering, identification of marker genes, marker gene prevalence, and pseudotime. The results were all highly concordant. An example (the pseudotime analysis) is shown below:
• We did the math on some of the basic aspects of single-cell RNA-seq. We’re not the first to do this (see, e.g. Petukhov et al., 2018), but one new result we have is an estimation of the UMI diversity on beads. This should be useful for those developing new technologies, or trying to optimize existing protocols:

Note that this post is the first in a series of five that discuss in more detail various aspects of the paper (see links at the top). Finally, a note on reproducibility and usability:

The development of the kallisto|bustools workflow, research into the methods, compilation of the results, and execution of the project required a tremendous team effort, and in working on it I was thinking of the first bioinformatics tool I wrote about and posted to the arXiv (the bioRxiv didn’t exist yet). The paper was:

Nicolas Bray and Lior Pachter, MAVID: Constrained ancestral alignment of multiple sequences, arXiv, 2003.

At the time we posted the code on our own website (now defunct, but accessible via the Wayback machine). We did our best to make the results reproducible but we were limited in our options with the tools available at the time. Furthermore, posting the preprint was highly unusual; there was almost no biology preprinting at the time. Other things have stayed the same. Considerations of software portability, efficiency and documentation were relevant then and remain relevant now.

Still, there has been an incredible development in the tools and techniques available for reproducibility and usability since that time. A lot of the innovation has been made possible by cloud infrastructure, but much of the development has been the result of changes in community standards and requirements (see e.g., Weber et al., 2019). I thought I’d compile a list of the parts and pieces of the project; they are typical for what is needed for a bioinformatics project today and comparing them to the bar in 2003 is mind boggling:

Software: GitHub repositories (kallisto and bustools); releases of binaries for multiple operating systems (Mac, Linux, Windows, Rock64); portable source code with minimal dependencies; multithreading; memory optimization; user interface.

Paper: Preprint (along with extensive Supplement providing backup for every result and claim in the main text); GitHub repository with code to reproduce all the figures/results in the preprint (reproducibility code includes R markdown, python notebooks, snakemake, software versions for every program used, fixed seeds).

Documentation: Manuals for the software; Tutorials for learning to use the code; Explanatory videos (all required materials posted on Github or available on stable websites for download).

The totality of work required to do all of this was substantial. Páll Melsted was the primary developer of kallisto and he wrote and designed bustools, which has been the foundation of the project. The key insight to adopt the BUS format was work in collaboration with Vasilis Ntranos. This was followed by long conversations on the fundamentals of single-cell RNA-seq with Jase Gehring. Sina Booeshaghi carried the project. He was responsible for the crucial UMI collapsing analysis, and put together the paper. Fan Gao, director of the Caltech Bioinformatics Resource Center, set up and implemented the extensive benchmarking, and helped fine-tune the algorithms and converge to the final approach taken. Lambda Lu conducted what I believe to be the most in-depth and detailed analysis to date of the effect of pre-processing on results. Her framework should serve as a template for future development work in this area. Eduardo Beltrame designed the benchmark panels and had a key insight about how to present results that is described in more detail in the companion post on rotating the knee plot. He also helped in the complex task of designing and building the companion websites for the project. Kristján Eldjarn Hjörleifsson helped with the RNA velocity work and helped make custom indices that turned out to be fundamental in understanding the performance of pseudoalignment in the single-cell RNA-seq setting. Sina Booeshaghi spent a lot of time thinking about how to optimize the user experience, making the tutorials and videos, and working overall to make the results of the paper not just reproducible, but the the methods usable.