You are currently browsing the tag archive for the ‘UMI’ tag.

This post is the fourth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

The “knee plot” is a standard single-cell RNA-seq quality control that is also used to determine a threshold for considering cells valid for analysis in an experiment. To make the plot, cells are ordered on the x-axis according to the number of distinct UMIs observed. The y-axis displays the number of distinct UMIs for each barcode (here barcodes are proxies for cells). The following example is from Aaron Lun’s DropletUtils vignette:

A single-cell RNA-seq knee plot.

High quality barcodes are located on the left hand side of the plot, and thresholding is performed by identifying the “knee” on the curve. On the right hand side, past the inflection point, are barcodes which have relatively low numbers of reads, and are therefore considered to have had failure in capture and to be too noisy for further analysis.

In Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019, we display a series of plots for a benchmark panel of 20 datasets, and the first plot in each panel (subplot A)is a knee plot. The following example is from an Arabidopsis thaliana dataset (Ryu et al., 2019; SRR8257100)

Careful examination of our plots shows that unlike the typical knee plot made for single-cell RNA-seq , ours has the x- and y- axes transposed. In our plot the x-axis displays the number of distinct UMI counts, and the y-axis corresponds to the barcodes, ordered from those with the most UMIs (bottom) to the least (top). The figure below shows both versions of a knee plot for the same data (the “standard” one in blue, our transposed plot in red):

Why bother transposing a plot?

We begin by observing that if one ranks barcodes according to the number of distinct UMIs associated with them (from highest to lowest), then the rank of a barcode with x distinct UMIs is given by f(x) where

$f(x) = |\{c:\# \mbox{UMIs} \in c \geq x\}|$.

In other words, the rank of a barcode is interpretable as the size of a certain set. Now suppose that instead of only measurements of RNA molecules in cells, there is another measurement. This could be measurement of surface protein abundances (e.g. CITE-seq or REAP-seq), or measurements of sample tags from a multiplexing technology (e.g. ClickTags). The natural interpretation of #distinct UMIs as the independent variable and  the rank of a barcode as the dependent variable is now clearly preferable. We can now define a bivariate function f(x,y) which informs on the number of barcodes with at least x RNA observations and tag observations:

$f(x,y) = |\{c:\# \mbox{UMIs} \in c \geq x \mbox{ and} \# \mbox{tags} \in c \geq y \}|$.

Nadia Volovich, with whom I’ve worked on this, has examined this function for the 8 sample species mixing experiment from Gehring et al. 2018. The function is shown below:

Here the x-axis corresponds to the #UMIs in a barcode, and the y-axis to the number of tags. The z-axis, or height of the surface, is the f(x,y) as defined above.  Instead of thresholding on either #UMIs or #tags, this “3D knee plot” makes possible thresholding using both (note that the red curve shown above corresponds to one projection of this surface).

Separately from the issue described above, there is another subtle issue with the knee plot. The x-axis (dependent) variable really ought to display the number of molecules assayed rather than the number of distinct UMIs. In the notation of Melsted, Booeshaghi et al., 2019 (see also the blog post on single-cell RNA-seq for dummies), what is currently being plotted is |supp(I)|, instead of |I|. While |I| cannot be directly measured, it can be inferred (see the Supplementary Note of Melsted, Booeshaghi et al., 2019), where the cardinality of I is denoted by k (see also Grün et al,, 2014). If d denotes the number of distinct UMIs for a barcode and n the effective number of UMIs , then k can be estimated by

$\hat{k} = \frac{log(1-\frac{d}{n})}{log(1-\frac{1}{n})}$.

The function estimating k is monotonic so for the purpose of thresholding with the knee plot it doesn’t matter much whether the correction is applied, but it is worth noting that the correction can be applied without much difficulty.

This post is the third in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

There is a million dollar prize on offer for a solution to the P vs. NP problem, so it’s understandable that one may wonder whether this blog post is an official entry. It is not.

The title for this post was inspired by a talk presented by David Tse at the CGSI 2017 meeting where he explained “How to solve NP-hard assembly problems in linear time“. The gist of the talk was summarized by Tse as follows:

“In computational genomics there’s been a lot of problems where the formulation is combinatorial optimization. Usually they come from some maximum likelihood formulation of some inference problem and those problems end up being mostly NP-hard. And the solution is typically to develop some heuristic way of solving the NP-hard problem. What I’m saying here is that actually there is a different way of approaching such problems. You can look at them from an information point of view.”

Of course thinking about NP-hard problems from an information point of view does not provide polynomial algorithms for them. But what Tse means is that information-theoretic insights can lead to efficient algorithms that squeeze the most out of the available information.

One of the computational genomics areas where an NP-complete formulation for a key problem was recently proposed is in single-cell RNA-seq pre-processing. After RNA molecules are captured from cells, they are amplified by PCR, and it is possible, in principle, to account for the PCR duplicates of the molecules by making use of unique molecular identifiers (UMIs). Since UMIs are (in theory) unique to each captured molecule, but identical among the PCR duplicates of that captured molecule, they can be used to identify and discard the PCR duplicates. In practice distinct captured molecules may share the same UMI causing a collision, so it can be challenging to decide when to “collapse” reads to account for PCR duplicates.

In the recent paper Srivastava et al. 2019, the authors developed a combinatorial optimization formulation for collapsing. They introduce the notion of “monochromatic arborescences” on a graph, where these objects correspond to what is, in the language of the previous post, elements of the set C. They explain that the combinatorial optimization formulation of UMI collapsing in this framework is to find a minimum cardinality covering of a certain graph by monochromatic arboresences. The authors then prove the following theorem, by reduction from the dominating set decision problem:

Theorem [Srivastava, Malik, Smith, Sudbery, Patro]: Minimum cardinality covering by monochromatic arborescences is NP-complete.

Following the standard practice David Tse described in his talk, the authors then apply a heuristic to the challenging NP-complete problem. It’s all good except for one small thing. The formulation is based on an assumption, articulated in Srivastava et al. 2019 (boldface and strikethrough is mine):

…gene-level deduplication provides a conservative approach and assumes that it is highly unlikely for molecules that are distinct transcripts of the same gene to be tagged with a similar UMI (within an edit distance of 1 from another UMI from the same gene). However, entirely discarding transcript-level information will mask true UMI collisions to some degree, even when there is direct evidence that similar UMIs must have arisen from distinct transcripts. For example, if similar UMIs appear in transcript-disjoint equivalence classes (even if all of the transcripts labeling both classes belong to the same gene), then they cannot have arisen from the same pre-PCR molecule. Accounting for such cases is especially true [important] when using an error-aware deduplication approach and as sequencing depth increases.

The one small thing? Well… the authors never checked whether the claim at the end, namely that “accounting for such cases is especially important”, is actually true. In our paper “Modular and efficient pre-processing of single-cell RNA-seq” we checked. The result is in our Figure 1d:

Each column in the figure corresponds to a dataset, and the y-axis shows the distribution (over cells) of the proportion of counts one can expect to lose if applying naïve collapsing to a gene. Naïve collapsing here means that two reads with the same UMI are considered to have come from the same molecule. The numbers are so small we had to include an inset in the top right. Basically, it almost never happens that there is “direct evidence that similar UMIs must have arisen from distinct transcripts”. If one does observe such an occurrence, it is almost certainly an artifact of missing annotation. In fact, this leads to an…

💡 Idea: prioritize genes with colliding UMIs for annotation correction. The UMIs directly highlight transcripts that are incomplete. Maybe for a future paper, but returning to the matter at hand…

Crucially, the information analysis shows that there is no point in solving an NP-complete problem in this setting. The naïve algorithm not only suffices, it is sensible to apply it. And the great thing about naïve collapsing is that it’s straightforward to implement and run; the algorithm is linear. The Srivastava et al. question of what is the “minimum number of UMIs, along with their counts, required to explain the set of mapped reads” is a precise, but wrong question. In the words of John Tukey: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

The math behind Figure 1d is elementary but interesting (see the Supplementary Note of our paper). We work with a simple binomial model which we justify based on the data. For related work see Petukhov et al. 2018. One interesting result that came out of our calculations (work done with Sina Booeshaghi), is an estimate for the effective number of UMIs on each bead in a cell. This resulted in Supplementary Figure 1:

The result is encouraging. While the number of UMIs on a bead is not quite $4^L$ where L is the length of the UMI (theoretical maximum shown by dashed red line for v2 chemistry and solid red line for v3 chemistry), it is nevertheless high. We don’t know whether the variation is a result of batch effect, model mis-specification, or other artifacts; that is an interesting question to explore with more data and analysis.

As for UMI collapsing, the naïve algorithm has been used for almost every experiment to date as it is the method that was implemented in the Cell Ranger software, and subsequently adopted in other software packages. This was done without any consideration of whether it is appropriate. As the Srivastava et al. paper shows, intuition is not to be relied upon, but fortunately, in this case, the naïve approach is the right one.

This post is the second in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

A few months ago, while working on the kallisto | bustools project, some of us in the lab were discussing various aspects of single-cell RNA-seq technology when the conversation veered into a debate over the meaning of some frequently used words and phrases in the art: “library complexity”, “library size”, “sensitivity”, “capture rate”, “saturation”, “number of UMIs”, “bork bork bork” etc. There was some sense of confusion. I felt like a dummy because even after working on RNA-seq for more than a decade, I was still lacking language and clarity about even the most basic concepts. This was perhaps not entirely my fault. Consider, for example, that the phrase “library size” is used to mean “the number of molecules in a cDNA library” by some authors, and the “number of reads sequenced” by others.

Since we were writing a paper on single-cell RNA-seq pre-processing that required some calculations related to the basic concepts (libraries, UMIs, and so on), we decided to write down notation for the key objects. After some back-and-forth, Sina Booeshaghi and I ended up drafting the diagram below that summarizes the sets of objects in a single-cell RNA-seq experiment, and the maps that relate them:

Structure of a single-cell RNA-seq experiment.

Each letter in this diagram is a set. The ensemble of RNA molecules contained within a single cell is denoted by R. To investigate R, a library (L) is constructed from the set of molecules captured from R (the set C). Typically, L is the result of of various fragmentation and amplification steps performed on C, meaning each element of C may be observed in L with some multiplicity. Thus, there is an inclusion map from C to L (arrow with curly tail), and an injection from C to R (arrows with head and tail). The library is interrogated via sequencing of some of the molecules in L, resulting in a set F of fragments. Subsequently, the set F is aligned or pseudoaligned to create a set B, which in our case is a BUS file. Not every fragment F is represented in B, hence the injection, rather than bijection, from B to F, and similarly from F to L. The set T consists of transcripts that correspond to molecules in C that were represented in B. Note that $|R| \geq |C| \geq |T|$. Separately, the set U consists of the unique molecular identifiers (UMIs) available to label molecules from the cell, and I is a multiset of UMIs associated with the molecules in T. Importantly, the data from an experiment consists of F, together with the support of I. The support of I means the number of distinct objects in I, and is denoted by |supp(I)|. The common term is “number of distinct UMIs”.

The diagram has three distinct parts. The sets on the top (L, F, B) are “lifted” from  and by PCR. Without PCR one would be in an the ideal situation of measuring C directly to produce T, which would then be used to directly draw inferences about R. This is the hope for direct RNA sequencing, a technology that is promising but that cannot yet be applied at the scale of cDNA based methods. The sets U and I are intended to be seen as orthogonal to the rest of the objects. They relate to the UMIs which, in droplet single-cell RNA-seq technology, are delivered via beads. While the figure was designed to describe single-cell RNA-seq, it is quite general and possibly a useful model for many sequence census assays.

So what is all this formality good for? Nothing in this setup is novel; any practitioner working with single-cell RNA-seq already knows what the ingredients for the technology are. However I do think there is some trouble with the language and meaning of words, and hopefully having names and labels for the relevant sets can help in communication.

The questions

With some notation at hand, it is possible to precisely articulate some of the key technical questions associated with a single-cell RNA-seq experiment:

• The alignment (or pseudoalignment) problem: compute B from F.
• The pre-processing problem: what is the set ?
• What is the library richness/complexity, i.e. what is |supp(L)|?
• What is the sensitivity, i.e. what is $\frac{|C|}{|R|}$?
• In droplet based experiments, what are the number of UMIs available to tag molecules in a cell, i.e. what is |U|?

These basic questions are sometimes confused with each other. For example, the capture rate refers to the proportion of cells from a sample that are captured in an experiment and should not be confused with sensitivity. The |supp(L)| is a concept that is natural to refer to when thinking about a cDNA library. Note that the “library size”, referred to in the beginning of this post, is used by molecular biologists to naturally mean |L|, and not |F| (this confusion was unfortunately disseminated by the highly influential RNA-seq papers Anders and Huber, 2010 and Robinson and Oshlack, 2010) . The support of another set, |supp(I)|, is one that is easy to measure but precisely because I is a multiset, $|I| \neq |supp(I)|$, and there is considerable confusion about this fact. The number of distinct UMIs, |supp(I)|, is frequently used in lieu of the set whose size is being referred to, namely |I| (this is the case when “knee plots” are made, a topic for the fourth blog post in this series). Similarly, |U| is usually not estimated, and the number $4^L$ where $L$ is the length of the UMIs is used in its stead. This is partly intellectual laziness but partly, I think, the lack of standard notation to refer to the objects in single-cell RNA-seq experiments.

This diagram in this post is just step 0 in discussing single-cell RNA-seq. There is a lot more subtlety and nuance in understanding and interpreting experiments (see Introduction to single-cell RNA-seq technologies). ∎