This post is the third in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

- Near-optimal pre-processing of single-cell RNA-seq
- Single-cell RNA-seq for dummies
- How to solve an NP-complete problem in linear time
- Rotating the knee (plot) and related yoga
- High velocity RNA velocity

There is a million dollar prize on offer for a solution to the P vs. NP problem, so it’s understandable that one may wonder whether this blog post is an official entry. **It is not**.

The title for this post was inspired by a talk presented by David Tse at the CGSI 2017 meeting where he explained “How to solve NP-hard assembly problems in linear time“. The gist of the talk was summarized by Tse as follows:

“In computational genomics there’s been a lot of problems where the formulation is combinatorial optimization. Usually they come from some maximum likelihood formulation of some inference problem and those problems end up being mostly NP-hard. And the solution is typically to develop some heuristic way of solving the NP-hard problem. What I’m saying here is that actually there is a different way of approaching such problems. You can look at them from an information point of view.”

Of course thinking about NP-hard problems from an information point of view does not provide polynomial algorithms for them. But what Tse means is that information-theoretic insights can lead to efficient algorithms that squeeze the most out of the available information.

One of the computational genomics areas where an NP-complete formulation for a key problem was recently proposed is in single-cell RNA-seq pre-processing. After RNA molecules are captured from cells, they are amplified by PCR, and it is possible, in principle, to account for the PCR duplicates of the molecules by making use of unique molecular identifiers (UMIs). Since UMIs are (in theory) unique to each captured molecule, but identical among the PCR duplicates of that captured molecule, they can be used to identify and discard the PCR duplicates. In practice distinct captured molecules may share the same UMI causing a *collision*, so it can be challenging to decide when to “collapse” reads to account for PCR duplicates.

In the recent paper Srivastava et al. 2019, the authors developed a combinatorial optimization formulation for collapsing. They introduce the notion of “monochromatic arborescences” on a graph, where these objects correspond to what is, in the language of the previous post, elements of the set *C*. They explain that the combinatorial optimization formulation of UMI collapsing in this framework is to find a minimum cardinality covering of a certain graph by monochromatic arboresences. The authors then prove the following theorem, by reduction from the dominating set decision problem:

*Theorem* [Srivastava, Malik, Smith, Sudbery, Patro]: Minimum cardinality covering by monochromatic arborescences is NP-complete.

Following the standard practice David Tse described in his talk, the authors then apply a heuristic to the challenging NP-complete problem. It’s all good except for one small thing. The formulation is based on an assumption, articulated in Srivastava et al. 2019 (boldface and strikethrough is mine):

…gene-level deduplication provides a conservative approach and assumes that it is highly unlikely for molecules that are distinct transcripts of the same gene to be tagged with a similar UMI (within an edit distance of 1 from another UMI from the same gene). However, entirely discarding transcript-level information will mask true UMI collisions to some degree, even when there is direct evidence that similar UMIs must have arisen from distinct transcripts. For example, if similar UMIs appear in transcript-disjoint equivalence classes (even if all of the transcripts labeling both classes belong to the same gene), then they

cannothave arisen from the same pre-PCR molecule.Accounting for such cases is especially.~~true~~[important] when using an error-aware deduplication approach and as sequencing depth increases

The one small thing? Well… the authors never checked whether the claim at the end, namely that “accounting for such cases is especially important”, is actually true. In our paper “Modular and efficient pre-processing of single-cell RNA-seq” we checked. The result is in our Figure 1d:

Each column in the figure corresponds to a dataset, and the y-axis shows the distribution (over cells) of the proportion of counts one can expect to lose if applying naïve collapsing to a gene. Naïve collapsing here means that two reads with the same UMI are considered to have come from the same molecule. The numbers are so small we had to include an inset in the top right. Basically, it almost *never* happens that there is “direct evidence that similar UMIs must have arisen from distinct transcripts”. If one does observe such an occurrence, it is almost certainly an artifact of missing annotation. In fact, this leads to an…

💡 Idea: prioritize genes with colliding UMIs for annotation correction. The UMIs directly highlight transcripts that are incomplete. Maybe for a future paper, but returning to the matter at hand…

Crucially, the *information* analysis shows that there is no point in solving an NP-complete problem in this setting. The naïve algorithm not only suffices, it is sensible to apply it. And the great thing about naïve collapsing is that it’s straightforward to implement and run; **the algorithm is linear**. The Srivastava et al. question of what is the “minimum number of UMIs, along with their counts, required to explain the set of mapped reads” is a precise, but wrong question. In the words of John Tukey: “**Far better an approximate answer to the ***right* question, which is often vague, than an *exact*** answer to the wrong question, which can always be made precise**.”

The math behind Figure 1d is elementary but interesting (see the Supplementary Note of our paper). We work with a simple binomial model which we justify based on the data. For related work see Petukhov et al. 2018. One interesting result that came out of our calculations (work done with Sina Booeshaghi), is an estimate for the effective number of UMIs on each bead in a cell. This resulted in Supplementary Figure 1:

The result is encouraging. While the number of UMIs on a bead is not quite where *L* is the length of the UMI (theoretical maximum shown by dashed red line for v2 chemistry and solid red line for v3 chemistry), it is nevertheless high. We don’t know whether the variation is a result of batch effect, model mis-specification, or other artifacts; that is an interesting question to explore with more data and analysis.

As for UMI collapsing, the naïve algorithm has been used for almost every experiment to date as it is the method that was implemented in the Cell Ranger software, and subsequently adopted in other software packages. This was done without any consideration of whether it is appropriate. As the Srivastava et al. paper shows, intuition is not to be relied upon, but fortunately, in this case, the naïve approach is the right one.

## 1 comment

Comments feed for this article

June 21, 2019 at 7:50 pm

Allen KnutsonBack when Moore’s Law was the norm, there was a linear algorithm for all exponential problems: wait for computers to improve, then run the program.