You are currently browsing the tag archive for the ‘sensitivity’ tag.

This post is the second in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

- Near-optimal pre-processing of single-cell RNA-seq
- Single-cell RNA-seq for dummies
- How to solve an NP-complete problem in linear time
- Rotating the knee (plot) and related yoga
- High velocity RNA velocity

A few months ago, while working on the kallisto | bustools project, some of us in the lab were discussing various aspects of single-cell RNA-seq technology when the conversation veered into a debate over the meaning of some frequently used words and phrases in the art: “library complexity”, “library size”, “sensitivity”, “capture rate”, “saturation”, “number of UMIs”, “bork bork bork” etc. There was some sense of confusion. I felt like a dummy because even after working on RNA-seq for more than a decade, I was still lacking language and clarity about even the most basic concepts. This was perhaps not entirely my fault. Consider, for example, that the phrase “library size” is used to mean “the number of molecules in a cDNA library” by some authors, and the “number of reads sequenced” by others.

Since we were writing a paper on single-cell RNA-seq pre-processing that required some calculations related to the basic concepts (libraries, UMIs, and so on), we decided to write down notation for the key objects. After some back-and-forth, Sina Booeshaghi and I ended up drafting the diagram below that summarizes the sets of objects in a single-cell RNA-seq experiment, and the maps that relate them:

Structure of a single-cell RNA-seq experiment.

Each letter in this diagram is a set. The ensemble of RNA molecules contained within a single cell is denoted by *R*. To investigate *R,* a library (*L*) is constructed from the set of molecules captured from *R* (the set *C*). Typically, *L* is the result of of various fragmentation and amplification steps performed on *C*, meaning each element of *C* may be observed in *L* with some multiplicity. Thus, there is an inclusion map from *C* to *L *(arrow with curly tail), and an injection from *C* to *R *(arrows with head and tail). The library is interrogated via sequencing of some of the molecules in *L*, resulting in a set *F* of fragments. Subsequently, the set *F* is aligned or pseudoaligned to create a set *B*, which in our case is a BUS file. Not every fragment *F* is represented in *B*, hence the injection, rather than bijection, from *B* to *F*, and similarly from *F* to *L. *The set* T* consists of transcripts that correspond to molecules in *C* that were represented in *B*. Note that . Separately, the set *U* consists of the unique molecular identifiers (UMIs) available to label molecules from the cell, and *I* is a multiset of UMIs associated with the molecules in *T. *Importantly, the data from an experiment consists of *F*, together with the support of *I. *The support of *I* means the number of distinct objects in *I, *and is denoted by *|supp(I)|*. The common term is “number of distinct UMIs”.

The diagram has three distinct parts. The sets on the top (*L, F, B*) are “lifted” from *C * and *T *by PCR. Without PCR one would be in an the ideal situation of measuring *C* directly to produce *T,* which would then be used to directly draw inferences about *R. *This is the hope for direct RNA sequencing, a technology that is promising but that cannot yet be applied at the scale of cDNA based methods. The sets *U* and *I *are intended to be seen as orthogonal to the rest of the objects. They relate to the UMIs which, in droplet single-cell RNA-seq technology, are delivered via beads. While the figure was designed to describe single-cell RNA-seq, it is quite general and possibly a useful model for many sequence census assays.

So what is all this formality good for? Nothing in this setup is novel; any practitioner working with single-cell RNA-seq already knows what the ingredients for the technology are. However I do think there is some trouble with the language and meaning of words, and hopefully having names and labels for the relevant sets can help in communication.

**The questions**

With some notation at hand, it is possible to precisely articulate some of the key technical questions associated with a single-cell RNA-seq experiment:

- The
**alignment**(or pseudoalignment) problem: compute*B*from*F*. - The
**pre-processing**problem: what is the set*T*? - What is the
**library richness/complexity**, i.e. what is*|supp(L)|*? - What is the
**sensitivity**, i.e. what is ? - In droplet based experiments, what are the
**number of UMIs available**to tag molecules in a cell, i.e. what is*|U|*?

These basic questions are sometimes confused with each other. For example, the *capture rate *refers to the proportion of cells from a sample that are captured in an experiment and should not be confused with sensitivity. The *|supp(L)| *is a concept that is natural to refer to when thinking about a cDNA library. Note that the “library size”, referred to in the beginning of this post, is used by molecular biologists to naturally mean *|L|, *and not *|F| *(this confusion was unfortunately disseminated by the highly influential RNA-seq papers Anders and Huber, 2010 and Robinson and Oshlack, 2010) . The support of another set, |*supp(I)|, *is one that is easy to measure but precisely because *I* is a multiset, , and there is considerable confusion about this fact. The number of distinct UMIs, |*supp(I)|,* is frequently used in lieu of the set whose size is being referred to, namely |I| (this is the case when “knee plots” are made, a topic for the fourth blog post in this series). Similarly, *|U|* is usually not estimated, and the number where is the length of the UMIs is used in its stead. This is partly intellectual laziness but partly, I think, the lack of standard notation to refer to the objects in single-cell RNA-seq experiments.

This diagram in this post is just step 0 in discussing single-cell RNA-seq. There is a lot more subtlety and nuance in understanding and interpreting experiments (see Introduction to single-cell RNA-seq technologies). ∎

**What are confusion matrices?**

In the 1904 book Mathematical Contributions to the Theory of Evolution, Karl Pearson introduced the notion of contingency tables. Sometime around the 1950s the term “confusion matrix” started to be used for such tables, specifically for 2×2 tables arising in the evaluation of algorithms for binary classification.

**Example**: Suppose there are 11 items labeled **A,B,C,D,E,F,G,H,I,J,K** four of which are of the category blue (also to be called “positive”) and eight of which are of the category red (also to be called “negative”). An algorithm called BEST receives as input the objects *without* their category labels, i.e. just **A,B,C,D,E,F,G,H,I,J,K **and must rank them so that the top of the ranking is enriched, as much as possible, for blue items. Suppose BEST produces as output the ranking:

**C****E****A****B****F****K****G****D****I****H****J**

How good is BEST at distinguishing the blue (positive) from the red (negative) items, i.e. how enriched are they at the top of the list? The confusion matrix provides a way to organize the assessment. We explain by example: if we declare the top 6 predictions of BEST blue and classify the remainder as red

========

Blue

**C****E****A****B****F****K**========

Red**G****D****I****H****J**

then the confusion matrix is:

Predicted category |
|||

Blue | Red | ||

True category |
Blue |
3 |
1 |

Red |
3 |
4 |

The matrix shows the number of predictions that are correct (3) as well as the number of blue items that are missing (1). Similarly, it shows the number of red items that were incorrectly predicted to be blue (3), as well as the number of red items correctly predicted to be red (4). The four numbers have names (true positives, false negatives, false positives and true negatives) and various other numbers can be derived from them that shed light on the quality of the prediction:

The table and the terminology are trivial. All that is involved is simple counting of successes and failures in a binary classification experiment, and computation of derived measures that are obtained by elementary arithmetic. **But here is a little secret that I’ve always felt embarrassed to admit**… I can never remember the details. For example I find myself forgetting whether “positive predictive value” is really “precision” or how that relates to “specificity”. For many years I thought I was the only one. I had what you might call “confusion syndrome”… But I’ve slowly come to realize that I am not alone.

**Confused people**

I became confused about confusion matrices from the moment I first encountered them when I started studying computational biology in the mid 1990s. One of the first papers I read at the time was a landmark review by Burset and Guigó (1996) on the evaluation of gene structure prediction programs (the review has now been cited over 800 times). I have highlighted some text from the paper, where the authors explain what “specificity” means.

Not only are there two definitions for specificity on the same page, there is also a typo in Figure 1 (specificity is abbreviated Sn instead of Sp) adding to the confusion.

Burset and Guigó are of course not to blame for the non-standard use of specificity in gene finding. They were merely describing deviant behavior by computational biologists working on gene finding who had apparently decided to redefine a term widely used in statistics and computer science. However lacking context at the time when I was first learning of these terms, I remained confused for years about whether I was supposed to use specificity or the term “precision” for what the gene finding people were computing.

Simple confusion about terms is not the only problem with the confusion matrix. It is filled with (abbreviated) jargon that can be overwhelming. Who can remember TP,TN,FP,FN,TPR,FPR,SN,SP,FDR,PPV,FOR,NPV,FDR,FNR,TNR,SPC,ACC,LR+,LR- etc. etc. etc. ? Even harder is keeping track of the many formulas relating the quantities, e.g. FDR = 1 – PPV or FPR = 1 – SPC. And how is one supposed to gain intuition about a formula described as ACC = (TP + TN) / (TP + FP + FN + TN)?

Confusion matrix confusion remains a problem in computational biology. Consider the recent preprint Creating a universal SNP and small indel variant caller with deep neural networks by Poplin *et al.* published on December 14, 2016. The paper, from Google/Verily, shows how a deep convolutional neural network can be used to call genetic variation by learning from *images* of read pileups. This is fascinating and potentially deep innovation. However the authors, experts in machine learning, were confused in the caption of their Figure 2A, incorrectly describing the FDR-Sensitivity plot as a ROC curve. This in turn caused confusion for readers expecting a ROC curve to be a function.

While an FPR-Sensitivity plot is a function an FDR-Sensitivity plot doesn’t have to be. I was confused by this subtlety myself until an author of Poplin *et al.* kindly sorted me out.

Hence this post, a note for myself to avoid further confusion by utilizing elementary trigonometry to organize the information in confusion (matrices). As we will see shortly, the Poplin *et al. *Figure 2A is what is more commonly known as a precision-recall (PR) curve, except that the usual axes of a PR plot are flipped, in addition to a reflection which adds further confusion.

**ROC curves**

The quality of the BEST prediction in the example above can be encoded in a picture as follows:

Figure 1: A false-positive / true-positive plot.

The grid is 4 lines high and 8 lines wide because there are 4 blue objects and 7 red ones. The BEST ranking is a path from the lower left hand corner of the grid (0,0) to the top right hand corner (7,4). The BEST predictions can be “read off” by following the path. Starting with the top ranked object, the path moves one step up if the object is (in truth) blue, and one step to the right if it is (in truth) red. For example, the first prediction of BEST is C which is blue, and therefore the first step is up. Continuing in this way one eventually arrives at (7,4) because each object is somewhere (only once) on BEST’s list. A BEST confusion matrix corresponds to a point in the plot. The confusion matrix discussed in the introduction thresholding after 6 predictions corresponds to the green point which is the location after 6 steps from (0,0).

The path corresponding to the BEST ranking, when scaled to the unit square, is called a **ROC curve**. This stands for receiver operator curve (the terminology originates from WWII). Again, each thresholding of the ranking, e.g. the top six as discussed above (green point), corresponds to a single point on the path:

Figure 2: A ROC plot.

In the ROC plot the x-axis is the false positive rate (FPR = # false positives / total number of true negatives) and the y-axis the sensitivity (Sn = #true positives / total number of true positives).

**Taxicab geometry**

At this point it is useful to think about taxicab geometry (formally known as geometry). In taxicab geometry, the distance between two points on the grid is the number of steps in the shortest path (using edges of the grid) connecting them. In the figure below, a false positive / true positive plot (as shown in Figure 1) is annotated to highlight the connection to trigonometry:

Figure 3: Taxicab confusion.

The terms associated to the confusion matrix can now be understood using taxicab trigonometry as follow:

(false discovery rate),

(positive predictive value),

(negative predictive value),

(false omission rate),

Fortunately trigonometry identities are easy in this geometry. Instead of the usual we have . Thus we see that FDR + PPV = 1 (or PPV = 1 – FDR) and NPV + FOR = 1 (or NPV = 1 -FOR). The false positive rate and true positive rate are seen to just be rescaling of the FP and TP, i.e. and , and the analogies to the identities just described are therefore FPR + SPC = 1 and TPR + FNR = 1.

Revisiting Figure 2A from Poplin *et al. *we see that FDR is best interpreted as 1 – PPV. PPV is also known as “precision”. The y-axis, sensitivity, is also called “recall”. Thus the curve is a precision-recall curve, although precision-recall curves are typically shown with recall on the x-axis.

Aside from helping to sort out identities, taxicab trigonometry is appealing in other ways. For example the angle in the example of Figure 3 is just . In fact in the regime in which the confusion trigonometry takes place and . There is a pedagogical point here: it is useful to identify a quantity such as FDR with directly with the angle subtended by the points (0,0) and (FP,TP). In fact one sees immediately that a confusion matrix is succinctly described by just two quantities (in addition to N and P). These do not necessarily have to be FP and TP, a pair such as FDR and NPV can also suffice.

No longer confused by confusion matrices? Show that

.

**Probabilistic**** interpretation**

While I think the geometric view of confusion matrices is useful for keeping track of the big picture, there is a probabilistic interpretation of many of the concepts involved that is useful to keep in mind.

For example, the precision associated to a confusion matrix can be interpreted as the probability that a prediction is correct. The identity FDR + PPV = 1 can therefore be thought of probabilistically, with FDR being the probability that a prediction is incorrect.

Most useful among such probabilistic interpretations is the area under a ROC curve (AUROC), which is frequently computed in biology papers. It is the probability that a randomly selected “positive” object is ranked higher than a randomly selected “negative” object by the classifier being evaluated. For example, in the red/blue/BEST example given above, the AUROC (= ) is the probability that a randomly selected blue letter is ranked higher than a randomly selected red letter by BEST.

## Recent Comments