You are currently browsing the monthly archive for December 2016.

The idea that 2016 was the worst year in history was already floated back in July, but despite a lot of negativity, there were definitely good things that happened in 2016. This was certainly true in terms of scientific discoveries and breakthroughs. One of my favorites was a significant improvement to the upper bound of the **Happy Ending **problem **after 81 years**!

The Happy Ending problem was the brainchild of mathematician Esther Klein, who proved that any five points in general position in the plane must contain a convex quadrilateral.

The extension of the problem asks, for each *n*, for the minimum number so that any points in general position in the plane contain a convex *n*-gon. Paul Erdös and George Szekeres established an upper bound for via Ramsey Theory in 1935, a link that highlighted the importance and application of Ramsey Theory to extremal combinatorics and instantly made the problem a classic. It even has connections to computational biology, via it’s link to the Erdös-Szekeres monotone subsequence theorem (published in the same 1935 paper), which in turn is related to the Needleman-Wunsch algorithm.

The “happy ending” associated to the problem comes from the story of Esther Klein and George Szekeres, who fell in love while working on the problem and ended up marrying in 1937, a happy ending that lasted for 68 years (they died within an hour of each other in 2005).

In their 1935 paper, Erdös-Szekeres proved an upper bound for using elementary combinatorics, namely

.

The difficulty of improving the upper bound for can be appreciated by noting that it took 63 years for there to be *any* improvement to the Erdös-Szekeres result, and when an improvement did arrive in 1998 it was to remove the “+1” (Fan Chung and Ron Graham, Discrete Geometry 1998) reducing the upper bound to for .

Considering that the best lower bound for , conjectured to be optimal, is (proved by Erdös and Szekeres in 1960), the +1 improvement left a very long way to go. A few other improvements after the Chung-Graham paper was published reduced the upper bound some more, but at most by constant factors.

Then, earlier this year, Andrew Suk announced an astonishing improvement. Building on a theorem of Attila Pór and Pavel Valtr, he proved that

.

The result doesn’t exactly match the conjectured lower bound, so one cannot say the happy ending to the Happy Ending Problem arrived in 2016, but it’s so *so* ** so **close (i.e. it solves the problem asymptotically) that one can only be left with optimism and hope for 2017.

Happy new year!

**What are confusion matrices?**

In the 1904 book Mathematical Contributions to the Theory of Evolution, Karl Pearson introduced the notion of contingency tables. Sometime around the 1950s the term “confusion matrix” started to be used for such tables, specifically for 2×2 tables arising in the evaluation of algorithms for binary classification.

**Example**: Suppose there are 11 items labeled **A,B,C,D,E,F,G,H,I,J,K** four of which are of the category blue (also to be called “positive”) and eight of which are of the category red (also to be called “negative”). An algorithm called BEST receives as input the objects *without* their category labels, i.e. just **A,B,C,D,E,F,G,H,I,J,K **and must rank them so that the top of the ranking is enriched, as much as possible, for blue items. Suppose BEST produces as output the ranking:

**C****E****A****B****F****K****G****D****I****H****J**

How good is BEST at distinguishing the blue (positive) from the red (negative) items, i.e. how enriched are they at the top of the list? The confusion matrix provides a way to organize the assessment. We explain by example: if we declare the top 6 predictions of BEST blue and classify the remainder as red

========

Blue

**C****E****A****B****F****K**========

Red**G****D****I****H****J**

then the confusion matrix is:

Predicted category |
|||

Blue | Red | ||

True category |
Blue |
3 |
1 |

Red |
3 |
4 |

The matrix shows the number of predictions that are correct (3) as well as the number of blue items that are missing (1). Similarly, it shows the number of red items that were incorrectly predicted to be blue (3), as well as the number of red items correctly predicted to be red (4). The four numbers have names (true positives, false negatives, false positives and true negatives) and various other numbers can be derived from them that shed light on the quality of the prediction:

The table and the terminology are trivial. All that is involved is simple counting of successes and failures in a binary classification experiment, and computation of derived measures that are obtained by elementary arithmetic. **But here is a little secret that I’ve always felt embarrassed to admit**… I can never remember the details. For example I find myself forgetting whether “positive predictive value” is really “precision” or how that relates to “specificity”. For many years I thought I was the only one. I had what you might call “confusion syndrome”… But I’ve slowly come to realize that I am not alone.

**Confused people**

I became confused about confusion matrices from the moment I first encountered them when I started studying computational biology in the mid 1990s. One of the first papers I read at the time was a landmark review by Burset and Guigó (1996) on the evaluation of gene structure prediction programs (the review has now been cited over 800 times). I have highlighted some text from the paper, where the authors explain what “specificity” means.

Not only are there two definitions for specificity on the same page, there is also a typo in Figure 1 (specificity is abbreviated Sn instead of Sp) adding to the confusion.

Burset and Guigó are of course not to blame for the non-standard use of specificity in gene finding. They were merely describing deviant behavior by computational biologists working on gene finding who had apparently decided to redefine a term widely used in statistics and computer science. However lacking context at the time when I was first learning of these terms, I remained confused for years about whether I was supposed to use specificity or the term “precision” for what the gene finding people were computing.

Simple confusion about terms is not the only problem with the confusion matrix. It is filled with (abbreviated) jargon that can be overwhelming. Who can remember TP,TN,FP,FN,TPR,FPR,SN,SP,FDR,PPV,FOR,NPV,FDR,FNR,TNR,SPC,ACC,LR+,LR- etc. etc. etc. ? Even harder is keeping track of the many formulas relating the quantities, e.g. FDR = 1 – PPV or FPR = 1 – SPC. And how is one supposed to gain intuition about a formula described as ACC = (TP + TN) / (TP + FP + FN + TN)?

Confusion matrix confusion remains a problem in computational biology. Consider the recent preprint Creating a universal SNP and small indel variant caller with deep neural networks by Poplin *et al.* published on December 14, 2016. The paper, from Google/Verily, shows how a deep convolutional neural network can be used to call genetic variation by learning from *images* of read pileups. This is fascinating and potentially deep innovation. However the authors, experts in machine learning, were confused in the caption of their Figure 2A, incorrectly describing the FDR-Sensitivity plot as a ROC curve. This in turn caused confusion for readers expecting a ROC curve to be a function.

While an FPR-Sensitivity plot is a function an FDR-Sensitivity plot doesn’t have to be. I was confused by this subtlety myself until an author of Poplin *et al.* kindly sorted me out.

Hence this post, a note for myself to avoid further confusion by utilizing elementary trigonometry to organize the information in confusion (matrices). As we will see shortly, the Poplin *et al. *Figure 2A is what is more commonly known as a precision-recall (PR) curve, except that the usual axes of a PR plot are flipped, in addition to a reflection which adds further confusion.

**ROC curves**

The quality of the BEST prediction in the example above can be encoded in a picture as follows:

Figure 1: A false-positive / true-positive plot.

The grid is 4 lines high and 8 lines wide because there are 4 blue objects and 7 red ones. The BEST ranking is a path from the lower left hand corner of the grid (0,0) to the top right hand corner (7,4). The BEST predictions can be “read off” by following the path. Starting with the top ranked object, the path moves one step up if the object is (in truth) blue, and one step to the right if it is (in truth) red. For example, the first prediction of BEST is C which is blue, and therefore the first step is up. Continuing in this way one eventually arrives at (7,4) because each object is somewhere (only once) on BEST’s list. A BEST confusion matrix corresponds to a point in the plot. The confusion matrix discussed in the introduction thresholding after 6 predictions corresponds to the green point which is the location after 6 steps from (0,0).

The path corresponding to the BEST ranking, when scaled to the unit square, is called a **ROC curve**. This stands for receiver operator curve (the terminology originates from WWII). Again, each thresholding of the ranking, e.g. the top six as discussed above (green point), corresponds to a single point on the path:

Figure 2: A ROC plot.

In the ROC plot the x-axis is the false positive rate (FPR = # false positives / total number of true negatives) and the y-axis the sensitivity (Sn = #true positives / total number of true positives).

**Taxicab geometry**

At this point it is useful to think about taxicab geometry (formally known as geometry). In taxicab geometry, the distance between two points on the grid is the number of steps in the shortest path (using edges of the grid) connecting them. In the figure below, a false positive / true positive plot (as shown in Figure 1) is annotated to highlight the connection to trigonometry:

Figure 3: Taxicab confusion.

The terms associated to the confusion matrix can now be understood using taxicab trigonometry as follow:

(false discovery rate),

(positive predictive value),

(negative predictive value),

(false omission rate),

Fortunately trigonometry identities are easy in this geometry. Instead of the usual we have . Thus we see that FDR + PPV = 1 (or PPV = 1 – FDR) and NPV + FOR = 1 (or NPV = 1 -FOR). The false positive rate and true positive rate are seen to just be rescaling of the FP and TP, i.e. and , and the analogies to the identities just described are therefore FPR + SPC = 1 and TPR + FNR = 1.

Revisiting Figure 2A from Poplin *et al. *we see that FDR is best interpreted as 1 – PPV. PPV is also known as “precision”. The y-axis, sensitivity, is also called “recall”. Thus the curve is a precision-recall curve, although precision-recall curves are typically shown with recall on the x-axis.

Aside from helping to sort out identities, taxicab trigonometry is appealing in other ways. For example the angle in the example of Figure 3 is just . In fact in the regime in which the confusion trigonometry takes place and . There is a pedagogical point here: it is useful to identify a quantity such as FDR with directly with the angle subtended by the points (0,0) and (FP,TP). In fact one sees immediately that a confusion matrix is succinctly described by just two quantities (in addition to N and P). These do not necessarily have to be FP and TP, a pair such as FDR and NPV can also suffice.

No longer confused by confusion matrices? Show that

.

**Probabilistic**** interpretation**

While I think the geometric view of confusion matrices is useful for keeping track of the big picture, there is a probabilistic interpretation of many of the concepts involved that is useful to keep in mind.

For example, the precision associated to a confusion matrix can be interpreted as the probability that a prediction is correct. The identity FDR + PPV = 1 can therefore be thought of probabilistically, with FDR being the probability that a prediction is incorrect.

Most useful among such probabilistic interpretations is the area under a ROC curve (AUROC), which is frequently computed in biology papers. It is the probability that a randomly selected “positive” object is ranked higher than a randomly selected “negative” object by the classifier being evaluated. For example, in the red/blue/BEST example given above, the AUROC (= ) is the probability that a randomly selected blue letter is ranked higher than a randomly selected red letter by BEST.

## Recent Comments