You are currently browsing the tag archive for the ‘FDR’ tag.
What are confusion matrices?
In the 1904 book Mathematical Contributions to the Theory of Evolution, Karl Pearson introduced the notion of contingency tables. Sometime around the 1950s the term “confusion matrix” started to be used for such tables, specifically for 2×2 tables arising in the evaluation of algorithms for binary classification.
Example: Suppose there are 11 items labeled A,B,C,D,E,F,G,H,I,J,K four of which are of the category blue (also to be called “positive”) and eight of which are of the category red (also to be called “negative”). An algorithm called BEST receives as input the objects without their category labels, i.e. just A,B,C,D,E,F,G,H,I,J,K and must rank them so that the top of the ranking is enriched, as much as possible, for blue items. Suppose BEST produces as output the ranking:
How good is BEST at distinguishing the blue (positive) from the red (negative) items, i.e. how enriched are they at the top of the list? The confusion matrix provides a way to organize the assessment. We explain by example: if we declare the top 6 predictions of BEST blue and classify the remainder as red
then the confusion matrix is:
The matrix shows the number of predictions that are correct (3) as well as the number of blue items that are missing (1). Similarly, it shows the number of red items that were incorrectly predicted to be blue (3), as well as the number of red items correctly predicted to be red (4). The four numbers have names (true positives, false negatives, false positives and true negatives) and various other numbers can be derived from them that shed light on the quality of the prediction:
The table and the terminology are trivial. All that is involved is simple counting of successes and failures in a binary classification experiment, and computation of derived measures that are obtained by elementary arithmetic. But here is a little secret that I’ve always felt embarrassed to admit… I can never remember the details. For example I find myself forgetting whether “positive predictive value” is really “precision” or how that relates to “specificity”. For many years I thought I was the only one. I had what you might call “confusion syndrome”… But I’ve slowly come to realize that I am not alone.
I became confused about confusion matrices from the moment I first encountered them when I started studying computational biology in the mid 1990s. One of the first papers I read at the time was a landmark review by Burset and Guigó (1996) on the evaluation of gene structure prediction programs (the review has now been cited over 800 times). I have highlighted some text from the paper, where the authors explain what “specificity” means.
Not only are there two definitions for specificity on the same page, there is also a typo in Figure 1 (specificity is abbreviated Sn instead of Sp) adding to the confusion.
Burset and Guigó are of course not to blame for the non-standard use of specificity in gene finding. They were merely describing deviant behavior by computational biologists working on gene finding who had apparently decided to redefine a term widely used in statistics and computer science. However lacking context at the time when I was first learning of these terms, I remained confused for years about whether I was supposed to use specificity or the term “precision” for what the gene finding people were computing.
Simple confusion about terms is not the only problem with the confusion matrix. It is filled with (abbreviated) jargon that can be overwhelming. Who can remember TP,TN,FP,FN,TPR,FPR,SN,SP,FDR,PPV,FOR,NPV,FDR,FNR,TNR,SPC,ACC,LR+,LR- etc. etc. etc. ? Even harder is keeping track of the many formulas relating the quantities, e.g. FDR = 1 – PPV or FPR = 1 – SPC. And how is one supposed to gain intuition about a formula described as ACC = (TP + TN) / (TP + FP + FN + TN)?
Confusion matrix confusion remains a problem in computational biology. Consider the recent preprint Creating a universal SNP and small indel variant caller with deep neural networks by Poplin et al. published on December 14, 2016. The paper, from Google/Verily, shows how a deep convolutional neural network can be used to call genetic variation by learning from images of read pileups. This is fascinating and potentially deep innovation. However the authors, experts in machine learning, were confused in the caption of their Figure 2A, incorrectly describing the FDR-Sensitivity plot as a ROC curve. This in turn caused confusion for readers expecting a ROC curve to be a function.
While an FPR-Sensitivity plot is a function an FDR-Sensitivity plot doesn’t have to be. I was confused by this subtlety myself until an author of Poplin et al. kindly sorted me out.
Hence this post, a note for myself to avoid further confusion by utilizing elementary trigonometry to organize the information in confusion (matrices). As we will see shortly, the Poplin et al. Figure 2A is what is more commonly known as a precision-recall (PR) curve, except that the usual axes of a PR plot are flipped, in addition to a reflection which adds further confusion.
The quality of the BEST prediction in the example above can be encoded in a picture as follows:
Figure 1: A false-positive / true-positive plot.
The grid is 4 lines high and 8 lines wide because there are 4 blue objects and 7 red ones. The BEST ranking is a path from the lower left hand corner of the grid (0,0) to the top right hand corner (7,4). The BEST predictions can be “read off” by following the path. Starting with the top ranked object, the path moves one step up if the object is (in truth) blue, and one step to the right if it is (in truth) red. For example, the first prediction of BEST is C which is blue, and therefore the first step is up. Continuing in this way one eventually arrives at (7,4) because each object is somewhere (only once) on BEST’s list. A BEST confusion matrix corresponds to a point in the plot. The confusion matrix discussed in the introduction thresholding after 6 predictions corresponds to the green point which is the location after 6 steps from (0,0).
The path corresponding to the BEST ranking, when scaled to the unit square, is called a ROC curve. This stands for receiver operator curve (the terminology originates from WWII). Again, each thresholding of the ranking, e.g. the top six as discussed above (green point), corresponds to a single point on the path:
Figure 2: A ROC plot.
In the ROC plot the x-axis is the false positive rate (FPR = # false positives / total number of true negatives) and the y-axis the sensitivity (Sn = #true positives / total number of true positives).
At this point it is useful to think about taxicab geometry (formally known as geometry). In taxicab geometry, the distance between two points on the grid is the number of steps in the shortest path (using edges of the grid) connecting them. In the figure below, a false positive / true positive plot (as shown in Figure 1) is annotated to highlight the connection to trigonometry:
Figure 3: Taxicab confusion.
The terms associated to the confusion matrix can now be understood using taxicab trigonometry as follow:
(false discovery rate),
(positive predictive value),
(negative predictive value),
(false omission rate),
Fortunately trigonometry identities are easy in this geometry. Instead of the usual we have . Thus we see that FDR + PPV = 1 (or PPV = 1 – FDR) and NPV + FOR = 1 (or NPV = 1 -FOR). The false positive rate and true positive rate are seen to just be rescaling of the FP and TP, i.e. and , and the analogies to the identities just described are therefore FPR + SPC = 1 and TPR + FNR = 1.
Revisiting Figure 2A from Poplin et al. we see that FDR is best interpreted as 1 – PPV. PPV is also known as “precision”. The y-axis, sensitivity, is also called “recall”. Thus the curve is a precision-recall curve, although precision-recall curves are typically shown with recall on the x-axis.
Aside from helping to sort out identities, taxicab trigonometry is appealing in other ways. For example the angle in the example of Figure 3 is just . In fact in the regime in which the confusion trigonometry takes place and . There is a pedagogical point here: it is useful to identify a quantity such as FDR with directly with the angle subtended by the points (0,0) and (FP,TP). In fact one sees immediately that a confusion matrix is succinctly described by just two quantities (in addition to N and P). These do not necessarily have to be FP and TP, a pair such as FDR and NPV can also suffice.
No longer confused by confusion matrices? Show that
While I think the geometric view of confusion matrices is useful for keeping track of the big picture, there is a probabilistic interpretation of many of the concepts involved that is useful to keep in mind.
For example, the precision associated to a confusion matrix can be interpreted as the probability that a prediction is correct. The identity FDR + PPV = 1 can therefore be thought of probabilistically, with FDR being the probability that a prediction is incorrect.
Most useful among such probabilistic interpretations is the area under a ROC curve (AUROC), which is frequently computed in biology papers. It is the probability that a randomly selected “positive” object is ranked higher than a randomly selected “negative” object by the classifier being evaluated. For example, in the red/blue/BEST example given above, the AUROC (= ) is the probability that a randomly selected blue letter is ranked higher than a randomly selected red letter by BEST.
- Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlations, Nature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601
- Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networks, Nature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635
An inconvenient request
One of the great things about conferences is that there is time to chat in person with distant friends and collaborators. Last July, at the ISMB conference in Berlin, I found myself doing just that during one of the coffee breaks. Suddenly, Manolis Kellis approached me and asked to talk in private. The reason for his interruption: he came to request that I remove an arXiv post of mine, namely “Comment on ‘Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions“, a response to a paper by Ward and Kellis. Why? He pointed out that my arXiv post was ranking highly on Google. This was inconvenient for him, he explained, while insisting that it was wrong of me to post a criticism of his work on a forum where he could not directly respond. He suggested that it would be best to work out any issues I might have with his paper offline. Unfortunately, there was the inconvenient truth that arXiv postings cannot be removed. Unlike some journals, where, say, a supplement can be revised while having the original removed (see the figure switching of Feizi et al.), arXiv preprints are permanent.
My initial confusion quickly turned to anger. After all, my arXiv comment had been rejected from Science where I had submitted it as a technical comment on the Ward-Kellis paper. I had then put it on the arXiv as a last resort measure to at least have some record of my concerns publicly accessible. How is this wrong? Can one not critique the work of Manolis Kellis?
Network nonsense begins
My first review of a Manolis Kellis paper was in the fall of 2006, in my capacity as a program committee member of the Research in Computational Molecular Biology (RECOMB) conference held in Oakland, CA in 2007. Because Oakland is right next to Berkeley, a number of Berkeley professors were involved in organizing and running the conference. Terry Speed was chair of the program committee. I was out of the country that year on sabbatical at the University of Oxford, so I could not participate, or even attend, the conference, but I volunteered to serve on the program committee. For those not familiar with the RECOMB review process, it is modeled after the standard Computer Science conferences. The program committee chair forms the program committee, who are then assigned a handful of papers to review and score. Reviews are entered on a website, and after a brief period of online discussion about borderline papers, scores are revised and accept/reject decisions are made. Authors can revise their manuscripts, but the reviewers never see the papers again before publication in the proceedings.
One of the papers I was assigned to review was by a student named Joshua Grochow and his advisor Manolis Kellis. The paper was titled “Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking“. Although networks were not my research focus at the time, and “symmetry-breaking” evoked in me nightmares from the physics underworld, I agreed to the review. The paper seemed to contain some interesting algorithms, appeared to have a combinatorial flavor, and potentially important applications- a good mix for RECOMB.
The problem addressed by Grochow & Kellis was that of identifying “network motifs” in biological networks. “Motifs” can be defined in a variety of ways, and the Grochow-Kellis objective was simple. In graph theoretic terms, given a graph G, the goal was to find subgraphs occurring with high multiplicity to an extent unlikely in a random graph. There are many models for random graphs, and the one that the results in Grochow-Kellis are based on is the Erdös-Renyi model (each edge chosen independently with some fixed probability). The reason this definition might be of biological interest, is that recurrent motifs interspersed in a graph are likely to represent evolutionarily conserved interaction modules.
The paper begins with a description of the method. I won’t go into the details here, except to say that everything seemed well until I read the caption of Figure 3. There the number 27,720 caught my eye.
During my first few years of graduate school I took many courses on enumerative and algebraic combinatorics. There are some numbers that combinatorialists just “know”. For example, seeing 42 emerge as the answer to a counting problem does not bring to mind Douglas Adams, but rather the vast literature on Catalan numbers and their connections to dozens of well-known counting problems. Similarly, a number such as 126 brings to mind binomial coefficients (), and with them the idea of counting the number of subsets of fixed size from a larger set. When I saw the number 27,720 I had a hunch that somehow some canonical combinatorial set had been enumerated. The idea may have entered my mind because of the picture of the “motif” in which I immediately recognized a clique (all vertices mutually connected) and a stable set (no pair of vertices connected). In any case, I realized that
The significance of this is that the “motif” on the left-hand side of Figure 3 had appeared many times because of a type of double- or rather thousandfold- counting. Instead of representing statistically significant recurring independent motifs, this “motif” arises because of a combinatorial artifact. In the specific example of Figure 3, the motif occurred once for any choice of 4 nodes from the clique of size 9, and any choice of 3 nodes from the stable set of size 12.
The point is that in a graph, any subgraph attached to a large clique (or stable set) will occur many times. This simple observation is a result of the fact that there are many subgraphs of a clique (or stable set) that are identical. I realized that this meant that the Grochow-Kellis method was essentially a heuristic for finding cliques and stable sets in graphs. The particular “network motifs” they were pulling out were just subgraphs that happened to be connected to some large cliques and stable sets. There are two problems with this: first, a clique or a stable set can hardly be considered an interesting “network motif”. Moreover, the fact that they appear in biological networks much more than in Erdös-Renyi random graphs is not surprising. Second, there is a large literature on finding cliques in graphs, none of which Grochow-Kellis cited or seemed to be familiar with.
The question of the performance of the Grochow-Kellis algorithm is answered in their Figure 3 as well. There is a slightly larger motif consisting of 6 nodes from the stable set of size 12, instead of 3. That motif occurs in all subsets of the stable set instead of subsets which means that there is a motif that occurs 116,424 times! Grochow and Kellis’s algorithm did not even achieve its stated goal. It really ought to have outputted the left hand side figure with six nodes in the stable set on the left, and not three. In other words, this was a paper providing uninteresting solutions from a biological point of view, and doing so poorly to boot.
I wrote up a detailed report on the paper, and posted it on the RECOMB review website together with poor scores reflecting my opinion that the paper had to be rejected. How could RECOMB, ostensibly the premier computer science conference on computational and algorithmic biology, publish a paper with neither a computational nor biological result? Not to mention an algorithm that demonstratably did not find the most frequently occurring motif.
As you might already guess, my rejection was subsequently overruled. I don’t know who made the final decision to accept the Grochow & Kellis paper to the RECOMB conference, although presumably it was the program committee chair. The decision jarred with my sense of scientific integrity. I had put considerable effort into reviewing the paper and understanding it, and I felt that I had provided a compelling objective argument for why the paper was fundamentally flawed- the fact that the results were trivial (and incorrect!) was not a subjective statement. At this point I need to point out that the RECOMB conference is quite difficult to get into. The acceptance rate for papers in 2007, consistent with other years, was 21.8%. I knew this meant that even a single very negative review, especially one with a compelling argument against the paper, almost certainly would lead to rejection of the paper. So I couldn’t understand then, nor do I still understand now, on what basis the decision was made to accept the paper. This bothered me greatly, and after much deliberation I started boycotting the conference. Despite publishing five RECOMB papers from 2000 to 2006 and regularly attending the meeting during that time, the continued poor decisions and haphazard standards for papers selected have led me to not return in almost 8 years.
Grochow and Kellis obviously received my review and considered how to “deal with it”. They added a section titled “The role of combinatorial effects”, in which they explained the origins of the number 27,720 that they gleaned from my report, but then spun the bad news they had received as “resulting from combinatorial connectivity patterns prevalent in larger network structures.” They then added that “…this combinatorial clustering effect brings into question the current definition of network motif” and proposed that “additional statistics…might well be suited to identify larger meaningful networks.” This is a lot like someone claiming to discover a bacteria whose DNA is arsenic-based and upon being told by others that the “discovery” is incorrect – in fact, that very bacteria seeks out phosphorous – responding that this is “really helpful” and that it “raises lots of new interesting open questions” about how arsenate gets into cells. Chutzpah. When you discover your work is flawed, the correct response is to retract it.
I don’t think people read papers very carefully. Joshua Grochow went on to win the MIT Charles and Jennifer Johnson Outstanding M. Eng. Thesis Award for his RECOMB work on network motif discovery. [Added February 18: Grochow and Kellis have posted a reply here].
The nature of man
I have to admit that after the Grochow-Kellis paper I was a bit skeptical of Kellis’ work. Not because of the paper itself (everyone makes mistakes), but because of the way he responded to my review. So a year and a half ago, when Manolis Kellis published a paper in an area I care about and am involved in, I may have had a negative prior. The paper was Luke Ward and Manolis Kellis “Evidence for Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”, Science 337 (2012) . Having been involved with the ENCODE pilot, where I contributed to the multiple alignment sub-project, I was curious what comparative genomics insights the full-scale $130 million dollar project revealed. The press releases accompanying the Ward-Kellis paper (e.g. The Nature of Man, The Economist) were suggesting that Ward and Kellis had figured out what makes a human a human; my curiosity was understandably piqued.
Ward and Kellis combined population genomic data from the 1000 Genomes Project with biochemical data from the ENCODE project to look for signatures of human constraint in regulatory elements. Their analysis was based on measuring three different proxies for constraint: SNP density, heterozygosity and derived allele frequency. To identify specific classes of regulatory regions under constraint, aggregated regions associated with specific gene ontology (GO) categories were tested for significance. Reading the paper I was amazed to discover they found precisely two categories: retinal cone cell development and nerve growth factor receptor signaling. It was only upon reading the supplement that I discovered that their tests had produced 53 other GO categories as well (Table S5).
Despite the fact that the listed categories were required to pass a false discovery rate (FDR) threshold for both the heterozygosity and derived allele frequency (DAF) measures, it was statistically invalid for them to highlight any specific GO category. FDR control merely guarantees a low false discovery rate among the entries in the entire list. Moreover, there was no obvious explanation for why categories such as chromatin binding (which had a smaller DAF than nerve growth) or protein binding (with the smallest p-value) appeared to be under purifying selection. As with the Feizi et al. paper, the supplement produced a story much less clean than the one presented in the main body of the paper. In fact, retinal cone cell development and nerve growth factor were 33 and 34 out of the 55 listed GO categories when sorted by the DAF p-value (42 and 54 when sorted by heterozygosity p-value). In other words, the story being sold in the paper was based on blatant statistically invalid cherry picking.
The other result of the paper was an estimate that in addition to the 5% of the human genome conserved across mammalian genomes, at least another 4% has been subject to lineage-specific constraint. This result was based on adding up the estimates of constrained nucleotides from their Table S6 (using the derived allele frequency measure). These were calculated using a statistic that was computed as follows: for each one of ten bins determined according to estimated background selection strength, and for every feature F, the average DAF value DF was rescaled to
where DCNDC and DNCNE were the bin-specific average DAFs of conserved non-degenerate coding regions and non-conserved non-ENCODE regions respectively. One problem with the statistic is that the non-conserved regions contain nucleotides not conserved in all mammals, which is not the same as nucleotides not conserved in any mammals. The latter would have been needed in order to identify human specific constraint. Second, the statistic PUCF was used as a proxy for the proportion under constraint even though, as defined, it could be less than zero or greater than one. Indeed, in Table S6 there were four values among the confidence intervals for the estimated proportions using DAF that included values less than 0% or above 100%:
Ward and Kellis were therefore proposing that some features might have a negative number of nucleotides under constraint. Moreover, while it is possible that after further rescaling PUCF might have correlated with the true proportion of nucleotides under constraint, there was no argument provided in the paper. Thus, while Ward and Kellis claimed to have estimated the proportion of nucleotides under constraint, they had only computed a statistic named “proportion under constraint”.
Nicolas Bray and I wrote up these points in a short technical comment and submitted it to the journal Science early in November 2012. The comment was summarily rejected with a curt reply by senior editor Laura Zahn stating that “relative to other Technical Comments we have recently received we feel that the scope and focus of your comment make it more suitable for the Online Comments facility at Science, rather than as a candidate for publication as a Technical Comment.” It is worth noting that Science did decide to publish another comment: Phil Green and Brent Ewing’s, “Comment on’Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions‘”, Science 10 (2013). Green and Ewing’s comment is biological in nature. Their concern is that “… the polymorphism trends are primarily attributable to mutational variation and technical artifacts rather than selection.” Its fine that Science decided to host a debate on a biology question on its pages, but how can one debate the interpretation of results from a method, when the method is fundamentally flawed to begin with? After all, our problem with PUC was much deeper than a “technical flaw”.
We decided at the end to place the comment in the arXiv. After doing so, it became apparent that it had little impact. Indeed, I have never received any feedback about it from anyone. Apparently even this was too much for Manolis Kellis.
By the time I noticed the Feizi et al. paper in the journal Nature Biotechnology early in August 2013, my experiences reading Kellis’ papers had subtly altered the dynamic between myself and the printed word. Usually, when I read a paper and I don’t understand something, I assume the fault lies with me. I think most people are like this. But now, when the Feizi et al. paper started to not make sense, I didn’t presume the problem was with me. I tried hard to give the paper a fair reading, but after a few paragraphs the spell of the authors was already broken. And so it is that Nicolas Bray and I came to figure out what was really going on in Feizi et al., a project that eventually led us to also look at Barzel-Barabási.
Speaking frankly, it was difficult work to write the blog posts about these articles. In addition to the time it took, it was exhausting and exasperating to discover the flaws, fallacies and frauds. Both Nick and I prefer to do research. But we felt a responsibility to spell out in detail what had happened here. Manolis Kellis is not just any scientist. He has, and continues to play leading roles in major consortium projects such as mod-ENCODE and ENCODE, and he has served on numerous advisory committees for the NHGRI. He is a member of the GCAT (Genomics, Computational Biology and Technology) study section until 2018. That any person would swap out a key figure in a published paper without publishing a correction, and without informing the editor is astonishing. That a person with great responsibility towards scientists is an abuser of science is unacceptable.
Manolis Kellis’ behavior is part of a systemic problem in computational biology. The cross-fertilization of ideas between mathematics, statistics, computer science and biology is both an opportunity and a danger. It is not hard to peddle incoherent math to biologists, many of whom are literally math phobic. For example, a number of responses I’ve received to the Feizi et al. blog post have started with comments such as
“I don’t have the expertise to judge the math, …”
Similarly, it isn’t hard to fool mathematicians into believing biological fables. Many mathematicians throughout the country were recently convinced by Jonathan Rothberg to donate samples of their DNA so that they might find out “what makes them a genius”. Such mathematicians, and their colleagues in computer science and statistics, take at face value statements such as “we have figured out what makes a human human”. In the midst of such confusion, it is easy for an enterprising “computational person” to take advantage of the situation, and Kellis has.
I believe the solution for this problem is for computational biologists to start taking themselves more seriously. Whether serving as reviewers for journals, as panel members for funding agencies, on hiring/tenure committees, or writing articles, all of us have to tone down the hype and pay closer attention to the science. There are many examples of what this means: a review of a math/stats containing paper cannot be a single paragraph long and based on a hunch, and similarly computational biologists shouldn’t claim, as have many of the authors of papers I’ve reviewed in these posts, pathways to cure disease and explanations for what makes humans human. Don’t fool the biologists. Don’t fool the computer scientists, statisticians, and mathematicians.
The possibilities for computational methods in biology are unlimited. The future is exciting, and there are possibilities for significant advances in areas ranging from molecular and evolutionary biology to medicine. But money, citations and fame cannot rule the day. The details of the #methodsmatter.