You are currently browsing Lior Pachter’s articles.
During my third year of undergraduate study I took a course taught by David Gabai in which tests were negatively graded. This meant that points were subtracted for every incorrect statement made in a proof. As proofs could, in principle, contain an unbounded number of false statements, there was essentially no lower bound on the grade one could receive on a test (course grades was bounded below by “F”). Although painful on occasion, I grew to love the class and the transcendent lessons that it taught. These memories came flooding back this past week, when a colleague brought to my attention the paper A simple and fast algorithm for K-medoids clustering by Hae-Sang Park and Chi-Hyuck Jun.
The Park-Jun paper introduces a K-means like method for K-medoid clustering. K-medoid clustering is a clustering formulation based on a generalization of (a relative of) the median rather than the mean, and is useful for the same robustness reasons that make the median preferable to the mean. The medoid is defined as follows: given n points in , the medoid is a point among them with the property that it minimizes the average distance to the other points, i.e. minimizes . In the case of , when n is odd, this corresponds to the median (the direct generalization of the median is to find a point not necessarily among the minimizing the average distance to the other points, and this is called the geometric median).
The K-medoid problem is to partition the points into k disjoint subsets so that if are the respective medoids of the subsets (clusters) then the average distance of each medoid to the points of the cluster it belongs to is minimized. In other words, the K-medoids problem is to find
For comparison, K-means clustering is the problem of finding
where the are the centroids of the points in each partition . The K-means and K-medoids problems are instances of partitional clustering formulations that are NP-hard.
The most widely used approach for finding a (hopefully good) solution to the K-medoids problem has been a greedy clustering algorithm called PAM by Kaufman and Rousseeuw (partitioning around medoids). To understand the Park-Jun contribution it is helpful to briefly review PAM first. The method works as follows:
- Initialize by selecting k distinguished points from the set of points.
- Identify, for each point, the distinguished point it is closest to. This identification partitions the points into k sets and a “cost” can be associated to the partition, namely the sum of the distances from each point to its associated distinguished point (note that the distinguished points may not yet be the medoids of their respective partitions).
- For each distinguished point d, and each non-distingsuished point x repeat the assignment and cost calculation of step 2 with d and x swapped so that x becomes distinguished and d returns to undistinguished status. Choose the swap that minimizes the total cost. If no swap reduces the cost then output the partition and the distinguished points.
- Repeat step 3 until termination.
It’s easy to see that when the PAM algorithm terminates the distinguished points must be medoids for the partitions associated to them.
The running time of PAM is . This is because in step 3, for each of the k distinguished points, there are n-k swaps to consider for a total of swaps, and the cost computation for each swap requires n-k assignments and additions. Thus, PAM is quadratic in the number of points.
To speed-up the PAM method, Park and Jun introduced in their paper an analogy of the K-means algorithm, with mean replaced by median:
- Initialize by selecting k distinguished points from the set of points.
- Identify, for each point, the distinguished point it is closest to. This identification partitions the points into k sets in the same way as the PAM algorithm.
- Compute the medoid for each partition. Repeat step 2 until the cost no longer decreases.
Park-Jun claim that their method has run time complexity “ which is equivalent to K-means clustering”. Yet the pseudocode in the paper begins with the clearly quadratic requirement “calculate the distance between every pair of all objects..”
Step 3 of the algorithm is also quadratic. The medoid of a set of m points is computed in time . Given m points the medoid must be one of them, so it suffices to compute, for each point, the sum of the distances to the others ( additions) and a medoid is then identified by taking the minimum. Furthermore, without assumptions on the structure of the distance matrix there cannot be a faster algorithm (with the triangle inequality the medoid can be computed in ).
Quadratic functions are not linear. If they were, it would mean that, with , . If that were the case then
for all x.
Assuming that a is positive and plugging in one would obtain that and it would follow that
When reading the paper, it was upon noticing this “result” that negative grading came to mind. With a proof that 0=1 we are at, say, a score of -10 for the paper. Turning to the discussion of Fig. 3 of the paper we read that “…the proposed method takes about constant time near zero regardless of the number of objects.”
I suppose a generous reading of this statement is that it is an allusion to Taylor expansion of the polynomial around . A less generous interpretation is that the figure is deliberately misleading, intended to show linear growth with slope close to zero for what is a quadratic function. I decided to be generous and grade this a -5, leading to a running total of -15.
It seems the authors may have truly believed that their algorithm was linear because in the abstract they begin with “This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm”. It seems possible that the authors thought they had bypassed what they viewed as the underlying cause for quadratic running time of the PAM algorithm, namely the quadratic swap. The K-means algorithm (Lloyd’s algorithm) is indeed linear, because the computation of the mean of a set of points is linear in the number of points (the values for each coordinate can be averaged separately). However the computation of a medoid is not the same as computation of the mean. -5, now a running total of –20. The actual running time of the Park-Jun algorithm is shown below:
Replicates on different instances are crucial, because absent from running time complexity calculations are the stopping times which are data dependent. A replicate of Figure 3 is shown below (using the parameters in Table 5 and implementations in MATLAB). The Park-Jun method is called as “small” in MATLAB.
Interestingly, the MATLAB implementation of PAM has been sped up considerably. Also, the “constant time near zero” behavior described by Park and Jun is clearly no such thing. For this lack of replication of Figure 3 another deduction of 5 points for a total of -25.
There is not much more to the Park and Jun paper, but unfortunately there is some. Table 6 shows a comparison of K-means, PAM and the Park-Jun method with true clusters based on a simulation:
The Rand index is a measure of concordance between clusters, and the higher the Rand index the greater the concordance. It seems that the Park-Jun method is slightly better than PAM, which are both superior to K-means in the experiment. However the performance of K-means is a lot better in this experiment than suggested by Figure 2 which was clearly (rotten) cherry picked (Fig2b):
For this I deduct yet another 5 points for a total of -30.
Amazingly, the Park and Jun paper has been cited 479 times. Many of the citations propagate the false claims in the paper, e.g. in referring to Park and Jun, Zadegan et al. write “in each iteration the new set of medoids is selected with running time O(N), where N is the number of objects in the dataset”. The question is, how many negative points is the repetition of false claims from a paper that is filled with false claims worth?
According to Bach’s first biographer, when the great composer was asked to explain the secret of his success he replied “I was obliged to be industrious. Whoever is equally industrious will succeed equally well”. I respectfully disagree.
First, Bach didn’t have to drive the twenty kids he fathered over his lifetime to soccer practices, arrange their playdates, prepare and cook dinners, and do the laundry. To his credit he did teach his sons composition (though notably only his sons). But even that act was time for composition, an activity which he was seemingly able to devote himself to completely. For example, the fifteen two part inventions were composed in Köthen between 1720 and 1723 as exercises for his 9 year old son Wilhelm Friedemann, at the same time he was composing the first volume of the Well Tempered Clavier. In other words Bach had the freedom and time to pursue a task which was valued (certainly after Mendelssohn’s revival) by others. That is not the case for many, whose hard work, even when important, is trivialized and diminished when they are alive and forgotten when they are dead.
The second reason I disagree with Bach’s suggestion that those who work as hard as he did will succeed equally well is that his talent transcended that which is achievable by hard work alone. Although there is some debate as to whether Bach’s genius was in his revealing of beauty or his discovery of truth (I tend to agree with Richard Taruskin, whose view is discussed eloquently by Bernard Chazelle in an interview from 2014), regardless of how one thinks of Bach genius, his music was extraordinary. Much was made recently of the fact that Chuck Berry’s Johnny B. Goode was included the Voyager spacecraft record, but, as Lewis Thomas said, if we really wanted to boast we would have sent the complete works of Bach. As it is, three Bach pieces were recorded, the most of any composer or musician. The mortal Chuck Berry and Wolfgang Amadeus Mozart appear just once. In other words, even the aliens won’t believe Bach’s claim that ” whoever is equally industrious will succeed equally well”.
Perhaps nitpicking Bach’s words is unfair, because he was a musician and not an orator or writer. But what about his music? Others have blogged him, but inevitably an analysis of his music reveals that even though he may have occasionally broken his “rules”, the fact that he did so, and how he did so, was the very genius he is celebrated for. Still, it’s fun to probe the greatest composer of all time, and I recently found myself doing just that, with none other than one of his “simplest” works, one of the inventions written as an exercise for his 9 year old child. The way this happened is that after a 30 year hiatus, I recently found myself revisiting some of the two part inventions. My nostalgia started during a power outage a few weeks ago, when I realized the piano was the only interesting non-electronic instrument that was working in the house, and I therefore decided to use the opportunity to learn a fugue that I hadn’t tackled before. A quick attempt revealed that I would be wise to brush up on the inventions first. My favorite invention is BWV 784 in A minor, and after some practice it slowly returned to my fingers, albeit in a cacophony of uneven, rheumatic dissonance. But I persevered, and not without reward. Not a musical reward- my practice hardly improved the sound- but a music theory reward. I realized, after playing some of the passages dozens of times, that I was playing a certain measure differently because it sounded better to me the “wrong” way. How could that be?
The specific measure was number 16:This measure is the third in a four measure section that starts in measure 14, in which a sequence of broken chords (specifically diminished seventh chords) connect a section ending in E minor with one beginning in A minor. It’s an interesting mathomusical problem to consider the ways in which those keys can be bridged in a four measure sequence (according to the “rules of Bach”), and in listening to the section while playing it I had noticed that the E♮notes at the end of measure 15 and in measure 16 were ok… but somehow a bit strange to my ear. To understand what was going on I had to dust off my First Year Harmony book by William Lovelock for some review (Lovelock’s text is an introduction to harmony and counterpoint and is a favorite of mine, perhaps because it is written in an axiomatic style). The opening broken diminished seventh chord in measure 14 is really the first inversion of the leading note (C♯) for D minor. It is followed by a dominant seventh (A) leading into the key of C major. We see the same pattern repeated there (leading diminished seventh in first inversion followed by the dominant seventh) and two more times until measure 18. In other words, Bach’s solution to the problem of arriving at A minor from E minor was to descend via D -> C -> B♭ -> A with alternating diminished sevenths with dominant sevenths to facilitate the modulations.
Except that is not what Bach did! Measure 16 is an outlier in the key of E minor instead of the expected B♭. What I was hearing while playing was that it was more natural for me to play the E ♭notes (marked in red in the manuscript above) than the annotated E♮notes. My 9 year old daughter recorded my (sloppy, but please ignore that!) version below:
Was Bach wrong? I tried to figure it out but I couldn’t think of a good reason why he would diverge from the natural harmonic sequence for the section. I found the answer in a booklet formally analyzing the two part inventions. In An Analytical Survey of the Fifteen Two-Part Inventions by J.S. Bach by Theodore O. Johnson the author explains that the use of E♮instead of E ♭at the end of measure 15 and in the beginning of measure 16 preserves the sequence of thirds each three semitones apart. My preference for E♭breaks the sequence- I chose musical harmony of mathematical harmony. Bach obviously faced the dilemma but refused to compromise. The composed sequence, even though not strictly the expected harmonic one, sounds better the more one listens to it. Upon understanding the explanation, I’ve grown to like it just as much and I’m fine with acknowledging that E minor works perfectly well in measure 16. It’s amazing to think how deeply Bach thought through every note in every composition. The attention to detail, even in a two part invention intended as an exercise for a 9 year old kid, is incredible.
The way Bach wrote the piece, with the E♮can be heard in Glenn Gould‘s classical recording:
Gould performs the invention in 45 seconds, an astonishing technical feat that has been panned by some but that I will admit I am quite fond of listening to. The particular passage in question is harmonically complex- Johnson notes that the three diminished seventh chords account for all twelve tones- and I find that Gould’s rendition highlights the complexity rather than diminishing it. In any case, what’s remarkable about Gould’s performance is that he too deviates from Bach’s score! Gould plays measure 18 the same way as the first measure instead of with the inversion which Bach annotated. I marked the difference in the score (see oval in battleship gray, which was Gould’s favorite color).
Life is short, break the rules.
In 1997 physicist Roger Penrose sued Kimberly-Clark Corporation for infringing on his “Penrose patent” with their Kleenex-Quilted toilet paper. He won the lawsuit but fortunately for lavaphiles the patent has expired leaving much room for aperiodic creativity in the bathroom.
Math is involved in many aspects of house design (two years ago I wrote about how math is even related to something as mundane as the roof), but it is especially important in the design of bathroom floors. The most examined floors in houses are those of bathrooms, as they are stared at for hours on end by pensive thinkers sitting on toilet seats. The best bathroom floors present beautiful tessellations not as mathematical artifact but mathematical artwork, and with this in mind I designed a three colored Penrose tiling for our bathroom a few years ago. This is its story:
Roger Penrose published a series of aperiodic tilings of the plane in the 1970s, famously describing a triplet of related tilings now termed P1, P2 and P3. These tilings turn out to be closely related to tilings in medieval islamic architecture and thus perhaps ought to be called “Iranian tilings” but to be consistent with convention I have decided to stick with the standard “Penrose tilings” in this post.
The tiling P3 is made from two types of rhombic tiles, matched together as desired according to the matching rules (indicated by the colors, or triangle/circle bumps) below:
The result is an aperiodic tiling of the plane, i.e. one without translation symmetry (for those interested, a formal definition is provided here). Such tilings have many interesting and beautiful properties, although a not so-well-known one is that they are 3-colorable. What this means is that each tile can be colored with one of three colors, so that any two adjacent tiles are always colored differently. The proof of the theorem, by Tom Sibley and Stan Wagon, doesn’t really have much to do with the aperiodicity of the rhombic Penrose tiling, but rather with the fact that it is constructed from parallelograms that are arranged so that any pair meet either at a point or along an edge (they call such tilings “tidy”). In fact, they prove that any tidy plane map whose countries are parallelograms is 3-colorable.
The theorem is illustrated below:
This photo is from our guest bathroom. I designed the tiling and the coloring to fit the bathroom space and sent a plan in the form of a figure to Hank Saxe from Saxe-Patterson Inc. in Taos New Mexico who cut and baked the tiles:
Hank mailed me the tiles in groups of “super-tiles”. These were groups of tiles glued together to an easily removable mat to simplify the installation. The tiles were then installed by my friend Robert Kertsman (at the time a general contractor) and his crew.
The final result is a bathroom for thought:
Three years ago Nicolas Bray and I published a post-publication review of the paper “Network link prediction by global silencing of indirect correlations” (Barzel and Barabási, Nature Biotechnology, 2013). Despite our less than positive review of the work, the paper has gone on to garner 95 citations since its publication (source: Google Scholar). In fact, in just this past year the paper has paper has been cited 44 times, an impressive feat with the result that the paper has become the first author’s most cited work.
In another Barabási paper (with Wang and Song) titled Quantifying Long-Term Scientific Impact (Science, 2013), the authors provide a formula for estimating the total number of citations a paper will acquire during its lifetime. The estimate is
where and are parameters learned from a few years of citation data. The authors call the ultimate impact because, they explain, “the total number of citations a paper will ever acquire [is equivalent to] the discovery’s ultimate impact”. With 95 citations in 3 years, the Barzel-Barabási “discovery” is therefore on track for significant “ultimate impact” (I leave it as an exercise for the reader to calculate the estimate for from the citation data). The ultimate impactful destiny of the paper is perhaps no surprise…Barzel and Barabási knew as much when writing it, describing its implication for systems biology as “Overall this silencing method will help translate the abundant correlation data into insights about the system’s interactions” and stating in a companion press release that “After silencing, what you are left with is the precise wiring diagram of the system… In a sense we get a peek into the black box.”
Drive by citations
Now that three years had passed since the publication of the press release and with the ultimate impact revealed, I was curious to see inside the 95 black boxes opened with global silencing, and to examine the 95 wiring diagrams that were thus precisely figured out.
So I delved into the citation list and examined, paper-by-paper, to what end the global silencing method had been used. Strikingly, I did not find, as I expected, precise wiring diagrams, or even black boxes. A typical example of what I did find is illustrated in the paper Global and portioned reconstructions of undirected complex networks by Xu et al. (European Journal Of Physics B, 2016) where the authors mention the Barzel-Barabási paper only once, in the following sentence of the introduction:
“To address this inverse problem, many methods have been proposed and they usually show robust and high performance with appropriate observations [9,10,11, 12,13,14,15,16,17,18,19,20,21].”
(Barzel-Barabási is reference ).
Andrew Perrin has coined the term drive by citations for “references to a work that make a very quick appearance, extract a very small, specific point from the work, and move on without really considering the existence or depth of connection [to] the cited work.” While its tempting to characterize the Xu et al. reference of Barzel-Barabási as a drive by citation the term seems overly generous, as Xu et al. have literally extracted nothing from Barzel-Barabási at all. It turns out that almost all of the 95 citations of Barzel-Barabási are of this type. Or not even that. In some cases I found no logical connection at all to the paper. Consider, for example, the Ph.D. thesis Dysbiosis in Inflammatory Bowel Disease, where Barzel-Barabási, as well as the Feizi et al. paper which Nicolas Bray and I also reviewed, are cited as follows:
The Ribosomal Database Project (RDP) is a web resource of curated reference sequences of bacterial, archeal, and fungal rRNAs. This service also facilitates the data analysis by providing the tools to build rRNA-derived phylogenetic trees, as well as aligned and annotated rRNA sequences (Barzel and Barabasi 2013; Feizi, Marbach et al. 2013).
(Neither papers has anything to do with building rRNA-derived phylogenetic trees or aligning rRNA sequences).
While this was probably an accidental error, some of the drive by citations were more sinister. For example, WenJun Zhang is an author who has cited Barzel-Barabási as
We may use an incomplete network to predict missing interactions (links) (Clauset et al., 2008; Guimera and Sales-Pardo, 2009; Barzel and Barabási, 2013; Lü et al., 2015; Zhang, 2015d, 2016a, 2016d; Zhang and Li, 2015).
in exactly the same way in three papers titled Network Informatics: A new science, Network pharmacology: A further description and Network toxicology: a new science. In fact this author has cited the work in exactly the same way in several other papers which appear to be copies of each other for a total of 7 citations all of which are placed in dubious “papers”. I suppose one may call this sort of thing hit and run citation.
I also found among the 95 citations one paper strongly criticizing the Barzel-Barabási paper in a letter to Nature Biotechnology (the title is Silence on the relevant literature and errors in implementation) , as well as the (to me unintelligible) response by the authors.
In any case, after carefully examining each of the 95 references citing Barzel and Barabási I was able to find only one paper that actually applied global silencing to biological data, and two others that benchmarked it. There are other ways a paper could impact derivative work, for example by virtue of the models or mathematics developed, be of use, but I could not find any other instance where Barzel and Barabási’s work was used meaningfully other than the three citations just mentioned.
When a citation is a citation
As mentioned, two papers have benchmarked global silencing (and also network deconvolution, from Feizi et al.). One was a paper by Nie et al. on Minimum Partial Correlation: An Accurate and Parameter-Free Measure of Functional Connectivity in fMRI. Table 1 from the paper shows the results of global silencing, network deconvolution and other methods on a series of simulations using the measure of c-sensitivity for accuracy:
Table 1 from Nie et al. showing performance of methods for “network cleanup”.
EPC is the “Elastic PC-algorithm” developed by the authors, which they argue is the best method. Interestingly, however, global silencing (GS) is equal to or worse than simply choosing the top entries from the partial correlation matrix (FP) in 19/28 cases- that’s 67% of the time! This is consistent with the results we published in Bray & Pachter 2013. In these simulations network deconvolution performs better than partial correlation, but still only 2/3 of the time. However in another benchmark of global silencing and network deconvolution published by Izadi et al. 2016 (A comparative analytical assay of gene regulatory networks inferred using microarray and RNA-seq datasets) network deconvolution underperformed global silencing. Also network deconvolution was examined in the paper Graph reconstruction Using covariance-based methods by Sulaimanov and Koeppl 2016 who show, with specific examples, that the scaling parameter we criticized in Bray & Pachter 2013 is indeed problematic:
The scaling parameter α is introduced in [Feizi et al. 2013] to improve network deconvolution. However, we show with simple examples that particular choices for α can lead to unwanted elimination of direct edges.
It’s therefore difficult to decide which is worse, network deconvolution or global silencing, however in either case it’s fair to consider the two papers that actually tested global silencing as legitimately citing the paper the method was described in.
The single paper I found that used global silencing to analyze a biological network for biological purposes is A Transcriptional and Metabolic Framework for Secondary Wall Formation in Arabidopsis by Li et al. in Plant Physiology, 2016. In fact the paper combined the results of network deconvolution and global silencing as follows:
First, for the given data set, we calculated the Pearson correlation coefficients matrix Sg×g. Given g1 regulators and g2 nonregulators, with g = g1+g2, the correlation matrix can be modified as
where O denotes the zero matrix, to include biological roles (TF and non-TF genes). We extracted the regulatory genes (TFs) from different databases, such as AGRIS (Palaniswamy et al., 2006), PlnTFDB (Pérez-Rodríguez et al., 2010), and DATF (Guo et al., 2005). We then applied the network deconvolution and global silencing methods to the modified correlation matrix S′. However, global silencing depends on finding the inverse of the correlation matrix that is rank deficient in the case p » n, where p is the number of genes and n is the number of features, as with the data analyzed here. Since finding an inverse for a rank-deficient matrix is an ill-posed problem, we resolved it by adding a noise term that renders the matrix positive definite. We then selected the best result, with respect to a match with experimentally verified regulatory interactions, from 10 runs of the procedure as a final outcome. The resulting distribution of weighted matrices for the regulatory interactions obtained by each method was decomposed into the mixture of two Gaussian distributions, and the value at which the two distributions intersect was taken as a cutoff for filtering the resulting interaction weight matrices. The latter was conducted to avoid arbitrary selection of a threshold value and prompted by the bimodality of the regulatory interaction weight matrices resulting from these methods. Finally, the gene regulatory network is attained by taking the shared regulatory interactions between the resulting filtered regulatory interactions obtained by the two approaches. The edges were rescored based on the geometric mean of the scores obtained by the two approaches.
In light of the benchmarks of global silencing and network deconvolution, and in the absence of analysis of the ad hoc method combining their results, it is difficult to believe that this methodology resulted in a meaningful network. However its citation of the relevant papers is certainly legitimate. Still, the results of the paper, which constitute a crude analysis of the resulting networks, are a far cry from revealing the “precise wiring diagram of the system”. The authors acknowledge this writing
From the cluster-based networks, it is clear that a wide variety of ontology terms are associated with each network, and it is difficult to directly associate a distinct process with a certain transcript profile.
The factor of use correction
The analysis of the Barzel and Barabási citations suggests that, because a citation is not always a citation (thanks to Nicolas Bray for suggesting the title for the post), to reflect the ultimate impact of a paper the quantity needs to be adjusted. I propose adjustment by the factor
where C is the total number of citations of a paper and is the number of drive by citations. The fraction is essentially a factor of use correction. It should be possible (and interesting) to develop text analytics algorithms for estimating so as to be able to correct to , and similarly adjusting citations counts, h-indices, impact factors of journals and related metrics. Explicit computation and publication of the factor of use correction for papers would also incentivize authors to reduce or eliminate gratuitous drive by citation practices.
For now I decided to compute the factor of use correction for the Barzel-Barabási paper by generously estimating that . This yielded . Barabási has an h-index of 117, but applying this factor of use correction to all of his published papers I obtained the result that Barabasi’s factor of use corrected h-index is 30.
The idea that 2016 was the worst year in history was already floated back in July, but despite a lot of negativity, there were definitely good things that happened in 2016. This was certainly true in terms of scientific discoveries and breakthroughs. One of my favorites was a significant improvement to the upper bound of the Happy Ending problem after 81 years!
The Happy Ending problem was the brainchild of mathematician Esther Klein, who proved that any five points in general position in the plane must contain a convex quadrilateral.
The extension of the problem asks, for each n, for the minimum number so that any points in general position in the plane contain a convex n-gon. Paul Erdös and George Szekeres established an upper bound for via Ramsey Theory in 1935, a link that highlighted the importance and application of Ramsey Theory to extremal combinatorics and instantly made the problem a classic. It even has connections to computational biology, via it’s link to the Erdös-Szekeres monotone subsequence theorem (published in the same 1935 paper), which in turn is related to the Needleman-Wunsch algorithm.
The “happy ending” associated to the problem comes from the story of Esther Klein and George Szekeres, who fell in love while working on the problem and ended up marrying in 1937, a happy ending that lasted for 68 years (they died within an hour of each other in 2005).
In their 1935 paper, Erdös-Szekeres proved an upper bound for using elementary combinatorics, namely
The difficulty of improving the upper bound for can be appreciated by noting that it took 63 years for there to be any improvement to the Erdös-Szekeres result, and when an improvement did arrive in 1998 it was to remove the “+1” (Fan Chung and Ron Graham, Discrete Geometry 1998) reducing the upper bound to for .
Considering that the best lower bound for , conjectured to be optimal, is (proved by Erdös and Szekeres in 1960), the +1 improvement left a very long way to go. A few other improvements after the Chung-Graham paper was published reduced the upper bound some more, but at most by constant factors.
The result doesn’t exactly match the conjectured lower bound, so one cannot say the happy ending to the Happy Ending Problem arrived in 2016, but it’s so so so close (i.e. it solves the problem asymptotically) that one can only be left with optimism and hope for 2017.
Happy new year!
What are confusion matrices?
In the 1904 book Mathematical Contributions to the Theory of Evolution, Karl Pearson introduced the notion of contingency tables. Sometime around the 1950s the term “confusion matrix” started to be used for such tables, specifically for 2×2 tables arising in the evaluation of algorithms for binary classification.
Example: Suppose there are 11 items labeled A,B,C,D,E,F,G,H,I,J,K four of which are of the category blue (also to be called “positive”) and eight of which are of the category red (also to be called “negative”). An algorithm called BEST receives as input the objects without their category labels, i.e. just A,B,C,D,E,F,G,H,I,J,K and must rank them so that the top of the ranking is enriched, as much as possible, for blue items. Suppose BEST produces as output the ranking:
How good is BEST at distinguishing the blue (positive) from the red (negative) items, i.e. how enriched are they at the top of the list? The confusion matrix provides a way to organize the assessment. We explain by example: if we declare the top 6 predictions of BEST blue and classify the remainder as red
then the confusion matrix is:
The matrix shows the number of predictions that are correct (3) as well as the number of blue items that are missing (1). Similarly, it shows the number of red items that were incorrectly predicted to be blue (3), as well as the number of red items correctly predicted to be red (4). The four numbers have names (true positives, false negatives, false positives and true negatives) and various other numbers can be derived from them that shed light on the quality of the prediction:
The table and the terminology are trivial. All that is involved is simple counting of successes and failures in a binary classification experiment, and computation of derived measures that are obtained by elementary arithmetic. But here is a little secret that I’ve always felt embarrassed to admit… I can never remember the details. For example I find myself forgetting whether “positive predictive value” is really “precision” or how that relates to “specificity”. For many years I thought I was the only one. I had what you might call “confusion syndrome”… But I’ve slowly come to realize that I am not alone.
I became confused about confusion matrices from the moment I first encountered them when I started studying computational biology in the mid 1990s. One of the first papers I read at the time was a landmark review by Burset and Guigó (1996) on the evaluation of gene structure prediction programs (the review has now been cited over 800 times). I have highlighted some text from the paper, where the authors explain what “specificity” means.
Not only are there two definitions for specificity on the same page, there is also a typo in Figure 1 (specificity is abbreviated Sn instead of Sp) adding to the confusion.
Burset and Guigó are of course not to blame for the non-standard use of specificity in gene finding. They were merely describing deviant behavior by computational biologists working on gene finding who had apparently decided to redefine a term widely used in statistics and computer science. However lacking context at the time when I was first learning of these terms, I remained confused for years about whether I was supposed to use specificity or the term “precision” for what the gene finding people were computing.
Simple confusion about terms is not the only problem with the confusion matrix. It is filled with (abbreviated) jargon that can be overwhelming. Who can remember TP,TN,FP,FN,TPR,FPR,SN,SP,FDR,PPV,FOR,NPV,FDR,FNR,TNR,SPC,ACC,LR+,LR- etc. etc. etc. ? Even harder is keeping track of the many formulas relating the quantities, e.g. FDR = 1 – PPV or FPR = 1 – SPC. And how is one supposed to gain intuition about a formula described as ACC = (TP + TN) / (TP + FP + FN + TN)?
Confusion matrix confusion remains a problem in computational biology. Consider the recent preprint Creating a universal SNP and small indel variant caller with deep neural networks by Poplin et al. published on December 14, 2016. The paper, from Google/Verily, shows how a deep convolutional neural network can be used to call genetic variation by learning from images of read pileups. This is fascinating and potentially deep innovation. However the authors, experts in machine learning, were confused in the caption of their Figure 2A, incorrectly describing the FDR-Sensitivity plot as a ROC curve. This in turn caused confusion for readers expecting a ROC curve to be a function.
While an FPR-Sensitivity plot is a function an FDR-Sensitivity plot doesn’t have to be. I was confused by this subtlety myself until an author of Poplin et al. kindly sorted me out.
Hence this post, a note for myself to avoid further confusion by utilizing elementary trigonometry to organize the information in confusion (matrices). As we will see shortly, the Poplin et al. Figure 2A is what is more commonly known as a precision-recall (PR) curve, except that the usual axes of a PR plot are flipped, in addition to a reflection which adds further confusion.
The quality of the BEST prediction in the example above can be encoded in a picture as follows:
Figure 1: A false-positive / true-positive plot.
The grid is 4 lines high and 8 lines wide because there are 4 blue objects and 7 red ones. The BEST ranking is a path from the lower left hand corner of the grid (0,0) to the top right hand corner (7,4). The BEST predictions can be “read off” by following the path. Starting with the top ranked object, the path moves one step up if the object is (in truth) blue, and one step to the right if it is (in truth) red. For example, the first prediction of BEST is C which is blue, and therefore the first step is up. Continuing in this way one eventually arrives at (7,4) because each object is somewhere (only once) on BEST’s list. A BEST confusion matrix corresponds to a point in the plot. The confusion matrix discussed in the introduction thresholding after 6 predictions corresponds to the green point which is the location after 6 steps from (0,0).
The path corresponding to the BEST ranking, when scaled to the unit square, is called a ROC curve. This stands for receiver operator curve (the terminology originates from WWII). Again, each thresholding of the ranking, e.g. the top six as discussed above (green point), corresponds to a single point on the path:
Figure 2: A ROC plot.
In the ROC plot the x-axis is the false positive rate (FPR = # false positives / total number of true negatives) and the y-axis the sensitivity (Sn = #true positives / total number of true positives).
At this point it is useful to think about taxicab geometry (formally known as geometry). In taxicab geometry, the distance between two points on the grid is the number of steps in the shortest path (using edges of the grid) connecting them. In the figure below, a false positive / true positive plot (as shown in Figure 1) is annotated to highlight the connection to trigonometry:
Figure 3: Taxicab confusion.
The terms associated to the confusion matrix can now be understood using taxicab trigonometry as follow:
(false discovery rate),
(positive predictive value),
(negative predictive value),
(false omission rate),
Fortunately trigonometry identities are easy in this geometry. Instead of the usual we have . Thus we see that FDR + PPV = 1 (or PPV = 1 – FDR) and NPV + FOR = 1 (or NPV = 1 -FOR). The false positive rate and true positive rate are seen to just be rescaling of the FP and TP, i.e. and , and the analogies to the identities just described are therefore FPR + SPC = 1 and TPR + FNR = 1.
Revisiting Figure 2A from Poplin et al. we see that FDR is best interpreted as 1 – PPV. PPV is also known as “precision”. The y-axis, sensitivity, is also called “recall”. Thus the curve is a precision-recall curve, although precision-recall curves are typically shown with recall on the x-axis.
Aside from helping to sort out identities, taxicab trigonometry is appealing in other ways. For example the angle in the example of Figure 3 is just . In fact in the regime in which the confusion trigonometry takes place and . There is a pedagogical point here: it is useful to identify a quantity such as FDR with directly with the angle subtended by the points (0,0) and (FP,TP). In fact one sees immediately that a confusion matrix is succinctly described by just two quantities (in addition to N and P). These do not necessarily have to be FP and TP, a pair such as FDR and NPV can also suffice.
No longer confused by confusion matrices? Show that
While I think the geometric view of confusion matrices is useful for keeping track of the big picture, there is a probabilistic interpretation of many of the concepts involved that is useful to keep in mind.
For example, the precision associated to a confusion matrix can be interpreted as the probability that a prediction is correct. The identity FDR + PPV = 1 can therefore be thought of probabilistically, with FDR being the probability that a prediction is incorrect.
Most useful among such probabilistic interpretations is the area under a ROC curve (AUROC), which is frequently computed in biology papers. It is the probability that a randomly selected “positive” object is ranked higher than a randomly selected “negative” object by the classifier being evaluated. For example, in the red/blue/BEST example given above, the AUROC (= ) is the probability that a randomly selected blue letter is ranked higher than a randomly selected red letter by BEST.
Okay Houston, we’ve got a problem. We need more power. Case in point: a recently published study Apollo Lunar Astronauts Show Higher Cardiovascular Disease Mortality by Michael Delp et al. was picked up by news outlets with headlines such as:
- Study finds cosmic rays increased heart risks among Apollo astronauts (Reuters)
- Deep space radiation caused heart problems for Apollo astronauts (NBC News)
- Apollo deep space astronauts five times more likely to die from heart disease (The Guardian)
- Space radiation took a toll on Apollo astronauts, study says (Fox News)
- Apollo astronauts much more likely to die from heart disease (Science)
- … etc., etc., …
The headlines were based on a sentence in the paper stating that “the CVD mortality rate among Apollo lunar astronauts (43%) was 4–5 times higher than in non-flight and LEO [low earth orbit] astronauts.”
A reading of the paper reveals that the “5 times more likely to die” risk calculation comes from . The number 9% is the rate of cardiovascular disease observed in 35 non-flight astronauts whereas the number 43% is rate of cardiovascular disease in Apollo lunar astronauts (3 out of 7). In other words, the grandiose claims of the paper are based on three Apollo astronauts dying of cardiovascular disease rather than an expected single astronaut.
The authors themselves must have realized how unfounded their claims were, because the paper evidently flirts with fraud. They used a p-value cutoff of 0.1 to declare the lunar astronaut result “significant”. This is in contrast to the standard cutoff 0.05 which they use for the remainder of the results in the paper, and they justified the strange exception by suggesting that others “considered [Fisher’s exact test] extremely conservative.” In addition, Ed Mitchell who died at the age of 85 on February 4th 2016 three months before the paper was submitted was excluded from the analysis. His inclusion would have increased the dataset size by 14%! Then there is the fact that they failed to mention the three astronauts who visited the moon twice and are still alive. Or that the lunar astronauts died ten years older on average. Perhaps worst of all, the authors imply that they have experimental data on a mechanism for their statistical (non)result by describing a follow-up experiment examining vascular responses of resistance arteries in irradiated mice. The problem is, the dose given to the mice was 87 times what the astronauts received! None of this is complicated stuff… and one wonders how only one of the reporters writing about the study picked up on any of this (Sarah Kaplan from the Washington post headlined the story with Studying heart disease in astronauts yields clues but not conclusive evidence and concluded correctly “that’s just three of seven people, which doesn’t give you a whole lot of statistical power”.)
One would hope that this kind of paper would be retracted by the journal but my previous attempts to get journals to do the right thing, even when the research was clearly flawed, have been futile. Then there is the funding. Learning nothing doesn’t come for free and the authors’ “work” was supported by grants from the National Space and Biomedical Research Institute under the NASA Cooperative Agreement. Clearly PI Michael Delp (who is also first author, corresponding author and dean of the College of Human Sciences at Florida State University) would like even more funding, proclaiming in interviews that he wanted to take “a deeper look into the medical history of the Apollo astronauts”, “study future questions” and that he was “working with NASA to conduct additional studies”. My experience in genomics has been that funding agencies typically turn a blind eye to flawed research leaving the task of evaluating the science to “peer reviewers”. I’ve seen many cases where individuals who published complete malarkey and hogwash continue to receive funding. But it seems NASA cares about the research it funds and may not be on the same page as Delp et al. In a statement published on July 28th, NASA wrote that:
The National Space Biomedical Research Institute, a non-governmental organization with funding from NASA’s Human Research Program, supported a recent study published in Scientific Reports that looked at the rate of cardiovascular disease among Apollo astronauts.
With the current limited astronaut data referenced in the study it is not possible to determine whether cosmic ray radiation affected the Apollo astronauts.
This is not the first time NASA has published statements distancing itself from studies it has supported (either directly or indirectly). Following reports that a NASA-funded study found that industrial civilization was headed for irreversible collapse, NASA published a statement making clear it did not support the results of the study.
Thank you NASA! You have set a great example in taking ownership of the published work your funding enabled. Hopefully others (NIH!!) will follow suit in publicly disavowing poorly designed underpowered studies that make grandiose claims.
Disclosure: I collaborate with NASA scientists, contribute to projects partially funded by NASA, and apply for NASA funding.
Computational biologists do not all recognize the Kalman filter by name, but they know it in the form of the hidden Markov model (the Kalman filter is a hidden Markov model with continuous latent variables and Gaussian observed variables). I mention this because while hidden Markov models, and more generally graphical models, have had an extraordinary impact on the tools and techniques of high-throughput biology, one of their primary conceptual sources, the Kalman filter, is rarely credited as such by computational biologists.
Illustration of the Kalman filter (from Wikipedia).
Where the Kalman filter has received high acclaim is in engineering, especially electrical and aeronautical engineering via its applications in control theory and where it has long been a mainstay of the fields. But it was not always so. The original papers, written in the early 1960s by Rudolf Kálmán and colleagues, were published in relatively obscure mechanical engineering journals rather than the mainstream electrical engineering journals of the time. This was because Kálmán’s ideas were initially scoffed at and rejected… literally. Kálmán second paper on the topic, New Results in Linear Filtering and Prediction Theory (with almost 6,000 citations), was rejected at first with a referee writing that “it cannot possibly be true”. The story is told in Grewal and Andrews’ book Kalman Filtering: Theory and Practice Using MATLAB. Of course not only was the Kalman filter theory correct, the underlying ideas were, in modern parlance, transformative and disruptive. In 2009 Rudolf Kálmán received the National Medal of Science from Barack Obama for his contribution. This is worth keeping in mind not only when receiving rejections for submitted papers, but also when writing reviews.
Rudolf Kálmán passed away at the age of 86 on Saturday July 2nd 2016.
I’m thrilled to announce that I will be moving to Caltech next year where I will be professor of computational biology!
Some people have asked me why I’m moving. First and foremost, we (my family) feel it is the right move for us as for a variety of reasons that I won’t get into here. For me personally, Caltech represents a unique, special, and extraordinary opportunity because it is an institution that fosters an environment facilitating research and teaching that, inasmuch as possible, is unencumbered by the minutiae of academia. In particular, Caltech is unintimidated by disciplinary boundaries, and enables a culture that I’ve yearned for my whole career. It doesn’t throw hundreds of millions of dollars at a football team (although the basketball team is doing pretty well). Its priorities are aligned with mine.
I’m leaving behind Berkeley, a university I started working at 17 years ago as a visiting assistant professor. I’ll miss Berkeley. I still remember the January 1999 phone call from Prof. Tsit Yuen Lam, announcing my appointment. I was honored to have been invited to conduct research and to teach at one of the world’s great institutions. Berkeley was, and still is, distinguished by it’s mission of providing world-class affordable public education. I can’t think of any university in the world that has done as well in pursuing this noble goal. Consider, for example, that UC Berkeley has almost as many Pell Grant recipients as all eight Ivy League schools combined. But with time, as I was allowed to drop the prefixes in my title, I found myself increasingly aware of the structure, organization and financing of the university. Two numbers that I learned have stuck in my mind: today, state funding comprises only 13% of the budget (likely even lower next year), less than half of what it was when I arrived. At the same time, tuition has increased by over a factor of three during the same time period. The squeeze has harmed the institution not just because of reductions in resources (though there have been many), but also because of the strain placed on the morale and mission of the university. Over time I started to question whether its world-class education was sustainable, and lamented that its affordability was becoming a myth. Over the past two years I’ve become increasingly aware that the reality of the university is at odds with my values. I’m sad for the University of California and for the citizens who are being harmed by the blows it is taking, and very much wish that the state will protect and nurture its education treasure. But I will be rooting for it from the sidelines.
I can’t wait to start at Caltech, and look forward to the next phase of my career!
My Caltech calculus professor, Tom Apostol, passed away yesterday May 8th 2016. When I arrived in his Math 1b class in the Fall of 1990 I thought, like most of my classmates, that I already knew calculus. I may have known it but I didn’t understand it. Apostol taught me the understanding part.
Apostol’s calculus books, affectionally called “Tommy I” and ‘Tommy II” were not just textbooks for students to memorize but rather mathematical wisdom and beauty condensed into a pair of books intended to transform grade-obsessed freshmen and sophomores into thinking human beings. Most of all, Apostol emphasized the idea that fundamental to mathematics is how one thinks about things, not just what one is thinking about. One of his iconic examples of this was the ice-cream-cone-proof that the focal property of an ellipse is a consequence of its definition as a section of a cone. Specifically, taking as the definition of an ellipse a plane curve obtained by intersecting an inclined plane with a cone
the goal is to both define the two foci, and then to derive the focal point property as illustrated below:
Apostol demonstrated the connection between conic sections and their foci via a proof and picture of Dandelin. His explanation, which I still remember from my freshman year in college, is beautiful (the excerpt below is from his linear algebra book):
Apostol didn’t invent Dandelin’s spheres but he knew they were “the right way” to think about conic sections, and he figured out “the right way” for each and every one of his explanations. His calculus books are exceptional for their introduction of integration before differentiation, his preference for axiomatic rather than mechanistic definition (e.g. of determinants) and his exercises that are “easy” when the material is understood “in the right way”. His course had a profound influence on my approach not only to mathematics, but to all of my learning in both the sciences and humanities.
One of Apostol’s signature traditions was his celebration of Gauss’ birthday. His classes were always filled with mathematical treats, but on April 30th every year he would hide a cake in the classroom before the students arrived and would serve an edible treat that day instead. Gauss turned 239 just last week. This seems to be a timely moment to take note of that prime number (Apostol was a number theorist) and to eat a slice of cake for Gauss, Apostol, and those who change our lives.