You are currently browsing the category archive for the ‘sophistry’ category.

During my third year of undergraduate study I took a course taught by David Gabai in which tests were negatively graded. This meant that points were subtracted for every incorrect statement made in a proof. As proofs could, in principle, contain an unbounded number of false statements, there was essentially no lower bound on the grade one could receive on a test (course grades was bounded below by “F”). Although painful on occasion, I grew to love the class and the transcendent lessons that it taught. These memories came flooding back this past week, when a colleague brought to my attention the paper A simple and fast algorithm for K-medoids clustering by Hae-Sang Park and Chi-Hyuck Jun.

The Park-Jun paper introduces a K-means like method for K-medoid clustering. K-medoid clustering is a clustering formulation based on a generalization of (a relative of) the median rather than the mean, and is useful for the same robustness reasons that make the median preferable to the mean. The medoid is defined as follows: given n points $x_1,\ldots,x_n$ in $\mathbb{R}^d$, the medoid is a point $x$ among them with the property that it minimizes the average distance to the other points, i.e. $x \in \{x_1,\ldots,x_n\}$ minimizes $\sum_{i=1}^n ||x-x_i||$. In the case of $d=1$, when n is odd, this corresponds to the median (the direct generalization of the median is to find a point $x$ not necessarily among the $x_i$ minimizing the average distance to the other points, and this is called the geometric median).

The K-medoid problem is to partition the points into k disjoint subsets $S_1,\ldots,S_n$ so that if $m_1,\ldots,m_k$ are the respective medoids of the subsets (clusters) then the average distance of each medoid to the points of the cluster it belongs to  is minimized. In other words, the K-medoids problem is to find

$argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||m_j-x_i||$.

For comparison, K-means clustering is the problem of finding

$argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||\mu_j - x_i||^2$,

where the $\mu_j$ are the centroids of the points in each partition $S_i$. The K-means and K-medoids problems are instances of partitional clustering formulations that are NP-hard.

The most widely used approach for finding a (hopefully good) solution to the K-medoids problem has been a greedy clustering algorithm called PAM by Kaufman and Rousseeuw (partitioning around medoids). To understand the Park-Jun contribution it is helpful to briefly review PAM first.  The method works as follows:

1. Initialize by selecting k distinguished points from the set of points.
2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets and a “cost” can be associated to the partition, namely the sum of the distances from each point to its associated distinguished point (note that the distinguished points may not yet be the medoids of their respective partitions).
3. For each distinguished point d, and each non-distingsuished point repeat the assignment and cost calculation of step  2 with and x swapped so that x becomes distinguished and returns to undistinguished status. Choose the swap that minimizes the total cost. If no swap reduces the cost then output the partition and the distinguished points.
4. Repeat step 3 until termination.

It’s easy to see that when the PAM algorithm terminates the distinguished points must be medoids for the partitions associated to them.

The running time of PAM is $O(k(n-k)^2)$. This is because in step 3, for each of the k distinguished points, there are n-k swaps to consider for a total of $k(n-k)$ swaps, and the cost computation for each swap requires n-k assignments and additions. Thus, PAM is quadratic in the number of points.

To speed-up the PAM method, Park and Jun introduced in their paper an analogy of the K-means algorithm, with mean replaced by median:

1. Initialize by selecting k distinguished points from the set of points.
2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets in the same way as the PAM algorithm.
3. Compute the medoid for each partition. Repeat step 2 until the cost no longer decreases.

Park-Jun claim that their method has run time complexity “$O(nk)$ which is equivalent to K-means clustering”. Yet the pseudocode in the paper begins with the clearly quadratic requirement “calculate the distance between every pair of all objects..”

Step 3 of the algorithm is also quadratic. The medoid of a set of m points is computed in time $O(m^2)$. Given m points the medoid must be one of them, so it suffices to compute, for each point, the sum of the distances to the others ($m \times m$ additions) and a medoid is then identified by taking the minimum. Furthermore, without assumptions on the structure of the distance matrix there cannot be a faster algorithm (with the triangle inequality the medoid can be computed in $O(n^{\frac{3}{2}}$).

Quadratic functions are not linear. If they were, it would mean that, with $a,b \neq 0$$ax^2=bx \mbox{ for all } x$. If that were the case then

$ax^2=bx$

$\Rightarrow ax^2-bx=0$ for all x.

Assuming that a is positive and plugging in $x=\frac{b-\sqrt{b^2+4a}}{2a}$ one would obtain that $ax^2-bx=1$ and it would follow that

$0=1$.

When reading the paper, it was upon noticing this “result” that negative grading came to mind. With a proof that 0=1 we are at, say, a score of -10 for the paper. Turning to the discussion of Fig. 3 of the paper we read that “…the proposed method takes about constant time near zero regardless of the number of objects.”

I suppose a generous reading of this statement is that it is an allusion to Taylor expansion of the polynomial $f(x)=x^2$ around $x=0$. A less generous interpretation is that the figure is deliberately misleading, intended to show linear growth with slope close to zero for what is a quadratic function. I decided to be generous and grade this a -5, leading to a running total of -15.

It seems the authors may have truly believed that their algorithm was linear because in the abstract they begin with “This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm”. It seems possible that the authors thought they had bypassed what they viewed as the underlying cause for quadratic running time of the PAM algorithm, namely the quadratic swap. The K-means algorithm (Lloyd’s algorithm) is indeed linear, because the computation of the mean of a set of points is linear in the number of points (the values for each coordinate can be averaged separately). However the computation of a medoid is not the same as computation of the mean. -5, now a running total of20. The actual running time of the Park-Jun algorithm is shown below:

Replicates on different instances are crucial, because absent from running time complexity calculations are the stopping times which are data dependent. A replicate of Figure 3 is shown below (using the parameters in Table 5 and implementations in MATLAB). The Park-Jun method is called as “small” in MATLAB.

Interestingly, the MATLAB implementation of PAM has been sped up considerably. Also, the “constant time near zero” behavior described by Park and Jun is clearly no such thing. For this lack of replication of Figure 3 another deduction of 5 points for a total of -25.

There is not much more to the Park and Jun paper, but unfortunately there is some. Table 6 shows a comparison of K-means, PAM and the Park-Jun method with true clusters based on a simulation:

The Rand index is a measure of concordance between clusters, and the higher the Rand index the greater the concordance. It seems that the Park-Jun method is slightly better than PAM, which are both superior to K-means in the experiment. However the performance of K-means is a lot better in this experiment than suggested by Figure 2 which was clearly (rotten) cherry picked (Fig2b):

For this I deduct yet another 5 points for a total of -30.

Amazingly, the Park and Jun paper has been cited 479 times. Many of the citations propagate the false claims in the paper, e.g. in referring to Park and Jun, Zadegan et al. write “in each iteration the new set of medoids is selected with running time O(N), where N is the number of objects in the dataset”.  The question is, how many negative points is the repetition of false claims from a paper that is filled with false claims worth?

Okay Houston, we’ve got a problem. We need more power. Case in point: a recently published study Apollo Lunar Astronauts Show Higher Cardiovascular Disease Mortality by Michael Delp et al. was picked up by news outlets with headlines such as:

The headlines were based on a sentence in the paper stating that “the CVD mortality rate among Apollo lunar astronauts (43%) was 4–5 times higher than in non-flight and LEO [low earth orbit] astronauts.”

A reading of the paper reveals that the “5 times more likely to die” risk calculation comes from $43\% \approx 9\% \times 5 = \left\lceil \frac{3}{7} \right\rceil$. The number 9% is the rate of cardiovascular disease observed in 35 non-flight astronauts whereas the number 43% is rate of cardiovascular disease in Apollo lunar astronauts (3 out of 7). In other words, the grandiose claims of the paper are based on three Apollo astronauts dying of cardiovascular disease rather than an expected single astronaut.

The authors themselves must have realized how unfounded their claims were, because the paper evidently flirts with fraud. They used a p-value cutoff of 0.1 to declare the lunar astronaut result “significant”. This is in contrast to the standard cutoff 0.05 which they use for the remainder of the results in the paper, and they justified the strange exception by suggesting that others  “considered [Fisher’s exact test] extremely conservative.” In addition, Ed Mitchell who died at the age of 85 on February 4th 2016 three months before the paper was submitted was excluded from the analysis. His inclusion would have increased the dataset size by 14%! Then there is the fact that they failed to mention the three astronauts who visited the moon twice and are still alive. Or that the lunar astronauts died ten years older on average. Perhaps worst of all, the authors imply that they have experimental data on a mechanism for their statistical (non)result by describing a follow-up experiment examining vascular responses of resistance arteries in irradiated mice. The problem is, the dose given to the mice was 87 times what the astronauts received!  None of this is complicated stuff… and one wonders how only one of the reporters writing about the study picked up on any of this (Sarah Kaplan  from the Washington post headlined the story with  Studying heart disease in astronauts yields clues but not conclusive evidence and concluded correctly “that’s just three of seven people, which doesn’t give you a whole lot of statistical power”.)

One would hope that this kind of paper would be retracted by the journal but my previous attempts to get journals to do the right thing, even when the research was clearly flawed, have been futile. Then there is the funding. Learning nothing doesn’t come for free and the authors’ “work”  was supported by grants from the National Space and Biomedical Research Institute under the NASA Cooperative Agreement. Clearly PI Michael Delp (who is also first author, corresponding author and dean of the College of Human Sciences at Florida State University) would like even more funding, proclaiming in interviews that he wanted to take “a deeper look into the medical history of the Apollo astronauts”, “study future questions” and that he was “working with NASA to conduct additional studies”. My experience in genomics has been that funding agencies typically turn a blind eye to flawed research leaving the task of evaluating the science to “peer reviewers”. I’ve seen many cases where individuals who published complete malarkey and hogwash continue to receive funding.  But it seems NASA cares about the research it funds and may not be on the same page as Delp et al. In a statement published on July 28th, NASA wrote that:

The National Space Biomedical Research Institute, a non-governmental organization with funding from NASA’s Human Research Program, supported a recent study published in Scientific Reports that looked at the rate of cardiovascular disease among Apollo astronauts.

With the current limited astronaut data referenced in the study it is not possible to determine whether cosmic ray radiation affected the Apollo astronauts.

This is not the first time NASA has published statements distancing itself from studies it has supported (either directly or indirectly). Following reports that a NASA-funded study found that industrial civilization was headed for irreversible collapse, NASA published a statement making clear it did not support the results of the study.

Thank you NASA! You have set a great example in taking ownership of the published work your funding enabled. Hopefully others (NIH!!) will follow suit in publicly disavowing poorly designed underpowered studies that make grandiose claims.

Disclosure: I collaborate with NASA scientists, contribute to projects partially funded by NASA, and apply for NASA funding.

Two weeks ago in my post Pachter’s P-value Prize I offered ${\bf \frac{\100}{p}}$ for justifying a reasonable null model and a p-value (p) associated to the statement “”Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions” in the paper

M. Kellis, B.W. Birren and E.S. Lander, Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisaeNature 2004 (hereafter referred to as the KBL paper).

Today I am happy to announce the winner of the prize. But first, I want to thank the many readers of the blog who offered comments (>135 in total) that are extraordinary in their breadth and depth, and that offer a glimpse of what scientific discourse can look like when not restricted to traditional publishing channels. You have provided a wonderful public example of what “peer review” should mean. Coincidentally, and to answer one of the questions posted, the blog surpassed one million views this past Saturday (the first post was on August 19th, 2013), a testament to the the fact that the collective peer reviewing taking place on these pages is not only of very high quality, but also having an impact.

I particularly want to thank the students who had the courage to engage in the conversation, and also faculty who published comments using their name. In that regard, I admire and commend Joshua Plotkin and Hunter Fraser for deciding to deanonymize themselves by agreeing to let me announce here that they were the authors of the critique sent to the authors in April 2004  initially posted as an anonymous comment on the blog.

The discussion on the blog was extensive, touching on many interesting issues and I only summarize a few of the threads of discussion here. I decided to touch on a number of key points made in order to provide context and justification for my post and selection of the prize winner.

The value of post-publication review

One of the comments  made in response to my post that I’d like to respond to first was by an author of KBL who dismissed the entire premise of the my challenge writing “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become, as Ferguson and Heene noted, a vast graveyard of undead theories? Surely the varied and interesting comments posted in response to my challenge (totaling >25,000 words and 50 pages in Arial 11 font), demonstrate the value of communal discussion of science after publication.

For the record, this past month I did submit a paper and also a grant, and I did spend lots of time with my family. I didn’t practice the guitar but I did play the piano. Yet in terms of research, for me the highlight of the month was reading and understanding the issues raised in the comments to my blog post. Did I have many other things to do? Sure. But what is more pressing than understanding if the research one does is to be meaningful?

The null model

A few years ago I introduced a new two-semester freshman math course at UC Berkeley for intended biology majors called Math 10- Methods of Mathematics: Calculus, Statistics and Combinatorics“. One of the key ideas we focus on in the first semester is that of a p-value. The idea of measuring significance of a biological result via a statistical computation involving probabilities is somewhat unnatural, and feedback from the students confirms what one might expect: that the topic of p-values is among the hardest in the course. Math for biologists turns out to be much harder than calculus. I believe that at Berkeley we are progressive in emphasizing the importance of statistics for biology majors at the outset of their education (to be clear, this is a recent development). The prevailing state is that of statistical illiteracy, and the result is that p-values are frequently misunderstood, abused, and violated in just about every possible way imaginable (see, e.g., here, here and here).

P-values require a null hypothesis and a test statistic, and of course one of the most common misconceptions about them is that when they are large they confirm that the null hypothesis is correct. No! And worse, a small p-value cannot be used to accept an alternative to the null, only to (confidently) reject the null itself. And rejection of the null comes with numerous subtle issues and caveats (see arguments against the p-value in the papers mentioned above). So what is the point?

I think the KBL paper makes for an interesting case study of when p-values can be useful. For starters, the construction of a null model is already a useful exercise, because it is a thought experiment designed to test ones understanding of the problem at hand. The senior author of the KBL paper argues that “we were interested in seeing whether, for genes where duplication frees up at least one copy to evolve rapidly, the evidence better fits one model (“Ohno”: only one copy will evolve quickly) or an alternative model (both genes will evolve quickly).” While I accept this statement at face value, it is important to acknowledge that if there is any science to data science, it is the idea that when examining data one must think beyond the specific hypotheses being tested and consider alternative explanations. This is the essence of what my colleague Ian Holmes is saying in his comment. In data analysis, thinking outside of the box (by using statistics) is not optional. If one is lazy and resorts to intuition then, as Páll Melted points out, one is liable to end up with fantasy.

The first author of KBL suggests that the “paper was quite explicit about the null model being tested.” But I was unsure of whether to assume that the one-gene-only-speeds-up model is the null based on”we sought to distinguish between the Ohno one-gene-only speeds-up (OS) model and the alternative both-genes speed-up (BS) model” or was the null the BS model because “the Ohno model is 10^87 times more likely, leading to significant rejection of the BS null”?  Or was the paper being explicit about not having a null model at all, because  “Two alternatives have been proposed for post-duplication”, or was it the opposite, i.e. two null models: “the OS and BS models are each claiming to be right 95% of the time”? I hope I can be forgiven for failing, despite trying very hard, to identify a null model in either the KBL paper, or the comments of the authors to my blog.

There is however a reasonable null model, and it is the “independence model”, which to be clear, is the model where each gene after duplication “accelerates” independently with some small probability (80/914). The suggestions that “the independence model is not biologically rooted” or that it “would predict that only 75% of genes would be preserved in at least one copy, and that 26% would be preserved in both copies” are of course absurd, as explained by Erik van Nimwegen who explains why point clearly and carefully. The fact that many entries reached the same conclusion about the suitable null model in this case is reassuring. I think it qualifies as a “reasonable model” (thereby passing the threshold for my prize).

The p-value

One of my favorite missives about p-values is by Andrew Gelman, who in “P-values and statistical practice” discusses the subtleties inherent in the use and abuse of p-values. But as my blog post illustrates, subtlety is one thing, and ignorance is an entirely different matter. Consider for example, the entry by Manolis Kellis who submitted that $p = 10^{-87}$ thus claiming that I owe him 903,659,165 million billion trillion quadrillion quintillion sextillion dollars (even more than the debt of the United States of America). His entry will not win the prize, although the elementary statistics lesson that follows is arguably worth a few dollars (for him). First, while it is true that a p-value can be computed from the (log) likelihood ratio when the null hypothesis is a special case of the alternative hypothesis (using the chi^2 distribution), the ratio of two likelihoods is not a p-value! Probabilities of events are also not p-values! For example, the comment that “I calculated p-values for the exact count, but the integral/sum would have been slightly better” is a non-starter. Even though KBL was published in 2004, this is apparently the level of understanding of p-values of one of the authors, a senior computational biologist and professor of computer science at MIT in 2015. Wow.

So what is “the correct” p-value? It depends, of course, on the test statistic. Here is where I will confess that like many professors, I had an answer in mind before asking the question. I was thinking specifically of the setting that leads to 0.74 (or 0.72/0.73, depending on roundoff and approximation). Many entries came up with the same answer I had in mind and therefore when I saw them I was relieved: I owed $135, which is what I had budgeted for the exercise. I was wrong. The problem with the answer 0.74 is that it is the answer to the specific question: what is the probability of seeing 4 or less pairs accelerate out of 76 pairs in which at least one accelerated. A better test statistic was proposed by Pseudo in which he/she asked for the probability of seeing 5% or less of the pairs accelerate from among the pairs with at least one gene accelerating when examining data from the null model with 457 pairs. This is a subtle but important distinction, and provides a stronger result (albeit with a smaller p-value). The KBL result is not striking even forgoing the specific numbers of genes measured to have accelerated in at least one pair (of course just because p=0.64 also does not mean the independence model is correct). What it means is that the data as presented simply weren’t “striking”. One caveat in the above analysis is that the arbitrary threshold used to declare “acceleration” is problematic. For example, one might imagine that other thresholds produce more convincing results, i.e. farther from the null, but of course even if that were true the use of an arbitrary cutoff was a poor approach to analysis of the data. Fortunately, regarding the specific question of its impact in terms of the analysis, we do not have to imagine. Thanks to the diligent work of Erik van Nimwegen, who went to the effort of downloading the data and reanalyzing it with different thresholds (from 0.4 to 1.6), we know that the null cannot be rejected even with other thresholds. The award There were many entries submitted and I read them all. My favorite was by Michael Eisen for his creative use of multiple testing correction, although I’m happier with the direction that yields$8.79. I will not be awarding him the prize though, because his submission fails the test of “reasonable”, although I will probably take him out to lunch sometime at Perdition Smokehouse.

I can’t review every single entry or this post, which is already too long, would become unbearable, but I did think long and hard about the entry of K. It doesn’t directly answer the question of why the 95% number is striking, nor do I completely agree with some of the assumptions (e.g. if neither gene in a pair accelerates then the parent gene was not accelerated pre-WGD). But I’ll give the entry an honorable mention.

The prize will be awarded to Pseudo for defining a reasonable null model and test statistic, and producing the smallest p-value within that framework. With a p-value of 0.64 I will be writing a check in the amount of $156.25. Congratulations Pseudo!! The biology One of the most interesting results of the blog post was, in my opinion, the extensive discussion about the truth. Leaving aside the flawed analysis of KBL, what is a reasonable model for evolution post-WGD? I am happy to see the finer technical details continue to be debated, and the intensity of the conversation is awesome! Pavel Pevzner’s cynical belief that “science fiction” is not a literary genre but rather a description of what is published in the journal Science may be realistic, but I hope the comments on my blog can change his mind about what the future can look like. In lieu of trying to summarize the scientific conversation (frankly, I don’t think I could do justice to some of the intricate and excellent arguments posted by some of the population geneticists) I’ll just leave readers to enjoy the comment threads on their own. Comments are still being posted, and I expect the blog post to be a “living” post-publication review for some time. May the skeletons keep finding a way out! The importance of wrong Earlier in this post I admitted to being wrong. I have been wrong many times. Even though I’ve admitted some of my mistakes on this blog and elsewhere in talks, I would like to joke that I’m not going to make it easy for you to find other flaws in my work. That would be a terrible mistake. Saying “I was wrong” is important for science and essential for scientists. Without it people lose trust in both. I have been particularly concerned with a lack of “I was wrong” in genomics. Unfortunately, there is a culture that has developed among “leaders” in the field where the three words admitting error or wrongdoing are taboo. The recent paper of Lin et al. critiqued by Gilad-Mizrahi is a good example. Leaving aside the question of whether the result in the paper is correct (there are strong indications that it isn’t), Mizrahi-Gilad began their critique on twitter by noting that the authors had completely failed to account for, or even discuss, batch effect. Nobody, and I mean nobody who works on RNA-Seq would imagine for even a femtosecond that this is ok. It was a major oversight and mistake. The authors, any of them really, could have just come out and said “I was wrong”. Instead, the last author on the paper, Mike Snyder, told reporters that “All of the sequencing runs were conducted by the same person using the same reagents, lowering the risk of unintentional bias”. Seriously? Examples abound. The “ENCODE 80% kerfuffle” involved claims that “80% of the genome is functional”. Any self-respecting geneticist recognizes such headline grabbing as rubbish. Ewan Birney, a distinguished scientist who has had a major impact on genomics having being instrumental in the ENSEMBL project and many other high-profile bioinformatics programs defended the claim on BBC: “EB: Ah, so, I don’t — It’s interesting to reflect back on this. For me, the big important thing of ENCODE is that we found that a lot of the genome had some kind of biochemical activity. And we do describe that as “biochemical function”, but that word “function” in the phrase “biochemical function”is the thing which gets confusing. If we use the phrase “biochemical activity”, that’s precisely what we did, we find that the different parts of the genome, [??] 80% have some specific biochemical event we can attach to it. I was often asked whether that 80% goes to 100%, and that’s what I believe it will do. So, in other words, that number is much more about the coverage of what we’ve assayed over the entire genome. In the paper, we say quite clearly that the majority of the genome is not under negative selection, and we say that most of the elements are not under pan-mammalian selection. So that’s negative selection we can detect between lots of different mammals. [??] really interesting question about what is precisely going on in the human population, but that’s — you know, I’m much closer to the instincts of this kind of 10% to 20% sort of range about what is under, sort of what evolution cares about under selection.” This response, and others by members of the ENCODE consortium upset many people who may struggle to tell apart white and gold from blue and black, but certainly know that white is not black and black is not white. Likewise, I suspect the response of KBL to my post disappointed many as well. For Fisher’s sake, why not just acknowledge what is obvious and true? The personal critique of professional conduct A conversation topic that emerged as a result of the blog (mostly on other forums) is the role of style in online discussion of science. Specifically, the question of whether personal attacks are legitimate has come up previously on my blog pages and also in conversations I’ve had with people. Here is my opinion on the matter: Science is practiced by human beings. Just like with any other human activity, some of the humans who practice it are ethical while others are not. Some are kind and generous while others are… not. Occasionally scientists are criminal. Frequently they are honorable. Of particular importance is the fact that most scientists’ behavior is not at any of these extremes, but rather a convex combination of the mentioned attributes and many others. In science it is people who benefit, or are hurt, by the behavior of scientists. Preprints on the bioRxiv do not collect salaries, the people who write them do. Papers published in journals do not get awarded or rejected tenure, people do. Grants do not get jobs, people do. The behavior of people in science affects… people. Some argue for a de facto ban on discussing the personal behavior of scientists. I agree that the personal life of scientists is off limits. But their professional life shouldn’t be. When Bernie Madoff fabricated gains of$65 billion it was certainly legitimate to criticize him personally. Imagine if that was taboo, and instead only the technical aspects of his Ponzi scheme were acceptable material for public debate. That would be a terrible idea for the finance industry, and so it should be for science. Science is not special among the professions, and frankly, the people who practice it hold no primacy over others.

I therefore believe it is not only acceptable but imperative to critique the professional behavior of persons who are scientists. I also think that doing so will help eliminate the problematic devil–saint dichotomy that persists with the current system. Having developed a culture in which personal criticism is outlawed in scientific conversations while only science is fair fodder for public discourse, we now have a situation where scientists are all presumed to be living Gods, or else serious criminals to be outlawed and banished from the scientific community. Acknowledging that there ought to be a grey zone, and developing a healthy culture where critique of all aspects of science and scientists is possible and encouraged would relieve a lot of pressure within the current system. It would also be more fair and just.

A final wish

I wish the authors of the KBL paper would publish the reviews of their paper on this blog.

About one and a half years ago I wrote a blog post titled “GTEx is throwing away 90% of their data“. The post was, shall we say, “direct”. For example, in reference to the RNA-Seq quantification program Flux Capacitor I wrote that

Using Flux Capacitor is equivalent to throwing out 90% of the data!

I added that “the methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad” (after explaining that the methods underlying the program were never published, except for a brief mention in that paper). I referred to the sole figure in Montgomery et al. as a “completely useless” description of the method  (and showed that it contained errors). I highlighted the fact that many aspects of Flux Capacitor, its description and documentation provided on its website were “incoherent”. Can we agree that this description is not flattering?

The claim about “throwing out 90% of the data” was based on benchmarking I reported on in the blog post. If I were to summarize the results (politely), I would say that the take home message was that Flux Capacitor is junk. Perhaps nobody had really noticed because nobody cared about the program. Flux Capacitor was literally being used only by the authors of the program  (and their affiliates, which turned out to include the ENCODE, GENCODE, GEUVADIS and GTEx consortiums). In fact, when I wrote the blog post, I don’t think the program had ever been benchmarked or compared to other tools. It was, after all, unpublished and besides, nobody reads consortium papers. However after I blogged a few others decided to include Flux Capacitor in their benchmarks and the conclusions reached were the same as mine: Flux Capacitor is junk and Flux Capacitor is junk. Of course some people objected to my blog post when it came out, so it’s fun to be right and have others say so in print. But true vindication has come in the form of a citation to the blog post in a published paper in a journal! Specifically, in

C. Iannone, A. Pohl, P. Papasaikas, D. Soronellas, G.P. Vincent, M. Beato and J. Valcárcel, Relationship between nucleosome positioning and progesterone-induced alternative splicing in breast cancer cells, RNA 21 (2015) 360–374

the authors cite my blog post. They write:

Ummm…. wait… WHAT THE FLUX? The authors actually used Flux Capacitor for their analysis, and are citing my blog at https://liorpachter.wordpress.com/tag/flux-capacitor/ as the definitive reference for the program. Wait, what again?? They used my blog post as a reference for the method??? This is like [[ readers are invited to leave a comment offering a suitable analogy ]].

I’m not really sure what the authors can do at this point. They could publish an erratum and replace the citation. But with what? Flux Capacitor still hasn’t been published (!) Then there is the journal. Does any journal really think it is acceptable to list my blog as the citation for an RNA-Seq quantification tool that is fundamental for the results in a paper? (I’m flattered, but still…) Speaking of the journal, where were the reviewers? How could they not catch this? And the readers? The paper has been out since January… I have to ask: has anybody read it? Of course the biggest embarrassment here is the fact that there is a citation for Flux Capacitor at all. Why on earth are the authors using this discredited program??? Well maybe one answer is to be found in the acknowledgments section, where the group of a PI from the GTEx project is thanked. Actually, this PI was the last author on one of the recently published GTEx companion papers, which, I am sad to say… used Flux Capacitor (albeit with some quantifications performed with Cufflinks as well to demonstrate “robustness”). Why would GTEx be pushing for Flux Capacitor and insist on its use? We’ve come full circle to my GTEx blog post. By now I don’t even know what I think is the most embarrassing part of this whole story. So I thought I’d host a poll:

Earlier this week US News and World Report (USNWR) released, for the first time, a global ranking of universities including rankings by subject area. In mathematics, the top ten universities are:

1. Berkeley
2. Stanford
3. Princeton
4. UCLA
5. University of Oxford
6. Harvard
7. King Abdulaziz University
8. Pierre and Marie Curie – Paris 6
9. University of Hong Kong
10. University of Cambridge

The past few days I’ve received a lot of email from colleagues and administrators about this ranking, and also the overall global ranking of USNWR in which Berkeley was #1. The emails generally say something to the effect of “of course rankings are not perfect, everybody knows… but look, we are amazing!”

BUT, one of the top math departments in the world, the math department at the Massachusetts Institute of Technology is ranked #11… they didn’t even make the top ten. Even more surprising is the entry at #7 that I have boldfaced: the math department at King Abdulaziz University (KAU) in Jeddah, Saudi Arabia. I’ve been in the math department at Berkeley for 15 years, and during this entire time I’ve never (to my knowledge) met a person from their math department and I don’t recall seeing a job application from any of their graduates… I honestly had never heard of the university in any scientific context. I’ve heard plenty about KAUST (the King Abdullah University of Science and Technology ) during the past few years, especially because it is the first mixed-gender university campus in Saudi Arabia, is developing a robust research program based on serious faculty hires from overseas, and in a high profile move hired former Caltech president Jean-Lou Chameau to run the school. But KAU is not KAUST.

A quick google searched reveals that although KAU is nearby in Jeddah, it is a very different type of institution. It has two separate campuses for men and women. Although it was established in 1967 (Osama Bin Laden was a student there in 1975) its math department started a Ph.D. program only two years ago. According to the math department website, the chair of the department, Prof. Abdullah Mathker Alotaibi, is a 2005 Ph.D. with zero publications [Update: Nov. 10: This initial claim was based on a Google Scholar Search of his full name; a reader commented below that he has published and that this claim was incorrect. Nevertheless, I do not believe it in any way materially affect the points made in this post.] This department beat MIT math in the USNWR global rankings! Seriously?

The USNWR rankings are based on 8 attributes:

– global research reputation
– regional research reputation
– publications
– normalized citation impact
– total citations
– number of highly cited papers
– percentage of highly cited papers
– international collaboration

Although KAU’s full time faculty are not very highly cited, it has amassed a large adjunct faculty that helped them greatly in these categories. In fact, in “normalized citation impact” KAU’s math department is the top ranked in the world. This amazing statistic is due to the fact that KAU employs (as adjunct faculty) more than a quarter of the highly cited mathematicians at Thomson Reuters. How did a single university assemble a group with such a large proportion of the world’s prolific (according to Thomson Reuters) mathematicians? (When I first heard this statistic from Iddo Friedberg via Twitter I didn’t believe it and had to go compute it myself from the data on the website. I guess I believe it now but I still can’t believe it!!)

In 2011 Yudhijit Bhattacharjee published an article in Science titled “Saudi Universities Offer Cash in Exchange for Academic Prestige” that describes how KAU is targeting highly cited professors for adjunct faculty positions. According to the article, professors are hired as adjunct professors at KAU for $72,000 per year in return for agreeing (apparently by contract) to add KAU as a secondary affiliation at ISIhighlycited.com and for adding KAU as an affiliation on their published papers. Annual visits to KAU are apparently also part of the “deal” although it is unclear from the Science article whether these actually happen regularly or not. [UPDATE Oct 31, 12:14pm: A friend who was solicited by KAU sent me the invitation email with the contract that KAU sends to potential “Distinguished Adjunct Professors”. The details are exactly as described in the Bhattacharjee article: From: "Dr. Mansour Almazroui" <ceccr@kau.edu.sa> Date: XXXX To: XXXX <XXXX> Subject: Re: Invitation to Join “International Affiliation Program” at King Abdulaziz University, Jeddah Saudi Arabia Dear Prof. XXXX , Hope this email finds you in good health. Thank you for your interest. Please find below the information you requested to be a “Distinguished Adjunct Professor” at KAU. 1. Joining our program will put you on an annual contract initially for one year but further renewable. However, either party can terminate its association with one month prior notice. 2. The Salary per month is$ 6000 for the period of contract.
3. You will be required to work at KAU premises for three weeks in
each contract year. For this you will be accorded with expected
three visits to KAU.
4. Each visit will be at least for one week long but extendable as
suited for research needs.
5. Air tickets entitlement will be in Business-class and stay in Jeddah
will be in a five star hotel. The KAU will cover all travel and living
6. You have to collaborate with KAU local researchers to work on KAU
funded (up to $100,000.00) projects. 7. It is highly recommended to work with KAU researchers to submit an external funded project by different agencies in Saudi Arabia. 8. May submit an international patent. 9. It is expected to publish some papers in ISI journals with KAU affiliation. 10. You will be required to amend your ISI highly cited affiliation details at the ISI highlycited.com web site to include your employment and affiliation with KAU. Kindly let me know your acceptance so that the official contract may be preceded. Sincerely, Mansour ] The publication of the Science article elicited a strong rebuttal from KAU on the comments section, where it was vociferously argued that the hiring of distinguished foreign scholars was aimed at creating legitimate research collaborations, and was not merely a gimmick for increasing citation counts. Moreover, some of the faculty who had signed on defended the decision in the article. For example, Neil Robertson, a distinguished graph theorist (of Robertson-Seymour graph minors fame) explained that “it’s just capitalism,” and “they have the capital and they want to build something out of it.” He added that “visibility is very important to them, but they also want to start a Ph.D. program in mathematics,” (they did do that in 2012) and he added that he felt that “this might be a breath of fresh air in a closed society.” It is interesting to note that despite his initial enthusiasm and optimism, Professor Robertson is no longer associated with KAU. In light of the high math ranking of KAU in the current USNWR I decided to take a closer look at who KAU has been hiring, why, and for what purpose, i.e. I decided to conduct post-publication peer review of the Bhattacharjee Science paper. A web page at KAU lists current “Distinguished Scientists” and another page lists “Former Distinguished Adjunct Professors“. One immediate observation is that out of 118 names on these pages there is 1 woman (Cheryl Praeger from the University of Western Australia). Given that KAU has two separate campuses for men and women, it is perhaps not surprising that women are not rushing to sign on, and perhaps KAU is also not rushing to invite them (I don’t have any information one way or another, but the underrepresentation seems significant). Aside from these faculty, there is also a program aptly named the “Highly Cited Researcher Program” that is part of the Center for Excellence in Genomic Medicine Research. Fourteen faculty are listed there (all men, zero women). But guided by the Science article which described the contract requirement that researchers add KAU to their ISI affiliation, I checked for adjunct KAU faculty at Thomson-Reuters ResearcherID.com and there I found what appears to be the definitive list. Although Neil Robertson has left KAU, he has been replaced by another distinguished graph theorist, namely Carsten Thomassen (no accident as his wikipedia page reveals that “He was included on the ISI Web of Knowledge list of the 250 most cited mathematicians.”) This is a name I immediately recognized due to my background in combinatorics; in fact I read a number of Thomassen’s papers as a graduate student. I decided to check whether it is true that adjunct faculty are adding KAU as an affiliation on their articles. Indeed, Thomassen has done exactly that in his latest publication Strongly 2-connected orientations of graphs published this year in the Journal of Combinatorial Theory Series B. At this point I started having serious reservations about the ethics of faculty who have agreed to be adjuncts at KAU. Regardless of the motivation of KAU in hiring adjunct highly cited foreign faculty, it seems highly inappropriate for a faculty member to list an affiliation on a paper to an institution to which they have no scientific connection whatsoever. I find it very hard to believe that serious graph theory is being researched at KAU, an institution that didn’t even have a Ph.D. program until 2012. It is inconceivable that Thomassen joined KAU in order to find collaborators there (he mostly publishes alone), or that he suddenly found a great urge to teach graph theory in Saudi Arabia (KAU had no Ph.D. program until 2012). The problem is also apparent when looking at the papers of researchers in genomics/computational biology that are adjuncts at KAU. I recognized a number of such faculty members, including high-profile names from my field such as Jun Wang, Manolis Dermitzakis and John Huelsenbeck. I was surprised to see their names (none of these faculty mention KAU on their websites) yet in each case I found multiple papers they have authored during the past year in which they list the KAU affiliation. I can only wonder whether their home institutions find this appropriate. Then again, maybe KAU is also paying the actual universities the faculty they are citation borrowing belong to? But assume for a moment that they aren’t, then why should institutions share the credit they deserve for supporting their faculty members by providing them space, infrastructure, staff and students with KAU? What exactly did KAU contribute to Kilpinen et al. Coordinated effects of sequence variation on DNA binding, chromatin structure and transcription, Science, 2013? Or to Landis et al. Bayesian analysis of biogeography when the number of areas is large, Systematic Biology, 2013? These papers have no authors or apparent contribution from KAU. Just the joint affiliation of the adjunct faculty member. The limit of the question arises in the case of Jun Wang, director of the Beijing Genome Institute, whose affiliations are BGI (60%), University of Copenhagen (15%), King Abdulaziz University (15%), The University of Hong Kong (5%), Macau University of Science and Technology (5%). Should he also acknowledge the airlines he flies on? Should there not be some limit on the number of affiliations of an individual? Shouldn’t journals have a policy about when it is legitimate to list a university as an affiliation for an author? (e.g. the author must have in some significant way been working at the institution). Another, bigger, disgrace that emerged in my examination of the KAU adjunct faculty is the issue of women. Aside from the complete lack of women in the “Highly Cited Researcher Program”, I found that most of the genomics adjunct faculty hired via the program will be attending an all-male conference in three weeks. The “Third International Conference on Genomic Medicine” will be held from November 17–20th at KAU. This conference has zero women. The same meeting last year… had zero women. I cannot understand how in 2014, at a time when many are speaking out strongly about the urgency of supporting females in STEM and in particular about balancing meetings, a bunch of men are willing to forgo all considerations of gender equality for the price of ~$3 per citation per year (a rough calculation using the figure of 72,000 per year from the Bhattacharjee paper and 24,000 citations for a highly cited researcher). To be clear I have no personal knowledge about whether the people I’ve mentioned in this article are actually being paid or how much, but even if they are being paid zero it is not ok to participate in such meetings. Maybe once (you didn’t know what you are getting into), but twice?! As for KAU, it seems clear based on the name of the “Highly Cited Researcher Program” and the fact that they advertise their rankings that they are specifically targeting highly cited researchers much more for their delivery of their citations than for development of genuine collaborations (looking at the adjunct faculty I failed to see any theme or concentration of people in any single area as would be expected in building a coherent research program). However I do not fault KAU for the goal of increasing the ranking of their institution. I can see an argument for deliberately increasing rankings in order to attract better students, which in turn can attract faculty. I do think that three years after the publication of the Science article, it is worth taking a closer look at the effects of the program (rankings have increased considerably but it is not clear that research output from individuals based at KAU has increased), and whether this is indeed the most effective way to use money to improve the quality of research institutions. The existence of KAUST lends credence to the idea that the king of Saudi Arabia is genuinely interested in developing Science in the country, and there is a legitimate research question as to how to do so with the existing resources and infrastructure. Regardless of how things ought to be done, the current KAU emphasis on rankings is a reflection of the rankings, which USNWR has jumped into with its latest worldwide ranking. The story of KAU is just evidence of a bad problem getting worse. I have previously thought about the bad version of the problem: A few years ago I wrote a short paper with my (now former) student Peter Huggins on university rankings: P. Huggins and L.P., Selecting universities: personal preferences and rankings, arXiv, 2008. It exists only as an arXiv preprint as we never found a suitable venue for publication (this is code for the paper was rejected upon peer review; no one seemed interested in finding out the extent to which the data behind rankings can produce a multitude of stories). The article addresses a simple question: given that various attributes have been measured for a bunch of universities, and assuming they are combined (linearly) into a score used to produce rankings, how do the rankings depend on the weightings of the individual attributes? The mathematics is that of polyhedral geometry, where the problem is to compute a normal fan of a polytope whose vertices encode all the possible rankings that can be obtained for all possible weightings of the attributes (an object we called the unitope). An example is shown below, indicating the possible rankings as determined by weightings chosen among three attributes measured by USNWR (freshman retention, selectivity, peer assessment). It is important to keep in mind this is data from 2007-2008. Our paper had an obvious but important message: rankings can be very sensitive to the attribute weightings. Of course some schools such as Harvard came out on top regardless of attribute preferences, but some schools, even top ranked schools, could shift by over 50 positions. Our conclusion was that although the data collected by USNWR was useful, the specific weighting chosen and the ranking it produced were not. Worse than that, sticking to a single choice of weightings was misleading at best, dangerous at worse. I was reminded of this paper when looking at the math department rankings just published by USNWR. When I saw that KAU was #7 I was immediately suspicious, and even Berkeley’s #1 position bothered me (even though I am a faculty member in the department). I immediately guessed that they must have weighted citations heavily, because our math department has applied math faculty, and KAU has their “highly cited researcher program”. Averaging citations across faculty from different (math) disciplines is inherently unfair. In the case of Berkeley, my applied math colleague James Sethian has a paper on level set methods with more than 10,000 (Google Scholar) citations. This reflects the importance and advance of the paper, but also the huge field of users of the method (many, if not most, of the disciplines in engineering). On the other hand, my topology colleague Ian Agol’s most cited paper has just over 200 citations. This is very respectable for a mathematics paper, but even so it doesn’t come close to reflecting his true stature in the field, namely the person who settled the Virtually Haken Conjecture thereby completing a long standing program of William Thurston that resulted in many of the central open problems in mathematics (Thurston was also incidentally an adjunct faculty member at KAU for some time). In other words, not only are citations not everything, they can also be not anything. By comparing citations across math departments that are diverse to very differing degrees USNWR rendered the math ranking meaningless. Some of the other data collected, e.g. reputation, may be useful or relevant to some, and for completeness I’m including it with this post (here) in a form that allows for it to be examined properly (USNWR does not release it in the form of a table, but rather piecemeal within individual html pages on their site), but collating the data for each university into one number is problematic. In my paper with Peter Huggins we show both how to evaluate the sensitivity of rankings to weightings and also how to infer bounds on the weightings by USNWR from the rankings. It would be great if USNWR included the ability to perform such computations with their data directly on their website but there is a reason USNWR focuses on citations. The impact factor of a journal is a measure of the average amount of citation per article. It is computed by averaging the citations over all articles published during the preceding two years, and its advertisement by journals reflects a publishing business model where demand for the journal comes from the impact factor, profit from free peer reviewing, and sales from closed subscription based access. Everyone knows the peer review system is broken, but it’s difficult to break free of when incentives are aligned to maintain it. Moreover, it leads to perverse focus of academic departments on the journals their faculty are publishing in and the citations they accumulate. Rankings such as those by USNWR reflect the emphasis on citations that originates with the journals, as so one cannot fault USNWR for including it as a factor and weighting it highly in their rankings. Having said that, USNWR should have known better than to publish the KAU math rankings; in fact it appears their publication might be a bug. The math department rankings are the only rankings that appear for KAU. They have been ommitted entirely from the global overall ranking and other departmental rankings (I wonder if this is because USNWR knows about the adjunct faculty purchase). In any case, the citation frenzy feeds departments that in aggregate form universities. Universities such as King Abdulaziz, that may reach the point where they feel compelled to enter into the market of citations to increase their overall profile… I hope this post frightened you. It should. Happy Halloween! [Update: Dec. 6: an article about KAU and citations has appeared in the Daily Cal, Jonathan Eisen posted his exchanges with KAU, and he has storified the tweets] Nature Publishing Group claims on its website that it is committed to publishing “original research” that is “of the highest quality and impact”. But when exactly is research “original”? This is a question with a complicated answer. A recent blog post by senior editor Dorothy Clyde at Nature Protocols provides insight into the difficulties Nature faces in detecting plagiarism, and identifies the issue of self plagiarism as particularly problematic. The journal tries to avoid publishing the work of authors who have previously published the same work or a minor variant thereof. I imagine this is partly in the interests of fairness, a service to the scientific community to ensure that researchers don’t have to sift through numerous variants of a single research project in the literature, and a personal interest of the journal in its aim to publish only the highest level of scholarship. On the other hand, there is also a rationale for individual researchers to revisit their own previously published work. Sometimes results can be recast in a way that makes them accessible to different communities, and rethinking of ideas frequently leads to a better understanding, and therefore a better exposition. The mathematician Gian-Carlo Rota made the case for enlightened self-plagiarism in one of his ten lessons he wished he had been taught when he was younger: 3. Publish the same result several times After getting my degree, I worked for a few years in functional analysis. I bought a copy of Frederick Riesz’ Collected Papers as soon as the big thick heavy oversize volume was published. However, as I began to leaf through, I could not help but notice that the pages were extra thick, almost like cardboard. Strangely, each of Riesz’ publications had been reset in exceptionally large type. I was fond of Riesz’ papers, which were invariably beautifully written and gave the reader a feeling of definitiveness. As I looked through his Collected Papers however, another picture emerged. The editors had gone out of their way to publish every little scrap Riesz had ever published. It was clear that Riesz’ publications were few. What is more surprising is that the papers had been published several times. Riesz would publish the first rough version of an idea in some obscure Hungarian journal. A few years later, he would send a series of notes to the French Academy’s Comptes Rendus in which the same material was further elaborated. A few more years would pass, and he would publish the definitive paper, either in French or in English. Adam Koranyi, who took courses with Frederick Riesz, told me that Riesz would lecture on the same subject year after year, while meditating on the definitive version to be written. No wonder the final version was perfect. Riesz’ example is worth following. The mathematical community is split into small groups, each one with its own customs, notation and terminology. It may soon be indispensable to present the same result in several versions, each one accessible to a specific group; the price one might have to pay otherwise is to have our work rediscovered by someone who uses a different language and notation, and who will rightly claim it as his own. The question is: where does one draw the line? I was recently forced to confront this question when reading an interesting paper about a statistical approach to utilizing controls in large-scale genomics experiments: J.A. Gagnon-Bartsch and T.P. Speed, Using control genes to corrected for unwanted variation in microarray dataBiostatistics, 2012. A cornerstone in the logic and methodology of biology is the notion of a “control”. For example, when testing the result of a drug on patients, a subset of individuals will be given a placebo. This is done to literally control for effects that might be measured in patients taking the drug, but that are not inherent to the drug itself. By examining patients on the placebo, it is possible to essentially cancel out uninteresting effects that are not specific to the drug. In modern genomics experiments that involve thousands, or even hundreds of thousands of measurements, there is a biological question of how to design suitable controls, and a statistical question of how to exploit large numbers of controls to “normalize” (i.e. remove unwanted variation) from the high-dimensional measurements. Formally, one framework for thinking about this is a linear model for gene expression. Using the notation of Gagnon-Bartsch & Speed, we have an expression matrix $Y$ of size $m \times n$ (samples and genes) modeled as $Y_{m \times n} = X_{m \times p}\beta_{p \times n} + Z_{m \times q}\gamma_{q \times n} + W_{m \times k} \alpha_{k \times n} + \epsilon_{m \times n}$. Here is a matrix describing various conditions (also called factors) and associated to it is the parameter matrix $\beta$ that records the contribution, or influence, of each factor on each gene. $\beta$ is the primary parameter of interest to be estimated from the data Y. The $\epsilon$ are random noise, and finally and are observed and unobserved covariates respectively. For example Z might encode factors for covariates such as gender, whereas W would encode factors that are hidden, or unobserved. A crucial point is that the number of hidden factors in W, namely k, is not known. The matrices $\gamma$ and $\alpha$ record the contributions of the Z and W factors on gene expression, and must also be estimated. It should be noted that X may be the logarithm of expression levels from a microarray experiment, or the analogous quantity from an RNA-Seq experiment (e.g. log of abundance in FPKM units). Linear models have been applied to gene expression analysis for a very long time; I can think of papers going back 15 years. But They became central to all analysis about a decade ago, specifically popularized with the Limma package for microarray data analysis. In an important paper in 2007, Leek and Storey focused explicitly on the identification of hidden factors and estimation of their influence, using a method called SVA (Surrogate Variable Analysis). Mathematically, they described a procedure for estimating k and W and the parameters $\alpha$. I will not delve into the details of SVA in this post, except to say that the overall idea is to first perform linear regression (assuming no hidden factors) to identify the parameters $\beta$ and to then perform singular value decomposition (SVD) on the residuals to identify hidden factors (details omitted here). The resulting identified hidden factors (and associated influence parameters) are then used in a more general model for gene expression in subsequent analysis. Gagnon-Bartsch and Speed refine this idea by suggesting that it is better to infer W from controls. For example, house-keeping genes that are unlikely to correlate with the conditions being tested, can be used to first estimate W, and then subsequently all the parameters of the model can be estimated by linear regression. They term this two-step process RUV-2 (acronym for Remote Unwanted Variation) where the “2” designates that the procedure is a two-step procedure. As with SVA, the key to inferring W from the controls is to perform singular value decomposition (or more generally factor analysis). This is actually clear from the probabilistic interpretation of PCA and the observation that what it means to be a in the set of “control genes” C in a setting where there are no observed factors Z, is that $Y_C = W \alpha_C + \epsilon_C$. That is, for such control genes the corresponding $\beta$ parameters are zero. This is a simple but powerful observation, because the explicit designation of control genes in the procedure makes it clear how to estimate W, and therefore the procedure becomes conceptually compelling and practically simple to implement. Thus, even though the model being used is the same as that of Leek & Storey, there is a novel idea in the paper that makes the procedure “cleaner”. Indeed, Gagnon-Bartsch & Speed provide experimental results in their paper showing that RUV-2 outperforms SVA. Even more convincing, is the use of RUV-2 by others. For example, in a paper on “The functional consequences of variation in transcription factor binding” by Cusanovitch et al., PLoS Genetics 2014, RUV-2 is shown to work well, and the authors explain how it helps them to take advantage of the controls in experimental design they created. There is a tech report and also a preprint that follow up on the Gagnon-Bartsch & Speed paper; the tech report extends RUV-2 to a four step method RUV-4 (it also provides a very clear exposition of the statistics), and separately the preprint describes an extension to RUV-2 for the case where the factor of interest is also unknown. Both of these papers build on the original paper in significant ways and are important work, that to return to the original question in the post, certainly are on the right side of “the line” The wrong side of the line? The development of RUV-2 and SVA occurred in the context of microarrays, and it is natural to ask whether the details are really different for RNA-Seq (spoiler: they aren’t). In a book chapter published earlier this year: D. Risso, J. Ngai, T.P. Speed, S. Dudoit, The role of spike-in standards in the normalization of RNA-Seq, in Statistical Analysis of Next Generation Sequencing Data (2014), 169-190. the authors replace “log expression levels” from microarrays with “log counts” from RNA-Seq and the linear regression performed with Limma for RUV-2 with a Poisson regression (this involves one different R command). They call the new method RUV, which is the same as the previously published RUV, a naming convention that makes sense since the paper has no new method. In fact, the mathematical formulas describing the method are identical (and even in almost identical notation!) with the exception that the book chapter ignores altogether, and replaces $\epsilon$ with O. To be fair, there is one added highlight in the book chapter, namely the observation that spike-ins can be used in lieu of housekeeping (or other control) genes. The method is unchanged, of course. It is just that the spike-ins are used to estimate W. Although spike-ins were not mentioned in the original Gagnon-Bartsch paper, there is no reason not to use them with arrays as well; they are standard with Affymetrix arrays. My one critique of the chapter is that it doesn’t make sense to me that counts are used in the procedure. I think it would be better to use abundance estimates, and in fact I believe that Jeff Leek has already investigated the possibility in a preprint that appears to be an update to his original SVA work. That issue aside, the book chapter does provide concrete evidence using a Zebrafish experiment that RUV-2 is relevant and works for RNA-Seq data. The story should end here (and this blog post would not have been written if it had) but two weeks ago, among five RNA-Seq papers published in Nature Biotechnology (I have yet to read the others), I found the following publication: D. Risso, J. Ngai, T.P. Speed, S. Dudoit, Normalization of RNA-Seq data using factor analysis of control genes or samples, Nature Biotechnology 32 (2014), 896-902. This paper has the same authors as the book chapter (with the exception that Sandrine Dudoit is now a co-corresponding author with Davide Risso, who was the sole corresponding author on the first publication), and, it turns out, it is basically the same paper… in fact in many parts it is the identical paper. It looks like the Nature Biotechnology paper is an edited and polished version of the book chapter, with a handful of additional figures (based on the same data) and better graphics. I thought that Nature journals publish original and reproducible research papers. I guess I didn’t realize that for some people “reproducible” means “reproduce your own previous research and republish it”. At this point, before drawing attention to some comparisons between the papers, I’d like to point out that the book chapter was refereed. This is clear from the fact that it is described as such in both corresponding authors’ CVs. How similar are the two papers? Final paragraph of paper in the book: Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical effects. With the advent of single-cell sequencing [35], the role of spike-in standards should become even more important, both to account for technical variability [6] and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike-in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols. Final paragraph of paper in Nature Biotechnology: Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical factors. With the advent of single-cell sequencing27, the role of spike-in standards should become even more important, both to account for technical variability28 and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike- in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols. Abstract of paper in the book: Normalization of RNA-seq data is essential to ensure accurate inference of expression levels, by adjusting for sequencing depth and other more complex nuisance effects, both within and between samples. Recently, the External RNA Control Consortium (ERCC) developed a set of 92 synthetic spike-in standards that are commercially available and relatively easy to add to a typical library preparation. In this chapter, we compare the performance of several state-of-the-art normalization methods, including adaptations that directly use spike-in sequences as controls. We show that although the ERCC spike-ins could in principle be valuable for assessing accuracy in RNA-seq experiments, their read counts are not stable enough to be used for normalization purposes. We propose a novel approach to normalization that can successfully make use of control sequences to remove unwanted effects and lead to accurate estimation of expression fold-changes and tests of differential expression. Abstract of paper in Nature Biotechnology: Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms. Abstract of Gagnon-Bartsch & Speed paper that already took credit for a “new” method called RUV: Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method “Remove Unwanted Variation, 2-step” (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain. Many figures are also the same (except one that appears to have been fixed in the Nature Biotechnology paper– I leave the discovery of the figure as an exercise to the reader). Here is Figure 9.2 in the book: The two panels appears as (b) and (c) in Figure 4 in the Nature Biotechnology paper (albeit transformed via a 90 degree rotation and reflection from the dihedral group): Basically the whole of the book chapter and the Nature Biotechnology paper are essentially the same, down to the math notation, which even two papers removed is just a rehashing of the RUV method of Gagnon-Bartsch & Speed. A complete diff of the papers is beyond the scope of this blog post and technically not trivial to perform, but examination by eye reveals one to be a draft of the other. Although it is acceptable in the academic community to draw on material from published research articles for expository book chapters (with permission), and conversely to publish preprints, including conference proceedings, in journals, this case is different. (a) the book chapter was refereed, exactly like a journal publication (b) the material in the chapter is not expository; it is research, (c) it was published before the Nature Biotechnology article, and presumably prepared long before, (d) the book chapter cites the Nature Biotechnology article but not vice versa and (e) the book chapter is not a particularly innovative piece of work to begin with. The method it describes and claims to be “novel”, namely RUV, was already published by Gagnon-Bartsch & Speed. Below is a musical rendition of what has happened here: When I was a teenager I broke all the rules on Friday night. After dinner I would watch Louis Rukeyser’s Wall Street Week at 8:30pm, and I would be in bed an hour later. On new year’s eve, he had a special “year-end review”, during which he hosted “financial experts” who would opine on the stock market and make predictions for the coming year. What I learned from Louis Rukeyser was: 1. Never trust men in suits (or tuxedos). 2. It’s easier to perpetrate the 1024 scam than one might think! Here are the experts in 1999 all predicting increases for the stock market in 2000: As it turned out, the NASDAQ peaked on March 10, 2000, and within a week and a half had dropped 10%. By the end of the year the dot-com bubble had completely burst and a few years later the market had lost almost 80% of its value. Predictions on the last day of the 20th century represented a spectacular failure for the “pundits”, but by then I had already witnessed many failures on the show. I’d also noted that almost all the invited “experts” were men. Of course correlation does not imply causation, but I remember having a hard time dispelling the notion that the guests were wrong because they were men. I never wanted to be sexist, but Louis Rukeyser made it very difficult for me! Gender issues aside, the main lesson I learned from Louis Rukeyser’s show is that it’s easy to perpetrate the 1024 scam. The scam goes something like this: a scammer sends out 1024 emails to individuals that are unlikely to know each other, with each email making a prediction about the performance of the stock market in the coming week. For half the people (512), she predicts the stock market will go up, and for the other half, that it will go down. The next week, she has obviously sent a correct prediction of the market to half the people (this assumes the market is never unchanged after a week). She ignores the 512 people who have received an incorrect prediction, dividing those who received the correct prediction into two halves (256 each). Again, she predicts the performance of the market in the coming week, sending 256 individuals a prediction that the market will go up, and the other 256 a prediction that it will go down. She continues this divide-and-conquer for 10 weeks, at which time there is one individual that has received correct predictions about the movement of the stock market for 2.5 months! This person may believe that the scammer has the ability to predict the market; after all, $(\frac{1}{2})^{10} = 0.00098$ which looks like a very significant p-value. This is when the scammer asks for a “large investment”. Of course what is missing is knowledge of the other prediction emails sent out, or in other words the multiple testing problem. The Wall Street Week guest panels essentially provided a perfect setting in which to perpetrate this scam. “Experts” that would err would be unlikely to be invited back. Whereas regular winners would be back for another chance at guessing. This is a situation very similar to the mutual fund management market, where managers are sacked when they have a bad year, only to have large firms with hundreds of funds on the books highlight funds that have performed well for 10 years in a row in their annual glossy brochures. But that is not the subject matter of this blog post. Rather, it’s the blog itself. I wrote and posted my first blog entry (Genesis of *Seq) exactly a year ago. I began writing it for two reasons. First, I thought it could be a convenient and useful forum for discussion of technical developments in computational biology. I was motivated partly by the seqanswers website, which allows users to share information and experience in dealing with high-throughput sequence data. But I was also inspired by the What’s New Blog that has created numerous bridges in the mathematics community via highly technical yet accessible posts that have democratized mathematics. Second, I had noticed an extraordinary abuse of multiple testing in computational biology, and I was desperate for a forum where I could bring the issue to peoples attention. My initial frustration with outlandish claims in papers based on weak statistics had also grown over time to encompass a general concern for lack of rigor in computational biology papers. None of us are perfect but there is a wide gap between perfect and wrong. Computational biology is a field that is now an amalgamation of many subjects and I hoped that a blog would be able to reach the different silos more effectively than publications. And thus this blog was born on August 19th 2013. I started without a preconception of how it would turn out over time, and I’m happy to say I’ve been surprised by its impact, most notably on myself. I’ve learned an enormous amount from reader feedback, in part via comments on individual posts, but also from private emails to me and in personal conversations. For this (selfish) reason alone, I will keep blogging. I have also been asked by many of you to keep posting, and I’m listening. When I have nothing left to say, I promise I will quit. But for now I have a backlog of posts, and after a break this summer, I am ready to return to the keyboard. Besides, since starting to blog I still haven’t been to Las Vegas. There has recently been something of an uproar over the new book A Troublesome Inheritance by Nicholas Wade, with much of the criticism centering on Wade’s claim that race is a meaningful biological category. This subject is one with which I1 have some personal connection since as a child growing up in South Africa in the 1980s, I was myself categorized very neatly by the Office for Race Classification: 10. A simple pair of digits that conferred on me numerous rights and privileges denied to the majority of the population. Explanation of identity numbers assigned to citizens by the South African government during apartheid. And yet the system behind those digits was anything but simple. The group to which an individual was assigned could be based on not only their skin color but also their employment, eating and drinking habits, and indeed explicitly social factors as related by Muriel Horrell of the South African Institute of Race Relations: “Should a man who is initially classified white have a number of coloured friends and spend many of his leisure hours in their company, he stands to risk being re-classified as coloured.” With these memories in mind, I found Wade’s concept of race as a biological category quite confusing, a confusion which only deepened when I discovered that he identifies not the eight races designated by the South African Population Registration Act of 1950, but rather five, none of which was the Griqua! With the full force of modern science on his side2, it seemed unlikely that these disparities represented an error on Wade’s part. And so I was left with a perplexing question: how could it be that the South African apartheid regime — racists par excellence — had failed to institutionalize their racism correctly? How had Wade gotten it right when Hendrik Verwoerd had gone awry? Eventually I realized that A Troublesome Inheritance itself might contain the answer to this conundrum. Institutions, Wade explains, are genetic: “they grow out of instinctual social behaviors” and “one indication of such a genetic effect is that, if institutions were purely cultural, it should be easy to transfer an institution from one society to another.”3 So perhaps it is Wade’s genetic instincts as a Briton that explain how he has navigated these waters more skillfully than the Dutch-descended Afrikaners who designed the institutions of apartheid. One might initially be inclined to scoff at such a suggestion or even to find it offensive. However, we must press boldly on in the name of truth and try to explain why this hypothesis might be true. Again, A Troublesome Inheritance comes to our aid. There, Wade discusses the decline in English interest rates between 1400 and 1850. This is the result, we learn, of rich English people producing more children than the poor and thereby genetically propagating those qualities which the rich are so famous for possessing: “less impulsive, more patient, and more willing to save.”4 However this period of time saw not only falling interest rates but also the rise of the British Empire. It was a period when Englishmen not only built steam engines and textile mills, but also trafficked in slaves by the millions and colonized countries whose people lacked their imperial genes. These latter activities, with an obvious appeal to the more racially minded among England’s population, could bring great wealth to those who engaged in them and so perhaps the greater reproductive fitness of England’s economic elite propagated not only patience but a predisposition to racism. This would explain, for example, the ability of John Hanning Speke to sniff out “the best blood of Abyssinia” when distinguishing the Tutsi from their Hutu neighbors. Some might be tempted to speculate that Wade is himself a racist. While Wade — who freely speculates about billions of human beings — would no doubt support such an activity, those who engage in such speculation should perhaps not judge him too harshly. After all, racism may simply be Wade’s own troublesome inheritance. #### Footnotes 1. In the spirit of authorship designation as discussed in this post, we describe the author contributions as follows: the recollections of South Africa are those of Lior Pachter, who distinctly remembers his classification as “white”. Nicolas Bray conceived and composed the post with input from LP. LP discloses no conflicts of interest. NB discloses being of British ancestry. 2. Perhaps not quite the full force, given the reception his book has received from actual scientists. 3. While this post is satirical, it should be noted for clarity that, improbably, this is an actual quote from Wade’s book. 4. Again, not satire. In reading the news yesterday I came across multiple reports claiming that even casually smoking marijuana can change your brain. I usually don’t pay much attention to such articles; I’ve never smoked a joint in my life. In fact, I’ve never even smoked a cigarette. So even though as a scientist I’ve been interested in cannabis from the molecular biology point of view, and as a citizen from a legal point of view, the issues have not been personal. However reading a USA Today article about the paper, I noticed that the principal investigator Hans Breiter was claiming to be a psychiatrist and mathematician. That is an unusual combination so I decided to take a closer look. I immediately found out the claim was a lie. In fact, the totality of math credentials of Hans Breiter consist of some logic/philosophy courses during a year abroad at St. Andrews while he was a pre-med student at Northwestern. Even being an undergraduate major in mathematics does not make one a mathematician, just as being an undergraduate major in biology does not makes one a doctor. Thus, with his outlandish claim, Hans Breiter had succeeded in personally offending me! So, I decided to take a look at his paper underlying the multiple news reports: This is quite possibly the worst paper I’ve read all year (as some of my previous blog posts show I am saying something with this statement). Here is a breakdown of some of the issues with the paper: ### 1. Study design First of all, the study has a very small sample size, with only 20 “cases” (marijuana users), a fact that is important to keep in mind in what follows. The title uses the term “recreational users” to describe them, and in the press release accompanying the article Breiter says that “Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” In fact, the majority of users in the study were smoking more than 10 joints per week. There is even a person in the study smoking more than 30 joints per week (as disclosed above, I’m not an expert on this stuff but if 30 joints per week is “recreation” then it seems to me that person is having a lot of fun). More importantly, Breiter’s statement in the press release is a lie. There is no evidence in the paper whatsoever, not even a tiny shred, that the users who were getting high once or twice a week were having any problems. There are also other issues with the study design. For example, the paper claims the users are not “abusing” other drugs, but it is quite possible that they are getting high on cocaine, heroin, or ??? as well, an issue that could quite possibly affect the study. The experiment consisted of an MRI scan of each user/control, but only a single scan was done. Given the variability in MRI scans this also seems problematic. ### 2. Multiple testing The study looked at three aspects of brain morphometry in the study participants: gray matter density, volume and shape. Each of these morphometric analyses constituted multiple tests. In the case of gray matter density, estimates were based on small clusters of voxels, resulting in 123 tests (association of each voxel cluster with marijuana use). Volumes were estimated for four regions: left and right nucleus accumbens and amygdala. Shape was also tested in the same four regions. What the authors should have done is to correct the p-values computed for each of these tests by accounting for the total number of tests performed. Instead, (Bonferroni) corrections were performed separately for each type of analysis. For example, in the volume analysis p-values were required to be less than 0.0125 = 0.05/4. In other words, the extent of testing was not properly accounted for. Even so, many of the results were not significant. For example, the volume analysis showed no significant association for any of the four tested regions. The best case was the left nucleus accumbens (Figure 1C) with a corrected p-value of 0.015 which is over the authors’ own stated required threshold of 0.0125 (see caption). They use the language “The association with drug use, after correcting for 4 comparisons, was determined to be a trend toward significance” to describe this non-effect. It is worth noting that the removal of the outlier at a volume of over $800 mm^3$ would almost certainly flatten the line altogether and remove even the slight effect. It would have been nice to test this hypothesis but the authors did not release any of their data. Figure 1c. In the Fox News article about the paper, Breiter is quoted saying ““For the NAC [nucleus accumbens], all three measures were abnormal, and they were abnormal in a dose-dependent way, meaning the changes were greater with the amount of marijuana used,” Breiter said. “The amygdala had abnormalities for shape and density, and only volume correlated with use. But if you looked at all three types of measures, it showed the relationships between them were quite abnormal in the marijuana users, compared to the normal controls.” The result above shows this to be a lie. Volume did not significantly correlate with use. This is all very bad, but things get uglier the more one looks at the paper. In the tables reporting the p-values, the authors do something I have never seen before in a published paper. They report the uncorrected p-values, indicating those that are significant (prior to correction) in boldface, and then put an asterisk next to those that are significant after their (incomplete) correction. I realize my own use of boldface is controversial… but what they are doing is truly insane. The fact that they put an asterisk next to the values significant after correction indicates they are aware that multiple testing is required. So why bother boldfacing p-values that they know are not significant? The overall effect is an impression that more tests are significant than is actually the case. See for yourself in their Table 4: Table 4. The fact that there are multiple columns is also problematic. Separate tests were performed for smoking occasions per day, joints per occasion, joints per week and smoking days per week. These measures are highly correlated, but even so multiply testing them requires multiple test correction. The authors simply didn’t perform it. They say “We did not correct for the number of drug use measures because these measures tend not be independent of each other”. In other words, they multiplied the number of tests by four, and chose to not worry about that. Unbelievable. Then there is Table 5, where the authors did not report the p-values at all, only whether they were significant or not… without correction: Table 5. ### 3. Correlation vs. causation This issue is one of the oldest in the book. There is even a wikipedia entry about itCorrelation does not imply causation. Yet despite the fact the every result in the paper is directed at testing for association, in the last sentence of the abstract they say “These data suggest that marijuana exposure, even in young recreational users, is associated with exposure-dependent alterations of the neural matrix of core reward structures and is consistent with animal studies of changes in dendritic arborization.” At a minimum, such a result would require doing a longitudinal study. Breiter takes this language to an extreme in the press release accompanying the article. I repeat the statement he made that I quoted above where I boldface the causal claim: “”Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” I believe that scientists should be sanctioned for making public statements that directly contradict the content of their papers, as appears to be the case here. There is precedent for this. This is the third and final post in a series (part1, part2) of posts on two back-to-back papers published in Nature Biotechnology in August 2013: 1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601 2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635 An inconvenient request One of the great things about conferences is that there is time to chat in person with distant friends and collaborators. Last July, at the ISMB conference in Berlin, I found myself doing just that during one of the coffee breaks. Suddenly, Manolis Kellis approached me and asked to talk in private. The reason for his interruption: he came to request that I remove an arXiv post of mine, namely “Comment on ‘Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions“, a response to a paper by Ward and Kellis. Why? He pointed out that my arXiv post was ranking highly on Google. This was inconvenient for him, he explained, while insisting that it was wrong of me to post a criticism of his work on a forum where he could not directly respond. He suggested that it would be best to work out any issues I might have with his paper offline. Unfortunately, there was the inconvenient truth that arXiv postings cannot be removed. Unlike some journals, where, say, a supplement can be revised while having the original removed (see the figure switching of Feizi et al.), arXiv preprints are permanent. My initial confusion quickly turned to anger. After all, my arXiv comment had been rejected from Science where I had submitted it as a technical comment on the Ward-Kellis paper. I had then put it on the arXiv as a last resort measure to at least have some record of my concerns publicly accessible. How is this wrong? Can one not critique the work of Manolis Kellis? Network nonsense begins My first review of a Manolis Kellis paper was in the fall of 2006, in my capacity as a program committee member of the Research in Computational Molecular Biology (RECOMB) conference held in Oakland, CA in 2007. Because Oakland is right next to Berkeley, a number of Berkeley professors were involved in organizing and running the conference. Terry Speed was chair of the program committee. I was out of the country that year on sabbatical at the University of Oxford, so I could not participate, or even attend, the conference, but I volunteered to serve on the program committee. For those not familiar with the RECOMB review process, it is modeled after the standard Computer Science conferences. The program committee chair forms the program committee, who are then assigned a handful of papers to review and score. Reviews are entered on a website, and after a brief period of online discussion about borderline papers, scores are revised and accept/reject decisions are made. Authors can revise their manuscripts, but the reviewers never see the papers again before publication in the proceedings. One of the papers I was assigned to review was by a student named Joshua Grochow and his advisor Manolis Kellis. The paper was titled “Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking“. Although networks were not my research focus at the time, and “symmetry-breaking” evoked in me nightmares from the physics underworld, I agreed to the review. The paper seemed to contain some interesting algorithms, appeared to have a combinatorial flavor, and potentially important applications- a good mix for RECOMB. The problem addressed by Grochow & Kellis was that of identifying “network motifs” in biological networks. “Motifs” can be defined in a variety of ways, and the Grochow-Kellis objective was simple. In graph theoretic terms, given a graph G, the goal was to find subgraphs occurring with high multiplicity to an extent unlikely in a random graph. There are many models for random graphs, and the one that the results in Grochow-Kellis are based on is the Erdös-Renyi model (each edge chosen independently with some fixed probability). The reason this definition might be of biological interest, is that recurrent motifs interspersed in a graph are likely to represent evolutionarily conserved interaction modules. The paper begins with a description of the method. I won’t go into the details here, except to say that everything seemed well until I read the caption of Figure 3. There the number 27,720 caught my eye. During my first few years of graduate school I took many courses on enumerative and algebraic combinatorics. There are some numbers that combinatorialists just “know”. For example, seeing 42 emerge as the answer to a counting problem does not bring to mind Douglas Adams, but rather the vast literature on Catalan numbers and their connections to dozens of well-known counting problems. Similarly, a number such as 126 brings to mind binomial coefficients ($126={9 \choose 4}$), and with them the idea of counting the number of subsets of fixed size from a larger set. When I saw the number 27,720 I had a hunch that somehow some canonical combinatorial set had been enumerated. The idea may have entered my mind because of the picture of the “motif” in which I immediately recognized a clique (all vertices mutually connected) and a stable set (no pair of vertices connected). In any case, I realized that $27,720 = 220 \cdot 126 = {12 \choose 3} \cdot {9 \choose 4}$. The significance of this is that the “motif” on the left-hand side of Figure 3 had appeared many times because of a type of double- or rather thousandfold- counting. Instead of representing statistically significant recurring independent motifs, this “motif” arises because of a combinatorial artifact. In the specific example of Figure 3, the motif occurred once for any choice of 4 nodes from the clique of size 9, and any choice of 3 nodes from the stable set of size 12. The point is that in a graph, any subgraph attached to a large clique (or stable set) will occur many times. This simple observation is a result of the fact that there are many subgraphs of a clique (or stable set) that are identical. I realized that this meant that the Grochow-Kellis method was essentially a heuristic for finding cliques and stable sets in graphs. The particular “network motifs” they were pulling out were just subgraphs that happened to be connected to some large cliques and stable sets. There are two problems with this: first, a clique or a stable set can hardly be considered an interesting “network motif”. Moreover, the fact that they appear in biological networks much more than in Erdös-Renyi random graphs is not surprising. Second, there is a large literature on finding cliques in graphs, none of which Grochow-Kellis cited or seemed to be familiar with. The question of the performance of the Grochow-Kellis algorithm is answered in their Figure 3 as well. There is a slightly larger motif consisting of nodes from the stable set of size 12, instead of 3. That motif occurs in all ${12 \choose 6}$ subsets of the stable set instead of ${12 \choose 3}$ subsets which means that there is a motif that occurs 116,424 times! Grochow and Kellis’s algorithm did not even achieve its stated goal. It really ought to have outputted the left hand side figure with six nodes in the stable set on the left, and not three. In other words, this was a paper providing uninteresting solutions from a biological point of view, and doing so poorly to boot. I wrote up a detailed report on the paper, and posted it on the RECOMB review website together with poor scores reflecting my opinion that the paper had to be rejected. How could RECOMB, ostensibly the premier computer science conference on computational and algorithmic biology, publish a paper with neither a computational nor biological result? Not to mention an algorithm that demonstratably did not find the most frequently occurring motif. As you might already guess, my rejection was subsequently overruled. I don’t know who made the final decision to accept the Grochow & Kellis paper to the RECOMB conference, although presumably it was the program committee chair. The decision jarred with my sense of scientific integrity. I had put considerable effort into reviewing the paper and understanding it, and I felt that I had provided a compelling objective argument for why the paper was fundamentally flawed- the fact that the results were trivial (and incorrect!) was not a subjective statement. At this point I need to point out that the RECOMB conference is quite difficult to get into. The acceptance rate for papers in 2007, consistent with other years, was 21.8%. I knew this meant that even a single very negative review, especially one with a compelling argument against the paper, almost certainly would lead to rejection of the paper. So I couldn’t understand then, nor do I still understand now, on what basis the decision was made to accept the paper. This bothered me greatly, and after much deliberation I started boycotting the conference. Despite publishing five RECOMB papers from 2000 to 2006 and regularly attending the meeting during that time, the continued poor decisions and haphazard standards for papers selected have led me to not return in almost 8 years. Grochow and Kellis obviously received my review and considered how to “deal with it”. They added a section titled “The role of combinatorial effects”, in which they explained the origins of the number 27,720 that they gleaned from my report, but then spun the bad news they had received as “resulting from combinatorial connectivity patterns prevalent in larger network structures.” They then added that “…this combinatorial clustering effect brings into question the current definition of network motif” and proposed that “additional statistics…might well be suited to identify larger meaningful networks.” This is a lot like someone claiming to discover a bacteria whose DNA is arsenic-based and upon being told by others that the “discovery” is incorrect – in fact, that very bacteria seeks out phosphorous – responding that this is “really helpful” and that it “raises lots of new interesting open questions” about how arsenate gets into cells. Chutzpah. When you discover your work is flawed, the correct response is to retract it. I don’t think people read papers very carefully. Joshua Grochow went on to win the MIT Charles and Jennifer Johnson Outstanding M. Eng. Thesis Award for his RECOMB work on network motif discovery. [Added February 18: Grochow and Kellis have posted a reply here]. The nature of man I have to admit that after the Grochow-Kellis paper I was a bit skeptical of Kellis’ work. Not because of the paper itself (everyone makes mistakes), but because of the way he responded to my review. So a year and a half ago, when Manolis Kellis published a paper in an area I care about and am involved in, I may have had a negative prior. The paper was Luke Ward and Manolis Kellis “Evidence for Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”, Science 337 (2012) . Having been involved with the ENCODE pilot, where I contributed to the multiple alignment sub-project, I was curious what comparative genomics insights the full-scale130 million dollar project revealed. The press releases accompanying the Ward-Kellis paper (e.g. The Nature of Man, The Economist) were suggesting that Ward and Kellis had figured out what makes a human a human; my curiosity was understandably piqued.

Ward and Kellis combined population genomic data from the 1000 Genomes Project with biochemical data from the ENCODE project to look for signatures of human constraint in regulatory elements. Their analysis was based on measuring three different proxies for constraint: SNP density, heterozygosity and derived allele frequency. To identify specific classes of regulatory regions under constraint, aggregated regions associated with specific gene ontology (GO) categories were tested for significance. Reading the paper I was amazed to discover they found precisely two categories: retinal cone cell development and nerve growth factor receptor signaling. It was only upon reading the supplement that I discovered that their tests had produced 53 other GO categories as well (Table S5).

Despite the fact that the listed categories were required to pass a false discovery rate (FDR) threshold for both the heterozygosity and derived allele frequency (DAF) measures, it was statistically invalid for them to highlight any specific GO category. FDR control merely guarantees a low false discovery rate among the entries in the entire list. Moreover, there was no obvious explanation for why categories such as chromatin binding (which had a smaller DAF than nerve growth) or protein binding (with the smallest p-value) appeared to be under purifying selection. As with the Feizi et al. paper, the supplement produced a story much less clean than the one presented in the main body of the paper. In fact, retinal cone cell development and nerve growth factor were 33 and 34 out of the 55 listed GO categories when sorted by the DAF p-value (42 and 54 when sorted by heterozygosity p-value). In other words, the story being sold in the paper was based on blatant statistically invalid cherry picking.

The other result of the paper was an estimate that in addition to the 5% of the human genome conserved across mammalian genomes, at least another 4% has been subject to lineage-specific constraint. This result was based on adding up the estimates of constrained nucleotides from their Table S6 (using the derived allele frequency measure). These were calculated using a statistic that was computed as follows: for each one of ten bins determined according to estimated background selection strength, and for every feature F, the average DAF value DF was rescaled to

$PUC_F = \frac{(D_F - D_{CNDC})}{(D_{NCNE}-D_{CNDC})}$,

where DCNDC and DNCNE were the bin-specific average DAFs of conserved non-degenerate coding regions and non-conserved non-ENCODE regions respectively. One problem with the statistic is that the non-conserved regions contain nucleotides not conserved in all mammals, which is not the same as nucleotides not conserved in any mammals. The latter would have been needed in order to identify human specific constraint. Second, the statistic PUCF  was used as a proxy for the proportion under constraint even though, as defined, it could be less than zero or greater than one. Indeed, in Table S6 there were four values among the confidence intervals for the estimated proportions using DAF that included values less than 0% or above 100%:

Ward and Kellis were therefore proposing that some features might have a negative number of nucleotides under constraint. Moreover, while it is possible that after further rescaling PUCmight have correlated with the true proportion of nucleotides under constraint, there was no argument provided in the paper. Thus, while Ward and Kellis claimed to have estimated the proportion of nucleotides under constraint, they had only computed a statistic named “proportion under constraint”.

Nicolas Bray and I wrote up these points in a short technical comment and submitted it to the journal Science early in November 2012. The comment was summarily rejected with a curt reply by senior editor Laura Zahn stating that “relative to other Technical  Comments we have recently received we feel that the scope and focus of your  comment make it more suitable for the Online Comments facility at Science, rather than as a candidate for publication as a Technical Comment.” It is worth noting that Science did decide to publish another comment: Phil Green and Brent Ewing’s, “Comment on’Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions‘”, Science 10 (2013). Green and Ewing’s comment is biological in nature. Their concern is that “… the polymorphism trends are primarily attributable to mutational variation and technical artifacts rather than selection.” Its fine that Science decided to host a debate on a biology question on its pages, but how can one debate the interpretation of results from a method, when the method is fundamentally flawed to begin with? After all, our problem with PUC was much deeper than a “technical flaw”.

We decided at the end to place the comment in the arXiv. After doing so, it became apparent that it had little impact. Indeed, I have never received any feedback about it from anyone. Apparently even this was too much for Manolis Kellis.

Methods matter

By the time I noticed the Feizi et al. paper in the journal Nature Biotechnology early in August 2013, my experiences reading Kellis’ papers had subtly altered the dynamic between myself and the printed word. Usually, when I read a paper and I don’t understand something, I assume the fault lies with me. I think most people are like this. But now, when the Feizi et al. paper started to not make sense, I didn’t presume the problem was with me. I tried hard to give the paper a fair reading, but after a few paragraphs the spell of the authors was already broken. And so it is that Nicolas Bray and I came to figure out what was really going on in Feizi et al., a project that eventually led us to also look at Barzel-Barabási.

Speaking frankly, it was difficult work to write the blog posts about these articles. In addition to the time it took, it was exhausting and exasperating to discover the flaws, fallacies and frauds. Both Nick and I prefer to do research. But we felt a responsibility to spell out in detail what had happened here. Manolis Kellis is not just any scientist. He has, and continues to play leading roles in major consortium projects such as mod-ENCODE and ENCODE, and he has served on numerous advisory committees for the NHGRI. He is a member of the GCAT (Genomics, Computational Biology and Technology) study section until 2018. That any person would swap out a key figure in a published paper without publishing a correction, and without informing the editor is astonishing. That a person with great responsibility towards scientists is an abuser of science is unacceptable.

Manolis Kellis’ behavior is part of a systemic problem in computational biology. The cross-fertilization of ideas between mathematics, statistics, computer science and biology is both an opportunity and a danger. It is not hard to peddle incoherent math to biologists, many of whom are literally math phobic. For example, a number of responses I’ve received to the Feizi et al. blog post have started with comments such as

“I don’t have the expertise to judge the math, …”

Similarly, it isn’t hard to fool mathematicians into believing biological fables. Many mathematicians throughout the country were recently convinced by Jonathan Rothberg to donate samples of their DNA so that they might find out “what makes them a genius”. Such mathematicians, and their colleagues in computer science and statistics, take at face value statements such as “we have figured out what makes a human human”. In the midst of such confusion, it is easy for an enterprising “computational person” to take advantage of the situation, and Kellis has.

I believe the solution for this problem is for computational biologists to start taking themselves more seriously. Whether serving as reviewers for journals, as panel members for funding agencies, on hiring/tenure committees, or writing articles, all of us have to tone down the hype and pay closer attention to the science. There are many examples of what this means: a review of a math/stats containing paper cannot be a single paragraph long and based on a hunch, and similarly computational biologists shouldn’t claim, as have many of the authors of papers I’ve reviewed in these posts, pathways to cure disease and explanations for what makes humans human. Don’t fool the biologists. Don’t fool the computer scientists, statisticians, and mathematicians.

The possibilities for computational methods in biology are unlimited. The future is exciting, and there are possibilities for significant advances in areas ranging from molecular and evolutionary biology to medicine. But money, citations and fame cannot rule the day. The details of the #methodsmatter.