You are currently browsing the category archive for the ‘sophistry’ category.

In a previous post I wrote about How not to perform a differential expression analysis. In response to my post, Rob Patro, Geet Duggal, Michael I Love, Rafael Irizarry and Carl Kingsford wrote a detailed response. Below is my point-by-point rebuttal to their response (the figures and results in this blog post can be generated using the scripts in the Bits of DNA GitHub repository):

1. In Figure 1 of their response, Patro et al. show an MA plot and state that “if it were true that these methods are ‘very very’ similar one would see most log-ratios close to 0 (within the red lines).” This is true. Below is the MA plot for kallisto with default parameters and Salmon with the –gcBias flag:

96.6% of the points lie within the red lines. Since this constitutes most of the points, it seems reasonable to conclude that the methods are indeed very very similar. When both programs are run in default mode, as I did in my blog post, 98.9% of the points lie within the red lines. Thus, using the criterion of Patro et al., the programs have very very similar, or near identical, output. These numbers are conservative, computed by omitting transcripts where both kallisto and Salmon determine that a transcript has zero abundance.

2. Furthermore, Patro et al. explain that their MA plot in Figure 1 “demonstrate[s] how deceiving count scatter plots can be in this particular context.” There is, superficially, some merit to this claim. The MA plot above looks like a smudge of points and seems at odds with the fact that 96.6% of the points lie within the red lines. However the plot displays 198,457 points corresponding to 198,457 quantified transcripts, and as a result many points obfuscate each other. The alpha parameter in ggplot2 sets the opacity/transparency of points, and should be used in such a case to reveal the density of points (see, e.g. Supplementary Figure 19 of Love et al. 2016). Below is a plot of the exact same points with alpha=0.01:

An R animation that interpolates between the two MA plots above shows the same points, with varying opacity parameters (alpha=1 -> 0.01) and helps to demonstrate how deceiving MA plots can be in this particular context:

3. The Patro et al. response fails to distinguish between two different comparisons I made in my blog post: (1) comparisons of default kallisto to default Salmon, and (2) default kallisto to Salmon with the –gcBias option. Comparisons of the programs with default options is important because with those options their output is near identical, and, as I explain in my blog post, this is not some cosmic coincidence but a result of Salmon directly implementing the key ideas of pseudoalignment. The Patro et al. 2017 paper is also not just about GC bias correction, as the authors claim in their response, but rather it is also “the Salmon paper” a descriptor that Patro et al. use 24 times in their response. Furthermore, when Patro et al. are asked about how to run Salmon they recommend running it with default options (see e.g. the epilogue below or the way Patro  et al. run Salmon for analysis of the Bealieau-Jones-Greene described in #5) so that a comparison of the programs in default mode is of direct relevance to users.

In regards to the GC bias correction, Patro et al. 2017 claim in their abstract that “[GC bias correction] substantially improves the accuracy of abundance estimates and the sensitivity of subsequence differential expression analysis”. This is a general statement, not one about the sort of niche use-cases they describe in their response. The question then is whether Patro et al. provides support for this general statement and my argument has always been that it does not.

4. Patro et al. criticize my use of the ERR188140 sample to demonstrate how similar Salmon is to kallisto. They write that “the blog post author selected a single sample…”(boldface theirs) to claim that Salmon and kallisto produce output with “very very strong similarity (≃)” and raise the possibility that it was cherry picked, noting that “this particular sample has less GC-content bias” and marking it in a plot. I used ERR188140 because it was our sample of choice for many of the demonstrative analyses in the Bray et al. 2016 paper (see the kallisto paper analysis Github repository where the sample is mentioned since February 2016) and for that paper we had already generated the RSEM quantifications (and the alignments required for running the program), thus saving time in making the PCA analysis for my blogpost. ERR188140 was chosen for Bray et al. 2016 because it was the most deeply sequenced sample in the GEUVADIS dataset.

5. Contrary to the claim by Patro et al. in their response that I examined only one dataset, I also included in my post links with references to specific figures from four other papers that independently found that kallisto is near identical to Salmon. The fairest example for consideration is the additional analysis I mentioned of Beaulieu-Jones and Greene, and separately Patro, of the RNA-Seq dataset from Boj et al. 2015. With that analysis, there can be no claims of cherry-picking. The dataset was chosen by the authors of Beaulieu-Jones and Greene 2017, kallisto quantifications were produced by Beaulieu-Jones and Greene, and Salmon quantifications were prepared by Patro. Presumably the main author of the Salmon program ran Salmon with the best settings possible for the experiment. The fact that different individuals ran the programs is highlighted by the fact that they are not even based on identical annotations. They used different versions of RefSeq: Beaulieu-Jones and Greene quantified with 35,026 transcripts and Patro, who quantified later, used an annotation with 35,882 transcripts. There are eight samples in the analysis and MA plots, made by restricting the analysis to the transcripts in common, all look alike. As an example, the MA plot for SRR1654626  is:

The fraction of points within the red lines, calculated as before by omitting points at (0,0), is 98.6%.  The Patro analysis of Bealieau-Greene was performed on March 8, 2017 with version 0.8.1 of Salmon, well after the –gcBias option was implemented, the Salmon (version 3 preprint describing the GC correction) published, and the paper submitted. The dates are verifiable in the GitHub repository with the Salmon results.

6. In arguing that kallisto and Salmon are different Patro et al. provide an interesting formula for the correlation for two random variables X and X+Y where X and Y are independent but its use in this context is a sleight of hand. The formula, which is a simple exercise for the reader to derive from the definition of correlation, is

$cor(X,X+Y)=\sqrt{\frac{1}{1+Var(Y)/Var(X)}}$.

It follows by Taylor series expansion that this is approximately

$cor(X,X+Y) \approx 1-\frac{1}{2}\frac{Var(Y)}{Var(X)}$.

and if sd(X) is about 3.4 and sd(Y) about 0.5 (Patro et al.‘s numbers), then by inspection cor(X,X+Y) will be 0.99. In sample SRR1654626 shown above, when ignoring transcripts where both programs output 0, sd(X)=3.5 and sd(Y) = 0.43 which are fairly close to Patro et al.‘s numbers. However Patro et al. proceed with a non sequitur, writing that “this means that a substantial difference of 25% between reported counts is typical”. While the correlation formula makes no distributional assumptions, the 25% difference seems to be based on an assumption that Y is normally distributed. Specifically, if is normally distributed with mean 0 and standard deviation 0.5 then |Y| is half-normally distributed and a typical percent difference based on the median is

$(2^{0.5 \cdot \sqrt{2}\cdot \mbox{erf}^{-1}(0.5)}-1)\cdot 100 = 26.3\% \approx 25\%.$

However the differences between kallisto and Salmon quantifications are far from normally distributed. The plot below shows the distribution of the differences between log2 counts of kallisto and salmon (again excluding cases where both programs output 0):

The blue vertical line is positioned at the median, which is at 0.001433093. This means that the typical difference between reported counts is not 25% but rather 0.1%.

7.  In their response, Patro et al. highlight the recent Zhang et al. 2017 paper that benchmarked a number of RNA-Seq programs, including kallisto and Salmon. Patro et al. comment on a high correlation between a mode of Salmon that quantifies based with transcriptome alignments and RSEM. First, the correlations reported by Zhang et al. are Pearson correlations, and not Spearman correlations that I focused on in my blog post. Second, the alignment mode of Salmon has nothing to do with pseudoalignment, in that read alignments (in the case of Zhang et al. 2017 produced with STAR) are quantified directly, in a workflow the same as that of RSEM. Investigation of the similarities between alignment Salmon and RSEM that led to the high correlation is beyond the scope of this post. Finally, in discussing the similarities between programs the authors (Zhang et al.) write “Salmon, Sailfish and Kallisto, cluster tightly together with R 2  > 0.96.”

8. In regards to the EM algorithm, Patro et al. acknowledge that Salmon uses kallisto’s termination criteria and have updated their code to reflect this fact. I thank them for doing so, however this portion of their response is bizarre:

“What if Salmon executed more iterations of its offline phase and outperformed kallisto? Then its improvement could be attributed to the extra iterations instead of the different model, bias correction, or online phase. By using the same termination criteria for the offline phase of Salmon, we eliminate a confounding variable in the analysis.”

If Salmon could perform better by executing more iterations of the EM algorithm it should certainly do so. This is because parameters hard-wired in the code should be set in a way that provides users with the best possible performance.

9. At one point in their response Patro et al. write that “It is expected that Salmon, without the GC bias correction feature, will be similar to kallisto”, essentially conceding that default Salmon $\simeq$ default kallisto, a main point of my blog post. However Patro et al. continue to insist that Salmon with GC bias correction significantly improves on kallisto. Patro et al. have repeated a key experiment (the GEUVADIS based simulation) in their paper, replacing the t-test with a workflow they describe as “the pipeline suggested by the post’s author”. To be clear, this is the workflow preferred by Patro et al.:

As explained in my post on How not to perform a differential expression analysis the reason that Love et al. recommend a DESeq2 workflow instead of a t-test for differential expression is because of the importance of regularizing variance estimates. This is made clear by repeating Patro et al.‘s GEUVADIS experiment with a typical three replicates per condition instead of eight:

With the t-test of transcripts Salmon cannot even achieve an FDR of less than 0.05.

10. Patro et al. find that switching to their recommended workflow (i.e. replacing the t-test with their own DESeq2) alters the difference between kallisto and Salmon at an FDR of 0.01 from 353% to 32%. Patro et al. describe this difference, in boldface, as “The results remain similar to the original published results when run using the accuser’s suggested pipeline.”  Note that Patro et al. refer to a typical difference of 0.1% between counts generated by kallisto and Salmon as “not very very similar” (point #6) while insisting that 353% and 32% aresimilar.

11. The reanalysis of the GEUVADIS differential expression experiment by Patro et al. also fails to address one of the most important critiques in my blogpost, namely that a typical experimental design will not deliberately confound bias with conditionThe plot below shows the difference between kallisto and Salmon in a typical experiment (3 replicates in each condition) followed by Love et al.’s recommended workflow (tximport -> DESeq2):

There is no apparent difference between kallisto and Salmon. Note that the samples in this experiment have the same GC bias as in Patro et al. 2017, the only difference being that samples are chosen randomly in a way that they are not confounded by batch. The lack of any observed difference in results between default kallisto and Salmon with the –gcBias option are the same with an 8×8 analysis:

There is no apparent difference between kallisto and Salmon, even though the simulation includes the same GC bias levels as in Patro et al. 2017 (just not confounded with condition) .

12. It is interesting to compare the 8×8 unconfounded experiment with the 8×8 confounded experiment.

While Salmon does improve on kallisto (although as discussed in point #10 the improvement is not 353% but rather 32% at an FDR of 0.01), the improvement in accuracy when performing an unconfounded experiment highlights why confounded experiments should not be performed in the first place.

13. Patro et al. claim that despite best intentions, “confounding of technical artifacts such as GC dependence with the biological comparison of interest does occur” and cite Gilad and Mizrahi-Man 2015. However the message of the Gilad and Mizrahi-Man paper is not that we must do our best to analyze confounded experiments. Rather, it is that with confounded experiments one may learn nothing at all. What they say is “In summary, we believe that our reanalysis indicates that the conclusions of the Mouse ENCODE Consortium papers pertaining to the clustering of the comparative gene expression data are unwarranted.” In other words, confounding of batch effect with variables of interest can render experiments worthless.

14. In response to my claim that GC bias has been reduced during the past 5 years, Patro et al. state:

A more informed assumption is that GC bias in sequencing data originates with PCR amplification and depends on thermocycler ramp speed (see, for example, Aird (2011) or t’ Hoen (2013)), and not from sequencing machines or reverse transcription protocols which may have improved in the past 5 years.

This statement is curious in that it seems to assume that, unlike sequencing machines or reverse transcription protocols, PCR amplification and thermocycler technology could not have improved in the past 5 years. As an example to the contrary, consider that just months before the publication of the GEUVADIS data, New England Biolabs released a new polymerase which claimed to address this very issue. GC bias is a ubiquitous issue in molecular biology and of course there are ongoing efforts to address it in the wet lab. Furthermore, continued research and benchmarking aimed at reducing GC bias, (see e.g. Thorner et al. 2014have led to marked improvements in library quality and standardization of experiments across labs. Anyone who performs bulk RNA-Seq, as we do in my lab, knows that RNA-Seq is no longer an ad hoc experiment.

15. Patro et al. write that

The point of the simulation was to demonstrate that, while modeling fragment sequence bias reduces gross mis-estimation (false reports of isoform switching across labs in real data — see for example Salmon Supplementary Figure 5 showing GEUVADIS data), the bias modeling does not lead to overall loss of signal. Consider that one could reduce false positives simply by attenuating signal or adding noise to all transcript abundances.

However none of the simulations or results in Patro et al. 2017 address the question of whether bias modeling leads to overall loss of signal. To answer it would require examining the true and false positives in a comparison of default Salmon and Salmon with –gcBias. Not only did Patro et al. not do the relevant intra-program comparisons, they did inter-program comparisons instead which clearly bear no relevance to the point they now claim they were making.

16. I want to make very clear that I believe that GC bias correction during RNA-Seq quantification is valuable and I agree with Patro et al. that it can be important for meta-analyses, especially of the kind that take place by large genome consortia. One of the interesting results in Patro et al. is the SEQC analysis (Supplementary Figure 4) which shows that that Salmon is more consistent in intra-center quantification in one sample (HBRR). However in a second experiment (UHRR) the programs are near identical in their quantification differences within and between centers and based on the results shown above I don’t believe that Patro et al. 2017 achieves its stated aim of showing that GC correction has an effect on typical differential analyses experiments that utilize typical downstream analyses.

17. I showed the results of running kallisto in default mode and Salmon with GC bias correction on a well-studied dataset from Trapnell et al. 2013. Patro et al. claim they were unable to reproduce my results, but that is because they performed a transcript level analysis despite the fact that I made it very clear in my post that I performed a gene level analysis. I chose to show results at the gene level to draw a contrast with Figure 3c of Trapnell et al. 2013. The results of Patro et al. at the transcript level show that even then the extent of overlap is remarkable. These results are consistent with the simulation results (see point #11).

18. The Salmon authors double down on their runtime analysis by claiming that “The running time discussion presented in the Salmon paper is accurate.” This is difficult to reconcile with two facts

(a) According to Patro et al.’s rebuttal “kallisto is faster when using a small number of threads” yet this was not presented in Patro et al. 2017.

(b) According to Patro et al. (see, e.g., the Salmon program GitHub), when running kallisto or Salmon with 30 threads what is being benchmarked is disk I/O and not the runtime of the programs.

If Patro et al. agree that to benchmark the speed of a program one must use a small number of threads, and Patro et al. agree that with a small number of threads kallisto is faster, then the only possible conclusion is that the running time discussion presented in the Salmon paper is not accurate.

19. The Patro et al. response has an entire section (3.1) devoted to explaining why quasimapping (used by Salmon) is distinct from pseudoalignment (introduced in the kallisto paper). Patro et al. describe quasimapping as “a different algorithm, different data structure, and computes different results.” Furthermore, in a blog post, Patro explained that RapMap (on which Salmon is based) implements both quasimapping and pseudoalignment, and that these are distinct concepts. He writes specifically that in contrast to the first algorithm provided by RapMap (pseudoalignment), “the second algorithm provided by RapMap — quasi-mapping — is a novel one”.

One of the reviewers of the Salmon paper recently published his review, which begins with the sentence “The authors present salmon, a new RNAseq quantification tool that uses pseudoalignment…” This directly contradicts the assertion of Patro et al. that quasimapping is “a different algorithm, different data structure” or that quasimapping is novel. In my blog post I provided a detailed walk-through that affirms that the reviewer is right. I showed how the quasimapping underlying Salmon is literally acting in identical ways on the k-mers in reads. Moreover, the results above show that Salmon, using quasimapping, does not “compute different results”. Unsurprisingly, its output is near identical to kallisto.

20. Patro et al. write that “The title of Sailfish paper contains the words ‘alignment-free’, which indicates that it was Sailfish that first presented the key idea of abandoning alignment. The term alignment-free has a long history in genomics and is used to describe methods in which the information inherent in a complete read is discarded in favor of the direct use of it’s substrings. Sailfish is indeed an alignment-free method because it shreds reads into constituent k-mers, and those are then operated on without regard to which read they originated from. The paper is aptly titled. The concept of pseudoalignment is distinct in that complete reads are associated to targets, even if base-pair alignments are not described.

21. Patro et al. write that “Salmon, including many of its main ideas, was widely known in the field prior to the kallisto preprint.” and mention that Zhang et al. 2015 included a brief description of Salmon. Zhang et al. 2015 was published on June 5, 2015, a month after the kallisto preprint was published, and its description of Salmon, though brief, was the first available for the program. Nowhere else, prior to the Zhang et al. publication, was there any description of what Salmon does or how it works, even at a high-level.

Notably, the paragraph on Salmon of Zhang et al. shows that Salmon, in its initial form, had nothing to do with pseudoalignment:

“Salmon is based on a novel lightweight alignment model that uses chains of maximal exact matches between sequencing fragments and reference transcripts to determine the potential origin of RNA‐seq reads.”

This is consistent with the PCA plot of my blog post which shows that initial versions of Salmon were very different from kallisto, and that Salmon $\simeq$ kallisto only after Salmon switched to the use of pseudoalignment.

22. My blogpost elicited an intense discussion in the comments and on social media of whether Patro et al. adequately attributed key ideas of Salmon to kallisto. Patro et al. They did not.

Patro et al. reference numerous citations to kallisto in Patro et al. 2017 which I’ve reproduced below

Only two of these references attribute any aspect of Salmon to kallisto. One of them, the Salmon bootstrap, is described as “inspired by kallisto” (in fact it is identical to that of kallisto). There is only one citation in Salmon to the key idea that has made it near identical to kallisto, namely the use of pseudoalignment, and that is to the RapMap paper from the Patro group (Srivastava et al. 2016).

Despite boasting of a commitment to open source principles and embracing preprints, Patro et al. conveniently ignore the RapMap preprints (Srivastava et al. 2015). Despite many mentions of kallisto, none of the four versions of the preprint acknowledge the direct use of the ideas in Bray et al. 2016 in any way, shape or form. The intent of Srivastava et al. is very clear. In the journal version the authors still do not acknowledge that “quasi mapping” is just pseudoalignment implemented with a suffix array, instead using words such as “inspired” and “motivated” to obfuscate the truth. Wording matters.

Epilogue

Discussion of the Zhang et al. 2017 paper by Patro et al., along with a tweet by Lappalainen about programs not giving identical results lead me to look more deeply into the Zhang et al. 2017 paper.

The exploration turned out to be interesting. On the one hand, some figures in Zhang et al. 2017 contradict Lappalainen’s claim that “none of the methods seem to give identical results…”.  For example, Figure S4 from the paper shows quantifications for four genes where kallisto and Salmon produce near identical results.

On the other hand, Figure 7 from the paper is an example from a simulation on a single gene where kallisto performed very differently from Salmon:

I contacted the authors to find out how they ran kallisto and Salmon. It turns out that for all the results in the paper with the exception of Figure 7, the programs were run as follows:

·       kallisto quant -i $KAL_INDEX –fr-stranded –plaintext$DATADIR/${f}_1.fq$DATADIR/${f}_2.fq -t 8 -o ./kallisto/ · salmon quant -i$SALMON_INDEX -l ISF -1 $DATADIR/${f}_1.fq -2 $DATADIR/${f}_2.fq -p 8 -o salmon_em  –incompatPrior 0

We then exchanged some further emails, after which they sent the data (reads) for the figure, we ran kallisto on our end and found discordant results with what was reported in the paper, they re-ran kallisto on their end, and after these exchanges we converged to an updated (and corrected) figure which shows Sailfish $\simeq$ kallisto but not Salmon $\simeq$ kallisto. The updated figure, shown below, was made by Zhang et al. using the default mode of kallisto version 0.43.1:

Note that kallisto is near identical in performance to Sailfish, which I explained in my blog post about Salmon has also converged to kallisto. However Salmon is different.

It turned out that for this one figure, Salmon was run with a non-standard set of options, specifically with the additional option –numPreAuxModelSamples 0 (although notably Patro did not recommend using the –gcBias option). The recommendation to run with this option was made by Patro to Zhang et al. after they contacted him early in January 2017 to ask for the best way to run Salmon for the experiment. What the flag does is turn off the online phase of Salmon (hence the 0 in –numPreAuxModelSamples 0) that is used to initially estimate the fragment length distribution. There is a good rationale for using the flag, namely the very small number of reads in the simulation makes it impossible to accurately learn auxiliary parameters as one might with a full dataset. However on January 13th, Patro changed the behavior of the option in a way that allowed Salmon to optimize for the specific experiment at hand. The default fragment length distribution in Salmon had been set the same as that in Cufflinks (mean 200, standard deviation 80). These settings match typical experimental data, and were chosen by Cole Trapnell and myself after examining numerous biological datasets. Setting –numPreAuxModelSamples to 0 forced Salmon to use those parameters. However on January 13th Patro changed the defaults in Salmon to mean = 250 and standard deviation = 25The numbers 250 and 25 are precisely the defaults for the polyester program that simulates reads. Polyester (with default parameters) is what Zhang et al. 2017 used to simulate reads for Figure 7.

Zhang et al. also contacted me on January 9th and I did not reply to their email. I had just moved institutions (from UC Berkeley to Caltech) on January 1st, and did not have the time to investigate in detail the issues they raised. I thank them for being forthcoming and helpful in reviewing Figure 7 post publication.

Returning to Lappalainen’s comment, it is true that Salmon results are different from kallisto in Figure 7 and one reason may be that Patro hard wired parameters for a flag that was used to match the parameters of the simulation. With the exception of that figure, throughout the paper Salmon $\simeq$ kallisto, providing yet another example of an independent publication confirming the claims of my blog post.

[September 2, 2017: A response to this post has been posted by the authors of Patro et al. 2017, and I have replied to them with a rebuttal]

Spot the difference

One of the maxims of computational biology is that “no two programs ever give the same result.” This is perhaps not so surprising; after all, most journals seek papers that report a significant improvement to an existing method. As a result, when developing new methods, computational biologists ensure that the results of their tools are different, specifically better (by some metric), than those of previous methods. The maxim certainly holds for RNA-Seq tools. For example, the large symmetric differences displayed in the Venn diagram below (from Zhang et al. 2014) are typical for differential expression tool benchmarks:

In a comparison of RNA-Seq quantification methods, Hayer et al. 2015 showed that methods differ even at the level of summary statistics (in Figure 7 from the paper, shown below, Pearson correlation was calculated using ground truth from a simulation):

These sort of of results are the norm in computational genomics. Finding a pair of software programs that produce identical results is about as likely as finding someone who has won the lottery… twice…. in one week. Well, it turns out there has been such a person, and here I describe the computational genomics analog of that unlikely event. Below are a pair of plots made using two different RNA-Seq quantification programs:

The two volcano plots show the log-fold change in abundance estimated for samples sequenced by Boj et al. 2015, plotted against p-values obtained with the program limma-voom. I repeat: the plots were made with quantifications from two different RNA-Seq programs. Details are described in the next section, but before reading it first try playing spot the difference.

The reveal

The top plot is reproduced from Supplementary Figure 6 in Beaulieu-Jones and Greene, 2017. The quantification program used in that paper was kallisto, an RNA-Seq quantification program based on pseudoalignment that was published in

The bottom plot was made using the quantification program Salmon, and is reproduced from a GitHub repository belonging to the lead author of

Patro et al. 2017 claim that “[Salmon] achieves the same order-of-magnitude benefits in speed as kallisto and Sailfish but with greater accuracy”, however after being unable to spot any differences myself in the volcano plots shown above, I decided, with mixed feelings of amusement and annoyance, to check for myself whether the similarity between the programs was some sort of fluke. Or maybe I’d overlooked something obvious, e.g. the fact that programs may tend to give more similar results at the gene level than at the transcript level. Thus began this blog post.

In the figure below, made by quantifying RNA-Seq sample ERR188140 with the latest versions of the two programs, each point is a transcript and its coordinates are the estimated counts produced by kallisto and salmon respectively.

Strikingly, the Pearson correlation coefficient is 0.9996026. However astute readers will recognize a possible sleight of hand on my part. The correlation may be inflated by similar results for the very abundant transcripts, and the plot hides thousands of points in the lower left-hand corner. RNA-Seq analyses are notorious for such plots that appear sounds but can be misleading. However in this case I’m not hiding anything. The Pearson correlation computed with $log(counts+1)$ is still extremely high (0.9955965) and the Spearman correlation, which gives equal balance to transcripts irrespective of the magnitude of their counts is 0.991206. My observation is confirmed in Table 3 of Sarkar et al. 2017 (note that in this table “quasi-mapping” corresponds to Salmon):

For context, the Spearman correlation between kallisto and a truly different RNA-Seq quantification program, RSEM, is 0.8944941. At this point I have to say… I’ve been doing computational biology for more than 20 years and I have never seen a situation where two ostensibly different programs output such similar results.

Patro and I are not alone in finding that Salmon $\simeq$ kallisto (if kallisto and Salmon gave identical results I would write that Salmon = kallisto but in lieu of the missing 0.004 in correlation I use the symbol $\, \simeq \,$ to denote the very very strong similarity). Examples in the literature abound, e.g. Supplementary Figure 5 from Majoros et al. 2017 (shown later in the post), Figure 1 from Everaert et al. 2017

or Figure 3A from Jin et al. 2017:

Just a few weeks ago, Sahraeian et al. 2017 published a comprehensive analysis of 39 RNA-Seq analysis tools and performed hierarchical clusterings of methods according to the similarity of their output. Here is one example (their Supplementary Figure 24a):

Amazingly, kallisto and Salmon-Quasi (the latest version of Salmon) are the two closest programs to each other in the entire comparison, producing output even more similar than the same program, e.g. Cufflinks or StringTie run with different alignments!

This raises the question of how, with kallisto published in May 2016 and Salmon $\simeq$ kallisto, Patro et al. 2017 was published in one of the most respected scientific publications that advertises first and foremost that it “is a forum for the publication of novel methods and significant improvements to tried-and-tested basic research techniques in the life sciences.” ?

How not to perform a differential expression analysis

The Patro et al. 2017 paper presents a number of comparisons between kallisto and Salmon in which Salmon appears to dramatically improve on the performance of kallisto. For example Figure 1c from Patro et al. 2017 is a table showing an enormous performance difference between kallisto and Salmon:

Figure 1c from Patro et al. 2017.

At a false discovery rate of 0.01, the authors claim that in a simulation study where ground truth is known Salmon identifies 4.5 times more truly differential transcripts than kallisto!

This can explain how Salmon was published, namely the reviewers and editor believed Patro et al.’s claims that Salmon significantly improves on previous work. In one analysis Patro et al. provide a p-value to help the “significance” stick. They write that “we found that Salmon’s distribution of mean absolute relative differences was significantly smaller (Mann-Whitney U test, P=0.00017) than those of kallisto. But how can the result Salmon >> kallisto, be reconciled with the fact that everybody repeatedly finds that Salmon $\simeq$ kallisto?

A closer look reveals three things:

1. In a differential expression analysis billed as “a typical downstream analysis” Patro et al. did not examine differential expression results for a typical biological experiment with a handful of replicates. Instead they examined a simulation of two conditions with eight replicates in each.
2. The large number of replicates allowed them to apply the log-ratio t-test directly to abundance estimates based on transcript per million (TPM) units, rather than estimated counts which are required for methods such as their own DESeq2.
3. The simulation involved generation of GC bias in an approach compatible with the inference model, with one batch of eight samples exhibiting “weak GC content dependence” while the other batch of eight exhibiting “more severe fragment-level GC bias.” Salmon was run in a GC bias correction mode.

These were unusual choices by Patro et al. What they did was allow Patro et al. to showcase the performance of their method in a way that leveraged the match between one of their inference models and the procedure for simulating the reads. The showcasing was enabled by having a confounding variable (bias) that exactly matches their condition variable, the use of TPM units to magnify the impact of that effect on their inference, simulation with a large number of replicates to enable the use of TPM,  which was possible because with many replicates one could directly apply the log t-test. This complex chain of dependencies is unraveled below:

There is a reason why log-fold changes are not directly tested in standard RNA-Seq differential expression analyses. Variance estimation is challenging with few replicates and RNA-Seq methods developers understood this early on. That is why all competitive methods for differential expression analysis such as DESeq/DESeq2, edgeR, limma-voom, Cuffdiff, BitSeq, sleuth, etc. regularize variance estimates (i.e., perform shrinkage) by sharing information across transcripts/genes of similar abundance. In a careful benchmarking of differential expression tools, Shurch et al. 2016 show that log-ratio t-test is the worst method. See, e.g., their Figure 2:

Figure 2 from Schurch et al. 2016. The four vertical panels show FPR and TPR for programs using 3,6,12 and 20 biological replicates (in yeast). Details are in the Schurch et al. 2016 paper.

The log-ratio t-test performs poorly not only when the number of replicates is small and regularization of variance estimates is essential. Schurch et al. specifically recommend DESeq2 (or edgeR) when up to 12 replicates are performed. In fact, the log-ratio t-test was so bad that it didn’t even make it into their Table 2 “summary of recommendations”.

The authors of Patro et al. 2017 are certainly well-aware of the poor performance of the log-ratio t-test. After all, one of them was specifically thanked in the Schurch et al. 2016 paper “for his assistance in identifying and correcting a bug”. Moreover, the recommended program by Schurch etal. (DESeq2) was authored by one of the coauthors on the Patro et al. paper, who regularly and publicly advocates for the use of his programs (and not the log-ratio t-test):

This recommendation has been codified in a detailed RNA-Seq tutorial where M. Love et al. write that “This [Salmon + tximport] is our current recommended pipeline for users”.

In Soneson and Delorenzi, 2013, the authors wrote that “there is no general consensus regarding which [RNA-Seq differential expression] method performs best in a given situation” and despite the development of many methods and benchmarks since this influential review, the question of how to perform differential expression analysis continues to be debated. While it’s true that “best practices” are difficult to agree on, one thing I hope everyone can agree on is that in a “typical downstream analysis” with a handful of replicates

do not perform differential expression with a log-ratio t-test.

Turning to Patro et al.‘s choice of units, it is important to note that the requirement of shrinkage for RNA-Seq differential analysis is the reason most differential expression tools require abundances measured in counts as input, and do not use length normalized units such as Transcripts Per Million (TPM). In TPM units the abundance $\rho_t$ for a transcript t is $\rho_t \propto \frac{c_t}{N \cdot l_t}$ where $c_t$ are the estimated counts for transcript t, $l_t$ is the (effective) length of t and N the number of total reads. Whereas counts are approximately Poisson distributed (albeit with some over-dispersion), variance estimates of abundances in TPM units depend on the lengths used in normalization and therefore cannot be used directly for regularization of variance estimation. Furthermore, the dependency of TPM on effective lengths means that abundances reported in TPM are very sensitive to the estimates of effective length.

This is why, when comparing the quantification accuracy of different programs, it is important to compare abundances using estimated counts. This was highlighted in Bray et al. 2016: “Estimated counts were used rather than transcripts per million (TPM) because the latter is based on both the assignment of ambiguous reads and the estimation of effective lengths of transcripts, so a program might be penalized for having a differing notion of effective length despite accurately assigning reads.” Yet Patro et al. perform no comparisons of programs in terms of estimated counts.

A typical analysis

The choices of Patro et al. in designing their benchmarks are demystified when one examines what would have happened had they compared Salmon to kallisto on typical data with standard downstream differential analysis tools such as their own tximport and DESeq2. I took the definition of “typical” from one of the Patro et al. coauthors’ own papers (Soneson et al. 2016): “Currently, one of the most common approaches is to define a set of non-overlapping targets (typically, genes) and use the number of reads overlapping a target as a measure of its abundance, or expression level.”

The Venn diagram below shows the differences in transcripts detected as differentially expressed when kallisto and Salmon are compared using the workflow the authors recommend publicly (quantifications -> tximport -> DESeq2) on a typical biological dataset with three replicates in each of two conditions. The number of overlapping genes is shown for a false discovery rate of 0.05 on RNA-Seq data from Trapnell et al. 2014:

A Venn diagram showing the overlap in genes predicted to be differential expressed by kallisto (blue) and Salmon (pink). Differential expression was performed with DESeq2 using transcript-level counts estimated by kallisto and Salmon and imported to DESeq2 with tximport. Salmon was run with GC bias correction.

This example provides Salmon the benefit of the doubt- the dataset was chosen to be older (when bias was more prevalent) and Salmon was not run in default mode but rather with GC bias correction turned on (option –gcBias).

When I saw these numbers for the first time I gasped. Of course I shouldn’t have been surprised; they are consistent with repeated published experiments in which comparisons of kallisto and Salmon have revealed near identical results. And while I think it’s valuable to publish confirmation of previous work, I did wonder whether Nature Methods would have accepted the Patro et al. paper had the authors conducted an actual “typical downstream analysis”.

Patro et al. utilized TPM based comparisons for all the results in their paper, presumably to highlight the improvements in accuracy resulting from better effective length estimates. Numerous results in the paper suggest that Salmon is much more accurate than kallisto. However I had seen a figure in Majoros et al. 2017 that examined the (cumulative) distribution of both kallisto and Salmon abundances in TPM units (their Supplementary Figure 5) in which the curves literally overlapped at almost all thresholds:

The plot above was made with Salmon v0.7.2 so in fairness to Patro et al. I remade it using the ERR188140 dataset mentioned above with Salmon v0.8.2:

The distribution of abundances (in TPM units) as estimated by kallisto (blue circles) and Salmon (red stars).

The blue circles correspond to kallisto and the red stars inside to Salmon. With the latest version of Salmon the similarity is even higher than what Majoros et al. observed! The Spearman correlation between kallisto and Salmon with TPM units is 0.9899896.

It’s interesting to examine what this means for a (truly) typical TPM analysis. One way that TPMs are used is to filter transcripts (or genes) by some threshold, typically TPM >  1 (in another deviation from “typical”, a key table in Patro et al. 2017 – Figure 1d – is made by thresholding with TPM > 0.1). The Venn diagram below shows the differences between the programs at the typical TPM > 1  threshold:

A Venn diagram showing the overlap in transcripts predicted by kallisto and Salmon to have estimated abundance > 1 TPM.

The figures above were made with Salmon 0.8.2 run in default mode. The correlation between kallisto and Salmon (in TPM) units decreases a tiny amount, from 0.9989224 to 0.9974325 with the –gcBias option and even the Spearman correlation decreases by only 0.011 from 0.9899896 to 0.9786092.

I think it’s perfectly fine for authors to present their work in the best light possible. What is not ok is to deliberately hide important and relevant truth, which in this case is that Salmon $\, \simeq \,$ kallisto.

A note on speed

One of the claims in Patro et al. 2017 is that “[the speed of Salmon] roughly matches the speed of the recently introduced kallisto.” The Salmon claim is based on a benchmark of an experiment (details unknown) with 600 million 75bp paired-end reads using 30 threads. Below are the results of a similar benchmark of Salmon showing time to process 19 samples from Boj et al. 2015 with variable numbers of threads:

First, Salmon with –gcBias is considerably slower than default Salmon. Furthermore, there is a rapid decrease in performance gain with increasing number of threads, something that should come as no surprise. It is well known that quantification can be I/O bound which means that at some point, extra threads don’t provide any gain as the disk starts grinding limiting access from the CPUs. So why did Patro et al. choose to benchmark runtime with 30 threads?

The figure below provides a possible answer:

In other words, not only is Salmon $\simeq$ kallisto in accuracy, but contrary to the claims in Patro et al. 2017, kallisto is faster. This result is confirmed in Table 1 of Sarkar et al. 2017 who find that Salmon is slower by roughly the same factor as seen above (in the table “quasi-mapping” is Salmon).

Having said that, the speed differences between kallisto and Salmon should not matter much in practice and large scale projects made possible with kallisto (e.g. Vivian et al. 2017) are possible with Salmon as well. Why then did the authors not report their running time benchmarks honestly?

The first common notion

The Patro et al. 2017 paper uses the term “quasi-mapping” to describe an algorithm, published in Srivastava et al. 2016, for obtaining their (what turned out to be near identical to kallisto) results. I have written previously how “quasi-mapping” is the same as pseudoalignment as an alignment concept, even though Srivastava et al. 2016 initially implemented pseudoalignment differently than the way we described it originally in our preprint in Bray et al. 2015. However the reviewers of Patro et al. 2017 may be forgiven for assuming that “quasi-mapping” is a technical advance over pseudoalignment. The Srivastava et al. paper is dense and filled with complex technical detail. Even for an expert in alignment/RNA-Seq it is not easy to see from a superficial reading of the paper that “quasi-mapping” is an equivalent concept to kallisto’s pseudoalignment (albeit implemented with suffix arrays instead of a de Bruijn graph). Nevertheless, the key to the paper is a simple sentence: “Specifically, the algorithm [RapMap, which is now used in Salmon] reports the intersection of transcripts appearing in all hits” in the section 2.1 of the paper. That’s the essence of pseudoalignment right there. The paper acknowledges as much, “This lightweight consensus mechanism is inspired by Kallisto ( Bray et al. , 2016 ), though certain differences exist”. Well, as shown above, those differences appear to have made no difference in standard practice, except insofar as the Salmon implementation of pseudoalignment being slower than the one in Bray et al. 2016.

Srivastava et al. 2016 and Patro et al. 2017 make a fuss about the fact that their “quasi-mappings” take into account the starting positions of reads in transcripts, thereby including more information than a “pure” pseudoalignment. This is a pedantic distinction Patro et al. are trying to create. Already in the kallisto preprint (May 11, 2015),  it was made clear that this information was trivially accessible via a reasonable approach to pseudoalignment: “Once the graph and contigs have been constructed, kallisto stores a hash table mapping each k-mer to the contig it is contained in, along with the position within the contig.”

In other words, Salmon is not producing near identical results to kallisto due to an unprecedented cosmic coincidence. The underlying method is the same. I leave it to the reader to apply Euclid’s first common notion:

Things which equal the same thing are also equal to each other.

Convergence

While Salmon is now producing almost identical output to kallisto and is based on the same principles and methods, this was not the case when the program was first released. The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods.

The first preprint was published on the BioRxiv on June 27, 2015. It followed shortly on the heels of the kallisto preprint which was published on May 11, 2015. However the first Salmon preprint described a program very different from kallisto. Instead of pseudoalignment, Salmon relied on chaining SMEMs (super-maximal exact matches) between reads and transcripts to identifying what the authors called “approximately consistent co-linear chains” as proxies for alignments of reads to transcripts. The authors then compared Salmon to kallisto writing that “We also compare with the recently released method of Kallisto which employs an idea similar in some respects to (but significantly different than) our lightweight-alignment algorithm and again find that Salmon tends to produce more accurate estimates in general, and in particular is better able [to] estimate abundances for multi-isoform genes.” In other words, in 2015 Patro et al. claimed that Salmon was “better” than kallisto. If so, why did the authors of Salmon later change the underlying method of their program to pseudoalignment from SMEM alignment?

Inspired by temporal ordering analysis of expression data and single-cell pseudotime analysis, I ran all the versions of kallisto and Salmon on ERR188140, and performed PCA on the resulting transcript abundance table to be able to visualize the progression of the programs over time. The figure below shows all the points with the exception of three: Sailfish 0.6.3, kallisto 0.42.0 and Salmon 0.32.0. I removed Sailfish 0.6.3 because it is such an outlier that it caused all the remaining points to cluster together on one side of the plot (the figure is below in the next section). In fairness I also removed one Salmon point (version 0.32.0) because it differed substantially from version 0.4.0 that was released a few weeks after 0.32.0 and fixed some bugs. Similarly, I removed kallisto 0.42.0, the first release of kallisto which had some bugs that were fixed 6 days later in version 0.42.1.

Evidently kallisto output has changed little since May 12, 2015. Although some small bugs were fixed and features added, the quantifications have been very similar. The quantifications have been stable because the algorithm has been the same.

On the other hand the Salmon trajectory shows a steady convergence towards kallisto. The result everyone is finding, namely that currently Salmon $\simeq$ kallisto is revealed by the clustering of recent versions of Salmon near kallisto. However the first releases of Salmon are very different from kallisto. This is also clear from the heatmap/hierarchical clustering of  Sahraeian et al. in which Salmon-SMEM was included (Salmon used SMEMs until version 0.5.1, sometimes labeled fmd, until “quasi-mapping” became the default). A question: if Salmon ca. 2015 was truly better than kallisto then is Salmon ca. 2017 worse than Salmon ca. 2015?

Convergence of Salmon and Sailfish to kallisto over the course of a year. The x-axis labels the time different versions of each program were released. The y-axis is PC1 from a PCA of transcript abundances of the programs.

Prestamping

The bioRxiv preprint server provides a feature by which a preprint can be linked to its final form in a journal. This feature is useful to readers of the bioRxiv, as final published papers are generally improved after preprint reader, reviewer, and editor comments have been addressed. Journal linking is also a mechanism for authors to time stamp their published work using the bioRxiv. However I’m sure the bioRxiv founders did not intend the linking feature to be abused as a “prestamping” mechanism, i.e. a way for authors to ex post facto receive a priority date for a published paper that had very little, or nothing, in common with the original preprint.

A comparison of the June 2015 preprint mentioning the Salmon program and the current Patro et al. paper reveals almost nothing in common. The initial method changed drastically in tandem with an update to the preprint on October 3, 2015 at which point the Salmon program was using “quasi mapping”, later published in Srivastava et al. 2016. Last year I met with Carl Kingsford (co-corresponding author of Patro et al. 2017) to discuss my concern that Salmon was changing from a method distinct from that of kallisto (SMEMs of May 2015) to one that was replicating all the innovations in kallisto, without properly disclosing that it was essentially a clone. Yet despite a promise that he would raise my concerns with the Salmon team, I never received a response.

At this point, the Salmon core algorithms have changed completely, the software program has changed completely, and the benchmarking has changed completely. The Salmon project of 2015 and the Salmon project of 2017 are two very different projects although the name of the program is the same. While some features have remained, for example the Salmon mode that processes transcriptome alignments (similar to eXpress) was present in 2015, and the approach to likelihood maximization has persisted, considering the programs the same is to descend into Theseus’ paradox.

Interestingly, Patro specifically asked to have the Salmon preprint linked to the journal:

The linking of preprints to journal articles is a feature that arXiv does not automate, and perhaps wisely so. If bioRxiv is to continue to automatically link preprints to journals it needs to focus not only on eliminating false negatives but also false positives, so that journal linking cannot be abused by authors seeking to use the preprint server to prestamp their work after the fact.

The fish always win?

The Sailfish program was the precursor of Salmon, and was published in Patro et al. 2014. At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

Despite the claim in the Sailfish abstract that “Sailfish provides quantification time…without loss of accuracy” and Figure 1 from the paper showing Sailfish to be more accurate than RSEM, we felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case; this was later noted by others, e.g. Hensman et al. 2015, Lee et al. 2015).

After reflecting on the Sailfish paper and results, Nicolas Bray had the key idea of abandoning alignments as a requirement for RNA-Seq quantification, developed pseudoalignment, and later created kallisto (with Harold Pimentel and Páll Melsted).

I mention this because after the publication of kallisto, Sailfish started changing along with Salmon, and is now frequently discussed in the context of kallisto and Salmon as an equal. Indeed, the PCA plot above shows that (in its current form, v0.10.0) Sailfish is also nearly identical to kallisto. This is because with the release of Sailfish 0.7.0 in September 2015, Patro et al. started changing the Sailfish approach to use pseudoalignment in parallel with the conversion of Salmon to use pseudoalignment. To clarify the changes in Sailfish, I made the PCA plot below which shows where the original version of Sailfish that coincided with the publication of Patro et al. 2014 (version 0.6.3 March 2014) lies relative to the more recent versions and to Salmon:

In other words, despite a series of confusing statements on the Sailfish GitHub page and an out-of-date description of the program on its homepage, Sailfish in its published form was substantially less accurate and slower than kallisto, and in its current form Sailfish is kallisto.

In retrospect, the results in Figure 1 of Patro et al. 2014 seem to be as problematic as the results in Figure 1 of Patro et al. 2017.  Apparently crafting computational experiments via biased simulations and benchmarks to paint a distorted picture of performance is a habit of Patro et al.

In the post I wrote that “The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods” Here are the details of how these support the claims I make (tl;dr https://twitter.com/yarbsalocin/status/893886707564662784):

Sailfish (current version) and Salmon implemented kallisto’s pseudoalignment algorithm using suffix arrays

First, both Sailfish and Salmon use RapMap (via SACollector) and call mergeLeftRightHits():
Sailfish:
https://github.com/kingsfordgroup/sailfish/blob/352f9001a442549370eb39924b06fa3140666a9e/src/SailfishQuantify.cpp#L192
Salmon:
https://github.com/COMBINE-lab/salmon/commit/234cb13d67a9a1b995c86c8669d4cefc919fbc87#diff-594b6c23e3bdd02a14cc1b861c812b10R2205

The RapMap code for “quasi mapping” executes an algorithm identical to psuedoalignment, down to the detail of what happens to the k-mers in a single read:

First, hitCollector() calls getSAHits_():
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L249

Here kmers are used hashed to SAintervals (Suffix Array intervals), that are then extended to see how far ahead to jump. This is the one of two key ideas in the kallisto paper, namely that not all the k-mers in a read need to be examined to pseudoalign the read. It’s much more than that though, it’s the actual exact same algorithm to the level of exactly the k-mers that are examined. kallisto performs this “skipping” using contig jumping with a different data structure (the transcriptome de Bruijn graph) but aside from data structure used what happens is identical:

https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SACollector.hpp#L652
makes a call to jumping and the code to compute MMP (skipping) is
https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SASearcher.hpp#L77

There is a different detail in the Sailfish/Salmon code which is that when skipping forward the suffix array is checked for exact matching on the skipped sequence. kallisto does not have this requirement (although it could). On error-free data these will obviously be identical; on error prone data this may make Salmon/Sailfish a bit more conservative and kallisto a bit more robust to error. Also due to the structure of suffix arrays there is a possible difference in behavior when a transcript contains a repeated k-mer. These differences affect a tiny proportion of reads, as is evident from the result that kallisto and Salmon produce near identical results.

The second key idea in kallisto of intersecting equivalence classes for a read. This exact procedure is in:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L363
which calls:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/src/HitManager.cpp#L599

There was a choice we had to make in kallisto of how to handle information from paired end reads (does one require consistent pseudoalignment in both? Just one suffices to pseudoalign a read?)
The code for intersection between left and right reads making the identical choices as kallisto is:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/RapMapUtils.hpp#L810

In other words, stepping through what happens to the k-mers in a read shows that Sailfish/Salmon copied the algorithms of kallisto and implemented it with the only difference being a different data structure used to hash the kmers. This is why, when I did my run of Salmon vs. kallisto that led to this blog post I found that
vs
salmon 69,701,169.
That’s a difference of 79,000 out of ~70 million = 0.1%.

1.  Until the kallisto program and preprint was published Salmon used SMEMs. Only after kallisto does Salmon change to using kmer cached suffix array intervals.
2. The kallisto preprint did not discuss outputting position as part of pseudoalignment because it was not central to the idea. It’s trivial to report pseudoalignment positions with either data structure and in fact both kallisto and Salmon do.

I want to make very clear here that I think there can be great value in implementing an algorithm with a different data structure. It’s a form of reproducibility that one can learn from: how to optimize, where performance gains can be made, etc. Unfortunately most funding agencies don’t give grants for projects whose goal is solely to reproduce someone else’s work. Neither do most journal publish papers that set out to do that. That’s too bad. If Patro et al. had presented their work honestly, and explained that they were implementing pseudoalignment with a different data structure to see if it’s better, I’d be a champion of their work. That’s not how they presented their work.

Salmon copied details in the quantification

The idea of using the EM algorithm for quantification with RNA-Seq goes back to Jiang and Wong, 2009, arguably even to Xing et al. 2006. I wrote up the details of the history in a review in 2011 that is on the arXiv. kallisto runs the EM algorithm on equivalence classes, an idea that originates with Nicolae et al. 2011 (or perhaps even Jiang and Wong 2009) but whose significance we understood from the Sailfish paper (Patro et al. 2014). Therefore the fact that Salmon (now) and kallisto both use the EM algorithm, in the same way, makes sense.

However Salmon did not use the EM algorithm before the kallisto preprint and program were published. It used an online variational Bayes algorithm instead. In the May 18, 2015 release of Salmon there is no mention of EM. Then, with the version 0.4 release date Salmon suddenly switches to the EM. In implementing the EM algorithm there are details that must be addressed, for example setting thresholds for when to terminate rounds of inference based on changes in the (log) likelihood (i.e. determine convergence).

For example, kallisto sets parameters
const double alpha_limit = 1e-7;
const double alpha_change_limit = 1e-2;
const double alpha_change = 1e-2;

in EMalgorithm.h
https://github.com/pachterlab/kallisto/blob/90db56ee8e37a703c368e22d08b692275126900e/src/EMAlgorithm.h
The link above shows that these kallisto parameters were set and have not changed since the release of kallisto
Also they were not always this way, see e.g. the version of April 6, 2015:
https://github.com/pachterlab/kallisto/blob/2651317188330f7199db7989b6a4dc472f5d1669/src/EMAlgorithm.h
This is because one of the things we did is explore the effects of these thresholds, and understand how setting them affects performance. This can be seen also in a legacy redundancy, we have both alpha_change and alpha_change_limit which ended up being unnecessary because they are equal in the program and used on one line.

The first versions of Salmon post-kallisto switched to the EM, but didn’t even terminate it the same way as kallisto, adopting instead a maximum iteration of 1,000. See
https://github.com/COMBINE-lab/salmon/blob/59bb9b2e45c76137abce15222509e74424629662/include/CollapsedEMOptimizer.hpp
from May 30, 2015.
This changed later first with the introduction of minAlpha (= kallisto’s alpha_limit)
https://github.com/COMBINE-lab/salmon/blob/56120af782a126c673e68c8880926f1e59cf1427/src/CollapsedEMOptimizer.cpp
and then alphaCheckCutoff (kallisto’s alpha_change_limit)
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp

Here are the salmon thresholds:
double minAlpha = 1e-8;
double alphaCheckCutoff = 1e-2;
double cutoff = minAlpha;

Notice that they are identical except that minAlpha = 1e-8 and not kallisto’s alpha_limit = 1e-7. However in kallisto, from the outset, the way that alpha_limit has been used is:
if (alpha_[ec] < alpha_limit/10.0) {
alpha_[ec] = 0.0;
}

In other words, alpha_limit in kallisto is really 1e-8, and has been all along.

The copying of all the details of our program have consequences for performance. In the sample I ran kallisto performed 1216 EM rounds of EM vs. 1214 EM rounds in Salmon.

Sailfish (current version) copied our sequence specific bias method

One of the things we did in kallisto is implement a sequence specific bias correction along the lines of what was done previously in Roberts et al. 2011, and later in Roberts et al. 2013. Implementing sequence specific bias correction in kallisto required working things out from scratch because of the way equivalence classes were being used with the EM algorithm, and not reads. I worked this out together with Páll Melsted during conversations that lasted about a month in the Spring of 2015. We implemented it in the code although did not release details of how it worked with the initial preprint because it was an option and not default, and we thought we might want to still change it before submitting the journal paper.

Here Rob is stating that Salmon can account for biases that kallisto cannot:
https://www.biostars.org/p/143458/#143639
This was a random forest bias correction method different from kallisto’s.

Shortly thereafter, here is the source code in Sailfish deprecating the Salmon bias correction and switching to kallisto’s method:
https://github.com/kingsfordgroup/sailfish/commit/377f6d65fe5201f7816213097e82df69e4786714#diff-fe8a1774cd7c858907112e6c9fda1e9dR76

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-3e922f9589567fee3b20671da9493c82R34

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-b14c09a136906d1c5d8534afa3a51c4cR818

This is the update to effective length in kallisto:
https://github.com/pachterlab/kallisto/blob/e5957cf96f029be4e899e5746edcf2f63e390609/src/weights.cpp#L184
Here is the Sailfish code:

Notice that there has been a literal copying down to the variable names:

The code written by the student of Rob was:

The code written by us is

efflen *= 0.5*biasAlphaNorm/biasDataNorm;

The code rewritten by Rob (editing that of the student):

effLength *= 0.5 * (txomeNormFactor / readNormFactor);

Note that since our bias correction method was not reported in our preprint, this had to have been copied directly from our codebase and was done so without any attribution.

I raised this specific issue with Carl Kingsford by email prior to our meeting in April 13 2016. We then discussed it in person. The conversation and email were prompted by a change to the Sailfish README on April 7, 2016 specifically accusing us of comparing kallisto to a “ **very old** version of Sailfish”:
https://github.com/kingsfordgroup/sailfish/commit/550cd19f7de0ea526f512a5266f77bfe07148266

What was stated is “The benchmarks in the kallisto paper *are* made against a very old version of Sailfish” not “were made against”. By the time that was written, it might well have been true. But kallisto was published in May 2015, it benchmarked with the Sailfish program described in Patro et al. 2014, and by 2016 Sailfish had changed completely implementing the pseudoalignment of kallisto.

Another aspect of an RNA-Seq quantification program is effective length estimation. There is an attribution to kallisto in the Sailfish code now explaining that this is from kallisto:
“Computes (and returns) new effective lengths for the transcripts based on the current abundance estimates (alphas) and the current effective lengths (effLensIn). This approach is based on the one taken in Kallisto
This is from January 23rd, 2016, almost 9 months after kallisto was released, and 3 months before the Sailfish README accused us of not testing the latest version of Sailfish in May 2015.

The attribution for effective lengths is also in the Salmon code, from 6 months later June 2016:
https://github.com/COMBINE-lab/salmon/blob/335c34b196205c6aebe4ddcc12c380eb47f5043a/include/DistributionUtils.hpp

There is also an acknowledgement in the Salmon code that a machine floating point tolerance we use
https://github.com/pachterlab/kallisto/blob/master/src/EMAlgorithm.h#L19
was copied.
The acknowledgment in Salmon is here
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp
This is the same file where the kallisto thresholds for the EM were copied to.

So after copying our entire method, our core algorithm, many of our ideas, specific parameters, and numerous features… really just about everything that goes into an RNA-Seq quantification project, there is an acknowledgment that our machine tolerance threshold was “intelligently chosen”.

During my third year of undergraduate study I took a course taught by David Gabai in which tests were negatively graded. This meant that points were subtracted for every incorrect statement made in a proof. As proofs could, in principle, contain an unbounded number of false statements, there was essentially no lower bound on the grade one could receive on a test (course grades was bounded below by “F”). Although painful on occasion, I grew to love the class and the transcendent lessons that it taught. These memories came flooding back this past week, when a colleague brought to my attention the paper A simple and fast algorithm for K-medoids clustering by Hae-Sang Park and Chi-Hyuck Jun.

The Park-Jun paper introduces a K-means like method for K-medoid clustering. K-medoid clustering is a clustering formulation based on a generalization of (a relative of) the median rather than the mean, and is useful for the same robustness reasons that make the median preferable to the mean. The medoid is defined as follows: given n points $x_1,\ldots,x_n$ in $\mathbb{R}^d$, the medoid is a point $x$ among them with the property that it minimizes the average distance to the other points, i.e. $x \in \{x_1,\ldots,x_n\}$ minimizes $\sum_{i=1}^n ||x-x_i||$. In the case of $d=1$, when n is odd, this corresponds to the median (the direct generalization of the median is to find a point $x$ not necessarily among the $x_i$ minimizing the average distance to the other points, and this is called the geometric median).

The K-medoid problem is to partition the points into k disjoint subsets $S_1,\ldots,S_n$ so that if $m_1,\ldots,m_k$ are the respective medoids of the subsets (clusters) then the average distance of each medoid to the points of the cluster it belongs to  is minimized. In other words, the K-medoids problem is to find

$argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||m_j-x_i||$.

For comparison, K-means clustering is the problem of finding

$argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||\mu_j - x_i||^2$,

where the $\mu_j$ are the centroids of the points in each partition $S_i$. The K-means and K-medoids problems are instances of partitional clustering formulations that are NP-hard.

The most widely used approach for finding a (hopefully good) solution to the K-medoids problem has been a greedy clustering algorithm called PAM by Kaufman and Rousseeuw (partitioning around medoids). To understand the Park-Jun contribution it is helpful to briefly review PAM first.  The method works as follows:

1. Initialize by selecting k distinguished points from the set of points.
2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets and a “cost” can be associated to the partition, namely the sum of the distances from each point to its associated distinguished point (note that the distinguished points may not yet be the medoids of their respective partitions).
3. For each distinguished point d, and each non-distingsuished point repeat the assignment and cost calculation of step  2 with and x swapped so that x becomes distinguished and returns to undistinguished status. Choose the swap that minimizes the total cost. If no swap reduces the cost then output the partition and the distinguished points.
4. Repeat step 3 until termination.

It’s easy to see that when the PAM algorithm terminates the distinguished points must be medoids for the partitions associated to them.

The running time of PAM is $O(k(n-k)^2)$. This is because in step 3, for each of the k distinguished points, there are n-k swaps to consider for a total of $k(n-k)$ swaps, and the cost computation for each swap requires n-k assignments and additions. Thus, PAM is quadratic in the number of points.

To speed-up the PAM method, Park and Jun introduced in their paper an analogy of the K-means algorithm, with mean replaced by median:

1. Initialize by selecting k distinguished points from the set of points.
2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets in the same way as the PAM algorithm.
3. Compute the medoid for each partition. Repeat step 2 until the cost no longer decreases.

Park-Jun claim that their method has run time complexity “$O(nk)$ which is equivalent to K-means clustering”. Yet the pseudocode in the paper begins with the clearly quadratic requirement “calculate the distance between every pair of all objects..”

Step 3 of the algorithm is also quadratic. The medoid of a set of m points is computed in time $O(m^2)$. Given m points the medoid must be one of them, so it suffices to compute, for each point, the sum of the distances to the others ($m \times m$ additions) and a medoid is then identified by taking the minimum. Furthermore, without assumptions on the structure of the distance matrix there cannot be a faster algorithm (with the triangle inequality the medoid can be computed in $O(n^{\frac{3}{2}}$).

Quadratic functions are not linear. If they were, it would mean that, with $a,b \neq 0$$ax^2=bx \mbox{ for all } x$. If that were the case then

$ax^2=bx$

$\Rightarrow ax^2-bx=0$ for all x.

Assuming that a is positive and plugging in $x=\frac{b-\sqrt{b^2+4a}}{2a}$ one would obtain that $ax^2-bx=1$ and it would follow that

$0=1$.

When reading the paper, it was upon noticing this “result” that negative grading came to mind. With a proof that 0=1 we are at, say, a score of -10 for the paper. Turning to the discussion of Fig. 3 of the paper we read that “…the proposed method takes about constant time near zero regardless of the number of objects.”

I suppose a generous reading of this statement is that it is an allusion to Taylor expansion of the polynomial $f(x)=x^2$ around $x=0$. A less generous interpretation is that the figure is deliberately misleading, intended to show linear growth with slope close to zero for what is a quadratic function. I decided to be generous and grade this a -5, leading to a running total of -15.

It seems the authors may have truly believed that their algorithm was linear because in the abstract they begin with “This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm”. It seems possible that the authors thought they had bypassed what they viewed as the underlying cause for quadratic running time of the PAM algorithm, namely the quadratic swap. The K-means algorithm (Lloyd’s algorithm) is indeed linear, because the computation of the mean of a set of points is linear in the number of points (the values for each coordinate can be averaged separately). However the computation of a medoid is not the same as computation of the mean. -5, now a running total of20. The actual running time of the Park-Jun algorithm is shown below:

Replicates on different instances are crucial, because absent from running time complexity calculations are the stopping times which are data dependent. A replicate of Figure 3 is shown below (using the parameters in Table 5 and implementations in MATLAB). The Park-Jun method is called as “small” in MATLAB.

Interestingly, the MATLAB implementation of PAM has been sped up considerably. Also, the “constant time near zero” behavior described by Park and Jun is clearly no such thing. For this lack of replication of Figure 3 another deduction of 5 points for a total of -25.

There is not much more to the Park and Jun paper, but unfortunately there is some. Table 6 shows a comparison of K-means, PAM and the Park-Jun method with true clusters based on a simulation:

The Rand index is a measure of concordance between clusters, and the higher the Rand index the greater the concordance. It seems that the Park-Jun method is slightly better than PAM, which are both superior to K-means in the experiment. However the performance of K-means is a lot better in this experiment than suggested by Figure 2 which was clearly (rotten) cherry picked (Fig2b):

For this I deduct yet another 5 points for a total of -30.

Amazingly, the Park and Jun paper has been cited 479 times. Many of the citations propagate the false claims in the paper, e.g. in referring to Park and Jun, Zadegan et al. write “in each iteration the new set of medoids is selected with running time O(N), where N is the number of objects in the dataset”.  The question is, how many negative points is the repetition of false claims from a paper that is filled with false claims worth?

Okay Houston, we’ve got a problem. We need more power. Case in point: a recently published study Apollo Lunar Astronauts Show Higher Cardiovascular Disease Mortality by Michael Delp et al. was picked up by news outlets with headlines such as:

The headlines were based on a sentence in the paper stating that “the CVD mortality rate among Apollo lunar astronauts (43%) was 4–5 times higher than in non-flight and LEO [low earth orbit] astronauts.”

A reading of the paper reveals that the “5 times more likely to die” risk calculation comes from $43\% \approx 9\% \times 5 = \left\lceil \frac{3}{7} \right\rceil$. The number 9% is the rate of cardiovascular disease observed in 35 non-flight astronauts whereas the number 43% is rate of cardiovascular disease in Apollo lunar astronauts (3 out of 7). In other words, the grandiose claims of the paper are based on three Apollo astronauts dying of cardiovascular disease rather than an expected single astronaut.

The authors themselves must have realized how unfounded their claims were, because the paper evidently flirts with fraud. They used a p-value cutoff of 0.1 to declare the lunar astronaut result “significant”. This is in contrast to the standard cutoff 0.05 which they use for the remainder of the results in the paper, and they justified the strange exception by suggesting that others  “considered [Fisher’s exact test] extremely conservative.” In addition, Ed Mitchell who died at the age of 85 on February 4th 2016 three months before the paper was submitted was excluded from the analysis. His inclusion would have increased the dataset size by 14%! Then there is the fact that they failed to mention the three astronauts who visited the moon twice and are still alive. Or that the lunar astronauts died ten years older on average. Perhaps worst of all, the authors imply that they have experimental data on a mechanism for their statistical (non)result by describing a follow-up experiment examining vascular responses of resistance arteries in irradiated mice. The problem is, the dose given to the mice was 87 times what the astronauts received!  None of this is complicated stuff… and one wonders how only one of the reporters writing about the study picked up on any of this (Sarah Kaplan  from the Washington post headlined the story with  Studying heart disease in astronauts yields clues but not conclusive evidence and concluded correctly “that’s just three of seven people, which doesn’t give you a whole lot of statistical power”.)

One would hope that this kind of paper would be retracted by the journal but my previous attempts to get journals to do the right thing, even when the research was clearly flawed, have been futile. Then there is the funding. Learning nothing doesn’t come for free and the authors’ “work”  was supported by grants from the National Space and Biomedical Research Institute under the NASA Cooperative Agreement. Clearly PI Michael Delp (who is also first author, corresponding author and dean of the College of Human Sciences at Florida State University) would like even more funding, proclaiming in interviews that he wanted to take “a deeper look into the medical history of the Apollo astronauts”, “study future questions” and that he was “working with NASA to conduct additional studies”. My experience in genomics has been that funding agencies typically turn a blind eye to flawed research leaving the task of evaluating the science to “peer reviewers”. I’ve seen many cases where individuals who published complete malarkey and hogwash continue to receive funding.  But it seems NASA cares about the research it funds and may not be on the same page as Delp et al. In a statement published on July 28th, NASA wrote that:

The National Space Biomedical Research Institute, a non-governmental organization with funding from NASA’s Human Research Program, supported a recent study published in Scientific Reports that looked at the rate of cardiovascular disease among Apollo astronauts.

With the current limited astronaut data referenced in the study it is not possible to determine whether cosmic ray radiation affected the Apollo astronauts.

This is not the first time NASA has published statements distancing itself from studies it has supported (either directly or indirectly). Following reports that a NASA-funded study found that industrial civilization was headed for irreversible collapse, NASA published a statement making clear it did not support the results of the study.

Thank you NASA! You have set a great example in taking ownership of the published work your funding enabled. Hopefully others (NIH!!) will follow suit in publicly disavowing poorly designed underpowered studies that make grandiose claims.

Disclosure: I collaborate with NASA scientists, contribute to projects partially funded by NASA, and apply for NASA funding.

Two weeks ago in my post Pachter’s P-value Prize I offered ${\bf \frac{\100}{p}}$ for justifying a reasonable null model and a p-value (p) associated to the statement “”Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions” in the paper

M. Kellis, B.W. Birren and E.S. Lander, Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisaeNature 2004 (hereafter referred to as the KBL paper).

Today I am happy to announce the winner of the prize. But first, I want to thank the many readers of the blog who offered comments (>135 in total) that are extraordinary in their breadth and depth, and that offer a glimpse of what scientific discourse can look like when not restricted to traditional publishing channels. You have provided a wonderful public example of what “peer review” should mean. Coincidentally, and to answer one of the questions posted, the blog surpassed one million views this past Saturday (the first post was on August 19th, 2013), a testament to the the fact that the collective peer reviewing taking place on these pages is not only of very high quality, but also having an impact.

I particularly want to thank the students who had the courage to engage in the conversation, and also faculty who published comments using their name. In that regard, I admire and commend Joshua Plotkin and Hunter Fraser for deciding to deanonymize themselves by agreeing to let me announce here that they were the authors of the critique sent to the authors in April 2004  initially posted as an anonymous comment on the blog.

The discussion on the blog was extensive, touching on many interesting issues and I only summarize a few of the threads of discussion here. I decided to touch on a number of key points made in order to provide context and justification for my post and selection of the prize winner.

The value of post-publication review

One of the comments  made in response to my post that I’d like to respond to first was by an author of KBL who dismissed the entire premise of the my challenge writing “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become, as Ferguson and Heene noted, a vast graveyard of undead theories? Surely the varied and interesting comments posted in response to my challenge (totaling >25,000 words and 50 pages in Arial 11 font), demonstrate the value of communal discussion of science after publication.

For the record, this past month I did submit a paper and also a grant, and I did spend lots of time with my family. I didn’t practice the guitar but I did play the piano. Yet in terms of research, for me the highlight of the month was reading and understanding the issues raised in the comments to my blog post. Did I have many other things to do? Sure. But what is more pressing than understanding if the research one does is to be meaningful?

The null model

A few years ago I introduced a new two-semester freshman math course at UC Berkeley for intended biology majors called Math 10- Methods of Mathematics: Calculus, Statistics and Combinatorics“. One of the key ideas we focus on in the first semester is that of a p-value. The idea of measuring significance of a biological result via a statistical computation involving probabilities is somewhat unnatural, and feedback from the students confirms what one might expect: that the topic of p-values is among the hardest in the course. Math for biologists turns out to be much harder than calculus. I believe that at Berkeley we are progressive in emphasizing the importance of statistics for biology majors at the outset of their education (to be clear, this is a recent development). The prevailing state is that of statistical illiteracy, and the result is that p-values are frequently misunderstood, abused, and violated in just about every possible way imaginable (see, e.g., here, here and here).

P-values require a null hypothesis and a test statistic, and of course one of the most common misconceptions about them is that when they are large they confirm that the null hypothesis is correct. No! And worse, a small p-value cannot be used to accept an alternative to the null, only to (confidently) reject the null itself. And rejection of the null comes with numerous subtle issues and caveats (see arguments against the p-value in the papers mentioned above). So what is the point?

I think the KBL paper makes for an interesting case study of when p-values can be useful. For starters, the construction of a null model is already a useful exercise, because it is a thought experiment designed to test ones understanding of the problem at hand. The senior author of the KBL paper argues that “we were interested in seeing whether, for genes where duplication frees up at least one copy to evolve rapidly, the evidence better fits one model (“Ohno”: only one copy will evolve quickly) or an alternative model (both genes will evolve quickly).” While I accept this statement at face value, it is important to acknowledge that if there is any science to data science, it is the idea that when examining data one must think beyond the specific hypotheses being tested and consider alternative explanations. This is the essence of what my colleague Ian Holmes is saying in his comment. In data analysis, thinking outside of the box (by using statistics) is not optional. If one is lazy and resorts to intuition then, as Páll Melted points out, one is liable to end up with fantasy.

The first author of KBL suggests that the “paper was quite explicit about the null model being tested.” But I was unsure of whether to assume that the one-gene-only-speeds-up model is the null based on”we sought to distinguish between the Ohno one-gene-only speeds-up (OS) model and the alternative both-genes speed-up (BS) model” or was the null the BS model because “the Ohno model is 10^87 times more likely, leading to significant rejection of the BS null”?  Or was the paper being explicit about not having a null model at all, because  “Two alternatives have been proposed for post-duplication”, or was it the opposite, i.e. two null models: “the OS and BS models are each claiming to be right 95% of the time”? I hope I can be forgiven for failing, despite trying very hard, to identify a null model in either the KBL paper, or the comments of the authors to my blog.

There is however a reasonable null model, and it is the “independence model”, which to be clear, is the model where each gene after duplication “accelerates” independently with some small probability (80/914). The suggestions that “the independence model is not biologically rooted” or that it “would predict that only 75% of genes would be preserved in at least one copy, and that 26% would be preserved in both copies” are of course absurd, as explained by Erik van Nimwegen who explains why point clearly and carefully. The fact that many entries reached the same conclusion about the suitable null model in this case is reassuring. I think it qualifies as a “reasonable model” (thereby passing the threshold for my prize).

The p-value

One of my favorite missives about p-values is by Andrew Gelman, who in “P-values and statistical practice” discusses the subtleties inherent in the use and abuse of p-values. But as my blog post illustrates, subtlety is one thing, and ignorance is an entirely different matter. Consider for example, the entry by Manolis Kellis who submitted that $p = 10^{-87}$ thus claiming that I owe him 903,659,165 million billion trillion quadrillion quintillion sextillion dollars (even more than the debt of the United States of America). His entry will not win the prize, although the elementary statistics lesson that follows is arguably worth a few dollars (for him). First, while it is true that a p-value can be computed from the (log) likelihood ratio when the null hypothesis is a special case of the alternative hypothesis (using the chi^2 distribution), the ratio of two likelihoods is not a p-value! Probabilities of events are also not p-values! For example, the comment that “I calculated p-values for the exact count, but the integral/sum would have been slightly better” is a non-starter. Even though KBL was published in 2004, this is apparently the level of understanding of p-values of one of the authors, a senior computational biologist and professor of computer science at MIT in 2015. Wow.

So what is “the correct” p-value? It depends, of course, on the test statistic. Here is where I will confess that like many professors, I had an answer in mind before asking the question. I was thinking specifically of the setting that leads to 0.74 (or 0.72/0.73, depending on roundoff and approximation). Many entries came up with the same answer I had in mind and therefore when I saw them I was relieved: I owed $135, which is what I had budgeted for the exercise. I was wrong. The problem with the answer 0.74 is that it is the answer to the specific question: what is the probability of seeing 4 or less pairs accelerate out of 76 pairs in which at least one accelerated. A better test statistic was proposed by Pseudo in which he/she asked for the probability of seeing 5% or less of the pairs accelerate from among the pairs with at least one gene accelerating when examining data from the null model with 457 pairs. This is a subtle but important distinction, and provides a stronger result (albeit with a smaller p-value). The KBL result is not striking even forgoing the specific numbers of genes measured to have accelerated in at least one pair (of course just because p=0.64 also does not mean the independence model is correct). What it means is that the data as presented simply weren’t “striking”. One caveat in the above analysis is that the arbitrary threshold used to declare “acceleration” is problematic. For example, one might imagine that other thresholds produce more convincing results, i.e. farther from the null, but of course even if that were true the use of an arbitrary cutoff was a poor approach to analysis of the data. Fortunately, regarding the specific question of its impact in terms of the analysis, we do not have to imagine. Thanks to the diligent work of Erik van Nimwegen, who went to the effort of downloading the data and reanalyzing it with different thresholds (from 0.4 to 1.6), we know that the null cannot be rejected even with other thresholds. The award There were many entries submitted and I read them all. My favorite was by Michael Eisen for his creative use of multiple testing correction, although I’m happier with the direction that yields$8.79. I will not be awarding him the prize though, because his submission fails the test of “reasonable”, although I will probably take him out to lunch sometime at Perdition Smokehouse.

I can’t review every single entry or this post, which is already too long, would become unbearable, but I did think long and hard about the entry of K. It doesn’t directly answer the question of why the 95% number is striking, nor do I completely agree with some of the assumptions (e.g. if neither gene in a pair accelerates then the parent gene was not accelerated pre-WGD). But I’ll give the entry an honorable mention.

The prize will be awarded to Pseudo for defining a reasonable null model and test statistic, and producing the smallest p-value within that framework. With a p-value of 0.64 I will be writing a check in the amount of $156.25. Congratulations Pseudo!! The biology One of the most interesting results of the blog post was, in my opinion, the extensive discussion about the truth. Leaving aside the flawed analysis of KBL, what is a reasonable model for evolution post-WGD? I am happy to see the finer technical details continue to be debated, and the intensity of the conversation is awesome! Pavel Pevzner’s cynical belief that “science fiction” is not a literary genre but rather a description of what is published in the journal Science may be realistic, but I hope the comments on my blog can change his mind about what the future can look like. In lieu of trying to summarize the scientific conversation (frankly, I don’t think I could do justice to some of the intricate and excellent arguments posted by some of the population geneticists) I’ll just leave readers to enjoy the comment threads on their own. Comments are still being posted, and I expect the blog post to be a “living” post-publication review for some time. May the skeletons keep finding a way out! The importance of wrong Earlier in this post I admitted to being wrong. I have been wrong many times. Even though I’ve admitted some of my mistakes on this blog and elsewhere in talks, I would like to joke that I’m not going to make it easy for you to find other flaws in my work. That would be a terrible mistake. Saying “I was wrong” is important for science and essential for scientists. Without it people lose trust in both. I have been particularly concerned with a lack of “I was wrong” in genomics. Unfortunately, there is a culture that has developed among “leaders” in the field where the three words admitting error or wrongdoing are taboo. The recent paper of Lin et al. critiqued by Gilad-Mizrahi is a good example. Leaving aside the question of whether the result in the paper is correct (there are strong indications that it isn’t), Mizrahi-Gilad began their critique on twitter by noting that the authors had completely failed to account for, or even discuss, batch effect. Nobody, and I mean nobody who works on RNA-Seq would imagine for even a femtosecond that this is ok. It was a major oversight and mistake. The authors, any of them really, could have just come out and said “I was wrong”. Instead, the last author on the paper, Mike Snyder, told reporters that “All of the sequencing runs were conducted by the same person using the same reagents, lowering the risk of unintentional bias”. Seriously? Examples abound. The “ENCODE 80% kerfuffle” involved claims that “80% of the genome is functional”. Any self-respecting geneticist recognizes such headline grabbing as rubbish. Ewan Birney, a distinguished scientist who has had a major impact on genomics having being instrumental in the ENSEMBL project and many other high-profile bioinformatics programs defended the claim on BBC: “EB: Ah, so, I don’t — It’s interesting to reflect back on this. For me, the big important thing of ENCODE is that we found that a lot of the genome had some kind of biochemical activity. And we do describe that as “biochemical function”, but that word “function” in the phrase “biochemical function”is the thing which gets confusing. If we use the phrase “biochemical activity”, that’s precisely what we did, we find that the different parts of the genome, [??] 80% have some specific biochemical event we can attach to it. I was often asked whether that 80% goes to 100%, and that’s what I believe it will do. So, in other words, that number is much more about the coverage of what we’ve assayed over the entire genome. In the paper, we say quite clearly that the majority of the genome is not under negative selection, and we say that most of the elements are not under pan-mammalian selection. So that’s negative selection we can detect between lots of different mammals. [??] really interesting question about what is precisely going on in the human population, but that’s — you know, I’m much closer to the instincts of this kind of 10% to 20% sort of range about what is under, sort of what evolution cares about under selection.” This response, and others by members of the ENCODE consortium upset many people who may struggle to tell apart white and gold from blue and black, but certainly know that white is not black and black is not white. Likewise, I suspect the response of KBL to my post disappointed many as well. For Fisher’s sake, why not just acknowledge what is obvious and true? The personal critique of professional conduct A conversation topic that emerged as a result of the blog (mostly on other forums) is the role of style in online discussion of science. Specifically, the question of whether personal attacks are legitimate has come up previously on my blog pages and also in conversations I’ve had with people. Here is my opinion on the matter: Science is practiced by human beings. Just like with any other human activity, some of the humans who practice it are ethical while others are not. Some are kind and generous while others are… not. Occasionally scientists are criminal. Frequently they are honorable. Of particular importance is the fact that most scientists’ behavior is not at any of these extremes, but rather a convex combination of the mentioned attributes and many others. In science it is people who benefit, or are hurt, by the behavior of scientists. Preprints on the bioRxiv do not collect salaries, the people who write them do. Papers published in journals do not get awarded or rejected tenure, people do. Grants do not get jobs, people do. The behavior of people in science affects… people. Some argue for a de facto ban on discussing the personal behavior of scientists. I agree that the personal life of scientists is off limits. But their professional life shouldn’t be. When Bernie Madoff fabricated gains of$65 billion it was certainly legitimate to criticize him personally. Imagine if that was taboo, and instead only the technical aspects of his Ponzi scheme were acceptable material for public debate. That would be a terrible idea for the finance industry, and so it should be for science. Science is not special among the professions, and frankly, the people who practice it hold no primacy over others.

I therefore believe it is not only acceptable but imperative to critique the professional behavior of persons who are scientists. I also think that doing so will help eliminate the problematic devil–saint dichotomy that persists with the current system. Having developed a culture in which personal criticism is outlawed in scientific conversations while only science is fair fodder for public discourse, we now have a situation where scientists are all presumed to be living Gods, or else serious criminals to be outlawed and banished from the scientific community. Acknowledging that there ought to be a grey zone, and developing a healthy culture where critique of all aspects of science and scientists is possible and encouraged would relieve a lot of pressure within the current system. It would also be more fair and just.

A final wish

I wish the authors of the KBL paper would publish the reviews of their paper on this blog.

About one and a half years ago I wrote a blog post titled “GTEx is throwing away 90% of their data“. The post was, shall we say, “direct”. For example, in reference to the RNA-Seq quantification program Flux Capacitor I wrote that

Using Flux Capacitor is equivalent to throwing out 90% of the data!

I added that “the methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad” (after explaining that the methods underlying the program were never published, except for a brief mention in that paper). I referred to the sole figure in Montgomery et al. as a “completely useless” description of the method  (and showed that it contained errors). I highlighted the fact that many aspects of Flux Capacitor, its description and documentation provided on its website were “incoherent”. Can we agree that this description is not flattering?

The claim about “throwing out 90% of the data” was based on benchmarking I reported on in the blog post. If I were to summarize the results (politely), I would say that the take home message was that Flux Capacitor is junk. Perhaps nobody had really noticed because nobody cared about the program. Flux Capacitor was literally being used only by the authors of the program  (and their affiliates, which turned out to include the ENCODE, GENCODE, GEUVADIS and GTEx consortiums). In fact, when I wrote the blog post, I don’t think the program had ever been benchmarked or compared to other tools. It was, after all, unpublished and besides, nobody reads consortium papers. However after I blogged a few others decided to include Flux Capacitor in their benchmarks and the conclusions reached were the same as mine: Flux Capacitor is junk and Flux Capacitor is junk. Of course some people objected to my blog post when it came out, so it’s fun to be right and have others say so in print. But true vindication has come in the form of a citation to the blog post in a published paper in a journal! Specifically, in

C. Iannone, A. Pohl, P. Papasaikas, D. Soronellas, G.P. Vincent, M. Beato and J. Valcárcel, Relationship between nucleosome positioning and progesterone-induced alternative splicing in breast cancer cells, RNA 21 (2015) 360–374

the authors cite my blog post. They write:

Ummm…. wait… WHAT THE FLUX? The authors actually used Flux Capacitor for their analysis, and are citing my blog at https://liorpachter.wordpress.com/tag/flux-capacitor/ as the definitive reference for the program. Wait, what again?? They used my blog post as a reference for the method??? This is like [[ readers are invited to leave a comment offering a suitable analogy ]].

I’m not really sure what the authors can do at this point. They could publish an erratum and replace the citation. But with what? Flux Capacitor still hasn’t been published (!) Then there is the journal. Does any journal really think it is acceptable to list my blog as the citation for an RNA-Seq quantification tool that is fundamental for the results in a paper? (I’m flattered, but still…) Speaking of the journal, where were the reviewers? How could they not catch this? And the readers? The paper has been out since January… I have to ask: has anybody read it? Of course the biggest embarrassment here is the fact that there is a citation for Flux Capacitor at all. Why on earth are the authors using this discredited program??? Well maybe one answer is to be found in the acknowledgments section, where the group of a PI from the GTEx project is thanked. Actually, this PI was the last author on one of the recently published GTEx companion papers, which, I am sad to say… used Flux Capacitor (albeit with some quantifications performed with Cufflinks as well to demonstrate “robustness”). Why would GTEx be pushing for Flux Capacitor and insist on its use? We’ve come full circle to my GTEx blog post. By now I don’t even know what I think is the most embarrassing part of this whole story. So I thought I’d host a poll:

Earlier this week US News and World Report (USNWR) released, for the first time, a global ranking of universities including rankings by subject area. In mathematics, the top ten universities are:

1. Berkeley
2. Stanford
3. Princeton
4. UCLA
5. University of Oxford
6. Harvard
7. King Abdulaziz University
8. Pierre and Marie Curie – Paris 6
9. University of Hong Kong
10. University of Cambridge

The past few days I’ve received a lot of email from colleagues and administrators about this ranking, and also the overall global ranking of USNWR in which Berkeley was #1. The emails generally say something to the effect of “of course rankings are not perfect, everybody knows… but look, we are amazing!”

BUT, one of the top math departments in the world, the math department at the Massachusetts Institute of Technology is ranked #11… they didn’t even make the top ten. Even more surprising is the entry at #7 that I have boldfaced: the math department at King Abdulaziz University (KAU) in Jeddah, Saudi Arabia. I’ve been in the math department at Berkeley for 15 years, and during this entire time I’ve never (to my knowledge) met a person from their math department and I don’t recall seeing a job application from any of their graduates… I honestly had never heard of the university in any scientific context. I’ve heard plenty about KAUST (the King Abdullah University of Science and Technology ) during the past few years, especially because it is the first mixed-gender university campus in Saudi Arabia, is developing a robust research program based on serious faculty hires from overseas, and in a high profile move hired former Caltech president Jean-Lou Chameau to run the school. But KAU is not KAUST.

A quick google searched reveals that although KAU is nearby in Jeddah, it is a very different type of institution. It has two separate campuses for men and women. Although it was established in 1967 (Osama Bin Laden was a student there in 1975) its math department started a Ph.D. program only two years ago. According to the math department website, the chair of the department, Prof. Abdullah Mathker Alotaibi, is a 2005 Ph.D. with zero publications [Update: Nov. 10: This initial claim was based on a Google Scholar Search of his full name; a reader commented below that he has published and that this claim was incorrect. Nevertheless, I do not believe it in any way materially affect the points made in this post.] This department beat MIT math in the USNWR global rankings! Seriously?

The USNWR rankings are based on 8 attributes:

– global research reputation
– regional research reputation
– publications
– normalized citation impact
– total citations
– number of highly cited papers
– percentage of highly cited papers
– international collaboration

Although KAU’s full time faculty are not very highly cited, it has amassed a large adjunct faculty that helped them greatly in these categories. In fact, in “normalized citation impact” KAU’s math department is the top ranked in the world. This amazing statistic is due to the fact that KAU employs (as adjunct faculty) more than a quarter of the highly cited mathematicians at Thomson Reuters. How did a single university assemble a group with such a large proportion of the world’s prolific (according to Thomson Reuters) mathematicians? (When I first heard this statistic from Iddo Friedberg via Twitter I didn’t believe it and had to go compute it myself from the data on the website. I guess I believe it now but I still can’t believe it!!)

In 2011 Yudhijit Bhattacharjee published an article in Science titled “Saudi Universities Offer Cash in Exchange for Academic Prestige” that describes how KAU is targeting highly cited professors for adjunct faculty positions. According to the article, professors are hired as adjunct professors at KAU for $72,000 per year in return for agreeing (apparently by contract) to add KAU as a secondary affiliation at ISIhighlycited.com and for adding KAU as an affiliation on their published papers. Annual visits to KAU are apparently also part of the “deal” although it is unclear from the Science article whether these actually happen regularly or not. [UPDATE Oct 31, 12:14pm: A friend who was solicited by KAU sent me the invitation email with the contract that KAU sends to potential “Distinguished Adjunct Professors”. The details are exactly as described in the Bhattacharjee article: From: "Dr. Mansour Almazroui" <ceccr@kau.edu.sa> Date: XXXX To: XXXX <XXXX> Subject: Re: Invitation to Join “International Affiliation Program” at King Abdulaziz University, Jeddah Saudi Arabia Dear Prof. XXXX , Hope this email finds you in good health. Thank you for your interest. Please find below the information you requested to be a “Distinguished Adjunct Professor” at KAU. 1. Joining our program will put you on an annual contract initially for one year but further renewable. However, either party can terminate its association with one month prior notice. 2. The Salary per month is$ 6000 for the period of contract.
3. You will be required to work at KAU premises for three weeks in
each contract year. For this you will be accorded with expected
three visits to KAU.
4. Each visit will be at least for one week long but extendable as
suited for research needs.
5. Air tickets entitlement will be in Business-class and stay in Jeddah
will be in a five star hotel. The KAU will cover all travel and living
6. You have to collaborate with KAU local researchers to work on KAU
funded (up to $100,000.00) projects. 7. It is highly recommended to work with KAU researchers to submit an external funded project by different agencies in Saudi Arabia. 8. May submit an international patent. 9. It is expected to publish some papers in ISI journals with KAU affiliation. 10. You will be required to amend your ISI highly cited affiliation details at the ISI highlycited.com web site to include your employment and affiliation with KAU. Kindly let me know your acceptance so that the official contract may be preceded. Sincerely, Mansour ] The publication of the Science article elicited a strong rebuttal from KAU on the comments section, where it was vociferously argued that the hiring of distinguished foreign scholars was aimed at creating legitimate research collaborations, and was not merely a gimmick for increasing citation counts. Moreover, some of the faculty who had signed on defended the decision in the article. For example, Neil Robertson, a distinguished graph theorist (of Robertson-Seymour graph minors fame) explained that “it’s just capitalism,” and “they have the capital and they want to build something out of it.” He added that “visibility is very important to them, but they also want to start a Ph.D. program in mathematics,” (they did do that in 2012) and he added that he felt that “this might be a breath of fresh air in a closed society.” It is interesting to note that despite his initial enthusiasm and optimism, Professor Robertson is no longer associated with KAU. In light of the high math ranking of KAU in the current USNWR I decided to take a closer look at who KAU has been hiring, why, and for what purpose, i.e. I decided to conduct post-publication peer review of the Bhattacharjee Science paper. A web page at KAU lists current “Distinguished Scientists” and another page lists “Former Distinguished Adjunct Professors“. One immediate observation is that out of 118 names on these pages there is 1 woman (Cheryl Praeger from the University of Western Australia). Given that KAU has two separate campuses for men and women, it is perhaps not surprising that women are not rushing to sign on, and perhaps KAU is also not rushing to invite them (I don’t have any information one way or another, but the underrepresentation seems significant). Aside from these faculty, there is also a program aptly named the “Highly Cited Researcher Program” that is part of the Center for Excellence in Genomic Medicine Research. Fourteen faculty are listed there (all men, zero women). But guided by the Science article which described the contract requirement that researchers add KAU to their ISI affiliation, I checked for adjunct KAU faculty at Thomson-Reuters ResearcherID.com and there I found what appears to be the definitive list. Although Neil Robertson has left KAU, he has been replaced by another distinguished graph theorist, namely Carsten Thomassen (no accident as his wikipedia page reveals that “He was included on the ISI Web of Knowledge list of the 250 most cited mathematicians.”) This is a name I immediately recognized due to my background in combinatorics; in fact I read a number of Thomassen’s papers as a graduate student. I decided to check whether it is true that adjunct faculty are adding KAU as an affiliation on their articles. Indeed, Thomassen has done exactly that in his latest publication Strongly 2-connected orientations of graphs published this year in the Journal of Combinatorial Theory Series B. At this point I started having serious reservations about the ethics of faculty who have agreed to be adjuncts at KAU. Regardless of the motivation of KAU in hiring adjunct highly cited foreign faculty, it seems highly inappropriate for a faculty member to list an affiliation on a paper to an institution to which they have no scientific connection whatsoever. I find it very hard to believe that serious graph theory is being researched at KAU, an institution that didn’t even have a Ph.D. program until 2012. It is inconceivable that Thomassen joined KAU in order to find collaborators there (he mostly publishes alone), or that he suddenly found a great urge to teach graph theory in Saudi Arabia (KAU had no Ph.D. program until 2012). The problem is also apparent when looking at the papers of researchers in genomics/computational biology that are adjuncts at KAU. I recognized a number of such faculty members, including high-profile names from my field such as Jun Wang, Manolis Dermitzakis and John Huelsenbeck. I was surprised to see their names (none of these faculty mention KAU on their websites) yet in each case I found multiple papers they have authored during the past year in which they list the KAU affiliation. I can only wonder whether their home institutions find this appropriate. Then again, maybe KAU is also paying the actual universities the faculty they are citation borrowing belong to? But assume for a moment that they aren’t, then why should institutions share the credit they deserve for supporting their faculty members by providing them space, infrastructure, staff and students with KAU? What exactly did KAU contribute to Kilpinen et al. Coordinated effects of sequence variation on DNA binding, chromatin structure and transcription, Science, 2013? Or to Landis et al. Bayesian analysis of biogeography when the number of areas is large, Systematic Biology, 2013? These papers have no authors or apparent contribution from KAU. Just the joint affiliation of the adjunct faculty member. The limit of the question arises in the case of Jun Wang, director of the Beijing Genome Institute, whose affiliations are BGI (60%), University of Copenhagen (15%), King Abdulaziz University (15%), The University of Hong Kong (5%), Macau University of Science and Technology (5%). Should he also acknowledge the airlines he flies on? Should there not be some limit on the number of affiliations of an individual? Shouldn’t journals have a policy about when it is legitimate to list a university as an affiliation for an author? (e.g. the author must have in some significant way been working at the institution). Another, bigger, disgrace that emerged in my examination of the KAU adjunct faculty is the issue of women. Aside from the complete lack of women in the “Highly Cited Researcher Program”, I found that most of the genomics adjunct faculty hired via the program will be attending an all-male conference in three weeks. The “Third International Conference on Genomic Medicine” will be held from November 17–20th at KAU. This conference has zero women. The same meeting last year… had zero women. I cannot understand how in 2014, at a time when many are speaking out strongly about the urgency of supporting females in STEM and in particular about balancing meetings, a bunch of men are willing to forgo all considerations of gender equality for the price of ~$3 per citation per year (a rough calculation using the figure of \$72,000 per year from the Bhattacharjee paper and 24,000 citations for a highly cited researcher). To be clear I have no personal knowledge about whether the people I’ve mentioned in this article are actually being paid or how much, but even if they are being paid zero it is not ok to participate in such meetings. Maybe once (you didn’t know what you are getting into), but twice?!

As for KAU, it seems clear based on the name of the “Highly Cited Researcher Program” and the fact that they advertise their rankings that they are specifically targeting highly cited researchers much more for their delivery of their citations than for development of genuine collaborations (looking at the adjunct faculty I failed to see any theme or concentration of people in any single area as would be expected in building a coherent research program). However I do not fault KAU for the goal of increasing the ranking of their institution. I can see an argument for deliberately increasing rankings in order to attract better students, which in turn can attract faculty. I do think that three years after the publication of the Science article, it is worth taking a closer look at the effects of the program (rankings have increased considerably but it is not clear that research output from individuals based at KAU has increased), and whether this is indeed the most effective way to use money to improve the quality of research institutions. The existence of KAUST lends credence to the idea that the king of Saudi Arabia is genuinely interested in developing Science in the country, and there is a legitimate research question as to how to do so with the existing resources and infrastructure. Regardless of how things ought to be done, the current KAU emphasis on rankings is a reflection of the rankings, which USNWR has jumped into with its latest worldwide ranking. The story of KAU is just evidence of a bad problem getting worse. I have previously thought about the bad version of the problem:

A few years ago I wrote a short paper with my (now former) student Peter Huggins on university rankings:

P. Huggins and L.P., Selecting universities: personal preferences and rankings, arXiv, 2008.

It exists only as an arXiv preprint as we never found a suitable venue for publication (this is code for the paper was rejected upon peer review; no one seemed interested in finding out the extent to which the data behind rankings can produce a multitude of stories). The article addresses a simple question: given that various attributes have been measured for a bunch of universities, and assuming they are combined (linearly) into a score used to produce rankings, how do the rankings depend on the weightings of the individual attributes? The mathematics is that of polyhedral geometry, where the problem is to compute a normal fan of a polytope whose vertices encode all the possible rankings that can be obtained for all possible weightings of the attributes (an object we called the unitope). An example is shown below, indicating the possible rankings as determined by weightings chosen among three attributes measured by USNWR (freshman retention, selectivity, peer assessment). It is important to keep in mind this is data from 2007-2008.

Our paper had an obvious but important message: rankings can be very sensitive to the attribute weightings. Of course some schools such as Harvard came out on top regardless of attribute preferences, but some schools, even top ranked schools, could shift by over 50 positions. Our conclusion was that although the data collected by USNWR was useful, the specific weighting chosen and the ranking it produced were not. Worse than that, sticking to a single choice of weightings was misleading at best, dangerous at worse.

I was reminded of this paper when looking at the math department rankings just published by USNWR. When I saw that KAU was #7 I was immediately suspicious, and even Berkeley’s #1 position bothered me (even though I am a faculty member in the department). I immediately guessed that they must have weighted citations heavily, because our math department has applied math faculty, and KAU has their “highly cited researcher program”. Averaging citations across faculty from different (math) disciplines is inherently unfair. In the case of Berkeley, my applied math colleague James Sethian has a paper on level set methods with more than 10,000 (Google Scholar) citations. This reflects the importance and advance of the paper, but also the huge field of users of the method (many, if not most, of the disciplines in engineering). On the other hand, my topology colleague Ian Agol’s most cited paper has just over 200 citations. This is very respectable for a mathematics paper, but even so it doesn’t come close to reflecting his true stature in the field, namely the person who settled the Virtually Haken Conjecture thereby completing a long standing program of William Thurston that resulted in many of the central open problems in mathematics (Thurston was also incidentally an adjunct faculty member at KAU for some time). In other words, not only are citations not everything, they can also be not anything. By comparing citations across math departments that are diverse to very differing degrees USNWR rendered the math ranking meaningless. Some of the other data collected, e.g. reputation, may be useful or relevant to some, and for completeness I’m including it with this post (here) in a form that allows for it to be examined properly (USNWR does not release it in the form of a table, but rather piecemeal within individual html pages on their site), but collating the data for each university into one number is problematic. In my paper with Peter Huggins we show both how to evaluate the sensitivity of rankings to weightings and also how to infer bounds on the weightings by USNWR from the rankings. It would be great if USNWR included the ability to perform such computations with their data directly on their website but there is a reason USNWR focuses on citations.

The impact factor of a journal is a measure of the average amount of citation per article. It is computed by averaging the citations over all articles published during the preceding two years, and its advertisement by journals reflects a publishing business model where demand for the journal comes from the impact factor, profit from free peer reviewing, and sales from closed subscription based access.  Everyone knows the peer review system is broken, but it’s difficult to break free of when incentives are aligned to maintain it. Moreover, it leads to perverse focus of academic departments on the journals their faculty are publishing in and the citations they accumulate. Rankings such as those by USNWR reflect the emphasis on citations that originates with the journals, as so one cannot fault USNWR for including it as a factor and weighting it highly in their rankings. Having said that, USNWR should have known better than to publish the KAU math rankings; in fact it appears their publication might be a bug. The math department rankings are the only rankings that appear for KAU. They have been ommitted entirely from the global overall ranking and other departmental rankings (I wonder if this is because USNWR knows about the adjunct faculty purchase). In any case, the citation frenzy feeds departments that in aggregate form universities. Universities such as King Abdulaziz, that may reach the point where they feel compelled to enter into the market of citations to increase their overall profile…

I hope this post frightened you. It should. Happy Halloween!

[Update: Dec. 6: an article about KAU and citations has appeared in the Daily Cal, Jonathan Eisen posted his exchanges with KAU, and he has storified the tweets]

Nature Publishing Group claims on its website that it is committed to publishing “original research” that is “of the highest quality and impact”. But when exactly is research “original”?  This is a question with a complicated answer. A recent blog post by senior editor Dorothy Clyde at Nature Protocols provides insight into the difficulties Nature faces in detecting plagiarism, and identifies the issue of self plagiarism as particularly problematic. The journal tries to avoid publishing the work of authors who have previously published the same work or a minor variant thereof. I imagine this is partly in the interests of fairness, a service to the scientific community to ensure that researchers don’t have to sift through numerous variants of a single research project in the literature, and a personal interest of the journal in its aim to publish only the highest level of scholarship.

On the other hand, there is also a rationale for individual researchers to revisit their own previously published work. Sometimes results can be recast in a way that makes them accessible to different communities, and rethinking of ideas frequently leads to a better understanding, and therefore a better exposition. The mathematician Gian-Carlo Rota made the case for enlightened self-plagiarism in one of his ten lessons he wished he had been taught when he was younger:

3. Publish the same result several times

After getting my degree, I worked for a few years in functional analysis. I bought a copy of Frederick Riesz’ Collected Papers as soon as the big thick heavy oversize volume was published. However, as I began to leaf through, I could not help but notice that the pages were extra thick, almost like cardboard. Strangely, each of Riesz’ publications had been reset in exceptionally large type. I was fond of Riesz’ papers, which were invariably beautifully written and gave the reader a feeling of definitiveness.

As I looked through his Collected Papers however, another picture emerged. The editors had gone out of their way to publish every little scrap Riesz had ever published. It was clear that Riesz’ publications were few. What is more surprising is that the papers had been published several times. Riesz would publish the first rough version of an idea in some obscure Hungarian journal. A few years later, he would send a series of notes to the French Academy’s Comptes Rendus in which the same material was further elaborated. A few more years would pass, and he would publish the definitive paper, either in French or in English. Adam Koranyi, who took courses with Frederick Riesz, told me that Riesz would lecture on the same subject year after year, while meditating on the definitive version to be written. No wonder the final version was perfect.

Riesz’ example is worth following. The mathematical community is split into small groups, each one with its own customs, notation and terminology. It may soon be indispensable to present the same result in several versions, each one accessible to a specific group; the price one might have to pay otherwise is to have our work rediscovered by someone who uses a different language and notation, and who will rightly claim it as his own.

The question is: where does one draw the line?

I was recently forced to confront this question when reading an interesting paper about a statistical approach to utilizing controls in large-scale genomics experiments:

J.A. Gagnon-Bartsch and T.P. Speed, Using control genes to corrected for unwanted variation in microarray dataBiostatistics, 2012.

A cornerstone in the logic and methodology of biology is the notion of a “control”. For example, when testing the result of a drug on patients, a subset of individuals will be given a placebo. This is done to literally control for effects that might be measured in patients taking the drug, but that are not inherent to the drug itself. By examining patients on the placebo, it is possible to essentially cancel out uninteresting effects that are not specific to the drug. In modern genomics experiments that involve thousands, or even hundreds of thousands of measurements, there is a biological question of how to design suitable controls, and a statistical question of how to exploit large numbers of controls to “normalize” (i.e. remove unwanted variation) from the high-dimensional measurements.

Formally, one framework for thinking about this is a linear model for gene expression. Using the notation of Gagnon-Bartsch & Speed, we have an expression matrix $Y$ of size $m \times n$ (samples and genes) modeled as

$Y_{m \times n} = X_{m \times p}\beta_{p \times n} + Z_{m \times q}\gamma_{q \times n} + W_{m \times k} \alpha_{k \times n} + \epsilon_{m \times n}$.

Here is a matrix describing various conditions (also called factors) and associated to it is the parameter matrix $\beta$ that records the contribution, or influence, of each factor on each gene. $\beta$ is the primary parameter of interest to be estimated from the data Y. The $\epsilon$ are random noise, and finally  and are observed and unobserved covariates respectively. For example Z might encode factors for covariates such as gender, whereas W would encode factors that are hidden, or unobserved. A crucial point is that the number of hidden factors in W, namely k, is not known. The matrices $\gamma$ and $\alpha$ record the contributions of the Z and W factors on gene expression, and must also be estimated. It should be noted that X may be the logarithm of expression levels from a microarray experiment, or the analogous quantity from an RNA-Seq experiment (e.g. log of abundance in FPKM units).

Linear models have been applied to gene expression analysis for a very long time; I can think of papers going back 15 years. But They became central to all analysis about a decade ago, specifically popularized with the Limma package for microarray data analysis. In an important paper in 2007, Leek and Storey focused explicitly on the identification of hidden factors and estimation of their influence, using a method called SVA (Surrogate Variable Analysis). Mathematically, they described a procedure for estimating k and W and the parameters $\alpha$. I will not delve into the details of SVA in this post, except to say that the overall idea is to first perform linear regression (assuming no hidden factors) to identify the parameters $\beta$ and to then perform singular value decomposition (SVD) on the residuals to identify hidden factors (details omitted here). The resulting identified hidden factors (and associated influence parameters) are then used in a more general model for gene expression in subsequent analysis.

Gagnon-Bartsch and Speed refine this idea by suggesting that it is better to infer W from controls. For example, house-keeping genes that are unlikely to correlate with the conditions being tested, can be used to first estimate W, and then subsequently all the parameters of the model can be estimated by linear regression. They term this two-step process RUV-2 (acronym for Remote Unwanted Variation) where the “2” designates that the procedure is a two-step procedure. As with SVA, the key to inferring W from the controls is to perform singular value decomposition (or more generally factor analysis). This is actually clear from the probabilistic interpretation of PCA and the observation that what it means to be a in the set of “control genes” C  in a setting where there are no observed factors Z, is that

$Y_C = W \alpha_C + \epsilon_C$.

That is, for such control genes the corresponding $\beta$ parameters are zero. This is a simple but powerful observation, because the explicit designation of control genes in the procedure makes it clear how to estimate W, and therefore the procedure becomes conceptually compelling and practically simple to implement. Thus, even though the model being used is the same as that of Leek & Storey, there is a novel idea in the paper that makes the procedure “cleaner”. Indeed, Gagnon-Bartsch & Speed provide experimental results in their paper showing that RUV-2 outperforms SVA. Even more convincing, is the use of RUV-2 by others. For example, in a paper on “The functional consequences of variation in transcription factor binding” by Cusanovitch et al., PLoS Genetics 2014, RUV-2 is shown to work well, and the authors explain how it helps them to take advantage of the controls in experimental design they created.

There is a tech report and also a preprint that follow up on the Gagnon-Bartsch & Speed paper; the tech report extends RUV-2 to a four step method RUV-4 (it also provides a very clear exposition of the statistics), and separately the preprint describes an extension to RUV-2 for the case where the factor of interest is also unknown. Both of these papers build on the original paper in significant ways and are important work, that to return to the original question in the post, certainly are on the right side of “the line”

The wrong side of the line?

The development of RUV-2 and SVA occurred in the context of microarrays, and it is natural to ask whether the details are really different for RNA-Seq (spoiler: they aren’t).  In a book chapter published earlier this year:

D. Risso, J. Ngai, T.P. Speed, S. Dudoit, The role of spike-in standards in the normalization of RNA-Seq, in Statistical Analysis of Next Generation Sequencing Data (2014), 169-190.

the authors replace “log expression levels” from microarrays with “log counts” from RNA-Seq and the linear regression performed with Limma for RUV-2 with a Poisson regression (this involves one different R command). They call the new method RUV, which is the same as the previously published RUV, a naming convention that makes sense since the paper has no new method. In fact, the mathematical formulas describing the method are identical (and even in almost identical notation!) with the exception that the book chapter ignores altogether, and replaces $\epsilon$ with O.

To be fair, there is one added highlight in the book chapter, namely the observation that spike-ins can be used in lieu of housekeeping (or other control) genes. The method is unchanged, of course. It is just that the spike-ins are used to estimate W. Although spike-ins were not mentioned in the original Gagnon-Bartsch paper, there is no reason not to use them with arrays as well; they are standard with Affymetrix arrays.

My one critique of the chapter is that it doesn’t make sense to me that counts are used in the procedure. I think it would be better to use abundance estimates, and in fact I believe that Jeff Leek has already investigated the possibility in a preprint that appears to be an update to his original SVA work. That issue aside, the book chapter does provide concrete evidence using a Zebrafish experiment that RUV-2 is relevant and works for RNA-Seq data.

The story should end here (and this blog post would not have been written if it had) but two weeks ago, among five RNA-Seq papers published in Nature Biotechnology (I have yet to read the others), I found the following publication:

D. Risso, J. Ngai, T.P. Speed, S. Dudoit, Normalization of RNA-Seq data using factor analysis of control genes or samples, Nature Biotechnology 32 (2014), 896-902.

This paper has the same authors as the book chapter (with the exception that Sandrine Dudoit is now a co-corresponding author with Davide Risso, who was the sole corresponding author on the first publication), and, it turns out, it is basically the same paper… in fact in many parts it is the identical paper. It looks like the Nature Biotechnology paper is an edited and polished version of the book chapter, with a handful of additional figures (based on the same data) and better graphics. I thought that Nature journals publish original and reproducible research papers. I guess I didn’t realize that for some people “reproducible” means “reproduce your own previous research and republish it”.

At this point, before drawing attention to some comparisons between the papers, I’d like to point out that the book chapter was refereed. This is clear from the fact that it is described as such in both corresponding authors’ CVs.

How similar are the two papers?

Final paragraph of paper in the book:

Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical effects. With the advent of single-cell sequencing [35], the role of spike-in standards should become even more important, both to account for technical variability [6] and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike-in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols.

Final paragraph of paper in Nature Biotechnology:

Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical factors. With the advent of single-cell sequencing27, the role of spike-in standards should become even more important, both to account for technical variability28 and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike- in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols.

Abstract of paper in the book:

Normalization of RNA-seq data is essential to ensure accurate inference of expression levels, by adjusting for sequencing depth and other more complex nuisance effects, both within and between samples. Recently, the External RNA Control Consortium (ERCC) developed a set of 92 synthetic spike-in standards that are commercially available and relatively easy to add to a typical library preparation. In this chapter, we compare the performance of several state-of-the-art normalization methods, including adaptations that directly use spike-in sequences as controls. We show that although the ERCC spike-ins could in principle be valuable for assessing accuracy in RNA-seq experiments, their read counts are not stable enough to be used for normalization purposes. We propose a novel approach to normalization that can successfully make use of control sequences to remove unwanted effects and lead to accurate estimation of expression fold-changes and tests of differential expression.

Abstract of paper in Nature Biotechnology:

Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms.

Abstract of Gagnon-Bartsch & Speed paper that already took credit for a “new” method called RUV:

Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method “Remove Unwanted Variation, 2-step” (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.

Many figures are also the same (except one that appears to have been fixed in the Nature Biotechnology paper– I leave the discovery of the figure as an exercise to the reader). Here is Figure 9.2 in the book:

The two panels appears as (b) and (c) in Figure 4 in the Nature Biotechnology paper (albeit transformed via a 90 degree rotation and reflection from the dihedral group):

Basically the whole of the book chapter and the Nature Biotechnology paper are essentially the same, down to the math notation, which even two papers removed is just a rehashing of the RUV method of Gagnon-Bartsch & Speed. A complete diff of the papers is beyond the scope of this blog post and technically not trivial to perform, but examination by eye reveals one to be a draft of the other.

Although it is acceptable in the academic community to draw on material from published research articles for expository book chapters (with permission), and conversely to publish preprints, including conference proceedings, in journals, this case is different. (a) the book chapter was refereed, exactly like a journal publication (b) the material in the chapter is not expository; it is research, (c) it was published before the Nature Biotechnology article, and presumably prepared long before,  (d) the book chapter cites the Nature Biotechnology article but not vice versa and (e) the book chapter is not a particularly innovative piece of work to begin with. The method it describes and claims to be “novel”, namely RUV, was already published by Gagnon-Bartsch & Speed.

Below is a musical rendition of what has happened here:

When I was a teenager I broke all the rules on Friday night. After dinner I would watch Louis Rukeyser’s Wall Street Week at 8:30pm, and I would be in bed an hour later. On new year’s eve, he had a special “year-end review”, during which he hosted “financial experts” who would opine on the stock market and make predictions for the coming year.

What I learned from Louis Rukeyser was:

1. Never trust men in suits (or tuxedos).

2. It’s easier to perpetrate the 1024 scam than one might think!

Here are the experts in 1999 all predicting increases for the stock market in 2000:

As it turned out, the NASDAQ peaked on March 10, 2000, and within a week and a half had dropped 10%. By the end of the year the dot-com bubble had completely burst and a few years later the market had lost almost 80% of its value.

Predictions on the last day of the 20th century represented a spectacular failure for the “pundits”, but by then I had already witnessed many failures on the show. I’d also noted that almost all the invited “experts” were men. Of course correlation does not imply causation, but I remember having a hard time dispelling the notion that the guests were wrong because they were men. I never wanted to be sexist, but Louis Rukeyser made it very difficult for me!

Gender issues aside, the main lesson I learned from Louis Rukeyser’s show is that it’s easy to perpetrate the 1024 scam. The scam goes something like this: a scammer sends out 1024 emails to individuals that are unlikely to know each other, with each email making a prediction about the performance of the stock market in the coming week. For half the people (512), she predicts the stock market will go up, and for the other half, that it will go down. The next week, she has obviously sent a correct prediction of the market to half the people (this assumes the market is never unchanged after a week). She ignores the 512 people who have received an incorrect prediction, dividing those who received the correct prediction into two halves (256 each). Again, she predicts the performance of the market in the coming week, sending 256 individuals a prediction that the market will go up, and the other 256 a prediction that it will go down. She continues this divide-and-conquer for 10 weeks, at which time there is one individual that has received correct predictions about the movement of the stock market for 2.5 months! This person may believe that the scammer has the ability to predict the market; after all, $(\frac{1}{2})^{10} = 0.00098$ which looks like a very significant p-value. This is when the scammer asks for a “large investment”. Of course what is missing is knowledge of the other prediction emails sent out, or in other words the multiple testing problem.

The Wall Street Week guest panels essentially provided a perfect setting in which to perpetrate this scam. “Experts” that would err would be unlikely to be invited back. Whereas regular winners would be back for another chance at guessing. This is a situation very similar to the mutual fund management market, where managers are sacked when they have a bad year, only to have large firms with hundreds of funds on the books highlight funds that have performed well for 10 years in a row in their annual glossy brochures. But that is not the subject matter of this blog post. Rather, it’s the blog itself.

I wrote and posted my first blog entry (Genesis of *Seq) exactly a year ago. I began writing it for two reasons. First, I thought it could be a convenient and useful forum for discussion of technical developments in computational biology. I was motivated partly by the seqanswers website, which allows users to share information and experience in dealing with high-throughput sequence data. But I was also inspired by the What’s New Blog that has created numerous bridges in the mathematics community via highly technical yet accessible posts that have democratized mathematics. Second, I had noticed an extraordinary abuse of multiple testing in computational biology, and I was desperate for a forum where I could bring the issue to peoples attention. My initial frustration with outlandish claims in papers based on weak statistics had also grown over time to encompass a general concern for lack of rigor in computational biology papers. None of us are perfect but there is a wide gap between perfect and wrong. Computational biology is a field that is now an amalgamation of many subjects and I hoped that a blog would be able to reach the different silos more effectively than publications.

And thus this blog was born on August 19th 2013. I started without a preconception of how it would turn out over time, and I’m happy to say I’ve been surprised by its impact, most notably on myself. I’ve learned an enormous amount from reader feedback, in part via comments on individual posts, but also from private emails to me and in personal conversations. For this (selfish) reason alone, I will keep blogging. I have also been asked by many of you to keep posting, and I’m listening. When I have nothing left to say, I promise I will quit. But for now I have a backlog of posts, and after a break this summer, I am ready to return to the keyboard. Besides, since starting to blog I still haven’t been to Las Vegas.

There has recently been something of an uproar over the new book A Troublesome Inheritance by Nicholas Wade, with much of the criticism centering on Wade’s claim that race is a meaningful biological category. This subject is one with which I1 have some personal connection since as a child growing up in South Africa in the 1980s, I was myself categorized very neatly by the Office for Race Classification: 10. A simple pair of digits that conferred on me numerous rights and privileges denied to the majority of the population.

Explanation of identity numbers assigned to citizens by the South African government during apartheid.

And yet the system behind those digits was anything but simple. The group to which an individual was assigned could be based on not only their skin color but also their employment, eating and drinking habits, and indeed explicitly social factors as related by Muriel Horrell of the South African Institute of Race Relations: “Should a man who is initially classified white have a number of coloured friends and spend many of his leisure hours in their company, he stands to risk being re-classified as coloured.”

With these memories in mind, I found Wade’s concept of race as a biological category quite confusing, a confusion which only deepened when I discovered that he identifies not the eight races designated by the South African Population Registration Act of 1950, but rather five, none of which was the Griqua! With the full force of modern science on his side2, it seemed unlikely that these disparities represented an error on Wade’s part. And so I was left with a perplexing question: how could it be that the South African apartheid regime — racists par excellence — had failed to institutionalize their racism correctly? How had Wade gotten it right when Hendrik Verwoerd had gone awry?

Eventually I realized that A Troublesome Inheritance itself might contain the answer to this conundrum. Institutions, Wade explains, are genetic: “they grow out of instinctual social behaviors” and “one indication of such a genetic effect is that, if institutions were purely cultural, it should be easy to transfer an institution from one society to another.”3 So perhaps it is Wade’s genetic instincts as a Briton that explain how he has navigated these waters more skillfully than the Dutch-descended Afrikaners who designed the institutions of apartheid.

One might initially be inclined to scoff at such a suggestion or even to find it offensive. However, we must press boldly on in the name of truth and try to explain why this hypothesis might be true. Again, A Troublesome Inheritance comes to our aid. There, Wade discusses the decline in English interest rates between 1400 and 1850. This is the result, we learn, of rich English people producing more children than the poor and thereby genetically propagating those qualities which the rich are so famous for possessing: “less impulsive, more patient, and more willing to save.”4 However this period of time saw not only falling interest rates but also the rise of the British Empire. It was a period when Englishmen not only built steam engines and textile mills, but also trafficked in slaves by the millions and colonized countries whose people lacked their imperial genes. These latter activities, with an obvious appeal to the more racially minded among England’s population, could bring great wealth to those who engaged in them and so perhaps the greater reproductive fitness of England’s economic elite propagated not only patience but a predisposition to racism. This would explain, for example, the ability of John Hanning Speke to sniff out “the best blood of Abyssinia” when distinguishing the Tutsi from their Hutu neighbors.

Some might be tempted to speculate that Wade is himself a racist. While Wade — who freely speculates about billions of human beings — would no doubt support such an activity, those who engage in such speculation should perhaps not judge him too harshly. After all, racism may simply be Wade’s own troublesome inheritance.

#### Footnotes

1.  In the spirit of authorship designation as discussed in this post, we describe the author contributions as follows: the recollections of South Africa are those of Lior Pachter, who distinctly remembers his classification as “white”. Nicolas Bray conceived and composed the post with input from LP. LP discloses no conflicts of interest. NB discloses being of British ancestry.
2. Perhaps not quite the full force, given the reception his book has received from actual scientists.
3. While this post is satirical, it should be noted for clarity that, improbably, this is an actual quote from Wade’s book.
4. Again, not satire.