You are currently browsing the category archive for the ‘reviews’ category.

Spot the difference

One of the maxims of computational biology is that “no two programs ever give the same result.” This is perhaps not so surprising; after all, most journals seek papers that report a significant improvement to an existing method. As a result, when developing new methods, computational biologists ensure that the results of their tools are different, specifically better (by some metric), than those of previous methods. The maxim certainly holds for RNA-Seq tools. For example, the large symmetric differences displayed in the Venn diagram below (from Zhang et al. 2014) are typical for differential expression tool benchmarks:

Venn

In a comparison of RNA-Seq quantification methods, Hayer et al. 2015 showed that methods differ even at the level of summary statistics (in Figure 7 from the paper, shown below, Pearson correlation was calculated using ground truth from a simulation):

quant_corr

These sort of of results are the norm in computational genomics. Finding a pair of software programs that produce identical results is about as likely as finding someone who has won the lottery… twice…. in one week. Well, it turns out there has been such a person, and here I describe the computational genomics analog of that unlikely event. Below are a pair of plots made using two different RNA-Seq quantification programs:

Casey_volcano

greensalmon3

The two volcano plots show the log-fold change in abundance estimated for samples sequenced by Boj et al. 2015, plotted against p-values obtained with the program limma-voom. I repeat: the plots were made with quantifications from two different RNA-Seq programs. Details are described in the next section, but before reading it first try playing spot the difference.

The reveal

The top plot is reproduced from Supplementary Figure 6 in Beaulieu-Jones and Greene, 2017. The quantification program used in that paper was kallisto, an RNA-Seq quantification program based on pseudoalignment that was published in

The bottom plot was made using the quantification program Salmon, and is reproduced from a GitHub repository belonging to the lead author of

Patro et al. 2017 claim that “[Salmon] achieves the same order-of-magnitude benefits in speed as kallisto and Sailfish but with greater accuracy”, however after being unable to spot any differences myself in the volcano plots shown above, I decided, with mixed feelings of amusement and annoyance, to check for myself whether the similarity between the programs was some sort of fluke. Or maybe I’d overlooked something obvious, e.g. the fact that programs may tend to give more similar results at the gene level than at the transcript level. Thus began this blog post.

In the figure below, made by quantifying RNA-Seq sample ERR188140 with the latest versions of the two programs, each point is a transcript and its coordinates are the estimated counts produced by kallisto and salmon respectively.

cor

Strikingly, the Pearson correlation coefficient is 0.9996026. However astute readers will recognize a possible sleight of hand on my part. The correlation may be inflated by similar results for the very abundant transcripts, and the plot hides thousands of points in the lower left-hand corner. RNA-Seq analyses are notorious for such plots that appear sounds but can be misleading. However in this case I’m not hiding anything. The Pearson correlation computed with log(counts+1) is still extremely high (0.9955965) and the Spearman correlation, which gives equal balance to transcripts irrespective of the magnitude of their counts is 0.991206. My observation is confirmed in Table 3 of Sarkar et al. 2017 (note that in this table “quasi-mapping” corresponds to Salmon):

Table3_Sarkar

For context, the Spearman correlation between kallisto and a truly different RNA-Seq quantification program, RSEM, is 0.8944941. At this point I have to say… I’ve been doing computational biology for more than 20 years and I have never seen a situation where two ostensibly different programs output such similar results.

Patro and I are not alone in finding that Salmon \simeq kallisto (if kallisto and Salmon gave identical results I would write that Salmon = kallisto but in lieu of the missing 0.004 in correlation I use the symbol \, \simeq \, to denote the very very strong similarity). Examples in the literature abound, e.g. Supplementary Figure 5 from Majoros et al. 2017 (shown later in the post), Figure 1 from Everaert et al. 2017

Everaert

or Figure 3A from Jin et al. 2017:

Jin

Just a few weeks ago, Sahraeian et al. 2017 published a comprehensive analysis of 39 RNA-Seq analysis tools and performed hierarchical clusterings of methods according to the similarity of their output. Here is one example (their Supplementary Figure 24a):

heatmap

Amazingly, kallisto and Salmon-Quasi (the latest version of Salmon) are the two closest programs to each other in the entire comparison, producing output even more similar than the same program, e.g. Cufflinks or StringTie run with different alignments!

This raises the question of how, with kallisto published in May 2016 and Salmon \simeq kallisto, Patro et al. 2017 was published in one of the most respected scientific publications that advertises first and foremost that it “is a forum for the publication of novel methods and significant improvements to tried-and-tested basic research techniques in the life sciences.” ?

How not to perform a differential expression analysis

The Patro et al. 2017 paper presents a number of comparisons between kallisto and Salmon in which Salmon appears to dramatically improve on the performance of kallisto. For example Figure 1c from Patro et al. 2017 is a table showing an enormous performance difference between kallisto and Salmon:

Figure 1c from Patro et al. 2017

Figure 1c from Patro et al. 2017.

At a false discovery rate of 0.01, the authors claim that in a simulation study where ground truth is known Salmon identifies 4.5 times more truly differential transcripts than kallisto!

This can explain how Salmon was published, namely the reviewers and editor believed Patro et al.’s claims that Salmon significantly improves on previous work. In one analysis Patro et al. provide a p-value to help the “significance” stick. They write that “we found that Salmon’s distribution of mean absolute relative differences was significantly smaller (Mann-Whitney U test, P=0.00017) than those of kallisto. But how can the result Salmon >> kallisto, be reconciled with the fact that everybody repeatedly finds that Salmon \simeq kallisto?

A closer look reveals three things:

  1. In a differential expression analysis billed as “a typical downstream analysis” Patro et al. did not examine differential expression results for a typical biological experiment with a handful of replicates. Instead they examined a simulation of two conditions with eight replicates in each.
  2. The large number of replicates allowed them to apply the log-ratio t-test directly to abundance estimates based on transcript per million (TPM) units, rather than estimated counts which are required for methods such as their own DESeq2.
  3. The simulation involved generation of GC bias in an approach compatible with the inference model, with one batch of eight samples exhibiting “weak GC content dependence” while the other batch of eight exhibiting “more severe fragment-level GC bias.” Salmon was run in a GC bias correction mode.

These were unusual choices by Patro et al. What they did was allow Patro et al. to showcase the performance of their method in a way that leveraged the match between one of their inference models and the procedure for simulating the reads. The showcasing was enabled by having a confounding variable (bias) that exactly matches their condition variable, the use of TPM units to magnify the impact of that effect on their inference, simulation with a large number of replicates to enable the use of TPM,  which was possible because with many replicates one could directly apply the log t-test. This complex chain of dependencies is unraveled below:

There is a reason why log-fold changes are not directly tested in standard RNA-Seq differential expression analyses. Variance estimation is challenging with few replicates and RNA-Seq methods developers understood this early on. That is why all competitive methods for differential expression analysis such as DESeq/DESeq2, edgeR, limma-voom, Cuffdiff, BitSeq, sleuth, etc. regularize variance estimates (i.e., perform shrinkage) by sharing information across transcripts/genes of similar abundance. In a careful benchmarking of differential expression tools, Shurch et al. 2016 show that log-ratio t-test is the worst method. See, e.g., their Figure 2:

Figure 2 from Schurch et al. 2016

Figure 2 from Schurch et al. 2016. The four vertical panels show FPR and TPR for programs using 3,6,12 and 20 biological replicates (in yeast). Details are in the Schurch et al. 2016 paper.

The log-ratio t-test performs poorly not only when the number of replicates is small and regularization of variance estimates is essential. Schurch et al. specifically recommend DESeq2 (or edgeR) when up to 12 replicates are performed. In fact, the log-ratio t-test was so bad that it didn’t even make it into their Table 2 “summary of recommendations”.

The authors of Patro et al. 2017 are certainly well-aware of the poor performance of the log-ratio t-test. After all, one of them was specifically thanked in the Schurch et al. 2016 paper “for his assistance in identifying and correcting a bug”. Moreover, the recommended program by Schurch etal. (DESeq2) was authored by one of the coauthors on the Patro et al. paper, who regularly and publicly advocates for the use of his programs (and not the log-ratio t-test):

This recommendation has been codified in a detailed RNA-Seq tutorial where M. Love et al. write that “This [Salmon + tximport] is our current recommended pipeline for users”.

In Soneson and Delorenzi, 2013, the authors wrote that “there is no general consensus regarding which [RNA-Seq differential expression] method performs best in a given situation” and despite the development of many methods and benchmarks since this influential review, the question of how to perform differential expression analysis continues to be debated. While it’s true that “best practices” are difficult to agree on, one thing I hope everyone can agree on is that in a “typical downstream analysis” with a handful of replicates

do not perform differential expression with a log-ratio t-test.

Turning to Patro et al.‘s choice of units, it is important to note that the requirement of shrinkage for RNA-Seq differential analysis is the reason most differential expression tools require abundances measured in counts as input, and do not use length normalized units such as Transcripts Per Million (TPM). In TPM units the abundance \rho_t for a transcript t is \rho_t \propto \frac{c_t}{N \cdot l_t} where c_t are the estimated counts for transcript t, l_t is the (effective) length of t and N the number of total reads. Whereas counts are approximately Poisson distributed (albeit with some over-dispersion), variance estimates of abundances in TPM units depend on the lengths used in normalization and therefore cannot be used directly for regularization of variance estimation. Furthermore, the dependency of TPM on effective lengths means that abundances reported in TPM are very sensitive to the estimates of effective length.

This is why, when comparing the quantification accuracy of different programs, it is important to compare abundances using estimated counts. This was highlighted in Bray et al. 2016: “Estimated counts were used rather than transcripts per million (TPM) because the latter is based on both the assignment of ambiguous reads and the estimation of effective lengths of transcripts, so a program might be penalized for having a differing notion of effective length despite accurately assigning reads.” Yet Patro et al. perform no comparisons of programs in terms of estimated counts.

A typical analysis

The choices of Patro et al. in designing their benchmarks are demystified when one examines what would have happened had they compared Salmon to kallisto on typical data with standard downstream differential analysis tools such as their own tximport and DESeq2. I took the definition of “typical” from one of the Patro et al. coauthors’ own papers (Soneson et al. 2016): “Currently, one of the most common approaches is to define a set of non-overlapping targets (typically, genes) and use the number of reads overlapping a target as a measure of its abundance, or expression level.”

The Venn diagram below shows the differences in transcripts detected as differentially expressed when kallisto and Salmon are compared using the workflow the authors recommend publicly (quantifications -> tximport -> DESeq2) on a typical biological dataset with three replicates in each of two conditions. The number of overlapping genes is shown for a false discovery rate of 0.05 on RNA-Seq data from Trapnell et al. 2014:

venn_DESeq2_0.05

A Venn diagram showing the overlap in genes predicted to be differential expressed by kallisto (blue) and Salmon (pink). Differential expression was performed with DESeq2 using transcript-level counts estimated by kallisto and Salmon and imported to DESeq2 with tximport. Salmon was run with GC bias correction.

This example provides Salmon the benefit of the doubt- the dataset was chosen to be older (when bias was more prevalent) and Salmon was not run in default mode but rather with GC bias correction turned on (option –gcBias).

When I saw these numbers for the first time I gasped. Of course I shouldn’t have been surprised; they are consistent with repeated published experiments in which comparisons of kallisto and Salmon have revealed near identical results. And while I think it’s valuable to publish confirmation of previous work, I did wonder whether Nature Methods would have accepted the Patro et al. paper had the authors conducted an actual “typical downstream analysis”.

What about the TPM?

Patro et al. utilized TPM based comparisons for all the results in their paper, presumably to highlight the improvements in accuracy resulting from better effective length estimates. Numerous results in the paper suggest that Salmon is much more accurate than kallisto. However I had seen a figure in Majoros et al. 2017 that examined the (cumulative) distribution of both kallisto and Salmon abundances in TPM units (their Supplementary Figure 5) in which the curves literally overlapped at almost all thresholds:

Majoros

The plot above was made with Salmon v0.7.2 so in fairness to Patro et al. I remade it using the ERR188140 dataset mentioned above with Salmon v0.8.2:

cumu

The distribution of abundances (in TPM units) as estimated by kallisto (blue circles) and Salmon (red stars).

The blue circles correspond to kallisto and the red stars inside to Salmon. With the latest version of Salmon the similarity is even higher than what Majoros et al. observed! The Spearman correlation between kallisto and Salmon with TPM units is 0.9899896.

It’s interesting to examine what this means for a (truly) typical TPM analysis. One way that TPMs are used is to filter transcripts (or genes) by some threshold, typically TPM >  1 (in another deviation from “typical”, a key table in Patro et al. 2017 – Figure 1d – is made by thresholding with TPM > 0.1). The Venn diagram below shows the differences between the programs at the typical TPM > 1  threshold:

venn

A Venn diagram showing the overlap in transcripts predicted by kallisto and Salmon to have estimated abundance > 1 TPM.

The figures above were made with Salmon 0.8.2 run in default mode. The correlation between kallisto and Salmon (in TPM) units decreases a tiny amount, from 0.9989224 to 0.9974325 with the –gcBias option and even the Spearman correlation decreases by only 0.011 from 0.9899896 to 0.9786092.

I think it’s perfectly fine for authors to present their work in the best light possible. What is not ok is to deliberately hide important and relevant truth, which in this case is that Salmon \, \simeq \, kallisto.

A note on speed

One of the claims in Patro et al. 2017 is that “[the speed of Salmon] roughly matches the speed of the recently introduced kallisto.” The Salmon claim is based on a benchmark of an experiment (details unknown) with 600 million 75bp paired-end reads using 30 threads. Below are the results of a similar benchmark of Salmon showing time to process 19 samples from Boj et al. 2015 with variable numbers of threads:

Salmon_only_timings

First, Salmon with –gcBias is considerably slower than default Salmon. Furthermore, there is a rapid decrease in performance gain with increasing number of threads, something that should come as no surprise. It is well known that quantification can be I/O bound which means that at some point, extra threads don’t provide any gain as the disk starts grinding limiting access from the CPUs. So why did Patro et al. choose to benchmark runtime with 30 threads?

The figure below provides a possible answer:

running_time

In other words, not only is Salmon \simeq kallisto in accuracy, but contrary to the claims in Patro et al. 2017, kallisto is faster. This result is confirmed in Table 1 of Sarkar et al. 2017 who find that Salmon is slower by roughly the same factor as seen above (in the table “quasi-mapping” is Salmon).

 

Table1_Sarkar.jpeg

Having said that, the speed differences between kallisto and Salmon should not matter much in practice and large scale projects made possible with kallisto (e.g. Vivian et al. 2017) are possible with Salmon as well. Why then did the authors not report their running time benchmarks honestly?

 

 

 

The first common notion

The Patro et al. 2017 paper uses the term “quasi-mapping” to describe an algorithm, published in Srivastava et al. 2016, for obtaining their (what turned out to be near identical to kallisto) results. I have written previously how “quasi-mapping” is the same as pseudoalignment as an alignment concept, even though Srivastava et al. 2016 initially implemented pseudoalignment differently than the way we described it originally in our preprint in Bray et al. 2015. However the reviewers of Patro et al. 2017 may be forgiven for assuming that “quasi-mapping” is a technical advance over pseudoalignment. The Srivastava et al. paper is dense and filled with complex technical detail. Even for an expert in alignment/RNA-Seq it is not easy to see from a superficial reading of the paper that “quasi-mapping” is an equivalent concept to kallisto’s pseudoalignment (albeit implemented with suffix arrays instead of a de Bruijn graph). Nevertheless, the key to the paper is a simple sentence: “Specifically, the algorithm [RapMap, which is now used in Salmon] reports the intersection of transcripts appearing in all hits” in the section 2.1 of the paper. That’s the essence of pseudoalignment right there. The paper acknowledges as much, “This lightweight consensus mechanism is inspired by Kallisto ( Bray et al. , 2016 ), though certain differences exist”. Well, as shown above, those differences appear to have made no difference in standard practice, except insofar as the Salmon implementation of pseudoalignment being slower than the one in Bray et al. 2016.

Srivastava et al. 2016 and Patro et al. 2017 make a fuss about the fact that their “quasi-mappings” take into account the starting positions of reads in transcripts, thereby including more information than a “pure” pseudoalignment. This is a pedantic distinction Patro et al. are trying to create. Already in the kallisto preprint (May 11, 2015),  it was made clear that this information was trivially accessible via a reasonable approach to pseudoalignment: “Once the graph and contigs have been constructed, kallisto stores a hash table mapping each k-mer to the contig it is contained in, along with the position within the contig.”

In other words, Salmon is not producing near identical results to kallisto due to an unprecedented cosmic coincidence. The underlying method is the same. I leave it to the reader to apply Euclid’s first common notion:

Things which equal the same thing are also equal to each other.

Convergence

While Salmon is now producing almost identical output to kallisto and is based on the same principles and methods, this was not the case when the program was first released. The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods.

The first preprint was published on the BioRxiv on June 27, 2015. It followed shortly on the heels of the kallisto preprint which was published on May 11, 2015. However the first Salmon preprint described a program very different from kallisto. Instead of pseudoalignment, Salmon relied on chaining SMEMs (super-maximal exact matches) between reads and transcripts to identifying what the authors called “approximately consistent co-linear chains” as proxies for alignments of reads to transcripts. The authors then compared Salmon to kallisto writing that “We also compare with the recently released method of Kallisto which employs an idea similar in some respects to (but significantly different than) our lightweight-alignment algorithm and again find that Salmon tends to produce more accurate estimates in general, and in particular is better able [to] estimate abundances for multi-isoform genes.” In other words, in 2015 Patro et al. claimed that Salmon was “better” than kallisto. If so, why did the authors of Salmon later change the underlying method of their program to pseudoalignment from SMEM alignment?

Inspired by temporal ordering analysis of expression data and single-cell pseudotime analysis, I ran all the versions of kallisto and Salmon on ERR188140, and performed PCA on the resulting transcript abundance table to be able to visualize the progression of the programs over time. The figure below shows all the points with the exception of three: Sailfish 0.6.3, kallisto 0.42.0 and Salmon 0.32.0. I removed Sailfish 0.6.3 because it is such an outlier that it caused all the remaining points to cluster together on one side of the plot (the figure is below in the next section). In fairness I also removed one Salmon point (version 0.32.0) because it differed substantially from version 0.4.0 that was released a few weeks after 0.32.0 and fixed some bugs. Similarly, I removed kallisto 0.42.0, the first release of kallisto which had some bugs that were fixed 6 days later in version 0.42.1.

pca_final

Evidently kallisto output has changed little since May 12, 2015. Although some small bugs were fixed and features added, the quantifications have been very similar. The quantifications have been stable because the algorithm has been the same.

On the other hand the Salmon trajectory shows a steady convergence towards kallisto. The result everyone is finding, namely that currently Salmon \simeq kallisto is revealed by the clustering of recent versions of Salmon near kallisto. However the first releases of Salmon are very different from kallisto. This is also clear from the heatmap/hierarchical clustering of  Sahraeian et al. in which Salmon-SMEM was included (Salmon used SMEMs until version 0.5.1, sometimes labeled fmd, until “quasi-mapping” became the default). A question: if Salmon ca. 2015 was truly better than kallisto then is Salmon ca. 2017 worse than Salmon ca. 2015?

Time vs. PC1

Convergence of Salmon and Sailfish to kallisto over the course of a year. The x-axis labels the time different versions of each program were released. The y-axis is PC1 from a PCA of transcript abundances of the programs.

Prestamping

The bioRxiv preprint server provides a feature by which a preprint can be linked to its final form in a journal. This feature is useful to readers of the bioRxiv, as final published papers are generally improved after preprint reader, reviewer, and editor comments have been addressed. Journal linking is also a mechanism for authors to time stamp their published work using the bioRxiv. However I’m sure the bioRxiv founders did not intend the linking feature to be abused as a “prestamping” mechanism, i.e. a way for authors to ex post facto receive a priority date for a published paper that had very little, or nothing, in common with the original preprint.

A comparison of the June 2015 preprint mentioning the Salmon program and the current Patro et al. paper reveals almost nothing in common. The initial method changed drastically in tandem with an update to the preprint on October 3, 2015 at which point the Salmon program was using “quasi mapping”, later published in Srivastava et al. 2016. Last year I met with Carl Kingsford (co-corresponding author of Patro et al. 2017) to discuss my concern that Salmon was changing from a method distinct from that of kallisto (SMEMs of May 2015) to one that was replicating all the innovations in kallisto, without properly disclosing that it was essentially a clone. Yet despite a promise that he would raise my concerns with the Salmon team, I never received a response.

At this point, the Salmon core algorithms have changed completely, the software program has changed completely, and the benchmarking has changed completely. The Salmon project of 2015 and the Salmon project of 2017 are two very different projects although the name of the program is the same. While some features have remained, for example the Salmon mode that processes transcriptome alignments (similar to eXpress) was present in 2015, and the approach to likelihood maximization has persisted, considering the programs the same is to descend into Theseus’ paradox.

Interestingly, Patro specifically asked to have the Salmon preprint linked to the journal:

The linking of preprints to journal articles is a feature that arXiv does not automate, and perhaps wisely so. If bioRxiv is to continue to automatically link preprints to journals it needs to focus not only on eliminating false negatives but also false positives, so that journal linking cannot be abused by authors seeking to use the preprint server to prestamp their work after the fact.

The fish always win?

The Sailfish program was the precursor of Salmon, and was published in Patro et al. 2014. At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

Despite the claim in the Sailfish abstract that “Sailfish provides quantification time…without loss of accuracy” and Figure 1 from the paper showing Sailfish to be more accurate than RSEM, we felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case; this was later noted by others, e.g. Hensman et al. 2015, Lee et al. 2015).

After reflecting on the Sailfish paper and results, Nicolas Bray had the key idea of abandoning alignments as a requirement for RNA-Seq quantification, developed pseudoalignment, and later created kallisto (with Harold Pimentel and Páll Melsted).

I mention this because after the publication of kallisto, Sailfish started changing along with Salmon, and is now frequently discussed in the context of kallisto and Salmon as an equal. Indeed, the PCA plot above shows that (in its current form, v0.10.0) Sailfish is also nearly identical to kallisto. This is because with the release of Sailfish 0.7.0 in September 2015, Patro et al. started changing the Sailfish approach to use pseudoalignment in parallel with the conversion of Salmon to use pseudoalignment. To clarify the changes in Sailfish, I made the PCA plot below which shows where the original version of Sailfish that coincided with the publication of Patro et al. 2014 (version 0.6.3 March 2014) lies relative to the more recent versions and to Salmon:

pca_final_allIn other words, despite a series of confusing statements on the Sailfish GitHub page and an out-of-date description of the program on its homepage, Sailfish in its published form was substantially less accurate and slower than kallisto, and in its current form Sailfish is kallisto.

In retrospect, the results in Figure 1 of Patro et al. 2014 seem to be as problematic as the results in Figure 1 of Patro et al. 2017.  Apparently crafting computational experiments via biased simulations and benchmarks to paint a distorted picture of performance is a habit of Patro et al.

Addendum [August 5, 2017]

In the post I wrote that “The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods” Here are the details of how these support the claims I make (tl;dr https://twitter.com/yarbsalocin/status/893886707564662784):

Sailfish (current version) and Salmon implemented kallisto’s pseudoalignment algorithm using suffix arrays

First, both Sailfish and Salmon use RapMap (via `SACollector`) and call `mergeLeftRightHits()`:
Sailfish:
https://github.com/kingsfordgroup/sailfish/blob/352f9001a442549370eb39924b06fa3140666a9e/src/SailfishQuantify.cpp#L192
Salmon:
https://github.com/COMBINE-lab/salmon/commit/234cb13d67a9a1b995c86c8669d4cefc919fbc87#diff-594b6c23e3bdd02a14cc1b861c812b10R2205

The RapMap code for “quasi mapping” executes an algorithm identical to psuedoalignment, down to the detail of what happens to the k-mers in a single read:

First, `hitCollector()` calls `getSAHits_()`:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L249

Here kmers are used hashed to SAintervals (Suffix Array intervals), that are then extended to see how far ahead to jump. This is the one of two key ideas in the kallisto paper, namely that not all the k-mers in a read need to be examined to pseudoalign the read. It’s much more than that though, it’s the actual exact same algorithm to the level of exactly the k-mers that are examined. kallisto performs this “skipping” using contig jumping with a different data structure (the transcriptome de Bruijn graph) but aside from data structure used what happens is identical:

https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SACollector.hpp#L652
makes a call to jumping and the code to compute MMP (skipping) is
https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SASearcher.hpp#L77

There is a different detail in the Sailfish/Salmon code which is that when skipping forward the suffix array is checked for exact matching on the skipped sequence. kallisto does not have this requirement (although it could). On error-free data these will obviously be identical; on error prone data this may make Salmon/Sailfish a bit more conservative and kallisto a bit more robust to error. Also due to the structure of suffix arrays there is a possible difference in behavior when a transcript contains a repeated k-mer. These differences affect a tiny proportion of reads, as is evident from the result that kallisto and Salmon produce near identical results.

The second key idea in kallisto of intersecting equivalence classes for a read. This exact procedure is in:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L363
which calls:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/src/HitManager.cpp#L599

There was a choice we had to make in kallisto of how to handle information from paired end reads (does one require consistent pseudoalignment in both? Just one suffices to pseudoalign a read?)
The code for intersection between left and right reads making the identical choices as kallisto is:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/RapMapUtils.hpp#L810

In other words, stepping through what happens to the k-mers in a read shows that Sailfish/Salmon copied the algorithms of kallisto and implemented it with the only difference being a different data structure used to hash the kmers. This is why, when I did my run of Salmon vs. kallisto that led to this blog post I found that
kallisto pseudoaligned 69,780,930 reads
vs
salmon 69,701,169.
That’s a difference of 79,000 out of ~70 million = 0.1%.

Two additional points:

  1.  Until the kallisto program and preprint was published Salmon used SMEMs. Only after kallisto does Salmon change to using kmer cached suffix array intervals.
  2. The kallisto preprint did not discuss outputting position as part of pseudoalignment because it was not central to the idea. It’s trivial to report pseudoalignment positions with either data structure and in fact both kallisto and Salmon do.

I want to make very clear here that I think there can be great value in implementing an algorithm with a different data structure. It’s a form of reproducibility that one can learn from: how to optimize, where performance gains can be made, etc. Unfortunately most funding agencies don’t give grants for projects whose goal is solely to reproduce someone else’s work. Neither do most journal publish papers that set out to do that. That’s too bad. If Patro et al. had presented their work honestly, and explained that they were implementing pseudoalignment with a different data structure to see if it’s better, I’d be a champion of their work. That’s not how they presented their work.

Salmon copied details in the quantification

The idea of using the EM algorithm for quantification with RNA-Seq goes back to Jiang and Wong, 2009, arguably even to Xing et al. 2006. I wrote up the details of the history in a review in 2011 that is on the arXiv. kallisto runs the EM algorithm on equivalence classes, an idea that originates with Nicolae et al. 2011 (or perhaps even Jiang and Wong 2009) but whose significance we understood from the Sailfish paper (Patro et al. 2014). Therefore the fact that Salmon (now) and kallisto both use the EM algorithm, in the same way, makes sense.

However Salmon did not use the EM algorithm before the kallisto preprint and program were published. It used an online variational Bayes algorithm instead. In the May 18, 2015 release of Salmon there is no mention of EM. Then, with the version 0.4 release date Salmon suddenly switches to the EM. In implementing the EM algorithm there are details that must be addressed, for example setting thresholds for when to terminate rounds of inference based on changes in the (log) likelihood (i.e. determine convergence).

For example, kallisto sets parameters
const double alpha_limit = 1e-7;
const double alpha_change_limit = 1e-2;
const double alpha_change = 1e-2;

in EMalgorithm.h
https://github.com/pachterlab/kallisto/blob/90db56ee8e37a703c368e22d08b692275126900e/src/EMAlgorithm.h
The link above shows that these kallisto parameters were set and have not changed since the release of kallisto
Also they were not always this way, see e.g. the version of April 6, 2015:
https://github.com/pachterlab/kallisto/blob/2651317188330f7199db7989b6a4dc472f5d1669/src/EMAlgorithm.h
This is because one of the things we did is explore the effects of these thresholds, and understand how setting them affects performance. This can be seen also in a legacy redundancy, we have both alpha_change and alpha_change_limit which ended up being unnecessary because they are equal in the program and used on one line.

The first versions of Salmon post-kallisto switched to the EM, but didn’t even terminate it the same way as kallisto, adopting instead a maximum iteration of 1,000. See
https://github.com/COMBINE-lab/salmon/blob/59bb9b2e45c76137abce15222509e74424629662/include/CollapsedEMOptimizer.hpp
from May 30, 2015.
This changed later first with the introduction of minAlpha (= kallisto’s alpha_limit)
https://github.com/COMBINE-lab/salmon/blob/56120af782a126c673e68c8880926f1e59cf1427/src/CollapsedEMOptimizer.cpp
and then alphaCheckCutoff (kallisto’s alpha_change_limit)
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp

Here are the salmon thresholds:
double minAlpha = 1e-8;
double alphaCheckCutoff = 1e-2;
double cutoff = minAlpha;

Notice that they are identical except that minAlpha = 1e-8 and not kallisto’s alpha_limit = 1e-7. However in kallisto, from the outset, the way that alpha_limit has been used is:
if (alpha_[ec] < alpha_limit/10.0) {
alpha_[ec] = 0.0;
}

In other words, alpha_limit in kallisto is really 1e-8, and has been all along.

The copying of all the details of our program have consequences for performance. In the sample I ran kallisto performed 1216 EM rounds of EM vs. 1214 EM rounds in Salmon.

Sailfish (current version) copied our sequence specific bias method

One of the things we did in kallisto is implement a sequence specific bias correction along the lines of what was done previously in Roberts et al. 2011, and later in Roberts et al. 2013. Implementing sequence specific bias correction in kallisto required working things out from scratch because of the way equivalence classes were being used with the EM algorithm, and not reads. I worked this out together with Páll Melsted during conversations that lasted about a month in the Spring of 2015. We implemented it in the code although did not release details of how it worked with the initial preprint because it was an option and not default, and we thought we might want to still change it before submitting the journal paper.

Here Rob is stating that Salmon can account for biases that kallisto cannot:
https://www.biostars.org/p/143458/#143639
This was a random forest bias correction method different from kallisto’s.

Shortly thereafter, here is the source code in Sailfish deprecating the Salmon bias correction and switching to kallisto’s method:
https://github.com/kingsfordgroup/sailfish/commit/377f6d65fe5201f7816213097e82df69e4786714#diff-fe8a1774cd7c858907112e6c9fda1e9dR76

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-3e922f9589567fee3b20671da9493c82R34

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-b14c09a136906d1c5d8534afa3a51c4cR818

This is the update to effective length in kallisto:
https://github.com/pachterlab/kallisto/blob/e5957cf96f029be4e899e5746edcf2f63e390609/src/weights.cpp#L184
Here is the Sailfish code:
https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-8341ac749ad4ac5cfcc8bfef0d6f1efaR796

Notice that there has been a literal copying down to the variable names:
https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-8341ac749ad4ac5cfcc8bfef0d6f1efaR796

The code written by the student of Rob was:

effLength *=alphaNormFactor/readNormFactor;

The code written by us is

efflen *= 0.5*biasAlphaNorm/biasDataNorm;

The code rewritten by Rob (editing that of the student):

effLength *= 0.5 * (txomeNormFactor / readNormFactor);

Note that since our bias correction method was not reported in our preprint, this had to have been copied directly from our codebase and was done so without any attribution.

I raised this specific issue with Carl Kingsford by email prior to our meeting in April 13 2016. We then discussed it in person. The conversation and email were prompted by a change to the Sailfish README on April 7, 2016 specifically accusing us of comparing kallisto to a “ **very old** version of Sailfish”:
https://github.com/kingsfordgroup/sailfish/commit/550cd19f7de0ea526f512a5266f77bfe07148266

What was stated is “The benchmarks in the kallisto paper *are* made against a very old version of Sailfish” not “were made against”. By the time that was written, it might well have been true. But kallisto was published in May 2015, it benchmarked with the Sailfish program described in Patro et al. 2014, and by 2016 Sailfish had changed completely implementing the pseudoalignment of kallisto.

Token attribution

Another aspect of an RNA-Seq quantification program is effective length estimation. There is an attribution to kallisto in the Sailfish code now explaining that this is from kallisto:
“Computes (and returns) new effective lengths for the transcripts based on the current abundance estimates (alphas) and the current effective lengths (effLensIn). This approach is based on the one taken in Kallisto
https://github.com/kingsfordgroup/sailfish/blob/b1657b3e8929584b13ad82aa06060ce1d5b52aed/src/SailfishUtils.cpp
This is from January 23rd, 2016, almost 9 months after kallisto was released, and 3 months before the Sailfish README accused us of not testing the latest version of Sailfish in May 2015.

The attribution for effective lengths is also in the Salmon code, from 6 months later June 2016:
https://github.com/COMBINE-lab/salmon/blob/335c34b196205c6aebe4ddcc12c380eb47f5043a/include/DistributionUtils.hpp

There is also an acknowledgement in the Salmon code that a machine floating point tolerance we use
https://github.com/pachterlab/kallisto/blob/master/src/EMAlgorithm.h#L19
was copied.
The acknowledgment in Salmon is here
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp
This is the same file where the kallisto thresholds for the EM were copied to.

So after copying our entire method, our core algorithm, many of our ideas, specific parameters, and numerous features… really just about everything that goes into an RNA-Seq quantification project, there is an acknowledgment that our machine tolerance threshold was “intelligently chosen”.

 

During my third year of undergraduate study I took a course taught by David Gabai in which tests were negatively graded. This meant that points were subtracted for every incorrect statement made in a proof. As proofs could, in principle, contain an unbounded number of false statements, there was essentially no lower bound on the grade one could receive on a test (course grades was bounded below by “F”). Although painful on occasion, I grew to love the class and the transcendent lessons that it taught. These memories came flooding back this past week, when a colleague brought to my attention the paper A simple and fast algorithm for K-medoids clustering by Hae-Sang Park and Chi-Hyuck Jun.

The Park-Jun paper introduces a K-means like method for K-medoid clustering. K-medoid clustering is a clustering formulation based on a generalization of (a relative of) the median rather than the mean, and is useful for the same robustness reasons that make the median preferable to the mean. The medoid is defined as follows: given n points x_1,\ldots,x_n in \mathbb{R}^d, the medoid is a point x among them with the property that it minimizes the average distance to the other points, i.e. x \in \{x_1,\ldots,x_n\} minimizes \sum_{i=1}^n ||x-x_i||. In the case of d=1, when n is odd, this corresponds to the median (the direct generalization of the median is to find a point x not necessarily among the x_i minimizing the average distance to the other points, and this is called the geometric median).

The K-medoid problem is to partition the points into k disjoint subsets S_1,\ldots,S_n so that if m_1,\ldots,m_k are the respective medoids of the subsets (clusters) then the average distance of each medoid to the points of the cluster it belongs to  is minimized. In other words, the K-medoids problem is to find

argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||m_j-x_i||.

For comparison, K-means clustering is the problem of finding

argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||\mu_j - x_i||^2,

where the \mu_j are the centroids of the points in each partition S_i. The K-means and K-medoids problems are instances of partitional clustering formulations that are NP-hard.

The most widely used approach for finding a (hopefully good) solution to the K-medoids problem has been a greedy clustering algorithm called PAM by Kaufman and Rousseeuw (partitioning around medoids). To understand the Park-Jun contribution it is helpful to briefly review PAM first.  The method works as follows:

  1. Initialize by selecting k distinguished points from the set of points.
  2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets and a “cost” can be associated to the partition, namely the sum of the distances from each point to its associated distinguished point (note that the distinguished points may not yet be the medoids of their respective partitions).
  3. For each distinguished point d, and each non-distingsuished point repeat the assignment and cost calculation of step  2 with and x swapped so that x becomes distinguished and returns to undistinguished status. Choose the swap that minimizes the total cost. If no swap reduces the cost then output the partition and the distinguished points.
  4. Repeat step 3 until termination.

It’s easy to see that when the PAM algorithm terminates the distinguished points must be medoids for the partitions associated to them.

The running time of PAM is O(k(n-k)^2). This is because in step 3, for each of the k distinguished points, there are n-k swaps to consider for a total of k(n-k) swaps, and the cost computation for each swap requires n-k assignments and additions. Thus, PAM is quadratic in the number of points.

To speed-up the PAM method, Park and Jun introduced in their paper an analogy of the K-means algorithm, with mean replaced by median:

  1. Initialize by selecting k distinguished points from the set of points.
  2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets in the same way as the PAM algorithm.
  3. Compute the medoid for each partition. Repeat step 2 until the cost no longer decreases.

Park-Jun claim that their method has run time complexity “O(nk) which is equivalent to K-means clustering”. Yet the pseudocode in the paper begins with the clearly quadratic requirement “calculate the distance between every pair of all objects..”

alldist

Step 3 of the algorithm is also quadratic. The medoid of a set of m points is computed in time O(m^2). Given m points the medoid must be one of them, so it suffices to compute, for each point, the sum of the distances to the others (m \times m additions) and a medoid is then identified by taking the minimum. Furthermore, without assumptions on the structure of the distance matrix there cannot be a faster algorithm (with the triangle inequality the medoid can be computed in O(n^{\frac{3}{2}}).

Quadratic functions are not linear. If they were, it would mean that, with a,b \neq 0ax^2=bx \mbox{ for all } x. If that were the case then

ax^2=bx

\Rightarrow ax^2-bx=0 for all x.

Assuming that a is positive and plugging in x=\frac{b-\sqrt{b^2+4a}}{2a} one would obtain that ax^2-bx=1 and it would follow that

0=1.

When reading the paper, it was upon noticing this “result” that negative grading came to mind. With a proof that 0=1 we are at, say, a score of -10 for the paper. Turning to the discussion of Fig. 3 of the paper we read that “…the proposed method takes about constant time near zero regardless of the number of objects.”

const_timefig

I suppose a generous reading of this statement is that it is an allusion to Taylor expansion of the polynomial f(x)=x^2 around x=0. A less generous interpretation is that the figure is deliberately misleading, intended to show linear growth with slope close to zero for what is a quadratic function. I decided to be generous and grade this a -5, leading to a running total of -15.

It seems the authors may have truly believed that their algorithm was linear because in the abstract they begin with “This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm”. It seems possible that the authors thought they had bypassed what they viewed as the underlying cause for quadratic running time of the PAM algorithm, namely the quadratic swap. The K-means algorithm (Lloyd’s algorithm) is indeed linear, because the computation of the mean of a set of points is linear in the number of points (the values for each coordinate can be averaged separately). However the computation of a medoid is not the same as computation of the mean. -5, now a running total of20. The actual running time of the Park-Jun algorithm is shown below:

kmed-largeN

Replicates on different instances are crucial, because absent from running time complexity calculations are the stopping times which are data dependent. A replicate of Figure 3 is shown below (using the parameters in Table 5 and implementations in MATLAB). The Park-Jun method is called as “small” in MATLAB.

kmedoids

Interestingly, the MATLAB implementation of PAM has been sped up considerably. Also, the “constant time near zero” behavior described by Park and Jun is clearly no such thing. For this lack of replication of Figure 3 another deduction of 5 points for a total of -25.

There is not much more to the Park and Jun paper, but unfortunately there is some. Table 6 shows a comparison of K-means, PAM and the Park-Jun method with true clusters based on a simulation:

Table6

The Rand index is a measure of concordance between clusters, and the higher the Rand index the greater the concordance. It seems that the Park-Jun method is slightly better than PAM, which are both superior to K-means in the experiment. However the performance of K-means is a lot better in this experiment than suggested by Figure 2 which was clearly (rotten) cherry picked (Fig2b):

Fig2.jpeg

For this I deduct yet another 5 points for a total of -30.

Amazingly, the Park and Jun paper has been cited 479 times. Many of the citations propagate the false claims in the paper, e.g. in referring to Park and Jun, Zadegan et al. write “in each iteration the new set of medoids is selected with running time O(N), where N is the number of objects in the dataset”.  The question is, how many negative points is the repetition of false claims from a paper that is filled with false claims worth?

According to Bach’s first biographer, when the great composer was asked to explain the secret of his success he replied “I was obliged to be industrious. Whoever is equally industrious will succeed equally well”. I respectfully disagree.

First, Bach didn’t have to drive the twenty kids he fathered over his lifetime to soccer practices, arrange their playdates, prepare and cook dinners, and do the laundry. To his credit he did teach his sons composition (though notably only his sons). But even that act was time for composition, an activity which he was seemingly able to devote himself to completely. For example, the fifteen two part inventions were composed in Köthen between 1720 and 1723 as exercises for his 9 year old son Wilhelm Friedemann, at the same time he was composing the first volume of the Well Tempered Clavier. In other words Bach had the freedom and time to pursue a task which was valued (certainly after Mendelssohn’s revival) by others. That is not the case for many, whose hard work, even when important, is trivialized and diminished when they are alive and forgotten when they are dead.

The second reason I disagree with Bach’s suggestion that those who work as hard as he did will succeed equally well is that his talent transcended that which is achievable by hard work alone. Although there is some debate as to whether Bach’s genius was in his revealing of beauty or his discovery of truth (I tend to agree with Richard Taruskin, whose view is discussed eloquently by Bernard Chazelle in an interview from 2014), regardless of how one thinks of Bach genius, his music was extraordinary. Much was made recently of the fact that Chuck Berry’s Johnny B. Goode was included the Voyager spacecraft record, but, as Lewis Thomas said,  if we really wanted to boast we would have sent the complete works of Bach. As it is, three Bach pieces were recorded, the most of any composer or musician. The mortal Chuck Berry and Wolfgang Amadeus Mozart appear just once. In other words, even the aliens won’t believe Bach’s claim that ” whoever is equally industrious will succeed equally well”.

Perhaps nitpicking Bach’s words is unfair, because he was a musician and not an orator or writer. But what about his music? Others have blogged him, but inevitably an analysis of his music reveals that even though he may have occasionally broken his “rules”, the fact that he did so, and how he did so, was the very genius he is celebrated for. Still, it’s fun to probe the greatest composer of all time, and I recently found myself doing just that, with none other than one of his “simplest” works, one of the inventions written as an exercise for his 9 year old child. The way this happened is that after a 30 year hiatus, I recently found myself revisiting some of the two part inventions. My nostalgia started during a power outage a few weeks ago, when I realized the piano was the only interesting non-electronic instrument that was working in the house, and I therefore decided to use the opportunity to learn a fugue that I hadn’t tackled before. A quick attempt revealed that I would be wise to brush up on the inventions first. My favorite invention is BWV 784 in A minor, and after some practice it slowly returned to my fingers, albeit in a cacophony of uneven, rheumatic dissonance. But I persevered, and not without reward. Not a musical reward- my practice hardly improved the sound- but a music theory reward. I realized, after playing some of the passages dozens of times, that I was playing a certain measure differently because it sounded better to me the “wrong” way. How could that be?

The specific measure was number 16:bach1bach2This measure is the third in a four measure section that starts in measure 14, in which a sequence of broken chords (specifically diminished seventh chords) connect a section ending in E minor with one beginning in A minor. It’s an interesting mathomusical problem to consider the ways in which those keys can be bridged in a four measure sequence (according to the “rules of Bach”), and in listening to the section while playing it I had noticed that the E♮notes at the end of measure 15 and in measure 16 were ok… but somehow a bit strange to my ear. To understand what was going on I had to dust off my First Year Harmony book by William Lovelock for some review (Lovelock’s text is an introduction to harmony and counterpoint and is a favorite of mine, perhaps because it is written in an axiomatic style). The opening broken diminished seventh chord in measure 14 is really the first inversion of the leading note (C♯) for D minor. It is followed by a dominant seventh (A) leading into the key of C major. We see the same pattern repeated there (leading diminished seventh in first inversion followed by the dominant seventh) and two more times until measure 18. In other words, Bach’s solution to the problem of arriving at A minor from E minor was to descend via D -> C -> B♭ -> A with alternating diminished sevenths with dominant sevenths to facilitate the modulations.

Except that is not what Bach did! Measure 16 is an outlier in the key of E minor instead of the expected B♭. What I was hearing while playing was that it was more natural for me to play the E ♭notes (marked in red in the manuscript above) than the annotated E♮notes. My 9 year old daughter recorded my (sloppy, but please ignore that!) version below:

Was Bach wrong? I tried to figure it out but I couldn’t think of a good reason why he would diverge from the natural harmonic sequence for the section. I found the answer in a booklet formally analyzing the two part inventions. In An Analytical Survey of the Fifteen Two-Part Inventions by J.S. Bach by Theodore O. Johnson the author explains that the use of E♮instead of E ♭at the end of measure 15 and in the beginning of measure 16 preserves the sequence of thirds each three semitones apart. My preference for E♭breaks the sequence- I chose musical harmony of mathematical harmony. Bach obviously faced the dilemma but refused to compromise. The composed sequence, even though not strictly the expected harmonic one, sounds better the more one listens to it. Upon understanding the explanation, I’ve grown to like it just as much and I’m fine with acknowledging that E minor works perfectly well in measure 16. It’s amazing to think how deeply Bach thought through every note in every composition. The attention to detail, even in a two part invention intended as an exercise for a 9 year old kid, is incredible.

The way Bach wrote the piece, with the E♮can be heard in Glenn Gould‘s classical recording:

Gould performs the invention in 45 seconds, an astonishing technical feat that has been panned by some but that I will admit I am quite fond of listening to. The particular passage in question is harmonically complex- Johnson notes that the three diminished seventh chords account for all twelve tones- and I find that Gould’s rendition highlights the complexity rather than diminishing it. In any case, what’s remarkable about Gould’s performance is that he too deviates from Bach’s score! Gould plays measure 18 the same way as the first measure instead of with the inversion which Bach annotated. I marked the difference in the score (see oval in battleship gray, which was Gould’s favorite color).

Life is short, break the rules.

This post is dedicated to the  National Endowment for the Arts and
the National Endowment for the Humanities who enrich our lives by supporting the arts and the humanities.

 

In 2014 biologist Mike Snyder published 42 papers. Some of his contributions to those papers were collaborative, but many describe experiments performed in his lab. To make sense of the number 42, one might naïvely assume that Snyder dedicated 8 days and 17 hours (on average) to each paper. That time would include the hours (days?) needed to conceive of ideas, setup and perform experiments, analyze data and write text. It’s not much time, yet even so it is probably an upper bound. This is because Snyder performed the work that resulted in the 42 papers in his spare time. As the chair of the genetics department at Stanford, an administrative position carrying many responsibilities, he was burdened with numerous inter- and intra- and extra- departmental meetings, and as an internationally renowned figure in genomics he was surely called upon to assist in numerous reviews of grants and papers, to deliver invited talks at conferences, and to serve on numerous national and international panels and committees. Moreover, as a “principal investigator” (PI), Snyder was certainly tasked with writing reports about the funding he receives, and to spend time applying for future funding. So how does someone write 42 papers in a year?

Snyder manages 36 postdocs,  13 research assistants,  11 research scientists, 9 visiting scientists, and 8 graduate students. One might imagine that each of the graduate students supervises four postdocs, so that the students/postdocs operate under a structure that enables independent work to proceed in the lab with only occasional supervision and input from the PI. This so-called “PI model” for biology arguably makes sense. The focus of experienced investigators on supervision allows them to provide crucial input gained from years of experience on projects that require not only creative insight, but also large amounts of tedious rote work. Modern biology is also fraught with interdisciplinary challenges, and large groups benefit from the diversity of their members.  For these reasons biology papers nowadays tend to have large numbers of authors.

It is those authors, the unsung heroes of science, that are the “cast” in Eric Lander’s recent perspective on “The Heroes of CRISPR” where he writes that

“the narrative underscores that scientific breakthroughs are rarely eureka moments. They are typically ensemble acts, played out over a decade or more, in which the cast becomes part of something greater than what any one of them could do alone.”

All of the research papers referenced in the Lander perspective have multiple authors, on average about 7, and going up to 20. When discussing papers such as this, it is therefore customary to refer to “the PI and colleagues…” in lieu of crediting all individual authors. Indeed, this phrase appears throughout Lander’s perspective, where he writes “Moineau and colleagues..”, “Siksnys and colleagues..”, etc. It is understood that this means that the PIs guided projects and provided key insights, yet left most of the work to the first author(s) who were in turn aided by others.

There is a more cynical view of the PI model, namely that by running large labs PIs are able to benefit from a scientific roulette of their own making. PIs can claim credit for successes while blaming underlings for failures. Even one success can fund numerous failures, and so the wheel spins faster and faster…the PI always wins. Of course the mad rush to win leads to an overall deterioration of quality. For example, it appears that even the Lander perspective was written in haste (Lander may not be as prolific as Snyder but according to Thomson Reuters he did publish 21 “hot” papers in 2015, placing him fourth on the list of the world’s most influential scientific minds, only a few paces behind winner Stacey Gabriel). The result is what appears to be a fairly significant typo in Figure 2 of his perspective, which shows the geography of the CRISPR story. The circle labeled “9” should be colored in blue like circle “10”, because that color represents “the final step of biological engineering to enable genome editing”. The Jinek et al. 2012 paper of circle “9” clearly states that “We further show that the Cas9 endonuclease can be programmed with guide RNA engineered as a single transcript to target and cleave any dsDNA sequence of interest”. Moreover the Jinek et al. 2013 paper described editing in mammalian cells but was submitted in 2012 (the dates in Lander’s figure are by submission), so circle “9” should also be labeled with a “10” for “Genome editing in mammalian cells”.  In other words, the figure should have looked like this:

Fig2_fixed.jpg

Lander acknowledges the reality of the PI model in the opening of his perspective, where he writes that “the Perspective describes an inspiring ensemble of a dozen or so scientists who—with their collaborators and other contributors whose stories are not elaborated here—discovered the CRISPR system, unraveled its molecular mechanisms, and repurposed it as a powerful tool for biological research and biomedicine. Together, they are the Heroes of CRISPR.”

But in choosing to highlight a “dozen or so” scientists, almost all of whom are established PIs at this point, Lander unfairly trivializes the contributions of dozens of graduate students and postdocs who may have made many of the discoveries he discusses, may have developed the key insights, and almost certainly did most of the work. For example, one of the papers mentioned in the Lander perspective is

F. Zhang*, L. Cong*, S. Lodato, S. Kosuri, G.M. Church and P. Arlotta, Efficient construction of sequence-specific TAL effectors for modulating mammalian transcription, 2011

I’ve placed a * next to the names of F. Zhang and L. Cong as is customary to do when “authors contributed equally to this work” (a quote from their paper). Yet in his perspective Lander writes that “when TALEs were deciphered, Zhang, with his collaborators Paola Arlotta and George Church…successfully repurposed them for mammals”. The omission of Cong is surprising not only because his contribution to the TALE work was equal to that of Zhang, but because he is the same Cong that is first author of Cong et al. 2013 that Lander lauds for its number of citations (Le Cong is a former graduate student of Zhang, and is now a postdoc at the Broad Institute). Another person who is absent from Lander’s perspective, yet was first author on two of the key papers mentioned is Martin Jinek. He and Charpentier’s postdoc Krzysztof Chylinski were deemed fundamental to the CRISPR story elsewhere.

A proper narrative of the history of CRISPR would include stories primarily about the graduate students and postdocs who did the heavy lifting. Still, one might imagine giving Lander the benefit of the doubt; perhaps it was simply not possible to exhaustively describe all the individuals who contributed to the 20 year CRISPR story. Maybe Lander’s intent was to draw readers into his piece by including interesting backstories in the style of the New Yorker. Mentions of cognac, choucroute garnie and Saddam Hussein certainly added spice to what would otherwise be just a litany of dates. Yet if that is the case, why are the backstories of the only two women mentioned in Lander’s perspective presented as so so so boring? The entirety of what he has to say about them is “Charpentier had earned her Ph.D in microbiology from Pasteur Institute in 1995” and “Doudna had received her Ph.D. at Harvard” (those who are interested in Charpentier and Doudna’s fascinating CRISPR story will have to forgo the journal Cell and read about The CRISPR Quandary in the New York Times). A possible answer to my question is provided here.

Lander concludes his perspective by writing that “the [CRISPR] narrative…is a wonderful lesson for a young person contemplating a life in science”. That may be true of the CRISPR narrative, but the lesson of his narrative is that young persons should not count on leaders in their field to recognize their work.

COI statement: Eric Lander is my former (informal) Ph.D. advisor. Jennifer Doudna is a current colleague and collaborator.

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as the law of total variance. Given a random (response) variable C with a random covariate  Z, the law of total variance states that the variance of C can be decomposed as:

Var[C] = E_Z[Var[C|Z]] + Var_Z[E[C|Z]].

There is a biological interpretation of this law that also serves to explain it: If the random variable C denotes the expression of a gene in a single cell (C being a random variable means that the expression is stochastic), and Z denotes the (random) state of a cell, then the law of total variance describes the “total noise” Var[C] in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression Var[C|Z] to be the conditional variance, which is also a random variable; its possible values are the variance of the gene expression in different cell states. If Var[C] does not depend on Z then the expression of the gene is said to be homoscedastic, i.e., it does not depend on cell state (if it does then it is said to be heteroscedastic). Because Var[C|Z] is a random variable, the expression E_Z[Var[C|Z] makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression E[C|Z] is a random variable whose possible values are the average  of the gene expression in cells. Thus, Var_Z[E[C|Z]] is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into E. coli and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables Z_1,\ldots,Z_n which are independent and identically distributed, one for each of n different cells, while random variables C_1,\ldots,C_n  and Y_1,\ldots,Y_n correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable C_i and Y_i, i.e. a pair of measurements for each cell. A hierarchical model for the experiment, in which the marginal (unconditional) distributions C_i and Y_i are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead,  in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

\eta^2_{int} = \frac{\frac{1}{n} \left( \sum_{i=1}^n \frac{1}{2} (c_i-y_i)^2\right) }{\overline{c} \cdot \overline{y}}, (intrinsic noise)

\eta^2_{ext} = \frac{\frac{1}{n}\sum_{i=1}^n c_i \cdot y_i - \overline{c} \cdot \overline{y}}{\overline{c} \cdot \overline{y}}, (extrinsic noise)

\eta^2_{tot} = \frac{\frac{1}{n}\sum_{i=1}^n \frac{1}{2} (c_i^2+y_i^2)-\overline{c}\cdot\overline{y}}{\overline{c} \cdot \overline{y}}. (total noise)

Here  c_1,\ldots,c_n and y_1,\ldots,y_n are the measurements of cyan respectively yellow reporter expression in each cell, \overline{c} = \frac{1}{n}\sum_{i=1}^n c_i and \overline{y} = \frac{1}{n}\sum_{i=1}^n y_i.

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho,  studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:

geometry_fig

The Elowitz et al. caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of n in the denominator rather than n-1), an observation made by Hilfinger and Paulsson. The sample covariance has a  (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line c=y is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):

table_results

There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

  1. There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
  2. The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
  3. The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.

I recently read a “brainiac” column in the Boston Globe titled “For 40 years, computer scientists looked for a solution that doesn’t exist” that caused me to facepalm so violently I now have pain in my right knee.

The article is about the recent arXiv posting:

Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) by Arturs Backurs and Piotr Indyk (first posted in December 2014 and most recently updated in April 2015).

In the preprint, which was presented at STOC, Backurs and Indyk show that computing edit distance sub-quadratically is hard in the sense that if it were possible to compute the edit distance between two strings of length n in time O(n^{2-\delta}) for some constant \delta > 0 then it would be possible to solve the satisfiability of conjunctive normal form formulas in time that would violate the Strong Exponential Time Hypothesis (SETH)This type of “lower bound” reduction is common practice in theoretical computer science. In this case the preprint is of interest because (a) it sheds light on the complexity of a computational problem (edit distance) that has been extensively studied by theoretical computer scientists and (b) it extends the relevance of SETH and sheds more light on what the hypothesis implies.

In the introduction to the preprint Backurs and Indyk motivate their study by writing that “Unfortunately, that [standard dynamic programming] algorithm runs in quadratic time, which is prohibitive for long sequences”  and note that “considerable effort has been invested into designing faster algorithms, either by assuming that the edit distance is bounded, by considering the average case or by resorting to approximation”. They mention that the fastest known exact algorithm is faster than quadratic time running in O(n^2/logn), but note that “only barely so”. This is all true, and such preamble is typical for theoretical computer science, setting the stage for the result. In one single parenthetical remark intended to relate their work to applications they note that “the human genome consists of roughly 3 billions base pairs” and that exact edit distance computation is currently prohibitive for such long sequences.  This is a true sentence, but also not very precise. More on that later in the post…

The parenthetical genome comment in the preprint was picked up on in an MIT press release describing the results and implications of the paper. While the press release does a very good job explaining the reduction of Backurs and Indyk in an accessible way, it also picks up on the parenthetical genome comment and goes a bit further. The word “genome” appears three times, and it is stated that “a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes.” This is technically true, but two human genomes are very similar, so that exhaustive comparison is not necessary to compute the edit distance. In fact, two human genomes are 99.9% identical, and with strong prior upper bounds on the edit distance the computation of edit distance is tractable.

In terms of different species, the edit distance between complete genomes is completely irrelevant, because of the fact that organisms undergo large scale segmental duplications and rearrangements over timescales of speciation. The MIT press release did not claim the Backurs and Indyk result was relevant for comparing genomes of species, but… the Boston Globe column states that

“By calculating the edit distance between the genomes of two organisms, biologists can estimate how far back in evolutionary time the organisms diverged from each other.”

No.

Actually, even for closely related genomes edit distance is not the relevant statistic in biology, but rather the alignment of the genomes. This is why biological sequence analysis is based on what is called the Neeldeman-Wunsch algorithm and not the Wagner-Fischer algorithm mentioned in the MIT press release. Actually, because the length of sequences for which it is relevant to compute alignments is bounded and rather small (usually a few tens or hundreds of thousands of base pairs at most), the real computational issue for biological sequence analysis is not running time but memory. The Needleman-Wunsch and Wagner-Fischer algorithms require O(n^2) space and the algorithmic advance that made edit distance and alignment computations for biological sequence comparison tractable was Hirschberg’s algorithm which requires only linear space at a cost of only a factor of 2 in time.

Thus, the parenthetical comment of Backurs and Indyk, intended to merely draw attention of the reader to the possible practical relevance of the edit distance problem, mutated into utter rubbish by the time the work was being discussed in the Boston Globe. If that was the only issue with the brainiac column I might have just hurt my hand when I facepalmed, but the following paragraph caused me the real pain:

Because it’s computationally infeasible to compute the edit distance between two human genomes, biologists have to rely on approximations. They’d love a faster method, but Indyk and his coauthor, MIT graduate student Arturs Backurs, recently demonstrated a faster method is impossible to create. And by impossible they don’t mean “very hard” or “our technology has to improve first.” They mean something like, “by the laws of the mathematics, it can’t be done.”

We’ve already gone through point #1 but amazingly nothing is true in this paragraph:

  1. No, it is not computationally infeasible to compute the edit distance between two human genomes and no, biologists do not have to rely on approximations. But then there is also the fact that…
  2. Backurs and Indyk did not, in fact, demonstrate that a “faster method is impossible to create”. What they showed is that reducing the exponent in the quadratic algorithm would require SETH to be false. It turns out there are some respectable computer scientists that believe that SETH is indeed false. So…
  3. No, Backurs and Indyk, who understand mathematics quite well… most certainly do not believe that “by the laws of the mathematics, it can’t be done”. Actually,
  4. I’d bet that Backurs and Indyk don’t really believe that mathematics has laws. Axioms yes.. but laws…

I find inaccurate reporting of science to be highly problematic. How can we expect the general public to believe scientists claims about serious issues, e.g. the prospects for climate change, when news reports are filled with hype and rubbish about less serious issues, e.g. the complexity of edit distance?

There is a another issue that this example highlights. While complexity theory is extremely important and fundamental for computer science, and while a lot of the work in complexity has practical applications, there has been a consistent hyping of implications for biology where complexity theory is simply not very relevant. When I started my career it was fashionable to talk about protein folding being NP-complete. But of course millions of proteins were probably being folded in my body while I was writing this blog post. Does that fact that I am alive imply that P=NP?

Of course, exploring the complexity of simplified models for biological systems, such as the HP-model for protein folding that the above reference links to, has great value. It helps to sharpen the questions, to better specify models, and to define the limits of computational tractability for computational biology problems. But it is worth keeping in mind that in biology n does not go to infinity.

IMG_7693

Why, sometimes I’ve believed as many as six impossible things before breakfast.” – The White Queen from Through the Looking Glass by Lewis Carroll.

p0

 

 

p1

Two weeks ago in my post Pachter’s P-value Prize I offered {\bf \frac{\$100}{p}} for justifying a reasonable null model and a p-value (p) associated to the statement “”Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions” in the paper

M. Kellis, B.W. Birren and E.S. Lander, Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisaeNature 2004 (hereafter referred to as the KBL paper).

Today I am happy to announce the winner of the prize. But first, I want to thank the many readers of the blog who offered comments (>135 in total) that are extraordinary in their breadth and depth, and that offer a glimpse of what scientific discourse can look like when not restricted to traditional publishing channels. You have provided a wonderful public example of what “peer review” should mean. Coincidentally, and to answer one of the questions posted, the blog surpassed one million views this past Saturday (the first post was on August 19th, 2013), a testament to the the fact that the collective peer reviewing taking place on these pages is not only of very high quality, but also having an impact.

I particularly want to thank the students who had the courage to engage in the conversation, and also faculty who published comments using their name. In that regard, I admire and commend Joshua Plotkin and Hunter Fraser for deciding to deanonymize themselves by agreeing to let me announce here that they were the authors of the critique sent to the authors in April 2004  initially posted as an anonymous comment on the blog.

The discussion on the blog was extensive, touching on many interesting issues and I only summarize a few of the threads of discussion here. I decided to touch on a number of key points made in order to provide context and justification for my post and selection of the prize winner.

The value of post-publication review

One of the comments  made in response to my post that I’d like to respond to first was by an author of KBL who dismissed the entire premise of the my challenge writing “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become, as Ferguson and Heene noted, a vast graveyard of undead theories? Surely the varied and interesting comments posted in response to my challenge (totaling >25,000 words and 50 pages in Arial 11 font), demonstrate the value of communal discussion of science after publication.

For the record, this past month I did submit a paper and also a grant, and I did spend lots of time with my family. I didn’t practice the guitar but I did play the piano. Yet in terms of research, for me the highlight of the month was reading and understanding the issues raised in the comments to my blog post. Did I have many other things to do? Sure. But what is more pressing than understanding if the research one does is to be meaningful?

The null model

A few years ago I introduced a new two-semester freshman math course at UC Berkeley for intended biology majors called Math 10- Methods of Mathematics: Calculus, Statistics and Combinatorics“. One of the key ideas we focus on in the first semester is that of a p-value. The idea of measuring significance of a biological result via a statistical computation involving probabilities is somewhat unnatural, and feedback from the students confirms what one might expect: that the topic of p-values is among the hardest in the course. Math for biologists turns out to be much harder than calculus. I believe that at Berkeley we are progressive in emphasizing the importance of statistics for biology majors at the outset of their education (to be clear, this is a recent development). The prevailing state is that of statistical illiteracy, and the result is that p-values are frequently misunderstood, abused, and violated in just about every possible way imaginable (see, e.g., here, here and here).

P-values require a null hypothesis and a test statistic, and of course one of the most common misconceptions about them is that when they are large they confirm that the null hypothesis is correct. No! And worse, a small p-value cannot be used to accept an alternative to the null, only to (confidently) reject the null itself. And rejection of the null comes with numerous subtle issues and caveats (see arguments against the p-value in the papers mentioned above). So what is the point?

I think the KBL paper makes for an interesting case study of when p-values can be useful. For starters, the construction of a null model is already a useful exercise, because it is a thought experiment designed to test ones understanding of the problem at hand. The senior author of the KBL paper argues that “we were interested in seeing whether, for genes where duplication frees up at least one copy to evolve rapidly, the evidence better fits one model (“Ohno”: only one copy will evolve quickly) or an alternative model (both genes will evolve quickly).” While I accept this statement at face value, it is important to acknowledge that if there is any science to data science, it is the idea that when examining data one must think beyond the specific hypotheses being tested and consider alternative explanations. This is the essence of what my colleague Ian Holmes is saying in his comment. In data analysis, thinking outside of the box (by using statistics) is not optional. If one is lazy and resorts to intuition then, as Páll Melted points out, one is liable to end up with fantasy.

The first author of KBL suggests that the “paper was quite explicit about the null model being tested.” But I was unsure of whether to assume that the one-gene-only-speeds-up model is the null based on”we sought to distinguish between the Ohno one-gene-only speeds-up (OS) model and the alternative both-genes speed-up (BS) model” or was the null the BS model because “the Ohno model is 10^87 times more likely, leading to significant rejection of the BS null”?  Or was the paper being explicit about not having a null model at all, because  “Two alternatives have been proposed for post-duplication”, or was it the opposite, i.e. two null models: “the OS and BS models are each claiming to be right 95% of the time”? I hope I can be forgiven for failing, despite trying very hard, to identify a null model in either the KBL paper, or the comments of the authors to my blog.

There is however a reasonable null model, and it is the “independence model”, which to be clear, is the model where each gene after duplication “accelerates” independently with some small probability (80/914). The suggestions that “the independence model is not biologically rooted” or that it “would predict that only 75% of genes would be preserved in at least one copy, and that 26% would be preserved in both copies” are of course absurd, as explained by Erik van Nimwegen who explains why point clearly and carefully. The fact that many entries reached the same conclusion about the suitable null model in this case is reassuring. I think it qualifies as a “reasonable model” (thereby passing the threshold for my prize).

The p-value

One of my favorite missives about p-values is by Andrew Gelman, who in “P-values and statistical practice” discusses the subtleties inherent in the use and abuse of p-values. But as my blog post illustrates, subtlety is one thing, and ignorance is an entirely different matter. Consider for example, the entry by Manolis Kellis who submitted that p = 10^{-87} thus claiming that I owe him 903,659,165 million billion trillion quadrillion quintillion sextillion dollars (even more than the debt of the United States of America). His entry will not win the prize, although the elementary statistics lesson that follows is arguably worth a few dollars (for him). First, while it is true that a p-value can be computed from the (log) likelihood ratio when the null hypothesis is a special case of the alternative hypothesis (using the chi^2 distribution), the ratio of two likelihoods is not a p-value! Probabilities of events are also not p-values! For example, the comment that “I calculated p-values for the exact count, but the integral/sum would have been slightly better” is a non-starter. Even though KBL was published in 2004, this is apparently the level of understanding of p-values of one of the authors, a senior computational biologist and professor of computer science at MIT in 2015. Wow.

So what is “the correct” p-value? It depends, of course, on the test statistic. Here is where I will confess that like many professors, I had an answer in mind before asking the question. I was thinking specifically of the setting that leads to 0.74 (or 0.72/0.73, depending on roundoff and approximation). Many entries came up with the same answer I had in mind and therefore when I saw them I was relieved: I owed $135, which is what I had budgeted for the exercise. I was wrong. The problem with the answer 0.74 is that it is the answer to the specific question: what is the probability of seeing 4 or less pairs accelerate out of 76 pairs in which at least one accelerated. A better test statistic was proposed by Pseudo in which he/she asked for the probability of seeing 5% or less of the pairs accelerate from among the pairs with at least one gene accelerating when examining data from the null model with 457 pairs. This is a subtle but important distinction, and provides a stronger result (albeit with a smaller p-value). The KBL result is not striking even forgoing the specific numbers of genes measured to have accelerated in at least one pair (of course just because p=0.64 also does not mean the independence model is correct). What it means is that the data as presented simply weren’t “striking”.

One caveat in the above analysis is that the arbitrary threshold used to declare “acceleration” is problematic. For example, one might imagine that other thresholds produce more convincing results, i.e. farther from the null, but of course even if that were true the use of an arbitrary cutoff was a poor approach to analysis of the data. Fortunately, regarding the specific question of its impact in terms of the analysis, we do not have to imagine. Thanks to the diligent work of Erik van Nimwegen, who went to the effort of downloading the data and reanalyzing it with different thresholds (from 0.4 to 1.6), we know that the null cannot be rejected even with other thresholds.

The award

There were many entries submitted and I read them all. My favorite was by Michael Eisen for his creative use of multiple testing correction, although I’m happier with the direction that yields $8.79. I will not be awarding him the prize though, because his submission fails the test of “reasonable”, although I will probably take him out to lunch sometime at Perdition Smokehouse.

I can’t review every single entry or this post, which is already too long, would become unbearable, but I did think long and hard about the entry of K. It doesn’t directly answer the question of why the 95% number is striking, nor do I completely agree with some of the assumptions (e.g. if neither gene in a pair accelerates then the parent gene was not accelerated pre-WGD). But I’ll give the entry an honorable mention.

The prize will be awarded to Pseudo for defining a reasonable null model and test statistic, and producing the smallest p-value within that framework. With a p-value of 0.64 I will be writing a check in the amount of $156.25. Congratulations Pseudo!!

The biology

One of the most interesting results of the blog post was, in my opinion, the extensive discussion about the truth. Leaving aside the flawed analysis of KBL, what is a reasonable model for evolution post-WGD? I am happy to see the finer technical details continue to be debated, and the intensity of the conversation is awesome! Pavel Pevzner’s cynical belief that “science fiction” is not a literary genre but rather a description of what is published in the journal Science may be realistic, but I hope the comments on my blog can change his mind about what the future can look like.

In lieu of trying to summarize the scientific conversation (frankly, I don’t think I could do justice to some of the intricate and excellent arguments posted by some of the population geneticists) I’ll just leave readers to enjoy the comment threads on their own. Comments are still being posted, and I expect the blog post to be a “living” post-publication review for some time. May the skeletons keep finding a way out!

The importance of wrong

Earlier in this post I admitted to being wrong. I have been wrong many times. Even though I’ve admitted some of my mistakes on this blog and elsewhere in talks, I would like to joke that I’m not going to make it easy for you to find other flaws in my work. That would be a terrible mistake. Saying “I was wrong” is important for science and essential for scientists. Without it people lose trust in both.

I have been particularly concerned with a lack of “I was wrong” in genomics. Unfortunately, there is a culture that has developed among “leaders” in the field where the three words admitting error or wrongdoing are taboo. The recent paper of Lin et al. critiqued by Gilad-Mizrahi is a good example. Leaving aside the question of whether the result in the paper is correct (there are strong indications that it isn’t), Mizrahi-Gilad began their critique on twitter by noting that the authors had completely failed to account for, or even discuss, batch effect. Nobody, and I mean nobody who works on RNA-Seq would imagine for even a femtosecond that this is ok. It was a major oversight and mistake. The authors, any of them really, could have just come out and said “I was wrong”. Instead, the last author on the paper, Mike Snyder, told reporters that “All of the sequencing runs were conducted by the same person using the same reagents, lowering the risk of unintentional bias”. Seriously?

Examples abound. The “ENCODE 80% kerfuffle” involved claims that “80% of the genome is functional”. Any self-respecting geneticist recognizes such headline grabbing as rubbish. Ewan Birney, a distinguished scientist who has had a major impact on genomics having being instrumental in the ENSEMBL project and many other high-profile bioinformatics programs defended the claim on BBC:

“EB: Ah, so, I don’t — It’s interesting to reflect back on this. For me, the big important thing of ENCODE is that we found that a lot of the genome had some kind of biochemical activity. And we do describe that as “biochemical function”, but that word “function” in the phrase “biochemical function”is the thing which gets confusing. If we use the phrase “biochemical activity”, that’s precisely what we did, we find that the different parts of the genome, [??] 80% have some specific biochemical event we can attach to it. I was often asked whether that 80% goes to 100%, and that’s what I believe it will do. So, in other words, that number is much more about the coverage of what we’ve assayed over the entire genome. In the paper, we say quite clearly that the majority of the genome is not under negative selection, and we say that most of the elements are not under pan-mammalian selection. So that’s negative selection we can detect between lots of different mammals. [??] really interesting question about what is precisely going on in the human population, but that’s — you know, I’m much closer to the instincts of this kind of 10% to 20% sort of range about what is under, sort of what evolution cares about under selection.”

This response, and others by members of the ENCODE consortium upset many people who may struggle to tell apart white and gold from blue and black, but certainly know that white is not black and black is not white. Likewise, I suspect the response of KBL to my post disappointed many as well. For Fisher’s sake, why not just acknowledge what is obvious and true?

The personal critique of professional conduct

A conversation topic that emerged as a result of the blog (mostly on other forums) is the role of style in online discussion of science. Specifically, the question of whether personal attacks are legitimate has come up previously on my blog pages and also in conversations I’ve had with people. Here is my opinion on the matter:

Science is practiced by human beings. Just like with any other human activity, some of the humans who practice it are ethical while others are not. Some are kind and generous while others are… not. Occasionally scientists are criminal. Frequently they are honorable. Of particular importance is the fact that most scientists’ behavior is not at any of these extremes, but rather a convex combination of the mentioned attributes and many others.

In science it is people who benefit, or are hurt, by the behavior of scientists. Preprints on the bioRxiv do not collect salaries, the people who write them do. Papers published in journals do not get awarded or rejected tenure, people do. Grants do not get jobs, people do. The behavior of people in science affects… people.

Some argue for a de facto ban on discussing the personal behavior of scientists. I agree that the personal life of scientists is off limits. But their professional life shouldn’t be. When Bernie Madoff fabricated gains of $65 billion it was certainly legitimate to criticize him personally. Imagine if that was taboo, and instead only the technical aspects of his Ponzi scheme were acceptable material for public debate. That would be a terrible idea for the finance industry, and so it should be for science. Science is not special among the professions, and frankly, the people who practice it hold no primacy over others.

I therefore believe it is not only acceptable but imperative to critique the professional behavior of persons who are scientists. I also think that doing so will help eliminate the problematic devil–saint dichotomy that persists with the current system. Having developed a culture in which personal criticism is outlawed in scientific conversations while only science is fair fodder for public discourse, we now have a situation where scientists are all presumed to be living Gods, or else serious criminals to be outlawed and banished from the scientific community. Acknowledging that there ought to be a grey zone, and developing a healthy culture where critique of all aspects of science and scientists is possible and encouraged would relieve a lot of pressure within the current system. It would also be more fair and just.

A final wish

I wish the authors of the KBL paper would publish the reviews of their paper on this blog.

In this blog post I offer a cash prize for computing a p-value [update June 9th: the winner has been announced!]. For details about the competition you can skip directly to the challenge. But context is important:

Background

I’ve recently been reading a bioRxiv posting by X. Lan and J. Pritchard, Long-term survival of duplicate genes despite absence of subfunctionalized expression (2015) that examines the question of whether gene expression data (from human and mouse tissues) supports a model of duplicate preservation by subfunctionalization.

The term subfunctionalization is a hypothesis for explaining the ubiquity of persistence of gene duplicates in extant genomes. The idea is that gene pairs arising from a duplication event evolve, via neutral mutation, different functions that are distinct from their common ancestral gene, yet together recapitulate the original function. It was introduced in 1999 an alternative to the older hypothesis of neofunctionalization, which posits that novel gene functions arise by virtue of “retention” of one copy of a gene after duplication, while the other copy morphs into a new gene with a new function. Neofunctionalization was first floated as an idea to explain gene duplicates in the context of evolutionary theory by Haldane and Fisher in the 1930s, and was popularized by Ohno in his book Evolution by Gene Duplication published in 1970. The cartoon below helps to understand the difference between the *functionalization hypotheses (adapted from wikipedia):

Neo- and subfunctionalization
Lan and Pritchard examine the credibility of the sub- and neofunctionalization hypotheses using modern high-throughput gene expression (RNA-Seq) data: in their own words “Based on theoretical models and previous literature, we expected that–aside from the youngest duplicates–most duplicate pairs would be functionally distinct, and that the primary mechanism for this would be through divergent expression profiles. In particular, the sub- and neofunctionalization models suggest that, for each duplicate gene, there should be at least one tissue where that gene is more highly expressed than its partner.”

What they found was that, in their words, that “surprisingly few duplicate pairs show any evidence of sub-/neofunctionalization of expression.” The went further, stating that “the prevailing model for the evolution of gene duplicates holds that, to survive, duplicates must achieve non-redundant functions, and that this usually occurs by partitioning the expression space. However, we report here that sub-/neofunctionalization of expression occurs extremely slowly, and generally does not happen until the duplicates are separated by genomic rearrangements. Thus, in most cases long-term survival must rely on other factors.” They propose instead that “following duplication the expression levels of a gene pair evolve so that their combined expression matches the optimal level. Subsequently, the relative expression levels of the two genes evolve as a random walk, but do so slowly (33) due to constraint on their combined expression. If expression happens to become asymmetric, this reduces functional constraint on the minor gene. Subsequent accumulation of missense mutations in the minor gene may provide weak selective pressure to eventually eliminate expression of this gene, or may free the minor gene to evolve new functions.”

The Lan and Pritchard paper is the latest in a series of works that examine high-browed evolutionary theories with hard data, and that are finding reality to be far more complicated than the intuitively appealing, yet clearly inadequate, hypotheses of neo- and subfunctionalization. One of the excellent papers in the area is

Dean et al. Pervasive and Persistent Redundancy among Duplicated Genes in Yeast, PLoS Genetics, 2008.

where the authors argue that in yeast “duplicate genes do not often evolve to behave like singleton genes even after very long periods of time.” I mention this paper, from the Petrov lab, because its results are fundamentally at odds with what is arguably the first paper to provide genome-wide evidence for neofunctionalization (also in yeast):

M. Kellis, B.W. Birren and E.S. Lander, Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisae, Nature 2004.

At the time, the Kellis-Birren-Lander paper was hailed as containing “work that may lead to better understanding of genetic diseases” and in the press release Kellis stated that “understanding the dynamics of genome duplication has implications in understanding disease. In certain types of cancer, for instance, cells have twice as many chromosomes as they should, and there are many other diseases linked to gene dosage and misregulation.” He added that “these processes are not much different from what happened in yeast.” and the author of the press releases added that “whole genome duplication may have allowed other organisms besides yeast to achieve evolutionary innovations in one giant leap instead of baby steps. It may account for up to 80 percent (seen this number before?) of flowering plant species and could explain why fish are the most diverse of all vertebrates.”

This all brings me to:

The challenge

In the abstract of their paper, Kellis, Birren and Lander wrote that:

Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions.” [boldface by authors]
In the main text of the paper, the authors expanded on this claim, writing:

Strikingly, in nearly every case (95%), accelerated evolution was confined to only one of the two paralogues. This strongly supports the model in which one of the paralogues retained an ancestral function while the other, relieved of this selective constraint, was free to evolve more rapidly”.

The word “strikingly” suggests a result that is surprising in its statistical significance with respect to some null model the authors have in mind. The data is as follows:

The authors identified 457 duplicated gene pairs that arose by whole genome duplication (for a total of 914 genes) in yeast. Of the 457 pairs 76 showed accelerated (protein) evolution in S. cerevisiae. The term “accelerated” was defined to relate to amino acid substitution rates in S. cerevisiae, which were required to be 50% faster than those in another yeast species, K. waltii. Of the 76 genes, only four pairs were accelerated in both paralogs. Therefore 72 gene pairs showed acceleration in only one paralog (72/76 = 95%).

So, is it indeed “striking” that “in nearly every case (95%), accelerated evolution was confined to only one of the two praralogues”? Well, the authors don’t provide a pvalue in their paper, nor do they propose a null model with respect to which the question makes sense. So I am offering a prize to help crowdsource what should have been an exercise undertaken by the authors, or if not a requirement demanded by the referees. To incentivize people in the right direction,

I will award {\bf \frac{\$100}{p}}

to the person who can best justify a reasonable null model, together with a p-value (p) for the phrase “Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair” in the abstract of the Kellis-Birren-Lander paper. Notice the smaller the (justifiable) p-value someone can come up with, the larger the prize will be.

Bonus: explain in your own words how you think the paper was accepted to Nature without the authors having to justify their use of the word “strikingly” for a main result of the paper, and in a timeframe consisting of submission on December 17th 2003 (just three days before Hanukkah and one week before Christmas) and acceptance January 19th 2004 (Martin Luther King Jr. day).

Rules

To be eligible for the prize entries must be submitted as comments on this blog post by 11:59pm EST on Sunday May 31st June 7th, 2015 and they must be submitted with a valid e-mail address. I will keep the name (and e-mail address) of the winner anonymous if they wish (this can be ensured by using a pseudonym when submitting the entry as a comment). The prize, if awarded, will go to the person submitting the most complete, best explained solution that has a pvalue calculation that is correct according to the model proposed. Preference will be given to submission from students, especially undergraduates, but individuals in any stage of their career, and from anywhere in the world, are encouraged to submit solutions. I reserve the right to interpret the phrase “reasonable null model” in a way that is consistent with its use in the scientific community and I reserve the right to not award the prize if no good/correct solutions are offered. Participants do not have to answer the bonus question to win.

 

 

 

 

Last year I came across a wonderful post on the arXiv, a paper titled A new approach to enumerating statistics modulo n written by William Kuszmaul while he was a high school student participating in the MIT Primes Program for Research in Mathematics. Among other things, Kuszmaul solved the problem of counting the number of subsets of the n-element set \{1,2,\ldots,n\} that sum to k mod m.

This counting problem is related to beautiful (elementary) number theory and combinatorics, and is connected to ideas in (error correcting) coding theory and even computational biology (the Burrows Wheeler transform). I’ll tell the tale, but first explain my personal connection to it, which is the story of  my first paper that wasn’t, and how I was admitted to graduate school:

In 1993 I was a junior (math major) at Caltech and one of my best friends was a senior, Nitu Kitchloo, now an algebraic topologist and professor of mathematics at Johns Hopkins University. I had a habit of asking Nitu math questions, and he had a habit of solving them. One of the questions I asked him that year was:

How many subsets of the n-element set \{1,2,\ldots,n\} sum to 0 mod n?

I don’t remember exactly why I was thinking of the problem, but I do remember that Nitu immediately started looking at the generating function (the polynomial whose coefficients count the number of subsets for each n) and magic happened quickly thereafter. We eventually wrote up a manuscript whose main result was the enumeration of N_n^k, the number of subsets of \{1,2,\ldots,n\} whose elements sum to k mod n. The main result was

N_n^k = \frac{1}{n} \sum_{s|n; s \,odd}2^{\frac{n}{s}}\frac{\varphi(s)}{\varphi\left(\frac{s}{(k,s)}\right)}\mu\left(\frac{s}{(k,s)} \right).

We were particularly tickled that the formula contained both Euler’s totient function and the Möbius function. It seemed nice. So we decided to submit our little result to a journal.

By now months had passed and Nitu had already left Caltech for graduate school. He left me to submit the paper and I didn’t know where, so I consulted the resident combinatorics expert (Rick Wilson) who told me that he liked the result and hadn’t seen it before, but that before sending it off, just in case, I should should consult with this one professor at MIT who was known to be good at counting. I remember Wilson saying something along the lines of “Richard knows everything”. This was in the nascent days of the web, so I hand wrote a letter to Richard Stanley, enclosed a copy of the manuscript, and mailed it off.

A few weeks later I received a hand-written letter from MIT. Richard Stanley had written back to let us know that he very much liked the result… because it had brought back memories of his time as an undergraduate at Caltech, when he had worked on the same problem! He sent me the reference:

R.P Stanley and M.F. Yoder, A study of Varshamov codes for asymmetric channels, JPL Technical Report 32-1526, 1973.

Also included was a note that said “Please consider applying to MIT”.

Stanley and Yoder’s result concerned single-error-correcting codes for what are known as Z-channels. These are communication links where there is an asymmetry in the fidelity of transmission: 0 is reliably transmitted as 0 (probability 1), but 1 may be transmitted as either a 1 or a 0 (with some probability p). In a 1973 paper A class of codes for asymmetric channels and a problem from the additive theory of numbers published in the IEEE Transactions of Information Theory, R. Varshamov had proposed a single error correcting code for such channels, which was essentially to encode a message by a bit string corresponding to a subset of \{1,\ldots,n\} whose elements sum to d mod (n+1). It’s not hard to see that since zeroes are transmitted faithfully, it would not be hard to detect a single error (and correct it) by summing the elements corresponding to the bit string. Stanley and Yoder’s paper addressed questions related to enumerating the number of codewords. In particular, they were basically working out the solution to the problem Nitu and I had considered. I guess we could have published our paper anyway as we had a few additional results, for example a theorem explaining how to enumerate zero summing subsets of finite Abelian groups, but somehow we never did. There is a link to the manuscript we had written on my website:

N. Kitchloo and L. Pachter, An interesting result about subset sums, 1993.

One generalization we had explored in our paper was the enumeration of what we called N_{n,m}^kthe number of subsets of the n-element set \{1,2,\ldots,n\} that sum to k mod m. We looked specifically at the problem where m|n and proved that

N_{n,m}^n = \frac{1}{m}\sum_{s|m; s \, odd} 2^{\frac{n}{s}}\varphi(s).

What Kuszmaul succeeded in doing is to extend this result to any m < n, which is very nice for two reasons: first, the work completes the investigation of the question of subset sums (to subsets summing to an arbitrary modulus). More importantly, the technique used is that of thinking more generally about “modular enumeration”, which is the problem of finding remainders of polynomials modulo x^n-1. This led him to numerous other applications, including results on q-multinomial and q-Catalan numbers, and to the combinatorics of lattice paths. This is the hallmark of excellent mathematics: a proof technique that sheds light on the problem at hand and many others.

One of the ideas that modular enumeration is connected to is that of the Burrows-Wheeler transform (BWT). The BWT was published as a DEC tech report in 1994 (based on earlier work of Wheeler in 1983), and is a transform of one string to another of the same length. To understand the transform, consider the example of a binary string of length n. The BWT consists of forming a matrix of all cyclic permutations of s (one row per permutation), then sorting the rows lexicographically, and finally outputting the last column of the matrix. For example, the string 001101 is transformed to 110010. It is obvious by virtue of the definition that any two strings that are the equivalent up to circular permutation will be transformed to the same string. For example, 110100 will also transform to 110010.

Circular binary strings are called necklaces in combinatorics. Their enumeration is a classic problem, solvable by Burnside’s lemma, and the answer is that the number of distinct necklaces C_n of length n is given by

C_n = \frac{1}{n} \sum_{s|n} 2^{\frac{n}{s}} \varphi(s).

For odd n this formula coincides with the subset sum problem (for subsets summing to 0 mod n). When n is prime it is easy to describe a bijection, but for general odd a simple combinatorial bisection is elusive (see Richard Stanley’s Enumerative Combinatorics Volume 1 Chapter 1 Problem 105b).

The Burrows-Wheeler transform is useful because it can be utilized for constant time string matching while requiring an index whose size is only linear in the size of the target. It has therefore become an indispensable tool for genomics. I’m not aware of an application of the elementary observation above, but as the Stanley-Yoder, Kitchloo-P., Kuszmaul timeline demonstrates (21 years in between publications)… math moves in decades. I do think there is some interesting combinatorics underlying the BWT, and that its elucidation may turn out to have practical implications. We’ll see.

A final point: it is fashionable to think that biology, unlike math, moves in years. After all, NIH R01 grants are funded for a period of 3–5 years, and researchers constantly argue with journals that publication times should be weeks and not months. But in fact, lots of basic research in biology moves in decades as well, just like in mathematics. A good example is the story of CRISP/Cas9, which began with the discovery of “genetic sandwiches” in 1987.  The follow-up identification and interpretation and of CRISRPs took decades, mirroring the slow development of mathematics. Today the utility of the CRISPR/Cas9 system depends on the efficient selection of guides and prediction of off-target binding, and as it turns out, tools developed for this purpose frequently use the Burrows-Wheeler transform. It appears that not only binary strings can form circles, but ideas as well…

William Kuszmaul  won 3rd place in the 2014 Intel Science Talent Search for his work on modular enumeration. Well deserved, and thank you!

Blog Stats

  • 1,620,050 views
%d bloggers like this: