I recently published a paper on the bioRxiv together with Vasilis Ntranos, Lynn Yi and Páll Melsted on Identification of transcriptional signatures for cell types from single-cell RNA-Seq. The contributions of the paper can be summed up as:

1. The simple technique of logistic regression, by taking advantage of the large number of cells assayed in single-cell RNA-Seq experiments, is much more effective than current approaches at identifying marker genes for clusters of cells.
2. The simplest single-cell RNA-Seq data, namely 3′ single-end reads produced by technologies such as Drop-Seq or 10X, can distinguish isoforms of genes.
3. The simple idea of GDE provides a unified perspective on DGE, DTU and DTE.

These simple, simple and simple ideas are so obvious that of course anyone could have discovered them, and one might be tempted to go so far as to say that even if people didn’t explicitly write them down, they were basically already known. After all, logistic regression was published by David Cox in 1958, and who didn’t know that there are many 3′ unannotated UTRs in the human genome? As for DGE, DTU and DTE (and DTE->G and DTE+G) I mean who doesn’t get these basic concepts? Indeed, after reading our paper someone remarked that one of the key results “was already known“, presumably because the successful application of logistic regression as a gene differential expression method for single-cell RNA-Seq follows from the fact that Šidák aggregation fails for differential gene expression in bulk RNA-Seq.

The “was already known” comment reminded me of a recent blog post about the dirty secret of mathematics. In the post, the author begins with the following math problem: Without taking your pencil off the paper/screen, can you draw four straight lines that go through the middle of all of the dots?

The problem may not yield immediately (try it!) but the solution is obvious once presented. This is a case of the solution requiring a bit of out-of-the-box thinking, leading to a perspective on the problem that is obvious in retrospect. In the Ntranos, Yi et al. paper, the change in perspective was the realization that “Instead of the traditional approach of using the cell labels as covariates for gene expression, logistic regression incorporates transcript quantifications as covariates for cell labels”. It’s no surprise the “was already known” reaction reared it’s head in this case. It’s easy to convince oneself, after the fact, that the “obvious” idea was in one’s head all along.

The egg of Columbus is an apocryphal tale about ideas that seem trivial after the fact. The story originates from the book “History of the New World” by Girolamo Benzoni, who wrote that Columbus, upon upon being told that his journey to the West Indies was unremarkable and that Spain “would not have been devoid of a man who would have attempted the same” had he not undertaken the journey, replied

“Gentlemen, I will lay a wager with any of you, that you will not make this egg stand up as I will, naked and without anything at all.” They all tried, and no one succeeded in making it stand up. When the egg came round to the hands of Columbus, by beating it down on the table he fixed it, having thus crushed a little of one end”

The story makes a good point. Discovery of the Caribbean in the 6th millennium BC was certainly not a trivial accomplishment even if it was obvious after the fact. The egg trick, which Columbus would have learned from the Amerindians who first brought chickens to the Americas, is a good metaphor for the discovery.

There are many Amerindian eggs in mathematics, which has its own apocryphal story to make the point: A professor proving a theorem during a lecture pauses to remark that “it is obvious that…”, upon which she is interrupted by a student asking if that’s truly the case. The professor runs out of the classroom to a nearby office,  returning after several minutes with a notepad filled with equations to exclaim “Why yes, it is obvious!” But even first-rate mathematicians can struggle to accept Amerindian eggs as worthy contributions, frequently succumbing to the temptation of dismissing others’ work as obvious. One of my former graduate school mentors was G.W. Peck, a math professor who created a pseudonym for the express purpose of publishing his Ameridian eggs in a way that would reduce unintended embarrassment for those whose work he was improving on in in “trivial ways”. G.W. Peck has an impressive publication record.

Bioinformatics is not very different from mathematics; the literature is populated with many Amerindian eggs. My favorite example is the Smith-Waterman algorithm, an algorithm for local alignment published by Temple Smith and Michael Waterman in 1981. The Smith-Waterman algorithm is a simple modification of the Needleman-Wunsch algorithm:

The table above shows the differences. That’s it! This table made for a (highly cited) paper. Just initialize the Needleman-Wunsch algorithm with zeroes instead of a gap penalty, set negative scores to 0, trace back from the highest score. In fact, it’s such a minor modification that when I first learned the details of the algorithm I thought “This is obvious! After all, it’s just the Needleman-Wunsch algorithm. Why does it even have a name?! Smith and Waterman got a highly cited paper?! For this?!” My skepticism lasted only as long as it took me to discover and read Peter Sellers’ 1980 paper attempting to solve the same problem. It’s a lot more complicated, relying on the idea of “inductive steps”, and requires untangling mysterious diagrams such as:

The Smith-Waterman solution was clever, simple and obvious (after the fact). Such ideas are a hallmark of Michael Waterman’s distinguished career. Consider the Lander-Waterman model, which is a formula for the expected number of contigs in a shotgun sequencing experiment:

$E(contigs) = Ne^{-R}.$

Here N is the number of reads sequenced and R=NL/G is the “redundancy” (reads * fragment length / genome length). At first glance the Lander-Waterman “model” is just a formula arising from the Poisson distribution! It was obvious… immediately after they published it. The Pevzner-Tang-Waterman approach to DNA assembly is another good example. It is no coincidence that all of these foundational, important and impactful ideas have Waterman in their name.

Looking back at my own career, some of the most satisfying projects have been Amerindian eggs, projects where I was lucky to participate in collaborations leading to ideas that were obvious (after the fact). Nowadays I know I’ve hit the mark when I receive the most authentic of compliments: “your work is trivial!” or “was widely known in the field“, as I did recently after blogging about plagiarism of key ideas from kallisto. However I’m still waiting to hear the ultimate compliment: “everything you do is obvious and was already known!”

(Click “read the rest of this entry” to see the solution to the 9 dot problem.)

The development of microarray technology two decades ago heralded genome-wide comparative studies of gene expression in human, but it was the widespread adoption of RNA-Seq that has led to differential expression analysis becoming a staple of molecular biology studies. RNA-Seq provides measurements of transcript abundance, making possible not only gene-level analyses, but also differential analysis of isoforms of genes. As such, its use has necessitated refinements of the term “differential expression”, and new terms such as “differential transcript expression” have emerged alongside “differential gene expression”. A difficulty with these concepts is that they are used to describe biology, statistical hypotheses, and sometimes to describe types of methods. The aims of this post are to provide a unifying framework for thinking about the various concepts, to clarify their meaning, and to describe connections between them.

To illustrate the different concepts associated to differential expression, I’ll use the following example, consisting of a comparison of a single two-isoform gene in two conditions (the figure is Supplementary Figure 1 in Ntranos, Yi et al. Identification of transcriptional signatures for cell types from single-cell RNA-Seq, 2018):

The isoforms are labeled primary and secondary, and the two conditions are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates $x_A$ and $x_B$ corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates $y_A$ and $y_B$ corresponding to the abundance of the secondary isoforms. In data from an experiment the black dots will represent the mean level of expression of the constituent isoforms as derived from replicates, and there will be uncertainty as to their exact location. In this example I’ll assume they represent the true abundances.

### Biology

Below is a list of terms used to characterize changes in expression:

Differential transcript expression (DTE) is change in one of the isoforms. In the figure, this is represented (conceptually) by the two red lines along the x- and y-axes respectively. Algebraically, one might compute the change in the primary isoform by $x_B-x_A$ and the change in the secondary isoform by $y_B-y_A$. However the term DTE is used to denote not only the extent of change, but also the event that a single isoform of a gene changes between conditions, i.e. when the two points lie on a horizontal or vertical line. DTE can be understood to occur as a result of transcriptional regulation if an isoform has a unique transcription start site, or post-transcriptional regulation if it is determined by a unique splicing event.

Differential gene expression (DGE) is the change in the overall output of the gene. Change in the overall output of a gene is change in the direction of  the line $y=x$, and the extent of change can be understood geometrically to be the distance between the projections of the two points onto the line $y=x$ (blue line labeled DGE). The distance will depend on the metric used. For example, the change in expression could be defined to be the total expression in condition B ($x_B+y_B$) minus the change in expression in condition A ($x_A+y_A$), which is $|x_B-x_A+y_B-y_A|$.  This is just the length of the blue line labeled “DGE” given by the $L_1$ norm. Alternatively, one could consider “DGE” to be the length of the blue line in the $L_2$ norm. As with DTE, DGE can also refer to a specific type of change in gene expression between conditions, one in which every isoform changes (relatively) by the same amount so that the line joining the two points has a slope of 1 (i.e. is angled at 45°). DGE can be understood to be the result of transcriptional regulation, driving overall gene expression up or down.

Differential transcript usage (DTU) is the change in relative expression between the primary and secondary isoforms. This can be interpreted geometrically as the angle between the two points, or alternatively as the length (as given by some norm) of the green line labeled DTU. As with DTE and DGE, DTU is also a term used to describe a certain kind of difference in expression between two conditions, one in which the line joining the two points has a slope of -1. DTU events are most likely controlled by post-transcriptional regulation.

Gene differential expression (GDE) is represented by the red line. It is the amount of change in expression along in the direction of line joining the two points. GDE is a notion that, for reasons explained below, is not typically tested for, and there are few methods that consider it. However GDE is biologically meaningful, in that it generalizes the notions of DGE, DTU and DTE, allowing for change in any direction. A gene that exhibits some change in expression between conditions is GDE regardless of the direction of change. GDE can represent complex changes in expression driven by a combination of transcriptional and post-transcriptional regulation. Note that DGE, DTU and DTE are all special cases of GDE.

If the $L_2$ norm is used to measure length and $DTE_1,DTE_2$ denote DTE in the primary and secondary isoforms respectively, then it is clear that DGE, DTU, DTE and GDE satisfy the relationship

$GDE^2 = DGE^2 + DTU^2 = DTE_1^2 + DTE_2^2.$

### Statistics

The terms DTE, DGE, DTU and GDE have an intuitive biological meaning, but they are also used in genomics as descriptors of certain null hypotheses for statistical testing of differential expression.

The differential transcript expression (DTE) null hypothesis for an isoform is that it did not change between conditions, i.e. $x_A=x_B$ for the primary isoform, or $y_A=y_B$ for the secondary isoform. In other words, in this example there are two DTE null hypotheses one could consider.

The differential gene expresión (DGE) null hypothesis is that there is no change in overall expression of the gene, i.e. $x_A+y_A = x_B+y_B$.

The differential transcript usage (DTU) null hypothesis is that there is no change in the difference in expression of isoforms, i.e. $x_A-y_A = x_B - y_B$.

The gene differential expression (GDE) null hypothesis is that there is no change in expression in any direction, i.e. for all constants $a,b$, $ax_A+by_A = ax_B+by_B$.

The union differential transcript expression (UDTE) null hypothesis is that there is no change in expression of any isoform. That is, that $x_A = y_A$ and $x_B = y_B$ (this null hypothesis is sometimes called DTE+G). The terminology is motivated by $\neg \cup_i DTE_i = \cap_i DTE_i$.

Not that $UDTE \Leftrightarrow GDE$, because if we assume GDE, and set $a=1,b=0$ we obtain DTE for the primary isoform and setting $a=0,b=1$ we obtain DTE for the secondary isoform. To be clear, by GDE or DTE in this case we mean the GDE (respectively DTE) null hypothesis. Furthermore, we have that

$UDTE,GDE \Rightarrow DTE,DGE,DTU$.

This is clear because if $x_A=y_A$ and $x_B=y_B$ then both DTE null hypotheses are satisfied by definition, and both DGE and DTU are trivially satisfied. However no other implications hold, i.e. $DTE \not \Rightarrow DGE,DTU$, similarly $DGE \not \Rightarrow DTE,DTU$, and $DTU \not \Rightarrow DGE, DTE$.

### Methods

The terms DGE, DTE, DTU and GDE also used to describe methods for differential analysis.

A differential gene expression method is one whose goal is to identify changes in overall gene expression. Because DGE depends on the projection of the points (representing gene abundances) to the line y=x, DGE methods typically take as input gene counts or abundances computed by summing transcript abundances $x_A+y_A$ and $x_B+y_B$. Examples of early DGE methods for RNA-Seq were DESeq (now DESeq2) and edgeR. One problem with DGE methods is that it is problematic to estimate gene abundance by adding up counts of the constituent isoforms. This issue was discussed extensively in Trapnell et al. 2013. On the other hand, if the biology of a gene is DGE, i.e. changes in expression are the same (relatively) in all isoforms, then DGE methods will be optimal, and the issue of summed counts not representing gene abundances accurately is moot.

differential transcript expression method is one whose goal is to identify individual transcripts that have undergone DTE. Early methods for DTE were Cufflinks (now Cuffdiff2) and MISO, and more recently sleuth, which improves DTE accuracy by modeling uncertainty in transcript quantifications. A key issue with DTE is that there are many more transcripts than genes, so that rejecting DTE null hypotheses is harder than rejecting DGE null hypotheses. On the other hand, DTE provides differential analysis at the highest resolution possible, pinpointing specific isoforms that change and opening a window to study post-transcriptional regulation. A number of recent examples highlight the importance of DTE in biomedicine (see, e.g., Vitting-Seerup and Sandelin 2017). Unfortunately DTE results do not always translate to testable hypotheses, as it is difficult to knock out individual isoforms of genes.

differential transcript usage method is one whose goal is to identify genes whose overall expression is constant, but where isoform switching leads to changes in relative isoform abundances. Cufflinks implemented a DTU test using Jensen-Shannon divergence, and more recently RATs is a method specialized for DTU.

As discussed in the previous section, none of null hypotheses DGE, DTE and DTU imply any other, so users have to choose, prior to performing an analysis, which type of test they will perform. There are differing opinions on the “right” approach to choosing between DGE, DTU and DTE. Sonseson et al. 2016 suggest that while DTE and DTU may be appropriate in certain niche applications, generally it’s better to choose DGE, and they therefore advise not to bother with transcript-level analysis. In Trapnell et al. 2010, an argument was made for focusing on DTE and DTU, with the conclusion to the paper speculating that “differential RNA level isoform regulation…suggests functional specialization of the isoforms in many genes.” Van den Berge et al. 2017 advocate for a middle ground: performing a gene-level analysis but saving some “FDR budget” for identifying DTE in genes for which the UDTE null hypothesis has been rejected.

There are two alternatives that have been proposed to get around the difficulty of having to choose, prior to analysis, whether to perform DGE, DTU or DTE:

differential transcript expression aggregation (DTE->G) method is a method that first performs DTE on all isoforms of every gene, and then aggregates the resulting p-values (by gene) to obtain gene-level p-values. The “aggregation” relies on the observation that under the null hypothesis, p-values are uniformly distributed. There are a number of different tests (e.g. Fisher’s method) for testing whether (independent) p-values are uniformly distributed. Applying such tests to isoform p-values per gene provides gene-level p-values and the ability to reject UDTE. A DTE->G method was tested in Soneson et al. 2016 (based on Šidák aggregation) and the stageR method (Van den Berge et al. 2017) uses the same method as a first step. Unfortunately, naïve DTE->G methods perform poorly when genes change by DGE, as shown in Yi et al. 2017. The same paper shows that Lancaster aggregation is a DTE->G method that achieves the best of both the DGE and DTU worlds. One major drawback of DTE->G methods is that they are non-constructive, i.e. the rejection of UDTE by a DTE->G method provides no information about which transcripts were differential and how. The stageR method averts this problem but requires sacrificing some power to reject UDTE in favor of the interpretability provided by subsequent DTE.

gene differential expression method is a method for gene-level analysis that tests for differences in the direction of change identified between conditions. For a GDE method to be successful, it must be able to identify the direction of change, and that is not possible with bulk RNA-Seq data. This is because of the one in ten rule that states that approximately one predictive variable can be estimated from ten events. In bulk RNA-Seq, the number of replicates in standard experiments is three, and the number of isoforms in multi-isoform genes is at least two, and sometimes much more than that.

In Ntranos, Yi et al. 2018, it is shown that single-cell RNA-Seq provides enough “replicates” in the form of cells, that logistic regression can be used to predict condition based on expression, effectively identifying the direction of change. As such, it provides an alternative to DTE->G for rejecting UDTE. The Ntranos and Yi GDE methods is extremely powerful: by identifying the direction of change it is a DGE methods when the change is DGE, it is a DTU method when the change is DTU, and it is a DTE method when the change is DTE. Interpretability is provided in the prediction step: it is the estimated direction of change.

### Remarks

The discussion in this post is based on an example consisting of a gene with two isoforms, however the concepts discussed are easy to generalize to multi-isoform genes with more than two transcripts. I have not discussed differential exon usage (DEU), which is the focus of the DEXSeq method because of the complexities arising in genes which don’t have well-defined shared exons. Nevertheless, the DEXSeq approach to rejecting UDTE is similar to DTE->G, with DTE replaced by DEU. There are many programs for DTE, DTU and (especially) DGE that I haven’t mentioned; the ones cited are intended merely to serve as illustrative examples. This is not a comprehensive review of RNA-Seq differential expression methods.

### Acknowledgments

The blog post was motivated by questions of Charlotte Soneson and Mark Robinson arising from an initial draft of the Ntranos, Yi et al. 2018 paper. The exposition was developed with Vasilis Ntranos and Lynn Yi. Valentine Svensson provided valuable comments and feedback.

I have been fascinated with mini computers for some time, and have wondered when they will become suitable for bioinformatics. The 4273π project, which is an online course that is distributed as a 32Gb SD card image for the Raspberry Pi, has been around for a few years and demonstrated the utility of mini computers for training. The course is a proof of principle that bioinformatics software can work on a mini computer; the distributed software includes some comparative genomics and phylogenetics programs. However there is not much one can do with 1Gb RAM. The data in 4273π are small FASTA files, and while the Raspberry Pi is powerful enough to allow for experimentation and exploration of such datasets, even the new Raspberry Pi 3, with ten times the performance of the original 2012 model, still only has 1Gb of RAM and is not powerful enough for handling the current primary data type of genomics: high-throughput sequencing data.

Enter the Rock64.

The Rock64 is a new single-board computer from Pine64 that competes with the Raspberry Pi 3:

The Rock64 is evidence of the rapid and impressive development in single-board computers over the past few years, and Pine64 crosses a major threshold by offering a model with 4Gb RAM. The machine is also cheap. A 4Gb RAM Rock64, which is a 64-bit, quad core 1.5GHz machine, costs $44.95 (the 1Gb model is just$24.95). An enclosure is $7.95, a power supply$6.99, and a 64Gb SSD drive is only $31.95 (the 16Gb drive is$15.95). When my student Jase Gehring found out the specs of the machine last summer, he immediately realized that it was powerful enough to run kallisto for RNA-Seq analyses, and we preordered a handful of the boards for the lab. These arrived in the fall and we have been testing the machines for a while. One of them is hooked up to a monitor, and together with a bluetooth mouse and keyboard is serving as a general desktop computer in the wet lab. They are extraordinary versatile mini computers that, in my opinion, portend a future of mobile, low-cost, and light-weight computing for clinical and field genomics applications.

Unfortunately ARM is not an architecture known to most computational biologists, and my initial enthusiasm for the Rock64 was dampened when I found out that most genomics software does not work on ARM architecture. However I managed to install R, and Páll Melsted compiled kallisto on the Rock64 for the new release of version 0.44 (the release introduces an ARM binary, along with pseudobam for visualization of pseudoalignments). With these programs in place on Gibraltar (our first Rock64 with 4Gb of RAM, a 64Gb SSD drive, and a quad-core 1.5GHz processor), there was ample processing power to quantify RNA-Seq datasets.

For example, I was able to build the Saccharomyces cerevisae release 81 transcriptome index in one minute. A complete quantification of 6 samples from Ellahi, Thurtle and Rine, 2015 using two cores (with 30 bootstraps per sample) took 21 minutes. The quantification consisted of processing 47,744,312 paired-end reads. Amazingly, the Rock64 can quantify human RNA-Seq, which requires pseudoalignment of reads to a much larger transcriptome than yeast. A human 15,117,833 paired-end read sample (SRR493366) took less than 11 minutes to quantify using a single core. These results show that the Rock64 is not a toy; it can be used for the analysis of high-throughput sequencing data from substantial biological experiments.

It’s mind boggling to consider just how amazing it is to be able to quantify RNA-Seq on such a machine. When we developed kallisto we knew that the two orders of magnitude speedup was a game-changer, but I never thought we would literally be able to run it on what is not much more than a phone. We’re not going to switch over all of our RNA-Seq analyses to the Rock64s quite yet, but cluster assemblies such as the Pico5S have piqued my interest.

I imagine that it won’t be long before mini computers are even more powerful, and provide ultra low-cost portable alternatives to current server and cloud computing solutions. Having said that, I still miss my Commodore 64. Fortunately the mini revolution isn’t leaving me behind: a mini version of the C64 is slated for release early this year.

On April 11th 2016, I contacted the Office for Prevention of Harassment and Discrimination at UC Berkeley to report that Professor Terry Speed had sexually harassed a postdoctoral researcher in the UC Berkeley statistics department in the period 2000–2002. Two specific allegations were subsequently investigated:

Allegation One: Respondent, a professor in the Statistics Department, sexually harassed Complainant One, a post-doctoral student in the same department, from 2000-2002 by making sexual advances toward her, asking her for dates, telling her he had a “crush” on her, giving her hugs, and communicating with her, including by email, in an intimate or romantic manner, when such behavior was not welcome.

Allegation Two: Respondent, a professor in the Statistics Department, created a hostile work environment for Complainant Two, an Assistant Professor in the Mathematics Department, in 2002, through Respondent’s persistent discussions and emails regarding his romantic interest in Complainant One and by pressuring Complainant Two to persuade Complainant One to interact with Respondent.

The investigation took 14 months to complete, and the result was a 47 page report along with 89 pages of supporting evidence based on interviews, hundreds of pages of emails that I disclosed at the outset of the investigation, and letters and emails provided by Respondent. The report concludes as follows:

CONCLUSION
For the reasons stated above, I conclude that the preponderance of the evidence substantiates that Respondent violated the 1992 Sexual Harassment Policy in that he engaged in unwelcome conduct of a sexual nature that created a hostile environment for Complainant One and Complainant Two, and conditioned an academic or personnel decision on Complainant One’s submission to his conduct. This report is being submitted to the Vice Provost for Faculty for review under the Faculty Code of Conduct.

I have waited since June of last year to hear from the Vice Provost for Faculty at UC Berkeley what action the university will take in light of the findings, however despite multiple requests for information the university has yet to respond as to whether it will enact any sanctions on Respondent.

My close-up encounter with sexual harassment was devastating. I never expected, when I arrived in Berkeley in 1999, that Terry Speed, a senior professor in my field who I admired and thought of as a mentor would end up as Respondent and myself as Complainant Two. However much more serious and significant than my ordeal were the devastating consequences his sexual harassment had on the life and well being of Complainant One. The sexual harassment that took place was not an isolated event. Despite repeated verbal and written requests by Complainant One that Speed stop, his sexual harassment continued unabated for months. The case was not reported at the time the sexual harassment happened because of the structure of Title IX. Complainant One knew that Speed would be informed if a complaint was made, and Complainant One was terrified of reprisal. Her fear was not hypothetical; after months of asking Speed to stop sexually harassing her, he communicated to her that, unless she was willing to reconcile with him as he wished, she could not count on his recommendation.

Speed has been an advocate for women in academia in recent years. However no amount of advocacy on behalf of women can cancel out the physical and mental harm caused by prolonged sexual harassment. Speed’s self-proclamation that he is a “male feminist” rings hollow.

The GTEx consortium has just published a collection of papers in a special issue of Nature that together provide an unprecedented view of the human transcriptome across dozens of tissues. The work is based on a large-scale RNA-Seq experiment of postmortem tissue from hundreds of human donors, illustrated in Figure 1 of the overview by Ward and Gilad 2017:

The data provide a powerful new opportunity for several analyses, highlighted (at least for me) by the discovery of 673 trans-eQTLs at 10% genome-wide FDR. Undoubtedly more discoveries will be published when the sequencing data, available via dbGAP, is analyzed in future studies. As a result, the GTEx project is likely to garner many citations, both for specific results, but also drive-by-citations that highlight the scope and innovation of the project. Hopefully, these citations will include the key GTEx paper:

Carithers, Latarsha J, Ardlie, Kristin, Barcus, Mary, Branton, Philip A, Britton, Angela, Buia, Stephen A, Compton, Carolyn C, DeLuca, David S, Peter-Demchok, Joanne, Gelfand, Ellen T, Guan, Ping, Korzeniewski, Greg E, Lockhart, Nicole C, Rabiner, Chana A, Rao, Abhi K, Robinson, Karna L, Roche, Nancy V, Sawyer, Sherilyn J, Segrè, Ayellet V, Shive, Charles E, Smith, Anna M, Sobin, Leslie H, Undale, Anita H, Valentino, Kimberly M, Vaught, Jim, Young, Taylor R, Moore, Helen M, on behalf of the GTEx consortium, A Novel Approach to High-Quality Postmortem Tissue Procurement: The GTEx Project, Biopreservation and Biobanking 13(5), 2015, p 311–319.

The paper by Latarsha Carithers et al. provides an overview of the consent and laboratory procedures that GTEx developed and applied to obtain tissues from hundreds of deceased donors. The monumental effort is, to my knowledge, unprecedented in scale and scope, and it relied on the kindness and generosity of hundreds of family members and next-of-kin of donors, who consented to donate their loved ones to science.

To develop effective and appropriate consent procedures, the GTEx project organized a sub-study to determine how best to approach, interact and explain the project to family members. Ultimately consent was obtained either in person or over the phone, and one can only imagine the courage of families to agree to donate, especially during times of grief and for a project whose goals could only be explained in terms of the long-term benefits of basic science.

The consent procedures for GTEx were complicated by a need to rapidly place tissue in preservative postmortem. RNA degrades rapidly after the time of death, and there is a window of only a few hours before expression can no longer be effectively measured. The RNA Integrity Number (RIN) measures the extent of degradation of RNA. It used to be measured with gel electrophoresis by examining the ratio of 28S:18S rRNA; more recently RIN is computed using more sophisticated analyses with, e.g. the Agilent bioanalyzer (see Schroeder et al. 2006 for details). GTEx conducted extensive studies to determine the correspondence between postmortem interval (time taken to preserve tissue) and RIN, and also examined the RIN necessary for effective RNA-Seq library construction.

The effect of ischemic time time on RIN values (Fig 6 from Carithers et al. 2015).

These studies were used to deploy standard operating procedures across multiple source sites (an obvious necessity given the number of donors needed). All of this research was not only crucial for GTEx, but will be extremely valuable for studies relying on postmortem RNA-Seq in the future.

The collection of specimens from each source site required training of individuals at that site, and one of GTEx’s achievements is the gathering of knowledge of how to orchestrate such a complex distributed sample collection and preparation enterprise. The workflow shown below (Figure 2 from Carithers et al. 2015) hints at the complexities involved (e.g. the need for separate treatment of brain due to the requirement of proper sectioning).

A meeting discussing the findings of Carithers et al. was held on May 20-21 2015 and I encourage all users of GTEx data/results to view the recording of it (Day 1, Day 2).

It is truly moving and humbling to consider the generosity of the hundreds of families, in many cases of donors in their twenties or thirties, who enabled GTEx. The scale of their contribution, and the suffering that preceded collection of the data cannot be captured in cartoons that describe the experiment. The families can also never be fully acknowledged, not matter how many times they are thanked in acknowledgment sections. But at a minimum, I think that reading Carithers et al. 2015 is the least one can do to honor them, and those who turned their good-will into science.

Acknowledgment: the idea for this blog post originated during a conversation with Roderic Guigó.

In a previous post I wrote about How not to perform a differential expression analysis. In response to my post, Rob Patro, Geet Duggal, Michael I Love, Rafael Irizarry and Carl Kingsford wrote a detailed response. Below is my point-by-point rebuttal to their response (the figures and results in this blog post can be generated using the scripts in the Bits of DNA GitHub repository):

1. In Figure 1 of their response, Patro et al. show an MA plot and state that “if it were true that these methods are ‘very very’ similar one would see most log-ratios close to 0 (within the red lines).” This is true. Below is the MA plot for kallisto with default parameters and Salmon with the –gcBias flag:

96.6% of the points lie within the red lines. Since this constitutes most of the points, it seems reasonable to conclude that the methods are indeed very very similar. When both programs are run in default mode, as I did in my blog post, 98.9% of the points lie within the red lines. Thus, using the criterion of Patro et al., the programs have very very similar, or near identical, output. These numbers are conservative, computed by omitting transcripts where both kallisto and Salmon determine that a transcript has zero abundance.

2. Furthermore, Patro et al. explain that their MA plot in Figure 1 “demonstrate[s] how deceiving count scatter plots can be in this particular context.” There is, superficially, some merit to this claim. The MA plot above looks like a smudge of points and seems at odds with the fact that 96.6% of the points lie within the red lines. However the plot displays 198,457 points corresponding to 198,457 quantified transcripts, and as a result many points obfuscate each other. The alpha parameter in ggplot2 sets the opacity/transparency of points, and should be used in such a case to reveal the density of points (see, e.g. Supplementary Figure 19 of Love et al. 2016). Below is a plot of the exact same points with alpha=0.01:

An R animation that interpolates between the two MA plots above shows the same points, with varying opacity parameters (alpha=1 -> 0.01) and helps to demonstrate how deceiving MA plots can be in this particular context:

3. The Patro et al. response fails to distinguish between two different comparisons I made in my blog post: (1) comparisons of default kallisto to default Salmon, and (2) default kallisto to Salmon with the –gcBias option. Comparisons of the programs with default options is important because with those options their output is near identical, and, as I explain in my blog post, this is not some cosmic coincidence but a result of Salmon directly implementing the key ideas of pseudoalignment. The Patro et al. 2017 paper is also not just about GC bias correction, as the authors claim in their response, but rather it is also “the Salmon paper” a descriptor that Patro et al. use 24 times in their response. Furthermore, when Patro et al. are asked about how to run Salmon they recommend running it with default options (see e.g. the epilogue below or the way Patro  et al. run Salmon for analysis of the Bealieau-Jones-Greene described in #5) so that a comparison of the programs in default mode is of direct relevance to users.

In regards to the GC bias correction, Patro et al. 2017 claim in their abstract that “[GC bias correction] substantially improves the accuracy of abundance estimates and the sensitivity of subsequence differential expression analysis”. This is a general statement, not one about the sort of niche use-cases they describe in their response. The question then is whether Patro et al. provides support for this general statement and my argument has always been that it does not.

4. Patro et al. criticize my use of the ERR188140 sample to demonstrate how similar Salmon is to kallisto. They write that “the blog post author selected a single sample…”(boldface theirs) to claim that Salmon and kallisto produce output with “very very strong similarity (≃)” and raise the possibility that it was cherry picked, noting that “this particular sample has less GC-content bias” and marking it in a plot. I used ERR188140 because it was our sample of choice for many of the demonstrative analyses in the Bray et al. 2016 paper (see the kallisto paper analysis Github repository where the sample is mentioned since February 2016) and for that paper we had already generated the RSEM quantifications (and the alignments required for running the program), thus saving time in making the PCA analysis for my blogpost. ERR188140 was chosen for Bray et al. 2016 because it was the most deeply sequenced sample in the GEUVADIS dataset.

5. Contrary to the claim by Patro et al. in their response that I examined only one dataset, I also included in my post links with references to specific figures from four other papers that independently found that kallisto is near identical to Salmon. The fairest example for consideration is the additional analysis I mentioned of Beaulieu-Jones and Greene, and separately Patro, of the RNA-Seq dataset from Boj et al. 2015. With that analysis, there can be no claims of cherry-picking. The dataset was chosen by the authors of Beaulieu-Jones and Greene 2017, kallisto quantifications were produced by Beaulieu-Jones and Greene, and Salmon quantifications were prepared by Patro. Presumably the main author of the Salmon program ran Salmon with the best settings possible for the experiment. The fact that different individuals ran the programs is highlighted by the fact that they are not even based on identical annotations. They used different versions of RefSeq: Beaulieu-Jones and Greene quantified with 35,026 transcripts and Patro, who quantified later, used an annotation with 35,882 transcripts. There are eight samples in the analysis and MA plots, made by restricting the analysis to the transcripts in common, all look alike. As an example, the MA plot for SRR1654626  is:

The fraction of points within the red lines, calculated as before by omitting points at (0,0), is 98.6%.  The Patro analysis of Bealieau-Greene was performed on March 8, 2017 with version 0.8.1 of Salmon, well after the –gcBias option was implemented, the Salmon (version 3 preprint describing the GC correction) published, and the paper submitted. The dates are verifiable in the GitHub repository with the Salmon results.

6. In arguing that kallisto and Salmon are different Patro et al. provide an interesting formula for the correlation for two random variables X and X+Y where X and Y are independent but its use in this context is a sleight of hand. The formula, which is a simple exercise for the reader to derive from the definition of correlation, is

$cor(X,X+Y)=\sqrt{\frac{1}{1+Var(Y)/Var(X)}}$.

It follows by Taylor series expansion that this is approximately

$cor(X,X+Y) \approx 1-\frac{1}{2}\frac{Var(Y)}{Var(X)}$.

and if sd(X) is about 3.4 and sd(Y) about 0.5 (Patro et al.‘s numbers), then by inspection cor(X,X+Y) will be 0.99. In sample SRR1654626 shown above, when ignoring transcripts where both programs output 0, sd(X)=3.5 and sd(Y) = 0.43 which are fairly close to Patro et al.‘s numbers. However Patro et al. proceed with a non sequitur, writing that “this means that a substantial difference of 25% between reported counts is typical”. While the correlation formula makes no distributional assumptions, the 25% difference seems to be based on an assumption that Y is normally distributed. Specifically, if is normally distributed with mean 0 and standard deviation 0.5 then |Y| is half-normally distributed and a typical percent difference based on the median is

$(2^{0.5 \cdot \sqrt{2}\cdot \mbox{erf}^{-1}(0.5)}-1)\cdot 100 = 26.3\% \approx 25\%.$

However the differences between kallisto and Salmon quantifications are far from normally distributed. The plot below shows the distribution of the differences between log2 counts of kallisto and salmon (again excluding cases where both programs output 0):

The blue vertical line is positioned at the median, which is at 0.001433093. This means that the typical difference between reported counts is not 25% but rather 0.1%.

7.  In their response, Patro et al. highlight the recent Zhang et al. 2017 paper that benchmarked a number of RNA-Seq programs, including kallisto and Salmon. Patro et al. comment on a high correlation between a mode of Salmon that quantifies based with transcriptome alignments and RSEM. First, the correlations reported by Zhang et al. are Pearson correlations, and not Spearman correlations that I focused on in my blog post. Second, the alignment mode of Salmon has nothing to do with pseudoalignment, in that read alignments (in the case of Zhang et al. 2017 produced with STAR) are quantified directly, in a workflow the same as that of RSEM. Investigation of the similarities between alignment Salmon and RSEM that led to the high correlation is beyond the scope of this post. Finally, in discussing the similarities between programs the authors (Zhang et al.) write “Salmon, Sailfish and Kallisto, cluster tightly together with R 2  > 0.96.”

8. In regards to the EM algorithm, Patro et al. acknowledge that Salmon uses kallisto’s termination criteria and have updated their code to reflect this fact. I thank them for doing so, however this portion of their response is bizarre:

“What if Salmon executed more iterations of its offline phase and outperformed kallisto? Then its improvement could be attributed to the extra iterations instead of the different model, bias correction, or online phase. By using the same termination criteria for the offline phase of Salmon, we eliminate a confounding variable in the analysis.”

If Salmon could perform better by executing more iterations of the EM algorithm it should certainly do so. This is because parameters hard-wired in the code should be set in a way that provides users with the best possible performance.

9. At one point in their response Patro et al. write that “It is expected that Salmon, without the GC bias correction feature, will be similar to kallisto”, essentially conceding that default Salmon $\simeq$ default kallisto, a main point of my blog post. However Patro et al. continue to insist that Salmon with GC bias correction significantly improves on kallisto. Patro et al. have repeated a key experiment (the GEUVADIS based simulation) in their paper, replacing the t-test with a workflow they describe as “the pipeline suggested by the post’s author”. To be clear, this is the workflow preferred by Patro et al.:

As explained in my post on How not to perform a differential expression analysis the reason that Love et al. recommend a DESeq2 workflow instead of a t-test for differential expression is because of the importance of regularizing variance estimates. This is made clear by repeating Patro et al.‘s GEUVADIS experiment with a typical three replicates per condition instead of eight:

With the t-test of transcripts Salmon cannot even achieve an FDR of less than 0.05.

10. Patro et al. find that switching to their recommended workflow (i.e. replacing the t-test with their own DESeq2) alters the difference between kallisto and Salmon at an FDR of 0.01 from 353% to 32%. Patro et al. describe this difference, in boldface, as “The results remain similar to the original published results when run using the accuser’s suggested pipeline.”  Note that Patro et al. refer to a typical difference of 0.1% between counts generated by kallisto and Salmon as “not very very similar” (point #6) while insisting that 353% and 32% aresimilar.

11. The reanalysis of the GEUVADIS differential expression experiment by Patro et al. also fails to address one of the most important critiques in my blogpost, namely that a typical experimental design will not deliberately confound bias with conditionThe plot below shows the difference between kallisto and Salmon in a typical experiment (3 replicates in each condition) followed by Love et al.’s recommended workflow (tximport -> DESeq2):

There is no apparent difference between kallisto and Salmon. Note that the samples in this experiment have the same GC bias as in Patro et al. 2017, the only difference being that samples are chosen randomly in a way that they are not confounded by batch. The lack of any observed difference in results between default kallisto and Salmon with the –gcBias option are the same with an 8×8 analysis:

There is no apparent difference between kallisto and Salmon, even though the simulation includes the same GC bias levels as in Patro et al. 2017 (just not confounded with condition) .

12. It is interesting to compare the 8×8 unconfounded experiment with the 8×8 confounded experiment.

While Salmon does improve on kallisto (although as discussed in point #10 the improvement is not 353% but rather 32% at an FDR of 0.01), the improvement in accuracy when performing an unconfounded experiment highlights why confounded experiments should not be performed in the first place.

13. Patro et al. claim that despite best intentions, “confounding of technical artifacts such as GC dependence with the biological comparison of interest does occur” and cite Gilad and Mizrahi-Man 2015. However the message of the Gilad and Mizrahi-Man paper is not that we must do our best to analyze confounded experiments. Rather, it is that with confounded experiments one may learn nothing at all. What they say is “In summary, we believe that our reanalysis indicates that the conclusions of the Mouse ENCODE Consortium papers pertaining to the clustering of the comparative gene expression data are unwarranted.” In other words, confounding of batch effect with variables of interest can render experiments worthless.

14. In response to my claim that GC bias has been reduced during the past 5 years, Patro et al. state:

A more informed assumption is that GC bias in sequencing data originates with PCR amplification and depends on thermocycler ramp speed (see, for example, Aird (2011) or t’ Hoen (2013)), and not from sequencing machines or reverse transcription protocols which may have improved in the past 5 years.

This statement is curious in that it seems to assume that, unlike sequencing machines or reverse transcription protocols, PCR amplification and thermocycler technology could not have improved in the past 5 years. As an example to the contrary, consider that just months before the publication of the GEUVADIS data, New England Biolabs released a new polymerase which claimed to address this very issue. GC bias is a ubiquitous issue in molecular biology and of course there are ongoing efforts to address it in the wet lab. Furthermore, continued research and benchmarking aimed at reducing GC bias, (see e.g. Thorner et al. 2014have led to marked improvements in library quality and standardization of experiments across labs. Anyone who performs bulk RNA-Seq, as we do in my lab, knows that RNA-Seq is no longer an ad hoc experiment.

15. Patro et al. write that

The point of the simulation was to demonstrate that, while modeling fragment sequence bias reduces gross mis-estimation (false reports of isoform switching across labs in real data — see for example Salmon Supplementary Figure 5 showing GEUVADIS data), the bias modeling does not lead to overall loss of signal. Consider that one could reduce false positives simply by attenuating signal or adding noise to all transcript abundances.

However none of the simulations or results in Patro et al. 2017 address the question of whether bias modeling leads to overall loss of signal. To answer it would require examining the true and false positives in a comparison of default Salmon and Salmon with –gcBias. Not only did Patro et al. not do the relevant intra-program comparisons, they did inter-program comparisons instead which clearly bear no relevance to the point they now claim they were making.

16. I want to make very clear that I believe that GC bias correction during RNA-Seq quantification is valuable and I agree with Patro et al. that it can be important for meta-analyses, especially of the kind that take place by large genome consortia. One of the interesting results in Patro et al. is the SEQC analysis (Supplementary Figure 4) which shows that that Salmon is more consistent in intra-center quantification in one sample (HBRR). However in a second experiment (UHRR) the programs are near identical in their quantification differences within and between centers and based on the results shown above I don’t believe that Patro et al. 2017 achieves its stated aim of showing that GC correction has an effect on typical differential analyses experiments that utilize typical downstream analyses.

17. I showed the results of running kallisto in default mode and Salmon with GC bias correction on a well-studied dataset from Trapnell et al. 2013. Patro et al. claim they were unable to reproduce my results, but that is because they performed a transcript level analysis despite the fact that I made it very clear in my post that I performed a gene level analysis. I chose to show results at the gene level to draw a contrast with Figure 3c of Trapnell et al. 2013. The results of Patro et al. at the transcript level show that even then the extent of overlap is remarkable. These results are consistent with the simulation results (see point #11).

18. The Salmon authors double down on their runtime analysis by claiming that “The running time discussion presented in the Salmon paper is accurate.” This is difficult to reconcile with two facts

(a) According to Patro et al.’s rebuttal “kallisto is faster when using a small number of threads” yet this was not presented in Patro et al. 2017.

(b) According to Patro et al. (see, e.g., the Salmon program GitHub), when running kallisto or Salmon with 30 threads what is being benchmarked is disk I/O and not the runtime of the programs.

If Patro et al. agree that to benchmark the speed of a program one must use a small number of threads, and Patro et al. agree that with a small number of threads kallisto is faster, then the only possible conclusion is that the running time discussion presented in the Salmon paper is not accurate.

19. The Patro et al. response has an entire section (3.1) devoted to explaining why quasimapping (used by Salmon) is distinct from pseudoalignment (introduced in the kallisto paper). Patro et al. describe quasimapping as “a different algorithm, different data structure, and computes different results.” Furthermore, in a blog post, Patro explained that RapMap (on which Salmon is based) implements both quasimapping and pseudoalignment, and that these are distinct concepts. He writes specifically that in contrast to the first algorithm provided by RapMap (pseudoalignment), “the second algorithm provided by RapMap — quasi-mapping — is a novel one”.

One of the reviewers of the Salmon paper recently published his review, which begins with the sentence “The authors present salmon, a new RNAseq quantification tool that uses pseudoalignment…” This directly contradicts the assertion of Patro et al. that quasimapping is “a different algorithm, different data structure” or that quasimapping is novel. In my blog post I provided a detailed walk-through that affirms that the reviewer is right. I showed how the quasimapping underlying Salmon is literally acting in identical ways on the k-mers in reads. Moreover, the results above show that Salmon, using quasimapping, does not “compute different results”. Unsurprisingly, its output is near identical to kallisto.

20. Patro et al. write that “The title of Sailfish paper contains the words ‘alignment-free’, which indicates that it was Sailfish that first presented the key idea of abandoning alignment. The term alignment-free has a long history in genomics and is used to describe methods in which the information inherent in a complete read is discarded in favor of the direct use of it’s substrings. Sailfish is indeed an alignment-free method because it shreds reads into constituent k-mers, and those are then operated on without regard to which read they originated from. The paper is aptly titled. The concept of pseudoalignment is distinct in that complete reads are associated to targets, even if base-pair alignments are not described.

21. Patro et al. write that “Salmon, including many of its main ideas, was widely known in the field prior to the kallisto preprint.” and mention that Zhang et al. 2015 included a brief description of Salmon. Zhang et al. 2015 was published on June 5, 2015, a month after the kallisto preprint was published, and its description of Salmon, though brief, was the first available for the program. Nowhere else, prior to the Zhang et al. publication, was there any description of what Salmon does or how it works, even at a high-level.

Notably, the paragraph on Salmon of Zhang et al. shows that Salmon, in its initial form, had nothing to do with pseudoalignment:

“Salmon is based on a novel lightweight alignment model that uses chains of maximal exact matches between sequencing fragments and reference transcripts to determine the potential origin of RNA‐seq reads.”

This is consistent with the PCA plot of my blog post which shows that initial versions of Salmon were very different from kallisto, and that Salmon $\simeq$ kallisto only after Salmon switched to the use of pseudoalignment.

22. My blogpost elicited an intense discussion in the comments and on social media of whether Patro et al. adequately attributed key ideas of Salmon to kallisto. Patro et al. They did not.

Patro et al. reference numerous citations to kallisto in Patro et al. 2017 which I’ve reproduced below

Only two of these references attribute any aspect of Salmon to kallisto. One of them, the Salmon bootstrap, is described as “inspired by kallisto” (in fact it is identical to that of kallisto). There is only one citation in Salmon to the key idea that has made it near identical to kallisto, namely the use of pseudoalignment, and that is to the RapMap paper from the Patro group (Srivastava et al. 2016).

Despite boasting of a commitment to open source principles and embracing preprints, Patro et al. conveniently ignore the RapMap preprints (Srivastava et al. 2015). Despite many mentions of kallisto, none of the four versions of the preprint acknowledge the direct use of the ideas in Bray et al. 2016 in any way, shape or form. The intent of Srivastava et al. is very clear. In the journal version the authors still do not acknowledge that “quasi mapping” is just pseudoalignment implemented with a suffix array, instead using words such as “inspired” and “motivated” to obfuscate the truth. Wording matters.

Epilogue

Discussion of the Zhang et al. 2017 paper by Patro et al., along with a tweet by Lappalainen about programs not giving identical results lead me to look more deeply into the Zhang et al. 2017 paper.

The exploration turned out to be interesting. On the one hand, some figures in Zhang et al. 2017 contradict Lappalainen’s claim that “none of the methods seem to give identical results…”.  For example, Figure S4 from the paper shows quantifications for four genes where kallisto and Salmon produce near identical results.

On the other hand, Figure 7 from the paper is an example from a simulation on a single gene where kallisto performed very differently from Salmon:

I contacted the authors to find out how they ran kallisto and Salmon. It turns out that for all the results in the paper with the exception of Figure 7, the programs were run as follows:

·       kallisto quant -i $KAL_INDEX –fr-stranded –plaintext$DATADIR/${f}_1.fq$DATADIR/${f}_2.fq -t 8 -o ./kallisto/ · salmon quant -i$SALMON_INDEX -l ISF -1 $DATADIR/${f}_1.fq -2 $DATADIR/${f}_2.fq -p 8 -o salmon_em  –incompatPrior 0

We then exchanged some further emails, after which they sent the data (reads) for the figure, we ran kallisto on our end and found discordant results with what was reported in the paper, they re-ran kallisto on their end, and after these exchanges we converged to an updated (and corrected) figure which shows Sailfish $\simeq$ kallisto but not Salmon $\simeq$ kallisto. The updated figure, shown below, was made by Zhang et al. using the default mode of kallisto version 0.43.1:

Note that kallisto is near identical in performance to Sailfish, which I explained in my blog post about Salmon has also converged to kallisto. However Salmon is different.

It turned out that for this one figure, Salmon was run with a non-standard set of options, specifically with the additional option –numPreAuxModelSamples 0 (although notably Patro did not recommend using the –gcBias option). The recommendation to run with this option was made by Patro to Zhang et al. after they contacted him early in January 2017 to ask for the best way to run Salmon for the experiment. What the flag does is turn off the online phase of Salmon (hence the 0 in –numPreAuxModelSamples 0) that is used to initially estimate the fragment length distribution. There is a good rationale for using the flag, namely the very small number of reads in the simulation makes it impossible to accurately learn auxiliary parameters as one might with a full dataset. However on January 13th, Patro changed the behavior of the option in a way that allowed Salmon to optimize for the specific experiment at hand. The default fragment length distribution in Salmon had been set the same as that in Cufflinks (mean 200, standard deviation 80). These settings match typical experimental data, and were chosen by Cole Trapnell and myself after examining numerous biological datasets. Setting –numPreAuxModelSamples to 0 forced Salmon to use those parameters. However on January 13th Patro changed the defaults in Salmon to mean = 250 and standard deviation = 25The numbers 250 and 25 are precisely the defaults for the polyester program that simulates reads. Polyester (with default parameters) is what Zhang et al. 2017 used to simulate reads for Figure 7.

Zhang et al. also contacted me on January 9th and I did not reply to their email. I had just moved institutions (from UC Berkeley to Caltech) on January 1st, and did not have the time to investigate in detail the issues they raised. I thank them for being forthcoming and helpful in reviewing Figure 7 post publication.

Returning to Lappalainen’s comment, it is true that Salmon results are different from kallisto in Figure 7 and one reason may be that Patro hard wired parameters for a flag that was used to match the parameters of the simulation. With the exception of that figure, throughout the paper Salmon $\simeq$ kallisto, providing yet another example of an independent publication confirming the claims of my blog post.

Two years ago I wrote a blog post on being wrong. It’s not fun to admit being wrong, but sometimes it’s necessary. I have to admit to being wrong again.

In May 2015 my coauthors and I released software called kallisto for RNA-Seq and we published a preprint concurrently. Several months before that, in February 2015, when initial results with kallisto showed that its accuracy was competitive with state-of-the-art programs but that it could quantify a hundred or more times faster, I went to seek advice from a licensing officer at UC Berkeley about licensing options. Even though most software in bioinformatics is freely available to both academia and industry, I felt, for reasons I outlined in a previous blog post, that it was right that commercial users should pay a fee to use kallisto. I believed then, and still do now, that it’s right that institutions that support software development should benefit from its commercial use (UC Berkeley receives 2/3 of the royalties for commercially licensed products), that students are entitled to renumeration for software engineering work that does not directly support of their own research goals, and that funds are needed to support specialized personnel who can maintain/improve code and service user requests.

After some discussion with UC Berkeley staff, who were helpful every step of the way, I finally licensed kallisto using standard language from a UC Berkeley license, which provided free access to academics and non-profit institutions while requiring companies to contact UC Berkeley for a commercial license. I wouldn’t call the decision an experiment. I truly believed it was the right thing to do and convinced my coauthors on the project to go along with the decision. Unfortunately,

I was wrong.

Shortly after kallisto was released Titus Brown wrote a blog post titled “On licensing in bioinformatics software: use the BSD, Luke“. One critique he made against the type of licensing arrangement I secured can be paraphrased as “such licensing necessitates conversations with lawyers”. I scoffed at this comment at the time, thinking to myself ok, so what if lawyers need to be involved to do the right thing? I also scoffed at a tag he associated with his post: “lior-is-wrong”.

Prior to licensing kallisto, with the exception of one program, my software was released freely to academia and industry. The exception was the AVID alignment program, a project that launched shortly I arrived at UC Berkeley as a postdoc, and whose licensing terms were decided in large part by collaborators at LBNL. They had first licensed visualization software (for AVID alignments) called VISTA the same way, and I think they did that because they came from the protein folding community where such licensing is common. In any case, there hadn’t been much lawyering going on (as far as I was aware) with the AVID/VISTA projects and I also didn’t think too much about the broader issues surrounding licensing choices. Open source free licensing also seemed good and throughout the years I always let my students decide what license they wanted for the software they’d written. With kallisto standing to have a huge impact on companies, enabling them to directly profit (in \$) from its speed, I decided it was timely to think about the pros and cons of different licenses again (it was later shown that kallisto does, indeed, greatly reduce costs for RNA-Seq analysis). This is the process that led up to the kallisto licensing, and the associated blog post I wrote.

However Titus Brown was right. Despite what I perceived to be best intentions from the UC Berkeley licensing staff and their counterparts at companies, what should have been simple licensing agreements signed and completed in a day, sometimes bogged down in lawyer infused negotiations. There were questions about indemnity clauses (prior to licensing kallisto I didn’t know what those were), payment terms, etc. etc. etc.  Some licenses were signed, but in some cases companies where I knew certain researchers wanted to use kallisto withdrew their requests due to term disagreements.

[September 2, 2017: A response to this post has been posted by the authors of Patro et al. 2017, and I have replied to them with a rebuttal]

Spot the difference

One of the maxims of computational biology is that “no two programs ever give the same result.” This is perhaps not so surprising; after all, most journals seek papers that report a significant improvement to an existing method. As a result, when developing new methods, computational biologists ensure that the results of their tools are different, specifically better (by some metric), than those of previous methods. The maxim certainly holds for RNA-Seq tools. For example, the large symmetric differences displayed in the Venn diagram below (from Zhang et al. 2014) are typical for differential expression tool benchmarks:

In a comparison of RNA-Seq quantification methods, Hayer et al. 2015 showed that methods differ even at the level of summary statistics (in Figure 7 from the paper, shown below, Pearson correlation was calculated using ground truth from a simulation):

These sort of of results are the norm in computational genomics. Finding a pair of software programs that produce identical results is about as likely as finding someone who has won the lottery… twice…. in one week. Well, it turns out there has been such a person, and here I describe the computational genomics analog of that unlikely event. Below are a pair of plots made using two different RNA-Seq quantification programs:

The two volcano plots show the log-fold change in abundance estimated for samples sequenced by Boj et al. 2015, plotted against p-values obtained with the program limma-voom. I repeat: the plots were made with quantifications from two different RNA-Seq programs. Details are described in the next section, but before reading it first try playing spot the difference.

The reveal

The top plot is reproduced from Supplementary Figure 6 in Beaulieu-Jones and Greene, 2017. The quantification program used in that paper was kallisto, an RNA-Seq quantification program based on pseudoalignment that was published in

The bottom plot was made using the quantification program Salmon, and is reproduced from a GitHub repository belonging to the lead author of

Patro et al. 2017 claim that “[Salmon] achieves the same order-of-magnitude benefits in speed as kallisto and Sailfish but with greater accuracy”, however after being unable to spot any differences myself in the volcano plots shown above, I decided, with mixed feelings of amusement and annoyance, to check for myself whether the similarity between the programs was some sort of fluke. Or maybe I’d overlooked something obvious, e.g. the fact that programs may tend to give more similar results at the gene level than at the transcript level. Thus began this blog post.

In the figure below, made by quantifying RNA-Seq sample ERR188140 with the latest versions of the two programs, each point is a transcript and its coordinates are the estimated counts produced by kallisto and salmon respectively.

Strikingly, the Pearson correlation coefficient is 0.9996026. However astute readers will recognize a possible sleight of hand on my part. The correlation may be inflated by similar results for the very abundant transcripts, and the plot hides thousands of points in the lower left-hand corner. RNA-Seq analyses are notorious for such plots that appear sounds but can be misleading. However in this case I’m not hiding anything. The Pearson correlation computed with $log(counts+1)$ is still extremely high (0.9955965) and the Spearman correlation, which gives equal balance to transcripts irrespective of the magnitude of their counts is 0.991206. My observation is confirmed in Table 3 of Sarkar et al. 2017 (note that in this table “quasi-mapping” corresponds to Salmon):

For context, the Spearman correlation between kallisto and a truly different RNA-Seq quantification program, RSEM, is 0.8944941. At this point I have to say… I’ve been doing computational biology for more than 20 years and I have never seen a situation where two ostensibly different programs output such similar results.

Patro and I are not alone in finding that Salmon $\simeq$ kallisto (if kallisto and Salmon gave identical results I would write that Salmon = kallisto but in lieu of the missing 0.004 in correlation I use the symbol $\, \simeq \,$ to denote the very very strong similarity). Examples in the literature abound, e.g. Supplementary Figure 5 from Majoros et al. 2017 (shown later in the post), Figure 1 from Everaert et al. 2017

or Figure 3A from Jin et al. 2017:

Just a few weeks ago, Sahraeian et al. 2017 published a comprehensive analysis of 39 RNA-Seq analysis tools and performed hierarchical clusterings of methods according to the similarity of their output. Here is one example (their Supplementary Figure 24a):

Amazingly, kallisto and Salmon-Quasi (the latest version of Salmon) are the two closest programs to each other in the entire comparison, producing output even more similar than the same program, e.g. Cufflinks or StringTie run with different alignments!

This raises the question of how, with kallisto published in May 2016 and Salmon $\simeq$ kallisto, Patro et al. 2017 was published in one of the most respected scientific publications that advertises first and foremost that it “is a forum for the publication of novel methods and significant improvements to tried-and-tested basic research techniques in the life sciences.” ?

How not to perform a differential expression analysis

The Patro et al. 2017 paper presents a number of comparisons between kallisto and Salmon in which Salmon appears to dramatically improve on the performance of kallisto. For example Figure 1c from Patro et al. 2017 is a table showing an enormous performance difference between kallisto and Salmon:

Figure 1c from Patro et al. 2017.

At a false discovery rate of 0.01, the authors claim that in a simulation study where ground truth is known Salmon identifies 4.5 times more truly differential transcripts than kallisto!

This can explain how Salmon was published, namely the reviewers and editor believed Patro et al.’s claims that Salmon significantly improves on previous work. In one analysis Patro et al. provide a p-value to help the “significance” stick. They write that “we found that Salmon’s distribution of mean absolute relative differences was significantly smaller (Mann-Whitney U test, P=0.00017) than those of kallisto. But how can the result Salmon >> kallisto, be reconciled with the fact that everybody repeatedly finds that Salmon $\simeq$ kallisto?

A closer look reveals three things:

1. In a differential expression analysis billed as “a typical downstream analysis” Patro et al. did not examine differential expression results for a typical biological experiment with a handful of replicates. Instead they examined a simulation of two conditions with eight replicates in each.
2. The large number of replicates allowed them to apply the log-ratio t-test directly to abundance estimates based on transcript per million (TPM) units, rather than estimated counts which are required for methods such as their own DESeq2.
3. The simulation involved generation of GC bias in an approach compatible with the inference model, with one batch of eight samples exhibiting “weak GC content dependence” while the other batch of eight exhibiting “more severe fragment-level GC bias.” Salmon was run in a GC bias correction mode.

These were unusual choices by Patro et al. What they did was allow Patro et al. to showcase the performance of their method in a way that leveraged the match between one of their inference models and the procedure for simulating the reads. The showcasing was enabled by having a confounding variable (bias) that exactly matches their condition variable, the use of TPM units to magnify the impact of that effect on their inference, simulation with a large number of replicates to enable the use of TPM,  which was possible because with many replicates one could directly apply the log t-test. This complex chain of dependencies is unraveled below:

There is a reason why log-fold changes are not directly tested in standard RNA-Seq differential expression analyses. Variance estimation is challenging with few replicates and RNA-Seq methods developers understood this early on. That is why all competitive methods for differential expression analysis such as DESeq/DESeq2, edgeR, limma-voom, Cuffdiff, BitSeq, sleuth, etc. regularize variance estimates (i.e., perform shrinkage) by sharing information across transcripts/genes of similar abundance. In a careful benchmarking of differential expression tools, Shurch et al. 2016 show that log-ratio t-test is the worst method. See, e.g., their Figure 2:

Figure 2 from Schurch et al. 2016. The four vertical panels show FPR and TPR for programs using 3,6,12 and 20 biological replicates (in yeast). Details are in the Schurch et al. 2016 paper.

The log-ratio t-test performs poorly not only when the number of replicates is small and regularization of variance estimates is essential. Schurch et al. specifically recommend DESeq2 (or edgeR) when up to 12 replicates are performed. In fact, the log-ratio t-test was so bad that it didn’t even make it into their Table 2 “summary of recommendations”.

The authors of Patro et al. 2017 are certainly well-aware of the poor performance of the log-ratio t-test. After all, one of them was specifically thanked in the Schurch et al. 2016 paper “for his assistance in identifying and correcting a bug”. Moreover, the recommended program by Schurch etal. (DESeq2) was authored by one of the coauthors on the Patro et al. paper, who regularly and publicly advocates for the use of his programs (and not the log-ratio t-test):

This recommendation has been codified in a detailed RNA-Seq tutorial where M. Love et al. write that “This [Salmon + tximport] is our current recommended pipeline for users”.

In Soneson and Delorenzi, 2013, the authors wrote that “there is no general consensus regarding which [RNA-Seq differential expression] method performs best in a given situation” and despite the development of many methods and benchmarks since this influential review, the question of how to perform differential expression analysis continues to be debated. While it’s true that “best practices” are difficult to agree on, one thing I hope everyone can agree on is that in a “typical downstream analysis” with a handful of replicates

do not perform differential expression with a log-ratio t-test.

Turning to Patro et al.‘s choice of units, it is important to note that the requirement of shrinkage for RNA-Seq differential analysis is the reason most differential expression tools require abundances measured in counts as input, and do not use length normalized units such as Transcripts Per Million (TPM). In TPM units the abundance $\rho_t$ for a transcript t is $\rho_t \propto \frac{c_t}{N \cdot l_t}$ where $c_t$ are the estimated counts for transcript t, $l_t$ is the (effective) length of t and N the number of total reads. Whereas counts are approximately Poisson distributed (albeit with some over-dispersion), variance estimates of abundances in TPM units depend on the lengths used in normalization and therefore cannot be used directly for regularization of variance estimation. Furthermore, the dependency of TPM on effective lengths means that abundances reported in TPM are very sensitive to the estimates of effective length.

This is why, when comparing the quantification accuracy of different programs, it is important to compare abundances using estimated counts. This was highlighted in Bray et al. 2016: “Estimated counts were used rather than transcripts per million (TPM) because the latter is based on both the assignment of ambiguous reads and the estimation of effective lengths of transcripts, so a program might be penalized for having a differing notion of effective length despite accurately assigning reads.” Yet Patro et al. perform no comparisons of programs in terms of estimated counts.

A typical analysis

The choices of Patro et al. in designing their benchmarks are demystified when one examines what would have happened had they compared Salmon to kallisto on typical data with standard downstream differential analysis tools such as their own tximport and DESeq2. I took the definition of “typical” from one of the Patro et al. coauthors’ own papers (Soneson et al. 2016): “Currently, one of the most common approaches is to define a set of non-overlapping targets (typically, genes) and use the number of reads overlapping a target as a measure of its abundance, or expression level.”

The Venn diagram below shows the differences in transcripts detected as differentially expressed when kallisto and Salmon are compared using the workflow the authors recommend publicly (quantifications -> tximport -> DESeq2) on a typical biological dataset with three replicates in each of two conditions. The number of overlapping genes is shown for a false discovery rate of 0.05 on RNA-Seq data from Trapnell et al. 2014:

A Venn diagram showing the overlap in genes predicted to be differential expressed by kallisto (blue) and Salmon (pink). Differential expression was performed with DESeq2 using transcript-level counts estimated by kallisto and Salmon and imported to DESeq2 with tximport. Salmon was run with GC bias correction.

This example provides Salmon the benefit of the doubt- the dataset was chosen to be older (when bias was more prevalent) and Salmon was not run in default mode but rather with GC bias correction turned on (option –gcBias).

When I saw these numbers for the first time I gasped. Of course I shouldn’t have been surprised; they are consistent with repeated published experiments in which comparisons of kallisto and Salmon have revealed near identical results. And while I think it’s valuable to publish confirmation of previous work, I did wonder whether Nature Methods would have accepted the Patro et al. paper had the authors conducted an actual “typical downstream analysis”.

Patro et al. utilized TPM based comparisons for all the results in their paper, presumably to highlight the improvements in accuracy resulting from better effective length estimates. Numerous results in the paper suggest that Salmon is much more accurate than kallisto. However I had seen a figure in Majoros et al. 2017 that examined the (cumulative) distribution of both kallisto and Salmon abundances in TPM units (their Supplementary Figure 5) in which the curves literally overlapped at almost all thresholds:

The plot above was made with Salmon v0.7.2 so in fairness to Patro et al. I remade it using the ERR188140 dataset mentioned above with Salmon v0.8.2:

The distribution of abundances (in TPM units) as estimated by kallisto (blue circles) and Salmon (red stars).

The blue circles correspond to kallisto and the red stars inside to Salmon. With the latest version of Salmon the similarity is even higher than what Majoros et al. observed! The Spearman correlation between kallisto and Salmon with TPM units is 0.9899896.

It’s interesting to examine what this means for a (truly) typical TPM analysis. One way that TPMs are used is to filter transcripts (or genes) by some threshold, typically TPM >  1 (in another deviation from “typical”, a key table in Patro et al. 2017 – Figure 1d – is made by thresholding with TPM > 0.1). The Venn diagram below shows the differences between the programs at the typical TPM > 1  threshold:

A Venn diagram showing the overlap in transcripts predicted by kallisto and Salmon to have estimated abundance > 1 TPM.

The figures above were made with Salmon 0.8.2 run in default mode. The correlation between kallisto and Salmon (in TPM) units decreases a tiny amount, from 0.9989224 to 0.9974325 with the –gcBias option and even the Spearman correlation decreases by only 0.011 from 0.9899896 to 0.9786092.

I think it’s perfectly fine for authors to present their work in the best light possible. What is not ok is to deliberately hide important and relevant truth, which in this case is that Salmon $\, \simeq \,$ kallisto.

A note on speed

One of the claims in Patro et al. 2017 is that “[the speed of Salmon] roughly matches the speed of the recently introduced kallisto.” The Salmon claim is based on a benchmark of an experiment (details unknown) with 600 million 75bp paired-end reads using 30 threads. Below are the results of a similar benchmark of Salmon showing time to process 19 samples from Boj et al. 2015 with variable numbers of threads:

First, Salmon with –gcBias is considerably slower than default Salmon. Furthermore, there is a rapid decrease in performance gain with increasing number of threads, something that should come as no surprise. It is well known that quantification can be I/O bound which means that at some point, extra threads don’t provide any gain as the disk starts grinding limiting access from the CPUs. So why did Patro et al. choose to benchmark runtime with 30 threads?

The figure below provides a possible answer:

In other words, not only is Salmon $\simeq$ kallisto in accuracy, but contrary to the claims in Patro et al. 2017, kallisto is faster. This result is confirmed in Table 1 of Sarkar et al. 2017 who find that Salmon is slower by roughly the same factor as seen above (in the table “quasi-mapping” is Salmon).

Having said that, the speed differences between kallisto and Salmon should not matter much in practice and large scale projects made possible with kallisto (e.g. Vivian et al. 2017) are possible with Salmon as well. Why then did the authors not report their running time benchmarks honestly?

The first common notion

The Patro et al. 2017 paper uses the term “quasi-mapping” to describe an algorithm, published in Srivastava et al. 2016, for obtaining their (what turned out to be near identical to kallisto) results. I have written previously how “quasi-mapping” is the same as pseudoalignment as an alignment concept, even though Srivastava et al. 2016 initially implemented pseudoalignment differently than the way we described it originally in our preprint in Bray et al. 2015. However the reviewers of Patro et al. 2017 may be forgiven for assuming that “quasi-mapping” is a technical advance over pseudoalignment. The Srivastava et al. paper is dense and filled with complex technical detail. Even for an expert in alignment/RNA-Seq it is not easy to see from a superficial reading of the paper that “quasi-mapping” is an equivalent concept to kallisto’s pseudoalignment (albeit implemented with suffix arrays instead of a de Bruijn graph). Nevertheless, the key to the paper is a simple sentence: “Specifically, the algorithm [RapMap, which is now used in Salmon] reports the intersection of transcripts appearing in all hits” in the section 2.1 of the paper. That’s the essence of pseudoalignment right there. The paper acknowledges as much, “This lightweight consensus mechanism is inspired by Kallisto ( Bray et al. , 2016 ), though certain differences exist”. Well, as shown above, those differences appear to have made no difference in standard practice, except insofar as the Salmon implementation of pseudoalignment being slower than the one in Bray et al. 2016.

Srivastava et al. 2016 and Patro et al. 2017 make a fuss about the fact that their “quasi-mappings” take into account the starting positions of reads in transcripts, thereby including more information than a “pure” pseudoalignment. This is a pedantic distinction Patro et al. are trying to create. Already in the kallisto preprint (May 11, 2015),  it was made clear that this information was trivially accessible via a reasonable approach to pseudoalignment: “Once the graph and contigs have been constructed, kallisto stores a hash table mapping each k-mer to the contig it is contained in, along with the position within the contig.”

In other words, Salmon is not producing near identical results to kallisto due to an unprecedented cosmic coincidence. The underlying method is the same. I leave it to the reader to apply Euclid’s first common notion:

Things which equal the same thing are also equal to each other.

Convergence

While Salmon is now producing almost identical output to kallisto and is based on the same principles and methods, this was not the case when the program was first released. The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods.

The first preprint was published on the BioRxiv on June 27, 2015. It followed shortly on the heels of the kallisto preprint which was published on May 11, 2015. However the first Salmon preprint described a program very different from kallisto. Instead of pseudoalignment, Salmon relied on chaining SMEMs (super-maximal exact matches) between reads and transcripts to identifying what the authors called “approximately consistent co-linear chains” as proxies for alignments of reads to transcripts. The authors then compared Salmon to kallisto writing that “We also compare with the recently released method of Kallisto which employs an idea similar in some respects to (but significantly different than) our lightweight-alignment algorithm and again find that Salmon tends to produce more accurate estimates in general, and in particular is better able [to] estimate abundances for multi-isoform genes.” In other words, in 2015 Patro et al. claimed that Salmon was “better” than kallisto. If so, why did the authors of Salmon later change the underlying method of their program to pseudoalignment from SMEM alignment?

Inspired by temporal ordering analysis of expression data and single-cell pseudotime analysis, I ran all the versions of kallisto and Salmon on ERR188140, and performed PCA on the resulting transcript abundance table to be able to visualize the progression of the programs over time. The figure below shows all the points with the exception of three: Sailfish 0.6.3, kallisto 0.42.0 and Salmon 0.32.0. I removed Sailfish 0.6.3 because it is such an outlier that it caused all the remaining points to cluster together on one side of the plot (the figure is below in the next section). In fairness I also removed one Salmon point (version 0.32.0) because it differed substantially from version 0.4.0 that was released a few weeks after 0.32.0 and fixed some bugs. Similarly, I removed kallisto 0.42.0, the first release of kallisto which had some bugs that were fixed 6 days later in version 0.42.1.

Evidently kallisto output has changed little since May 12, 2015. Although some small bugs were fixed and features added, the quantifications have been very similar. The quantifications have been stable because the algorithm has been the same.

On the other hand the Salmon trajectory shows a steady convergence towards kallisto. The result everyone is finding, namely that currently Salmon $\simeq$ kallisto is revealed by the clustering of recent versions of Salmon near kallisto. However the first releases of Salmon are very different from kallisto. This is also clear from the heatmap/hierarchical clustering of  Sahraeian et al. in which Salmon-SMEM was included (Salmon used SMEMs until version 0.5.1, sometimes labeled fmd, until “quasi-mapping” became the default). A question: if Salmon ca. 2015 was truly better than kallisto then is Salmon ca. 2017 worse than Salmon ca. 2015?

Convergence of Salmon and Sailfish to kallisto over the course of a year. The x-axis labels the time different versions of each program were released. The y-axis is PC1 from a PCA of transcript abundances of the programs.

Prestamping

The bioRxiv preprint server provides a feature by which a preprint can be linked to its final form in a journal. This feature is useful to readers of the bioRxiv, as final published papers are generally improved after preprint reader, reviewer, and editor comments have been addressed. Journal linking is also a mechanism for authors to time stamp their published work using the bioRxiv. However I’m sure the bioRxiv founders did not intend the linking feature to be abused as a “prestamping” mechanism, i.e. a way for authors to ex post facto receive a priority date for a published paper that had very little, or nothing, in common with the original preprint.

A comparison of the June 2015 preprint mentioning the Salmon program and the current Patro et al. paper reveals almost nothing in common. The initial method changed drastically in tandem with an update to the preprint on October 3, 2015 at which point the Salmon program was using “quasi mapping”, later published in Srivastava et al. 2016. Last year I met with Carl Kingsford (co-corresponding author of Patro et al. 2017) to discuss my concern that Salmon was changing from a method distinct from that of kallisto (SMEMs of May 2015) to one that was replicating all the innovations in kallisto, without properly disclosing that it was essentially a clone. Yet despite a promise that he would raise my concerns with the Salmon team, I never received a response.

At this point, the Salmon core algorithms have changed completely, the software program has changed completely, and the benchmarking has changed completely. The Salmon project of 2015 and the Salmon project of 2017 are two very different projects although the name of the program is the same. While some features have remained, for example the Salmon mode that processes transcriptome alignments (similar to eXpress) was present in 2015, and the approach to likelihood maximization has persisted, considering the programs the same is to descend into Theseus’ paradox.

Interestingly, Patro specifically asked to have the Salmon preprint linked to the journal:

The linking of preprints to journal articles is a feature that arXiv does not automate, and perhaps wisely so. If bioRxiv is to continue to automatically link preprints to journals it needs to focus not only on eliminating false negatives but also false positives, so that journal linking cannot be abused by authors seeking to use the preprint server to prestamp their work after the fact.

The fish always win?

The Sailfish program was the precursor of Salmon, and was published in Patro et al. 2014. At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

Despite the claim in the Sailfish abstract that “Sailfish provides quantification time…without loss of accuracy” and Figure 1 from the paper showing Sailfish to be more accurate than RSEM, we felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case; this was later noted by others, e.g. Hensman et al. 2015, Lee et al. 2015).

After reflecting on the Sailfish paper and results, Nicolas Bray had the key idea of abandoning alignments as a requirement for RNA-Seq quantification, developed pseudoalignment, and later created kallisto (with Harold Pimentel and Páll Melsted).

I mention this because after the publication of kallisto, Sailfish started changing along with Salmon, and is now frequently discussed in the context of kallisto and Salmon as an equal. Indeed, the PCA plot above shows that (in its current form, v0.10.0) Sailfish is also nearly identical to kallisto. This is because with the release of Sailfish 0.7.0 in September 2015, Patro et al. started changing the Sailfish approach to use pseudoalignment in parallel with the conversion of Salmon to use pseudoalignment. To clarify the changes in Sailfish, I made the PCA plot below which shows where the original version of Sailfish that coincided with the publication of Patro et al. 2014 (version 0.6.3 March 2014) lies relative to the more recent versions and to Salmon:

In other words, despite a series of confusing statements on the Sailfish GitHub page and an out-of-date description of the program on its homepage, Sailfish in its published form was substantially less accurate and slower than kallisto, and in its current form Sailfish is kallisto.

In retrospect, the results in Figure 1 of Patro et al. 2014 seem to be as problematic as the results in Figure 1 of Patro et al. 2017.  Apparently crafting computational experiments via biased simulations and benchmarks to paint a distorted picture of performance is a habit of Patro et al.

In the post I wrote that “The history of the Salmon program is accessible via the GitHub repository, which recorded changes to the code, and also via the bioRxiv preprint server where the authors published three versions of the Salmon preprint prior to its publication in Nature Methods” Here are the details of how these support the claims I make (tl;dr https://twitter.com/yarbsalocin/status/893886707564662784):

Sailfish (current version) and Salmon implemented kallisto’s pseudoalignment algorithm using suffix arrays

First, both Sailfish and Salmon use RapMap (via SACollector) and call mergeLeftRightHits():
Sailfish:
https://github.com/kingsfordgroup/sailfish/blob/352f9001a442549370eb39924b06fa3140666a9e/src/SailfishQuantify.cpp#L192
Salmon:
https://github.com/COMBINE-lab/salmon/commit/234cb13d67a9a1b995c86c8669d4cefc919fbc87#diff-594b6c23e3bdd02a14cc1b861c812b10R2205

The RapMap code for “quasi mapping” executes an algorithm identical to psuedoalignment, down to the detail of what happens to the k-mers in a single read:

First, hitCollector() calls getSAHits_():
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L249

Here kmers are used hashed to SAintervals (Suffix Array intervals), that are then extended to see how far ahead to jump. This is the one of two key ideas in the kallisto paper, namely that not all the k-mers in a read need to be examined to pseudoalign the read. It’s much more than that though, it’s the actual exact same algorithm to the level of exactly the k-mers that are examined. kallisto performs this “skipping” using contig jumping with a different data structure (the transcriptome de Bruijn graph) but aside from data structure used what happens is identical:

https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SACollector.hpp#L652
makes a call to jumping and the code to compute MMP (skipping) is
https://github.com/COMBINE-lab/RapMap/blob/c1e3132a2e136615edbb91348781cb71ba4c22bd/include/SASearcher.hpp#L77

There is a different detail in the Sailfish/Salmon code which is that when skipping forward the suffix array is checked for exact matching on the skipped sequence. kallisto does not have this requirement (although it could). On error-free data these will obviously be identical; on error prone data this may make Salmon/Sailfish a bit more conservative and kallisto a bit more robust to error. Also due to the structure of suffix arrays there is a possible difference in behavior when a transcript contains a repeated k-mer. These differences affect a tiny proportion of reads, as is evident from the result that kallisto and Salmon produce near identical results.

The second key idea in kallisto of intersecting equivalence classes for a read. This exact procedure is in:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/SACollector.hpp#L363
which calls:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/src/HitManager.cpp#L599

There was a choice we had to make in kallisto of how to handle information from paired end reads (does one require consistent pseudoalignment in both? Just one suffices to pseudoalign a read?)
The code for intersection between left and right reads making the identical choices as kallisto is:
https://github.com/COMBINE-lab/RapMap/blob/bd76ec5c37bc178fd93c4d28b3dd029885dbe598/include/RapMapUtils.hpp#L810

In other words, stepping through what happens to the k-mers in a read shows that Sailfish/Salmon copied the algorithms of kallisto and implemented it with the only difference being a different data structure used to hash the kmers. This is why, when I did my run of Salmon vs. kallisto that led to this blog post I found that
vs
salmon 69,701,169.
That’s a difference of 79,000 out of ~70 million = 0.1%.

1.  Until the kallisto program and preprint was published Salmon used SMEMs. Only after kallisto does Salmon change to using kmer cached suffix array intervals.
2. The kallisto preprint did not discuss outputting position as part of pseudoalignment because it was not central to the idea. It’s trivial to report pseudoalignment positions with either data structure and in fact both kallisto and Salmon do.

I want to make very clear here that I think there can be great value in implementing an algorithm with a different data structure. It’s a form of reproducibility that one can learn from: how to optimize, where performance gains can be made, etc. Unfortunately most funding agencies don’t give grants for projects whose goal is solely to reproduce someone else’s work. Neither do most journal publish papers that set out to do that. That’s too bad. If Patro et al. had presented their work honestly, and explained that they were implementing pseudoalignment with a different data structure to see if it’s better, I’d be a champion of their work. That’s not how they presented their work.

Salmon copied details in the quantification

The idea of using the EM algorithm for quantification with RNA-Seq goes back to Jiang and Wong, 2009, arguably even to Xing et al. 2006. I wrote up the details of the history in a review in 2011 that is on the arXiv. kallisto runs the EM algorithm on equivalence classes, an idea that originates with Nicolae et al. 2011 (or perhaps even Jiang and Wong 2009) but whose significance we understood from the Sailfish paper (Patro et al. 2014). Therefore the fact that Salmon (now) and kallisto both use the EM algorithm, in the same way, makes sense.

However Salmon did not use the EM algorithm before the kallisto preprint and program were published. It used an online variational Bayes algorithm instead. In the May 18, 2015 release of Salmon there is no mention of EM. Then, with the version 0.4 release date Salmon suddenly switches to the EM. In implementing the EM algorithm there are details that must be addressed, for example setting thresholds for when to terminate rounds of inference based on changes in the (log) likelihood (i.e. determine convergence).

For example, kallisto sets parameters
const double alpha_limit = 1e-7;
const double alpha_change_limit = 1e-2;
const double alpha_change = 1e-2;

in EMalgorithm.h
https://github.com/pachterlab/kallisto/blob/90db56ee8e37a703c368e22d08b692275126900e/src/EMAlgorithm.h
The link above shows that these kallisto parameters were set and have not changed since the release of kallisto
Also they were not always this way, see e.g. the version of April 6, 2015:
https://github.com/pachterlab/kallisto/blob/2651317188330f7199db7989b6a4dc472f5d1669/src/EMAlgorithm.h
This is because one of the things we did is explore the effects of these thresholds, and understand how setting them affects performance. This can be seen also in a legacy redundancy, we have both alpha_change and alpha_change_limit which ended up being unnecessary because they are equal in the program and used on one line.

The first versions of Salmon post-kallisto switched to the EM, but didn’t even terminate it the same way as kallisto, adopting instead a maximum iteration of 1,000. See
https://github.com/COMBINE-lab/salmon/blob/59bb9b2e45c76137abce15222509e74424629662/include/CollapsedEMOptimizer.hpp
from May 30, 2015.
This changed later first with the introduction of minAlpha (= kallisto’s alpha_limit)
https://github.com/COMBINE-lab/salmon/blob/56120af782a126c673e68c8880926f1e59cf1427/src/CollapsedEMOptimizer.cpp
and then alphaCheckCutoff (kallisto’s alpha_change_limit)
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp

Here are the salmon thresholds:
double minAlpha = 1e-8;
double alphaCheckCutoff = 1e-2;
double cutoff = minAlpha;

Notice that they are identical except that minAlpha = 1e-8 and not kallisto’s alpha_limit = 1e-7. However in kallisto, from the outset, the way that alpha_limit has been used is:
if (alpha_[ec] < alpha_limit/10.0) {
alpha_[ec] = 0.0;
}

In other words, alpha_limit in kallisto is really 1e-8, and has been all along.

The copying of all the details of our program have consequences for performance. In the sample I ran kallisto performed 1216 EM rounds of EM vs. 1214 EM rounds in Salmon.

Sailfish (current version) copied our sequence specific bias method

One of the things we did in kallisto is implement a sequence specific bias correction along the lines of what was done previously in Roberts et al. 2011, and later in Roberts et al. 2013. Implementing sequence specific bias correction in kallisto required working things out from scratch because of the way equivalence classes were being used with the EM algorithm, and not reads. I worked this out together with Páll Melsted during conversations that lasted about a month in the Spring of 2015. We implemented it in the code although did not release details of how it worked with the initial preprint because it was an option and not default, and we thought we might want to still change it before submitting the journal paper.

Here Rob is stating that Salmon can account for biases that kallisto cannot:
https://www.biostars.org/p/143458/#143639
This was a random forest bias correction method different from kallisto’s.

Shortly thereafter, here is the source code in Sailfish deprecating the Salmon bias correction and switching to kallisto’s method:
https://github.com/kingsfordgroup/sailfish/commit/377f6d65fe5201f7816213097e82df69e4786714#diff-fe8a1774cd7c858907112e6c9fda1e9dR76

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-3e922f9589567fee3b20671da9493c82R34

https://github.com/kingsfordgroup/sailfish/commit/be0760edce11f95377088baabf72112f920874f9#diff-b14c09a136906d1c5d8534afa3a51c4cR818

This is the update to effective length in kallisto:
https://github.com/pachterlab/kallisto/blob/e5957cf96f029be4e899e5746edcf2f63e390609/src/weights.cpp#L184
Here is the Sailfish code:

Notice that there has been a literal copying down to the variable names:

The code written by the student of Rob was:

The code written by us is

efflen *= 0.5*biasAlphaNorm/biasDataNorm;

The code rewritten by Rob (editing that of the student):

effLength *= 0.5 * (txomeNormFactor / readNormFactor);

Note that since our bias correction method was not reported in our preprint, this had to have been copied directly from our codebase and was done so without any attribution.

I raised this specific issue with Carl Kingsford by email prior to our meeting in April 13 2016. We then discussed it in person. The conversation and email were prompted by a change to the Sailfish README on April 7, 2016 specifically accusing us of comparing kallisto to a “ **very old** version of Sailfish”:
https://github.com/kingsfordgroup/sailfish/commit/550cd19f7de0ea526f512a5266f77bfe07148266

What was stated is “The benchmarks in the kallisto paper *are* made against a very old version of Sailfish” not “were made against”. By the time that was written, it might well have been true. But kallisto was published in May 2015, it benchmarked with the Sailfish program described in Patro et al. 2014, and by 2016 Sailfish had changed completely implementing the pseudoalignment of kallisto.

Another aspect of an RNA-Seq quantification program is effective length estimation. There is an attribution to kallisto in the Sailfish code now explaining that this is from kallisto:
“Computes (and returns) new effective lengths for the transcripts based on the current abundance estimates (alphas) and the current effective lengths (effLensIn). This approach is based on the one taken in Kallisto
This is from January 23rd, 2016, almost 9 months after kallisto was released, and 3 months before the Sailfish README accused us of not testing the latest version of Sailfish in May 2015.

The attribution for effective lengths is also in the Salmon code, from 6 months later June 2016:
https://github.com/COMBINE-lab/salmon/blob/335c34b196205c6aebe4ddcc12c380eb47f5043a/include/DistributionUtils.hpp

There is also an acknowledgement in the Salmon code that a machine floating point tolerance we use
https://github.com/pachterlab/kallisto/blob/master/src/EMAlgorithm.h#L19
was copied.
The acknowledgment in Salmon is here
https://github.com/COMBINE-lab/salmon/blob/a3bfcf72e85ebf8b10053767b8b506280a814d9e/src/CollapsedEMOptimizer.cpp
This is the same file where the kallisto thresholds for the EM were copied to.

So after copying our entire method, our core algorithm, many of our ideas, specific parameters, and numerous features… really just about everything that goes into an RNA-Seq quantification project, there is an acknowledgment that our machine tolerance threshold was “intelligently chosen”.

During my third year of undergraduate study I took a course taught by David Gabai in which tests were negatively graded. This meant that points were subtracted for every incorrect statement made in a proof. As proofs could, in principle, contain an unbounded number of false statements, there was essentially no lower bound on the grade one could receive on a test (course grades was bounded below by “F”). Although painful on occasion, I grew to love the class and the transcendent lessons that it taught. These memories came flooding back this past week, when a colleague brought to my attention the paper A simple and fast algorithm for K-medoids clustering by Hae-Sang Park and Chi-Hyuck Jun.

The Park-Jun paper introduces a K-means like method for K-medoid clustering. K-medoid clustering is a clustering formulation based on a generalization of (a relative of) the median rather than the mean, and is useful for the same robustness reasons that make the median preferable to the mean. The medoid is defined as follows: given n points $x_1,\ldots,x_n$ in $\mathbb{R}^d$, the medoid is a point $x$ among them with the property that it minimizes the average distance to the other points, i.e. $x \in \{x_1,\ldots,x_n\}$ minimizes $\sum_{i=1}^n ||x-x_i||$. In the case of $d=1$, when n is odd, this corresponds to the median (the direct generalization of the median is to find a point $x$ not necessarily among the $x_i$ minimizing the average distance to the other points, and this is called the geometric median).

The K-medoid problem is to partition the points into k disjoint subsets $S_1,\ldots,S_n$ so that if $m_1,\ldots,m_k$ are the respective medoids of the subsets (clusters) then the average distance of each medoid to the points of the cluster it belongs to  is minimized. In other words, the K-medoids problem is to find

$argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||m_j-x_i||$.

For comparison, K-means clustering is the problem of finding

$argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||\mu_j - x_i||^2$,

where the $\mu_j$ are the centroids of the points in each partition $S_i$. The K-means and K-medoids problems are instances of partitional clustering formulations that are NP-hard.

The most widely used approach for finding a (hopefully good) solution to the K-medoids problem has been a greedy clustering algorithm called PAM by Kaufman and Rousseeuw (partitioning around medoids). To understand the Park-Jun contribution it is helpful to briefly review PAM first.  The method works as follows:

1. Initialize by selecting k distinguished points from the set of points.
2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets and a “cost” can be associated to the partition, namely the sum of the distances from each point to its associated distinguished point (note that the distinguished points may not yet be the medoids of their respective partitions).
3. For each distinguished point d, and each non-distingsuished point repeat the assignment and cost calculation of step  2 with and x swapped so that x becomes distinguished and returns to undistinguished status. Choose the swap that minimizes the total cost. If no swap reduces the cost then output the partition and the distinguished points.
4. Repeat step 3 until termination.

It’s easy to see that when the PAM algorithm terminates the distinguished points must be medoids for the partitions associated to them.

The running time of PAM is $O(k(n-k)^2)$. This is because in step 3, for each of the k distinguished points, there are n-k swaps to consider for a total of $k(n-k)$ swaps, and the cost computation for each swap requires n-k assignments and additions. Thus, PAM is quadratic in the number of points.

To speed-up the PAM method, Park and Jun introduced in their paper an analogy of the K-means algorithm, with mean replaced by median:

1. Initialize by selecting k distinguished points from the set of points.
2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets in the same way as the PAM algorithm.
3. Compute the medoid for each partition. Repeat step 2 until the cost no longer decreases.

Park-Jun claim that their method has run time complexity “$O(nk)$ which is equivalent to K-means clustering”. Yet the pseudocode in the paper begins with the clearly quadratic requirement “calculate the distance between every pair of all objects..”

Step 3 of the algorithm is also quadratic. The medoid of a set of m points is computed in time $O(m^2)$. Given m points the medoid must be one of them, so it suffices to compute, for each point, the sum of the distances to the others ($m \times m$ additions) and a medoid is then identified by taking the minimum. Furthermore, without assumptions on the structure of the distance matrix there cannot be a faster algorithm (with the triangle inequality the medoid can be computed in $O(n^{\frac{3}{2}}$).

Quadratic functions are not linear. If they were, it would mean that, with $a,b \neq 0$$ax^2=bx \mbox{ for all } x$. If that were the case then

$ax^2=bx$

$\Rightarrow ax^2-bx=0$ for all x.

Assuming that a is positive and plugging in $x=\frac{b-\sqrt{b^2+4a}}{2a}$ one would obtain that $ax^2-bx=1$ and it would follow that

$0=1$.

When reading the paper, it was upon noticing this “result” that negative grading came to mind. With a proof that 0=1 we are at, say, a score of -10 for the paper. Turning to the discussion of Fig. 3 of the paper we read that “…the proposed method takes about constant time near zero regardless of the number of objects.”

I suppose a generous reading of this statement is that it is an allusion to Taylor expansion of the polynomial $f(x)=x^2$ around $x=0$. A less generous interpretation is that the figure is deliberately misleading, intended to show linear growth with slope close to zero for what is a quadratic function. I decided to be generous and grade this a -5, leading to a running total of -15.

It seems the authors may have truly believed that their algorithm was linear because in the abstract they begin with “This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm”. It seems possible that the authors thought they had bypassed what they viewed as the underlying cause for quadratic running time of the PAM algorithm, namely the quadratic swap. The K-means algorithm (Lloyd’s algorithm) is indeed linear, because the computation of the mean of a set of points is linear in the number of points (the values for each coordinate can be averaged separately). However the computation of a medoid is not the same as computation of the mean. -5, now a running total of20. The actual running time of the Park-Jun algorithm is shown below:

Replicates on different instances are crucial, because absent from running time complexity calculations are the stopping times which are data dependent. A replicate of Figure 3 is shown below (using the parameters in Table 5 and implementations in MATLAB). The Park-Jun method is called as “small” in MATLAB.

Interestingly, the MATLAB implementation of PAM has been sped up considerably. Also, the “constant time near zero” behavior described by Park and Jun is clearly no such thing. For this lack of replication of Figure 3 another deduction of 5 points for a total of -25.

There is not much more to the Park and Jun paper, but unfortunately there is some. Table 6 shows a comparison of K-means, PAM and the Park-Jun method with true clusters based on a simulation:

The Rand index is a measure of concordance between clusters, and the higher the Rand index the greater the concordance. It seems that the Park-Jun method is slightly better than PAM, which are both superior to K-means in the experiment. However the performance of K-means is a lot better in this experiment than suggested by Figure 2 which was clearly (rotten) cherry picked (Fig2b):

For this I deduct yet another 5 points for a total of -30.

Amazingly, the Park and Jun paper has been cited 479 times. Many of the citations propagate the false claims in the paper, e.g. in referring to Park and Jun, Zadegan et al. write “in each iteration the new set of medoids is selected with running time O(N), where N is the number of objects in the dataset”.  The question is, how many negative points is the repetition of false claims from a paper that is filled with false claims worth?

According to Bach’s first biographer, when the great composer was asked to explain the secret of his success he replied “I was obliged to be industrious. Whoever is equally industrious will succeed equally well”. I respectfully disagree.

First, Bach didn’t have to drive the twenty kids he fathered over his lifetime to soccer practices, arrange their playdates, prepare and cook dinners, and do the laundry. To his credit he did teach his sons composition (though notably only his sons). But even that act was time for composition, an activity which he was seemingly able to devote himself to completely. For example, the fifteen two part inventions were composed in Köthen between 1720 and 1723 as exercises for his 9 year old son Wilhelm Friedemann, at the same time he was composing the first volume of the Well Tempered Clavier. In other words Bach had the freedom and time to pursue a task which was valued (certainly after Mendelssohn’s revival) by others. That is not the case for many, whose hard work, even when important, is trivialized and diminished when they are alive and forgotten when they are dead.

The second reason I disagree with Bach’s suggestion that those who work as hard as he did will succeed equally well is that his talent transcended that which is achievable by hard work alone. Although there is some debate as to whether Bach’s genius was in his revealing of beauty or his discovery of truth (I tend to agree with Richard Taruskin, whose view is discussed eloquently by Bernard Chazelle in an interview from 2014), regardless of how one thinks of Bach genius, his music was extraordinary. Much was made recently of the fact that Chuck Berry’s Johnny B. Goode was included the Voyager spacecraft record, but, as Lewis Thomas said,  if we really wanted to boast we would have sent the complete works of Bach. As it is, three Bach pieces were recorded, the most of any composer or musician. The mortal Chuck Berry and Wolfgang Amadeus Mozart appear just once. In other words, even the aliens won’t believe Bach’s claim that ” whoever is equally industrious will succeed equally well”.

Perhaps nitpicking Bach’s words is unfair, because he was a musician and not an orator or writer. But what about his music? Others have blogged him, but inevitably an analysis of his music reveals that even though he may have occasionally broken his “rules”, the fact that he did so, and how he did so, was the very genius he is celebrated for. Still, it’s fun to probe the greatest composer of all time, and I recently found myself doing just that, with none other than one of his “simplest” works, one of the inventions written as an exercise for his 9 year old child. The way this happened is that after a 30 year hiatus, I recently found myself revisiting some of the two part inventions. My nostalgia started during a power outage a few weeks ago, when I realized the piano was the only interesting non-electronic instrument that was working in the house, and I therefore decided to use the opportunity to learn a fugue that I hadn’t tackled before. A quick attempt revealed that I would be wise to brush up on the inventions first. My favorite invention is BWV 784 in A minor, and after some practice it slowly returned to my fingers, albeit in a cacophony of uneven, rheumatic dissonance. But I persevered, and not without reward. Not a musical reward- my practice hardly improved the sound- but a music theory reward. I realized, after playing some of the passages dozens of times, that I was playing a certain measure differently because it sounded better to me the “wrong” way. How could that be?

The specific measure was number 16:This measure is the third in a four measure section that starts in measure 14, in which a sequence of broken chords (specifically diminished seventh chords) connect a section ending in E minor with one beginning in A minor. It’s an interesting mathomusical problem to consider the ways in which those keys can be bridged in a four measure sequence (according to the “rules of Bach”), and in listening to the section while playing it I had noticed that the E♮notes at the end of measure 15 and in measure 16 were ok… but somehow a bit strange to my ear. To understand what was going on I had to dust off my First Year Harmony book by William Lovelock for some review (Lovelock’s text is an introduction to harmony and counterpoint and is a favorite of mine, perhaps because it is written in an axiomatic style). The opening broken diminished seventh chord in measure 14 is really the first inversion of the leading note (C♯) for D minor. It is followed by a dominant seventh (A) leading into the key of C major. We see the same pattern repeated there (leading diminished seventh in first inversion followed by the dominant seventh) and two more times until measure 18. In other words, Bach’s solution to the problem of arriving at A minor from E minor was to descend via D -> C -> B♭ -> A with alternating diminished sevenths with dominant sevenths to facilitate the modulations.

Except that is not what Bach did! Measure 16 is an outlier in the key of E minor instead of the expected B♭. What I was hearing while playing was that it was more natural for me to play the E ♭notes (marked in red in the manuscript above) than the annotated E♮notes. My 9 year old daughter recorded my (sloppy, but please ignore that!) version below:

Was Bach wrong? I tried to figure it out but I couldn’t think of a good reason why he would diverge from the natural harmonic sequence for the section. I found the answer in a booklet formally analyzing the two part inventions. In An Analytical Survey of the Fifteen Two-Part Inventions by J.S. Bach by Theodore O. Johnson the author explains that the use of E♮instead of E ♭at the end of measure 15 and in the beginning of measure 16 preserves the sequence of thirds each three semitones apart. My preference for E♭breaks the sequence- I chose musical harmony of mathematical harmony. Bach obviously faced the dilemma but refused to compromise. The composed sequence, even though not strictly the expected harmonic one, sounds better the more one listens to it. Upon understanding the explanation, I’ve grown to like it just as much and I’m fine with acknowledging that E minor works perfectly well in measure 16. It’s amazing to think how deeply Bach thought through every note in every composition. The attention to detail, even in a two part invention intended as an exercise for a 9 year old kid, is incredible.

The way Bach wrote the piece, with the E♮can be heard in Glenn Gould‘s classical recording:

Gould performs the invention in 45 seconds, an astonishing technical feat that has been panned by some but that I will admit I am quite fond of listening to. The particular passage in question is harmonically complex- Johnson notes that the three diminished seventh chords account for all twelve tones- and I find that Gould’s rendition highlights the complexity rather than diminishing it. In any case, what’s remarkable about Gould’s performance is that he too deviates from Bach’s score! Gould plays measure 18 the same way as the first measure instead of with the inversion which Bach annotated. I marked the difference in the score (see oval in battleship gray, which was Gould’s favorite color).

Life is short, break the rules.

This post is dedicated to the  National Endowment for the Arts and
the National Endowment for the Humanities who enrich our lives by supporting the arts and the humanities.

 Lior Pachter on GDE² = DGE² + DTU² = DTE₁² +… Sahin Naqvi on GDE² = DGE² + DTU² = DTE₁² +… rishabh on Why I read the network nonsens… rishabh on The network nonsense of Albert… rishabh on The network nonsense of Manoli… yufreegravatar on Bioinformatics on a Rock6… Anonymous on Terry Speed: a “male… Nick Halmagyi on The network nonsense of Albert… fedster9 on Waterman’s egg Leonardo de Oliveira… on Waterman’s egg

### Blog Stats

• 1,750,858 views