You are currently browsing the monthly archive for May 2015.

In this blog post I offer a cash prize for computing a p-value [update June 9th: the winner has been announced!]. For details about the competition you can skip directly to the challenge. But context is important:

Background

I’ve recently been reading a bioRxiv posting by X. Lan and J. Pritchard, Long-term survival of duplicate genes despite absence of subfunctionalized expression (2015) that examines the question of whether gene expression data (from human and mouse tissues) supports a model of duplicate preservation by subfunctionalization.

The term subfunctionalization is a hypothesis for explaining the ubiquity of persistence of gene duplicates in extant genomes. The idea is that gene pairs arising from a duplication event evolve, via neutral mutation, different functions that are distinct from their common ancestral gene, yet together recapitulate the original function. It was introduced in 1999 an alternative to the older hypothesis of neofunctionalization, which posits that novel gene functions arise by virtue of “retention” of one copy of a gene after duplication, while the other copy morphs into a new gene with a new function. Neofunctionalization was first floated as an idea to explain gene duplicates in the context of evolutionary theory by Haldane and Fisher in the 1930s, and was popularized by Ohno in his book Evolution by Gene Duplication published in 1970. The cartoon below helps to understand the difference between the *functionalization hypotheses (adapted from wikipedia):

Lan and Pritchard examine the credibility of the sub- and neofunctionalization hypotheses using modern high-throughput gene expression (RNA-Seq) data: in their own words “Based on theoretical models and previous literature, we expected that–aside from the youngest duplicates–most duplicate pairs would be functionally distinct, and that the primary mechanism for this would be through divergent expression profiles. In particular, the sub- and neofunctionalization models suggest that, for each duplicate gene, there should be at least one tissue where that gene is more highly expressed than its partner.”

What they found was that, in their words, that “surprisingly few duplicate pairs show any evidence of sub-/neofunctionalization of expression.” The went further, stating that “the prevailing model for the evolution of gene duplicates holds that, to survive, duplicates must achieve non-redundant functions, and that this usually occurs by partitioning the expression space. However, we report here that sub-/neofunctionalization of expression occurs extremely slowly, and generally does not happen until the duplicates are separated by genomic rearrangements. Thus, in most cases long-term survival must rely on other factors.” They propose instead that “following duplication the expression levels of a gene pair evolve so that their combined expression matches the optimal level. Subsequently, the relative expression levels of the two genes evolve as a random walk, but do so slowly (33) due to constraint on their combined expression. If expression happens to become asymmetric, this reduces functional constraint on the minor gene. Subsequent accumulation of missense mutations in the minor gene may provide weak selective pressure to eventually eliminate expression of this gene, or may free the minor gene to evolve new functions.”

The Lan and Pritchard paper is the latest in a series of works that examine high-browed evolutionary theories with hard data, and that are finding reality to be far more complicated than the intuitively appealing, yet clearly inadequate, hypotheses of neo- and subfunctionalization. One of the excellent papers in the area is

Dean et al. Pervasive and Persistent Redundancy among Duplicated Genes in Yeast, PLoS Genetics, 2008.

where the authors argue that in yeast “duplicate genes do not often evolve to behave like singleton genes even after very long periods of time.” I mention this paper, from the Petrov lab, because its results are fundamentally at odds with what is arguably the first paper to provide genome-wide evidence for neofunctionalization (also in yeast):

M. Kellis, B.W. Birren and E.S. Lander, Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisae, Nature 2004.

At the time, the Kellis-Birren-Lander paper was hailed as containing “work that may lead to better understanding of genetic diseases” and in the press release Kellis stated that “understanding the dynamics of genome duplication has implications in understanding disease. In certain types of cancer, for instance, cells have twice as many chromosomes as they should, and there are many other diseases linked to gene dosage and misregulation.” He added that “these processes are not much different from what happened in yeast.” and the author of the press releases added that “whole genome duplication may have allowed other organisms besides yeast to achieve evolutionary innovations in one giant leap instead of baby steps. It may account for up to 80 percent (seen this number before?) of flowering plant species and could explain why fish are the most diverse of all vertebrates.”

This all brings me to:

The challenge

In the abstract of their paper, Kellis, Birren and Lander wrote that:

Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions.” [boldface by authors]
In the main text of the paper, the authors expanded on this claim, writing:

Strikingly, in nearly every case (95%), accelerated evolution was confined to only one of the two paralogues. This strongly supports the model in which one of the paralogues retained an ancestral function while the other, relieved of this selective constraint, was free to evolve more rapidly”.

The word “strikingly” suggests a result that is surprising in its statistical significance with respect to some null model the authors have in mind. The data is as follows:

The authors identified 457 duplicated gene pairs that arose by whole genome duplication (for a total of 914 genes) in yeast. Of the 457 pairs 76 showed accelerated (protein) evolution in S. cerevisiae. The term “accelerated” was defined to relate to amino acid substitution rates in S. cerevisiae, which were required to be 50% faster than those in another yeast species, K. waltii. Of the 76 genes, only four pairs were accelerated in both paralogs. Therefore 72 gene pairs showed acceleration in only one paralog (72/76 = 95%).

So, is it indeed “striking” that “in nearly every case (95%), accelerated evolution was confined to only one of the two praralogues”? Well, the authors don’t provide a pvalue in their paper, nor do they propose a null model with respect to which the question makes sense. So I am offering a prize to help crowdsource what should have been an exercise undertaken by the authors, or if not a requirement demanded by the referees. To incentivize people in the right direction,

I will award ${\bf \frac{\100}{p}}$

to the person who can best justify a reasonable null model, together with a p-value (p) for the phrase “Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair” in the abstract of the Kellis-Birren-Lander paper. Notice the smaller the (justifiable) p-value someone can come up with, the larger the prize will be.

Bonus: explain in your own words how you think the paper was accepted to Nature without the authors having to justify their use of the word “strikingly” for a main result of the paper, and in a timeframe consisting of submission on December 17th 2003 (just three days before Hanukkah and one week before Christmas) and acceptance January 19th 2004 (Martin Luther King Jr. day).

Rules

To be eligible for the prize entries must be submitted as comments on this blog post by 11:59pm EST on Sunday May 31st June 7th, 2015 and they must be submitted with a valid e-mail address. I will keep the name (and e-mail address) of the winner anonymous if they wish (this can be ensured by using a pseudonym when submitting the entry as a comment). The prize, if awarded, will go to the person submitting the most complete, best explained solution that has a pvalue calculation that is correct according to the model proposed. Preference will be given to submission from students, especially undergraduates, but individuals in any stage of their career, and from anywhere in the world, are encouraged to submit solutions. I reserve the right to interpret the phrase “reasonable null model” in a way that is consistent with its use in the scientific community and I reserve the right to not award the prize if no good/correct solutions are offered. Participants do not have to answer the bonus question to win.

About one and a half years ago I wrote a blog post titled “GTEx is throwing away 90% of their data“. The post was, shall we say, “direct”. For example, in reference to the RNA-Seq quantification program Flux Capacitor I wrote that

Using Flux Capacitor is equivalent to throwing out 90% of the data!

I added that “the methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad” (after explaining that the methods underlying the program were never published, except for a brief mention in that paper). I referred to the sole figure in Montgomery et al. as a “completely useless” description of the method  (and showed that it contained errors). I highlighted the fact that many aspects of Flux Capacitor, its description and documentation provided on its website were “incoherent”. Can we agree that this description is not flattering?

The claim about “throwing out 90% of the data” was based on benchmarking I reported on in the blog post. If I were to summarize the results (politely), I would say that the take home message was that Flux Capacitor is junk. Perhaps nobody had really noticed because nobody cared about the program. Flux Capacitor was literally being used only by the authors of the program  (and their affiliates, which turned out to include the ENCODE, GENCODE, GEUVADIS and GTEx consortiums). In fact, when I wrote the blog post, I don’t think the program had ever been benchmarked or compared to other tools. It was, after all, unpublished and besides, nobody reads consortium papers. However after I blogged a few others decided to include Flux Capacitor in their benchmarks and the conclusions reached were the same as mine: Flux Capacitor is junk and Flux Capacitor is junk. Of course some people objected to my blog post when it came out, so it’s fun to be right and have others say so in print. But true vindication has come in the form of a citation to the blog post in a published paper in a journal! Specifically, in

C. Iannone, A. Pohl, P. Papasaikas, D. Soronellas, G.P. Vincent, M. Beato and J. Valcárcel, Relationship between nucleosome positioning and progesterone-induced alternative splicing in breast cancer cells, RNA 21 (2015) 360–374

the authors cite my blog post. They write:

Ummm…. wait… WHAT THE FLUX? The authors actually used Flux Capacitor for their analysis, and are citing my blog at https://liorpachter.wordpress.com/tag/flux-capacitor/ as the definitive reference for the program. Wait, what again?? They used my blog post as a reference for the method??? This is like [[ readers are invited to leave a comment offering a suitable analogy ]].

I’m not really sure what the authors can do at this point. They could publish an erratum and replace the citation. But with what? Flux Capacitor still hasn’t been published (!) Then there is the journal. Does any journal really think it is acceptable to list my blog as the citation for an RNA-Seq quantification tool that is fundamental for the results in a paper? (I’m flattered, but still…) Speaking of the journal, where were the reviewers? How could they not catch this? And the readers? The paper has been out since January… I have to ask: has anybody read it? Of course the biggest embarrassment here is the fact that there is a citation for Flux Capacitor at all. Why on earth are the authors using this discredited program??? Well maybe one answer is to be found in the acknowledgments section, where the group of a PI from the GTEx project is thanked. Actually, this PI was the last author on one of the recently published GTEx companion papers, which, I am sad to say… used Flux Capacitor (albeit with some quantifications performed with Cufflinks as well to demonstrate “robustness”). Why would GTEx be pushing for Flux Capacitor and insist on its use? We’ve come full circle to my GTEx blog post. By now I don’t even know what I think is the most embarrassing part of this whole story. So I thought I’d host a poll:

Today I posted the preprint N. Bray, H. Pimentel, P. Melsted and L. Pachter, Near-optimal RNA-Seq quantification with kallisto to the arXiv. It describes the RNA-Seq quantification program kallisto. [Update April 5, 2016: a revised version of the preprint has been published: Nicolas L. Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-Seq quantification, Nature Biotechnology (2016), doi:10.1038/nbt.3519 published online April 4, 2016.]

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was so much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of k-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. That is asking for a lot less, and as it turns out, pseudoalignment can be much faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient k-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

The screenshot below shows kallisto being used on my 2012 iMac desktop to build an index of the human transcriptome (5 min 8 sec), and then quantify 78.6 million GEUVADIS human RNA-Seq reads (14 min). When we first saw these results we thought they were simply too good to be true. Let me repeat: The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. In some tests we’ve gotten even faster running times, up to 15 million reads quantified per minute.

The results in our paper indicate that kallisto is not just fast, but also very accurate. This is not surprising: underlying RNA-Seq analysis are the alignments, and although kallisto is pseudoaligning instead, it is almost always only the compatibility information that is used in actual applications. As we show in our paper, from the point of view of compatibility, the pseudoalignments and alignments are almost the same.

Although accuracy is a primary concern with analysis, we realized in the course of working on kallisto that speed is also paramount, and not just as a  matter of convenience. The speed of kallisto has three major implications:

1. It allows for efficient bootstrapping. All that is required for the bootstrap are reruns of the EM algorithm, and those are particularly fast within kallisto. The result is that we can accurately estimate the uncertainty in abundance estimates. One of my favorite figures from our paper, made by Harold, is this one:

It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. This result is non-trivial: the standard dogma, that the technical variance in RNA-Seq is “Poisson” (i.e. proportional to the mean) is false, as shown in Supplementary Figure 3 of our paper (the correlation becomes 0.64). Thus, the bootstrap will be invaluable when incorporated in downstream application and we are already working on some ideas.

2. It is not just the kallisto quantification that is fast; the index building, and even compilation of the program are also easy and quick. The implication for biologists is that RNA-Seq analysis now becomes interactive. Instead of “freezing” an analysis that might take weeks or even months, data can be explored dynamically, e.g. easily quantified against different transcriptomes, or re-quantified as transcriptomes are updated. The ability to analyze data locally instead of requiring cloud computation means that analysis is portable, and also easily secure.

3. We have found the fast turnaround of analysis helpful in improving the program itself. With kallisto we can quickly check the effect of changes in the algorithms. This allows for much faster debugging of problems, and also better optimization. It also allows us to release improvements knowing that users will be able to test them without resorting to a major computation that might take months. For this reason we’re not afraid to say that some improvements to kallisto will be coming soon.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!