I have just returned from the Genome Informatics 2013 meeting at CSHL. Jennifer Harrow, Michael Schatz and James Taylor organized a fantastic event that I thoroughly enjoyed.
The purpose of this post is to provide a short summary of my keynote talk, which was titled “Stories from the Supplement” (the talk can be viewed here and the presentation downloaded here). The idea for talking about what goes on in the supplement of papers, was triggered by a specific event towards the end of the reviewing/editing process for the paper:
A. Roberts and L. Pachter, Streaming algorithms for fragment assignment, Nature Methods 10 (2013), p 71—73.
After a thorough and productive review process, deftly handled by editor Tal Nawy whose work on our behalf greatly improved the quality of the paper, we were sent an email from the journal shortly before publication stating that “If the Online Methods section contains more than 10 equations, move the equation-heavy portions to a separate Supplementary Note“. This requirement made it essentially impossible to properly explain our model and method in the paper. After publishing lengthy supplements for the Cufflinks papers (see below) that I feel were poorly reviewed to the detriment of the paper, I decided it was time to talk about this issue in public.
C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold and L. Pachter,
Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, (2010), p 511–515.
Supplementary Material: 42 pages.
C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter,
Differential analysis of gene regulation at transcript resolution with RNA-seq, Nature Biotechnology, 31 (2012), p 46–53.
Supplementary Material: 70 pages.
My talk contained three examples selected to make a number of points:
- Methods in the supplement frequently contain ideas that transcend the specifics of the paper. These ideas can be valuable in the long run, but when they are in the supplement it is harder to identify what they are and to appreciate their significance.
- Supplements frequently contain errors (my own included). These errors make it difficult for others to understand the methods and implement them independently.
- In RNA-Seq specifically, there are a number of methodological issues buried in the supplements of various papers that have caused confusion in the field.
- The constant push of methods to supplements is part of a general trend to overemphasize the importance of data while minimizing the relevance of methods.
The examples were as follows:
- Fragment assignment: The idea of probabilistically assigning ambiguously mapped fragments in RNA-Seq is present in many papers, but at least for me, it was the math worked out in the supplements of those papers (and many conversations with my collaborators, especially Cole Trapnell and Adam Roberts) that made me realize the importance of fragment assignment for *Seq. I went on to explain how Nicolas Bray used these insights to develop a fragment assignment model for joint analysis of RNA-Seq. The result is the ability to magnify the effective coverage of individual samples from multiple samples, as shown in my talk using the GEUVADIS data:
In this plot each point represents the accuracy for the samples when quantified independently (black), or by our method (red/blue). The difference between red and blue has to do with a technical choice in the method that I explained in the talk. - I talked about the problem of using raw counts for RNA-Seq analysis. Returning to a theme I have discussed in talks and on my blog previously, I explained that even when the goal is differential analysis, raw counts are flawed because “wrong does not cancel out wrong”. The idea of using raw count quantification knowing it is inaccurate, but arguing that it doesn’t matter because the bias cancels during comparisons (e.g. in differential expression or eQTL analysis) is mathematically equivalent to the following:
Acknowledging that (obtained by summing numerators and then dividing by the sum of denominators) but arguing that it is ok to say that
, which is obviously not correct (the answer is ).
A key point I made is that even though it might seem that the wrong answer is at least close to the correct answer, in practice, on real data, the differences can be significant. I showed an analysis done by Cole Trapnell using an extensive dataset generated in the Rinn lab for the Cufflinks 2 paper that makes this point. - I talked about the different units currently being used for RNA-Seq quantification, such as CPM, RPKM, FPKM and TPM (all of them appeared in various talks during the meeting). I discussed the history of the units, and why they were chosen, and argued in favor of simply using the relative abundance estimates (perhaps normalized by a constant factor, as in TPM). This point of view was first advocated by Bo Li and Colin Dewey in their RSEM paper, and I hope the community will adopt their point of view.
My penultimate slide showed this world map of high-throughput sequencers. I think this is a very cool map, as it shows (by proxy) the extraordinary extent of sequencing going on worldwide. However it is yet another example of a focus on data, and data generation, in genomics. Data is of course, very important, but I showed another map for methods, that illustrates a very different thing: the extent of computational biology going on around the world. The methods map is made from visits to the Cufflinks website. I mashed it with the sequencer map to make the case that data and methods go hand-in-hand.
Sequencers of the world and users of Cufflinks.
5 comments
Comments feed for this article
November 2, 2013 at 6:46 pm
Lax
Glad that you bring up the importance of method section. It gets buried in most papers published in the “glam” journals. Btw, Colin Dewey had emphasized the usage of TPM in private correspondence with me. I am glad he did that. It is also the responsibility of developers to make sure that they support the user base by providing support in understanding the method vua documentation, mailing list etc.
November 22, 2013 at 5:50 am
Kristoffer Vitting-Seerup
I’ve just watched the video of your lecture from CSHL ( http://www.youtube.com/watch?v=5NiFibnbE8o ) and I find the part about the inherent problems in FPKM values very interesting and at the same time problematic.
But won’t most tuxedo workflows, where you use cuffmerge and afterwards Cuffdiff, be resistant to this problem since the Cuffmerge creates a “common” transcriptome for all replicates/conditions, whereby the mean transcript length is the same for all experiments?
November 22, 2013 at 7:09 am
Lior Pachter
The problem with FPKM is that the denominator in calculating it depends on the abundances (35:18 in my talk). So even though the transcripts may be identical (i.e. the l_r equal), the proportionality constant may be different.
However FPKM is not a problem in tuxedo workflows, because in comparing experiments with Cuffdiff this normalization issue is taken care of. The reason to prefer TPM at this point is for the ability to compare abundances across papers more easily. I should add however that TPM units for raw relative abundances is still problematic, because what is really of interest in cross-experiment comparisons is absolute abundance, so that further normalization, e.g. quantile normalization, may be needed to arrive closer to that goal.
November 26, 2013 at 7:13 am
Damian Kao
Hi Lior,
Thanks for the great lecture from CSHL (I watched the youtube video). I still don’t quite understand the problem with using raw counts for DE analysis is. Are you talking about possible flaws in how the raw counts are generated (fragment assignment problems you talked about in your previous blog post) that can result in compounded errors in down-stream analysis? Or are you talking about something inherently flawed about using raw counts as a unit of measurement?
November 26, 2013 at 7:45 pm
Lior Pachter
Hi Damian,
The issue I’m talking about with raw counts is that of fragment assignment of ambiguously mapped reads. Such reads are common in RNA-Seq (in eurakyotes) because of the fact that many genes are composed of multiple similar isoforms. In a typical experiment ~75% of reads map ambiguously (to transcripts).
What people mean by “raw counts” is usually one of 1) restriction to uniquely mapping reads or 2) count the total number of reads mapping to a transcript or 3) count the total number of reads mapping to a genomic region or 4) count the number of reads common to a set of transcripts. As I explain in my talk and previous blog posts, such approaches are problematic in that they can lead to inaccurate abundance estimates. Moreover, they remain problematic even for the purposes of differential expression analysis.