The purpose of this post is to provide a short summary of my keynote talk, which was titled “Stories from the Supplement” (the talk can be viewed here and the presentation downloaded here). The idea for talking about what goes on in the supplement of papers, was triggered by a specific event towards the end of the reviewing/editing process for the paper:
A. Roberts and L. Pachter, Streaming algorithms for fragment assignment, Nature Methods 10 (2013), p 71—73.
After a thorough and productive review process, deftly handled by editor Tal Nawy whose work on our behalf greatly improved the quality of the paper, we were sent an email from the journal shortly before publication stating that “If the Online Methods section contains more than 10 equations, move the equation-heavy portions to a separate Supplementary Note“. This requirement made it essentially impossible to properly explain our model and method in the paper. After publishing lengthy supplements for the Cufflinks papers (see below) that I feel were poorly reviewed to the detriment of the paper, I decided it was time to talk about this issue in public.
C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold and L. Pachter,
Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, (2010), p 511–515.
Supplementary Material: 42 pages.
C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter,
Differential analysis of gene regulation at transcript resolution with RNA-seq, Nature Biotechnology, 31 (2012), p 46–53.
Supplementary Material: 70 pages.
My talk contained three examples selected to make a number of points:
- Methods in the supplement frequently contain ideas that transcend the specifics of the paper. These ideas can be valuable in the long run, but when they are in the supplement it is harder to identify what they are and to appreciate their significance.
- Supplements frequently contain errors (my own included). These errors make it difficult for others to understand the methods and implement them independently.
- In RNA-Seq specifically, there are a number of methodological issues buried in the supplements of various papers that have caused confusion in the field.
- The constant push of methods to supplements is part of a general trend to overemphasize the importance of data while minimizing the relevance of methods.
The examples were as follows:
- Fragment assignment: The idea of probabilistically assigning ambiguously mapped fragments in RNA-Seq is present in many papers, but at least for me, it was the math worked out in the supplements of those papers (and many conversations with my collaborators, especially Cole Trapnell and Adam Roberts) that made me realize the importance of fragment assignment for *Seq. I went on to explain how Nicolas Bray used these insights to develop a fragment assignment model for joint analysis of RNA-Seq. The result is the ability to magnify the effective coverage of individual samples from multiple samples, as shown in my talk using the GEUVADIS data:
In this plot each point represents the accuracy for the samples when quantified independently (black), or by our method (red/blue). The difference between red and blue has to do with a technical choice in the method that I explained in the talk.
- I talked about the problem of using raw counts for RNA-Seq analysis. Returning to a theme I have discussed in talks and on my blog previously, I explained that even when the goal is differential analysis, raw counts are flawed because “wrong does not cancel out wrong”. The idea of using raw count quantification knowing it is inaccurate, but arguing that it doesn’t matter because the bias cancels during comparisons (e.g. in differential expression or eQTL analysis) is mathematically equivalent to the following:
Acknowledging that (obtained by summing numerators and then dividing by the sum of denominators) but arguing that it is ok to say that
, which is obviously not correct (the answer is ).
A key point I made is that even though it might seem that the wrong answer is at least close to the correct answer, in practice, on real data, the differences can be significant. I showed an analysis done by Cole Trapnell using an extensive dataset generated in the Rinn lab for the Cufflinks 2 paper that makes this point.
- I talked about the different units currently being used for RNA-Seq quantification, such as CPM, RPKM, FPKM and TPM (all of them appeared in various talks during the meeting). I discussed the history of the units, and why they were chosen, and argued in favor of simply using the relative abundance estimates (perhaps normalized by a constant factor, as in TPM). This point of view was first advocated by Bo Li and Colin Dewey in their RSEM paper, and I hope the community will adopt their point of view.
My penultimate slide showed this world map of high-throughput sequencers. I think this is a very cool map, as it shows (by proxy) the extraordinary extent of sequencing going on worldwide. However it is yet another example of a focus on data, and data generation, in genomics. Data is of course, very important, but I showed another map for methods, that illustrates a very different thing: the extent of computational biology going on around the world. The methods map is made from visits to the Cufflinks website. I mashed it with the sequencer map to make the case that data and methods go hand-in-hand.
Sequencers of the world and users of Cufflinks.