You are currently browsing the monthly archive for November 2013.

[Update April 6, 2014: The initial title of this post was “23andme genotypes are all wrong”. While that was and remains a technically correct statement, I have changed it because the readership of my blog, and this post in particular, has changed. Initially, when I made this post, the readers of the blog were (computational) biologists with extensive knowledge of genotyping and association mapping, and they could understand the point I was trying to make with the title. However in the past few months the readership of my blog has grown greatly, and the post is now reaching a wide public audience. The revised title clarifies that the content of this post relates to the point that low error rates in genotyping can be problematic in the context of genome-wide association reports because of multiple-testing.]

I have been reading the flurry of news articles and blog posts written this week about 23andme and the FDA with some interest. In my research talks, I am fond of displaying 23andme results, and have found that people always respond with interest. On the teaching side, I have subsidized 23andme testing for volunteer students in Math127 who were interested in genetics so that they could learn about personalized genomics first-hand. Finally, a number of my former and current students have worked at 23andme, and some are current employees.

Despite lots of opinions being expressed about the 23andme vs. FDA kerfuffle, I believe that two key points have been ignored in the discussions:

1. All 23andme genotypes that have ever been reported to customers are wrong. This is the case despite very accurate genotyping technology used by 23andme.
2. The interpretation of 23andme results involves examining a large number of odds ratios. The presence of errors leads to a huge multiple-testing problem.

Together, these issues lead to an interesting conundrum for the company, for customers, and for the FDA.

I always find it useful to think about problems concretely. In the case of 23andme, it means examining actual genotypes. Fortunately, you don’t have to pay the company \$99 dollars to get your own- numerous helpful volunteers have posted their 23andme genotypes online. They can be viewed at openSNP.org where “customers of direct-to-customer genetic tests [can] publish their test results, find others with similar genetic variations, learn more about their results, get the latest primary literature on their variations and help scientists find new associations”. There are a total of 624 genotypes available at openSNP, many of them from 23andme. As an example, consider “samantha“, who in addition to providing her 23andme genotype, also provides lots of phenotypic information. Here is the initial part of her genotype file:

# This data file generated by 23andMe at: Wed Jul 20 20:37:11 2011
#
# Below is a text version of your data. Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier
# (an rsid or an internal id), its location on the reference human genome, and the
# genotype call oriented with respect to the plus strand on the human reference
# sequence.     We are using reference human assembly build 36.  Note that it is possible
# that data downloaded at different times may be different due to ongoing improvements
#
# http://www.ncbi.nlm.nih.gov/projects/mapview/map_search.cgi?taxid=9606&build=36
#
# rsid	chromosome	position	genotype
rs4477212	1	72017	AA
rs3094315	1	742429	AG
rs3131972	1	742584	AG
rs12124819	1	766409	AA
rs11240777	1	788822	AA
rs6681049	1	789870	CC
rs4970383	1	828418	CC
rs4475691	1	836671	CC
rs7537756	1	844113	AA
rs13302982	1	851671	GG
rs1110052	1	863421	GT
...

Anyone who has been genotyped by 23andme can get this file for themselves from the website (by clicking on their name, then on “Browse Raw Data” from the pull-down menu, and then clicking on “Download” in the top-right corner of the browser window). The SNPs are labeled with rsid labels (e.g. rs3094315) and correspond to specific locations on chromosomes (e.g. chr1:742429). Since every human is diploid, two bases are shown for every SNP; one came from mom and one from dad. The 23andme genotype is not phased, which means that you can’t tell in the case of rs3094315 whether the A was from mom and the G from dad, or vice versa (it turns out paternal origin can be important, but that is a topic for another post).

A key question the FDA has asked, as it does for any diagnostic test, is whether the SNP calls are accurate. The answer is already out there. First, someone has performed a 23andme replicate experiment precisely to assess the error rate. In an experiment in 2010 with two replicates, 85 SNPs out of about 600,000 were different. Today, Illumina types around 1 million SNPs, so one would expect even more errors. Furthermore, a replicate analysis provides only a lower bound, since systematic errors will not be detected. Another way to examine the error rate is to look at genotypes of siblings. That was written about in this blog post which concluded there were 87 errors. 23andme currently uses the Illumina Omni Express for genotyping, and the Illumina spec sheet claims a similar error rate to those inferred in the blog posts mentioned above. The bottom line is that even though the error rate for any individual SNP call is very very low (<0.01% error), with a million SNPs being called there is (almost) certainly at least one error somewhere in the genotype. In fact, assuming a conservative error rate leading to an average of 100 errors per genotype, the probability that a 23andme genotype has no errors is less than 10^(-40).

The fact that 23andme genotypes are wrong (i.e. at least one error in some SNP) wouldn’t matter if one was only interested in a single SNP. With very high probability, it would be some other SNPs that are the wrong ones. But the way people use 23andme is not to look at a single SNP of interest, but rather to scan the results from all SNPs to find out whether there is some genetic variant with large (negative) effect. The good news is that there isn’t much information available for the majority of the 1 million SNPs being tested. But there are, nevertheless, lots of SNPs (thousands) to look at. Whereas a comprehensive exam at a doctor’s office might currently constitute a handful of tests– a dozen or a few dozen at most– a 23andme test assessing thousands of SNPs and hundreds of diseases/traits constitutes more diagnostic tests on an individual at one time than have previously been performed in a lifetime.

To understand how many tests are being performed in a 23andme experiment, it is helpful to look at the Interpretome website. The website allows a user to examine information on SNPs without paying, and without uploading the data. I took a look at Samantha, and the Interpretome gave information about 2829 SNPs. These are SNPs for which there is a research article that has identified the SNP as significant in some association study (the website conveniently provides direct links to the articles). For example, here are two rows from the phenotype table describing something about Samantha’s genetic predisposition for large head circumference:

Head circumference (infant) 11655470  CC  T  .05    4E-6    22504419
Head circumference (infant) 1042725    CC  T  .07    3E-10  22504419

Samantha’s genotype at the locus is CC, the “risk” allele is T, the odds ratios  are very small (0.05,0.07) and the p-values are apparently significant. Interpretome’s results differ from those of 23andme, but looking at the diversity of phenotypes reported on gives one a sense for the possibilities that currently exist in genetics, and the scope of 23andme’s reports.

From the estimates of error rates provided above, and using the back of an envelope, it stands to reason that about 1/3 of 23andme tested individuals have an error at one of their “interesting” SNPs. Not all of SNPs arising in association studies are related to diseases, but many of them are. I don’t think its unreasonable to postulate that a significant percentage of 23andme customers have some error in a SNP that is medically important. Whether such errors are typically false positives or false negatives is unclear, and the extent to which they may lead to significant odds ratios is another interesting question. In other words, its not good enough to know how frequently warfarin sensitivity is being called incorrectly. The question is how frequently some medically significant result is incorrect.

Of course, the issue of multiple testing as it pertains to interpreting genotypes is probably a secondary issue with 23andme. As many bloggers have pointed out, it is not even clear that many of 23andme’s odds ratios are accurate or meaningful. A major issue, for example, is the population background of an individual examining his/her genotype and how close it is to the population on which the GWAS were performed. Furthermore, there are serious questions about the meaning of the GWAS odds ratios in the case of complex traits. However I think the issue of multiple testing is a deeper one, and a problem that will only be exacerbated as more disease SNPs are identified. Having said that, there are also approaches that could mitigate errors and improve fidelity of the tests. As DECODE genetics has demonstrated, imputation and phasing can in principle be used to infer population haplotypes, which not only are useful for GWAS analyses, but can also be used to identify erroneous SNP calls. 23andme’s problem is that although they have many genotypes, they are from diverse populations that will be harder to impute and phase.

The issue of multiple testing arising in the context of 23andme and the contrast with classic diagnostics reminds me of the dichotomy between whole-genome analysis and classic single gene molecular biology. The way in which customers are looking at their 23andme results is precisely to look for the largest effects, i.e. phenotypes where they appear to have high odds of contracting a disease, or being sensitive to some drug. This is the equivalent of genome scientists picking the “low hanging fruit” out of genome-wide experiments such as those performed in ENCODE. In genomics, scientists have learned (with some exceptions)  how to interpret genome-wide analyses after correcting for multiple-hypothesis testing by controlling for false discovery rate. But are the customers of 23andme doing so? Is the company helping them do it? Should it? Will the FDA require it? Can looking at ones own genotype constitute too much testing?

There are certainly many precedents for superfluous harmful testing in medicine. For example, the American Academy of Family Physicians has concluded that prostate cancer PSA tests and digital rectal exams have marginal benefits that are outweighed by the harm caused by following up on positive results. Similar arguments have been made for mammography screening. I therefore think that there are serious issues to consider about the implications of direct-to-consumer genetic testing and although I support the democratization of genomics, I’m glad the FDA is paying attention.

Samantha’s type 2 diabetes risk as estimated from her genotype by Interpretome. She appears to have a lower risk than an average person. Does this make it ok for her to have another cookie?

My students run a journal club that meets once a week and last Friday the paper we discussed was M. Imakaev et al., Iterative correction of Hi-C data reveals hallmarks of chromosome organization, Nature Methods 9 (2012), p 999–1003. The paper, from Leonid Mirny’s group, describes an approach to analyzing Hi-C data, and is one of only a few papers that have been published focusing on methods for this *Seq assay. Present at journal club were my students Nicolas Bray, Shannon Hateley, Harold Pimentel, Atif Rahman, Lorian Schaeffer, Akshay Tambe, Faraz Tavakoli, postdocs Sreeram Kannan and Shannon McCurdy, and guest Emily Brown from Doris Bachtrog’s group. They all read the paper prior to our meeting, and the notes below are a summary of our discussion (although I have taken some more time to flesh out the math for the interested reader). For those who are impatient, the bottom line is that I think the title of the paper by Imakaev et al. should have had a few extra words (mine in italics) so that it would be “Iterative proportional fitting of a log-linear model for correction of Hi-C data reveals hallmarks of chromosome organization”. Read the rest of this entry »

I visited Duke’s mathematics department yesterday to give a talk in the mathematical biology seminar. After an interesting day meeting many mathematicians and (computational) biologists, I had an excellent dinner with Jonathan Mattingly, Sayan MukherjeeMichael Reed and David Schaeffer. During dinner conversation, the topic of probability theory (and how to teach it) came up, and in particular Buffon’s needle problem.

The question was posed by Georges-Louis Leclerc, Comte de Buffon in the 18th century:

Suppose we have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. What is the probability that the needle will lie across a line between two strips?

If the strips are distance $t$ apart, and $l \leq t$, then it is easy to see that the probability $P$ is given by

$P = \int_{\theta =0}^{\frac{\pi}{2}} \int_{x = 0}^{\frac{l}{2}sin \theta} \frac{4}{t \pi} dx d\theta = \frac{2l}{t \pi}$.

The appearance of $\pi$ in the denominator turns the problem into a Monte Carlo technique for estimating $\pi$: simply simulate random needle tosses and count crossings.

It turns out there is a much more elegant solution to the problem– one that does not require calculus. I learned of it from Gian-Carlo Rota when I was a graduate student at MIT. It appears in his book Introduction to Geometric Probability (with Dan Klain) that I have occasionally used when teaching Math 249. The argument relies on the linearity of expectation, and is as follows:

Let $f(l)$ denote the expected number of crossings when a needle of length $l$ is thrown on the floor. Now consider two needles, one of length $l$ and the other $m$, attached to each other end to end (possibly at some angle). If $X_1$ is a random variable describing the number of crossings of the first needle, and $X_2$ of the second, its certainly the case that $X_1$ and $X_2$ are dependent, but because expectation is linear, it is the case that $E(X_1+X_2) = E(X_1)+E(X_2)$. In other words, the total number of crossings is, in expectation, $f(l)+f(m)$.

Buffon’s needle problem: what is the probability that a needle of length $l \leq t$ crosses a line? (A) A short needle being thrown at random on a floor with parallel lines. (B) Two connected needles. The expected number of crossings is proportional to the sum of their lengths. (C) A circle of diameter always crosses exactly two lines.

It follows that $f$ is a linear function, and since $f(0)=0$, we have that $f(l) = cl$ where $c$ is some constant. Now consider a circle of diameter $t$. Such a circle, when thrown on the floor, always crosses the parallel lines exactly twice. If $C$ is a regular polygon with vertices on the circle, and the total length of the polygon segments is $l$, then the total number of crossings is $f(l)$. Taking the limit as the number of segments in the polygon goes to infinity, we find that $f(t \pi ) = 2$. In other words,

$f(t \pi) = c \cdot t \pi = 2 \Rightarrow c = \frac{2}{t \pi}$,

and the expected number of crossings of a needle of length l is $\frac{2l}{t \pi}$. If $l < t$, the number of crossings is either 0 or 1, so the expected number of crossings is, by definition of expectation, equal to the probability of a single crossing. This solves Buffon’s problem no calculus required!

The linearity of expectation appears elementary at first glance. The proof is simple, and it is one of the first “facts” learned in statistics– I taught it to my math 10 students last week. However the apparent simplicity masks its depth and utility; the above example is cute, and one of my favorites, but linearity of expectation is useful in many settings. For example I recently saw an interesting application in an arXiv preprint by Anand Bhaskar, Andy Clark and Yun Song on “Distortion of genealogical properties when the sample is very large“.

The paper addresses an important question, namely the suitability of the coalescent as an approximation to discrete time random mating models, when sample sizes are large. This is an important question, because population sequencing is starting to involve hundreds of thousands, if not millions of individuals.

The results of Bhaskar, Clark and Song are based on dynamic programming calculations of various genealogical quantities as inferred from the discrete time Wright-Fisher model. An example is the expected frequency spectrum for random samples of individuals from a population. By frequency spectrum, they mean, for each k, the expected number of polymorphic sites with k derived alleles and n-k ancestral alleles under an infinite-sites model of mutation in a sample of n individuals. Without going into details (see their equations (8),(9) and (10)), the point is that they are able to derive dynamic programming recursions because they are computing the expected frequencies, and the linearity of expectation is what allows for the derivation of the dynamic programming recursions.

None of this has anything to do with my seminar, except for the fact that the expectation-maximization algorithm did make a brief appearance, as it frequently does in my lectures these days. I spoke mainly about some of the mathematics problems that arise in comparative transcriptomics, with a view towards a principled approach to comparing transcriptomes between cells, tissues, individuals and species.

The Duke Chapel. While I was inside someone was playing the organ, and as I stared at the ceiling, I could have sworn I was in Europe.

In a first with RNA-Seq technology, scientists at Stanford University have broken through the single-cell barrier. In a recently published paper,  A.R. Wu et al., Quantitative assessment of single-cell RNA-sequencing methodsthe results of sequencing RNA from zero human cells. How was this accomplished? The gist of it is that an empty tube was filled with spike-in, and then submitted for RNA-Seq… The details are as follows: In order to assess the quality of single-cell RNA-Seq, Wu et al. performed numerous single-cell RNA-Seq experiments on HCT116 cells, as summarized in the figure below (Figure 1a from their paper).

Figure 1a from the Wu et al. paper showing the experimental design.

was interested in this study because for the regularized pooling project I’m working on with Nicolas Bray (see recent post),  it would be useful to demonstrate improvements in quantification accuracy by joint analysis of single-cell RNA-Seq. I asked Nick to look at the Wu et al. data when it came out two weeks ago, and as a first step he aligned the reads to the human transcriptome. Strangely, he found very low alignment rates, and in some cases literally almost no reads aligned at all. At first we thought there was some trimming issue, so we went to look at the Cufflinks output of the authors. The figure below, made by Nick, shows the percent spike-in (assessed by examining the abundance of ERCC-*) for each of the SMARTer based 96 samples:

The worst sample is C70 (GSM1241223) for which only 252 human transcripts have non-zero abundance. It is 99.828339% spike-in! The fact that the results of RNA-Seq on an empty test tube were published is in and of itself just a minor (?) embarrassment; more interesting is the range of quality obtained as measured by the amount of spike-in sequenced– a plot that we have made above and that seems crucial to the paper, but that was not produced by the authors. In fact, what the authors do show is slightly suspect: reproduced below is their Figure S2 from the Supplement:

Why would the authors show correlations for just four randomly picked samples? Why not show results for all of the data? We dug a bit deeper into this, and noticed that 93/96 of the FPKM file names look like [GEO accession]_CXX_ILXXXX. But the remaining three look like GSM1241223_C70_NTC_tube_ctrl_IL3196.sorted.genes.fpkm_tracking.txt.gz (which is the apparently empty tube), GSM1241245_C92_cell_tube_ctrl_IL3198.sorted.genes.fpkm_tracking.txt.gz and GSM1241195_C42_100ng_RNA_ctrl_IL3198.sorted.genes.fpkm_tracking.txt.gz. Therefore, these were presumably intended controls, but they were not published as such. There is the separate issue, that aside from the controls, the experiment in general appears to have some failure rate that is not clearly presented. This is evident in the following plot which Nick made, showing the average log-correlation of each experiment with the others after removing zeroes (the bottom one is C09 and the runner up is C70):

This figure is showing the honest truth of the paper. It is what it is; everyone I’ve talkedto that has actually performed single-cell RNA-Seq tells me that it is difficult and there is a non-trivial failure rate, on top of variable quality across cells. In fact, there is subtle evidence of failure in other papers. In the single-cell RNA-Seq technology race, the paper preceding Wu et al was A.K. Shalek et al., Single-cell transcriptomics reveals bimodality in expression and splicing in immune cellsNature (2013). In Shalek et al., the authors describe 18 single-cell experiments. Specifically, they claim to have constructed DNA libraries “from 18 single BMDCs (S1–S18), three replicate populations of 10,000 cells, and two negative controls (empty wells), and sequenced each to an average depth of 27 million read pairs.” However a close inspection of the GEO reveals the following IDs and descriptors:

 GSM1012777 Single cell S1 GSM1012778 Single cell S2 GSM1012779 Single cell S3 GSM1012780 Single cell S4 GSM1012781 Single cell S5 GSM1012782 Single cell S6 GSM1012783 Single cell S7 GSM1012784 Single cell S8 GSM1012785 Single cell S9 GSM1012786 Single cell S10 GSM1012787 Single cell S11 GSM1012788 Single cell S13 GSM1012789 Single cell S14 GSM1012790 Single cell S15 GSM1012791 Single cell S16 GSM1012792 Single cell S22 GSM1012793 Single cell S23 GSM1012794 Single cell S24

While there are 18 consecutive IDs, the cell labels range from 1–24. Where are the 6 missing cells? I can’t be sure, but they were probably failures. Update: the authors of the Shalek et al. paper explained to me after seeing the post that two of the missing labels were negative controls, and 3 were population replicates (the names of these were altered in GEO). $6-5=1$ which was indeed a failure (S12); it gave no signal on the BioAnalyzer and was therefore not sequenced. I was told that the authors are working on fixing the GEO sample names to clarify the reason for missing labels of samples. As such, it turns out the experiment was extremely successful with a success rate of 18/19.

Returning to Wu et al., they should be commended for releasing all their data (to their credit they also release the R code they used for analysis). The problem with the paper is that instead of reporting the failures and discarding them before analysis, they instead use all of the data when performing comparisons between single-cell and bulk RNA-Seq. This is is evident in some of the strange techniques they end up using. For example, the method for generating the crucial Figure 4a is described as:

“(a) Correlation between the merged
single cells (“ensemble”) and the bulk RNA-seq measurement of gene
expression. The ensemble was created by computationally pooling all
the raw reads obtained from the 96 single-cell transcriptomes
generated using the C1 system and then sampling 30 million reads
randomly. The bulk and ensemble libraries were depth matched before
alignment was performed. For each gene, the log2-transformed median
FPKM values from the ensemble and bulk were plotted against each
other. “

I’m guessing that the odd idea of sampling and then taking the median is precisely to throw out outliers coming from the control tubes. Yes, the data were tortured, and yes, the FPKMs confessed. The paper has some other issues that suggest it was not carefully reviewed by the authors (let alone the reviewers). In the Methods I found the statement “FPKM values used for analyses were generated by TopHat”. I, of all people, can attest to the fact that it is Cufflinks, not TopHat, that estimates (not generates!) FPKM values. Thankfully, in the GEO entries Cufflinks is correctly cited together with the version used.

In summary, in the last two high profile publications on single-cell RNA-Seq, there were failures in the experiment and they were not reported clearly by the authors. Neither committed an egregious offense, but I wish they had fully reported the number of experiments attempted and the number that succeeded. That seems to me to be important data in papers describing new technology. I believe that fear of rejection from the journal, or fear of embarrassment of the state of single-cell RNA-Seq is what drove Wu et al. to spin the results positively. All part of the fear of failure, that seems to hold back a lot of science. But single-cell RNA-Seq has a bright future and these papers would both be better if they were more open about failure. The only thing we have to fear is fear itself.

Last Saturday I returned from Cold Spring Harbor Laboratories where I spoke at the Genome Informatics Meeting on Stories from the Supplement. On Monday I delivered the “Prestige Lecture” at a meeting of the Center for Science of Information on New Directions in the Science of Information and I started by talking about Cold Spring Harbor Laboratory (CSHL). That is because the Eugenics Record Office at CSHL is where Claude Shannon, famous father of information theory, wrapped up his Ph.D. in population genetics in 1939.

The fact that Shannon did his Ph.D. in population genetics– his Ph.D. was titled “An Algebra for Theoretical Genetics“– is unknown to most information theorists and population geneticists. It is his masters thesis that is famous (for good reason– it can be said to have started the digital revolution), and his paper in 1948 that founded information theory. But his Ph.D. thesis was impressive in its own right: its contents formed the beginning of my talk to the information theorists, and I summarize the interesting story below.

I learned about the details surrounding Shannon’s foray into biology from a wonderful final project paper written for the class The Structure of Engineering Revolutions in the Fall of 2001: Eugene Chiu, Jocelyn Lin, Brok Mcferron, Noshirwan Petigara, Satwiksai Seshasai, Mathematical Theory of Claude Shannon. In 1939, Shannon’s advisor, Vannevar Bush, sent him to study genetics with Barbara Burks at the Eugenics Record Office at Cold Spring Harbor. That’s right, the Eugenics office was located at Cold Spring Harbor from 1910 until 1939, when it was closed down as a result of Nazi eugenics. Fortunately, Shannon was not very interested in the practical aspects of eugenics, and more focused on the theoretical aspects of genetics.

His work in genetics was a result of direction from Vannevar Bush, who knew about genetics via his presidency of the Carnegie Institution of Washington that ran the Cold Spring Harbor research center. Apparently Bush remarked to a colleague that “It occurred to me that, just as a special algebra had worked well in his hands on the theory of relays, another special algebra might conceivably handle some of the aspects of Mendelian heredity”. The main result of his thesis is his Theorem 12:

The notation $\lambda^{hij}_{klm}$ refers to genotype frequencies in a diploid population. The indices $h,i,j$ refer to alleles at three loci on one haplotype, and $k,l,m$ at the same loci on the other haplotype. The $p$ variables correspond to recombination crossover probabilities. $p_{00}$ is the probability of an even number of crossovers between both the 1st and 2nd loci, and the 2nd and 3rd loci. $p_{01}$ is the probability of an even number of crossovers between the 1st and 2nd loci but an odd number of crossovers between the 2nd and 3rd loci, and so on. Finally, the dot notation in the $\lambda$ represents summation over the index (these days one might use a $+$). The result is a formula for the population genotype frequencies $\mu^{hij}_{ilm}$ after $n$ generations. The derivation involves elementary combinatorics, specifically induction, but it is an interesting result and at the time was not something population geneticists had worked out. What I find impressive about it is that Shannon, apparently on his own, mastered the basic principles of (population) genetics of his time, and performed a calculation that is quite similar to many that are relevant in population genetics today. Bush wrote about Shannon “At the time that I suggested that he try his queer algebra on this subject, he did not even know what the words meant… “.

Why did Shannon not pursue a career in population genetics? The Eugenics Record Office closed shortly after he left and Bush discouraged him from continuing in the field, telling him that “few scientists are ever able to apply creatively a new and unconventional method furnished by some one else – at least of their own generation”. Thus, despite encouragement from a number of statisticians and geneticists that his work was novel and of interest, Shannon returned to electrical engineering. Shortly thereafter, the world got information theory.

Of course today population genetics has data, tons of it, and many interesting problems, including some that I think require insights and ideas from information theory. My Prestige Lecture was aimed at encouraging information theorists to return to their Shannon roots, and redirect their focus towards biology. I have been working with information theorist David Tse (academic grandson of Shannon) for the past year on de novo RNA-Seq assembly (a talk on our joint work with postdoc Sreeram Kannan was presented by Sreeram at the Genome Informatics meeting), and I believe the engagement of information theorists in biology would be of great benefit to both fields; in terms of biology, I see many applications of information theory beyond population genetics. Some back-and-forth has already started. Recently there have been some interesting papers using information theory to study genome signatures and compression, but I believe that there are many other fruitful avenues for collaboration. David and Sreeram were the only information theorists at CSHL last week (I think), but I hope that there will be many more at the 2014 meeting in Cambridge, UK!

The beach at Cold Spring Harbor. I took the photo on November 1st before my Genome Informatics keynote.

I have just returned from the Genome Informatics 2013 meeting at CSHL. Jennifer Harrow, Michael Schatz and James Taylor organized a fantastic event that I thoroughly enjoyed.

The purpose of this post is to provide a short summary of my keynote talk, which was titled “Stories from the Supplement” (the talk can be viewed here and the presentation downloaded here). The idea for talking about what goes on in the supplement of papers, was triggered by a specific event towards the end of the reviewing/editing process for the paper:

A. Roberts and L. Pachter, Streaming algorithms for fragment assignment, Nature Methods 10 (2013), p 71—73.

After a thorough and productive review process, deftly handled by editor Tal Nawy whose work on our behalf greatly improved the quality of the paper, we were sent an email from the journal shortly before publication stating that “If the Online Methods section contains more than 10 equations, move the equation-heavy portions to a separate Supplementary Note“. This requirement made it essentially impossible to properly explain our model and method in the paper. After publishing lengthy supplements for the Cufflinks papers (see below) that I feel were poorly reviewed to the detriment of the paper, I decided it was time to talk about this issue in public.

C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold and L. Pachter,
Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, (2010), p 511–515.
Supplementary Material: 42 pages.

C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter,
Differential analysis of gene regulation at transcript resolution with RNA-seq, Nature Biotechnology, 31 (2012), p 46–53.
Supplementary Material: 70 pages.

My talk contained three examples selected to make a number of points:

• Methods in the supplement frequently contain ideas that transcend the specifics of the paper. These ideas can be valuable in the long run, but when they are in the supplement it is harder to identify what they are and to appreciate their significance.
• Supplements frequently contain errors (my own included). These errors make it difficult for others to understand the methods and implement them independently.
• In RNA-Seq specifically, there are a number of methodological issues buried in the supplements of various papers that have caused confusion in the field.
• The constant push of methods to supplements is part of a general trend to overemphasize the importance of data while minimizing the relevance of methods.

The examples were as follows:

1. Fragment assignment: The idea of probabilistically assigning ambiguously mapped fragments in RNA-Seq is present in many papers, but at least for me, it was the math worked out in the supplements of those papers (and many conversations with my collaborators, especially Cole Trapnell and Adam Roberts) that made me realize the importance of fragment assignment for *Seq. I went on to explain how Nicolas Bray used these insights to develop a fragment assignment model for joint analysis of RNA-Seq. The result is the ability to magnify the effective coverage of individual samples from multiple samples, as shown in my talk using the GEUVADIS data:
In this plot each point represents the accuracy for the samples when quantified independently (black), or by our method (red/blue). The difference between red and blue has to do with a technical choice in the method that I explained in the talk.
2. I talked about the problem of using raw counts for RNA-Seq analysis. Returning to a theme I have discussed in talks and on my blog previously, I explained that even when the goal is differential analysis, raw counts are flawed because “wrong does not cancel out wrong”. The idea of using raw count quantification knowing it is inaccurate, but arguing that it doesn’t matter because the bias cancels during comparisons (e.g. in differential expression or eQTL analysis) is mathematically equivalent to the following:
Acknowledging that $\frac{1}{2}+\frac{2}{3} \neq \frac{3}{5}$ (obtained by summing numerators and then dividing by the sum of denominators) but arguing that it is ok to say that
$\frac{\frac{1}{2}+\frac{2}{3}}{\frac{3}{4}+\frac{5}{6}} = \frac{\frac{3}{5}}{\frac{8}{10}} = \frac{3}{4}$, which is obviously not correct (the answer is $\frac{14}{19}$).
A key point I made is that even though it might seem that the wrong answer is at least close to the correct answer, in practice, on real data, the differences can be significant. I showed an analysis done by Cole Trapnell using an extensive dataset generated in the Rinn lab for the Cufflinks 2 paper that makes this point.
3. I talked about the different units currently being used for RNA-Seq quantification, such as CPM, RPKM, FPKM and TPM (all of them appeared in various talks during the meeting). I discussed the history of the units, and why they were chosen, and argued in favor of simply using the relative abundance estimates (perhaps normalized by a constant factor, as in TPM). This point of view was first advocated by Bo Li and Colin Dewey in their RSEM paper, and I hope the community will adopt their point of view.

My penultimate slide showed this world map of high-throughput sequencers. I think this is a very cool map, as it shows (by proxy) the extraordinary extent of sequencing going on worldwide. However it is yet another example of a focus on data, and data generation, in genomics. Data is of course, very important, but I showed another map for methodsthat illustrates a very different thing: the extent of computational biology going on around the world. The methods map is made from visits to the Cufflinks website. I mashed it with the sequencer map to make the case that data and methods go hand-in-hand.

Sequencers of the world and users of Cufflinks.