In 1963, a post-doctoral fellow at Caltech by the name of Emile Zuckerkandl, who was working with Linus Pauling, published a seminal paper that launched the field of comparative genomics. The full citation is:

L. Pauling and E. Zuckerkandl, “Chemical Paleogenetics: Molecular ‘restoration studies’ of extinct forms of life“, Acta Chem. Scand. 17 (1963) , S9–S16.

The paper not only introduces the concept of comparative genomics, but also its generalization to phylogenomics. It followed on the heels of pioneering work by Zuckerkandl, following the advice of Pauling, to sequence hemoglobin proteins in various primates and mammals. In related work, Zuckerkandl and Pauling (it is clear that Zuckerkandl was the creative and driving force) also introduced the concept of a molecular clock and proposed the concept of sequence alignment (although Ingram published first).

Zuckerkandl’s name and work are not as well known as they ought to be. Perhaps this is partly because his work was overshadowed by Watson and Crick’s famous paper published a decade earlier, but maybe also because computational methods take second place to experiments and Zuckerkandl’s legacy lies in his ideas about phylogenomics, not his contribution to protein sequencing.

The paper after which this blog post is titled is extraordinary in its originality, and the main figure in the paper is a classic that every computational biologist should instantly recognize:

The caption (from the paper), is as follows: Tentative partial structure of two chain-ancestors of the human hemoglobin polypeptide chains. The numbering of the residues is the one usually applied to the human $\alpha$-chain. The abbreviations for the amino acids are those commonly employed, except for asparagin (asg) and glutamine (glm). The other abbreviations: E=early epoch; M=medium epoch; L=late epoch; abs=absent. “None”, in column (e), means: probably no evolutionarily effective mutation occurred at the site under consideration in the line of descent leading from the ancestral genes to the human genes. The residues and comments are placed in parentheses when the conclusion reached is partly based on the consideration of human and sperm whale myglobins.

(a) Residue number

(b) Partial sequence of the $II^{\beta}--III^{\gamma}--IV^{\delta}$-chain (late form)

(c) Partial sequence of the $I-^{\alpha} II^{\beta}--III^{\gamma}--IV^{\delta}$-chain (late form)

(d) Chain(s) in whose direct ancestry the mutation(s) seem to have occurred

(e) Nature of the substitution

(f) Quantitative evaluation of the time of mutation

(g) From evidence relating to a number of different animals (cf. Gratzer and Allison)

(h) Not after the time of the ancestor common to horse and man

(j) Amino-acid residue x present at a homologous site in the two myoglobins

(k) Polypeptide chain $I^{\alpha}$ or $II^{\beta}--III^{\gamma}--IV^{\delta}$

The inference of ancestry is based on the hemoglobin gene phylogeny

The rows (d)–(f) basically tell the story of the paper. Zuckerkandl and Pauling show that with sequence, alignment and phylogeny one can infer the history of mutation and substitution and its effects on extant sequence. Right there, we have the genesis of phylogenomics. In the discussion of the paper there is a detailed description of how phylogenomic techniques can be used to infer functional relationships among proteins, with the conclusion that “even apparently unrelated proteins can indeed have a common molecular ancestor”.

The fields of genomics and phylogenetics have a common ancestor as well but sadly this ancestor, Emile Zuckerkandl, passed away on November 9, 2013. May his ideas continue to evolve.

[Update Dec. 12]: Upon reading the initial version of this post, Ewan Birney asked me how the Zuckerkandl-Pauling alignment looks with modern tools, an interesting question now that 50 years have passed since their paper was publishedI decided to look at this and started by downloaded the sequences and aligning them with three programs: 1) Clustal-Omega, the latest (last) version of the popular and widely used Clustal alignment programs, 2) Muscle, a sequence alignment tool billed by its author as “one of the best-performing multiple alignment programs” and 3) FSA, a sequence alignment program developed in my group by Robert Bradley. The results are interesting:

The Clustal-Omega alignment.

The muscle alignment.

The FSA alignment.

Its a bit complicated to compare these multiple alignments with the Z-P alignment. The latter is a pairwise alignment of the two ancestrally inferred sequences, whereas the multiple alignments above are of the extant sequences. This is already interesting; almost no alignment programs explicitly align by ancestral inference the way Z-P were doing it. The only exception, as far as I know, is MAVID, written by Nicolas Bray, although as implemented it was only suitable for DNA alignment. Nevertheless, the alignment is small enough that I was able to compare Z-P’s alignment to the Clustal-Omega, Muscle and FSA alignments by hand.

There are three areas of the alignment with indels. The first is right in the beginning, where Z-P annotate an indel right after the Valine (V). Both Clustal-Omega and Muscle moved this indel to the beginning of the sequence. That is probably because of heuristics setting gaps to the ends of sequences in case they are incomplete. In this case, I removed the Methionine because Z-P had not included it in their alignment. Nevertheless, FSA aligned this correctly (I think the Z-P alignment here is correct).

The second indel is two bases and different between all three programs and the Z-P alignment. Its a bit hard to say who is correct. One would have to look at more sequences, but I think that FSA looks good here (based on the posterior probabilities shown, see below).

The third indel involves the placement of the amino acids DLSH. In this case only muscle agrees with the Z-P alignment, however I think the FSA answer is the most informative (and only informative) answer in this case. FSA not only shows a single multiple alignment, but also colors it according to the posterior probabilities of the individual pairwise alignments. This is explained in the paper– the visualization is interactive and available via a java applet. What the FSA alignment shows is that there are really two solutions of almost equal accuracy: the one it produced and the Z-P solution. The latter requires two separate indel events. These are presumably much less common than mutations, so its not clear which is correct (this could be checked by examining a larger globin alignment including outlier species).

In summary, what the alignments with modern tools show, is that the multiple alignment is really hard. Much harder than most people think!

A final analysis I did is to look at the tree inferred according to an evolutionary model (the FSA alignment server does this automatically). The inferred tree is unrooted and shown below:

It agrees with the Z-P tree, except for the fact that the beta-delta ancestor is much closer than what is shown in their figure. It should be noted that at the time of their paper, (Markov) models for amino acid changes along a tree had not yet been developed; that only started happening with Jukes and Cantor in 1967.

Overall, Zuckerkandl and Pauling did a remarkable job with their alignment, ancestral inference and tree estimation. What I learned from my analysis is that their paper still provides a case study for multiple alignment today.

I have been reading the flurry of news articles and blog posts written this week about 23andme and the FDA with some interest. In my research talks, I am fond of displaying 23andme results, and have found that people always respond with interest. On the teaching side, I have subsidized 23andme testing for volunteer students in Math127 who were interested in genetics so that they could learn about personalized genomics first-hand. Finally, a number of my former and current students have worked at 23andme, and some are current employees.

Despite lots of opinions being expressed about the 23andme vs. FDA kerfuffle, I believe that two key points have been ignored in the discussions:

1. All 23andme genotypes that have ever been reported to customers are wrong. This is the case despite very accurate genotyping technology used by 23andme.
2. The interpretation of 23andme results involves examining a large number of odds ratios. The presence of errors leads to a huge multiple-testing problem.

Together, these issues lead to an interesting conundrum for the company, for customers, and for the FDA.

I always find it useful to think about problems concretely. In the case of 23andme, it means examining actual genotypes. Fortunately, you don’t have to pay the company $99 dollars to get your own- numerous helpful volunteers have posted their 23andme genotypes online. They can be viewed at openSNP.org where “customers of direct-to-customer genetic tests [can] publish their test results, find others with similar genetic variations, learn more about their results, get the latest primary literature on their variations and help scientists find new associations”. There are a total of 624 genotypes available at openSNP, many of them from 23andme. As an example, consider “samantha“, who in addition to providing her 23andme genotype, also provides lots of phenotypic information. Here is the initial part of her genotype file: # This data file generated by 23andMe at: Wed Jul 20 20:37:11 2011 # # Below is a text version of your data. Fields are TAB-separated # Each line corresponds to a single SNP. For each SNP, we provide its identifier # (an rsid or an internal id), its location on the reference human genome, and the # genotype call oriented with respect to the plus strand on the human reference # sequence. We are using reference human assembly build 36. Note that it is possible # that data downloaded at different times may be different due to ongoing improvements # in our ability to call genotypes. More information about these changes can be found at: # https://www.23andme.com/you/download/revisions/ # # More information on reference human assembly build 36: # http://www.ncbi.nlm.nih.gov/projects/mapview/map_search.cgi?taxid=9606&build=36 # # rsid chromosome position genotype rs4477212 1 72017 AA rs3094315 1 742429 AG rs3131972 1 742584 AG rs12124819 1 766409 AA rs11240777 1 788822 AA rs6681049 1 789870 CC rs4970383 1 828418 CC rs4475691 1 836671 CC rs7537756 1 844113 AA rs13302982 1 851671 GG rs1110052 1 863421 GT ... Anyone who has been genotyped by 23andme can get this file for themselves from the website (by clicking on their name, then on “Browse Raw Data” from the pull-down menu, and then clicking on “Download” in the top-right corner of the browser window). The SNPs are labeled with rsid labels (e.g. rs3094315) and correspond to specific locations on chromosomes (e.g. chr1:742429). Since every human is diploid, two bases are shown for every SNP; one came from mom and one from dad. The 23andme genotype is not phased, which means that you can’t tell in the case of rs3094315 whether the A was from mom and the G from dad, or vice versa (it turns out paternal origin can be important, but that is a topic for another post). A key question the FDA has asked, as it does for any diagnostic test, is whether the SNP calls are accurate. The answer is already out there. First, someone has performed a 23andme replicate experiment precisely to assess the error rate. In an experiment in 2010 with two replicates, 85 SNPs out of about 600,000 were different. Today, Illumina types around 1 million SNPs, so one would expect even more errors. Furthermore, a replicate analysis provides only a lower bound, since systematic errors will not be detected. Another way to examine the error rate is to look at genotypes of siblings. That was written about in this blog post which concluded there were 87 errors. 23andme currently uses the Illumina Omni Express for genotyping, and the Illumina spec sheet claims a similar error rate to those inferred in the blog posts mentioned above. The bottom line is that even though the error rate for any individual SNP call is very very low (<0.01% error), with a million SNPs being called there is (almost) certainly at least one error somewhere in the genotype. In fact, assuming a conservative error rate leading to an average of 100 errors per genotype, the probability that a 23andme genotype has no errors is less than 10^(-40). The fact that 23andme genotypes are wrong (i.e. at least one error in some SNP) wouldn’t matter if one was only interested in a single SNP. With very high probability, it would be some other SNPs that are the wrong ones. But the way people use 23andme is not to look at a single SNP of interest, but rather to scan the results from all SNPs to find out whether there is some genetic variant with large (negative) effect. The good news is that there isn’t much information available for the majority of the 1 million SNPs being tested. But there are, nevertheless, lots of SNPs (thousands) to look at. Whereas a comprehensive exam at a doctor’s office might currently constitute a handful of tests– a dozen or a few dozen at most– a 23andme test assessing thousands of SNPs and hundreds of diseases/traits constitutes more diagnostic tests on an individual at one time than have previously been performed in a lifetime. To understand how many tests are being performed in a 23andme experiment, it is helpful to look at the Interpretome website. The website allows a user to examine information on SNPs without paying, and without uploading the data. I took a look at Samantha, and the Interpretome gave information about 2829 SNPs. These are SNPs for which there is a research article that has identified the SNP as significant in some association study (the website conveniently provides direct links to the articles). For example, here are two rows from the phenotype table describing something about Samantha’s genetic predisposition for large head circumference: Head circumference (infant) 11655470 CC T .05 4E-6 22504419 Head circumference (infant) 1042725 CC T .07 3E-10 22504419 Samantha’s genotype at the locus is CC, the “risk” allele is T, the odds ratios are very small (0.05,0.07) and the p-values are apparently significant. Interpretome’s results differ from those of 23andme, but looking at the diversity of phenotypes reported on gives one a sense for the possibilities that currently exist in genetics, and the scope of 23andme’s reports. From the estimates of error rates provided above, and using the back of an envelope, it stands to reason that about 1/3 of 23andme tested individuals have an error at one of their “interesting” SNPs. Not all of SNPs arising in association studies are related to diseases, but many of them are. I don’t think its unreasonable to postulate that a significant percentage of 23andme customers have some error in a SNP that is medically important. Whether such errors are typically false positives or false negatives is unclear, and the extent to which they may lead to significant odds ratios is another interesting question. In other words, its not good enough to know how frequently warfarin sensitivity is being called incorrectly. The question is how frequently some medically significant result is incorrect. Of course, the issue of multiple testing as it pertains to interpreting genotypes is probably a secondary issue with 23andme. As many bloggers have pointed out, it is not even clear that many of 23andme’s odds ratios are accurate or meaningful. A major issue, for example, is the population background of an individual examining his/her genotype and how close it is to the population on which the GWAS were performed. Furthermore, there are serious questions about the meaning of the GWAS odds ratios in the case of complex traits. However I think the issue of multiple testing is a deeper one, and a problem that will only be exacerbated as more disease SNPs are identified. Having said that, there are also approaches that could mitigate errors and improve fidelity of the tests. As DECODE genetics has demonstrated, imputation and phasing can in principle be used to infer population haplotypes, which not only are useful for GWAS analyses, but can also be used to identify erroneous SNP calls. 23andme’s problem is that although they have many genotypes, they are from diverse populations that will be harder to impute and phase. The issue of multiple testing arising in the context of 23andme and the contrast with classic diagnostics reminds me of the dichotomy between whole-genome analysis and classic single gene molecular biology. The way in which customers are looking at their 23andme results is precisely to look for the largest effects, i.e. phenotypes where they appear to have high odds of contracting a disease, or being sensitive to some drug. This is the equivalent of genome scientists picking the “low hanging fruit” out of genome-wide experiments such as those performed in ENCODE. In genomics, scientists have learned (with some exceptions) how to interpret genome-wide analyses after correcting for multiple-hypothesis testing by controlling for false discovery rate. But are the customers of 23andme doing so? Is the company helping them do it? Should it? Will the FDA require it? Can looking at ones own genotype constitute too much testing? There are certainly many precedents for superfluous harmful testing in medicine. For example, the American Academy of Family Physicians has concluded that prostate cancer PSA tests and digital rectal exams have marginal benefits that are outweighed by the harm caused by following up on positive results. Similar arguments have been made for mammography screening. I therefore think that there are serious issues to consider about the implications of direct-to-consumer genetic testing and although I support the democratization of genomics, I’m glad the FDA is paying attention. Samantha’s type 2 diabetes risk as estimated from her genotype by Interpretome. She appears to have a lower risk than an average person. Does this make it ok for her to have another cookie? My students run a journal club that meets once a week and last Friday the paper we discussed was M. Imakaev et al., Iterative correction of Hi-C data reveals hallmarks of chromosome organization, Nature Methods 9 (2012), p 999–1003. The paper, from Leonid Mirny’s group, describes an approach to analyzing Hi-C data, and is one of only a few papers that have been published focusing on methods for this *Seq assay. Present at journal club were my students Nicolas Bray, Shannon Hateley, Harold Pimentel, Atif Rahman, Lorian Schaeffer, Akshay Tambe, Faraz Tavakoli, postdocs Sreeram Kannan and Shannon McCurdy, and guest Emily Brown from Doris Bachtrog’s group. They all read the paper prior to our meeting, and the notes below are a summary of our discussion (although I have taken some more time to flesh out the math for the interested reader). For those who are impatient, the bottom line is that I think the title of the paper by Imakaev et al. should have had a few extra words (mine in italics) so that it would be “Iterative proportional fitting of a log-linear model for correction of Hi-C data reveals hallmarks of chromosome organization”. Read the rest of this entry » I visited Duke’s mathematics department yesterday to give a talk in the mathematical biology seminar. After an interesting day meeting many mathematicians and (computational) biologists, I had an excellent dinner with Jonathan Mattingly, Sayan MukherjeeMichael Reed and David Schaeffer. During dinner conversation, the topic of probability theory (and how to teach it) came up, and in particular Buffon’s needle problem. The question was posed by Georges-Louis Leclerc, Comte de Buffon in the 18th century: Suppose we have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. What is the probability that the needle will lie across a line between two strips? If the strips are distance $t$ apart, and $l \leq t$, then it is easy to see that the probability $P$ is given by $P = \int_{\theta =0}^{\frac{\pi}{2}} \int_{x = 0}^{\frac{l}{2}sin \theta} \frac{4}{t \pi} dx d\theta = \frac{2l}{t \pi}$. The appearance of $\pi$ in the denominator turns the problem into a Monte Carlo technique for estimating $\pi$: simply simulate random needle tosses and count crossings. It turns out there is a much more elegant solution to the problem– one that does not require calculus. I learned of it from Gian-Carlo Rota when I was a graduate student at MIT. It appears in his book Introduction to Geometric Probability (with Dan Klain) that I have occasionally used when teaching Math 249. The argument relies on the linearity of expectation, and is as follows: Let $f(l)$ denote the expected number of crossings when a needle of length $l$ is thrown on the floor. Now consider two needles, one of length $l$ and the other $m$, attached to each other end to end (possibly at some angle). If $X_1$ is a random variable describing the number of crossings of the first needle, and $X_2$ of the second, its certainly the case that $X_1$ and $X_2$ are dependent, but because expectation is linear, it is the case that $E(X_1+X_2) = E(X_1)+E(X_2)$. In other words, the total number of crossings is, in expectation, $f(l)+f(m)$. Buffon’s needle problem: what is the probability that a needle of length $l \leq t$ crosses a line? (A) A short needle being thrown at random on a floor with parallel lines. (B) Two connected needles. The expected number of crossings is proportional to the sum of their lengths. (C) A circle of diameter always crosses exactly two lines. It follows that $f$ is a linear function, and since $f(0)=0$, we have that $f(l) = cl$ where $c$ is some constant. Now consider a circle of diameter $t$. Such a circle, when thrown on the floor, always crosses the parallel lines exactly twice. If $C$ is a regular polygon with vertices on the circle, and the total length of the polygon segments is $l$, then the total number of crossings is$f(l). Taking the limit as the number of segments in the polygon goes to infinity, we find that $f(t \pi ) = 2$. In other words, $f(t \pi) = c \cdot t \pi = 2 \Rightarrow c = \frac{2}{t \pi}$, and the expected number of crossings of a needle of length l is $\frac{2l}{t \pi}$. If $l < t$, the number of crossings is either 0 or 1, so the expected number of crossings is, by definition of expectation, equal to the probability of a single crossing. This solves Buffon’s problem no calculus required! The linearity of expectation appears elementary at first glance. The proof is simple, and it is one of the first “facts” learned in statistics– I taught it to my math 10 students last week. However the apparent simplicity masks its depth and utility; the above example is cute, and one of my favorites, but linearity of expectation is useful in many settings. For example I recently saw an interesting application in an arXiv preprint by Anand Bhaskar, Andy Clark and Yun Song on “Distortion of genealogical properties when the sample is very large“. The paper addresses an important question, namely the suitability of the coalescent as an approximation to discrete time random mating models, when sample sizes are large. This is an important question, because population sequencing is starting to involve hundreds of thousands, if not millions of individuals. The results of Bhaskar, Clark and Song are based on dynamic programming calculations of various genealogical quantities as inferred from the discrete time Wright-Fisher model. An example is the expected frequency spectrum for random samples of individuals from a population. By frequency spectrum, they mean, for each k, the expected number of polymorphic sites with k derived alleles and n-k ancestral alleles under an infinite-sites model of mutation in a sample of n individuals. Without going into details (see their equations (8),(9) and (10)), the point is that they are able to derive dynamic programming recursions because they are computing the expected frequencies, and the linearity of expectation is what allows for the derivation of the dynamic programming recursions. None of this has anything to do with my seminar, except for the fact that the expectation-maximization algorithm did make a brief appearance, as it frequently does in my lectures these days. I spoke mainly about some of the mathematics problems that arise in comparative transcriptomics, with a view towards a principled approach to comparing transcriptomes between cells, tissues, individuals and species. The Duke Chapel. While I was inside someone was playing the organ, and as I stared at the ceiling, I could have sworn I was in Europe. In a first with RNA-Seq technology, scientists at Stanford University have broken through the single-cell barrier. In a recently published paper, A.R. Wu et al., Quantitative assessment of single-cell RNA-sequencing methodsthe results of sequencing RNA from zero human cells. How was this accomplished? The gist of it is that an empty tube was filled with spike-in, and then submitted for RNA-Seq… The details are as follows: In order to assess the quality of single-cell RNA-Seq, Wu et al. performed numerous single-cell RNA-Seq experiments on HCT116 cells, as summarized in the figure below (Figure 1a from their paper). Figure 1a from the Wu et al. paper showing the experimental design. was interested in this study because for the regularized pooling project I’m working on with Nicolas Bray (see recent post), it would be useful to demonstrate improvements in quantification accuracy by joint analysis of single-cell RNA-Seq. I asked Nick to look at the Wu et al. data when it came out two weeks ago, and as a first step he aligned the reads to the human transcriptome. Strangely, he found very low alignment rates, and in some cases literally almost no reads aligned at all. At first we thought there was some trimming issue, so we went to look at the Cufflinks output of the authors. The figure below, made by Nick, shows the percent spike-in (assessed by examining the abundance of ERCC-*) for each of the SMARTer based 96 samples: The worst sample is C70 (GSM1241223) for which only 252 human transcripts have non-zero abundance. It is 99.828339% spike-in! The fact that the results of RNA-Seq on an empty test tube were published is in and of itself just a minor (?) embarrassment; more interesting is the range of quality obtained as measured by the amount of spike-in sequenced– a plot that we have made above and that seems crucial to the paper, but that was not produced by the authors. In fact, what the authors do show is slightly suspect: reproduced below is their Figure S2 from the Supplement: Why would the authors show correlations for just four randomly picked samples? Why not show results for all of the data? We dug a bit deeper into this, and noticed that 93/96 of the FPKM file names look like [GEO accession]_CXX_ILXXXX. But the remaining three look like GSM1241223_C70_NTC_tube_ctrl_IL3196.sorted.genes.fpkm_tracking.txt.gz (which is the apparently empty tube), GSM1241245_C92_cell_tube_ctrl_IL3198.sorted.genes.fpkm_tracking.txt.gz and GSM1241195_C42_100ng_RNA_ctrl_IL3198.sorted.genes.fpkm_tracking.txt.gz. Therefore, these were presumably intended controls, but they were not published as such. There is the separate issue, that aside from the controls, the experiment in general appears to have some failure rate that is not clearly presented. This is evident in the following plot which Nick made, showing the average log-correlation of each experiment with the others after removing zeroes (the bottom one is C09 and the runner up is C70): This figure is showing the honest truth of the paper. It is what it is; everyone I’ve talkedto that has actually performed single-cell RNA-Seq tells me that it is difficult and there is a non-trivial failure rate, on top of variable quality across cells. In fact, there is subtle evidence of failure in other papers. In the single-cell RNA-Seq technology race, the paper preceding Wu et al was A.K. Shalek et al., Single-cell transcriptomics reveals bimodality in expression and splicing in immune cellsNature (2013). In Shalek et al., the authors describe 18 single-cell experiments. Specifically, they claim to have constructed DNA libraries “from 18 single BMDCs (S1–S18), three replicate populations of 10,000 cells, and two negative controls (empty wells), and sequenced each to an average depth of 27 million read pairs.” However a close inspection of the GEO reveals the following IDs and descriptors:  GSM1012777 Single cell S1 GSM1012778 Single cell S2 GSM1012779 Single cell S3 GSM1012780 Single cell S4 GSM1012781 Single cell S5 GSM1012782 Single cell S6 GSM1012783 Single cell S7 GSM1012784 Single cell S8 GSM1012785 Single cell S9 GSM1012786 Single cell S10 GSM1012787 Single cell S11 GSM1012788 Single cell S13 GSM1012789 Single cell S14 GSM1012790 Single cell S15 GSM1012791 Single cell S16 GSM1012792 Single cell S22 GSM1012793 Single cell S23 GSM1012794 Single cell S24 While there are 18 consecutive IDs, the cell labels range from 1–24. Where are the 6 missing cells? I can’t be sure, but they were probably failures. Update: the authors of the Shalek et al. paper explained to me after seeing the post that two of the missing labels were negative controls, and 3 were population replicates (the names of these were altered in GEO). $6-5=1$ which was indeed a failure (S12); it gave no signal on the BioAnalyzer and was therefore not sequenced. I was told that the authors are working on fixing the GEO sample names to clarify the reason for missing labels of samples. As such, it turns out the experiment was extremely successful with a success rate of 18/19. Returning to Wu et al., they should be commended for releasing all their data (to their credit they also release the R code they used for analysis). The problem with the paper is that instead of reporting the failures and discarding them before analysis, they instead use all of the data when performing comparisons between single-cell and bulk RNA-Seq. This is is evident in some of the strange techniques they end up using. For example, the method for generating the crucial Figure 4a is described as: “(a) Correlation between the merged single cells (“ensemble”) and the bulk RNA-seq measurement of gene expression. The ensemble was created by computationally pooling all the raw reads obtained from the 96 single-cell transcriptomes generated using the C1 system and then sampling 30 million reads randomly. The bulk and ensemble libraries were depth matched before alignment was performed. For each gene, the log2-transformed median FPKM values from the ensemble and bulk were plotted against each other. “ I’m guessing that the odd idea of sampling and then taking the median is precisely to throw out outliers coming from the control tubes. Yes, the data were tortured, and yes, the FPKMs confessed. The paper has some other issues that suggest it was not carefully reviewed by the authors (let alone the reviewers). In the Methods I found the statement “FPKM values used for analyses were generated by TopHat”. I, of all people, can attest to the fact that it is Cufflinks, not TopHat, that estimates (not generates!) FPKM values. Thankfully, in the GEO entries Cufflinks is correctly cited together with the version used. In summary, in the last two high profile publications on single-cell RNA-Seq, there were failures in the experiment and they were not reported clearly by the authors. Neither committed an egregious offense, but I wish they had fully reported the number of experiments attempted and the number that succeeded. That seems to me to be important data in papers describing new technology. I believe that fear of rejection from the journal, or fear of embarrassment of the state of single-cell RNA-Seq is what drove Wu et al. to spin the results positively. All part of the fear of failure, that seems to hold back a lot of science. But single-cell RNA-Seq has a bright future and these papers would both be better if they were more open about failure. The only thing we have to fear is fear itself. Last Saturday I returned from Cold Spring Harbor Laboratories where I spoke at the Genome Informatics Meeting on Stories from the Supplement. On Monday I delivered the “Prestige Lecture” at a meeting of the Center for Science of Information on New Directions in the Science of Information and I started by talking about Cold Spring Harbor Laboratory (CSHL). That is because the Eugenics Record Office at CSHL is where Claude Shannon, famous father of information theory, wrapped up his Ph.D. in population genetics in 1939. The fact that Shannon did his Ph.D. in population genetics– his Ph.D. was titled “An Algebra for Theoretical Genetics“– is unknown to most information theorists and population geneticists. It is his masters thesis that is famous (for good reason– it can be said to have started the digital revolution), and his paper in 1948 that founded information theory. But his Ph.D. thesis was impressive in its own right: its contents formed the beginning of my talk to the information theorists, and I summarize the interesting story below. I learned about the details surrounding Shannon’s foray into biology from a wonderful final project paper written for the class The Structure of Engineering Revolutions in the Fall of 2001: Eugene Chiu, Jocelyn Lin, Brok Mcferron, Noshirwan Petigara, Satwiksai Seshasai, Mathematical Theory of Claude Shannon. In 1939, Shannon’s advisor, Vannevar Bush, sent him to study genetics with Barbara Burks at the Eugenics Record Office at Cold Spring Harbor. That’s right, the Eugenics office was located at Cold Spring Harbor from 1910 until 1939, when it was closed down as a result of Nazi eugenics. Fortunately, Shannon was not very interested in the practical aspects of eugenics, and more focused on the theoretical aspects of genetics. His work in genetics was a result of direction from Vannevar Bush, who knew about genetics via his presidency of the Carnegie Institution of Washington that ran the Cold Spring Harbor research center. Apparently Bush remarked to a colleague that “It occurred to me that, just as a special algebra had worked well in his hands on the theory of relays, another special algebra might conceivably handle some of the aspects of Mendelian heredity”. The main result of his thesis is his Theorem 12: The notation $\lambda^{hij}_{klm}$ refers to genotype frequencies in a diploid population. The indices $h,i,j$ refer to alleles at three loci on one haplotype, and $k,l,m$ at the same loci on the other haplotype. The $p$ variables correspond to recombination crossover probabilities. $p_{00}$ is the probability of an even number of crossovers between both the 1st and 2nd loci, and the 2nd and 3rd loci. $p_{01}$ is the probability of an even number of crossovers between the 1st and 2nd loci but an odd number of crossovers between the 2nd and 3rd loci, and so on. Finally, the dot notation in the $\lambda$ represents summation over the index (these days one might use a $+$). The result is a formula for the population genotype frequencies $\mu^{hij}_{ilm}$ after $n$ generations. The derivation involves elementary combinatorics, specifically induction, but it is an interesting result and at the time was not something population geneticists had worked out. What I find impressive about it is that Shannon, apparently on his own, mastered the basic principles of (population) genetics of his time, and performed a calculation that is quite similar to many that are relevant in population genetics today. Bush wrote about Shannon “At the time that I suggested that he try his queer algebra on this subject, he did not even know what the words meant… “. Why did Shannon not pursue a career in population genetics? The Eugenics Record Office closed shortly after he left and Bush discouraged him from continuing in the field, telling him that “few scientists are ever able to apply creatively a new and unconventional method furnished by some one else – at least of their own generation”. Thus, despite encouragement from a number of statisticians and geneticists that his work was novel and of interest, Shannon returned to electrical engineering. Shortly thereafter, the world got information theory. Of course today population genetics has data, tons of it, and many interesting problems, including some that I think require insights and ideas from information theory. My Prestige Lecture was aimed at encouraging information theorists to return to their Shannon roots, and redirect their focus towards biology. I have been working with information theorist David Tse (academic grandson of Shannon) for the past year on de novo RNA-Seq assembly (a talk on our joint work with postdoc Sreeram Kannan was presented by Sreeram at the Genome Informatics meeting), and I believe the engagement of information theorists in biology would be of great benefit to both fields; in terms of biology, I see many applications of information theory beyond population genetics. Some back-and-forth has already started. Recently there have been some interesting papers using information theory to study genome signatures and compression, but I believe that there are many other fruitful avenues for collaboration. David and Sreeram were the only information theorists at CSHL last week (I think), but I hope that there will be many more at the 2014 meeting in Cambridge, UK! The beach at Cold Spring Harbor. I took the photo on November 1st before my Genome Informatics keynote. I have just returned from the Genome Informatics 2013 meeting at CSHL. Jennifer Harrow, Michael Schatz and James Taylor organized a fantastic event that I thoroughly enjoyed. The purpose of this post is to provide a short summary of my keynote talk, which was titled “Stories from the Supplement” (the talk can be viewed here and the presentation downloaded here). The idea for talking about what goes on in the supplement of papers, was triggered by a specific event towards the end of the reviewing/editing process for the paper: A. Roberts and L. Pachter, Streaming algorithms for fragment assignment, Nature Methods 10 (2013), p 71—73. After a thorough and productive review process, deftly handled by editor Tal Nawy whose work on our behalf greatly improved the quality of the paper, we were sent an email from the journal shortly before publication stating that “If the Online Methods section contains more than 10 equations, move the equation-heavy portions to a separate Supplementary Note“. This requirement made it essentially impossible to properly explain our model and method in the paper. After publishing lengthy supplements for the Cufflinks papers (see below) that I feel were poorly reviewed to the detriment of the paper, I decided it was time to talk about this issue in public. C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold and L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, (2010), p 511–515. Supplementary Material: 42 pages. C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA-seq, Nature Biotechnology, 31 (2012), p 46–53. Supplementary Material: 70 pages. My talk contained three examples selected to make a number of points: • Methods in the supplement frequently contain ideas that transcend the specifics of the paper. These ideas can be valuable in the long run, but when they are in the supplement it is harder to identify what they are and to appreciate their significance. • Supplements frequently contain errors (my own included). These errors make it difficult for others to understand the methods and implement them independently. • In RNA-Seq specifically, there are a number of methodological issues buried in the supplements of various papers that have caused confusion in the field. • The constant push of methods to supplements is part of a general trend to overemphasize the importance of data while minimizing the relevance of methods. The examples were as follows: 1. Fragment assignment: The idea of probabilistically assigning ambiguously mapped fragments in RNA-Seq is present in many papers, but at least for me, it was the math worked out in the supplements of those papers (and many conversations with my collaborators, especially Cole Trapnell and Adam Roberts) that made me realize the importance of fragment assignment for *Seq. I went on to explain how Nicolas Bray used these insights to develop a fragment assignment model for joint analysis of RNA-Seq. The result is the ability to magnify the effective coverage of individual samples from multiple samples, as shown in my talk using the GEUVADIS data: In this plot each point represents the accuracy for the samples when quantified independently (black), or by our method (red/blue). The difference between red and blue has to do with a technical choice in the method that I explained in the talk. 2. I talked about the problem of using raw counts for RNA-Seq analysis. Returning to a theme I have discussed in talks and on my blog previously, I explained that even when the goal is differential analysis, raw counts are flawed because “wrong does not cancel out wrong”. The idea of using raw count quantification knowing it is inaccurate, but arguing that it doesn’t matter because the bias cancels during comparisons (e.g. in differential expression or eQTL analysis) is mathematically equivalent to the following: Acknowledging that $\frac{1}{2}+\frac{2}{3} \neq \frac{3}{5}$ (obtained by summing numerators and then dividing by the sum of denominators) but arguing that it is ok to say that $\frac{\frac{1}{2}+\frac{2}{3}}{\frac{3}{4}+\frac{5}{6}} = \frac{\frac{3}{5}}{\frac{8}{10}} = \frac{3}{4}$, which is obviously not correct (the answer is $\frac{14}{19}$). A key point I made is that even though it might seem that the wrong answer is at least close to the correct answer, in practice, on real data, the differences can be significant. I showed an analysis done by Cole Trapnell using an extensive dataset generated in the Rinn lab for the Cufflinks 2 paper that makes this point. 3. I talked about the different units currently being used for RNA-Seq quantification, such as CPM, RPKM, FPKM and TPM (all of them appeared in various talks during the meeting). I discussed the history of the units, and why they were chosen, and argued in favor of simply using the relative abundance estimates (perhaps normalized by a constant factor, as in TPM). This point of view was first advocated by Bo Li and Colin Dewey in their RSEM paper, and I hope the community will adopt their point of view. My penultimate slide showed this world map of high-throughput sequencers. I think this is a very cool map, as it shows (by proxy) the extraordinary extent of sequencing going on worldwide. However it is yet another example of a focus on data, and data generation, in genomics. Data is of course, very important, but I showed another map for methodsthat illustrates a very different thing: the extent of computational biology going on around the world. The methods map is made from visits to the Cufflinks website. I mashed it with the sequencer map to make the case that data and methods go hand-in-hand. Sequencers of the world and users of Cufflinks. A blogpost by Lior Pachter at http://liorpachter.wordpress.com/2013/10/21/gtex/ suggested that GTEx uses a suboptimal transcript quantification method. He showed using simulated data, that the method used by GTEx obtained a quality score that can be reached by a different method using 10-fold less data. This result led to the sensational subject line: “GTEx is throwing away 90% of their data”. We take such concerns very seriously and intend to be transparent to the community regarding our decisions. We stand by the processes and decisions we have taken to date given the data and time constraints but of course always remain open to community suggestions for improvement. We have answered the questions raised by the blogpost below. We thank Lior Pachter for hosting this in his blog. - Is GTEx LITERALLY throwing away 90% of the data? Absolutely not! The GTEx project is not throwing away any data at all. In fact, it provides all data, in both raw and processed forms, to the scientific community at dbGAP site (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v3.p1) or on the GTEx portal (http://www.broadinstitute.org/gtex/). GTEx is running multiple tools to perform various analytical steps. The initial analysis of the GTEx pilot project data is completed and frozen and the consortium is in the process of writing a manuscript describing the results. For transcript abundance quantification, which comprises a small fraction of the overall analyses of the pilot data, the consortium used FluxCapacitor and provided the outputs on the GTEx portal. Specifically, these results were used for analyses of splice- and isoform-QTLs and for assessing tissue-specificity of loss-of-function variants.). Tools used in the pilot phase of GTEx will be re-evaluated, based on benchmarking results and large-scale production consideration (see below), and alternative methods, including for transcript quantification, may be employed in the future. - Is the FluxCapacitor (FC) a published method? Yes, although not as a standalone methodological paper but as part of a large study where RNAseq data were used to perform eQTL analysis (Montgomery, Nature 2010). That paper was peer reviewed and includes a 1.5 page description of the methodology in the methods section and a supplementary figure (Supplementary Figure 23) describing the method’s outline (http://www.nature.com/nature/journal/v464/n7289/full/nature08903.html). In addition, there is extensive documentation on the web at http://sammeth.net/confluence/display/FLUX/Home supporting the publication and offering download of the software. Indeed, the level of detail of the description of the method in the paper is not as comprehensive as in some standalone methods papers. We thank Lior for pointing this out and note that the documentation has been updated and will continue to be improved by the authors of FC. - Why did GTEx decide to use the FluxCapacitor method in it’s pilot analyses? In order to understand the decision to use FluxCapacitor (FC) in GTex, one first needs to understand how such decisions are made in GTEx (see below). Initially we used Cufflinks (CL), which is the most commonly used tool in the field for quantifying isoforms. However, when using it at large scale (1000s of samples) we hit technical problems of large memory use and long compute times. We attempted to overcome these difficulties, and investigated the possibility of parallelizing CL and contacted the CL developers for help. However, the developers advised us that CL could not be parallelized at that point. Due to project timelines, we started investigating alternative methods. The Guigo group has already produced transcript quantifications with the FC on GTEx data and demonstrated biologically meaningful high quality results (http://www.gtexportal.org/home?page=transcriptQuant) and provided the tool and support to test it and install as part of the GTEx production analysis pipeline. Our current experience with FC is that it scales well and can be used in a production setting. - Are exon level or gene level read counts appropriate for the purposes of GTEx and Geuvadis eQTL analyses? Another criticism raised in the blogpost was regarding the use of exon-level and gene-level read counts, as opposed to transcript abundance levels, for calculating eQTLs in both the GTEx pilot project as well as in the recently published GEUVADIS paper (Lappalainen et al. Nature, 2013). The eQTL analyses for both projects mainly used exon-level and gene-level expression values but did not use simple read counts, rather they used carefully normalized values after correcting for multiple covariates (as described in Lappalainen et al.). Transcript (or isoform) abundance levels, which are the subject of Lior Pachter’s simulations, require more sophisticated estimation methods (such as FC, CL or others) since one needs to deconvolute the contribution of each isoform to the exon-level signals (as different isoforms often share exons). We deliberately did not use isoform-abundance levels for the major part of the analysis since this is still a very active area of research and new tools continue to be developed. The analyses in Lappalainen et al. using transcript quantifications, such as population differences, were replicated with other approaches with highly consistent results (Fig. 1c, Fig. S14). Our results suggest that exon-level and gene-level quantifications are much more robust in cis-eQTL analysis. In the GEUVADIS paper we discovered ~7800 genes with exon-level eQTLs, and ~3800 genes with gene-level eQTLs. Initial tests for eQTL discovery using transcript quantifications from FC found <1000 genes with transcript-level eQTLs, and led to or decision to use the much more powerful exon and gene quantifications in that paper. Given the relatively small difference in correlation coefficients described in Lior Pachter’s post between FC and CL (~83% vs. ~92% for 20M fragments), and our previous experience mentioned above, we find it very unlikely that CL or any other transcript quantification method would discover a number of genes with transcript eQTLs anywhere near the 3800 or the 7800 figure. This suggests that transcript quantification methods do not capture biological variation as well as more straightforward exon and gene quantifications. The issue that was raised in the blog is regarding estimating relative expression levels of isoforms within a single sample and is measured using the Spearman correlation metric. However, for single-gene eQTL, which is the main goal of GTEx pilot analysis, one searches for significant correlations between genotypes and expression levels of a single gene when compared across subjects. This correlation is robust against shifts (ie. adding or subtracting a constant value) and changes in scale (applying a constant factor) of the expression levels. Therefore, normalizing to the alignable territory of the gene (which will introduce the same factor across samples), and ignoring ambiguously mapped reads (which likely introduces a constant shift) will have little or no effect on the resulting eQTLs. We are constantly evaluating further quantification methods, but we believe that the eQTL results calculated as part of the GTEx pilot and the GEUVADIS paper are solid and robust. - How does GTEx decide on methodologies to be used? GTEx strives to provide to the community the highest quality raw and derived data as well as the highest quality scientific results. GTEx also operates with clear and rigid timelines and deliverables. Therefore, we prefer to use methods that already have been vetted by the community and can be used in a large-scale production setting. In this particular case the experience with FC as part of the GEUVADIS project was part of the consideration. Systematic benchmarking of tools is very important and we encourage the community to conduct such benchmarks. Proper benchmarking is not a simple task — one needs to carefully define the benchmarking metrics (which depend on the particular downstream use of the data), often there are not sufficient ground truth data, and simulations can often be deceiving since they don’t reflect real biology or experimental data. Therefore, whenever there are no published benchmarks that use metrics that are relevant to the GTEx project, we have to perform them within the project (prioritized by the impact on the results). One example is a recent comparison of different alignment methods (including TopHat and GEM), which was recently presented at the ASHG conference in Boston. Although isoform-level quantification is only a minor part of our current analyses, we continue to evaluate several methods (including FC and CL) and welcome any constructive input on this evaluation from the community. The results obtained by Pachter’s simulations suggest that CL outperforms FC (based on the Spearman correlation metric) but this seems to be at odds with our experience (perhaps due to unrealistic assumptions in the simulations). Clearly, further benchmarking is required to better understand the differences between tools and their effect on the final results (http://www.gtexportal.org/home?page=transcriptQuant). - Is GTEx open to feedback and what are the mechanisms in place? GTEx welcomes and encourages feedback and interaction with outside investigators to improve the analysis and data production. We offer several mechanisms for interaction: (i) The GTEx datasets are available for download (some require applying for access to protect donor privacy) and have already been shared with over 100 different research groups that carry out their own analyses of the data; (ii) We had a widely-attended international community meeting in June and plan to hold such meetings yearly, giving the opportunity for external groups to share their results; and (iii) we welcome e-mails to GTEXInfo@mail.nih.gov and comments in the GTEx portal (http://www.broadinstitute.org/gtex/). Finally, we are interested in facilitating systematic benchmarking of tools and investigators that are interested in participating in such benchmarks or in defining the evaluation metrics are welcome to contact us. To summarize, GTEx strives to produce, and make publicly-available in a timely manner, the best possible data and analysis results, within data release and practical limitations. By no means do we feel that all our analyses are the best possible in all aspects, or that we will perform all the different types of analysis one can do with these data. We are open to constructive feedback regarding the tools we use and the analyses we perform. Finally, all data are available to any investigator who desires to perform novel analyses with their own methods and we anticipate that much improved and innovative analyses of the data will emerge with time. Manolis Dermitzakis, Gad Getz, Krisitn Ardlie, Roderic Guigo for the GTEx consortium This week PLoS Computational Biology published a guide titled Ten Simple Rules for Reproducible Computational Research, by G.K. Sandve, A. Nekrutenko, J. Taylor and E. Hovig. The guide lists ten rules, including Rule 6: For Analyses that Include Randomness, Note Underlying Random Seeds This is somewhat akin to the biological practice of storing cDNA libraries at -20C. For computational biologists the rule might seem a bit excessive at first glance, and not quite at the same level of importance as Rule 1: for every result, keep track of how it was produced. Indeed, I doubt that any of my colleagues keep track of their random seeds; I certainly haven’t. But paying attention to (and recording) seeds used in random number generation is extremely important, and I thought I’d share an anecdote from one of my recent projects to make the point. My student Atif Rahman has been working on a project for which a basic validation of our method required simulating multiple sets of sequencing reads with error after inducing mutations into a reference genome. He started by using wgsim from SAMtools (a point of note is that wgsim only simulates reads with uniform sequencing error but that is another matter). Initially he was running this command for i in {1..40} do wgsim -d 200 -N 300000 -h{genome}.fna ${genome}_${i}_1.fastq
${genome}_${i}_2.fastq
done

Somewhat to his surprise, he found that some of the read sets were eerily similar. Upon closer inspection, he realized what was going on. Multiple datasets were being generated in the same (literal) second, and since he was not setting the seeds, they were being chosen according to the internal (wall) clock– a practice that is common in many programs. As a result those read sets were completely identical. This is because for a fixed seed, a pseudo-random number generator (which is how the computer is generating random numbers) computes the “random” numbers in a deterministic way (for more on random number generation see http://www.random.org/randomness/ ).

One way to circumvent the problem is to first generate a sequence of pseudo-random numbers (of course remembering to record the seed used!) and then to use the resulting numbers as seeds in wgsim using the “-S” option. The quick hack that Atif used was to insert a sleep() between iterations.

In summary, random number generation should not be done randomly.

The Genotype-Tissue Expression (GTEx) project is an NIH initiative to catalog human tissue-specific expression patterns in order to better understand gene regulation (see initial press release). The project is an RNA-Seq tour-de-force: RNA extracted from multiple tissues from more than 900 individuals is been quantified with more than 1,800 RNA-Seq experiments. An initial paper describing the experiments was published in Nature Genetics earlier this year and the full dataset is currently being analyzed by a large consortium of scientists.

I have been thinking recently about how to analyze genotype-tissue expression data, and have been looking forward to testing some ideas. But I have not yet become involved directly with the data, and in fact have not even submitted a request to analyze it. Given the number of samples, I’d been hoping that some basic mapping/quantification had already been done so that I could build on the work of the consortium. But, alas, this past week I got some bad news.

In a recent twitter conversation, I discovered that the program that is being used by several key GTEx consortium members to quantify the data is Flux Capacitor developed by Michael Sammeth while he was in Roderic Guigós group at the CRG in Barcelona.

What is Flux Capacitor?

Strangely, the method has never been published, despite the fact that it has been used in ten publications over the course of four years, including high profile papers from consortia such as ENCODE, GENCODE, GEUVADIS and GTEx. There is no manuscript on the author’s website or in a preprint archive. There is a website for the program but it is incomplete and unfinished, and contains no coherent explanation of what the program does. Papers using the method point to the article S. B. Montgomery, … , E. T. DermitzakisTranscriptome genetics using second generation sequencing in a Caucasian population, Nature 464 (2010) and/or the website http://sammeth.net/confluence/display/FLUX/Home for a description of the method. Here is what these citations amount to:

The Montgomery et al. paper contains one figure providing the “FluxCapacitor outline”. It is completely useless in actually providing insight into what Flux Capacitor does:

Modification of the top half of Supplementary Figure 23 from Montgomery et al (2010) titled “Flux Capacitor Outline” (although it actually shows a splice graph if one corrects the errors as I have done in red).

The methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad. Consider for example the sentence:

In our approach we estimate the biases characteristic of each experiment by collecting read distribution profiles in non-overlapping transcripts, binned by several transcript lengths and expression levels. From these profiles, we estimate for each edge and transcript a flux correction factor $b^j_i$ that following the language of hydro-dynamic flow networks, we denote as the capacity of the edge, as the area under the transcript profile between the edge boundaries (Supplementary Fig. 23).

The indices and j for $b^j_i$ are never defined, but more importantly its completely unclear what the the correction factor actually is, how it is estimated, and how it is used (this should be compared to the current sophistication of other methods). On the program website there is no coherent information either. Here is an example:

The resulting graph with edges labelled by the number of reads can be interpreted as a flow network where each transcript representing a transportation path from its start to its end and consequently each edge a possibly shared segment of transportation along which a certain number of reads per nucleotide — i.e., a flux — is observed.

I downloaded the code and it is undocumented- even to the extent that it is not clear what the input needs to be or what the output means. There is no example provided with the software to test the program.

I therefore became curious why GTEx chose Flux Capacitor instead of many other freely available tools for RNA-Seq (e.g. ALEXA-SeqCLIIQCufflinks, eXpress, iReckon IsoEM, IsoformExMISO, NEUMARSEM, rSEQrQuantSLIDE, TIGAR, …). Although many of these programs are not suitable for production-scale analysis, Cufflinks and RSEM certainly are, and eXpress was specifically designed for efficient quantification (linear in the number of mapped reads and constant memory). I looked around and no benchmark of Flux Capacitor has ever been performed–there is literally not even a mention of it in any paper other than in manuscripts by Sammeth, Guigó or Dermitzakis. So I thought that after four years of repeated use of the program in high profile projects, I would take a look for myself:

After fumbling about with the barely usable Flux Capacitor software, I finally managed to run it on simulated data generated for my paper: Adam Roberts and Lior Pachter, Streaming fragment assignment for real time analysis of sequencing experiments, Nature Methods 10 (2013), 71–73. One example of the state of the software is the example page (the required sorted file is posted there but its download requires the realization that is is linked to from the non-obviously placed paperclip). Fortunately, I was using my own reads and the UCSC annotation. The Roberts-Pachter simulation is explained in the Online Methods of our paper (section “Simulation RNA-Seq study”). It consists of 75bp paired-end reads simulated according to parameters mimicking real data from an ENCODE embryonic stem cell line. I tested Flux Capacitor with both 10 million and 100 million simulated reads; the results are shown in the figure below:

Flux Capacitor accuracy on simulations with 10 million and 100 million reads. The top panels show scatterplots of estimated transcript abundance vs. true transcript abundance. The lower panels show the same data with both axes logged.

For comparison, the next figure shows the results of RSEM, Cufflinks and eXpress on a range of simulations (up to a billion reads) from the Roberts-Pachter paper (Figure 2a):

Modification of Figure 2a from A. Roberts and L. Pachter, Nature Methods (2013) showing the performance of Flux Capacitor in context.

Flux Capacitor has very poor performance. With 100 million reads, its performance is equivalent to other software programs at 10 million reads, and similarly, with 10 million reads, it has the performance of other programs at 1 million reads. I think its fair to say that

Using Flux Capacitor is equivalent to throwing out 90% of the data!

The simulation is a best case scenario. It adheres to the standard model for RNA-Seq in which fragments are generated uniformly at random with lengths chosen from a distribution, and with errors. As explained above, all these parameters were set according to an actual ENCODE dataset, so that the difficulty of the problem corresponds to realistic RNA-Seq data. I can’t explain the poor performance of Flux Capacitor because I don’t understand the method. However my best guess is that it is somehow solving min-flow using linear programming along the lines of the properly fomulated ideas in E. Bernard, L. Jacob, J. Mairal and J.-P. VertEfficient RNA isoform identification and quantification from RNA-seq data with network flows, Technical Report HAL-00803134, March 2013. If this is the case, the poor performance might be a result of some difficulties resulting from the minimization of isoforms and reflected in the (incorrectly estimated) stripes on the left and bottom of the log-log plots. That is not to say the conclusions of the papers where Flux Capacitor is used are wrong. As can be seen from our benchmark, although performance is degraded with Flux Capacitor, the quantifications are not all wrong. For example, abundant transcripts are less likely to be affected by Flux Capacitor’s obviously poor quantification. Still, the use of Flux Capacitor greatly reduces resolution of low-expressed genes and, as mentioned previously, is effectively equivalent to throwing out 90% of the data.

As far as GTEx is concerned, I’ve been told that a significant amount of the analysis is based on raw counts obtained from reads uniquely mapping to the genome (this approach appears to have also been used in many of the other papers where Flux Capacitor was used). Adam Roberts and I examined the performance of raw counts in the eXpress paper (Figure S8, reproduced below):

Figure S8 from A. Roberts and L. Pachter, Nature Methods (2013) showing the limits of quantification when ignoring ambiguous reads. NEUMA (Normalization by Expected Uniquely Mappable Areas) calculates an effective length for each transcript in order to normalize counts based on uniquely mappable areas of transcripts. We modified NEUMA to allow for errors, thereby increasing the accuracy of the method considerably, but its accuracy remains inferior to eXpress, which does consider ambiguous reads. Furthermore, NEUMA is unable to produce abundance estimates for targets without sufficient amounts of unique sequence. The EM algorithm is superior because it can take advantage of different combinations of shared sequence among multiple targets to produce estimates. The accuracy was calculated using only the subset of transcripts (77% of total) that NEUMA quantifies.

Quantification with raw counts is even worse than Flux Capacitor. It is not even possible to quantify 23% of transcripts  (due to insufficient uniquely mapping reads). This is why in the figure above the eXpress results are better than on the entire transcriptome (third figure of this post). The solid line shows that on the (raw count) quantifiable part of the transcriptome, quantification by raw counting is again equivalent to throwing out about 90% of the data. The dashed line is our own improvement of NEUMA (which required modifying the source code) to allow for errors in the reads. This leads to an improvement in performance, but results still don’t match eXpress (and RSEM and Cufflinks), and are worse than even Flux Capacitor if the unquantifiable transcripts are taken into account. In the recent Cufflinks 2 paper, we show that raw counts also cannot be used for differential analysis (as “wrong does not cancel out wrong”–  see my previous post on this).

One criticism of my simulation study could be that I am not impartial. After all, Cufflinks and eXpress were developed in my group, and the primary developer of RSEM, Bo Li, is now my postdoc. I agree with this criticism! This study should have been undertaken a long time ago and subjected to peer review by the author(s?) of Flux Capacitor and not by me. The fact that I have had to do it is a failure on their part, not mine. Moreover, it is outrageous that multiple journals and consortia have published work based on a method that is essentially a black box. This degrades the quality of the science and undermines scientists who do work hard to diligently validate, benchmark and publish their methods. Open source (the Flux Capacitor source code is, in fact, available for download) is not open science. Methods matter.

• 49,073 hits