You are currently browsing the monthly archive for October 2013.

A blogpost by Lior Pachter at https://liorpachter.wordpress.com/2013/10/21/gtex/ suggested that GTEx uses a suboptimal transcript quantification method. He showed using simulated data, that the method used by GTEx obtained a quality score that can be reached by a different method using 10-fold less data. This result led to the sensational subject line: “GTEx is throwing away 90% of their data”.  We take such concerns very seriously and intend to be transparent to the community regarding our decisions. We stand by the processes and decisions we have taken to date given the data and time constraints but of course always remain open to community suggestions for improvement. We have answered the questions raised by the blogpost below. We thank Lior Pachter for hosting this in his blog.

– Is GTEx LITERALLY throwing away 90% of the data?

Absolutely not! The GTEx project is not throwing away any data at all. In fact, it provides all data, in both raw and processed forms, to the scientific community at dbGAP site (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v3.p1) or on the GTEx portal (http://www.broadinstitute.org/gtex/). GTEx is running multiple tools to perform various analytical steps. The initial analysis of the GTEx pilot project data is completed and frozen and the consortium is in the process of writing a manuscript describing the results.  For transcript abundance quantification, which comprises a small fraction of the overall analyses of the pilot data, the consortium used FluxCapacitor and provided the outputs on the GTEx portal. Specifically, these results were used for analyses of splice- and isoform-QTLs and for assessing tissue-specificity of loss-of-function variants.). Tools used in the pilot phase of GTEx will be re-evaluated, based on benchmarking results and large-scale production consideration (see below), and alternative methods, including for transcript quantification, may be employed in the future.

– Is the FluxCapacitor (FC) a published method?

Yes, although not as a standalone methodological paper but as part of a large study where RNAseq data were used to perform eQTL analysis (Montgomery, Nature 2010). That paper was peer reviewed and includes a 1.5 page description of the methodology in the methods section and a supplementary figure (Supplementary Figure 23) describing the method’s outline (http://www.nature.com/nature/journal/v464/n7289/full/nature08903.html). In addition, there is extensive documentation on the web at http://sammeth.net/confluence/display/FLUX/Home supporting the publication and offering download of the software. Indeed, the level of detail of the description of the method in the paper is not as comprehensive as in some standalone methods papers. We thank Lior for pointing this out and note that the documentation has been updated and will continue to be improved by the authors of FC.

– Why did GTEx decide to use the FluxCapacitor method in it’s pilot analyses?

In order to understand the decision to use FluxCapacitor (FC) in GTex, one first needs to understand how such decisions are made in GTEx (see below). Initially we used Cufflinks (CL), which is the most commonly used tool in the field for quantifying isoforms. However, when using it at large scale (1000s of samples) we hit technical problems of large memory use and long compute times. We attempted to overcome these difficulties, and investigated the possibility of parallelizing CL and contacted the CL developers for help. However, the developers advised us that CL could not be parallelized at that point. Due to project timelines, we started investigating alternative methods. The Guigo group has already produced transcript quantifications with the FC on GTEx data and demonstrated biologically meaningful high quality results (http://www.gtexportal.org/home?page=transcriptQuant) and provided the tool and support to test it and install as part of the GTEx production analysis pipeline. Our current experience with FC is that it scales well and can be used in a production setting.

– Are exon level or gene level read counts appropriate for the purposes of GTEx and Geuvadis eQTL analyses?

Another criticism raised in the blogpost was regarding the use of exon-level and gene-level read counts, as opposed to transcript abundance levels, for calculating eQTLs in both the GTEx pilot project as well as in the recently published GEUVADIS paper (Lappalainen et al. Nature, 2013). The eQTL analyses for both projects mainly used exon-level and gene-level expression values but did not use simple read counts, rather they used carefully normalized values after correcting for multiple covariates (as described in Lappalainen et al.). Transcript (or isoform) abundance levels, which are the subject of Lior Pachter’s simulations, require more sophisticated estimation methods (such as FC, CL or others) since one needs to deconvolute the contribution of each isoform to the exon-level signals (as different isoforms often share exons). We deliberately did not use isoform-abundance levels for the major part of the analysis since this is still a very active area of research and new tools continue to be developed. The analyses in Lappalainen et al. using transcript quantifications, such as population differences, were replicated with other approaches with highly consistent results (Fig. 1c, Fig. S14).

Our results suggest that exon-level and gene-level quantifications are much more robust in cis-eQTL analysis. In the GEUVADIS paper we discovered ~7800 genes with exon-level eQTLs, and ~3800 genes with gene-level eQTLs. Initial tests for eQTL discovery using transcript quantifications from FC found <1000 genes with transcript-level eQTLs, and led to or decision to use the much more powerful exon and gene quantifications in that paper. Given the relatively small difference in correlation coefficients described in Lior Pachter’s post between FC and CL (~83% vs. ~92% for 20M fragments), and our previous experience mentioned above, we find it very unlikely that CL or any other transcript quantification method would discover a number of genes with transcript eQTLs anywhere near the 3800 or the 7800 figure. This suggests that transcript quantification methods do not capture biological variation as well as more straightforward exon and gene quantifications.

The issue that was raised in the blog is regarding estimating relative expression levels of isoforms within a single sample and is measured using the Spearman correlation metric. However, for single-gene eQTL, which is the main goal of GTEx pilot analysis, one searches for significant correlations between genotypes and expression levels of a single gene when compared across subjects.  This correlation is robust against shifts (ie. adding or subtracting a constant value) and changes in scale (applying a constant factor) of the expression levels. Therefore, normalizing to the alignable territory of the gene (which will introduce the same factor across samples), and ignoring ambiguously mapped reads (which likely introduces a constant shift) will have little or no effect on the resulting eQTLs. We are constantly evaluating further quantification methods, but we believe that the eQTL results calculated as part of the GTEx pilot and the GEUVADIS paper are solid and robust.

– How does GTEx decide on methodologies to be used?

GTEx strives to provide to the community the highest quality raw and derived data as well as the highest quality scientific results. GTEx also operates with clear and rigid timelines and deliverables. Therefore, we prefer to use methods that already have been vetted by the community and can be used in a large-scale production setting. In this particular case the experience with FC as part of the GEUVADIS project was part of the consideration.

Systematic benchmarking of tools is very important and we encourage the community to conduct such benchmarks. Proper benchmarking is not a simple task — one needs to carefully define the benchmarking metrics (which depend on the particular downstream use of the data), often there are not sufficient ground truth data, and simulations can often be deceiving since they don’t reflect real biology or experimental data. Therefore, whenever there are no published benchmarks that use metrics that are relevant to the GTEx project, we have to perform them within the project (prioritized by the impact on the results). One example is a recent comparison of different alignment methods (including TopHat and GEM), which was recently presented at the ASHG conference in Boston.

Although isoform-level quantification is only a minor part of our current analyses, we continue to evaluate several methods (including FC and CL) and welcome any constructive input on this evaluation from the community. The results obtained by Pachter’s simulations suggest that CL outperforms FC (based on the Spearman correlation metric) but this seems to be at odds with our experience (perhaps due to unrealistic assumptions in the simulations). Clearly, further benchmarking is required to better understand the differences between tools and their effect on the final results (http://www.gtexportal.org/home?page=transcriptQuant).

– Is GTEx open to feedback and what are the mechanisms in place?

GTEx welcomes and encourages feedback and interaction with outside investigators to improve the analysis and data production. We offer several mechanisms for interaction: (i) The GTEx datasets are available for download (some require applying for access to protect donor privacy) and have already been shared with over 100 different research groups that carry out their own analyses of the data; (ii) We had a widely-attended international community meeting in June and plan to hold such meetings yearly, giving the opportunity for external groups to share their results; and (iii) we welcome e-mails to GTEXInfo@mail.nih.gov and comments in the GTEx portal (http://www.broadinstitute.org/gtex/). Finally, we are interested in facilitating systematic benchmarking of tools and investigators that are interested in participating in such benchmarks or in defining the evaluation metrics are welcome to contact us.

To summarize, GTEx strives to produce, and make publicly-available in a timely manner, the best possible data and analysis results, within data release and practical limitations. By no means do we feel that all our analyses are the best possible in all aspects, or that we will perform all the different types of analysis one can do with these data. We are open to constructive feedback regarding the tools we use and the analyses we perform. Finally, all data are available to any investigator who desires to perform novel analyses with their own methods and we anticipate that much improved and innovative analyses of the data will emerge with time.

Manolis Dermitzakis, Gad Getz, Krisitn Ardlie, Roderic Guigo for the GTEx consortium

This week PLoS Computational Biology published a guide titled Ten Simple Rules for Reproducible Computational Research, by G.K. Sandve, A. Nekrutenko, J. Taylor and E. Hovig. The guide lists ten rules, including

Rule 6: For Analyses that Include Randomness, Note Underlying Random Seeds

This is somewhat akin to the biological practice of storing cDNA libraries at -20C. For computational biologists the rule might seem a bit excessive at first glance, and not quite at the same level of importance as

Rule 1: for every result, keep track of how it was produced.

Indeed, I doubt that any of my colleagues keep track of their random seeds; I certainly haven’t. But paying attention to (and recording) seeds used in random number generation is extremely important, and I thought I’d share an anecdote from one of my recent projects to make the point.

My student Atif Rahman has been working on a project for which a basic validation of our method required simulating multiple sets of sequencing reads with error after inducing mutations into a reference genome. He started by using wgsim from SAMtools (a point of note is that wgsim only simulates reads with uniform sequencing error but that is another matter). Initially he was running this command

```for i in {1..40} do
wgsim -d 200 -N 300000 -h \${genome}.fna \${genome}_\${i}_1.fastq
\${genome}_\${i}_2.fastq
done```

Somewhat to his surprise, he found that some of the read sets were eerily similar. Upon closer inspection, he realized what was going on. Multiple datasets were being generated in the same (literal) second, and since he was not setting the seeds, they were being chosen according to the internal (wall) clock– a practice that is common in many programs. As a result those read sets were completely identical. This is because for a fixed seed, a pseudo-random number generator (which is how the computer is generating random numbers) computes the “random” numbers in a deterministic way (for more on random number generation see http://www.random.org/randomness/ ).

One way to circumvent the problem is to first generate a sequence of pseudo-random numbers (of course remembering to record the seed used!) and then to use the resulting numbers as seeds in wgsim using the “-S” option. The quick hack that Atif used was to insert a sleep() between iterations.

In summary, random number generation should not be done randomly.

The Genotype-Tissue Expression (GTEx) project is an NIH initiative to catalog human tissue-specific expression patterns in order to better understand gene regulation (see initial press release). The project is an RNA-Seq tour-de-force: RNA extracted from multiple tissues from more than 900 individuals is been quantified with more than 1,800 RNA-Seq experiments. An initial paper describing the experiments was published in Nature Genetics earlier this year and the full dataset is currently being analyzed by a large consortium of scientists.

I have been thinking recently about how to analyze genotype-tissue expression data, and have been looking forward to testing some ideas. But I have not yet become involved directly with the data, and in fact have not even submitted a request to analyze it. Given the number of samples, I’d been hoping that some basic mapping/quantification had already been done so that I could build on the work of the consortium. But, alas, this past week I got some bad news.

In a recent twitter conversation, I discovered that the program that is being used by several key GTEx consortium members to quantify the data is Flux Capacitor developed by Michael Sammeth while he was in Roderic Guigós group at the CRG in Barcelona.

What is Flux Capacitor?

Strangely, the method has never been published, despite the fact that it has been used in ten publications over the course of four years, including high profile papers from consortia such as ENCODE, GENCODE, GEUVADIS and GTEx. There is no manuscript on the author’s website or in a preprint archive. There is a website for the program but it is incomplete and unfinished, and contains no coherent explanation of what the program does. Papers using the method point to the article S. B. Montgomery, … , E. T. DermitzakisTranscriptome genetics using second generation sequencing in a Caucasian population, Nature 464 (2010) and/or the website http://sammeth.net/confluence/display/FLUX/Home for a description of the method. Here is what these citations amount to:

The Montgomery et al. paper contains one figure providing the “FluxCapacitor outline”. It is completely useless in actually providing insight into what Flux Capacitor does:

Modification of the top half of Supplementary Figure 23 from Montgomery et al (2010) titled “Flux Capacitor Outline” (although it actually shows a splice graph if one corrects the errors as I have done in red).

The methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad. Consider for example the sentence:

In our approach we estimate the biases characteristic of each experiment by collecting read distribution profiles in non-overlapping transcripts, binned by several transcript lengths and expression levels. From these profiles, we estimate for each edge and transcript a flux correction factor $b^j_i$ that following the language of hydro-dynamic flow networks, we denote as the capacity of the edge, as the area under the transcript profile between the edge boundaries (Supplementary Fig. 23).

The indices and j for $b^j_i$ are never defined, but more importantly its completely unclear what the the correction factor actually is, how it is estimated, and how it is used (this should be compared to the current sophistication of other methods). On the program website there is no coherent information either. Here is an example:

The resulting graph with edges labelled by the number of reads can be interpreted as a flow network where each transcript representing a transportation path from its start to its end and consequently each edge a possibly shared segment of transportation along which a certain number of reads per nucleotide — i.e., a flux — is observed.

I downloaded the code and it is undocumented- even to the extent that it is not clear what the input needs to be or what the output means. There is no example provided with the software to test the program.

I therefore became curious why GTEx chose Flux Capacitor instead of many other freely available tools for RNA-Seq (e.g. ALEXA-SeqCLIIQCufflinks, eXpress, iReckon IsoEM, IsoformExMISO, NEUMARSEM, rSEQrQuantSLIDE, TIGAR, …). Although many of these programs are not suitable for production-scale analysis, Cufflinks and RSEM certainly are, and eXpress was specifically designed for efficient quantification (linear in the number of mapped reads and constant memory). I looked around and no benchmark of Flux Capacitor has ever been performed–there is literally not even a mention of it in any paper other than in manuscripts by Sammeth, Guigó or Dermitzakis. So I thought that after four years of repeated use of the program in high profile projects, I would take a look for myself:

After fumbling about with the barely usable Flux Capacitor software, I finally managed to run it on simulated data generated for my paper: Adam Roberts and Lior Pachter, Streaming fragment assignment for real time analysis of sequencing experiments, Nature Methods 10 (2013), 71–73. One example of the state of the software is the example page (the required sorted file is posted there but its download requires the realization that is is linked to from the non-obviously placed paperclip). Fortunately, I was using my own reads and the UCSC annotation. The Roberts-Pachter simulation is explained in the Online Methods of our paper (section “Simulation RNA-Seq study”). It consists of 75bp paired-end reads simulated according to parameters mimicking real data from an ENCODE embryonic stem cell line. I tested Flux Capacitor with both 10 million and 100 million simulated reads; the results are shown in the figure below:

Flux Capacitor accuracy on simulations with 10 million and 100 million reads. The top panels show scatterplots of estimated transcript abundance vs. true transcript abundance. The lower panels show the same data with both axes logged.

For comparison, the next figure shows the results of RSEM, Cufflinks and eXpress on a range of simulations (up to a billion reads) from the Roberts-Pachter paper (Figure 2a):

Modification of Figure 2a from A. Roberts and L. Pachter, Nature Methods (2013) showing the performance of Flux Capacitor in context.

Flux Capacitor has very poor performance. With 100 million reads, its performance is equivalent to other software programs at 10 million reads, and similarly, with 10 million reads, it has the performance of other programs at 1 million reads. I think its fair to say that

Using Flux Capacitor is equivalent to throwing out 90% of the data!

The simulation is a best case scenario. It adheres to the standard model for RNA-Seq in which fragments are generated uniformly at random with lengths chosen from a distribution, and with errors. As explained above, all these parameters were set according to an actual ENCODE dataset, so that the difficulty of the problem corresponds to realistic RNA-Seq data. I can’t explain the poor performance of Flux Capacitor because I don’t understand the method. However my best guess is that it is somehow solving min-flow using linear programming along the lines of the properly fomulated ideas in E. Bernard, L. Jacob, J. Mairal and J.-P. VertEfficient RNA isoform identification and quantification from RNA-seq data with network flows, Technical Report HAL-00803134, March 2013. If this is the case, the poor performance might be a result of some difficulties resulting from the minimization of isoforms and reflected in the (incorrectly estimated) stripes on the left and bottom of the log-log plots. That is not to say the conclusions of the papers where Flux Capacitor is used are wrong. As can be seen from our benchmark, although performance is degraded with Flux Capacitor, the quantifications are not all wrong. For example, abundant transcripts are less likely to be affected by Flux Capacitor’s obviously poor quantification. Still, the use of Flux Capacitor greatly reduces resolution of low-expressed genes and, as mentioned previously, is effectively equivalent to throwing out 90% of the data.

As far as GTEx is concerned, I’ve been told that a significant amount of the analysis is based on raw counts obtained from reads uniquely mapping to the genome (this approach appears to have also been used in many of the other papers where Flux Capacitor was used). Adam Roberts and I examined the performance of raw counts in the eXpress paper (Figure S8, reproduced below):

Figure S8 from A. Roberts and L. Pachter, Nature Methods (2013) showing the limits of quantification when ignoring ambiguous reads. NEUMA (Normalization by Expected Uniquely Mappable Areas) calculates an effective length for each transcript in order to normalize counts based on uniquely mappable areas of transcripts. We modified NEUMA to allow for errors, thereby increasing the accuracy of the method considerably, but its accuracy remains inferior to eXpress, which does consider ambiguous reads. Furthermore, NEUMA is unable to produce abundance estimates for targets without sufficient amounts of unique sequence. The EM algorithm is superior because it can take advantage of different combinations of shared sequence among multiple targets to produce estimates. The accuracy was calculated using only the subset of transcripts (77% of total) that NEUMA quantifies.

Quantification with raw counts is even worse than Flux Capacitor. It is not even possible to quantify 23% of transcripts  (due to insufficient uniquely mapping reads). This is why in the figure above the eXpress results are better than on the entire transcriptome (third figure of this post). The solid line shows that on the (raw count) quantifiable part of the transcriptome, quantification by raw counting is again equivalent to throwing out about 90% of the data. The dashed line is our own improvement of NEUMA (which required modifying the source code) to allow for errors in the reads. This leads to an improvement in performance, but results still don’t match eXpress (and RSEM and Cufflinks), and are worse than even Flux Capacitor if the unquantifiable transcripts are taken into account. In the recent Cufflinks 2 paper, we show that raw counts also cannot be used for differential analysis (as “wrong does not cancel out wrong”–  see my previous post on this).

One criticism of my simulation study could be that I am not impartial. After all, Cufflinks and eXpress were developed in my group, and the primary developer of RSEM, Bo Li, is now my postdoc. I agree with this criticism! This study should have been undertaken a long time ago and subjected to peer review by the author(s?) of Flux Capacitor and not by me. The fact that I have had to do it is a failure on their part, not mine. Moreover, it is outrageous that multiple journals and consortia have published work based on a method that is essentially a black box. This degrades the quality of the science and undermines scientists who do work hard to diligently validate, benchmark and publish their methods. Open source (the Flux Capacitor source code is, in fact, available for download) is not open science. Methods matter.

Hui Jiang and Julia Salzman have posted a new paper on the arXiv proposing a novel approach to correcting for non-uniform coverage of transcripts in RNA-Seq: “A penalized likelihood approach for robust estimation of isoform expression” (October 1, 2013).

Their paper addresses the issue of non-uniformity of read coverage across transcripts in RNA-Seq, an issue that is frustrating for the challenges it presents in analysis. The non-uniformity of read coverage in RNA-Seq was first noticed in A. Mortazavi et al., Mapping and quantifying mammalian transcriptomes, Nature Methods 5 (2008), 621–628. Figure 1 in the paper (see below) shows an example of non-uniform coverage, and the paper discusses ideas for library preparation that can reduce bias and improve uniformity.

Figure 1b from Mortazavi et al. (2008) showing (non-uniform) coverage of Myf6.

Supplementary Figure 1a from Mortazavi et al. (2008) describing uniformity of coverage achievable with different library preparations. “Deviation from uniformity” was assessed using the Kolmogorov-Smirnov test.

The experimental approach of modifying library preparation to reduce non-uniformity has been complemented by statistical approaches to the problem. Specifically, various models have been proposed for “correcting” for experimental artefacts that induce non-uniform coverage. To understand Jiang and Salzman’s latest paper, it is helpful to review previous approaches that have been proposed. Read the rest of this entry »

Today I came across an arXiv posting from yesterday:

Theoretical bounds on mate-pair information for accurate genome assembly by Henry Lin and Yufeng Shen (arxiv.org/abs/1310.1653, October 7 2013).

It purports to provide a sufficient condition for genome assembly, when in fact it is a rediscovery of Ukkonen’s condition for assembly (and a poor rediscovery at that). I’ve seen this happen before so I thought I’d make a short post about this.

Ukkonen’s paper is

E. Ukkonen, Approximate string matching with q-grams and maximal matches, Theoretical Computer Science 92 (1992), no. 1, 191–211.

The connection to genome assembly is explained well in a recent paper by Guy Bresler, Ma’ayan Bresler and David Tse: Optimal assembly for high-throughput shotgun sequencing (arxiv.org/abs/1301.0068, 18 February 2013).

The following figure from the Bresler et al. paper illustrates Ukkonen’s condition:

The condition is that the read length L must be at least one greater than the maximum length of any interleaved repeat or triple repeat. In the figure above (Figure 4 in the Bresler et al. paper), it is impossible to distinguish the two different assemblies (that merely interchange the sequence between the interleaved repeats) with reads of length L. There is a similar example of switching for triple repeats.

The lower bound of Lin and Shen essentially recapitulates the above condition. Moreover, the upper bound is trivial: it merely describes a sufficient library design that if sequenced completely would allow for assembly. Bresler et al. go much farther than this, and describe an approach to optimal assembly under a random model of (shotgun) sequencing that in particular, requires shotgun generalizations of Ukkonen’s combinatorial condition. In a future post I will describe in much more detail the Bresler et al. results. For now, it suffices to say, Ukkonen should be essential (starting) reading for anyone working on assembly.

Update (October 8): the authors have graciously contacted me and acknowledged their oversight in not referring to Ukkonen’s work. They will update their manuscript and repost to the arXiv. They also pointed out that their paper discusses specifically paired-end reads, and that this requires a slightly different analysis than Ukkonen’s initial work (a good point that I should have mentioned in the original post). I applaud them for (a) posting the initial result on the arXiv thereby allowing the community to look at their work prior to publication and (b) for responding immediately to improve it given the feedback.