You are currently browsing the category archive for the ‘papers’ category.

Earlier this month I posted a new paper on the bioRxiv:

Jase Gehring, Jeff Park, Sisi Chen, Matt Thomson, and Lior Pachter, Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transciptional Spaces, bioRxiv, 2018.

The paper offers some insights into the benefits of multiplex single-cell RNA-Seq, a molecular implementation of information multiplexing. The paper also reflects the benefits of a multiplex lab, and the project came about thanks to Jase Gehring, a multiplex molecular biologist/computational biologist in my lab.

mult·i·plex
/`məltəˌpleks/
– consisting of many elements in a complex relationship.
– involving simultaneous transmission of several messages along a single channel of communication.

Conceptually, Jase’s work presents a method for chemically labeling cells from multiple samples with DNA nucleotides so that samples can be pooled prior to single-cell RNA-Seq, yet cells can subsequently be associated with their samples of origin after sequencing. This is achieved by labeling all cells from a sample with DNA that is unique to that sample; in the figure below colors are used to represent the different DNA tags that are used for each sample:

This is analogous to the barcoding of transcripts in single-cell RNA-Seq, that allows for transcripts from the same cell of origin to be associated with each other, yet in this framework there is an additional layer of barcoding of cells.

The tagging mechanism is a click chemistry one-pot, two-step reaction in which cell samples are exposed to methyltetrazine-activated DNA (MTZ-DNA) oligos as well as the amine-reactive cross-linker NHS-trans-cyclooctene (NHS-TCO). The NHS functionalized oligos are formed in situ by reaction of methyltetrazine with trans-cyclooctene (the inverse-election demand Diels-Alder (IEDDA) reaction). Nucleophilic amines present on all proteins, but not nucleic acids, attack the in situ-formed NHS-DNA, chemoprecipitating the functionalized oligos directly onto the cells:

MTZ-DNAs are made by activating 5′-amine modified oligos with NHS-MTZ for the IEDDA reaction, and they are designed with a PCR primer, a cell tag (a unique “barcode” sequence) and a poly-A tract so that they can be captured by poly-T during single-cell RNA-Seq:

Such oligos can be readily ordered from IDT. We are careful to refer to the identifying sequences in these oligos as cell tags rather than barcodes so as not to confuse them with cell barcodes which are used in single-cell RNA-Seq to associate transcripts with cells.

The process of sample tagging for single-cell RNA-Seq is illustrated in the figure below. It shows how the tags, appearing as synthetic “transcripts” in cells, are captured during 3′ based microfluidic single-cell RNA-Seq and are subsequently deciphered by sequencing a tag library alongside the cDNA library:

This significance of multiplexing is manifold. First, by labeling cells prior to performing single-cell RNA-Seq, multiplexing allows for controlling a trade off between the number of cells assayed per sample, and the total number of samples analyzed. This allows for leveraging the large number of cells that can be assayed with current technologies to enable complex experimental designs based on many samples. In our paper we demonstrate this by performing an experiment consisting of single-cell RNA-Seq of neural stem cells (NSCs) exposed to 96 different combinations of growth factors. The experiment was conducted in collaboration with the Thomson lab that is interested in performing large-scale perturbation experiments to understand cell fate decisions in response to developmental signals. We examined NSCs subjected to different concentrations of Scriptaid/Decitabine, epidermal growth factor/basic fibroblast growth factor, retinoid acid, and bone morphogenic protein 4. In other words, our experiment corresponded to a 4x4x6 table of conditions, and for each condition we performed a single-cell RNA-Seq experiment (in multiplex).

This is one of the largest (in terms of samples) single-cell RNA-Seq experiments to date: a 100-fold decrease in the number of cells we collected per sample allowed us to perform an experiment with 100x more samples. Without multiplexing, an experiment that cost us ~\$7,000 would cost a few hundred thousand dollars, well outside the scope of what is possible in a typical lab. We certainly would have not been able to perform the experiment without multiplexing. Although the cost tradeoff is impactful, there are many other important implications of multiplexing as well:

• Whereas simplex single-cell RNA-Seq is descriptive, focusing on what is in a single sample, multiplex single-cell RNA-Seq allows for asking how? For example how do cell states change in response to perturbations? How does disease affect cell state and type?
• Simplex single-cell RNA-Seq leads to systematics arguments about clustering: when do cells that cluster together constitute a “cell type”? How many clusters are real? How should clustering be performed? Multiplex single-cell RNA-Seq provides an approach to assigning significance to clusters via their association with samples. In our paper, we specifically utilized sample identification to determine the parameters/thresholds for the clustering algorithm:On the left hand side is a t-SNE plot labeled by different samples, and on the right hand side de novo clusters. The experiment allowed us to confirm the functional significance of a cluster as a cell state resulting from a specific range of perturbation conditions.
• Multiplexing reduces batch effect, and also makes possible the procurement of more replicates in experiments, an important aspect of single-cell RNA-Seq as noted by Hicks et al. 2017.
• Multiplexing has numerous other benefits, e.g. allowing for the detection of doublets and their removal prior to analysis. This useful observation of Stoeckius et al. makes possible higher-throughput single-cell RNA-Seq. We also found an intriguing relationship between tag abundance and cell size. Both of these phenomena are illustrated in one supplementary figure of our paper that I’m particularly fond of:

It shows a multiplexing experiment in which 8 different samples have been pooled together. Two of these samples are human-only samples, and two are mouse-only. The remaining four are samples in which human and mouse cells have been mixed together (with 2,3,4 and 5 tags being used for each sample respectively). The t-SNE plot is made from the tag counts, which is why the samples are neatly separated into 8 clusters. However in Panel b, the cells are colored by their cDNA content (human, mouse, or both). The pure samples are readily identifiable, as are the mixed samples. Cell doublets (purple) can be easily identified and therefore removed from analysis. The relationship between cell size and tag abundance is shown in Panel d. For a given sample with both human and mouse cells (bottom row), human cells give consistently higher sample tag counts. Along with all of this, the figure shows we are able to label a sample with 5 tags, which means that using only 20 oligos (this is how many we worked with for all of our experiments) it is possible to label ${20 \choose 5} = 15,504$ samples.

• Thinking about hundreds (and soon thousands) of single-cell experiments is going to be complicated. The cell-gene matrix that is the fundamental object of study in single-cell RNA-Seq extends to a cell-gene-sample tensor. While more complicated, there is an opportunity for novel analysis paradigms to be developed. A hint of this is evident in our visualization of the samples by projecting the sample-cluster matrix. Specifically, the matrix below shows which clusters are represented within each sample, and the matrix is quantitative in the sense that the magnitude of each entry represents the relative abundance of cells in a sample occupying a given cluster:
A three-dimensional PCA of this matrix reveals interesting structure in the experiment. Here each point is an entire sample, not a cell, and one can see how changes in factors move samples in “experiment space”:

As experiments become even more complicated, and single-cell assays become increasingly multimodal (including not only RNA-Seq but also protein measurements, methylation data, etc.) development of a coherent mathematical framework for single-cell genomics will be central to interpreting the data. As Dueck et al. 2015 point out, such analysis is likely to not only be mathematically interesting, but also functionally important.

We aren’t the only group thinking about sample multiplexing for single-cell RNA-Seq. The “demuxlet” method by Kang et al., 2017 is an in silico approach based on multiplexing from genomic variation. Kang et al. show that if pooled samples are genetically heterogeneous, genotype data can be used to separate samples providing an effective solution for multiplexing single-cell RNA-Seq in large human studies. However demuxlet has limitations, for example it cannot be used for samples from a homogenous genetic background. Two papers at the end of last year develop an epitope labeling strategy for multiplexing: Stoeckius et al. 2017 and Peterson et al. 2017. While epitope labeling provides additional information that can be of interest, our method is more universal in that it can be used to multiplex any kind of samples, even from different organisms (a point we make with the species mixing multiplex experiment I described above). The approaches are also not exclusive, epitope labeling could be coupled to a live cell DNA tagging multiplex experiment allowing for the same epitopes to be assayed together in different samples. Finally, our click chemistry approach is fast, cheap and convenient, immediately providing multiplex capability for thousands, or even hundreds of thousands of samples.

One interesting aspect of Jase’s multiplexing paper is that the project it describes was itself a multiplexing experiment of sorts. The origins of the experiment date to 2005 when I was awarded tenure in the mathematics department at UC Berkeley. As is customary after tenure trauma, I went on sabbatical for a year, and I used that time to ponder career related questions that one is typically too busy for. Questions I remember thinking about: Why exactly did I become a computational biologist? Was a mathematics department the ideal home for me? Should I be more deeply engaged with biologists? Were the computational biology papers I’d been writing meaningful? What is computational biology anyway?

In 2008, partly as a result of my sabbatical rumination but mostly thanks to the encouragement and support of Jasper Rine, I changed the structure of my appointment and joined the UC Berkeley Molecular and Cell Biology (MCB) department (50%). A year later, I responded to a call by then Dean Mark Schlissel and requested wet lab space in what was to become the Li Ka Shing Center at UC Berkeley. This was not a rash decision. After working with Cole Trapnell on RNA-Seq I’d come to the conclusion that a small wet lab would be ideal for our group to better learn the details of the technologies we were working on, and I felt that practicing them ourselves would ultimately be the best way to arrive at meaningful (computational) methods contributions. I’d also visited David Haussler‘s wet lab where I met Jason Underwood who was working on FragSeq at the time. I was impressed with his work and what I saw were important benefits of real contact between wet and dry, experiment and computation.

In 2011 I was delighted to move into my new wet lab. The decision to give me a few benches was a bold and unexpected one, spearheaded by Mark Schlissel, but also supported by a committee he formed to decide on the make up of the building. I am especially grateful to John Ngai, Art Reingold and Randy Scheckman for their help. However I was in a strange position starting a wet lab as a tenured professor. On the one hand the security of tenure provided some reassurance that a failure in the wet lab would not immediately translate to a failure of career. On the other hand, I had no startup funds to buy all the basic infrastructure necessary to run a lab. CIRM, Mark Schlissel, and later other senior faculty in Molecular & Cell Biology at UC Berkeley, stepped in to provide me with the basics: a -80 and -20, access to a shared cold room, a Bioanalyzer (to be shared with others in the building), and a thermocycler. I bought some other basic equipment but the most important piece was the recruitment of my first MCB graduate student: Shannon Hateley. Shannon and I agreed that she would set up the lab and also be lab manager, while I would supervise purchasing and other organization lab matters. I obtained informed consent from Shannon prior to her joining my lab, for what would be a monumental effort requested of her. We also agreed she would be co-advised by another molecular biologist “just in case”.

With Shannon’s work and then my second molecular biology student, Lorian Schaeffer, the lab officially became multiplexed. Jase, who initiated and developed not only the molecular biology but also the computational biology of Gehring et al. 2018 is the latest experimentalist to multiplex in our group. However some of the mathematicians now multiplex as well. This has been a boon to the research of the group and I see Jase’s paper as fruit that has grown from the diversity in the lab. Moving forward, I see increasing use of mathematics ideas in the development of novel molecular biology. For example, current single-cell RNA-Seq multiplexing is a form of information multiplexing that is trivial in comparison to the multiplexing ideas from information theory; the achievements are in the molecular molecular implementations, but in the future I foresee much more of a blur between wet and dry and increasingly sophisticated mathematical ideas being implemented with molecular biology.

Hedy Lamarr, the mother of multiplexing.

The idea that 2016 was the worst year in history was already floated back in July, but despite a lot of negativity, there were definitely good things that happened in 2016. This was certainly true in terms of scientific discoveries and breakthroughs. One of my favorites was a significant improvement to the upper bound of the Happy Ending problem after 81 years!

The Happy Ending problem was the brainchild of mathematician Esther Klein, who proved that any five points in general position in the plane must contain a convex quadrilateral.

The extension of the problem asks, for each n, for the minimum number $f(n)$ so that any $f(n)$ points in general position in the plane contain a convex n-gon. Paul Erdös and George Szekeres established an upper bound for $f(n)$ via Ramsey Theory in 1935, a link that highlighted the importance and application of Ramsey Theory to extremal combinatorics and instantly made the problem a classic. It even has connections to computational biology, via it’s link to the Erdös-Szekeres monotone subsequence theorem (published in the same 1935 paper), which in turn is related to the Needleman-Wunsch algorithm.

The “happy ending” associated to the problem comes from the story of Esther Klein and George Szekeres, who fell in love while working on the problem and ended up marrying in 1937, a happy ending that lasted for 68 years (they died within an hour of each other in 2005).

In their 1935 paper, Erdös-Szekeres proved an upper bound for $f(n)$ using elementary combinatorics, namely

$f(n) \leq {2n -4 \choose n-2}+1 = 4^{n-o(n)}$.

The difficulty of improving the upper bound for $f(n)$ can be appreciated by noting that it took 63 years for there to be any improvement to the Erdös-Szekeres result, and when an improvement did arrive in 1998 it was to remove the “+1” (Fan Chung and Ron Graham, Discrete Geometry 1998) reducing the upper bound to $f(n) \leq {2n-4 \choose n-2}$ for $n \geq 4$.

Considering that the best lower bound for $f(n)$, conjectured to be optimal, is $f(n) \geq 2^{n-2}+1$ (proved by Erdös and Szekeres in 1960), the +1 improvement left a very long way to go. A few other improvements after the Chung-Graham paper was published reduced the upper bound some more, but at most by constant factors.

Then, earlier this year, Andrew Suk announced an astonishing improvement. Building on a theorem of Attila Pór and Pavel Valtr, he proved that

$f(n) = 2^{n+o(n)}$.

The result doesn’t exactly match the conjectured lower bound, so one cannot say the happy ending to the Happy Ending Problem arrived in 2016, but it’s so so so close (i.e. it solves the problem asymptotically) that one can only be left with optimism and hope for 2017.

Happy new year!

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as the law of total variance. Given a random (response) variable $C$ with a random covariate  $Z$, the law of total variance states that the variance of $C$ can be decomposed as:

$Var[C] = E_Z[Var[C|Z]] + Var_Z[E[C|Z]]$.

There is a biological interpretation of this law that also serves to explain it: If the random variable $C$ denotes the expression of a gene in a single cell ($C$ being a random variable means that the expression is stochastic), and $Z$ denotes the (random) state of a cell, then the law of total variance describes the “total noise” $Var[C]$ in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression $Var[C|Z]$ to be the conditional variance, which is also a random variable; its possible values are the variance of the gene expression in different cell states. If $Var[C]$ does not depend on $Z$ then the expression of the gene is said to be homoscedastic, i.e., it does not depend on cell state (if it does then it is said to be heteroscedastic). Because $Var[C|Z]$ is a random variable, the expression $E_Z[Var[C|Z]$ makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression $E[C|Z]$ is a random variable whose possible values are the average  of the gene expression in cells. Thus, $Var_Z[E[C|Z]]$ is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into E. coli and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables $Z_1,\ldots,Z_n$ which are independent and identically distributed, one for each of n different cells, while random variables $C_1,\ldots,C_n$  and $Y_1,\ldots,Y_n$ correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable $C_i$ and $Y_i$, i.e. a pair of measurements for each cell. A hierarchical model for the experiment, in which the marginal (unconditional) distributions $C_i$ and $Y_i$ are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead,  in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

$\eta^2_{int} = \frac{\frac{1}{n} \left( \sum_{i=1}^n \frac{1}{2} (c_i-y_i)^2\right) }{\overline{c} \cdot \overline{y}},$ (intrinsic noise)

$\eta^2_{ext} = \frac{\frac{1}{n}\sum_{i=1}^n c_i \cdot y_i - \overline{c} \cdot \overline{y}}{\overline{c} \cdot \overline{y}},$ (extrinsic noise)

$\eta^2_{tot} = \frac{\frac{1}{n}\sum_{i=1}^n \frac{1}{2} (c_i^2+y_i^2)-\overline{c}\cdot\overline{y}}{\overline{c} \cdot \overline{y}}.$ (total noise)

Here  $c_1,\ldots,c_n$ and $y_1,\ldots,y_n$ are the measurements of cyan respectively yellow reporter expression in each cell, $\overline{c} = \frac{1}{n}\sum_{i=1}^n c_i$ and $\overline{y} = \frac{1}{n}\sum_{i=1}^n y_i$.

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho,  studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:

The Elowitz et al. caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of n in the denominator rather than n-1), an observation made by Hilfinger and Paulsson. The sample covariance has a  (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line $c=y$ is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):

There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

1. There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
2. The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
3. The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.

The hierarchical classification of nature initiated by Carl Linnaeus today consists of eight major “ranks”, namely species, genus, family, order, class, phylum, kingdom and domain:

In the microbial world it makes sense to refine the standard taxonomy by subdividing species into strains. An important reason to do so is that bacterial taxonomy must reflect not only phylogeny but also pathogenicity, and small differences in genomes can translate to large pathogenic differences. This has implications for metagenomic analyses of microbial communities: for many biomedical applications it is desirable to characterize individuals strains.

Metagenomics has its roots in culture-independent retrieval and sequencing of 16S rRNA genes, and while variations in 16S can sometimes distinguish between strains, a single gene is not always sufficient. This limitation of 16S can be overcome with whole genome shotgun sequencing of microbial communities, an approach to metagenomics that became popular in the early 2000s and  that opened the door to higher resolution characterization of communities. In 2005 Kevin Chen and I wrote a review on the bioinformatics challenges that would have to be overcome to walk through the door. One of the things we did was to emphasize “problems and their connections to other areas of bioinformatics, such as… gene expression analysis”, and throughout the past decade I’ve always hoped for deeper connections to be established between metagenomics and gene expression bioinformatics. I’ve noticed interesting connections pop up from time to time (e.g. Paulson et al. 2013)  and have occasionally entertained the thought with my students and collaborators, especially as work in my group became more focused on RNA-Seq since the development of Cufflinks in 2008.

However connection modern transcriptome analysis methodology, specifically bioinformatics of RNA-Seq to metagenomics has been difficult to do until recently. One major reason is that until just a few years ago, there was no reference genome database for metagenomics analogous to the reference annotation databases available for use in transcriptomics. Another way to put this is that metagenomics has, until recently, been “de novo” bioinformatics. By this I mean that the analysis of communities from whole genome shotgun data had to largely proceed via de novo analyses of the data (e.g. de novo assembly of genomes), “binning” of reads according to sequence characteristics or hits to gene databases was required because it was impossible to compare sequences to references genomes. While de novo methods have also been developed for RNA-Seq, the scale of transcriptome analysis is much smaller than that of most metagenomic analyses, and as has been well documented, de novo transcriptomics is already very difficult (e.g. Amin et al. 2014).

The de novo state of metagenomics has changed in recent years, as (relatively) low-cost sequencing has been a boon for microbial genomics. The graph below, extracted from NCBI and published in a recent review, shows that in just the past few years thousands of bacterial genomes have been sequences, enabling, for the first time, reference based metagenomics:

This observation is reflected in the recent development of many methods for a variety of metagenomic applications that take advantage of reference genome databases.  Specifically, the problem of read assignment, which is fundamental for abundance estimation, has benefited from the possibility of metagenomic read alignment to reference databases.

The figure below, reproduced from the preprint “An evaluation of the accuracy and speed of metagenome analysis tools” by Stinus Lindgreen, Karen L. Adair and Paul Gardner, bioRxiv May 15, 2015 shows a benchmark of the accuracy and runtime of 14 programs developed for metagenomic read assignment for whole genome shotgun data:

The problem these methods are solving is really similar to the problem of read assignment in RNA-Seq. In RNA-Seq, instead of originating from strains, reads originate from transcripts. Just as strains are present in different abundances in a community, so are RNA transcripts in a cell (or in bulk). The analogy of taxonomy in metagenomics, i.e. the grouping of strains into species, genus etc. is also present in RNA-Seq, where transcripts are grouped into genes. The fragment (or read) assignment problem in RNA-Seq is closely related to the quantification problem in RNA-Seq and is a problem that has been thoroughly researched and for which many algorithms have been developed. I discussed the importance of the fragment assignment problem for RNA-Seq in my 2013 Genome Informatics Keynote.

In response to the development of reference-based bioinformatics possibilities for metagenomics, about three years ago my student Lorian Schaeffer started looking at the suitability of RNA-Seq tools for metagenomic read assignment. Although the metagenomic and RNA-Seq assignment problems are conceptually similar and methodologically related, there are various technical issues involved in applying RNA-Seq tools in the metagenomic setting (e.g. the need to carefully account for taxonomy in the metagenomics setting). After developing the computational infrastructure to benchmark RNA-Seq programs in the metagenomic setting, she proceeded to evaluate the accuracy of eXpress, a streaming algorithm for RNA-Seq quantification. Although the quantification of eXpress was specifically designed to be suitable for large numbers of reads, the program requires read alignments to a reference transcriptome (or in Lorian’s experiments a genome) database. In the metagenomic setting realistic databases are huge, and she found that it took days just to map the reads. Nevertheless, her initial benchmarks revealed that eXpress was significantly more accurate than the available metagenomic read assignment tools of the time.

When Kraken (Wood and Salzberg 2014), and later CLARK (Ounit et al. 2015) were published in 2014 and 2015 respectively, we took note because by circumventing the alignment step they dramatically altered the tractability of metagenomic read assignment. In parallel, in my group, Nicolas Bray and later Páll Melsted and Harold Pimentel were developing what is now kallisto (Bray et al. 2015). Like Kraken, kallisto avoided the need for aligning reads, but with the introduction of the concept of pseudoalignment, allowed for accurate read assignments based on joint analysis of exact k-mer matches. What we showed earlier this year is that unlike naïve k-mer based approaches to quantification, kallisto is as accurate as eXpress and other read alignment based quantification tools, and this observation led Lorian to immediately proceed to benchmark it on metagenomic data. The result of her work was just posted as a preprint:

Lorian Schaeffer, Harold Pimentel, Nicolas Bray, Páll Melsted and Lior Pachter, Pseudoalignment for metagenomic read assignment, arXiv 1510.07371, 2015.

With this paper we demonstrate a “technology transfer” from RNA-Seq bioinformatics to metagenomics, one that achieves dramatic improvements in read assignment accuracy in the metagenomics setting. The main result of her work is Table 1 in our preprint:

Using a published simulated Illumina dataset from Mende et al. 2012 (based on 100 genomes and containing 53.33 million reads), and augmenting it with another 2,308 genomes for the purpose of testing, she shows that kallisto significantly outperforms the best quantification methods (as benchmarked by Lindgreen et al., see figure above). “Significant” here refers to what I think is fair to characterize as an extraordinary improvement: at the genus level, a level that programs such as CLARK have been optimized for, kallisto’s RRMSE (relative root mean squared error)  is 0.13 compared to 17.05 for Kraken and 18.58 for CLARK. The improvement is based on two ideas: first, the results show that the model based approach for read assignment, the concept that underlies GASiC and eXpress, outperforms direct taxonomic read assignment as implemented by MEGAN and Kraken and CLARK (in the latter approach reads are aligned to the lowest rank to which they align unambigously). Second, pseudoalignment is not just faster than traditional alignment but also accurate.

The upshot: the accuracy and efficiency of kallisto make strain level analysis of metagenomes possible. In fact kallisto is more accurate at the strain level than other programs are at the genus level. Just as we have been advocating for transcript level analysis from RNA-Seq data, we believe that strain level analysis should become commonplace in metagenomics.

In digging deeply into the bioinformatics of metagenomics bioinformatics we noticed a few other areas that could benefit from RNA-Seq technology transfer. For example, the standard of RNA-Seq methods benchmarking appears to be higher than in metagenomics. Both the Kraken and CLARK papers benchmarked their programs on simulations with 10 genomes (the number ten is not a typo). CLARK did test on one dataset with 20 genomes, although using only 10,000 reads. To be fair to the authors of those papers, their standards were much higher than others in the field. The paper

Yu-Wei Wu and Yuzhen Ye, A novel abundance-based algorithm for binning metagenomic sequences using l-tuples, Journal of Computational Biology 2011.

benchmarked their method on simulations of reads from 2 (two!!) organisms. Biologists frequently complain that simulations of bioinformaticians are completely non-informative and unfortunately these cases provide fodder for such prejudice. Having said that, the RNA-Seq community also has much to learn from the metagenomics community. The previously mentioned paper by Paulson et al. 2013 addresses missing data in a way that should translate directly to missing data in single-cell RNA-Seq (the paper also makes performance comparisons with their comparative metagenomics approach to the RNA-Seq programs DESeq and edgeR) . One paper (McDavid et al. 2012) does take a look at modeling single-cell data with zero inflated distributions but I think this is a good example where metagenomics is ahead of RNA-Seq. Our immediate plans are to develop the kallisto application to metagenomics to include the ability to perform metagenome comparisons using sleuth. Conversely, inspired by the taxonomy hierarchy fundamental to metagenomics we’re going to explore RNA-Seq quantification with groups of transcripts that go beyond just genes.

Horizontal transfer is good.

[Update July 15, 2016: A preprint describing sleuth is available on BioRxiv]

Today my student Harold Pimentel released the beta version of his new RNA-Seq analysis method and software program called sleuth. A sleuth for RNA-Seq begins with the quantification of samples with kallisto, and together a sleuth of kallistos can be used to analyze RNA-Seq data rigorously and rapidly.

Why do we need another RNA-Seq program?

A major challenge in transcriptome analysis is to determine the transcripts that have changed in their abundance across conditions.  This challenge is not entirely trivial because the stochasticity in transcription both within and between cells (biological variation), and the randomness in the experiment (RNA-Seq) that is used to determine transcript abundances (technical variation), make it difficult to determine what constitutes “significant” change.

Technical variation can be assessed by performing technical replicates of experiments. In the case of RNA-Seq, this can be done by repeatedly sequencing from one cDNA library. Such replicates are fundamentally distinct from biological replicates designed to assess biological variation. Biological replicates are performed by sequencing different cDNA libraries that have been constructed from repeated biological experiments performed under the same (or in practice near-same) conditions. Because biological replicates require sequencing different cDNA libraries, a key point is that biological replicates include technical variation.

In the early days of RNA-Seq a few papers (e.g. Marioni et al. 2008, Bullard et al. 2010) described analyses of technical replicates and concluded that they were not really needed in practice, because technical variation could be predicted statistically from the properties of the Poisson distribution. The point is that in an idealized RNA-Seq experiment counts of reads are multinomial (according to abundances of the transcripts they originate from), and therefore approximately Poisson distributed. Their variance is therefore approximately equal to the mean, so that it is possible to predict the variance in counts across technical replicates based on the abundance of the transcripts they originate from. There is, however, one important subtlety here: “counts of reads” as used above refers to the number of reads originating from a transcript, but in many cases, especially in higher eukaryotes, reads are frequently ambiguous as to their transcript of origin because of the presence of multi-isoform genes and genes families. In other words, transcript counts cannot be precisely measured. However, the statement about the Poisson distribution of counts in technical replicates remain true when considering counts of reads by genomic features because then reads are no longer ambiguous.

This is why, in so-called “count-based methods” for RNA-Seq analysis, there is an analysis only at the gene level. Programs such as DESeq/DESeq2, edgeR and a literally dozens of other count-based methods first require counting reads across genome features using tools such as HTSeq or featureCounts. By utilizing read counts to genomic features, technical replicates are unnecessary in lieu of the statistical assumption that they would reveal Poisson distributed data, and instead the methods focus on modeling biological variation. The issue of how to model biological variation is non-trivial because typically very few biological replicates are performed in experiments. Thus, there is a need for pooling information across genes to obtain reliable variance estimates via a statistical process called shrinkage. How and what to shrink is a matter of extensive debate among statisticians engaged in the development of count-based RNA-Seq methods, but one theme that has emerged is that shrinkage approaches can be compatible with general and generalized linear models, thus allowing for the analysis of complex experimental designs.

Despite these accomplishments,  count-based methods for RNA-Seq have two major (related) drawbacks: first, the use of counts to gene features prevents inference about the transcription of isoforms, and therefore with most count-based methods it is impossible to identify splicing switches and other isoform changes between conditions. Some methods have tried to address this issue by restricting genomic features to specific exons or splice junctions (e.g. DEXSeq) but this requires throwing out a lot of data, thereby reducing power for identifying statistically significant differences between conditions. Second, because of the fact that in general $\frac{a}{b} + \frac{c}{d} \neq \frac{a+b}{c+d}$ it is mathematically incorrect to estimate gene abundances by adding up counts to their genomic region. One consequence of this, is that it is not possible to accurately measure fold change between conditions by using counts to gene features. In other words, count-based methods are problematic even at the gene-level and it is necessary to estimate transcript-level counts.

While reads might be ambiguous as to exactly which transcripts they originated from, it is possible to statistically infer an estimate of the number of reads from each transcript in an experiment. This kind of quantification has its origin in papers of Jiang and Wong, 2009 and Trapnell et al. 2010. However the process of estimating transcript-level counts introduces technical variation. That is to say, if multiple technical replicates were performed on a cDNA library and then transcript-level counts were to be inferred, those inferred counts would no longer be Poisson distributed. Thus, there appears to be a need for performing technical replicates after all. Furthermore, it becomes unclear how to work within the shrinkage frameworks of count-based methods.

There have been a handful of attempts to develop methods that combine the uncertainty of count estimates at the transcript level with biological variation in the assessment of statistically significant changes in transcript abundances between conditions. For example, the Cuffdiff2 method generalizes DESeq while the bitSeq method relies on a Bayesian framework to simultaneously quantify abundances at the transcript level while modeling biological variability. Despite showing improved performance over count-based methods, they also have significant shortcomings. For example the methods are not as flexible as those of general(ized) linear models, and bitSeq is slow partly because it requires MCMC sampling.

Thus, despite intensive research on both statistical and computational methods for RNA-Seq over the past years, there has been no solution for analysis of experiments that allows biologists to take full advantage of the power and resolution of RNA-Seq.

The sleuth model

The main contribution of sleuth is an intuitive yet powerful model for RNA-Seq that bridges the gap between count-based methods and quantification algorithms in a way that fully exploits the advantages of both.

$Y_t = X_t\beta_t + \epsilon_t$.

Here the subscript t refers to a specific transcript, $Y_t$ is a vector describing transcript abundances (of length equal to the number of samples), $X_t$ is a design matrix (of size number of samples x number of confounders), $\beta_t$ is a parameter vector (of size the number of confounders) and $\epsilon_t$ is a noise vector (of size the number of samples). In this model the abundances $Y_t$ are normally distributed. For the purposes of RNA-Seq data, the $Y_t$ may be assumed to be the logarithm of the counts (or normalized counts per million) from a transcript, and indeed this is the approach taken in a number of approaches to RNA-Seq modeling, e.g. in limma-voom. A common alternative to the general linear model is the generalized linear model, which postulates that some function of $Y_t$ has a distribution with mean equal to $g^{-1}(X_t \beta_t)$ where g is a link function, such as log, thereby allowing for distributions other than the normal to be used for the observed data. In the RNA-Seq context, where the negative binomial distribution may make sense because it is frequently a good distribution for modeling count data, the generalized model is sometimes preferred to the standard general model (e.g. by DESeq2). There is much debate about which approach is “better”.

In the sleuth model the $Y_t$ in the general linear model are modeled as unobserved. They can be thought of us the unobserved logarithms of true counts for each transcript across samples and are assumed to be normally distributed. The observed data $D_t$ is the logarithm of estimated counts for each transcript across samples, and is modeled as

$D_t = Y_t + \zeta_t$

where the $\zeta_t$ vector parameterizes a perturbation to the unobserved $Y_t$. This can be understood as the technical noise due to the random sequencing of fragments from a cDNA library and the uncertainty introduced in estimating transcript counts.

The sleuth model incorporates the assumptions that the response error is additive, i.e. if  the variance of transcript in sample is $V(D_{t,i})$ then $V(D_{t,i}) = \sigma^2_t + \tau^2_t$ where the variance $V(\epsilon_{t,i}|y_{t,i}) = \sigma^2_t$ and the variance $V(\zeta_{t,i}|y_{t,i}) = \tau^2_t$. Intuitively, sleuth teases apart the two sources of variance by examining both technical and biological replicates, and in doing so directly estimates “true” biological variance, i.e. the variance in biological replicates that is not technical.  In lieu of actual technical replicates, sleuth takes advantage of the bootstraps of kallisto which serve as accurate proxies.

In a test of sleuth on data simulated according to the DESeq2 model we found that sleuth significantly outperforms other methods:

In this simulation transcript counts were simulated according to a negative binomial distribution, following closely the protocol of the DESeq2 paper simulations. Reference parameters for the simulation were first estimated by running DESeq2 on a the female Finnish population from the GEUVADIS dataset (59 individuals). In the simulation above size factors were set to be equal in accordance with typical experiments being performed, but we also tested sleuth with size factors drawn at random with geometric mean of 1 in accordance with the DESeq2 protocol (yielding factors of 1, 0.33, 3, 3, 0.33 and 1) and sleuth still outperformed other methods.

There are many details in the implementation of sleuth that are crucial to its performance, e.g. the approach to shrinkage to estimate the biological variance $\sigma^2_t$. A forthcoming preprint, together with Nicolas Bray and Páll Melsted that also contributed to the project along with myself, will provide the details.

Exploratory data analysis with sleuth

One of the design goals of sleuth was to create a simple and efficient workflow in line with the principles of kallisto. Working with the Shiny web application framework we have designed an html interface that allows users to interact with sleuth plots allowing for real time exploratory data analysis.

The sleuth Shiny interface is much more than just a GUI for making plots of kallisto processed data. First, it allows for the exploration of the sleuth fitted models; users can explore the technical variation of each transcript, see where statistically significant differential transcripts appear in relationship to others in terms of abundance and variance and much more. Particularly useful are interactive features in the plots. For example, when examining an MA plot, users can highlight a region of points (dynamically created box in upper panel) and see their variance breakdown of the transcripts the points correspond to, and the list of the transcripts in a table below:

The web interface contains diagnostics, summaries of the data, “maps” showing low-dimensional representations of the data and tools for analysis of differential transcripts. The interactivity via Shiny can be especially useful for diagnostics; for example, in the diagnostics users can examine scatterplots of any two samples, and then select outliers to examine their variance, including the breakdown of technical variance. This allows for a determination of whether outliers represent high variance transcripts, or specific samples gone awry. Users can of course make figures showing transcript abundances in all samples, including boxplots displaying the extent of technical variation. Interested in the differential transcribed isoform ENST00000349155 of the TBX3 gene shown in Figure 5d of the Cuffdiff2 paper? It’s trivial to examine using the transcript viewer:

One can immediately see visually that differences between conditions completely dominate both the technical and biological variation within conditions. The sleuth q-value for this transcript is 3*10^(-68).

Among the maps, users can examine PCA projections onto any pair of components, allowing for rapid exploration of the structure of the data. Thus, with kallisto and sleuth raw RNA-Seq reads can be converted into a complete analysis in a matter of minutes. Experts will be able to generate plots and analyses in R using the sleuth library as they would with any R package. We plan numerous improvements and developments to the sleuth interface in the near future that will further facilitate data exploration; in the meantime we welcome feedback from users.

How to try out sleuth

Since sleuth requires the bootstraps and quantifications output by kallisto we recommend starting by running kallisto on your samples. The kallisto program is very fast, processing 30 million reads on a laptop in a matter of minutes. You will have to run kallisto with bootstraps- we have been using 100 bootstraps per sample but it should be possible to work with many fewer. We have yet to fully investigate the minimum number of bootstraps required for sleuth to be accurate.

To learn how to use kallisto start here. If you have already run kallisto you can proceed to the tutorial for sleuth. If you’re really eager to see sleuth without first learning kallisto, you can skip ahead and try it out using pre-computed kallisto runs of the Cuffdiff2 data- the tutorial explains where to obtain the data for trying out sleuth.

For questions, suggestions or help see the program websites and also the kallisto-sleuth user group. We hope you enjoy the tools!

On the 27th March, 2012 the Foreign Currency Department of the Central Bank of Iceland raided the offices of Samherji hf., a fishing company in Iceland. The Office of the Special Prosecutor was concerned that Sammherji hf. might be shifting profits in Iceland to its overseas subsidiaries by underselling its fish. In doing so it would be recording losses in Iceland while sheltering profits abroad thereby violating currency exchange (and possibly other) laws. The raid was immediately addressed by the company in a press release;  for a complementary perspective see the article by journalist Ingi Freyr Vilhjálmsson

Two months later, in a sharply worded rebuttal, the company argued that the reported discrepancies in prices between Samherji hf. and the market were flawed. Specifically, they argued, the Central Bank had incorrectly averaged numbers (Google English translation) resulting in a Simpson’s paradox! Was this a valid argument? Truly an example of Simpson’s paradox? Or a sleight of hand?

Simpson’s paradox (better referred to as Simpson’s reversal) is a mathematical inequality that can arise in the comparison of averages with raw data. It is known as a “paradox” because the inequality can appear to be unintuitive (to those who have poor statistical intuition, i.e. most of us). What makes it particularly perilous is that it can lurk disguised in the most unexpected places. Not only (allegedly) in the Central Bank of Iceland, but also, it turns out, in many comparative genome analyses…

Original sin

The mouse genome was published in December 2002, and in the publication was the following figure (Figure 25 from the mouse genome paper):

In the review Sixty years of genome biology by Doolittle et al. (2013) highlighting key advances in genome biology since the publication of the structure of the double helix in 1953, Chris Ponting singles out panel (a) as being of historical significance describing it as “a wonderful visual guide to the most important features of mammalian genes” and explaining that “collapsing levels of sequence conservation between thousands of mouse and human orthologs into one metagene, .. showed how, from a common sequence over 90 million years ago, mutation has etched away intronic sequence whilst selection has greatly preserved the exons, particularly toward their boundaries”. He is right, of course, in that the figure demonstrated, for the first time, the power of “genome-wide” comparative molecular biology and led to concerted efforts to characterize functional regions in the genome by aggregation of genome-wide data. Figures analogous to the mouse genome paper 25a are now a fixture in many genomics papers, with % sequence identity being replaced by data such as % methylated, % accessible, etc. etc.

In the recent paper

M. Singer and L. Pachter, Controlling for conservation in genome-wide DNA methylation studiesBMC Genomics, 16:240

my former Ph.D. student Meromit Singer (now a postdoc in the Regev lab at the Broad Institute) and I explain why in some cases the type of “meta-analysis” underlying Figure 25a, now a fixture in many genomics papers, can be misleading. We focused on DNA methylation studies, which I will explain, but first some elementary algebra…

According to the English Oxford Dictionary, one of the definitions for the word paradox is “A statement or proposition which on the face of it seems self-contradictory, absurd, or at variance with common sense, though, on investigation or when explained, it may prove to be well-founded”, and it is this sense of the word that is relevant in what has become known as “Simpson’s paradox”. The “paradox”, is just the following fact:

Given numbers $p_1,p_2,q_1,q_2,r_1,r_2,s_1,s_2$, it may happen that

$\frac{p_1}{q_1}>\frac{p_2}{q_2}$ and $\frac{r_1}{s_1} > \frac{r_2}{s_2}$ but $\frac{p_1+r_1}{q_1+s_1} < \frac{p_2+r_2}{q_2+s_2}$.

The geometry is that in the figure below (from wikipedia), even though each of the red vectors has a slope greater than its corresponding blue vector, the sum of the red vectors has a slope less than the sum of the blue vectors:

This reversal, nothing more than a simple algebraic curiosity, can sometimes underlie qualitative changes in the results of data analysis after averaging. This is because ratios or rates can be recast as slopes of vectors while averages correspond to sums of vectors. A famous example is described in the classic paper on sex bias in graduate admission at UC Berkeley, Bickel et al. discuss the issue of pooling data, and how it can affect conclusions about bias. The main point of the paper, namely that Simpson’s paradox can emerge in analyzing admissions data, is illustrated in the following toy example (Figure S1b) from our paper, which was constructed to facilitate understanding the effect in the context of genomics:

The figure shows (hypothetical) applications and admissions results for eight males and eight females at twelve departments in a university. A “0” indicates an applicant applied and was rejected, a “1” that the applicant applied and was accepted, and a grey box indicates that the applicant did not apply. In all departments the admissions rates are the same for males and females (0% for both males and females in departments 1–4 and 100% for both males and females in departments 5–12, best understood as the ratio, in each row, of the sum of the numbers divided by the number of non-grey squares). Yet the acceptance rate for all males is 66.7% vs. 33.3% for females (for each column, sum of the numbers divided by the number of non-grey squares as displayed in the plot underneath the table). This is a strange effect at first glance… somehow grouping males and females together leads to an appearance of bias, whereas each department is admitting students without regard to gender (the paper by Bickel et al. discusses real data from UC Berkeley admissions in 1973 where this effect resulted in statistically unjustified claims of bias).

Looking at the table it is clear that the reason for the discrepancy between departmental admission rates and overall admissions rates is due to the extra grey boxes in the female table. In this example (and in the 1973 UC Berkeley admissions) females applied to departments that are harder to get in to. In the example above the male and female departmental admissions rates were equal, but it is not difficult to construct examples (again, they occurred in the 1973 UC Berkeley data) were a reversal happens, i.e. admissions rates at the departmental level are opposite to those when aggregated. This is what is known as “Simpson’s paradox”.

That the issue is relevant in the genomics meta-analysis context can be seen by replacing “males” and “females” with by genomic features, e.g. “introns” and “exons”, and “departments” with different genomic instances of the features. For example, in the case of Figure 25a from the mouse genome paper, “departments” correspond to different instances of introns/exons in the genome, and “0”s to “not conserved” and “1”s to “conserved”. Interestingly, in the genomics meta-analysis context it is the plot (column averages in the table above) that is always the displayed “result”, and this makes the “Simpson’s effect” more subtle to detect or to identify than in standard settings where the table itself is revealed.

What is the equivalent of “females apply to departments that are harder to get into”? Consider the human-mouse plot in Figure 25a of the mouse genome paper. In some genes some bases are not conserved between human and mouse, and this is far more likely to happen in introns than exons. In fact, this is shown in Figure 25a (light blue curve labeled “aligning”). Hence the grey boxes of missing data, reflecting the fact that introns are less conserved than exons, and therefore in alignments they have not only more mismatches, but also indels.

In our paper, we considered the case of DNA methylation measurements, where “0” referred to unmethylated and “1” to methylated. Since DNA methylation can only be measured at CpGs, there is again, the issue of missing data. And as it turns out, the missing data is associated with feature type, so that genome-wide averaging can be a huge problem.

DNA Methylation

In the course of working on her Ph.D. on Statistical algorithms in the study of DNA methylation, Meromit worked on identifying and understanding DNA methylation inside genes, and in that context was interested in Figure 4 from the paper Dynamic changes in the human methylome during differentiation by L. Laurent et al., Genome Research, 20 (2010), p 320–331. The figure describes “average distribution of DNA methylation mapped onto a gene model”, and is reproduced below:

Figure 4 from L. Laurent et al., Genome Research 2010.

Figure 4c plays a major role in the paper, supporting the following sentence from the abstract: “Exons were more highly methylated than introns, and sharp transitions of methylation occurred at exon–intron boundaries, suggesting a role for differential methylation in transcript splicing.” Throughout the paper, there is extensive discussion of “sharp steps”, not only at exon-intron boundaries but also transcription start sites (TSS) and transcription termination sites (TTS), with speculation about the regulation such methylation differences might reflect and contribute to. Of particular note is the following paragraph about Figure 4c:

A downward gradient was seen going across exons from 5′ to 3′, while an upward gradient of DNA methylation was seen traveling from 5′ to 3′ across introns (Fig. 4C). While this is the first report on the splice junction methylation spikes, recent reports show that the intron–exon boundaries also appear to be marked by gradients in chromatin features, including nucleosomes (Schwartz et al. 2009) and the H3K36me3 histone mark (Kolasinska-Zwierz et al. 2009). Taken together, our data suggest that coupling of transcription and splicing may be regulated by DNA methylation as well as by other epigenetic marks.”

Meromit focused on the italicized sentence (emphasis mine). Looking at the figure, she noticed that the 5′ ends of exons seemed to be much more highly methylated than the 3′ ends. This, and the discussion in the paper, led her to think that indeed DNA methylation may be functional within transcripts. But given the sparsity of CpGs (for one thing, it turned out that the spikes at the intron-exon boundaries were due to an almost complete lack of CpGs there), and noting that CpGs must be present for DNA methylation to be measured, she decided to examine whether there was in fact stronger conservation of CpGs at the 5′ end of exons. However, when reanalyzing the data, she discovered that the reason for the reported difference in methylation was that because intragenic junctions were examined, the 3′ ends of exons were on average closer to the promoter than the 5′ ends, and therefore the difference in methylation was simply reflecting the fact that many first exons are unmethylated due to “leakage” from the promoter. This, in turn, got her thinking about the interplay between DNA methylation, conservation of sequence, and bias that may occur during the averaging of data.

One of my favorite figures from our paper is our Figure 3, which illustrates what can and does go wrong in naïve genome-wide averaging of methylation data at functional boundaries:

The leftmost column shows the result of naïve plotting of column averages for %methylated CpGs in simulations where the methylation states from real data was shuffled randomly across features (while maintaining the locations of missing data due to absent CpGs). The “sharp steps” reported in Laurent et al. are evident despite the fact that there is no actual difference in methylation. The next column (second from left) is the naïve plot generated from real data (4852 5′ UTR-coding junctions, 20,784 mid-gene intron-exon junctions and 21,408 CpG island junctions from genome-wide bisulfite data generated by Lister et al., 2008). Evidently the “sharp steps” look very similar to the simulated scenario where there is no actual methylation difference across the boundaries. The “Corrected via COMPARE” column shows what happens when the bias is removed via a correction procedure that I describe next. In the case of UTR-coding junctions, the “sharp step” disappears completely (this is confirmed by the analysis shown in the final column, which consists of simply throwing away rows in the data table where there is missing data, thereby removing the possible Simpson’s effect but at the cost of discarding data). Our analysis also shows that although there is some differential methylation (on average) at intron-exon boundaries, the effect is smaller than appears at first glance.

Without going into too much detail, it is worth noting that the underlying problem in the methylation case is that deamination leads to methylated CpGs converting to TpGs, so that this is a correlation between methylation and missing data. This is analogous to the problem in conservation analysis, where missing data (due to indels) is correlated with less conservation (i.e. more mismatches). Our paper delves into the details.

Although the analysis in Laurent et al. is problematic, the intention of this post is not to single out the paper; it just happens to be the publication that led us to the discovery of Simpson’s effect in the context of genome-wide averaging. In fact, in the preparation of our paper, we identified 40 published articles where naïve averaging of genome-wide methylation data was displayed in figures (the same issue undoubtedly affects other types of genome analysis as well, e.g. conservation analysis as in Figure 25a from the mouse paper, and we make this point in our paper but do not explore it in detail). In some cases the interpretation of the figures is central to the claims of the paper, whereas in other cases the figures are secondary to the main results. Also sometimes correction of Simpson’s effect may lead to a (small) quantitative rather than a qualitative change in interpretation; we have not examined each of the 40 cases in our table in detail (although are happy to send the table upon request). What we recommend is that future studies implement a correction when averaging genome-wide data across features. To assist in this regard, Meromit developed software called COMPARE:

Imputation of missing data

There are basically three ways to get around the problem discussed above:

1. Don’t average the data in your tables.
2. Discard rows with missing entries.
3. Impute the missing values.

The problem with #1 is that summaries of data in the form of plots like Figure 25a from the mouse genome paper are a lot more fun to look at than tables with thousands of rows. The workflows for #2 and #3 above are shown in our Figure 4:

However the problem with #2 is that at least in the case of DNA methylation, a lot of data must be discarded (this is evident in the large boxes in the final column of our Figure 3; see above). Imputation is interesting to consider. It is essentially the statistical problem of guessing what the methylation state in a specific position of a DNA sequence would be if there was a CpG at the site. The idea is to perform matrix completion (in a way analogous to the methods used for the Netflix prize); details are in our paper and implemented in software called COMPARE. It turns out that imputation of epigenetic state is possible (and can be surprisingly accurate):

The plots show analysis with COMPARE, specifically ten-fold cross-validations for 5’UTR, coding, intronic and exonic regions, at the corresponding junctions analyzed in Figure 3. Smoothing was used to display the large number of data points. In each plot n is the number of data points in the matrix and the regression line is shown in dark blue where s is its slope, and v is the additional amount of variance in the data explained by the regression line relative to a random line. Our results are consistent with those recently published in Ernst and Kellis, 2015, where the tractability and application of imputation of epigenetic states is discussed.

With COMPARE, it is possible to generate corrected plots (see column 3 in our Figure 3 above) that provide a clearer quantitative picture of the data. A reanalysis of intron-exon junctions with COMPARE is interesting. As shown in Figure 3 there is a substantially reduced effect in human (approximately a factor of two) in comparison to naïve averaging. What is interesting is that in Arabidopsis correction with COMPARE hardly changes the result at all (Figure S6):

The upshot is that after correction in both human and Arabidopsis, a correction that affects human but not Arabidopsis, the differences in DNA methylation at intron-exon boundaries are revealed to be the same.

Despite the obvious importance of imputation in this context, we discovered in the course of trying to publish our paper that some biologists are apparently extremely uncomfortable with the idea (even though SNP imputation has become standard for GWAS). Moreover, it seemed that a number of editors and reviewers felt that the reporting and characterization of abundant bias affecting a large number of publications was simply not of much interest. The publication process of our paper was one of the hardest of my career; we went through several journals and multiple review stages, including getting rejaccted (not a typo!). Rejacction is the simultaneous rejection and acceptance of a paper. To wit, an editor wrote to say that “unfortunately, we feel that as it stands the manuscript…cannot [be] consider[ed].. in its current form [but] we have taken the liberty of consulting the editors of [another journal], and they have informed us that if you decided to submit your manuscript to [them], they would only ask for a quick overview.” In other words, rejected, but accepted at our sister journal if you like.  We were rejaccted twice (!) bringing to mind a (paraphrasing) of Oscar Wilde: to be rejaccted once may be regarded as misfortune; twice looks like carelessness. At the end, we were delighted to publish our paper in the open access journal BMC Genomics, and hope that people will find it interesting and useful.

On fishing

Returning to the claims of Samherji hf. about the raid, there is a point to discussing Simpson’s effect in the context of comparing fish prices. The argument being made, is that the Central Bank of Iceland averaged across different weight classes of fish in computing the ratios “price per kilogram” (true) and that it would be better to instead compute weighted averages according to the weight of the fish sold (unclear). Specifically, while Simpson’s reversal can occur in the setting Samherji hf. is referring to, it is not at all clear from the data presented that this is what happened (I would check, but it would require knowledge of prices both of Samherji hf. and of competitors, and also tables that are not in pdf as in the attachments provided). Instead, all that Samherji hf. is saying is that they prefer a weighted averaging in computing price per kilogram, a procedure that, a priori, is by no means clearly “better” than naïve averaging for issues the Central Bank was investigating. After all, just because small fish might sell for different prices than large fish, it is not clear that such fish should be discounted in computing average prices per kilogram. Moreover, it is not at all apparent why Simpson’s paradox is relevant if the matter under consideration is whether or not to weight results when averaging. This is all to say that Simpson’s paradox can not only be subtle to detect, but also subtle to interpret if and when it does occur. One of the notorious abuses of Simpson’s paradox is its use by Republicans in claiming that more of them voted for the 1964 civil rights act than Democrats. It turns out that while this is true, the opposite is true as well.

And so it can be with genomics data. Be careful when you fish.

Disclosures: I was an author on the mouse genome paper. I am employed by UC Berkeley. I am the brother-in-law of the journalist mentioned in the opening paragraph. I eat fish.

Today I posted the preprint N. Bray, H. Pimentel, P. Melsted and L. Pachter, Near-optimal RNA-Seq quantification with kallisto to the arXiv. It describes the RNA-Seq quantification program kallisto. [Update April 5, 2016: a revised version of the preprint has been published: Nicolas L. Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-Seq quantification, Nature Biotechnology (2016), doi:10.1038/nbt.3519 published online April 4, 2016.]

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being simpler.

2. The simplicity was to replace approximate alignment of reads with exact alignment of k-mers. When reads are shredded into their constituent k-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was so much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of k-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. That is asking for a lot less, and as it turns out, pseudoalignment can be much faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient k-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

The screenshot below shows kallisto being used on my 2012 iMac desktop to build an index of the human transcriptome (5 min 8 sec), and then quantify 78.6 million GEUVADIS human RNA-Seq reads (14 min). When we first saw these results we thought they were simply too good to be true. Let me repeat: The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. In some tests we’ve gotten even faster running times, up to 15 million reads quantified per minute.

The results in our paper indicate that kallisto is not just fast, but also very accurate. This is not surprising: underlying RNA-Seq analysis are the alignments, and although kallisto is pseudoaligning instead, it is almost always only the compatibility information that is used in actual applications. As we show in our paper, from the point of view of compatibility, the pseudoalignments and alignments are almost the same.

Although accuracy is a primary concern with analysis, we realized in the course of working on kallisto that speed is also paramount, and not just as a  matter of convenience. The speed of kallisto has three major implications:

1. It allows for efficient bootstrapping. All that is required for the bootstrap are reruns of the EM algorithm, and those are particularly fast within kallisto. The result is that we can accurately estimate the uncertainty in abundance estimates. One of my favorite figures from our paper, made by Harold, is this one:

It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. This result is non-trivial: the standard dogma, that the technical variance in RNA-Seq is “Poisson” (i.e. proportional to the mean) is false, as shown in Supplementary Figure 3 of our paper (the correlation becomes 0.64). Thus, the bootstrap will be invaluable when incorporated in downstream application and we are already working on some ideas.

2. It is not just the kallisto quantification that is fast; the index building, and even compilation of the program are also easy and quick. The implication for biologists is that RNA-Seq analysis now becomes interactive. Instead of “freezing” an analysis that might take weeks or even months, data can be explored dynamically, e.g. easily quantified against different transcriptomes, or re-quantified as transcriptomes are updated. The ability to analyze data locally instead of requiring cloud computation means that analysis is portable, and also easily secure.

3. We have found the fast turnaround of analysis helpful in improving the program itself. With kallisto we can quickly check the effect of changes in the algorithms. This allows for much faster debugging of problems, and also better optimization. It also allows us to release improvements knowing that users will be able to test them without resorting to a major computation that might take months. For this reason we’re not afraid to say that some improvements to kallisto will be coming soon.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!

RNA-Seq is the new kid on the block, but there is still something to be learned from the stodgy microarray. One of the lessons is hidden in a tech report by Daniela Witten and Robert Tibshirani from 2007: “A comparison of fold-change  and the t-statistic for microarray data analysis“.

The tech report makes three main points. The first is that it is preferable to use a modified t-statistic rather than the ordinary t-statistic. This means that rather than comparing (normalized) means using

$T_i = \frac{\bar{x_i} - \bar{y_i}}{s_i}$

where $s_i$ is the standard deviation of the replicates $x_i$ (respectively $y_i$) of gene i in two different conditions, it is better to use

$T'_i = \frac{\bar{x_i} - \bar{y_i}}{s_i+s_0}$

where $s_0$ minimizes the coefficient of variation of $T'_i$.

The second point made is that the intuition that reproducibility implies accuracy is not correct (fold change had been proposed for use instead of a t-statistic because the results were more reproducible).

The third point, in my opinion the most important one, I quote directly from the report:

“A researcher should choose the measure of differential expression based on the biological system of interest. If large absolute changes in expression are relevant to the system, then fold-change should be used; on the other hand, if changes in expression relative to the underlying noise are important, then a modified t-statistic is preferable.”

How does this pertain to RNA-Seq? Microarray experiments and RNA-Seq both measure expression but the translation of methods for the analysis of one platform to the other can be non-trivial. One reason is that in RNA-Seq experiments accurately measuring “fold-change” is difficult. Read counts accumulated across a gene cannot be used directly to estimate fold change because the transcripts making up the gene may have different lengths. For this reason, methods such as Cufflinks, RSEM or eXpress (and most recently Sailfish recently reviewed on this blog) use the EM algorithm to “deconvolute” ambiguously mapped reads. The following thought experiment (Figure 1 in our paper describing Cufflinks/Cuffdiff 2) illustrates the issue:

Changes in fragment counts for a gene do not necessarily equal a change in expression. The “exon-union” method counts reads falling on any of a gene’s exons, whereas the “exon-intersection” method counts only reads
on constitutive exons. Both of the exon-union and exon-intersection counting schemes may incorrectly estimate a change in expression in genes with multiple isoforms as shown in the table. It is important to note that the problem of fragment assignment described here in the context of RNA-Seq is crucial for accurate estimation of parameters in many other *Seq assays.

“Count-based” methods for differential expression, such as DESeq, work directly with accumulated gene counts and are based on the premise that even if estimated fold-change is wrong, statistical significance can be assessed based on differences between replicates.  In our recent paper describing Cuffdiff 2 (with a new method for differential abundance analysis) we examine DESeq (as a proxy for count-based methods) carefully and show using both simulation and real data that fold-change is not estimated accurately. In fact, even when DESeq and Cufflinks both deem a gene to be differentially expressed, and even when the effect is in the same direction (e.g. up-regulation), DESeq can (and many times does) estimate fold-change incorrectly. This problem is not specific to DESeq. All “count based” methods that employ naive heuristics for computing fold change will produce inaccurate estimates:

Comparison of fold-change estimated by Cufflinks (tail of arrows) vs. “intersection-count” (head of arrows) reproduced from Figure 5 of the supplementary material of the Cuffdiff 2 paper. “Intersection-count” consists of the accumulated read counts in the regions shared among transcripts in a gene. The x-axis shows array fold change vs. the estimated fold-change on the y-axis.  For more details on the experiment see the Cuffdiff 2 paper.

In other words,

it is essential to perform fragment assignment in a biological context where absolute expression differences are relevant to the system.

What might that biological context be? This is a subjective question but in my experience users of microarrays or RNA-Seq (including myself) always examine fold-change in addition to p-values obtained from (modified) t-statistics or other model based statistics because the raw fold-change is more directly connected to the data from the experiment.

In many settings though, statistical significance remains the gold standard for discovery. In the recent epic “On the immortality of television sets: ‘function’ in the human genome according to the evolution-free gospel of ENCODE“, Dan Graur criticizes the ENCODE project for reaching an “absurd conclusion” through various means, among them the emphasis of “statistical significance rather than magnitude of effect”. Or, to paraphrase Samuel Johnson,

statistical significance is the last refuge from a poor analysis of data.