You are currently browsing the tag archive for the ‘kallisto’ tag.

[**Update July 15, 2016**: A preprint describing sleuth is available on BioRxiv]

Today my student Harold Pimentel released the beta version of his new RNA-Seq analysis method and software program called **sleuth**. A sleuth for RNA-Seq begins with the quantification of samples with **kallisto**, and together a sleuth of kallistos can be used to analyze RNA-Seq data rigorously and rapidly.

**Why do we need another RNA-Seq ****program?**

A major challenge in transcriptome analysis is to determine the transcripts that have changed in their abundance across conditions. This challenge is not entirely trivial because the stochasticity in transcription both within and between cells (*biological variation*), and the randomness in the experiment (RNA-Seq) that is used to determine transcript abundances (*technical variation*), make it difficult to determine what constitutes “significant” change.

Technical variation can be assessed by performing *technical replicates* of experiments. In the case of RNA-Seq, this can be done by repeatedly sequencing from one cDNA library. Such replicates are fundamentally distinct from *biological replicates *designed to assess biological variation. Biological replicates are performed by sequencing different cDNA libraries that have been constructed from repeated biological experiments performed under the same (or in practice near-same) conditions. Because biological replicates require sequencing different cDNA libraries, **a key point is that biological replicates include technical variation**.

In the early days of RNA-Seq a few papers (e.g. Marioni *et al*. 2008, Bullard *et al.* 2010) described analyses of technical replicates and concluded that they were not really needed in practice, because technical variation could be predicted statistically from the properties of the Poisson distribution. The point is that in an idealized RNA-Seq experiment counts of reads are multinomial (according to abundances of the transcripts they originate from), and therefore approximately Poisson distributed. Their variance is therefore approximately equal to the mean, so that it is possible to predict the variance in counts across technical replicates based on the abundance of the transcripts they originate from. There is, however, one important subtlety here: “counts of reads” as used above refers to the number of reads originating from a transcript, but in many cases, especially in higher eukaryotes, **reads are frequently ambiguous as to their transcript of origin** because of the presence of multi-isoform genes and genes families. In other words, ** transcript counts cannot be precisely measured**. However, the statement about the Poisson distribution of counts in technical replicates remain true when considering counts of reads by

*genomic*features because then reads are no longer ambiguous.

This is why, in so-called “count-based methods” for RNA-Seq analysis, there is an analysis only at the gene level. Programs such as DESeq/DESeq2, edgeR and a literally dozens of other count-based methods first require counting reads across genome features using tools such as HTSeq or featureCounts. By utilizing read counts to genomic features, technical replicates are unnecessary in lieu of the statistical assumption that they would reveal Poisson distributed data, and instead the methods focus on modeling biological variation. The issue of how to model biological variation is non-trivial because typically very few biological replicates are performed in experiments. Thus, there is a need for pooling information across genes to obtain reliable variance estimates via a statistical process called shrinkage. How and what to shrink is a matter of extensive debate among statisticians engaged in the development of count-based RNA-Seq methods, but one theme that has emerged is that shrinkage approaches can be compatible with general and generalized linear models, thus allowing for the analysis of complex experimental designs.

Despite these accomplishments, count-based methods for RNA-Seq have two major (related) drawbacks: first, the use of counts to gene features prevents inference about the transcription of isoforms, and therefore with most count-based methods it is impossible to identify splicing switches and other isoform changes between conditions. Some methods have tried to address this issue by restricting genomic features to specific exons or splice junctions (e.g. DEXSeq) but this requires throwing out a lot of data, thereby reducing power for identifying statistically significant differences between conditions. Second, because of the fact that in general it is mathematically incorrect to estimate gene abundances by adding up counts to their genomic region. One consequence of this, is that it is not possible to accurately measure fold change between conditions by using counts to gene features. In other words, count-based methods are problematic even at the gene-level and it is necessary to estimate transcript-level counts.

While reads might be ambiguous as to exactly which transcripts they originated from, it is possible to statistically infer an estimate of the number of reads from each transcript in an experiment. This kind of quantification has its origin in papers of Jiang and Wong, 2009 and Trapnell *et al. *2010. **However the process of estimating transcript-level counts introduces technical variation**. That is to say, if multiple technical replicates were performed on a cDNA library and then transcript-level counts were to be inferred, **those inferred counts would no longer be Poisson distributed.** Thus, there appears to be a need for performing technical replicates after all. Furthermore, it becomes unclear how to work within the shrinkage frameworks of count-based methods.

There have been a handful of attempts to develop methods that combine the uncertainty of count estimates at the transcript level with biological variation in the assessment of statistically significant changes in transcript abundances between conditions. For example, the Cuffdiff2 method generalizes DESeq while the bitSeq method relies on a Bayesian framework to simultaneously quantify abundances at the transcript level while modeling biological variability. Despite showing improved performance over count-based methods, they also have significant shortcomings. For example the methods are not as flexible as those of general(ized) linear models, and bitSeq is slow partly because it requires MCMC sampling.

Thus, despite intensive research on both statistical and computational methods for RNA-Seq over the past years, there has been no solution for analysis of experiments that allows biologists to take full advantage of the power and resolution of RNA-Seq.

**The sleuth model**

The main contribution of** sleuth **is an intuitive yet powerful model for RNA-Seq that bridges the gap between count-based methods and quantification algorithms in a way that fully exploits the advantages of both.

To understand **sleuth**, it is helpful to start with the *general linear model*:

.

Here the subscript *t* refers to a specific transcript, is a vector describing transcript abundances (of length equal to the number of samples), is a design matrix (of size number of samples x number of confounders), is a parameter vector (of size the number of confounders) and is a noise vector (of size the number of samples). In this model the abundances are normally distributed. For the purposes of RNA-Seq data, the may be assumed to be the logarithm of the counts (or normalized counts per million) from a transcript, and indeed this is the approach taken in a number of approaches to RNA-Seq modeling, e.g. in limma-voom. A common alternative to the general linear model is the *generalized* *linear model*, which postulates that some function of has a distribution with mean equal to where *g* is a link function, such as log, thereby allowing for distributions other than the normal to be used for the observed data. In the RNA-Seq context, where the negative binomial distribution may make sense because it is frequently a good distribution for modeling count data, the *generalized* model is sometimes preferred to the standard general model (e.g. by DESeq2). There is much debate about which approach is “better”.

In the **sleuth** model the in the general linear model are modeled as *unobserved*. They can be thought of us the unobserved logarithms of true counts for each transcript across samples and are assumed to be normally distributed. The *observed data** * is the logarithm of estimated counts for each transcript across samples, and is modeled as

where the vector parameterizes a perturbation to the unobserved . This can be understood as the technical noise due to the random sequencing of fragments from a cDNA library and the uncertainty introduced in estimating transcript counts.

The sleuth model incorporates the assumptions that the response error is *additive*, i.e. if the variance of transcript *t *in sample *i *is then where the variance and the variance . Intuitively, **sleuth** teases apart the two sources of variance by examining both technical and biological replicates, and in doing so directly estimates “true” biological variance, i.e. the variance in biological replicates that is *not* technical. In lieu of actual technical replicates, **sleuth** takes advantage of the bootstraps of **kallisto **which serve as accurate proxies.

In a test of **sleuth** on *data simulated according to the DESeq2 model* we* *found that **sleuth **significantly outperforms other methods:

In this simulation transcript counts were simulated according to a negative binomial distribution, following closely the protocol of the DESeq2 paper simulations. Reference parameters for the simulation were first estimated by running DESeq2 on a the female Finnish population from the GEUVADIS dataset (59 individuals). In the simulation above size factors were set to be equal in accordance with typical experiments being performed, but we also tested **sleuth** with size factors drawn at random with geometric mean of 1 in accordance with the DESeq2 protocol (yielding factors of 1, 0.33, 3, 3, 0.33 and 1) and **sleuth** still outperformed other methods.

There are many details in the implementation of **sleuth** that are crucial to its performance, e.g. the approach to shrinkage to estimate the biological variance . A forthcoming preprint, together with Nicolas Bray and Páll Melsted that also contributed to the project along with myself, will provide the details.

**Exploratory data analysis with sleuth**

One of the design goals of **sleuth** was to create a simple and efficient workflow in line with the principles of **kallisto**. Working with the Shiny web application framework we have designed an html interface that allows users to interact with **sleuth **plots allowing for real time exploratory data analysis.

The **sleuth** Shiny interface is much more than just a GUI for making plots of **kallisto **processed data. First, it allows for the exploration of the **sleuth** fitted models; users can explore the technical variation of each transcript, see where statistically significant differential transcripts appear in relationship to others in terms of abundance and variance and much more. Particularly useful are interactive features in the plots. For example, when examining an MA plot, users can highlight a region of points (dynamically created box in upper panel) and see their variance breakdown of the transcripts the points correspond to, and the list of the transcripts in a table below:

The web interface contains diagnostics, summaries of the data, “maps” showing low-dimensional representations of the data and tools for analysis of differential transcripts. The interactivity via Shiny can be especially useful for diagnostics; for example, in the diagnostics users can examine scatterplots of any two samples, and then select outliers to examine their variance, including the breakdown of technical variance. This allows for a determination of whether outliers represent high variance transcripts, or specific samples gone awry. Users can of course make figures showing transcript abundances in all samples, including boxplots displaying the extent of technical variation. Interested in the differential transcribed isoform ENST00000349155 of the TBX3 gene shown in Figure 5d of the Cuffdiff2 paper? It’s trivial to examine using the transcript viewer:

One can immediately see visually that differences between conditions completely dominate both the technical and biological variation within conditions. The **sleuth** q-value for this transcript is 3*10^(-68).

Among the maps, users can examine PCA projections onto any pair of components, allowing for rapid exploration of the structure of the data. Thus, with **kallisto **and **sleuth** raw RNA-Seq reads can be converted into a complete analysis in a matter of minutes. Experts will be able to generate plots and analyses in R using the **sleuth** library as they would with any R** **package. We plan numerous improvements and developments to the **sleuth **interface in the near future that will further facilitate data exploration; in the meantime we welcome feedback from users.

**How to try out sleuth**

Since **sleuth** requires the bootstraps and quantifications output by **kallisto** we recommend starting by running **kallisto **on your samples. The **kallisto **program is very fast, processing 30 million reads on a laptop in a matter of minutes. You will have to run **kallisto **with bootstraps- we have been using 100 bootstraps per sample but it should be possible to work with many fewer. We have yet to fully investigate the minimum number of bootstraps required for **sleuth **to be accurate.

To learn how to use **kallisto **start here. If you have already run **kallisto **you can proceed to the tutorial for **sleuth**. If you’re really eager to see **sleuth** without first learning **kallisto**, you can skip ahead and try it out using pre-computed **kallisto** runs of the Cuffdiff2 data- the tutorial explains where to obtain the data for trying out **sleuth**.

For questions, suggestions or help see the program websites and also the kallisto-sleuth user group. We hope you enjoy the tools!

Today I posted the preprint N. Bray, H. Pimentel, P. Melsted and L. Pachter, Near-optimal RNA-Seq quantification with kallisto to the arXiv. It describes the RNA-Seq quantification program **kallisto**. [Update April 5, 2016: a revised version of the preprint has been published: Nicolas L. Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-Seq quantification, Nature Biotechnology (2016), doi:10.1038/nbt.3519 published online April 4, 2016.]

The project began in August 2013 when I wrote my second blog post, about another arXiv preprint describing a program for RNA-Seq quantification called Sailfish (now a published paper). At the time, a few students and postdocs in my group read the paper and then discussed it in our weekly journal club. It advocated a philosophy of “lightweight algorithms, which make frugal use of data, respect constant factors and effectively use concurrent hardware by working with small units of data where possible”. Indeed, two themes emerged in the journal club discussion:

1. Sailfish was much faster than other methods by virtue of being* simpler.*

2. The simplicity was to replace *approximate* alignment of *reads *with *exact* alignment of *k-mers*. When reads are shredded into their constituent *k*-mer “mini-reads”, the difficult read -> reference alignment problem in the presence of errors becomes an exact matching problem efficiently solvable with a hash table.

We felt that the shredding of reads must lead to reduced accuracy, and we quickly checked and found that to be the case. In fact, in our simulations, we saw that Sailfish significantly underperformed methods such as RSEM. However the fact that simpler was ** so** much faster led us to wonder whether the prevailing wisdom of seeking to improve RNA-Seq analysis by looking at increasingly complex models was ill-founded. Perhaps simpler could be not only fast, but also accurate, or at least close enough to best-in-class for practical purposes.

After thinking about the problem carefully, my (now former) student Nicolas Bray realized that the key is to abandon the idea that alignments are necessary for RNA-Seq quantification. Even Sailfish makes use of alignments (of *k*-mers rather than reads, but alignments nonetheless). In fact, thinking about all the tools available, Nick realized that every RNA-Seq analysis program was being developed in the context of a “pipeline” of first aligning reads or parts of them to a reference genome or transcriptome. Nick had the insight to ask: what can be gained if we let go of that paradigm?

By April 2014 we had formalized the notion of “pseudoalignment” and Nick had written, in Python, a prototype of a pseudoaligner. He called the program kallisto. The basic idea was to determine, for each read, not *where* in each transcript it aligns, but rather which transcripts it is *compatible* with. That is asking for a lot less, and as it turns out, pseudoalignment can be *much* faster than alignment. At the same time, the information in pseudoalignments is enough to quantify abundances using a simple model for RNA-Seq, a point made in the isoEM paper, and an idea that Sailfish made use of as well.

Just how fast is pseudoalignment? In January of this year Páll Melsted from the University of Iceland came to visit my group for a semester sabbatical. Páll had experience in exactly the kinds of computer science we needed to optimize kallisto; he has written about efficient *k*-mer counting using the bloom filter and de Bruijn graph construction. He translated the Python kallisto to C++, incorporating numerous clever optimizations and a few new ideas along the way. His work was done in collaboration with my student Harold Pimentel, Nick (now a postdoc with Jacob Corn and Jennifer Doudna at the Innovative Genomics Initiative) and myself.

The screenshot below shows kallisto being used on my 2012 iMac desktop to build an index of the human transcriptome (5 min 8 sec), and then quantify 78.6 million GEUVADIS human RNA-Seq reads (14 min). When we first saw these results we thought they were simply too good to be true. Let me repeat: **The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. **In some tests we’ve gotten even faster running times, up to 15 million reads quantified per minute.

The results in our paper indicate that kallisto is not just fast, but also very accurate. This is not surprising: underlying RNA-Seq analysis are the alignments, and although kallisto is pseudoaligning instead, it is almost always only the compatibility information that is used in actual applications. As we show in our paper, from the point of view of compatibility, the pseudoalignments and alignments are almost the same.

Although accuracy is a primary concern with analysis, we realized in the course of working on kallisto that speed is also paramount, and not just as a matter of convenience. The speed of kallisto has three major implications:

1. It allows for efficient bootstrapping. All that is required for the bootstrap are reruns of the EM algorithm, and those are particularly fast within kallisto. The result is that we can accurately estimate the uncertainty in abundance estimates. One of my favorite figures from our paper, made by Harold, is this one:

It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The *x*-axis shows the variance estimated from kallisto bootstraps on a single subsample while the *y*-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. This result is non-trivial: the standard dogma, that the technical variance in RNA-Seq is “Poisson” (i.e. proportional to the mean) is false, as shown in Supplementary Figure 3 of our paper (the correlation becomes 0.64). Thus, the bootstrap will be invaluable when incorporated in downstream application and we are already working on some ideas.

2. It is not just the kallisto quantification that is fast; the index building, and even compilation of the program are also easy and quick. The implication for biologists is that RNA-Seq analysis now becomes interactive. Instead of “freezing” an analysis that might take weeks or even months, data can be explored dynamically, e.g. easily quantified against different transcriptomes, or re-quantified as transcriptomes are updated. The ability to analyze data locally instead of requiring cloud computation means that analysis is portable, and also easily secure.

3. We have found the fast turnaround of analysis helpful in improving the program itself. With kallisto we can quickly check the effect of changes in the algorithms. This allows for much faster debugging of problems, and also better optimization. It also allows us to release improvements knowing that users will be able to test them without resorting to a major computation that might take months. For this reason we’re not afraid to say that some improvements to kallisto will be coming soon.

As someone who has worked on RNA-Seq since the time of 32bp reads, I have to say that kallisto has personally been extremely liberating. It offers freedom from the bioinformatics core facility, freedom from the cloud, freedom from the multi-core server, and in my case freedom from my graduate students– for the first time in years I’m analyzing tons of data on my own; because of the simplicity and speed I find I have the time for it. Enjoy!

## Recent Comments