You are currently browsing the tag archive for the ‘single-cell’ tag.

This post is the third in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

- Near-optimal pre-processing of single-cell RNA-seq
- Single-cell RNA-seq for dummies
- How to solve an NP-complete problem in linear time
- Rotating the knee (plot) and related yoga
- High velocity RNA velocity

There is a million dollar prize on offer for a solution to the P vs. NP problem, so it’s understandable that one may wonder whether this blog post is an official entry. **It is not**.

The title for this post was inspired by a talk presented by David Tse at the CGSI 2017 meeting where he explained “How to solve NP-hard assembly problems in linear time“. The gist of the talk was summarized by Tse as follows:

“In computational genomics there’s been a lot of problems where the formulation is combinatorial optimization. Usually they come from some maximum likelihood formulation of some inference problem and those problems end up being mostly NP-hard. And the solution is typically to develop some heuristic way of solving the NP-hard problem. What I’m saying here is that actually there is a different way of approaching such problems. You can look at them from an information point of view.”

Of course thinking about NP-hard problems from an information point of view does not provide polynomial algorithms for them. But what Tse means is that information-theoretic insights can lead to efficient algorithms that squeeze the most out of the available information.

One of the computational genomics areas where an NP-complete formulation for a key problem was recently proposed is in single-cell RNA-seq pre-processing. After RNA molecules are captured from cells, they are amplified by PCR, and it is possible, in principle, to account for the PCR duplicates of the molecules by making use of unique molecular identifiers (UMIs). Since UMIs are (in theory) unique to each captured molecule, but identical among the PCR duplicates of that captured molecule, they can be used to identify and discard the PCR duplicates. In practice distinct captured molecules may share the same UMI causing a *collision*, so it can be challenging to decide when to “collapse” reads to account for PCR duplicates.

In the recent paper Srivastava et al. 2019, the authors developed a combinatorial optimization formulation for collapsing. They introduce the notion of “monochromatic arborescences” on a graph, where these objects correspond to what is, in the language of the previous post, elements of the set *C*. They explain that the combinatorial optimization formulation of UMI collapsing in this framework is to find a minimum cardinality covering of a certain graph by monochromatic arboresences. The authors then prove the following theorem, by reduction from the dominating set decision problem:

*Theorem* [Srivastava, Malik, Smith, Sudbery, Patro]: Minimum cardinality covering by monochromatic arborescences is NP-complete.

Following the standard practice David Tse described in his talk, the authors then apply a heuristic to the challenging NP-complete problem. It’s all good except for one small thing. The formulation is based on an assumption, articulated in Srivastava et al. 2019 (boldface and strikethrough is mine):

…gene-level deduplication provides a conservative approach and assumes that it is highly unlikely for molecules that are distinct transcripts of the same gene to be tagged with a similar UMI (within an edit distance of 1 from another UMI from the same gene). However, entirely discarding transcript-level information will mask true UMI collisions to some degree, even when there is direct evidence that similar UMIs must have arisen from distinct transcripts. For example, if similar UMIs appear in transcript-disjoint equivalence classes (even if all of the transcripts labeling both classes belong to the same gene), then they

cannothave arisen from the same pre-PCR molecule.Accounting for such cases is especially.~~true~~[important] when using an error-aware deduplication approach and as sequencing depth increases

The one small thing? Well… the authors never checked whether the claim at the end, namely that “accounting for such cases is especially important”, is actually true. In our paper “Modular and efficient pre-processing of single-cell RNA-seq” we checked. The result is in our Figure 1d:

Each column in the figure corresponds to a dataset, and the y-axis shows the distribution (over cells) of the proportion of counts one can expect to lose if applying naïve collapsing to a gene. Naïve collapsing here means that two reads with the same UMI are considered to have come from the same molecule. The numbers are so small we had to include an inset in the top right. Basically, it almost *never* happens that there is “direct evidence that similar UMIs must have arisen from distinct transcripts”. If one does observe such an occurrence, it is almost certainly an artifact of missing annotation. In fact, this leads to an…

💡 Idea: prioritize genes with colliding UMIs for annotation correction. The UMIs directly highlight transcripts that are incomplete. Maybe for a future paper, but returning to the matter at hand…

Crucially, the *information* analysis shows that there is no point in solving an NP-complete problem in this setting. The naïve algorithm not only suffices, it is sensible to apply it. And the great thing about naïve collapsing is that it’s straightforward to implement and run; **the algorithm is linear**. The Srivastava et al. question of what is the “minimum number of UMIs, along with their counts, required to explain the set of mapped reads” is a precise, but wrong question. In the words of John Tukey: “**Far better an approximate answer to the ***right* question, which is often vague, than an *exact*** answer to the wrong question, which can always be made precise**.”

The math behind Figure 1d is elementary but interesting (see the Supplementary Note of our paper). We work with a simple binomial model which we justify based on the data. For related work see Petukhov et al. 2018. One interesting result that came out of our calculations (work done with Sina Booeshaghi), is an estimate for the effective number of UMIs on each bead in a cell. This resulted in Supplementary Figure 1:

The result is encouraging. While the number of UMIs on a bead is not quite where *L* is the length of the UMI (theoretical maximum shown by dashed red line for v2 chemistry and solid red line for v3 chemistry), it is nevertheless high. We don’t know whether the variation is a result of batch effect, model mis-specification, or other artifacts; that is an interesting question to explore with more data and analysis.

As for UMI collapsing, the naïve algorithm has been used for almost every experiment to date as it is the method that was implemented in the Cell Ranger software, and subsequently adopted in other software packages. This was done without any consideration of whether it is appropriate. As the Srivastava et al. paper shows, intuition is not to be relied upon, but fortunately, in this case, the naïve approach is the right one.

This post is the second in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

- Near-optimal pre-processing of single-cell RNA-seq
- Single-cell RNA-seq for dummies
- How to solve an NP-complete problem in linear time
- Rotating the knee (plot) and related yoga
- High velocity RNA velocity

A few months ago, while working on the kallisto | bustools project, some of us in the lab were discussing various aspects of single-cell RNA-seq technology when the conversation veered into a debate over the meaning of some frequently used words and phrases in the art: “library complexity”, “library size”, “sensitivity”, “capture rate”, “saturation”, “number of UMIs”, “bork bork bork” etc. There was some sense of confusion. I felt like a dummy because even after working on RNA-seq for more than a decade, I was still lacking language and clarity about even the most basic concepts. This was perhaps not entirely my fault. Consider, for example, that the phrase “library size” is used to mean “the number of molecules in a cDNA library” by some authors, and the “number of reads sequenced” by others.

Since we were writing a paper on single-cell RNA-seq pre-processing that required some calculations related to the basic concepts (libraries, UMIs, and so on), we decided to write down notation for the key objects. After some back-and-forth, Sina Booeshaghi and I ended up drafting the diagram below that summarizes the sets of objects in a single-cell RNA-seq experiment, and the maps that relate them:

Structure of a single-cell RNA-seq experiment.

Each letter in this diagram is a set. The ensemble of RNA molecules contained within a single cell is denoted by *R*. To investigate *R,* a library (*L*) is constructed from the set of molecules captured from *R* (the set *C*). Typically, *L* is the result of of various fragmentation and amplification steps performed on *C*, meaning each element of *C* may be observed in *L* with some multiplicity. Thus, there is an inclusion map from *C* to *L *(arrow with curly tail), and an injection from *C* to *R *(arrows with head and tail). The library is interrogated via sequencing of some of the molecules in *L*, resulting in a set *F* of fragments. Subsequently, the set *F* is aligned or pseudoaligned to create a set *B*, which in our case is a BUS file. Not every fragment *F* is represented in *B*, hence the injection, rather than bijection, from *B* to *F*, and similarly from *F* to *L. *The set* T* consists of transcripts that correspond to molecules in *C* that were represented in *B*. Note that . Separately, the set *U* consists of the unique molecular identifiers (UMIs) available to label molecules from the cell, and *I* is a multiset of UMIs associated with the molecules in *T. *Importantly, the data from an experiment consists of *F*, together with the support of *I. *The support of *I* means the number of distinct objects in *I, *and is denoted by *|supp(I)|*. The common term is “number of distinct UMIs”.

The diagram has three distinct parts. The sets on the top (*L, F, B*) are “lifted” from *C * and *T *by PCR. Without PCR one would be in an the ideal situation of measuring *C* directly to produce *T,* which would then be used to directly draw inferences about *R. *This is the hope for direct RNA sequencing, a technology that is promising but that cannot yet be applied at the scale of cDNA based methods. The sets *U* and *I *are intended to be seen as orthogonal to the rest of the objects. They relate to the UMIs which, in droplet single-cell RNA-seq technology, are delivered via beads. While the figure was designed to describe single-cell RNA-seq, it is quite general and possibly a useful model for many sequence census assays.

So what is all this formality good for? Nothing in this setup is novel; any practitioner working with single-cell RNA-seq already knows what the ingredients for the technology are. However I do think there is some trouble with the language and meaning of words, and hopefully having names and labels for the relevant sets can help in communication.

**The questions**

With some notation at hand, it is possible to precisely articulate some of the key technical questions associated with a single-cell RNA-seq experiment:

- The
**alignment**(or pseudoalignment) problem: compute*B*from*F*. - The
**pre-processing**problem: what is the set*T*? - What is the
**library richness/complexity**, i.e. what is*|supp(L)|*? - What is the
**sensitivity**, i.e. what is ? - In droplet based experiments, what are the
**number of UMIs available**to tag molecules in a cell, i.e. what is*|U|*?

These basic questions are sometimes confused with each other. For example, the *capture rate *refers to the proportion of cells from a sample that are captured in an experiment and should not be confused with sensitivity. The *|supp(L)| *is a concept that is natural to refer to when thinking about a cDNA library. Note that the “library size”, referred to in the beginning of this post, is used by molecular biologists to naturally mean *|L|, *and not *|F| *(this confusion was unfortunately disseminated by the highly influential RNA-seq papers Anders and Huber, 2010 and Robinson and Oshlack, 2010) . The support of another set, |*supp(I)|, *is one that is easy to measure but precisely because *I* is a multiset, , and there is considerable confusion about this fact. The number of distinct UMIs, |*supp(I)|,* is frequently used in lieu of the set whose size is being referred to, namely |I| (this is the case when “knee plots” are made, a topic for the fourth blog post in this series). Similarly, *|U|* is usually not estimated, and the number where is the length of the UMIs is used in its stead. This is partly intellectual laziness but partly, I think, the lack of standard notation to refer to the objects in single-cell RNA-seq experiments.

This diagram in this post is just step 0 in discussing single-cell RNA-seq. There is a lot more subtlety and nuance in understanding and interpreting experiments (see Introduction to single-cell RNA-seq technologies). ∎

This post is the first in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

- Near-optimal pre-processing of single-cell RNA-seq
- Single-cell RNA-seq for dummies
- How to solve an NP-complete problem in linear time
- Rotating the knee (plot) and related yoga
- High velocity RNA velocity

During the past few years computational biologists have expended enormous effort in developing tools for processing and analyzing single-cell RNA-seq. This post describes yet another: the kallisto|bustools workflow for pre-processing single-cell RNA-seq. A preprint describing the method (Melsted, Booeshaghi et al. 2019) was recently posted on the bioRχiv.

Number of single-cell RNA-seq tools (from the scRNA-tools catalog).

Given that there are * so* many programs, a natural question is:

**why on earth would we write yet another software program for generating a count matrix from single-cell RNA-seq reads when there are already plenty of programs out there?**There’s alevin, cell ranger, dropseqpipe, dropseqtools, indrops… I’ve been going in alphabetical order but have to jump in with starsolo because it’s got the coolest name…now back to optimus, scruff, scpipe, scumi, umis, zumis, and I’m probably forgetting a few other something-umis. So

**w**

**hy another one?**

The answer requires briefly venturing back to a time long, long ago when RNA-seq was a fledgling, exciting new technology (~2009). At the time the notion of an “equivalence class*” *was introduced to the field (see e.g. Jiang and Wong, 2009 or Nicolae et al., 2011). Briefly, there is a natural equivalence relation on the set of reads in an RNA-seq experiment, where two reads are related when they are compatible with (i.e. could have originated from) exactly the same set of transcripts. The equivalence relation partitions the reads into equivalence classes, and, in a slight abuse of notation, the term “equivalence class” in RNA-seq is used to denote the set of transcripts corresponding to an equivalence class of reads. Starting with the pseudoalignment program kallisto that we published in Bray et al. 2016, it became possible to rapidly obtain the (transcript) equivalence classes for reads from an RNA-seq experiment.

When single-cell RNA-seq started to scale it became apparent to those of us working with equivalence classes for bulk RNA-seq that rather than counting genes from single-cell RNA-seq data, it would be better to examine what we called transcript compatibility counts (TCCs), i.e. counts of the equivalence classes (the origin of the term TCC is discussed in a previous blog post of mine). This vision has borne out: we recently published a paper demonstrating the power of TCCs for differential analysis of single-cell data (Ntranos, Yi et al. 2019) and I believe TCCs are ideal for many different single-cell RNA-seq analyses. So back to the question: **why another single-cell RNA-seq pre-processing workflow?**

Already in 2016 we wanted to be able to produce TCC matrices from single-cell RNA-seq data but there was no program to do it. My postdoc at the time, Vasilis Ntranos, developed a workflow, but in the course of working on a program he started to realize that there were numerous non-trivial aspects to processing single-cell RNA-seq. Even basic questions, such as how to correct barcodes or collapse UMIs required careful thought and analysis. As more and more programs for single-cell RNA-seq pre-processing started to appear, we examined them carefully and noted two things: 1. Most were not able to output TCC matrices and 2. They were, for the most part, based on ad hoc heuristics and unvalidated methods. Many of the programs were not even released with a preprint or paper. So we started working on the problem.

A key insight was that we needed a new format to allow for modular pre-processing. So we developed such a format, which we called the **Barcode, UMI, Set (BUS) format,** and we published a paper about it earlier this year (Melsted, Ntranos et al., 2019). This allowed us to start investigating different algorithms for the key steps, and to rearrange them and plug them in to an overall workflow as needed. Finally, after careful consideration of each of the key steps, weighing tradeoffs between efficiency and accuracy, and extensive experimentation, we settled on a workflow that is faster than any other method and based on reason rather than intuition. The workflow uses two programs, kallisto and bustools, and we call it the kallisto|bustools workflow. Some highlights:

- kallisto|bustools can produce a TCC matrix. The matrix is compatible with the gene count matrix (it collapses to the latter), and of course gene count matrices can be output as well for use in existing downstream tools.
- The workflow is very
**very**fast. With kallisto|bustools very large datasets can be processed in minutes. The title of this post refers to the workflow as “near-optimal” because it runs in time similar to the unix word count function. Maybe it’s possible to be a bit faster with some optimizations, but probably not by much:

- kallisto|bustools uses very little memory. We worked hard to achieve this feature, as we wanted it to be useful for large-scale analyses that are going to be performed by consortia such as the Human Cell Atlas project. The workflow currently uses ~3.5Gb of RAM for processing 10x v2 chemistry data, and ~11Gb RAM for 10x v3 chemistry data; both numbers are independent of the number of reads being processed. This means users can pre-process data on a laptop:

- The workflow is modular, thanks to its reliance on the flexible BUS format. It was straightforward to develop an RNA velocity workflow (more on this in a companion blog post). It will be easy to adapt the workflow to various technologies, to multiomics experiments, and to any custom analysis required:

- We tried to create comprehensive, yet succinct documentation to help make it easy to use the software (recommendations for improvements are welcome). We have online tutorials, as well as videos for novices:

– Installation instructions (and video)

– Getting started tutorial (and video).

– Manuals for kallisto and bustools.

– Complete code for reproducing all the results in the preprint - We were not lazy. In our tests we found variability in performance on different datasets so we tested the program extensively and ran numerous benchmarks on 10x Genomics data to validate Cell Ranger with respect to kallisto|bustools (note that Cell Ranger’s methods have been neither validated nor published). We compiled a benchmark panel consisting of 20 datasets from a wide variety of species. This resulted in 20 supplementary figures, each with 8 panels showing: a) the number of genes detected, b) concordance in counts per gene, c) number of genes detected, d) correlation in gene counts by cell, e) spatial separation between corresponding cells vs. neighboring cells, f,g) t-SNE analysis, h) gene set analysis to detect systematic differences in gene abundance estimation (see example below for the dataset SRR8257100 from the paper Ryu et al., 2019). We also examined in detail results on a species mixing experiment, and confirmed that Cell Ranger is consistent with kallisto on that as well. One thing we did not do in this paper is describe workflows for different technologies but such workflows and accompanying tutorials will be available soon:

- In addition we ran a detailed analysis on the 10x Genomics 10k E18 mouse brain dataset to investigate whether Cell Ranger pre-processing produces different results than kallisto insofar as downstream analyses are concerned. We looked at dimensionality reduction, clustering, identification of marker genes, marker gene prevalence, and pseudotime. The results were all highly concordant. An example (the pseudotime analysis) is shown below:

- We did the math on some of the basic aspects of single-cell RNA-seq. We’re not the first to do this (see, e.g. Petukhov et al., 2018), but one new result we have is an estimation of the UMI diversity on beads. This should be useful for those developing new technologies, or trying to optimize existing protocols:

Note that this post is the first in a series of five that discuss in more detail various aspects of the paper (see links at the top). Finally, a note on** reproducibility and usability:**

The development of the kallisto|bustools workflow, research into the methods, compilation of the results, and execution of the project required a tremendous team effort, and in working on it I was thinking of the first bioinformatics tool I wrote about and posted to the arXiv (the bioRxiv didn’t exist yet). The paper was:

Nicolas Bray and Lior Pachter, MAVID: Constrained ancestral alignment of multiple sequences, arXiv, 2003.

At the time we posted the code on our own website (now defunct, but accessible via the Wayback machine). We did our best to make the results reproducible but we were limited in our options with the tools available at the time. Furthermore, posting the preprint was highly unusual; there was almost no biology preprinting at the time. Other things have stayed the same. Considerations of software portability, efficiency and documentation were relevant then and remain relevant now.

Still, there has been an incredible development in the tools and techniques available for reproducibility and usability since that time. A lot of the innovation has been made possible by cloud infrastructure, but much of the development has been the result of changes in community standards and requirements (see e.g., Weber et al., 2019). I thought I’d compile a list of the parts and pieces of the project; they are typical for what is needed for a bioinformatics project today and comparing them to the bar in 2003 is mind boggling:

**Software**: GitHub repositories (kallisto and bustools); releases of binaries for multiple operating systems (Mac, Linux, Windows, Rock64); portable source code with minimal dependencies; multithreading; memory optimization; user interface.

**Paper**: Preprint (along with extensive Supplement providing backup for every result and claim in the main text); GitHub repository with code to reproduce all the figures/results in the preprint (reproducibility code includes R markdown, python notebooks, snakemake, software versions for every program used, fixed seeds).

**Documentation**: Manuals for the software; Tutorials for learning to use the code; Explanatory videos (all required materials posted on Github or available on stable websites for download).

The totality of work required to do all of this was substantial. Páll Melsted was the primary developer of kallisto and he wrote and designed bustools, which has been the foundation of the project. The key insight to adopt the BUS format was work in collaboration with Vasilis Ntranos. This was followed by long conversations on the fundamentals of single-cell RNA-seq with Jase Gehring. Sina Booeshaghi carried the project. He was responsible for the crucial UMI collapsing analysis, and put together the paper. Fan Gao, director of the Caltech Bioinformatics Resource Center, set up and implemented the extensive benchmarking, and helped fine-tune the algorithms and converge to the final approach taken. Lambda Lu conducted what I believe to be the most in-depth and detailed analysis to date of the effect of pre-processing on results. Her framework should serve as a template for future development work in this area. Eduardo Beltrame designed the benchmark panels and had a key insight about how to present results that is described in more detail in the companion post on rotating the knee plot. He also helped in the complex task of designing and building the companion websites for the project. Kristján Eldjarn Hjörleifsson helped with the RNA velocity work and helped make custom indices that turned out to be fundamental in understanding the performance of pseudoalignment in the single-cell RNA-seq setting. Sina Booeshaghi spent a lot of time thinking about how to optimize the user experience, making the tutorials and videos, and working overall to make the results of the paper not just reproducible, but the the methods usable.

The development of microarray technology two decades ago heralded genome-wide comparative studies of gene expression in human, but it was the widespread adoption of RNA-Seq that has led to differential expression analysis becoming a staple of molecular biology studies. RNA-Seq provides measurements of transcript abundance, making possible not only gene-level analyses, but also differential analysis of isoforms of genes. As such, its use has necessitated refinements of the term “differential expression”, and new terms such as “differential transcript expression” have emerged alongside “differential gene expression”. A difficulty with these concepts is that they are used to describe biology, statistical hypotheses, and sometimes to describe types of methods. The aims of this post are to provide a unifying framework for thinking about the various concepts, to clarify their meaning, and to describe connections between them.

To illustrate the different concepts associated to differential expression, I’ll use the following example, consisting of a comparison of a single two-isoform gene in two conditions (the figure is Supplementary Figure 1 in Ntranos, Yi *et al.* Identification of transcriptional signatures for cell types from single-cell RNA-Seq, 2018):

The isoforms are labeled *primary* and *secondary*, and the two conditions are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates and corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates and corresponding to the abundance of the secondary isoforms. In data from an experiment the black dots will represent the mean level of expression of the constituent isoforms as derived from replicates, and there will be uncertainty as to their exact location. In this example I’ll assume they represent the true abundances.

**Biology**

Below is a list of terms used to characterize changes in expression:

**Differential transcript expression (DTE) **is change in one of the isoforms. In the figure, this is represented (conceptually) by the two red lines along the x- and y-axes respectively. Algebraically, one might compute the change in the primary isoform by and the change in the secondary isoform by . However the term DTE is used to denote not only the extent of change, but also the event that a single isoform of a gene changes between conditions, i.e. when the two points lie on a horizontal or vertical line. DTE can be understood to occur as a result of transcriptional regulation if an isoform has a unique transcription start site, or post-transcriptional regulation if it is determined by a unique splicing event.

**Differential gene expression (DGE) **is the change in the overall output of the gene. Change in the overall output of a gene is change in the direction of the line , and the extent of change can be understood geometrically to be the distance between the projections of the two points onto the line (blue line labeled DGE). The distance will depend on the metric used. For example, the change in expression could be defined to be the total expression in condition B () minus the change in expression in condition A (), which is . This is just the length of the blue line labeled “DGE” given by the norm. Alternatively, one could consider “DGE” to be the length of the blue line in the norm. As with DTE, DGE can also refer to a specific type of change in gene expression between conditions, one in which every isoform changes (relatively) by the same amount so that the line joining the two points has a slope of 1 (i.e. is angled at 45°). DGE can be understood to be the result of transcriptional regulation, driving overall gene expression up or down.

**Differential transcript usage (DTU) **is the change in *relative* expression between the primary and secondary isoforms. This can be interpreted geometrically as the angle between the two points, or alternatively as the length (as given by some norm) of the green line labeled DTU. As with DTE and DGE, DTU is also a term used to describe a certain kind of difference in expression between two conditions, one in which the line joining the two points has a slope of -1. DTU events are most likely controlled by post-transcriptional regulation.

**Gene differential expression ****(GDE)** is represented by the red line. It is the amount of change in expression along in the direction of line joining the two points. GDE is a notion that, for reasons explained below, is not typically tested for, and there are few methods that consider it. However GDE is biologically meaningful, in that it generalizes the notions of DGE, DTU and DTE, allowing for change in *any *direction. A gene that exhibits *some* change in expression between conditions is GDE regardless of the direction of change. GDE can represent complex changes in expression driven by a combination of transcriptional and post-transcriptional regulation. Note that DGE, DTU and DTE are all special cases of GDE.

If the norm is used to measure length and denote DTE in the primary and secondary isoforms respectively, then it is clear that DGE, DTU, DTE and GDE satisfy the relationship

**Statistics**

The terms DTE, DGE, DTU and GDE have an intuitive biological meaning, but they are also used in genomics as descriptors of certain null hypotheses for statistical testing of differential expression.

The **differential transcript expression (DTE)** null hypothesis for an isoform is that it did not change between conditions, i.e. for the primary isoform, or for the secondary isoform. In other words, in this example there are two DTE null hypotheses one could consider.

The **differential gene expresión (DGE)** null hypothesis is that there is no change in overall expression of the gene, i.e. .

The **differential transcript usage ****(DTU)** null hypothesis is that there is no change in the difference in expression of isoforms, i.e. .

The **gene differential expression (GDE)** null hypothesis is that there is no change in expression in *any* direction, i.e. for all constants , .

The **union differential transcript expression (UDTE) **null hypothesis is that there is no change in expression of *any* *isoform. *That is, that *and* (this null hypothesis is sometimes called DTE+G). The terminology is motivated by .

Not that , because if we assume GDE, and set we obtain DTE for the primary isoform and setting we obtain DTE for the secondary isoform. To be clear, by GDE or DTE in this case we mean the GDE (respectively DTE)* null hypothesis. *Furthermore, we have that

.

This is clear because if and then both DTE null hypotheses are satisfied by definition, and both DGE and DTU are trivially satisfied. However no other implications hold, i.e. , similarly , and .

**Methods**

The terms DGE, DTE, DTU and GDE also used to describe methods for differential analysis.

A **differential gene expression method** is one whose goal is to identify changes in overall gene expression. Because DGE depends on the projection of the points (representing gene abundances) to the line y=x, DGE methods typically take as input gene counts or abundances computed by summing transcript abundances and . Examples of early DGE methods for RNA-Seq were DESeq (now DESeq2) and edgeR. One problem with DGE methods is that it is problematic to estimate gene abundance by adding up counts of the constituent isoforms. This issue was discussed extensively in Trapnell *et al.* 2013. On the other hand, if the biology of a gene is DGE, i.e. changes in expression are the same (relatively) in all isoforms, then DGE methods will be optimal, and the issue of summed counts not representing gene abundances accurately is moot.

A **differential transcript expression method **is one whose goal is to identify individual transcripts that have undergone DTE. Early methods for DTE were Cufflinks (now Cuffdiff2) and MISO, and more recently sleuth, which improves DTE accuracy by modeling uncertainty in transcript quantifications. A key issue with DTE is that there are many more transcripts than genes, so that rejecting DTE null hypotheses is harder than rejecting DGE null hypotheses. On the other hand, DTE provides differential analysis at the highest resolution possible, pinpointing specific isoforms that change and opening a window to study post-transcriptional regulation. A number of recent examples highlight the importance of DTE in biomedicine (see, e.g., Vitting-Seerup and Sandelin 2017). Unfortunately DTE results do not always translate to testable hypotheses, as it is difficult to knock out individual isoforms of genes.

A **differential transcript usage **method is one whose goal is to identify genes whose overall expression is constant, but where isoform switching leads to changes in relative isoform abundances. Cufflinks implemented a DTU test using Jensen-Shannon divergence, and more recently RATs is a method specialized for DTU.

As discussed in the previous section, none of null hypotheses DGE, DTE and DTU imply any other, so users have to choose, prior to performing an analysis, which type of test they will perform. There are differing opinions on the “right” approach to choosing between DGE, DTU and DTE. Sonseson *et al.* 2016 suggest that while DTE and DTU may be appropriate in certain niche applications, generally it’s better to choose DGE, and they therefore advise not to bother with transcript-level analysis. In Trapnell *et al.* 2010, an argument was made for focusing on DTE and DTU, with the conclusion to the paper speculating that “differential RNA level isoform regulation…suggests functional specialization of the isoforms in many genes.” Van den Berge *et al. *2017 advocate for a middle ground: performing a gene-level analysis but saving some “FDR budget” for identifying DTE in genes for which the UDTE null hypothesis has been rejected.

There are two alternatives that have been proposed to get around the difficulty of having to choose, prior to analysis, whether to perform DGE, DTU or DTE:

A **differential transcript expression aggregation (DTE->G) **method is a method that first performs DTE on all isoforms of every gene, and then aggregates the resulting p-values (by gene) to obtain gene-level p-values. The “aggregation” relies on the observation that under the null hypothesis, p-values are uniformly distributed. There are a number of different tests (e.g. Fisher’s method) for testing whether (independent) p-values are uniformly distributed. Applying such tests to isoform p-values per gene provides gene-level p-values and the ability to reject UDTE. A DTE->G method was tested in Soneson *et al.* 2016 (based on Šidák aggregation) and the stageR method (Van den Berge *et al. *2017) uses the same method as a first step. Unfortunately, naïve DTE->G methods perform poorly when genes change by DGE, as shown in Yi *et al.* 2017. The same paper shows that Lancaster aggregation is a DTE->G method that achieves the best of both the DGE and DTU worlds. One major drawback of DTE->G methods is that they are non-constructive, i.e. the rejection of UDTE by a DTE->G method provides no information about *which* transcripts were differential and how. The stageR method averts this problem but requires sacrificing some power to reject UDTE in favor of the interpretability provided by subsequent DTE.

A **gene differential expression method **is a method for gene-level analysis that tests for differences *in the direction of change *identified between conditions. For a GDE method to be successful, it must be able to identify the direction of change, and that is not possible with bulk RNA-Seq data. This is because of the one in ten rule that states that approximately one predictive variable can be estimated from ten events. In bulk RNA-Seq, the number of replicates in standard experiments is three, and the number of isoforms in multi-isoform genes is at least two, and sometimes much more than that.

In Ntranos, Yi *et al.* 2018, it is shown that single-cell RNA-Seq provides enough “replicates” in the form of cells, that logistic regression can be used to predict condition based on expression, effectively identifying the direction of change. As such, it provides an alternative to DTE->G for rejecting UDTE. The Ntranos and Yi GDE methods is extremely powerful: by identifying the direction of change it is a DGE methods when the change is DGE, it is a DTU method when the change is DTU, and it is a DTE method when the change is DTE. Interpretability is provided in the prediction step: it is the estimated direction of change.

### Remarks

The discussion in this post is based on an example consisting of a gene with two isoforms, however the concepts discussed are easy to generalize to multi-isoform genes with more than two transcripts. I have not discussed differential exon usage (DEU), which is the focus of the DEXSeq method because of the complexities arising in genes which don’t have well-defined shared exons. Nevertheless, the DEXSeq approach to rejecting UDTE is similar to DTE->G, with DTE replaced by DEU. There are many programs for DTE, DTU and (especially) DGE that I haven’t mentioned; the ones cited are intended merely to serve as illustrative examples. This is not a comprehensive review of RNA-Seq differential expression methods.

**Acknowledgments**

The blog post was motivated by questions of Charlotte Soneson and Mark Robinson arising from an initial draft of the Ntranos, Yi *et al.* 2018 paper. The exposition was developed with Vasilis Ntranos and Lynn Yi. Valentine Svensson provided valuable comments and feedback.

One of my favorite systems biology papers is the classic “Stochastic Gene Expression in a Single Cell” by Michael Elowitz, Arnold J. Levine, Eric D. Siggia and Peter S. Swain (in this post I refer to it as the ELSS paper).

What I particularly like about the paper is that it resulted from computational biology flipped. Unlike most genomics projects, where statistics and computation are servants of the data, in ELSS a statistical idea was implemented with biology, and the result was an elegant experiment that enabled a fundamental measurement to be made.

The fact the ELSS implemented a statistical idea with biology makes the statistics a natural starting point for understanding the paper. The statistical idea is what is known as *the law of total variance*. Given a random (response) variable with a random covariate , the law of total variance states that the variance of can be decomposed as:

.

There is a biological interpretation of this law that also serves to explain it: If the random variable denotes the expression of a gene in a single cell ( being a random variable means that the expression is stochastic), and denotes the (random) state of a cell, then the law of total variance describes the “total noise” in terms of what can be considered “intrinsic” (also called “unexplained”) and “extrinsic” (also called “explained”) noise or variance.

To understand intrinsic noise, first one understands the expression to be the conditional variance, which is also a random variable; its possible values are the *variance* of the gene expression in different cell states. If does not depend on then the expression of the gene is said to be *homoscedastic*, i.e., it does not depend on cell state (if it does then it is said to be *heteroscedastic*). Because is a random variable, the expression makes sense, it is simply the average variance (of the gene expression in single cells) across cell states (weighted by their probability of occurrence), thus the term “intrinsic noise” to describe it.

The expression is a random variable whose possible values are the average of the gene expression in cells. Thus, is the variance of the averages; intuitively it can be understood to describe the noise arising from different cell state, thus the term “extrinsic noise” to describe it (see here for a useful interactive visualization for exploring the law of total variance).

The idea of ELSS was to design an experiment to measure the extent of intrinsic and extrinsic noise in gene expression by inserting two identically regulated reporter genes (cyan fluorescent protein and yellow fluorescent protein) into *E. coli *and measuring their expression in different cells. What this provides are measurements from the following model:

Random cell states are represented by random variables which are independent and identically distributed, one for each of *n* different cells, while random variables and correspond to the gene expression of the cyan , respectively yellow, reporters in those cells. The ELSS experiment produces a single sample from each variable and , i.e. a pair of measurements for each cell*. *A hierarchical model for the experiment, in which the marginal (unconditional) distributions and are identical, allows for estimating the intrinsic and extrinsic noise from the reporter expression measurements.

The model above, on which ELSS is based, was not described in the original paper (more on that later). Instead, in ELSS the following estimates for intrinsic, extrinsic and total noise were simply written down:

(intrinsic noise)

(extrinsic noise)

(total noise)

Here and are the measurements of cyan respectively yellow reporter expression in each cell, and .

Last year, Audrey Fu, at the time a visiting scholar in Jonathan Pritchard’s lab and now assistant professor in statistical science at the University of Idaho, studied the ELSS paper as part of a journal club. She noticed some inconsistencies with the proposed estimates in the paper, e.g. it seemed to her that some were biased, whereas others were not, and she proceeded to investigate in more detail the statistical basis for the estimates. There had been a few papers trying to provide statistical background, motivation and interpretation for the formulas in ELSS (e.g. A. Hilfinger and J. Paulsson, Separating intrinsic from extrinsic fluctuations in dynamic biological systems, 2011 ), but there had not been an analysis of bias, or for that matter other statistical properties of the estimates. Audrey started to embark on a post-publication peer review of the paper, and having seen such reviews on my blog contacted me to ask whether I would be interested to work with her. The project has been a fun hobby of ours for the past couple of months, eventually leading to a manuscript that we just posted on the arXiv:

Audrey Fu and Lior Pachter, Estimating intrinsic and extrinsic noise from single-cell gene expression measurements, arXiv 2016.

Our work provides what I think of as a “statistical companion” to the ELSS paper. First, we describe a formal hierarchical model which provides a useful starting point for thinking about estimators for intrinsic and extrinsic noise. With the model we re-examine the ELSS formulas, derive alternative estimators that either minimize bias or mean squared error, and revisit the intuition that underlies the extraction of intrinsic and extrinsic noise from data. Details are in the paper, but I briefly review some of the highlights here:

Figure 3a of the ELSS paper shows a scatterplot of data from two experiments, and provides a geometric interpretation of intrinsic and extrinsic noise that can guide intuition about them. We have redrawn their figure (albeit with a handful of points rather than with real data) in order to clarify the connections to the statistics:

The Elowitz *et al. *caption correctly stated that “Each point represents the mean fluorescence intensities from one cell. Spread of points perpendicular to the diagonal line on which CFP and YFP intensities are equal corresponds to intrinsic noise, whereas spread parallel to this line is increased by extrinsic noise”. While both statements are true, the one about intrinsic noise is precise whereas the one about extrinsic noise can be refined. In fact, the ELSS extrinsic noise estimate is the sample covariance (albeit biased due to a prefix of *n* in the denominator rather than *n-1*), an observation made by Hilfinger and Paulsson. The sample covariance has a (well-known) geometric interpretation: Specifically, we explain that it is the average (signed) area of triangles formed by pairs of data points (one the blue one in the figure): green triangles in Q1 and Q3 (some not shown) represent a positive contribution to the covariance and magenta triangles in Q2 and Q4 represent a negative contribution. Since most data points lie in the 1st (Q1) and 3rd (Q3) quadrants relative to the blue point, most of the contribution involving the blue point is positive. Similarly, since most pairs of data points can be connected by a positively signed line, their positive contribution will result in a positive covariance. We also explain why naïve intuition of extrinsic noise as the variance of points along the line is problematic.

The estimators we derive are summarized in the table below (Table 1 from our paper):

There is a bit of algebra that is required to derive formulas in the table (see the appendices of our paper). The take home messages are that:

- There is a subtle assumption underlying the ELSS intrinsic noise estimator that makes sense for the experiments in the ELSS paper, but not for every type of experiment in which the ELSS formulas are currently used. This has to do with the mean expression level of the two reporters, and we provide a rationale and criteria when to apply quantile normalization to normalize expression to the data.
- The ELSS intrinsic noise estimator is unbiased, whereas the ELSS extrinsic noise estimator is (slightly) biased. This asymmetry can be easily rectified with adjustments we derive.
- The standard unbiased estimator for variance (obtained via the Bessel correction) is frequently, and correctly, criticized for trading off mean squared error for bias. In practice, it can be more important to minimize mean squared error (MSE). For this reason we derive MSE minimizing estimators. While the MSE minimizing estimates converge quickly to the unbiased estimates (as a function of the number of cells), we believe there may be applications of the law of total variance to problems in systems biology where sample sizes are smaller, in which case our formulas may become useful.

The ELSS paper has been cited more than 3,000 times and with the recent emergence of large scale single-cell RNA-Seq the ideas in the paper are more relevant than ever. Hopefully our statistical companion to the paper will be useful to those seeking to obtain a better understanding of the methods, or those who plan to extend and apply them.

## Recent Comments