You are currently browsing the monthly archive for February 2019.

Bi/BE/CS183 is a computational biology class at Caltech with a mix of undergraduate and graduate students. Matt Thomson and I are co-teaching the class this quarter with help from teaching assistants Eduardo Beltrame, Dongyi (Lambda) Lu and Jialong Jiang. The class has a focus on the computational biology of single-cell RNA-seq analysis, and as such we recently taught an introduction to single-cell RNA-seq technologies. We thought the slides used would be useful to others so we have published them on figshare:

Eduardo Beltrame, Jase Gehring, Valentine Svensson, Dongyi Lu, Jialong Jiang, Matt Thomson and Lior Pachter, Introduction to single-cell RNA-seq technologies, 2019. doi.org/10.6084/m9.figshare.7704659.v1

Thanks to Eduardo Beltrame, Jase Gehring and Valentine Svensson for many extensive and helpful discussions that clarified many of the key concepts. Eduardo Beltrame and Valentine Svensson performed new analysis (slide 28) and Jase Gehring resolved the tangle of “doublet” literature (slides 17–25). The 31 slides were presented in a 1.5 hour lecture. Some accompanying notes that might be helpful to anyone interested in using them are included for reference (numbered according to slide) below:

1. The first (title) slide makes the point that single-cell RNA-seq is sufficiently complicated that a deep understanding of the details of the technologies, and methods for analysis, is required. The #methodsmatter.
2. The second slide presents an overview of attributes associated with what one might imagine would constitute an ideal single-cell RNA-seq experiment. We tried to be careful about terminology and jargon, and therefore widely used terms are italicized and boldfaced.
3. This slide presents Figure 1 from Svensson et al. 2018. This is an excellent perspective that highlights key technological developments that have spurred the growth in popularity of single-cell RNA-seq. At this time (February 2019) the largest single-cell RNA-seq dataset that has been published consists of 690,000 Drop-seq adult mouse brain cells (Saunders, Macosko et al. 2018). Notably, the size and complexity of this dataset rivals that of a large-scale genome project that until recently would be undertaken by hundreds of researchers. The rapid adoption of single-cell RNA-seq is evident in the growth of records in public sequence databases.
4. The Chen et al. 2018 review on single-cell RNA-seq is an exceptionally useful and thorough review that is essential reading in the field. The slide shows Figure 2 which is rich in information and summarizes some of the technical aspects of single-cell RNA-seq technologies. Understanding of the details of individual protocols is essential to evaluating and assessing the strengths and weaknesses of different technologies for specific applications.
5. Current single-cell RNA-seq technologies can be broadly classified into two groups: well-based and droplet-based technologies. The Papalexi and Satija 2017 review provides a useful high-level overview and this slide shows a part of Figure 1 from the review.
6. The details of the SMART-Seq2 protocol are crucial for understanding the technology. SMART is a clever acronym for Switching Mechanism At the 5′ end of the RNA Transcript. It allows the addition of an arbitrary primer sequence at the 5′ end of a cDNA strand, and thus makes full length cDNA PCR possible. It relies on the peculiar properties of the reverse transcriptase from the Moloney murine leukemia virus (MMLV), which, upon reaching the 5’ end of the template, will add a few extra nucleotides (usually Cytosines). The resultant overhang is a binding site for the “template switch oligo”, which contains three riboguanines (rGrGrG). Upon annealing, the reverse transcriptase “switches” templates, and continues transcribing the DNA oligo, thus adding a constant sequence to the 5’ end of the cDNA. After a few cycles of PCR, the full length cDNA generated is too long for Illumina sequencing (where a maximum length of 800bp is desirable). To chop it up into smaller fragments of appropriate size while simultaneously adding the necessary Illumina adapter sequences, one can can use the Illumina tagmentation Nextera™ kits based on Tn5 tagmentation. The SMART template switching idea is also used in the Drop-seq and 10x genomics technologies.
7. While it is difficult to rate technologies exactly on specific metrics, it is possible to identify strengths and weaknesses of distinct approaches. The SMART-Seq2 technology has a major advantage in that it produces reads from across transcripts, thereby providing “full-length” information that can be used to quantify individual isoforms of genes. However this superior isoform resolution requires more sequencing, and as a result makes the method less cost effective. Well-based methods, in general, are not as scalable as droplet methods in terms of numbers of cells assayed. Nevertheless, the tradeoffs are complex. For example robotics technologies can be used to parallelize well-based technologies, thereby increasing throughput.
8. The cost of a single-cell technology is difficult to quantify. Costs depend on number of cells assayed as well as number of reads sequenced, and different technologies have differing needs in terms of reagents and library preparation costs. Ziegenhain et al. 2017 provide an in-depth discussion of how to assess cost in terms of accuracy and power, and the table shown in the slide is reproduced from part of Table 1 in the paper.
9. A major determinant of single-cell RNA-seq cost is sequencing cost. This slide shows sequencing costs at the UC Davis Genome Center and its purpose is to illustrate numerous tradeoffs, relating to throughput, cost per base, and cost per fragment that must be considered when selecting which sequencing machine to use. In addition, sequencing time frequently depends on core facilities or 3rd party providers multiplexing samples on machines, and some sequencing choices are likely to face more delay than others.
10. Turning to droplet technologies based on microfluidics, two key papers are the Drop-seq and inDrops papers which were published in the same issue of a single journal in 2015. The papers went to great lengths to document the respective technologies, and to provide numerous text and video tutorials to facilitate adoption by biologists. Arguably, this emphasis on usability (and not just reproducibility) played a major role in the rapid adoption of single-cell RNA-seq by many labs over the past three years. Two other references on the slide point to pioneering microfluidics work by Rustem Ismagilov, David Weitz and their collaborators that made possible the numerous microfluidic single-cell applications that have since been developed.
11.  This slide displays a figure showing a monodispersed emulsion from the recently posted preprint “Design principles for open source bioinstrumentation: the poseidon syringe pump system as an example” by Booeshaghi et al., 2019. The generation of such emulsions is a prerequisite for successful droplet-based single-cell RNA-seq. In droplet based single-cell RNA-seq, emulsions act as “parallelizing agents”, essentially making use of droplets to parallelize the biochemical reactions needed to capture transcriptomic (or other) information from single-cells.
12. The three objects that are central to droplet based single-cell RNA-seq are beads, cells and droplets. The relevance of emulsions in connection to these objects is that the basis of droplet methods for single-cell RNA-seq is the encapsulation of single cells together with single beads in the emulsion droplest. The beads are “barcode” delivery vectors. Barcodes are DNA sequences that are associated to transcripts, which are accessible after cell lysis in droplets. Therefore, beads must be manufactured in a way that ensures that each bead is coated with the same barcodes, but that the barcodes associated with two distinct beads are different from each other.
13. The inDrops approach to single-cell RNA-seq serves as a useful model for droplet based single-cell RNA-seq methods. The figure in the slide is from a protocol paper by Zilionis et al. 2017 and provides a useful overview of inDrops. In panel (a) one sees a zoom-in of droplets being generated in a microfluidic device, with channels delivering cells and beads highlighted. Panel (b) illustrates how inDrops hydrogel beads are used once inside droplets: barcodes (DNA oligos together with appropriate priming sequences) are released from the hydrogel beads and allow for cell barcoded cDNA synthesis. Finally, panel (c) shows the sequence construct of oligos on the beads.
14. This slide is analogous to slide 6, and shows an overview of the protocols that need to be followed both to make the hydrogel beads used for inDrops, and the inDrops protocol itself.  In a clever dual use of microfluidics, inDrops makes the hydrogel beads in an emulsion. Of note in the inDrops protocol itself is the fact that it is what is termed a “3′ protocol”. This means that the library, in addition to containing barcode and other auxiliary sequence, contains sequence only from 3′ ends of transcripts (seen in grey in the figure). This is the case also with other droplet based single-cell RNA-seq technologies such as Drop-seq or 10X Genomics technology.
15. The significance of 3′ protocols it is difficult to quantify individual isoforms of genes from the data they produce. This is because many transcripts, while differing in internal exon structure, will share a 3′ UTR. Nevertheless, in exploratory work aimed at investigating the information content delivered by 3′ protocols, Ntranos et al. 2019 show that there is a much greater diversity of 3′ UTRs in the genome than is currently annotated, and this can be taken advantage of to (sometimes) measure isoform dynamics with 3′ protocols.
16. To analyze the various performance metrics of a technology such as inDrops it is necessary to understand some of the underlying statistics and combinatorics of beads, cells and drops. Two simple modeling assumptions that can be made is that the number of cells and beads in droplets are each Poisson distributed (albeit with different rate parameters). Specifically, we assume that
$\mathbb{P}(\mbox{droplet has } k \mbox{ cells}) = \frac{e^{-\lambda}\lambda^k}{k!}$ and $\mathbb{P}(\mbox{droplet has } k \mbox{ beads}) = \frac{e^{-\mu}\mu^j}{j!}$. These assumptions are reasonable for the Drop-seq technology. Adjustment of concentrations and flow rates of beads and cells and oil allows for controlling the rate parameters of these distributions and as a result allow for controlling numerous tradeoffs which are discussed next.
17. The cell capture rate of a technology is the fraction of input cells that are assayed in an experiment. Droplets that contain one or more cells but no beads will result in a lost cells whose transcriptome is not measured. The probability that a droplet has no beads is $e^{-\mu}$ and therefore the probability that a droplet has at least one bead is $1-e^{-\mu}$. To raise the capture rate it is therefore desirable to increase the Poisson rate $\mu$ which is equal to the average number of beads in a droplet. However increasing $\mu$ leads to duplication, i.e. cases where a single droplet has more than one bead, thus leading .a single cell transcriptome to appear as two or more cells. The duplication rate is the fraction of assayed cells which were captured with more than one bead. The duplication rate can be calculated as $\frac{\mathbb{P}(\mbox{droplet has 2 or more beads})}{\mathbb{P}(\mbox{droplet has 1 or more beads})}$ (which happens to be equivalent to a calculation of the probability that we are alone in the universe). The tradeoff, shown quantitatively as capture rate vs. duplication rate, is shown in a figure I made for the slide.
19. This slide shows a comparison of bead technologies. In addition to inDrops, 10x Genomics single-cell RNA-seq technology also uses hydrogel beads. There are numerous technical details associated with beads including whether or not barcodes are released (and how).
20. Barcode collisions arise when two cells are separately encapsulated with beads that contain identical barcodes. The slide shows the barcode collision rate formula, which is $1-\left( 1-\frac{1}{M} \right)^{N-1}$. This formula is derived as follows: Let $p=\frac{1}{M}$. The probability that a barcodes is associated with k cells is given by the binomial formula ${N \choose k}p^k(1-p)^{N-k}$. Thus, the probability that a barcode is associated to exactly one cell is $Np(1-p)^{N-1} = \frac{N}{M}\left(1-\frac{1}{M}\right)^{N-1}$. Therefore the expected number of cells with a unique barcode is $N\left(1-\frac{1}{M}\right)^{N-1}$ and the barcode collision rate is $\left(1-\frac{1}{M}\right)^{N-1}$. This is approximately $1-\left( \frac{1}{e} \right)^{\frac{N}{M}}$. The term synthetic doublet is used to refer to the situation when two or more different cells appear to be a single cell due to barcode collision.
21. In the notation of the previous slide, the barcode diversity is $\frac{N}{M}$, which is an important number in that it determines the barcode collision rate. Since barcodes are encoded in sequence, a natural question is what sequence length is needed to ensure a desired level of barcode diversity. This slide provides a lower bound on the sequence length.
22. Technical doublets are multiple cells that are captured in a single droplet, barcoded with the same sequence, and thus the transcripts that are recorded from them appear to originate from a single-cell.  The technical doublet rate can be estimated using a calculation analogous to the one used for the cell duplication rate (slide 17), except that it is a function of the Poisson rate $\lambda$ and not $\mu$. In the single-cell RNA-seq literature the term “doublet” generally refers to technical doublets, although it is useful to distinguish such doublets from synthetic doublets and biological doublets (slide 25).
23. One way to estimate the technical doublet rate is via pooled experiments of cells from different species. The resulting data can be viewed in what has become known as a “Barnyard” plot, a term originating from the Macosko et al. 2015 Drop-seq paper. Despite the authors’ original intention to pool mouse, chicken, cow and pig cells, typical validation of single-cell technology now proceeds with a mixture of human and mouse cells. Doublets are readily identifiable in barnyard plots as points (corresponding to droplets) that display transcript sequence from two different species. The only way for this to happen is via the capture of a doublet of cells (one from each species) in a single droplet. Thus, validation of single-cell RNA-seq rigs via a pooled experiment, and assessment of the resultant barnyard plot, has become standard for ensuring that the data is indeed single cell.
24.  It is important to note that while points on the diagonal of barnyard plots do correspond to technical doublets, pooled experiments, say of human and mouse cells, will also result in human-human and mouse-mouse doublets that are not evident in barnyard plots. To estimate such within species technical doublets from a “barnyard analysis”, Bloom 2018 developed a simple procedure that is described in this slide. The “Bloom correction” is important to perform in order to estimate doublet rate from mixed species experiments.
25. Biological doublets arise when two cells form a discrete unit that does not break apart during disruption to form a suspension. Such doublets are by definition from the same species, and therefore will not be detected in barnyard plots. One way to avoid biological doublets is via single nucleus RNA-seq (see, Habib et al. 2017). Single nuclear RNA-seq has proved to be important for experiments involving cells that are difficult to dissociate, e.g. human brain cells. In fact, the formation of suspensions is a major bottleneck in the adoption of single-cell RNA-seq technology, as techniques vary by tissue and organism, and there is no general strategy. On the other hand, in a recent interesting paper, Halpern et al. 2018 show that biological doublets can sometimes be considered a feature rather than a bug. In any case, doublets are more complicated than initially appears, and we have by now seen that there are three types of doublets: synthetic doublets, technical doublets, and biological doublets .
26. Unique molecular identifiers (UMIs) are part of the oligo constructs for beads in droplet single-cell RNA-seq technologies. They are used to identify reads that originated from the same molecule, so that double counting of such molecules can be avoided. UMIs are generally much shorter than cell barcodes, and estimates of required numbers (and corresponding sequence lengths) proceed analogously to the calculations for cell barcodes.
27. The sensitivity of a single-cell RNA-seq technology (also called transcript capture rate) is the fraction of transcripts captured per cell. Crucially, the sensitivity of a technology is dependent on the amount sequenced, and the plot in this slide (made by Eduardo Beltrame and Valentine Svensson by analysis of data from Svensson et al. 2017) shows that dependency.  The dots in the figure are cells from different datasets that included spike-ins, whose capture is being measured (y-axis). Every technology displays an approximately linear relationship between number of reads sequenced and transcripts captured, however each line is describe by two parameters: a slope and an intercept. In other words, specification of sensitivity for a technology requires reporting two numbers, not one.
28. Having surveyed the different attributes of droplet technologies, this slide summarizes some of the pros and cons of inDrops similarly to slide 7 for SMART-Seq2.
29. While individual aspects of single-cell RNA-seq technologies can be readily understood via careful modeling coupled to straightforward quality control experiments, a comprehensive assessment of whether a technology is “good” or “bad” is meaningless. The figure in this slide, from Zhang et al. 2019, provides a useful overview of some of the important technical attributes.
30. In terms of the practical question “which technology should I choose for my experiment?”, the best that can be done is to offer very general decision workflows. The flowchart shown on this slide, also from Zhang et al. 2019, is not very specific but does provide some simple to understand checkpoints to think about. A major challenge in comparing and contrasting technologies for single-cell RNA-seq is that they are all changing very rapidly as numerous ideas and improvements are now published weekly.
31. The technologies reviewed in the slides are strictly transcriptomics methods (single-cell RNA-seq). During the past few years there has been a proliferation of novel multi-modal technologies that simultaneously measures the transcriptome of single cells along with other modalities. Such technologies are reviewed in Packer and Trapnell 2018 and the slide mentions a few of them.

Five years ago on this day, Nicolas Bray and I wrote a blog post on The network nonsense of Manolis Kellis in which we described the paper Feizi et al. 2013 from the Kellis lab as dishonest and fraudulent. Specifically, we explained that:

“Feizi et al. have written a paper that appears to be about inference of edges in networks based on a theoretically justifiable model but

1. the method used to obtain the results in the paper is completely different than the idealized version sold in the main text of the paper and
2. the method actually used has parameters that need to be set, yet no approach to setting them is provided. Even worse,
3. the authors appear to have deliberately tried to hide the existence of the parameters. It looks like
4. the reason for covering up the existence of parameters is that the parameters were tuned to obtain the results. Moreover,
5. the results are not reproducible. The provided data and software is not enough to replicate even a single figure in the paper. This is disturbing because
6. the performance of the method on the simplest of all examples, a correlation matrix arising from a Gaussian graphical model, is poor.”

A second point we made is that the justification for the method, which the authors called “network deconvolution” was nonsense. For example, the authors wrote that “The model assumes that networks are “linear time-invariant flow-preserving operators.” Perhaps I take things too literally when I read papers but I have to admit that five years later I still don’t understand the sentence. However just because a method is ad-hoc, heuristic, or perhaps poorly explained, doesn’t mean it won’t work well in practice. In the blog post we compared network deconvolution to regularized partial correlation on simulated data, and found network deconvolution performed poorly. But in a responding comment, Kellis noted that “in our experience, partial correlation performed very poorly in practice.” He added that “We have heard very positive feedback from many other scientists using our software successfully in diverse applications.”

Fortunately we can now evaluate Kellis’ claims in light of an independent analysis in Wang, Pourshafeie, Zitnik et al. 2018, a paper from the groups of Serafim Batzoglou and Jure Leskovec (in collaboration with Carlos Bustamante) at Stanford University. There are three main results presented in Wang, Pourshafeie and Zitnik et al. 2018 that summarize the benchmarking of network deconvolution and other methods, and I reproduce figures showing the results below. The first shows the performance of network deconvolution and some other network denoising methods on a problem of butterfly species identification (network deconvolution is abbreviated ND and is shown in green). RAW (in blue) is the original unprocessed network. Network deconvolution is much worse than RAW:

The second illustrates the performance of network denoising methods on Hi-C data. The performance metric in this case is normalized mutual information (NMI) which Wang, Pourshafeie, Zitnik et al. described as “a fair representation of overall performance”. Network deconvolution (ND, dark green) is again worse than RAW (dark blue):

Finally, in an analysis of gene function from tissue-specific gene interaction networks, ND (blue) does perform better than RAW (pink) although barely. In four cases out of eight shown it is the worst of the four methods benchmarked:

Network deconvolution was claimed to be applicable to any network when it was published. At the time, Feizi stated that “We applied it to gene networks, protein folding, and co-authorship social networks, but our method is general and applicable to many other network science problems.” A promising claim, but in reality it is difficult to beat the nonsense law: Nonsense methods tend to produce nonsense results.

The Feizi et al. 2013 paper now has 178 citations, most of them drive by citations. Interestingly this number, 178 is exactly the number of citations of the Barzel et al. 2013 network nonsense paper, which was published in the same issue of Nature Biotechnology. Presumably this reflects the fact that authors citing one paper feel obliged to cite the other. These pair of papers were thus an impact factor win for the journal. For the first authors on the papers, the network deconvolution/silencing work is their most highly cited first author papers respectively. Barzel is an assistant professor at Bar-Ilan University where he links to an article about his network nonsense on his “media page”. Feizi is an assistant professor at the University of Maryland where he lists Feizi et al. 2013 among his “selected publications“. Kellis teaches the “network deconvolution” and its associated nonsense in his computational biology course at MIT. And why not? These days truth seems to matter less and less in every domain. A statement doesn’t have to be true, it just has to work well on YouTube, Twitter, Facebook, or some webpage, and as long as some people believe it long enough, say until the next grant cycle, promotion evaluation, or election, then what harm is done? A win-win for everyone. Except science.

The encapsulation of beads together with cells in droplets is the basis of microfluidic based single-cell RNA-seq technologies. Ideally droplets contain exactly one bead and one cell, however in practice the number of beads and cells in droplets is stochastic and encapsulation of cells in droplets produces an approximately Poisson distribution of number of cells per droplet:

Specifically, the probability of observing k cells in a droplet is approximated by

$\mathbb{P}(\mbox{k cells in a droplet}) = \frac{e^{-\lambda}\lambda^k}{k!}$.

The rate parameter $\lambda$ can be controlled and the average number of cells per droplet is equal to it. Therefore, setting $\lambda$ to be much less than 1 ensures that two or more cells are rarely encapsulated in a single droplet. A consequence of this is that the number of empty droplets, given by $e^{-\lambda}$, is large. Importantly, one of the properties of the Poisson distribution is that variance is equal to the mean so the number of cells per droplet is also equal to $\lambda$.

Along with cells, beads must also be captured in droplets, and when plastic beads are used the occupancy statistics follow a Poisson distribution as well. This means that with technologies such as Drop-seq (Macosko et al. 2015), which uses polystyrene beads, many droplets are either empty, contain a bead and no cell, or a cell and no bead. The latter situation (cell and no bead) leads to a low “capture rate”, i.e. not many of the cells are assayed in an experiment.

One of the advantages of the inDrops method (Klein et al. 2015) over other single-cell RNA-seq methods is that it uses hydrogel beads which allow for a reduction in the variance of the number of beads per cell. In an important paper Abate et al. 2009 showed that close packing of hydrogel beads allows for an almost degenerate distribution where the number of beads per droplet is exactly one 98% of the time. The video below shows how close to degeneracy the distribution can be squeezed (in the example two beads are being encapsulated per droplet):

A discrete distribution defined over the non-negative integers with variance less than the mean is called sub-Poisson. Similarly, a discrete distribution defined over the non-negative integers with variance greater than the mean is called super-Poisson. This terminology dates back to at least the 1940s (e.g., Berkson et al. 1942) and is standard in many fields from physics (e.g. Rodionov and Cherkin 2004) to biology (e.g. Pitchiaya et al. 2014 ).

Figure 5.26 from Adrian Jeantet, Cavity quantum electrodynamics with carbon nanotubes, 2017.

Using this terminology, the close packing of hydrogel beads can be said to enable sub-Poisson loading of beads into droplets because the variance of beads per droplet is reduced in comparison to the Poisson statistics of plastic beads.

Unfortunately, in a 2015 paper, Bose et al. used the term “super-Poisson” instead of “sub-Poisson” in discussing an approach to reducing bead occupancy variance in the single-cell RNA-seq context. This sign error in terminology has subsequently been propagated and recently appeared in a single- cell RNA-seq review (Zhang et al. 2018) and in 10x Genomics advertising materials.

When it comes to single-cell RNA-seq we already have people referring to the number of reads sequenced as “the library size” and calling trees “one-dimensional manifolds“. Now sub-Poisson is mistaken for super-Poisson. Before you know it we’ll have professors teaching students that cell clusters obtained by k-means clustering are “cell types“…

Supper poisson (not to be confused with super-Poisson (not to be confused with sub-Poisson)).