Bi/BE/CS183 is a computational biology class at Caltech with a mix of undergraduate and graduate students. Matt Thomson and I are co-teaching the class this quarter with help from teaching assistants Eduardo Beltrame, Dongyi (Lambda) Lu and Jialong Jiang. The class has a focus on the computational biology of single-cell RNA-seq analysis, and as such we recently taught an introduction to single-cell RNA-seq technologies. We thought the slides used would be useful to others so we have published them on figshare:

Eduardo Beltrame, Jase Gehring, Valentine Svensson, Dongyi Lu, Jialong Jiang, Matt Thomson and Lior Pachter, Introduction to single-cell RNA-seq technologies, 2019.

Thanks to Eduardo Beltrame, Jase Gehring and Valentine Svensson for many extensive and helpful discussions that clarified many of the key concepts. Eduardo Beltrame and Valentine Svensson performed new analysis (slide 28) and Jase Gehring resolved the tangle of “doublet” literature (slides 17–25). The 31 slides were presented in a 1.5 hour lecture. Some accompanying notes that might be helpful to anyone interested in using them are included for reference (numbered according to slide) below:

  1. The first (title) slide makes the point that single-cell RNA-seq is sufficiently complicated that a deep understanding of the details of the technologies, and methods for analysis, is required. The #methodsmatter.
  2. The second slide presents an overview of attributes associated with what one might imagine would constitute an ideal single-cell RNA-seq experiment. We tried to be careful about terminology and jargon, and therefore widely used terms are italicized and boldfaced.
  3. This slide presents Figure 1 from Svensson et al. 2018. This is an excellent perspective that highlights key technological developments that have spurred the growth in popularity of single-cell RNA-seq. At this time (February 2019) the largest single-cell RNA-seq dataset that has been published consists of 690,000 Drop-seq adult mouse brain cells (Saunders, Macosko et al. 2018). Notably, the size and complexity of this dataset rivals that of a large-scale genome project that until recently would be undertaken by hundreds of researchers. The rapid adoption of single-cell RNA-seq is evident in the growth of records in public sequence databases.
  4. The Chen et al. 2018 review on single-cell RNA-seq is an exceptionally useful and thorough review that is essential reading in the field. The slide shows Figure 2 which is rich in information and summarizes some of the technical aspects of single-cell RNA-seq technologies. Understanding of the details of individual protocols is essential to evaluating and assessing the strengths and weaknesses of different technologies for specific applications.
  5. Current single-cell RNA-seq technologies can be broadly classified into two groups: well-based and droplet-based technologies. The Papalexi and Satija 2017 review provides a useful high-level overview and this slide shows a part of Figure 1 from the review.
  6. The details of the SMART-Seq2 protocol are crucial for understanding the technology. SMART is a clever acronym for Switching Mechanism At the 5′ end of the RNA Transcript. It allows the addition of an arbitrary primer sequence at the 5′ end of a cDNA strand, and thus makes full length cDNA PCR possible. It relies on the peculiar properties of the reverse transcriptase from the Moloney murine leukemia virus (MMLV), which, upon reaching the 5’ end of the template, will add a few extra nucleotides (usually Cytosines). The resultant overhang is a binding site for the “template switch oligo”, which contains three riboguanines (rGrGrG). Upon annealing, the reverse transcriptase “switches” templates, and continues transcribing the DNA oligo, thus adding a constant sequence to the 5’ end of the cDNA. After a few cycles of PCR, the full length cDNA generated is too long for Illumina sequencing (where a maximum length of 800bp is desirable). To chop it up into smaller fragments of appropriate size while simultaneously adding the necessary Illumina adapter sequences, one can can use the Illumina tagmentation Nextera™ kits based on Tn5 tagmentation. The SMART template switching idea is also used in the Drop-seq and 10x genomics technologies.
  7. While it is difficult to rate technologies exactly on specific metrics, it is possible to identify strengths and weaknesses of distinct approaches. The SMART-Seq2 technology has a major advantage in that it produces reads from across transcripts, thereby providing “full-length” information that can be used to quantify individual isoforms of genes. However this superior isoform resolution requires more sequencing, and as a result makes the method less cost effective. Well-based methods, in general, are not as scalable as droplet methods in terms of numbers of cells assayed. Nevertheless, the tradeoffs are complex. For example robotics technologies can be used to parallelize well-based technologies, thereby increasing throughput.
  8. The cost of a single-cell technology is difficult to quantify. Costs depend on number of cells assayed as well as number of reads sequenced, and different technologies have differing needs in terms of reagents and library preparation costs. Ziegenhain et al. 2017 provide an in-depth discussion of how to assess cost in terms of accuracy and power, and the table shown in the slide is reproduced from part of Table 1 in the paper.
  9. A major determinant of single-cell RNA-seq cost is sequencing cost. This slide shows sequencing costs at the UC Davis Genome Center and its purpose is to illustrate numerous tradeoffs, relating to throughput, cost per base, and cost per fragment that must be considered when selecting which sequencing machine to use. In addition, sequencing time frequently depends on core facilities or 3rd party providers multiplexing samples on machines, and some sequencing choices are likely to face more delay than others.
  10. Turning to droplet technologies based on microfluidics, two key papers are the Drop-seq and inDrops papers which were published in the same issue of a single journal in 2015. The papers went to great lengths to document the respective technologies, and to provide numerous text and video tutorials to facilitate adoption by biologists. Arguably, this emphasis on usability (and not just reproducibility) played a major role in the rapid adoption of single-cell RNA-seq by many labs over the past three years. Two other references on the slide point to pioneering microfluidics work by Rustem Ismagilov, David Weitz and their collaborators that made possible the numerous microfluidic single-cell applications that have since been developed.
  11.  This slide displays a figure showing a monodispersed emulsion from the recently posted preprint “Design principles for open source bioinstrumentation: the poseidon syringe pump system as an example” by Booeshaghi et al., 2019. The generation of such emulsions is a prerequisite for successful droplet-based single-cell RNA-seq. In droplet based single-cell RNA-seq, emulsions act as “parallelizing agents”, essentially making use of droplets to parallelize the biochemical reactions needed to capture transcriptomic (or other) information from single-cells.
  12. The three objects that are central to droplet based single-cell RNA-seq are beads, cells and droplets. The relevance of emulsions in connection to these objects is that the basis of droplet methods for single-cell RNA-seq is the encapsulation of single cells together with single beads in the emulsion droplest. The beads are “barcode” delivery vectors. Barcodes are DNA sequences that are associated to transcripts, which are accessible after cell lysis in droplets. Therefore, beads must be manufactured in a way that ensures that each bead is coated with the same barcodes, but that the barcodes associated with two distinct beads are different from each other.
  13. The inDrops approach to single-cell RNA-seq serves as a useful model for droplet based single-cell RNA-seq methods. The figure in the slide is from a protocol paper by Zilionis et al. 2017 and provides a useful overview of inDrops. In panel (a) one sees a zoom-in of droplets being generated in a microfluidic device, with channels delivering cells and beads highlighted. Panel (b) illustrates how inDrops hydrogel beads are used once inside droplets: barcodes (DNA oligos together with appropriate priming sequences) are released from the hydrogel beads and allow for cell barcoded cDNA synthesis. Finally, panel (c) shows the sequence construct of oligos on the beads.
  14. This slide is analogous to slide 6, and shows an overview of the protocols that need to be followed both to make the hydrogel beads used for inDrops, and the inDrops protocol itself.  In a clever dual use of microfluidics, inDrops makes the hydrogel beads in an emulsion. Of note in the inDrops protocol itself is the fact that it is what is termed a “3′ protocol”. This means that the library, in addition to containing barcode and other auxiliary sequence, contains sequence only from 3′ ends of transcripts (seen in grey in the figure). This is the case also with other droplet based single-cell RNA-seq technologies such as Drop-seq or 10X Genomics technology.
  15. The significance of 3′ protocols it is difficult to quantify individual isoforms of genes from the data they produce. This is because many transcripts, while differing in internal exon structure, will share a 3′ UTR. Nevertheless, in exploratory work aimed at investigating the information content delivered by 3′ protocols, Ntranos et al. 2019 show that there is a much greater diversity of 3′ UTRs in the genome than is currently annotated, and this can be taken advantage of to (sometimes) measure isoform dynamics with 3′ protocols.
  16. To analyze the various performance metrics of a technology such as inDrops it is necessary to understand some of the underlying statistics and combinatorics of beads, cells and drops. Two simple modeling assumptions that can be made is that the number of cells and beads in droplets are each Poisson distributed (albeit with different rate parameters). Specifically, we assume that
    \mathbb{P}(\mbox{droplet has } k \mbox{ cells}) = \frac{e^{-\lambda}\lambda^k}{k!} and \mathbb{P}(\mbox{droplet has } k  \mbox{ beads}) = \frac{e^{-\mu}\mu^j}{j!}. These assumptions are reasonable for the Drop-seq technology. Adjustment of concentrations and flow rates of beads and cells and oil allows for controlling the rate parameters of these distributions and as a result allow for controlling numerous tradeoffs which are discussed next.
  17. The cell capture rate of a technology is the fraction of input cells that are assayed in an experiment. Droplets that contain one or more cells but no beads will result in a lost cells whose transcriptome is not measured. The probability that a droplet has no beads is e^{-\mu} and therefore the probability that a droplet has at least one bead is 1-e^{-\mu}. To raise the capture rate it is therefore desirable to increase the Poisson rate \mu which is equal to the average number of beads in a droplet. However increasing \mu leads to duplication, i.e. cases where a single droplet has more than one bead, thus leading .a single cell transcriptome to appear as two or more cells. The duplication rate is the fraction of assayed cells which were captured with more than one bead. The duplication rate can be calculated as \frac{\mathbb{P}(\mbox{droplet has 2 or more beads})}{\mathbb{P}(\mbox{droplet has 1 or more beads})} (which happens to be equivalent to a calculation of the probability that we are alone in the universe). The tradeoff, shown quantitatively as capture rate vs. duplication rate, is shown in a figure I made for the slide.
  18. One way to circumvent the capture-duplication tradeoff is to load beads into droplets in a way that reduces the variance of beads per droplet. One way to do this is to use hydrogel beads instead of polystyrene beads, which are used in Drop-seq. The slide shows hydrogel beads being being captured in droplets at a reduced variance due to the ability to pack hydrogel beads tightly together. Hydrogel beads are used with inDrops. The term Sub-poisson loading refers generally to the goal of placing exactly one bead in each droplet in a droplet based single-cell RNA-seq method.
  19. This slide shows a comparison of bead technologies. In addition to inDrops, 10x Genomics single-cell RNA-seq technology also uses hydrogel beads. There are numerous technical details associated with beads including whether or not barcodes are released (and how).
  20. Barcode collisions arise when two cells are separately encapsulated with beads that contain identical barcodes. The slide shows the barcode collision rate formula, which is 1-\left( 1-\frac{1}{M} \right)^{N-1}. This formula is derived as follows: Let p=\frac{1}{M}. The probability that a barcodes is associated with k cells is given by the binomial formula {N \choose k}p^k(1-p)^{N-k}. Thus, the probability that a barcode is associated to exactly one cell is Np(1-p)^{N-1} = \frac{N}{M}\left(1-\frac{1}{M}\right)^{N-1}. Therefore the expected number of cells with a unique barcode is N\left(1-\frac{1}{M}\right)^{N-1} and the barcode collision rate is \left(1-\frac{1}{M}\right)^{N-1}. This is approximately 1-\left( \frac{1}{e} \right)^{\frac{N}{M}}. The term synthetic doublet is used to refer to the situation when two or more different cells appear to be a single cell due to barcode collision.
  21. In the notation of the previous slide, the barcode diversity is \frac{N}{M}, which is an important number in that it determines the barcode collision rate. Since barcodes are encoded in sequence, a natural question is what sequence length is needed to ensure a desired level of barcode diversity. This slide provides a lower bound on the sequence length.
  22. Technical doublets are multiple cells that are captured in a single droplet, barcoded with the same sequence, and thus the transcripts that are recorded from them appear to originate from a single-cell.  The technical doublet rate can be estimated using a calculation analogous to the one used for the cell duplication rate (slide 17), except that it is a function of the Poisson rate \lambda and not \mu. In the single-cell RNA-seq literature the term “doublet” generally refers to technical doublets, although it is useful to distinguish such doublets from synthetic doublets and biological doublets (slide 25).
  23. One way to estimate the technical doublet rate is via pooled experiments of cells from different species. The resulting data can be viewed in what has become known as a “Barnyard” plot, a term originating from the Macosko et al. 2015 Drop-seq paper. Despite the authors’ original intention to pool mouse, chicken, cow and pig cells, typical validation of single-cell technology now proceeds with a mixture of human and mouse cells. Doublets are readily identifiable in barnyard plots as points (corresponding to droplets) that display transcript sequence from two different species. The only way for this to happen is via the capture of a doublet of cells (one from each species) in a single droplet. Thus, validation of single-cell RNA-seq rigs via a pooled experiment, and assessment of the resultant barnyard plot, has become standard for ensuring that the data is indeed single cell.
  24.  It is important to note that while points on the diagonal of barnyard plots do correspond to technical doublets, pooled experiments, say of human and mouse cells, will also result in human-human and mouse-mouse doublets that are not evident in barnyard plots. To estimate such within species technical doublets from a “barnyard analysis”, Bloom 2018 developed a simple procedure that is described in this slide. The “Bloom correction” is important to perform in order to estimate doublet rate from mixed species experiments.
  25. Biological doublets arise when two cells form a discrete unit that does not break apart during disruption to form a suspension. Such doublets are by definition from the same species, and therefore will not be detected in barnyard plots. One way to avoid biological doublets is via single nucleus RNA-seq (see, Habib et al. 2017). Single nuclear RNA-seq has proved to be important for experiments involving cells that are difficult to dissociate, e.g. human brain cells. In fact, the formation of suspensions is a major bottleneck in the adoption of single-cell RNA-seq technology, as techniques vary by tissue and organism, and there is no general strategy. On the other hand, in a recent interesting paper, Halpern et al. 2018 show that biological doublets can sometimes be considered a feature rather than a bug. In any case, doublets are more complicated than initially appears, and we have by now seen that there are three types of doublets: synthetic doublets, technical doublets, and biological doublets .
  26. Unique molecular identifiers (UMIs) are part of the oligo constructs for beads in droplet single-cell RNA-seq technologies. They are used to identify reads that originated from the same molecule, so that double counting of such molecules can be avoided. UMIs are generally much shorter than cell barcodes, and estimates of required numbers (and corresponding sequence lengths) proceed analogously to the calculations for cell barcodes.
  27. The sensitivity of a single-cell RNA-seq technology (also called transcript capture rate) is the fraction of transcripts captured per cell. Crucially, the sensitivity of a technology is dependent on the amount sequenced, and the plot in this slide (made by Eduardo Beltrame and Valentine Svensson by analysis of data from Svensson et al. 2017) shows that dependency.  The dots in the figure are cells from different datasets that included spike-ins, whose capture is being measured (y-axis). Every technology displays an approximately linear relationship between number of reads sequenced and transcripts captured, however each line is describe by two parameters: a slope and an intercept. In other words, specification of sensitivity for a technology requires reporting two numbers, not one.
  28. Having surveyed the different attributes of droplet technologies, this slide summarizes some of the pros and cons of inDrops similarly to slide 7 for SMART-Seq2.
  29. While individual aspects of single-cell RNA-seq technologies can be readily understood via careful modeling coupled to straightforward quality control experiments, a comprehensive assessment of whether a technology is “good” or “bad” is meaningless. The figure in this slide, from Zhang et al. 2019, provides a useful overview of some of the important technical attributes.
  30. In terms of the practical question “which technology should I choose for my experiment?”, the best that can be done is to offer very general decision workflows. The flowchart shown on this slide, also from Zhang et al. 2019, is not very specific but does provide some simple to understand checkpoints to think about. A major challenge in comparing and contrasting technologies for single-cell RNA-seq is that they are all changing very rapidly as numerous ideas and improvements are now published weekly.
  31. The technologies reviewed in the slides are strictly transcriptomics methods (single-cell RNA-seq). During the past few years there has been a proliferation of novel multi-modal technologies that simultaneously measures the transcriptome of single cells along with other modalities. Such technologies are reviewed in Packer and Trapnell 2018 and the slide mentions a few of them.