You are currently browsing the tag archive for the ‘single cell RNA-Seq’ tag.

The arXiv preprint server has its roots in an “e-print” email list curated by astrophysicist Joanne Cohn, who in 1989 had the idea of organizing the sharing of preprints among her physics colleagues. In her recollections of the dawn of the arXiv,  she mentions that “at one point two people put out papers on the list on the same topic within a few days of each other” and that her “impression was that because of the worldwide reach of [her] distribution list, people realized it could be a way to establish precedence for research.” In biology, where many areas are crowded and competitive, the ability to time stamp research before a possibly lengthy journal review and publication process is almost certainly one of the driving forces behind the rapid growth of the bioRxiv (ASAPbio 2016 and Vale & Hyman, 2016).

However the ability to establish priority with preprints is not, in my opinion, what makes them important for science. Rather, the value of preprints is in their ability to accelerate research via the rapid dissemination of methods and discoveries. This point was eloquently made by Stephen Quake, co-president of the Chan Zuckerberg Biohub, at a Caltech Kavli Nanonscience Institute Distinguished Seminar Series talk earlier this year. He demonstrated the impact of preprints and of sharing data prior to journal publication by way of example, noting that posting of the CZ Biohub “Tabula Muris” preprint along with the data directly accelerated two different unrelated projects: Cusanovich et al. 2018 and La Manno et al. 2018. In fact, in the case of La Manno et al. 2018, Quake revealed that one of the corresponding authors of the paper, Sten Linnarsson, had told him that “[he] couldn’t get the paper past the referees without using all of the [Tabula Muris] data”:

Moreover, Quake made clear that the open science principles practiced with the Tabula Muris preprint were not just a one-off experiment, but fundamental Chan Zuckerberg Initiative (CZI) values that are required for all CZI internal research and for publications arising from work the CZI supports: “[the CZI has] taken a pretty aggressive policy about publication… people have to agree to use biorXiv or a preprint server to share results… and the hope is that this is going to accelerate science because you’ll learn about things sooner and be able to work on them”:

Indeed, on its website the CZI lists four values that guide its mission and one of them is “Open Science”:

Open Science
The velocity of science and pace of discovery increase as scientists build on each others’ discoveries. Sharing results, open-source software, experimental methods, and biological resources as early as possible will accelerate progress in every area.

This is a strong and direct rebuttal to Dan Longo and Jeffrey Drazen’s “research parasite” fear mongering in The New England Journal of Medicine.

I was therefore disappointed with the CZI after failing, for the past two months, to obtain the code and data for the preprint “A molecular cell atlas of the human lung from single cell RNA sequencing” by Travaglini, Nabhan et al. (the preprint was posted on the bioRxiv on August 27th 2019). The interesting preprint describes an atlas of 58 cell populations in the human lung, which include 41 of 45 previously characterized cell types or subtypes and the discovery of 14 new ones. Of particular interest to me, in light of some ongoing projects in my lab, is a comparative analysis examining cell type concordance between human and mouse. Travaglini, Nabhan et al. note that 17 molecular types have been gained or lost since the divergence of human and mouse. The results are based on large-scale single-cell RNA-seq (using two technologies) of ~70,000 human and lung peripheral blood cells.

The comparative analysis is detailed in Extended Data Figure S5 (reproduced below), which shows scatter plots of (log) gene counts for homologous human and mouse cell types. For each pair of cell types, a sub-figures also shows the correlation between gene expression and divergent genes are highlighted:

I wanted to understand the details behind this figure: how exactly were cell types defined and homologous cell types identified? What was the precise thresholding for “divergent” genes? How were the ln(CPM+1) expression units computed? Some aspects of these questions have answers in the Methods section of the preprint, but I wanted to know exactly; I needed to see the code. For example, the manuscript describes the cluster selection procedure as follows: “Clusters of similar cells were detected using the Louvain method for community detection including only biologically meaningful principle [sic] components (see below)” and looking “below” for the definition of “biologically meaningful” I only found a descriptive explanation illustrated with an example, but with no precise specification provided. I also wanted to explore the data. We have been examining some approaches for cross-species single-cell analysis and this preprint describes an exceptionally useful dataset for this purpose. Thus, access to the software and data used for the preprint would accelerate the research in my lab.

But while the preprint has a sentence with a link to the software (“Code for demultiplexing counts/UMI tables, clustering, annotation, and other downstream analyses are available on GitHub (https://github.com/krasnowlab/HLCA)”) clicking on the link merely sends one to the Github Octocat.

The Travaglini, Nabhan et al. Github repository that is supposed to contain the analysis code is nowhere to be found. The data is also not available in any form. The preprint states that “Raw sequencing data, alignments, counts/UMI tables, and cellular metadata are available on GEO (accession GEOXX),” The only data a search for GEOXX turns up is a list of prices on a shoe website.

I wrote to the authors of Travaglini, Nabhan et al. right after their preprint appeared noting the absence of code and data and asking for both. I was told by one of the first co-authors that they were in the midst of uploading the materials, but that the decision of whether to share them would have to be made by the corresponding authors. Almost two months later, after repeated requests, I have yet to receive anything. My initial excitement for the Travaglini, Nabhan et al. single-cell RNA-seq has turned into disappointment at their zero-data RNA-seq.

🦗 🦗 🦗 🦗 🦗

This state of affairs, namely the posting of bioRxiv preprints without data or code, is far too commonplace. I was first struck with the extent of the problem last year when the Gupta, Collier et al. 2018 preprint was posted without a Methods section (let alone with data or code). Also problematic was that the preprint was posted just three months before publication while the journal submission was under review. I say problematic because not sharing code, not sharing software, not sharing methods, and not posting the preprint at the time of submission to a journal does not accelerate progress in science (see the CZI Open Science values statement above).

The Gupta, Collier et al. preprint was not a CZI related preprint but the Travaglini, Nabhan et al. preprint is. Specifically, Travaglini, Nabhan et al. 2019 is a collaboration between CZ Biohub and Stanford University researchers, and the preprint appears on the Chan Zuckerberg Biohub bioRxiv channel:

The Travaglini, Nabhan et al. 2019 preprint is also not an isolated example; another recent CZ Biohub preprint from the same lab, Horns et al. 2019,  states explicitly that “Sequence data, preprocessed data, and code will be made freely available [only] at the time of [journal] publication.” These are cases where instead of putting its money where its mouth is, the mouth took the money, ate it, and spat out a 404 error.

To be fair, sharing data, software and methods is difficult. Human data must sometimes be protected due to confidentiality constraints, thus requiring controlled access with firewalls such as dbGaP that can be difficult to set up. Even with unrestricted data, sharing can be cumbersome. For example, the SRA upload process is notoriously difficult to manage, and the lack of metadata standards can make organizing experimental data, especially sequencing data, complicated and time consuming. The sharing of experimental protocols can be challenging when they are in flux and still being optimized while work is being finalized. And when it comes to software, ensuring reproducibility and usability can take months of work in the form of wrangling Snakemake and other workflows, not to mention the writing of documentation. Practicing Open Science, I mean really doing it, is difficult work. There is a lot more to it than just dumping an advertisement on the bioRxiv to collect a timestamp. By not sharing their data or software, preprints such as Travaglini, Nabhan et al. 2019 and Horns et al. 2019 appear to be little more than a cynical attempt to claim priority.

It would be great if the CZI, an initiative backed by billions of dollars with hundreds of employees, would truly champion Open Science. The Tabula Muris preprint is a great example of how preprints that are released with data and software can accelerate progress in science. But Tabula Muris seems to be an exception for CZ Biohub researchers rather than the rule, and actions speak louder than a website with a statement about Open Science values.

A few months ago, in July 2019, I wrote a series of five posts about the Melsted, Booeshaghi et al. 2019 preprint on this blog, including a post focusing on a new fast workflow for RNA velocity based on kallisto | bustools.  This new workflow replaces the velocyto software published with the RNA velocity paper (La Manno et al. 2018), in the sense that the kallisto | bustools is more than an order of magnitude faster than velocyto (and Cell Ranger which it builds on), while producing near identical results:

In my blogpost, I showed how we were able to utilize this much faster workflow to easily produce RNA velocity for a large dataset of 113,917 cells from Clark et al. 2019, a dataset that was intractable with Cell Ranger + velocyto.

The kallisto | bustools RNA velocity workflow makes use of two novel developments: a new mode in kallisto called kallisto bus that produces BUS files from single-cell RNA-seq data, and a new collection of new C++ programs forming “bustools” that can be used to process BUS files to produce count matrices. The RNA velocity workflow utilizes the bustools sort, correct, capture and count commands. With these tools an RNA velocity analysis that previously took a day now takes minutes. The workflow, associated tools and validations are described in the Melsted, Booeshaghi et al. 2019 preprint.

Now, in an interesting exercise presented at the Single Cell Genomics 2019 meeting, Sten Linnarsson revealed that he reimplemented the kallisto | bustools workflow in Loompy. Loompy, which previously consisted of some Python tools for reading and writing Loom files, now has a function that runs kallisto bus. It also consists of Python functions that are used to manipulate BUS files; these are Python reimplementations of the bustools functions needed for RNA velocity and produce the same output as kallisto | bustools. It is therefore possible to now answer a question I know has been on many minds… one that has been asked before but not to my knowledge, in the single-cell RNA-seq setting… is Python really faster than C++ ?

To answer this question we (this is an analysis performed with Sina Booeshaghi), performed an apples-to-apples comparison running kallisto | bustools and Loompy on exactly the same data, with the same hardware. We pre-processed both the human forebrain data from La Manno et al. 2018,  and data from Oetjen et al. 2018 consisting of 498,303,099 single-cell RNA-seq reads sequenced from a cDNA library of human bone marrow (SRA accession SRR7881412; see also PanglaoDB).

First, we examined the correspondence between Loopy and bustools on the human forebrain data. As expected, given that the Loompy code first runs the same kallisto as in the kallisto | bustools workflow, and then reimplements bustools, the results are the near identical. In the first plot every dot is a cell (as defined by the velocyto output from La Manno et al. 2018) and the number of counts produced by each method is shown. In the second, the correlation between gene counts in each cell are plotted:

The figures above are produced from the “spliced” RNA velocity matrix. We also examined the “unspliced” matrix, with similar results:

In runtime benchmarks on the Oetjen et al. 2018 data we found that kallisto | bustools runs 3.75 faster than Loompy (note that by default Loompy runs kallisto with all available threads, so we modified the Loompy source code to create a fair comparison). Furthermore, kallisto | bustools requires 1.8 times less RAM. In other words, despite rumors to the contrary, Python is indeed slower than C++ !

Of course, sometimes there is utility in reimplementing software in another language, even a slower one. For example, a reimplementation of C++ code could lead to a simpler workflow in a higher level language. That’s not the case here. The memory efficiency of kallisto | bustools makes possible the simplest user interface imaginable: a kallisto | bustools based Google Colab notebook allows for single-cell RNA-seq pre-processing in the cloud on a web browser without a personal computer.

At the recent Single Cell Genomics 2019 meeting, Linnarsson’s noted that Cell Ranger + veloctyto has been replaced by kallisto | bustools:

Indeed, as I wrote on my blog shortly after the publication of Melsted, Booeshaghi et al., 2019, RNA velocity calculations that were previously intractable on large datasets are now straightforward. Linnarsson is right. Users should always adopt best-in-class tools in favor of methods that underperform in accuracy, efficiency, or both. #methodsmatter

This post is the fifth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

The following passage about Beethoven’s fifth symphony was written by one of my favorite musicologists:

“No great music has ever been built from an initial figure of four notes. As I have said elsewhere, you might as well say that every piece of music is built from an initial figure of one note. You may profitably say that the highest living creatures have begun from a single nucleated cell. But no ultra-microscope has yet unraveled the complexities of the single living cell; nor, if the spectroscope is to be believed, are we yet very full informed of the complexities of a single atom of iron : and it is quite absurd to suppose that the evolution of a piece of music can proceed from a ‘simple figure of four notes’ on lines in the least resembling those of nature.” – Donald Francis Tovey writing about Beethoven’s Fifth Symphony in Essays in Musical Analysis Volume I, 1935.

This passage conveys something true about Beethoven’s fifth symphony: an understanding of it cannot arise from a limited fixation on the famous four note motif. As far as single-cell biology goes, I don’t know whether Tovey was familiar with Theodor Boveri‘s sea urchin experiments, but he certainly hit upon a scientific truth as well: single cells cannot be understood in isolation. Key to understanding them is context (Eberwine et al., 2013).

RNA velocity, with roots in the work of Zeisel et al., 2011, has been recently adapted for single-cell RNA-seq by La Manno et al. 2018, and provides much needed context for interpreting the transcriptomes of single-cells in the form of a dynamics overlay. Since writing a review about the idea last year (Svensson and Pachter, 2019), I’ve become increasingly convinced that the method, despite relying on sparse data, numerous very strong model assumptions, and lots of averaging, is providing meaningful biological insight. For example, in a recent study of spermatogonial stem cells (Guo et al. 2018), the authors describe two “unexpected” transitions between distinct states of cells that are revealed by RNA velocity analysis (panel a from their Figure 6, see below):

Producing an RNA velocity analysis currently requires running the programs Cell Ranger followed by velocyto. These programs are both very slow. Cell Ranger’s running time scales at about 3 hours per hundred million reads (see Supplementary Table 1 Melsted, Booeshaghi et al., 2019). The subsequent velocyto run is also slow. The authors describe it as taking “approximately 3 hours” but anecdotally the running time can be much longer on large datasets. The programs also require lots of memory.

To facilitate rapid and large-scale RNA velocity analysis, in Melsted, Booeshaghi et al., 2019  we describe a kallisto|bustools workflow that makes possible efficient RNA velocity computations at least an order of magnitude faster than with Cell Ranger and velocyto. The work, a tour-de-force of development, testing and validation, was primarily that of Sina Booeshaghi. Páll Melsted implemented the bustools capture command and Kristján Hjörleifsson assisted with identifying and optimizing the indices for pseudoalignment. We present analysis on two datasets in the paper. The first is single-cell RNA-seq from retinal development recently published in Clark et al. 2019. This is a beautiful paper- and I don’t mean just in terms of the results. Their data and results are extremely well organized making their paper reproducible. This is so important it merits a shout out 👏🏾

See Clark et al. 2019‘s  GEO GSE 118614 for a well-organized and useful data share.

The figure below shows RNA velocity vectors overlaid on UMAP coordinates for Clark et al.’s 10 stage time series of retinal development (see cell [8] in our python notebook):

An overlap on the same UMAP with cells colored by type is shown below:

Clark et al. performed a detailed pseudotime analysis in their paper, which successfully identified genes associated with cell changes during development. This is a reproduction of their figure 2:

We examined the six genes from their panel C from a velocity point of view using the scvelo package and the results are beautiful:

What can be seen with RNA velocity is not only the changes in expression that are extracted from pseudotime analysis (Clark et al. 2019 Figure 2 panel C), but also changes in their velocity, i.e. their acceleration (middle column above). RNA velocity adds an interesting dimension to the analysis.

To validate that our RNA velocity workflow provides results consistent with velocyto, we performed a direct comparison with the developing human forebrain dataset published by La Manno et al. in the original RNA velocity paper (La Manno et al. 2018 Figure 4).

The results are concordant, not only in terms of the displayed vectors, but also, crucially, in the estimation of the underlying phase diagrams (the figure below shows a comparison for the same dataset; kallisto on the left, Cell Ranger + velocyto on the right):

Digging deeper into the data, one difference we found between the workflows (other than speed) is the number of reads counts. We implemented a simple strategy to estimate the required spliced and unspliced matrices that attempts to follow the one described in the La Manno et al. paper, where the authors describe the rules for characterizing reads as spliced vs. unspliced as follows:

1. A molecule was annotated as spliced if all of the reads in the set supporting a given molecule map only to the exonic regions of the compatible transcripts.
2. A molecule was annotated as unspliced if all of the compatible transcript models had at least one read among the supporting set of reads for this molecule mapping that i) spanned exon-intron boundary, or ii) mapped to the intron of that transcript.

In the kallisto|bustools workflow this logic was implemented via the bustools capture command which was first use to identify all reads that were compatible only with exons (i.e. there was no pseudoalignment to any intron) and then all reads that were compatible only with introns  (i.e. there was no pseudoalignment completely within an exon). While our “spliced matrices” had similar numbers of counts, our “unspliced matrices” had considerably more (see Melsted, Booeshaghi et al. 2019 Supplementary Figure 10A and B):

To understand the discrepancy better we investigated the La Manno et al. code, and we believe that differences arise from the velocyto package logic.py code in which the same count function

###### def count(self, molitem: vcy.Molitem, cell_bcidx: int, dict_layers_columns: Dict[str, np.ndarray], geneid2ix: Dict[str, int])

appears 8 times and each version appears to implement a slightly different “logic” than described in the methods section.

A tutorial showing how to efficiently perform RNA velocity is available on the kallisto|bustools website. There is no excuse not to examine cells in context.

This post is the fourth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

1. Near-optimal pre-processing of single-cell RNA-seq
2. Single-cell RNA-seq for dummies
3. How to solve an NP-complete problem in linear time
4. Rotating the knee (plot) and related yoga
5. High velocity RNA velocity

The “knee plot” is a standard single-cell RNA-seq quality control that is also used to determine a threshold for considering cells valid for analysis in an experiment. To make the plot, cells are ordered on the x-axis according to the number of distinct UMIs observed. The y-axis displays the number of distinct UMIs for each barcode (here barcodes are proxies for cells). The following example is from Aaron Lun’s DropletUtils vignette:

A single-cell RNA-seq knee plot.

High quality barcodes are located on the left hand side of the plot, and thresholding is performed by identifying the “knee” on the curve. On the right hand side, past the inflection point, are barcodes which have relatively low numbers of reads, and are therefore considered to have had failure in capture and to be too noisy for further analysis.

In Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019, we display a series of plots for a benchmark panel of 20 datasets, and the first plot in each panel (subplot A)is a knee plot. The following example is from an Arabidopsis thaliana dataset (Ryu et al., 2019; SRR8257100)

Careful examination of our plots shows that unlike the typical knee plot made for single-cell RNA-seq , ours has the x- and y- axes transposed. In our plot the x-axis displays the number of distinct UMI counts, and the y-axis corresponds to the barcodes, ordered from those with the most UMIs (bottom) to the least (top). The figure below shows both versions of a knee plot for the same data (the “standard” one in blue, our transposed plot in red):

Why bother transposing a plot?

We begin by observing that if one ranks barcodes according to the number of distinct UMIs associated with them (from highest to lowest), then the rank of a barcode with x distinct UMIs is given by f(x) where

$f(x) = |\{c:\# \mbox{UMIs} \in c \geq x\}|$.

In other words, the rank of a barcode is interpretable as the size of a certain set. Now suppose that instead of only measurements of RNA molecules in cells, there is another measurement. This could be measurement of surface protein abundances (e.g. CITE-seq or REAP-seq), or measurements of sample tags from a multiplexing technology (e.g. ClickTags). The natural interpretation of #distinct UMIs as the independent variable and  the rank of a barcode as the dependent variable is now clearly preferable. We can now define a bivariate function f(x,y) which informs on the number of barcodes with at least x RNA observations and tag observations:

$f(x,y) = |\{c:\# \mbox{UMIs} \in c \geq x \mbox{ and} \# \mbox{tags} \in c \geq y \}|$.

Nadia Volovich, with whom I’ve worked on this, has examined this function for the 8 sample species mixing experiment from Gehring et al. 2018. The function is shown below:

Here the x-axis corresponds to the #UMIs in a barcode, and the y-axis to the number of tags. The z-axis, or height of the surface, is the f(x,y) as defined above.  Instead of thresholding on either #UMIs or #tags, this “3D knee plot” makes possible thresholding using both (note that the red curve shown above corresponds to one projection of this surface).

Separately from the issue described above, there is another subtle issue with the knee plot. The x-axis (dependent) variable really ought to display the number of molecules assayed rather than the number of distinct UMIs. In the notation of Melsted, Booeshaghi et al., 2019 (see also the blog post on single-cell RNA-seq for dummies), what is currently being plotted is |supp(I)|, instead of |I|. While |I| cannot be directly measured, it can be inferred (see the Supplementary Note of Melsted, Booeshaghi et al., 2019), where the cardinality of I is denoted by k (see also Grün et al,, 2014). If d denotes the number of distinct UMIs for a barcode and n the effective number of UMIs , then k can be estimated by

$\hat{k} = \frac{log(1-\frac{d}{n})}{log(1-\frac{1}{n})}$.

The function estimating k is monotonic so for the purpose of thresholding with the knee plot it doesn’t matter much whether the correction is applied, but it is worth noting that the correction can be applied without much difficulty.

Bi/BE/CS183 is a computational biology class at Caltech with a mix of undergraduate and graduate students. Matt Thomson and I are co-teaching the class this quarter with help from teaching assistants Eduardo Beltrame, Dongyi (Lambda) Lu and Jialong Jiang. The class has a focus on the computational biology of single-cell RNA-seq analysis, and as such we recently taught an introduction to single-cell RNA-seq technologies. We thought the slides used would be useful to others so we have published them on figshare:

Eduardo Beltrame, Jase Gehring, Valentine Svensson, Dongyi Lu, Jialong Jiang, Matt Thomson and Lior Pachter, Introduction to single-cell RNA-seq technologies, 2019. doi.org/10.6084/m9.figshare.7704659.v1

Thanks to Eduardo Beltrame, Jase Gehring and Valentine Svensson for many extensive and helpful discussions that clarified many of the key concepts. Eduardo Beltrame and Valentine Svensson performed new analysis (slide 28) and Jase Gehring resolved the tangle of “doublet” literature (slides 17–25). The 31 slides were presented in a 1.5 hour lecture. Some accompanying notes that might be helpful to anyone interested in using them are included for reference (numbered according to slide) below:

1. The first (title) slide makes the point that single-cell RNA-seq is sufficiently complicated that a deep understanding of the details of the technologies, and methods for analysis, is required. The #methodsmatter.
2. The second slide presents an overview of attributes associated with what one might imagine would constitute an ideal single-cell RNA-seq experiment. We tried to be careful about terminology and jargon, and therefore widely used terms are italicized and boldfaced.
3. This slide presents Figure 1 from Svensson et al. 2018. This is an excellent perspective that highlights key technological developments that have spurred the growth in popularity of single-cell RNA-seq. At this time (February 2019) the largest single-cell RNA-seq dataset that has been published consists of 690,000 Drop-seq adult mouse brain cells (Saunders, Macosko et al. 2018). Notably, the size and complexity of this dataset rivals that of a large-scale genome project that until recently would be undertaken by hundreds of researchers. The rapid adoption of single-cell RNA-seq is evident in the growth of records in public sequence databases.
4. The Chen et al. 2018 review on single-cell RNA-seq is an exceptionally useful and thorough review that is essential reading in the field. The slide shows Figure 2 which is rich in information and summarizes some of the technical aspects of single-cell RNA-seq technologies. Understanding of the details of individual protocols is essential to evaluating and assessing the strengths and weaknesses of different technologies for specific applications.
5. Current single-cell RNA-seq technologies can be broadly classified into two groups: well-based and droplet-based technologies. The Papalexi and Satija 2017 review provides a useful high-level overview and this slide shows a part of Figure 1 from the review.
6. The details of the SMART-Seq2 protocol are crucial for understanding the technology. SMART is a clever acronym for Switching Mechanism At the 5′ end of the RNA Transcript. It allows the addition of an arbitrary primer sequence at the 5′ end of a cDNA strand, and thus makes full length cDNA PCR possible. It relies on the peculiar properties of the reverse transcriptase from the Moloney murine leukemia virus (MMLV), which, upon reaching the 5’ end of the template, will add a few extra nucleotides (usually Cytosines). The resultant overhang is a binding site for the “template switch oligo”, which contains three riboguanines (rGrGrG). Upon annealing, the reverse transcriptase “switches” templates, and continues transcribing the DNA oligo, thus adding a constant sequence to the 5’ end of the cDNA. After a few cycles of PCR, the full length cDNA generated is too long for Illumina sequencing (where a maximum length of 800bp is desirable). To chop it up into smaller fragments of appropriate size while simultaneously adding the necessary Illumina adapter sequences, one can can use the Illumina tagmentation Nextera™ kits based on Tn5 tagmentation. The SMART template switching idea is also used in the Drop-seq and 10x genomics technologies.
7. While it is difficult to rate technologies exactly on specific metrics, it is possible to identify strengths and weaknesses of distinct approaches. The SMART-Seq2 technology has a major advantage in that it produces reads from across transcripts, thereby providing “full-length” information that can be used to quantify individual isoforms of genes. However this superior isoform resolution requires more sequencing, and as a result makes the method less cost effective. Well-based methods, in general, are not as scalable as droplet methods in terms of numbers of cells assayed. Nevertheless, the tradeoffs are complex. For example robotics technologies can be used to parallelize well-based technologies, thereby increasing throughput.
8. The cost of a single-cell technology is difficult to quantify. Costs depend on number of cells assayed as well as number of reads sequenced, and different technologies have differing needs in terms of reagents and library preparation costs. Ziegenhain et al. 2017 provide an in-depth discussion of how to assess cost in terms of accuracy and power, and the table shown in the slide is reproduced from part of Table 1 in the paper.
9. A major determinant of single-cell RNA-seq cost is sequencing cost. This slide shows sequencing costs at the UC Davis Genome Center and its purpose is to illustrate numerous tradeoffs, relating to throughput, cost per base, and cost per fragment that must be considered when selecting which sequencing machine to use. In addition, sequencing time frequently depends on core facilities or 3rd party providers multiplexing samples on machines, and some sequencing choices are likely to face more delay than others.
10. Turning to droplet technologies based on microfluidics, two key papers are the Drop-seq and inDrops papers which were published in the same issue of a single journal in 2015. The papers went to great lengths to document the respective technologies, and to provide numerous text and video tutorials to facilitate adoption by biologists. Arguably, this emphasis on usability (and not just reproducibility) played a major role in the rapid adoption of single-cell RNA-seq by many labs over the past three years. Two other references on the slide point to pioneering microfluidics work by Rustem Ismagilov, David Weitz and their collaborators that made possible the numerous microfluidic single-cell applications that have since been developed.
11.  This slide displays a figure showing a monodispersed emulsion from the recently posted preprint “Design principles for open source bioinstrumentation: the poseidon syringe pump system as an example” by Booeshaghi et al., 2019. The generation of such emulsions is a prerequisite for successful droplet-based single-cell RNA-seq. In droplet based single-cell RNA-seq, emulsions act as “parallelizing agents”, essentially making use of droplets to parallelize the biochemical reactions needed to capture transcriptomic (or other) information from single-cells.
12. The three objects that are central to droplet based single-cell RNA-seq are beads, cells and droplets. The relevance of emulsions in connection to these objects is that the basis of droplet methods for single-cell RNA-seq is the encapsulation of single cells together with single beads in the emulsion droplest. The beads are “barcode” delivery vectors. Barcodes are DNA sequences that are associated to transcripts, which are accessible after cell lysis in droplets. Therefore, beads must be manufactured in a way that ensures that each bead is coated with the same barcodes, but that the barcodes associated with two distinct beads are different from each other.
13. The inDrops approach to single-cell RNA-seq serves as a useful model for droplet based single-cell RNA-seq methods. The figure in the slide is from a protocol paper by Zilionis et al. 2017 and provides a useful overview of inDrops. In panel (a) one sees a zoom-in of droplets being generated in a microfluidic device, with channels delivering cells and beads highlighted. Panel (b) illustrates how inDrops hydrogel beads are used once inside droplets: barcodes (DNA oligos together with appropriate priming sequences) are released from the hydrogel beads and allow for cell barcoded cDNA synthesis. Finally, panel (c) shows the sequence construct of oligos on the beads.
14. This slide is analogous to slide 6, and shows an overview of the protocols that need to be followed both to make the hydrogel beads used for inDrops, and the inDrops protocol itself.  In a clever dual use of microfluidics, inDrops makes the hydrogel beads in an emulsion. Of note in the inDrops protocol itself is the fact that it is what is termed a “3′ protocol”. This means that the library, in addition to containing barcode and other auxiliary sequence, contains sequence only from 3′ ends of transcripts (seen in grey in the figure). This is the case also with other droplet based single-cell RNA-seq technologies such as Drop-seq or 10X Genomics technology.
15. The significance of 3′ protocols it is difficult to quantify individual isoforms of genes from the data they produce. This is because many transcripts, while differing in internal exon structure, will share a 3′ UTR. Nevertheless, in exploratory work aimed at investigating the information content delivered by 3′ protocols, Ntranos et al. 2019 show that there is a much greater diversity of 3′ UTRs in the genome than is currently annotated, and this can be taken advantage of to (sometimes) measure isoform dynamics with 3′ protocols.
16. To analyze the various performance metrics of a technology such as inDrops it is necessary to understand some of the underlying statistics and combinatorics of beads, cells and drops. Two simple modeling assumptions that can be made is that the number of cells and beads in droplets are each Poisson distributed (albeit with different rate parameters). Specifically, we assume that
$\mathbb{P}(\mbox{droplet has } k \mbox{ cells}) = \frac{e^{-\lambda}\lambda^k}{k!}$ and $\mathbb{P}(\mbox{droplet has } k \mbox{ beads}) = \frac{e^{-\mu}\mu^j}{j!}$. These assumptions are reasonable for the Drop-seq technology. Adjustment of concentrations and flow rates of beads and cells and oil allows for controlling the rate parameters of these distributions and as a result allow for controlling numerous tradeoffs which are discussed next.
17. The cell capture rate of a technology is the fraction of input cells that are assayed in an experiment. Droplets that contain one or more cells but no beads will result in a lost cells whose transcriptome is not measured. The probability that a droplet has no beads is $e^{-\mu}$ and therefore the probability that a droplet has at least one bead is $1-e^{-\mu}$. To raise the capture rate it is therefore desirable to increase the Poisson rate $\mu$ which is equal to the average number of beads in a droplet. However increasing $\mu$ leads to duplication, i.e. cases where a single droplet has more than one bead, thus leading .a single cell transcriptome to appear as two or more cells. The duplication rate is the fraction of assayed cells which were captured with more than one bead. The duplication rate can be calculated as $\frac{\mathbb{P}(\mbox{droplet has 2 or more beads})}{\mathbb{P}(\mbox{droplet has 1 or more beads})}$ (which happens to be equivalent to a calculation of the probability that we are alone in the universe). The tradeoff, shown quantitatively as capture rate vs. duplication rate, is shown in a figure I made for the slide.
18. One way to circumvent the capture-duplication tradeoff is to load beads into droplets in a way that reduces the variance of beads per droplet. One way to do this is to use hydrogel beads instead of polystyrene beads, which are used in Drop-seq. The slide shows hydrogel beads being being captured in droplets at a reduced variance due to the ability to pack hydrogel beads tightly together. Hydrogel beads are used with inDrops. The term Sub-poisson loading refers generally to the goal of placing exactly one bead in each droplet in a droplet based single-cell RNA-seq method.
19. This slide shows a comparison of bead technologies. In addition to inDrops, 10x Genomics single-cell RNA-seq technology also uses hydrogel beads. There are numerous technical details associated with beads including whether or not barcodes are released (and how).
20. Barcode collisions arise when two cells are separately encapsulated with beads that contain identical barcodes. The slide shows the barcode collision rate formula, which is $1-\left( 1-\frac{1}{M} \right)^{N-1}$. This formula is derived as follows: Let $p=\frac{1}{M}$. The probability that a barcodes is associated with k cells is given by the binomial formula ${N \choose k}p^k(1-p)^{N-k}$. Thus, the probability that a barcode is associated to exactly one cell is $Np(1-p)^{N-1} = \frac{N}{M}\left(1-\frac{1}{M}\right)^{N-1}$. Therefore the expected number of cells with a unique barcode is $N\left(1-\frac{1}{M}\right)^{N-1}$ and the barcode collision rate is $\left(1-\frac{1}{M}\right)^{N-1}$. This is approximately $1-\left( \frac{1}{e} \right)^{\frac{N}{M}}$. The term synthetic doublet is used to refer to the situation when two or more different cells appear to be a single cell due to barcode collision.
21. In the notation of the previous slide, the barcode diversity is $\frac{N}{M}$, which is an important number in that it determines the barcode collision rate. Since barcodes are encoded in sequence, a natural question is what sequence length is needed to ensure a desired level of barcode diversity. This slide provides a lower bound on the sequence length.
22. Technical doublets are multiple cells that are captured in a single droplet, barcoded with the same sequence, and thus the transcripts that are recorded from them appear to originate from a single-cell.  The technical doublet rate can be estimated using a calculation analogous to the one used for the cell duplication rate (slide 17), except that it is a function of the Poisson rate $\lambda$ and not $\mu$. In the single-cell RNA-seq literature the term “doublet” generally refers to technical doublets, although it is useful to distinguish such doublets from synthetic doublets and biological doublets (slide 25).
23. One way to estimate the technical doublet rate is via pooled experiments of cells from different species. The resulting data can be viewed in what has become known as a “Barnyard” plot, a term originating from the Macosko et al. 2015 Drop-seq paper. Despite the authors’ original intention to pool mouse, chicken, cow and pig cells, typical validation of single-cell technology now proceeds with a mixture of human and mouse cells. Doublets are readily identifiable in barnyard plots as points (corresponding to droplets) that display transcript sequence from two different species. The only way for this to happen is via the capture of a doublet of cells (one from each species) in a single droplet. Thus, validation of single-cell RNA-seq rigs via a pooled experiment, and assessment of the resultant barnyard plot, has become standard for ensuring that the data is indeed single cell.
24.  It is important to note that while points on the diagonal of barnyard plots do correspond to technical doublets, pooled experiments, say of human and mouse cells, will also result in human-human and mouse-mouse doublets that are not evident in barnyard plots. To estimate such within species technical doublets from a “barnyard analysis”, Bloom 2018 developed a simple procedure that is described in this slide. The “Bloom correction” is important to perform in order to estimate doublet rate from mixed species experiments.
25. Biological doublets arise when two cells form a discrete unit that does not break apart during disruption to form a suspension. Such doublets are by definition from the same species, and therefore will not be detected in barnyard plots. One way to avoid biological doublets is via single nucleus RNA-seq (see, Habib et al. 2017). Single nuclear RNA-seq has proved to be important for experiments involving cells that are difficult to dissociate, e.g. human brain cells. In fact, the formation of suspensions is a major bottleneck in the adoption of single-cell RNA-seq technology, as techniques vary by tissue and organism, and there is no general strategy. On the other hand, in a recent interesting paper, Halpern et al. 2018 show that biological doublets can sometimes be considered a feature rather than a bug. In any case, doublets are more complicated than initially appears, and we have by now seen that there are three types of doublets: synthetic doublets, technical doublets, and biological doublets .
26. Unique molecular identifiers (UMIs) are part of the oligo constructs for beads in droplet single-cell RNA-seq technologies. They are used to identify reads that originated from the same molecule, so that double counting of such molecules can be avoided. UMIs are generally much shorter than cell barcodes, and estimates of required numbers (and corresponding sequence lengths) proceed analogously to the calculations for cell barcodes.
27. The sensitivity of a single-cell RNA-seq technology (also called transcript capture rate) is the fraction of transcripts captured per cell. Crucially, the sensitivity of a technology is dependent on the amount sequenced, and the plot in this slide (made by Eduardo Beltrame and Valentine Svensson by analysis of data from Svensson et al. 2017) shows that dependency.  The dots in the figure are cells from different datasets that included spike-ins, whose capture is being measured (y-axis). Every technology displays an approximately linear relationship between number of reads sequenced and transcripts captured, however each line is describe by two parameters: a slope and an intercept. In other words, specification of sensitivity for a technology requires reporting two numbers, not one.
28. Having surveyed the different attributes of droplet technologies, this slide summarizes some of the pros and cons of inDrops similarly to slide 7 for SMART-Seq2.
29. While individual aspects of single-cell RNA-seq technologies can be readily understood via careful modeling coupled to straightforward quality control experiments, a comprehensive assessment of whether a technology is “good” or “bad” is meaningless. The figure in this slide, from Zhang et al. 2019, provides a useful overview of some of the important technical attributes.
30. In terms of the practical question “which technology should I choose for my experiment?”, the best that can be done is to offer very general decision workflows. The flowchart shown on this slide, also from Zhang et al. 2019, is not very specific but does provide some simple to understand checkpoints to think about. A major challenge in comparing and contrasting technologies for single-cell RNA-seq is that they are all changing very rapidly as numerous ideas and improvements are now published weekly.
31. The technologies reviewed in the slides are strictly transcriptomics methods (single-cell RNA-seq). During the past few years there has been a proliferation of novel multi-modal technologies that simultaneously measures the transcriptome of single cells along with other modalities. Such technologies are reviewed in Packer and Trapnell 2018 and the slide mentions a few of them.

The encapsulation of beads together with cells in droplets is the basis of microfluidic based single-cell RNA-seq technologies. Ideally droplets contain exactly one bead and one cell, however in practice the number of beads and cells in droplets is stochastic and encapsulation of cells in droplets produces an approximately Poisson distribution of number of cells per droplet:

Specifically, the probability of observing k cells in a droplet is approximated by

$\mathbb{P}(\mbox{k cells in a droplet}) = \frac{e^{-\lambda}\lambda^k}{k!}$.

The rate parameter $\lambda$ can be controlled and the average number of cells per droplet is equal to it. Therefore, setting $\lambda$ to be much less than 1 ensures that two or more cells are rarely encapsulated in a single droplet. A consequence of this is that the number of empty droplets, given by $e^{-\lambda}$, is large. Importantly, one of the properties of the Poisson distribution is that variance is equal to the mean so the number of cells per droplet is also equal to $\lambda$.

Along with cells, beads must also be captured in droplets, and when plastic beads are used the occupancy statistics follow a Poisson distribution as well. This means that with technologies such as Drop-seq (Macosko et al. 2015), which uses polystyrene beads, many droplets are either empty, contain a bead and no cell, or a cell and no bead. The latter situation (cell and no bead) leads to a low “capture rate”, i.e. not many of the cells are assayed in an experiment.

One of the advantages of the inDrops method (Klein et al. 2015) over other single-cell RNA-seq methods is that it uses hydrogel beads which allow for a reduction in the variance of the number of beads per cell. In an important paper Abate et al. 2009 showed that close packing of hydrogel beads allows for an almost degenerate distribution where the number of beads per droplet is exactly one 98% of the time. The video below shows how close to degeneracy the distribution can be squeezed (in the example two beads are being encapsulated per droplet):

A discrete distribution defined over the non-negative integers with variance less than the mean is called sub-Poisson. Similarly, a discrete distribution defined over the non-negative integers with variance greater than the mean is called super-Poisson. This terminology dates back to at least the 1940s (e.g., Berkson et al. 1942) and is standard in many fields from physics (e.g. Rodionov and Cherkin 2004) to biology (e.g. Pitchiaya et al. 2014 ).

Figure 5.26 from Adrian Jeantet, Cavity quantum electrodynamics with carbon nanotubes, 2017.

Using this terminology, the close packing of hydrogel beads can be said to enable sub-Poisson loading of beads into droplets because the variance of beads per droplet is reduced in comparison to the Poisson statistics of plastic beads.

Unfortunately, in a 2015 paper, Bose et al. used the term “super-Poisson” instead of “sub-Poisson” in discussing an approach to reducing bead occupancy variance in the single-cell RNA-seq context. This sign error in terminology has subsequently been propagated and recently appeared in a single- cell RNA-seq review (Zhang et al. 2018) and in 10x Genomics advertising materials.

When it comes to single-cell RNA-seq we already have people referring to the number of reads sequenced as “the library size” and calling trees “one-dimensional manifolds“. Now sub-Poisson is mistaken for super-Poisson. Before you know it we’ll have professors teaching students that cell clusters obtained by k-means clustering are “cell types“…

Supper poisson (not to be confused with super-Poisson (not to be confused with sub-Poisson)).

Earlier this month I posted a new paper on the bioRxiv:

Jase Gehring, Jeff Park, Sisi Chen, Matt Thomson, and Lior Pachter, Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transciptional Spaces, bioRxiv, 2018.

The paper offers some insights into the benefits of multiplex single-cell RNA-Seq, a molecular implementation of information multiplexing. The paper also reflects the benefits of a multiplex lab, and the project came about thanks to Jase Gehring, a multiplex molecular biologist/computational biologist in my lab.

mult·i·plex
/`məltəˌpleks/
adjective
– consisting of many elements in a complex relationship.
– involving simultaneous transmission of several messages along a single channel of communication.

Conceptually, Jase’s work presents a method for chemically labeling cells from multiple samples with DNA nucleotides so that samples can be pooled prior to single-cell RNA-Seq, yet cells can subsequently be associated with their samples of origin after sequencing. This is achieved by labeling all cells from a sample with DNA that is unique to that sample; in the figure below colors are used to represent the different DNA tags that are used for each sample:

This is analogous to the barcoding of transcripts in single-cell RNA-Seq, that allows for transcripts from the same cell of origin to be associated with each other, yet in this framework there is an additional layer of barcoding of cells.

The tagging mechanism is a click chemistry one-pot, two-step reaction in which cell samples are exposed to methyltetrazine-activated DNA (MTZ-DNA) oligos as well as the amine-reactive cross-linker NHS-trans-cyclooctene (NHS-TCO). The NHS functionalized oligos are formed in situ by reaction of methyltetrazine with trans-cyclooctene (the inverse-electron demand Diels-Alder (IEDDA) reaction). Nucleophilic amines present on all proteins, but not nucleic acids, attack the in situ-formed NHS-DNA, chemoprecipitating the functionalized oligos directly onto the cells:

MTZ-DNAs are made by activating 5′-amine modified oligos with NHS-MTZ for the IEDDA reaction, and they are designed with a PCR primer, a cell tag (a unique “barcode” sequence) and a poly-A tract so that they can be captured by poly-T during single-cell RNA-Seq:

Such oligos can be readily ordered from IDT. We are careful to refer to the identifying sequences in these oligos as cell tags rather than barcodes so as not to confuse them with cell barcodes which are used in single-cell RNA-Seq to associate transcripts with cells.

The process of sample tagging for single-cell RNA-Seq is illustrated in the figure below. It shows how the tags, appearing as synthetic “transcripts” in cells, are captured during 3′ based microfluidic single-cell RNA-Seq and are subsequently deciphered by sequencing a tag library alongside the cDNA library:

This significance of multiplexing is manifold. First, by labeling cells prior to performing single-cell RNA-Seq, multiplexing allows for controlling a trade off between the number of cells assayed per sample, and the total number of samples analyzed. This allows for leveraging the large number of cells that can be assayed with current technologies to enable complex experimental designs based on many samples. In our paper we demonstrate this by performing an experiment consisting of single-cell RNA-Seq of neural stem cells (NSCs) exposed to 96 different combinations of growth factors. The experiment was conducted in collaboration with the Thomson lab that is interested in performing large-scale perturbation experiments to understand cell fate decisions in response to developmental signals. We examined NSCs subjected to different concentrations of Scriptaid/Decitabine, epidermal growth factor/basic fibroblast growth factor, retinoid acid, and bone morphogenic protein 4. In other words, our experiment corresponded to a 4x4x6 table of conditions, and for each condition we performed a single-cell RNA-Seq experiment (in multiplex).

This is one of the largest (in terms of samples) single-cell RNA-Seq experiments to date: a 100-fold decrease in the number of cells we collected per sample allowed us to perform an experiment with 100x more samples. Without multiplexing, an experiment that cost us ~\$7,000 would cost a few hundred thousand dollars, well outside the scope of what is possible in a typical lab. We certainly would have not been able to perform the experiment without multiplexing. Although the cost tradeoff is impactful, there are many other important implications of multiplexing as well:

• Whereas simplex single-cell RNA-Seq is descriptive, focusing on what is in a single sample, multiplex single-cell RNA-Seq allows for asking how? For example how do cell states change in response to perturbations? How does disease affect cell state and type?
• Simplex single-cell RNA-Seq leads to systematics arguments about clustering: when do cells that cluster together constitute a “cell type”? How many clusters are real? How should clustering be performed? Multiplex single-cell RNA-Seq provides an approach to assigning significance to clusters via their association with samples. In our paper, we specifically utilized sample identification to determine the parameters/thresholds for the clustering algorithm:On the left hand side is a t-SNE plot labeled by different samples, and on the right hand side de novo clusters. The experiment allowed us to confirm the functional significance of a cluster as a cell state resulting from a specific range of perturbation conditions.
• Multiplexing reduces batch effect, and also makes possible the procurement of more replicates in experiments, an important aspect of single-cell RNA-Seq as noted by Hicks et al. 2017.
• Multiplexing has numerous other benefits, e.g. allowing for the detection of doublets and their removal prior to analysis. This useful observation of Stoeckius et al. makes possible higher-throughput single-cell RNA-Seq. We also found an intriguing relationship between tag abundance and cell size. Both of these phenomena are illustrated in one supplementary figure of our paper that I’m particularly fond of:

It shows a multiplexing experiment in which 8 different samples have been pooled together. Two of these samples are human-only samples, and two are mouse-only. The remaining four are samples in which human and mouse cells have been mixed together (with 2,3,4 and 5 tags being used for each sample respectively). The t-SNE plot is made from the tag counts, which is why the samples are neatly separated into 8 clusters. However in Panel b, the cells are colored by their cDNA content (human, mouse, or both). The pure samples are readily identifiable, as are the mixed samples. Cell doublets (purple) can be easily identified and therefore removed from analysis. The relationship between cell size and tag abundance is shown in Panel d. For a given sample with both human and mouse cells (bottom row), human cells give consistently higher sample tag counts. Along with all of this, the figure shows we are able to label a sample with 5 tags, which means that using only 20 oligos (this is how many we worked with for all of our experiments) it is possible to label ${20 \choose 5} = 15,504$ samples.

• Thinking about hundreds (and soon thousands) of single-cell experiments is going to be complicated. The cell-gene matrix that is the fundamental object of study in single-cell RNA-Seq extends to a cell-gene-sample tensor. While more complicated, there is an opportunity for novel analysis paradigms to be developed. A hint of this is evident in our visualization of the samples by projecting the sample-cluster matrix. Specifically, the matrix below shows which clusters are represented within each sample, and the matrix is quantitative in the sense that the magnitude of each entry represents the relative abundance of cells in a sample occupying a given cluster:
A three-dimensional PCA of this matrix reveals interesting structure in the experiment. Here each point is an entire sample, not a cell, and one can see how changes in factors move samples in “experiment space”:

As experiments become even more complicated, and single-cell assays become increasingly multimodal (including not only RNA-Seq but also protein measurements, methylation data, etc.) development of a coherent mathematical framework for single-cell genomics will be central to interpreting the data. As Dueck et al. 2015 point out, such analysis is likely to not only be mathematically interesting, but also functionally important.

We aren’t the only group thinking about sample multiplexing for single-cell RNA-Seq. The “demuxlet” method by Kang et al., 2017 is an in silico approach based on multiplexing from genomic variation. Kang et al. show that if pooled samples are genetically heterogeneous, genotype data can be used to separate samples providing an effective solution for multiplexing single-cell RNA-Seq in large human studies. However demuxlet has limitations, for example it cannot be used for samples from a homogenous genetic background. Two papers at the end of last year develop an epitope labeling strategy for multiplexing: Stoeckius et al. 2017 and Peterson et al. 2017. While epitope labeling provides additional information that can be of interest, our method is more universal in that it can be used to multiplex any kind of samples, even from different organisms (a point we make with the species mixing multiplex experiment I described above). The approaches are also not exclusive, epitope labeling could be coupled to a live cell DNA tagging multiplex experiment allowing for the same epitopes to be assayed together in different samples. Finally, our click chemistry approach is fast, cheap and convenient, immediately providing multiplex capability for thousands, or even hundreds of thousands of samples.

One interesting aspect of Jase’s multiplexing paper is that the project it describes was itself a multiplexing experiment of sorts. The origins of the experiment date to 2005 when I was awarded tenure in the mathematics department at UC Berkeley. As is customary after tenure trauma, I went on sabbatical for a year, and I used that time to ponder career related questions that one is typically too busy for. Questions I remember thinking about: Why exactly did I become a computational biologist? Was a mathematics department the ideal home for me? Should I be more deeply engaged with biologists? Were the computational biology papers I’d been writing meaningful? What is computational biology anyway?

In 2008, partly as a result of my sabbatical rumination but mostly thanks to the encouragement and support of Jasper Rine, I changed the structure of my appointment and joined the UC Berkeley Molecular and Cell Biology (MCB) department (50%). A year later, I responded to a call by then Dean Mark Schlissel and requested wet lab space in what was to become the Li Ka Shing Center at UC Berkeley. This was not a rash decision. After working with Cole Trapnell on RNA-Seq I’d come to the conclusion that a small wet lab would be ideal for our group to better learn the details of the technologies we were working on, and I felt that practicing them ourselves would ultimately be the best way to arrive at meaningful (computational) methods contributions. I’d also visited David Haussler‘s wet lab where I met Jason Underwood who was working on FragSeq at the time. I was impressed with his work and what I saw were important benefits of real contact between wet and dry, experiment and computation.

In 2011 I was delighted to move into my new wet lab. The decision to give me a few benches was a bold and unexpected one, spearheaded by Mark Schlissel, but also supported by a committee he formed to decide on the make up of the building. I am especially grateful to John Ngai, Art Reingold and Randy Scheckman for their help. However I was in a strange position starting a wet lab as a tenured professor. On the one hand the security of tenure provided some reassurance that a failure in the wet lab would not immediately translate to a failure of career. On the other hand, I had no startup funds to buy all the basic infrastructure necessary to run a lab. CIRM, Mark Schlissel, and later other senior faculty in Molecular & Cell Biology at UC Berkeley, stepped in to provide me with the basics: a -80 and -20, access to a shared cold room, a Bioanalyzer (to be shared with others in the building), and a thermocycler. I bought some other basic equipment but the most important piece was the recruitment of my first MCB graduate student: Shannon Hateley. Shannon and I agreed that she would set up the lab and also be lab manager, while I would supervise purchasing and other organization lab matters. I obtained informed consent from Shannon prior to her joining my lab, for what would be a monumental effort requested of her. We also agreed she would be co-advised by another molecular biologist “just in case”.

With Shannon’s work and then my second molecular biology student, Lorian Schaeffer, the lab officially became multiplexed. Jase, who initiated and developed not only the molecular biology but also the computational biology of Gehring et al. 2018 is the latest experimentalist to multiplex in our group. However some of the mathematicians now multiplex as well. This has been a boon to the research of the group and I see Jase’s paper as fruit that has grown from the diversity in the lab. Moving forward, I see increasing use of mathematics ideas in the development of novel molecular biology. For example, current single-cell RNA-Seq multiplexing is a form of information multiplexing that is trivial in comparison to the multiplexing ideas from information theory; the achievements are in the molecular implementations, but in the future I foresee much more of a blur between wet and dry and increasingly sophisticated mathematical ideas being implemented with molecular biology.

Hedy Lamarr, the mother of multiplexing.

In a first with RNA-Seq technology, scientists at Stanford University have broken through the single-cell barrier. In a recently published paper,  A.R. Wu et al., Quantitative assessment of single-cell RNA-sequencing methodsthe results of sequencing RNA from zero human cells. How was this accomplished? The gist of it is that an empty tube was filled with spike-in, and then submitted for RNA-Seq… The details are as follows: In order to assess the quality of single-cell RNA-Seq, Wu et al. performed numerous single-cell RNA-Seq experiments on HCT116 cells, as summarized in the figure below (Figure 1a from their paper).

Figure 1a from the Wu et al. paper showing the experimental design.

was interested in this study because for the regularized pooling project I’m working on with Nicolas Bray (see recent post),  it would be useful to demonstrate improvements in quantification accuracy by joint analysis of single-cell RNA-Seq. I asked Nick to look at the Wu et al. data when it came out two weeks ago, and as a first step he aligned the reads to the human transcriptome. Strangely, he found very low alignment rates, and in some cases literally almost no reads aligned at all. At first we thought there was some trimming issue, so we went to look at the Cufflinks output of the authors. The figure below, made by Nick, shows the percent spike-in (assessed by examining the abundance of ERCC-*) for each of the SMARTer based 96 samples:

The worst sample is C70 (GSM1241223) for which only 252 human transcripts have non-zero abundance. It is 99.828339% spike-in! The fact that the results of RNA-Seq on an empty test tube were published is in and of itself just a minor (?) embarrassment; more interesting is the range of quality obtained as measured by the amount of spike-in sequenced– a plot that we have made above and that seems crucial to the paper, but that was not produced by the authors. In fact, what the authors do show is slightly suspect: reproduced below is their Figure S2 from the Supplement:

Why would the authors show correlations for just four randomly picked samples? Why not show results for all of the data? We dug a bit deeper into this, and noticed that 93/96 of the FPKM file names look like [GEO accession]_CXX_ILXXXX. But the remaining three look like GSM1241223_C70_NTC_tube_ctrl_IL3196.sorted.genes.fpkm_tracking.txt.gz (which is the apparently empty tube), GSM1241245_C92_cell_tube_ctrl_IL3198.sorted.genes.fpkm_tracking.txt.gz and GSM1241195_C42_100ng_RNA_ctrl_IL3198.sorted.genes.fpkm_tracking.txt.gz. Therefore, these were presumably intended controls, but they were not published as such. There is the separate issue, that aside from the controls, the experiment in general appears to have some failure rate that is not clearly presented. This is evident in the following plot which Nick made, showing the average log-correlation of each experiment with the others after removing zeroes (the bottom one is C09 and the runner up is C70):

This figure is showing the honest truth of the paper. It is what it is; everyone I’ve talkedto that has actually performed single-cell RNA-Seq tells me that it is difficult and there is a non-trivial failure rate, on top of variable quality across cells. In fact, there is subtle evidence of failure in other papers. In the single-cell RNA-Seq technology race, the paper preceding Wu et al was A.K. Shalek et al., Single-cell transcriptomics reveals bimodality in expression and splicing in immune cellsNature (2013). In Shalek et al., the authors describe 18 single-cell experiments. Specifically, they claim to have constructed DNA libraries “from 18 single BMDCs (S1–S18), three replicate populations of 10,000 cells, and two negative controls (empty wells), and sequenced each to an average depth of 27 million read pairs.” However a close inspection of the GEO reveals the following IDs and descriptors:

 GSM1012777 Single cell S1 GSM1012778 Single cell S2 GSM1012779 Single cell S3 GSM1012780 Single cell S4 GSM1012781 Single cell S5 GSM1012782 Single cell S6 GSM1012783 Single cell S7 GSM1012784 Single cell S8 GSM1012785 Single cell S9 GSM1012786 Single cell S10 GSM1012787 Single cell S11 GSM1012788 Single cell S13 GSM1012789 Single cell S14 GSM1012790 Single cell S15 GSM1012791 Single cell S16 GSM1012792 Single cell S22 GSM1012793 Single cell S23 GSM1012794 Single cell S24

While there are 18 consecutive IDs, the cell labels range from 1–24. Where are the 6 missing cells? I can’t be sure, but they were probably failures. Update: the authors of the Shalek et al. paper explained to me after seeing the post that two of the missing labels were negative controls, and 3 were population replicates (the names of these were altered in GEO). $6-5=1$ which was indeed a failure (S12); it gave no signal on the BioAnalyzer and was therefore not sequenced. I was told that the authors are working on fixing the GEO sample names to clarify the reason for missing labels of samples. As such, it turns out the experiment was extremely successful with a success rate of 18/19.

Returning to Wu et al., they should be commended for releasing all their data (to their credit they also release the R code they used for analysis). The problem with the paper is that instead of reporting the failures and discarding them before analysis, they instead use all of the data when performing comparisons between single-cell and bulk RNA-Seq. This is is evident in some of the strange techniques they end up using. For example, the method for generating the crucial Figure 4a is described as:

“(a) Correlation between the merged
single cells (“ensemble”) and the bulk RNA-seq measurement of gene
expression. The ensemble was created by computationally pooling all
the raw reads obtained from the 96 single-cell transcriptomes
generated using the C1 system and then sampling 30 million reads
randomly. The bulk and ensemble libraries were depth matched before
alignment was performed. For each gene, the log2-transformed median
FPKM values from the ensemble and bulk were plotted against each
other. “

I’m guessing that the odd idea of sampling and then taking the median is precisely to throw out outliers coming from the control tubes. Yes, the data were tortured, and yes, the FPKMs confessed. The paper has some other issues that suggest it was not carefully reviewed by the authors (let alone the reviewers). In the Methods I found the statement “FPKM values used for analyses were generated by TopHat”. I, of all people, can attest to the fact that it is Cufflinks, not TopHat, that estimates (not generates!) FPKM values. Thankfully, in the GEO entries Cufflinks is correctly cited together with the version used.

In summary, in the last two high profile publications on single-cell RNA-Seq, there were failures in the experiment and they were not reported clearly by the authors. Neither committed an egregious offense, but I wish they had fully reported the number of experiments attempted and the number that succeeded. That seems to me to be important data in papers describing new technology. I believe that fear of rejection from the journal, or fear of embarrassment of the state of single-cell RNA-Seq is what drove Wu et al. to spin the results positively. All part of the fear of failure, that seems to hold back a lot of science. But single-cell RNA-Seq has a bright future and these papers would both be better if they were more open about failure. The only thing we have to fear is fear itself.

### Recent Comments

 Qun Anrikar on Sexual harassment case number… Dmitry Kondrashov on Time to end letter grades Joel Miller on The lethal nonsense of Michael… U. Ploy on The amoral nonsense of Orchid… JC JC on Williams math professor invest… Anonymous on The amoral nonsense of Orchid… John D. Brown Junior on The two cultures of mathematic… ziphon on The amoral nonsense of Orchid… Lior Pachter on The amoral nonsense of Orchid… Craken on The amoral nonsense of Orchid…

### Blog Stats

• 2,657,685 views