You are currently browsing the tag archive for the ‘single cell RNA-Seq’ tag.

Earlier this month I posted a new paper on the bioRxiv:

Jase Gehring, Jeff Park, Sisi Chen, Matt Thomson, and Lior Pachter, Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transciptional Spaces, bioRxiv, 2018.

The paper offers some insights into the benefits of multiplex single-cell RNA-Seq, a molecular implementation of information multiplexing. The paper also reflects the benefits of a multiplex lab, and the project came about thanks to Jase Gehring, a multiplex molecular biologist/computational biologist in my lab.

mult·i·plex
/`məltəˌpleks/
– consisting of many elements in a complex relationship.
– involving simultaneous transmission of several messages along a single channel of communication.

Conceptually, Jase’s work presents a method for chemically labeling cells from multiple samples with DNA nucleotides so that samples can be pooled prior to single-cell RNA-Seq, yet cells can subsequently be associated with their samples of origin after sequencing. This is achieved by labeling all cells from a sample with DNA that is unique to that sample; in the figure below colors are used to represent the different DNA tags that are used for each sample:

This is analogous to the barcoding of transcripts in single-cell RNA-Seq, that allows for transcripts from the same cell of origin to be associated with each other, yet in this framework there is an additional layer of barcoding of cells.

The tagging mechanism is a click chemistry one-pot, two-step reaction in which cell samples are exposed to methyltetrazine-activated DNA (MTZ-DNA) oligos as well as the amine-reactive cross-linker NHS-trans-cyclooctene (NHS-TCO). The NHS functionalized oligos are formed in situ by reaction of methyltetrazine with trans-cyclooctene (the inverse-election demand Diels-Alder (IEDDA) reaction). Nucleophilic amines present on all proteins, but not nucleic acids, attack the in situ-formed NHS-DNA, chemoprecipitating the functionalized oligos directly onto the cells:

MTZ-DNAs are made by activating 5′-amine modified oligos with NHS-MTZ for the IEDDA reaction, and they are designed with a PCR primer, a cell tag (a unique “barcode” sequence) and a poly-A tract so that they can be captured by poly-T during single-cell RNA-Seq:

Such oligos can be readily ordered from IDT. We are careful to refer to the identifying sequences in these oligos as cell tags rather than barcodes so as not to confuse them with cell barcodes which are used in single-cell RNA-Seq to associate transcripts with cells.

The process of sample tagging for single-cell RNA-Seq is illustrated in the figure below. It shows how the tags, appearing as synthetic “transcripts” in cells, are captured during 3′ based microfluidic single-cell RNA-Seq and are subsequently deciphered by sequencing a tag library alongside the cDNA library:

This significance of multiplexing is manifold. First, by labeling cells prior to performing single-cell RNA-Seq, multiplexing allows for controlling a trade off between the number of cells assayed per sample, and the total number of samples analyzed. This allows for leveraging the large number of cells that can be assayed with current technologies to enable complex experimental designs based on many samples. In our paper we demonstrate this by performing an experiment consisting of single-cell RNA-Seq of neural stem cells (NSCs) exposed to 96 different combinations of growth factors. The experiment was conducted in collaboration with the Thomson lab that is interested in performing large-scale perturbation experiments to understand cell fate decisions in response to developmental signals. We examined NSCs subjected to different concentrations of Scriptaid/Decitabine, epidermal growth factor/basic fibroblast growth factor, retinoid acid, and bone morphogenic protein 4. In other words, our experiment corresponded to a 4x4x6 table of conditions, and for each condition we performed a single-cell RNA-Seq experiment (in multiplex).

This is one of the largest (in terms of samples) single-cell RNA-Seq experiments to date: a 100-fold decrease in the number of cells we collected per sample allowed us to perform an experiment with 100x more samples. Without multiplexing, an experiment that cost us ~\$7,000 would cost a few hundred thousand dollars, well outside the scope of what is possible in a typical lab. We certainly would have not been able to perform the experiment without multiplexing. Although the cost tradeoff is impactful, there are many other important implications of multiplexing as well:

• Whereas simplex single-cell RNA-Seq is descriptive, focusing on what is in a single sample, multiplex single-cell RNA-Seq allows for asking how? For example how do cell states change in response to perturbations? How does disease affect cell state and type?
• Simplex single-cell RNA-Seq leads to systematics arguments about clustering: when do cells that cluster together constitute a “cell type”? How many clusters are real? How should clustering be performed? Multiplex single-cell RNA-Seq provides an approach to assigning significance to clusters via their association with samples. In our paper, we specifically utilized sample identification to determine the parameters/thresholds for the clustering algorithm:On the left hand side is a t-SNE plot labeled by different samples, and on the right hand side de novo clusters. The experiment allowed us to confirm the functional significance of a cluster as a cell state resulting from a specific range of perturbation conditions.
• Multiplexing reduces batch effect, and also makes possible the procurement of more replicates in experiments, an important aspect of single-cell RNA-Seq as noted by Hicks et al. 2017.
• Multiplexing has numerous other benefits, e.g. allowing for the detection of doublets and their removal prior to analysis. This useful observation of Stoeckius et al. makes possible higher-throughput single-cell RNA-Seq. We also found an intriguing relationship between tag abundance and cell size. Both of these phenomena are illustrated in one supplementary figure of our paper that I’m particularly fond of:

It shows a multiplexing experiment in which 8 different samples have been pooled together. Two of these samples are human-only samples, and two are mouse-only. The remaining four are samples in which human and mouse cells have been mixed together (with 2,3,4 and 5 tags being used for each sample respectively). The t-SNE plot is made from the tag counts, which is why the samples are neatly separated into 8 clusters. However in Panel b, the cells are colored by their cDNA content (human, mouse, or both). The pure samples are readily identifiable, as are the mixed samples. Cell doublets (purple) can be easily identified and therefore removed from analysis. The relationship between cell size and tag abundance is shown in Panel d. For a given sample with both human and mouse cells (bottom row), human cells give consistently higher sample tag counts. Along with all of this, the figure shows we are able to label a sample with 5 tags, which means that using only 20 oligos (this is how many we worked with for all of our experiments) it is possible to label ${20 \choose 5} = 15,504$ samples.

• Thinking about hundreds (and soon thousands) of single-cell experiments is going to be complicated. The cell-gene matrix that is the fundamental object of study in single-cell RNA-Seq extends to a cell-gene-sample tensor. While more complicated, there is an opportunity for novel analysis paradigms to be developed. A hint of this is evident in our visualization of the samples by projecting the sample-cluster matrix. Specifically, the matrix below shows which clusters are represented within each sample, and the matrix is quantitative in the sense that the magnitude of each entry represents the relative abundance of cells in a sample occupying a given cluster:
A three-dimensional PCA of this matrix reveals interesting structure in the experiment. Here each point is an entire sample, not a cell, and one can see how changes in factors move samples in “experiment space”:

As experiments become even more complicated, and single-cell assays become increasingly multimodal (including not only RNA-Seq but also protein measurements, methylation data, etc.) development of a coherent mathematical framework for single-cell genomics will be central to interpreting the data. As Dueck et al. 2015 point out, such analysis is likely to not only be mathematically interesting, but also functionally important.

We aren’t the only group thinking about sample multiplexing for single-cell RNA-Seq. The “demuxlet” method by Kang et al., 2017 is an in silico approach based on multiplexing from genomic variation. Kang et al. show that if pooled samples are genetically heterogeneous, genotype data can be used to separate samples providing an effective solution for multiplexing single-cell RNA-Seq in large human studies. However demuxlet has limitations, for example it cannot be used for samples from a homogenous genetic background. Two papers at the end of last year develop an epitope labeling strategy for multiplexing: Stoeckius et al. 2017 and Peterson et al. 2017. While epitope labeling provides additional information that can be of interest, our method is more universal in that it can be used to multiplex any kind of samples, even from different organisms (a point we make with the species mixing multiplex experiment I described above). The approaches are also not exclusive, epitope labeling could be coupled to a live cell DNA tagging multiplex experiment allowing for the same epitopes to be assayed together in different samples. Finally, our click chemistry approach is fast, cheap and convenient, immediately providing multiplex capability for thousands, or even hundreds of thousands of samples.

One interesting aspect of Jase’s multiplexing paper is that the project it describes was itself a multiplexing experiment of sorts. The origins of the experiment date to 2005 when I was awarded tenure in the mathematics department at UC Berkeley. As is customary after tenure trauma, I went on sabbatical for a year, and I used that time to ponder career related questions that one is typically too busy for. Questions I remember thinking about: Why exactly did I become a computational biologist? Was a mathematics department the ideal home for me? Should I be more deeply engaged with biologists? Were the computational biology papers I’d been writing meaningful? What is computational biology anyway?

In 2008, partly as a result of my sabbatical rumination but mostly thanks to the encouragement and support of Jasper Rine, I changed the structure of my appointment and joined the UC Berkeley Molecular and Cell Biology (MCB) department (50%). A year later, I responded to a call by then Dean Mark Schlissel and requested wet lab space in what was to become the Li Ka Shing Center at UC Berkeley. This was not a rash decision. After working with Cole Trapnell on RNA-Seq I’d come to the conclusion that a small wet lab would be ideal for our group to better learn the details of the technologies we were working on, and I felt that practicing them ourselves would ultimately be the best way to arrive at meaningful (computational) methods contributions. I’d also visited David Haussler‘s wet lab where I met Jason Underwood who was working on FragSeq at the time. I was impressed with his work and what I saw were important benefits of real contact between wet and dry, experiment and computation.

In 2011 I was delighted to move into my new wet lab. The decision to give me a few benches was a bold and unexpected one, spearheaded by Mark Schlissel, but also supported by a committee he formed to decide on the make up of the building. I am especially grateful to John Ngai, Art Reingold and Randy Scheckman for their help. However I was in a strange position starting a wet lab as a tenured professor. On the one hand the security of tenure provided some reassurance that a failure in the wet lab would not immediately translate to a failure of career. On the other hand, I had no startup funds to buy all the basic infrastructure necessary to run a lab. CIRM, Mark Schlissel, and later other senior faculty in Molecular & Cell Biology at UC Berkeley, stepped in to provide me with the basics: a -80 and -20, access to a shared cold room, a Bioanalyzer (to be shared with others in the building), and a thermocycler. I bought some other basic equipment but the most important piece was the recruitment of my first MCB graduate student: Shannon Hateley. Shannon and I agreed that she would set up the lab and also be lab manager, while I would supervise purchasing and other organization lab matters. I obtained informed consent from Shannon prior to her joining my lab, for what would be a monumental effort requested of her. We also agreed she would be co-advised by another molecular biologist “just in case”.

With Shannon’s work and then my second molecular biology student, Lorian Schaeffer, the lab officially became multiplexed. Jase, who initiated and developed not only the molecular biology but also the computational biology of Gehring et al. 2018 is the latest experimentalist to multiplex in our group. However some of the mathematicians now multiplex as well. This has been a boon to the research of the group and I see Jase’s paper as fruit that has grown from the diversity in the lab. Moving forward, I see increasing use of mathematics ideas in the development of novel molecular biology. For example, current single-cell RNA-Seq multiplexing is a form of information multiplexing that is trivial in comparison to the multiplexing ideas from information theory; the achievements are in the molecular molecular implementations, but in the future I foresee much more of a blur between wet and dry and increasingly sophisticated mathematical ideas being implemented with molecular biology.

Hedy Lamarr, the mother of multiplexing.

In a first with RNA-Seq technology, scientists at Stanford University have broken through the single-cell barrier. In a recently published paper,  A.R. Wu et al., Quantitative assessment of single-cell RNA-sequencing methodsthe results of sequencing RNA from zero human cells. How was this accomplished? The gist of it is that an empty tube was filled with spike-in, and then submitted for RNA-Seq… The details are as follows: In order to assess the quality of single-cell RNA-Seq, Wu et al. performed numerous single-cell RNA-Seq experiments on HCT116 cells, as summarized in the figure below (Figure 1a from their paper).

Figure 1a from the Wu et al. paper showing the experimental design.

was interested in this study because for the regularized pooling project I’m working on with Nicolas Bray (see recent post),  it would be useful to demonstrate improvements in quantification accuracy by joint analysis of single-cell RNA-Seq. I asked Nick to look at the Wu et al. data when it came out two weeks ago, and as a first step he aligned the reads to the human transcriptome. Strangely, he found very low alignment rates, and in some cases literally almost no reads aligned at all. At first we thought there was some trimming issue, so we went to look at the Cufflinks output of the authors. The figure below, made by Nick, shows the percent spike-in (assessed by examining the abundance of ERCC-*) for each of the SMARTer based 96 samples:

The worst sample is C70 (GSM1241223) for which only 252 human transcripts have non-zero abundance. It is 99.828339% spike-in! The fact that the results of RNA-Seq on an empty test tube were published is in and of itself just a minor (?) embarrassment; more interesting is the range of quality obtained as measured by the amount of spike-in sequenced– a plot that we have made above and that seems crucial to the paper, but that was not produced by the authors. In fact, what the authors do show is slightly suspect: reproduced below is their Figure S2 from the Supplement:

Why would the authors show correlations for just four randomly picked samples? Why not show results for all of the data? We dug a bit deeper into this, and noticed that 93/96 of the FPKM file names look like [GEO accession]_CXX_ILXXXX. But the remaining three look like GSM1241223_C70_NTC_tube_ctrl_IL3196.sorted.genes.fpkm_tracking.txt.gz (which is the apparently empty tube), GSM1241245_C92_cell_tube_ctrl_IL3198.sorted.genes.fpkm_tracking.txt.gz and GSM1241195_C42_100ng_RNA_ctrl_IL3198.sorted.genes.fpkm_tracking.txt.gz. Therefore, these were presumably intended controls, but they were not published as such. There is the separate issue, that aside from the controls, the experiment in general appears to have some failure rate that is not clearly presented. This is evident in the following plot which Nick made, showing the average log-correlation of each experiment with the others after removing zeroes (the bottom one is C09 and the runner up is C70):

This figure is showing the honest truth of the paper. It is what it is; everyone I’ve talkedto that has actually performed single-cell RNA-Seq tells me that it is difficult and there is a non-trivial failure rate, on top of variable quality across cells. In fact, there is subtle evidence of failure in other papers. In the single-cell RNA-Seq technology race, the paper preceding Wu et al was A.K. Shalek et al., Single-cell transcriptomics reveals bimodality in expression and splicing in immune cellsNature (2013). In Shalek et al., the authors describe 18 single-cell experiments. Specifically, they claim to have constructed DNA libraries “from 18 single BMDCs (S1–S18), three replicate populations of 10,000 cells, and two negative controls (empty wells), and sequenced each to an average depth of 27 million read pairs.” However a close inspection of the GEO reveals the following IDs and descriptors:

 GSM1012777 Single cell S1 GSM1012778 Single cell S2 GSM1012779 Single cell S3 GSM1012780 Single cell S4 GSM1012781 Single cell S5 GSM1012782 Single cell S6 GSM1012783 Single cell S7 GSM1012784 Single cell S8 GSM1012785 Single cell S9 GSM1012786 Single cell S10 GSM1012787 Single cell S11 GSM1012788 Single cell S13 GSM1012789 Single cell S14 GSM1012790 Single cell S15 GSM1012791 Single cell S16 GSM1012792 Single cell S22 GSM1012793 Single cell S23 GSM1012794 Single cell S24

While there are 18 consecutive IDs, the cell labels range from 1–24. Where are the 6 missing cells? I can’t be sure, but they were probably failures. Update: the authors of the Shalek et al. paper explained to me after seeing the post that two of the missing labels were negative controls, and 3 were population replicates (the names of these were altered in GEO). $6-5=1$ which was indeed a failure (S12); it gave no signal on the BioAnalyzer and was therefore not sequenced. I was told that the authors are working on fixing the GEO sample names to clarify the reason for missing labels of samples. As such, it turns out the experiment was extremely successful with a success rate of 18/19.

Returning to Wu et al., they should be commended for releasing all their data (to their credit they also release the R code they used for analysis). The problem with the paper is that instead of reporting the failures and discarding them before analysis, they instead use all of the data when performing comparisons between single-cell and bulk RNA-Seq. This is is evident in some of the strange techniques they end up using. For example, the method for generating the crucial Figure 4a is described as:

“(a) Correlation between the merged
single cells (“ensemble”) and the bulk RNA-seq measurement of gene
expression. The ensemble was created by computationally pooling all
the raw reads obtained from the 96 single-cell transcriptomes
generated using the C1 system and then sampling 30 million reads
randomly. The bulk and ensemble libraries were depth matched before
alignment was performed. For each gene, the log2-transformed median
FPKM values from the ensemble and bulk were plotted against each
other. “

I’m guessing that the odd idea of sampling and then taking the median is precisely to throw out outliers coming from the control tubes. Yes, the data were tortured, and yes, the FPKMs confessed. The paper has some other issues that suggest it was not carefully reviewed by the authors (let alone the reviewers). In the Methods I found the statement “FPKM values used for analyses were generated by TopHat”. I, of all people, can attest to the fact that it is Cufflinks, not TopHat, that estimates (not generates!) FPKM values. Thankfully, in the GEO entries Cufflinks is correctly cited together with the version used.

In summary, in the last two high profile publications on single-cell RNA-Seq, there were failures in the experiment and they were not reported clearly by the authors. Neither committed an egregious offense, but I wish they had fully reported the number of experiments attempted and the number that succeeded. That seems to me to be important data in papers describing new technology. I believe that fear of rejection from the journal, or fear of embarrassment of the state of single-cell RNA-Seq is what drove Wu et al. to spin the results positively. All part of the fear of failure, that seems to hold back a lot of science. But single-cell RNA-Seq has a bright future and these papers would both be better if they were more open about failure. The only thing we have to fear is fear itself.