You are currently browsing the category archive for the ‘technology’ category.

The long-standing practice of data sharing in genomics can be traced to the Bermuda principles, which were formulated during the human genome project (Contreras, 2010). While the Bermuda principles focused on open sharing of DNA sequence data, they heralded the adoption of other open source standards in the genomics community. For example, unlike many other scientific disciplines, most genomics software is open source and this has been the case for a long time (Stajich and Lapp, 2006). The open principles of genomics have arguably greatly accelerated progress and facilitated discovery.

While open sourcing has become de rigueur in genomics dry labs, wet labs remain beholden to commercial instrument providers that rarely open source hardware or software, and impose draconian restrictions on instrument use and modification. With a view towards joining others who are working to change this state of affairs, we’ve posted a new preprint in which we describe an open source syringe pump and microscope system called poseidon:

A. Sina Booeshaghi, Eduardo da Veiga Beltrame, Dylan Bannon, Jase Gehring and Lior Pachter,

The poseidon system consists of

• A syringe pump that can operate at a wide range of flow rates. The bulk cost per pump is $37.91. A system of three pumps that can be used for droplet based single-cell RNA-seq experiments can be assembled for$174.87
• A microscope system that can be used to evaluate the quality of emulsions produced using the syringe pumps. The cost is $211.69. • Open source software that can be used to operate four pumps simultaneously, either via a Raspberry Pi (that is part of the microscope system) or directly via a laptop/desktop. Together, these components can be used to build a Drop-seq rig for under$400, or they can be used piecemeal for a wide variety of tasks. Along with describing benchmarks of poseidon, the preprint presents design guidelines that we hope can accelerate both development and adoption of open source bioinstruments. These were developed while working on the project; some were borrowed from our experience with bioinformatics software, while others emerged as we worked out kinks during development. As is the case with software, open source is not,  in and of itself, enough to make an application usable.  We had to optimize many steps during the development of poseidon, and in the preprint we illustrate the design principles we converged on with specific examples from poseidon.

The complete hardware/software package consists of the following components:

We benchmarked the system thoroughly and it has similar performance to a commercial Harvard Apparatus syringe pump; see panel (a) below. The software driving the pumps can be used for infusion or withdrawl, and is easily customizable. In fact, the ability to easily program arbitrary schedules and flow rates without depending on vendors who frequently charge money and require firmware upgrades for basic tasks, was a major motivation for undertaking the project. The microscope is basic but usable for setting up emulsions. Shown in panel (b) below is a microfluidic droplet generation chip imaged with the microscope. Panel (c) shows that we have no trouble generating uniform emulsions with it.

Together, the system constitutes a Drop-seq rig (3 pumps + microscope) that can be built for under $400: We did not start the poseidon project from scratch. First of all, we were fortunate to have some experience with 3D printing. Shortly after I started setting up a wet lab, Shannon Hateley, a former student in the lab, encouraged me to buy a 3D printer to reduce costs for basic lab supplies. The original MakerGear M2 we purchased has served us well saving us enormous amounts of money just as Shannon predicted, and in fact we now have added a Prusa printer: The printer Shannon introduced to the lab came in handy when, some time later, after starting to perform Drop-seq in the lab, Jase Gehring became frustrated with the rigidity commercial syringe pumps he was using. With a 3D printer available in-house, he was able to print and assemble a published open source syringe pump design. What started as a hobby project became more serious when two students joined the lab: Sina Booeshaghi, a mechanical engineer, and Eduardo Beltrame, an expert in 3D printing. We were also encouraged by the publication of a complete Drop-seq do-it-yourself design from the Satija lab. Starting with the microscope device from the Stephenson et al. paper, and the syringe pump from the Wijnen et al. paper, we worked our way through numerous hardware design optimizations and software prototypes. The photo below shows the published work we started with at the bottom, the final designs we ended up with at the top, and intermediate versions as we converged on design choices: In the course of design we realized that despite a lot of experience developing open source software in the lab, there were new lessons we were learning regarding open-source hardware development, and hardware-software integration. We ended up formulating six design principles that we explain in detail in the preprint via example of how they pertained to the poseidon design: We are hopeful that these principles we adhered to will serve as useful guidelines for others interested in undertaking open source bioinstrumentation projects. Earlier this month I posted a new paper on the bioRxiv: Jase Gehring, Jeff Park, Sisi Chen, Matt Thomson, and Lior Pachter, Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transciptional Spaces, bioRxiv, 2018. The paper offers some insights into the benefits of multiplex single-cell RNA-Seq, a molecular implementation of information multiplexing. The paper also reflects the benefits of a multiplex lab, and the project came about thanks to Jase Gehring, a multiplex molecular biologist/computational biologist in my lab. mult·i·plex /məltəˌpleks/ adjective – consisting of many elements in a complex relationship. – involving simultaneous transmission of several messages along a single channel of communication. Conceptually, Jase’s work presents a method for chemically labeling cells from multiple samples with DNA nucleotides so that samples can be pooled prior to single-cell RNA-Seq, yet cells can subsequently be associated with their samples of origin after sequencing. This is achieved by labeling all cells from a sample with DNA that is unique to that sample; in the figure below colors are used to represent the different DNA tags that are used for each sample: This is analogous to the barcoding of transcripts in single-cell RNA-Seq, that allows for transcripts from the same cell of origin to be associated with each other, yet in this framework there is an additional layer of barcoding of cells. The tagging mechanism is a click chemistry one-pot, two-step reaction in which cell samples are exposed to methyltetrazine-activated DNA (MTZ-DNA) oligos as well as the amine-reactive cross-linker NHS-trans-cyclooctene (NHS-TCO). The NHS functionalized oligos are formed in situ by reaction of methyltetrazine with trans-cyclooctene (the inverse-electron demand Diels-Alder (IEDDA) reaction). Nucleophilic amines present on all proteins, but not nucleic acids, attack the in situ-formed NHS-DNA, chemoprecipitating the functionalized oligos directly onto the cells: MTZ-DNAs are made by activating 5′-amine modified oligos with NHS-MTZ for the IEDDA reaction, and they are designed with a PCR primer, a cell tag (a unique “barcode” sequence) and a poly-A tract so that they can be captured by poly-T during single-cell RNA-Seq: Such oligos can be readily ordered from IDT. We are careful to refer to the identifying sequences in these oligos as cell tags rather than barcodes so as not to confuse them with cell barcodes which are used in single-cell RNA-Seq to associate transcripts with cells. The process of sample tagging for single-cell RNA-Seq is illustrated in the figure below. It shows how the tags, appearing as synthetic “transcripts” in cells, are captured during 3′ based microfluidic single-cell RNA-Seq and are subsequently deciphered by sequencing a tag library alongside the cDNA library: This significance of multiplexing is manifold. First, by labeling cells prior to performing single-cell RNA-Seq, multiplexing allows for controlling a trade off between the number of cells assayed per sample, and the total number of samples analyzed. This allows for leveraging the large number of cells that can be assayed with current technologies to enable complex experimental designs based on many samples. In our paper we demonstrate this by performing an experiment consisting of single-cell RNA-Seq of neural stem cells (NSCs) exposed to 96 different combinations of growth factors. The experiment was conducted in collaboration with the Thomson lab that is interested in performing large-scale perturbation experiments to understand cell fate decisions in response to developmental signals. We examined NSCs subjected to different concentrations of Scriptaid/Decitabine, epidermal growth factor/basic fibroblast growth factor, retinoid acid, and bone morphogenic protein 4. In other words, our experiment corresponded to a 4x4x6 table of conditions, and for each condition we performed a single-cell RNA-Seq experiment (in multiplex). This is one of the largest (in terms of samples) single-cell RNA-Seq experiments to date: a 100-fold decrease in the number of cells we collected per sample allowed us to perform an experiment with 100x more samples. Without multiplexing, an experiment that cost us ~$7,000 would cost a few hundred thousand dollars, well outside the scope of what is possible in a typical lab. We certainly would have not been able to perform the experiment without multiplexing. Although the cost tradeoff is impactful, there are many other important implications of multiplexing as well:

• Whereas simplex single-cell RNA-Seq is descriptive, focusing on what is in a single sample, multiplex single-cell RNA-Seq allows for asking how? For example how do cell states change in response to perturbations? How does disease affect cell state and type?
• Simplex single-cell RNA-Seq leads to systematics arguments about clustering: when do cells that cluster together constitute a “cell type”? How many clusters are real? How should clustering be performed? Multiplex single-cell RNA-Seq provides an approach to assigning significance to clusters via their association with samples. In our paper, we specifically utilized sample identification to determine the parameters/thresholds for the clustering algorithm:On the left hand side is a t-SNE plot labeled by different samples, and on the right hand side de novo clusters. The experiment allowed us to confirm the functional significance of a cluster as a cell state resulting from a specific range of perturbation conditions.
• Multiplexing reduces batch effect, and also makes possible the procurement of more replicates in experiments, an important aspect of single-cell RNA-Seq as noted by Hicks et al. 2017.
• Multiplexing has numerous other benefits, e.g. allowing for the detection of doublets and their removal prior to analysis. This useful observation of Stoeckius et al. makes possible higher-throughput single-cell RNA-Seq. We also found an intriguing relationship between tag abundance and cell size. Both of these phenomena are illustrated in one supplementary figure of our paper that I’m particularly fond of:

It shows a multiplexing experiment in which 8 different samples have been pooled together. Two of these samples are human-only samples, and two are mouse-only. The remaining four are samples in which human and mouse cells have been mixed together (with 2,3,4 and 5 tags being used for each sample respectively). The t-SNE plot is made from the tag counts, which is why the samples are neatly separated into 8 clusters. However in Panel b, the cells are colored by their cDNA content (human, mouse, or both). The pure samples are readily identifiable, as are the mixed samples. Cell doublets (purple) can be easily identified and therefore removed from analysis. The relationship between cell size and tag abundance is shown in Panel d. For a given sample with both human and mouse cells (bottom row), human cells give consistently higher sample tag counts. Along with all of this, the figure shows we are able to label a sample with 5 tags, which means that using only 20 oligos (this is how many we worked with for all of our experiments) it is possible to label ${20 \choose 5} = 15,504$ samples.

• Thinking about hundreds (and soon thousands) of single-cell experiments is going to be complicated. The cell-gene matrix that is the fundamental object of study in single-cell RNA-Seq extends to a cell-gene-sample tensor. While more complicated, there is an opportunity for novel analysis paradigms to be developed. A hint of this is evident in our visualization of the samples by projecting the sample-cluster matrix. Specifically, the matrix below shows which clusters are represented within each sample, and the matrix is quantitative in the sense that the magnitude of each entry represents the relative abundance of cells in a sample occupying a given cluster:
A three-dimensional PCA of this matrix reveals interesting structure in the experiment. Here each point is an entire sample, not a cell, and one can see how changes in factors move samples in “experiment space”:

As experiments become even more complicated, and single-cell assays become increasingly multimodal (including not only RNA-Seq but also protein measurements, methylation data, etc.) development of a coherent mathematical framework for single-cell genomics will be central to interpreting the data. As Dueck et al. 2015 point out, such analysis is likely to not only be mathematically interesting, but also functionally important.

We aren’t the only group thinking about sample multiplexing for single-cell RNA-Seq. The “demuxlet” method by Kang et al., 2017 is an in silico approach based on multiplexing from genomic variation. Kang et al. show that if pooled samples are genetically heterogeneous, genotype data can be used to separate samples providing an effective solution for multiplexing single-cell RNA-Seq in large human studies. However demuxlet has limitations, for example it cannot be used for samples from a homogenous genetic background. Two papers at the end of last year develop an epitope labeling strategy for multiplexing: Stoeckius et al. 2017 and Peterson et al. 2017. While epitope labeling provides additional information that can be of interest, our method is more universal in that it can be used to multiplex any kind of samples, even from different organisms (a point we make with the species mixing multiplex experiment I described above). The approaches are also not exclusive, epitope labeling could be coupled to a live cell DNA tagging multiplex experiment allowing for the same epitopes to be assayed together in different samples. Finally, our click chemistry approach is fast, cheap and convenient, immediately providing multiplex capability for thousands, or even hundreds of thousands of samples.

One interesting aspect of Jase’s multiplexing paper is that the project it describes was itself a multiplexing experiment of sorts. The origins of the experiment date to 2005 when I was awarded tenure in the mathematics department at UC Berkeley. As is customary after tenure trauma, I went on sabbatical for a year, and I used that time to ponder career related questions that one is typically too busy for. Questions I remember thinking about: Why exactly did I become a computational biologist? Was a mathematics department the ideal home for me? Should I be more deeply engaged with biologists? Were the computational biology papers I’d been writing meaningful? What is computational biology anyway?

In 2008, partly as a result of my sabbatical rumination but mostly thanks to the encouragement and support of Jasper Rine, I changed the structure of my appointment and joined the UC Berkeley Molecular and Cell Biology (MCB) department (50%). A year later, I responded to a call by then Dean Mark Schlissel and requested wet lab space in what was to become the Li Ka Shing Center at UC Berkeley. This was not a rash decision. After working with Cole Trapnell on RNA-Seq I’d come to the conclusion that a small wet lab would be ideal for our group to better learn the details of the technologies we were working on, and I felt that practicing them ourselves would ultimately be the best way to arrive at meaningful (computational) methods contributions. I’d also visited David Haussler‘s wet lab where I met Jason Underwood who was working on FragSeq at the time. I was impressed with his work and what I saw were important benefits of real contact between wet and dry, experiment and computation.

In 2011 I was delighted to move into my new wet lab. The decision to give me a few benches was a bold and unexpected one, spearheaded by Mark Schlissel, but also supported by a committee he formed to decide on the make up of the building. I am especially grateful to John Ngai, Art Reingold and Randy Scheckman for their help. However I was in a strange position starting a wet lab as a tenured professor. On the one hand the security of tenure provided some reassurance that a failure in the wet lab would not immediately translate to a failure of career. On the other hand, I had no startup funds to buy all the basic infrastructure necessary to run a lab. CIRM, Mark Schlissel, and later other senior faculty in Molecular & Cell Biology at UC Berkeley, stepped in to provide me with the basics: a -80 and -20, access to a shared cold room, a Bioanalyzer (to be shared with others in the building), and a thermocycler. I bought some other basic equipment but the most important piece was the recruitment of my first MCB graduate student: Shannon Hateley. Shannon and I agreed that she would set up the lab and also be lab manager, while I would supervise purchasing and other organization lab matters. I obtained informed consent from Shannon prior to her joining my lab, for what would be a monumental effort requested of her. We also agreed she would be co-advised by another molecular biologist “just in case”.

With Shannon’s work and then my second molecular biology student, Lorian Schaeffer, the lab officially became multiplexed. Jase, who initiated and developed not only the molecular biology but also the computational biology of Gehring et al. 2018 is the latest experimentalist to multiplex in our group. However some of the mathematicians now multiplex as well. This has been a boon to the research of the group and I see Jase’s paper as fruit that has grown from the diversity in the lab. Moving forward, I see increasing use of mathematics ideas in the development of novel molecular biology. For example, current single-cell RNA-Seq multiplexing is a form of information multiplexing that is trivial in comparison to the multiplexing ideas from information theory; the achievements are in the molecular implementations, but in the future I foresee much more of a blur between wet and dry and increasingly sophisticated mathematical ideas being implemented with molecular biology.

Hedy Lamarr, the mother of multiplexing.

I have been fascinated with mini computers for some time, and have wondered when they will become suitable for bioinformatics. The 4273π project, which is an online course that is distributed as a 32Gb SD card image for the Raspberry Pi, has been around for a few years and demonstrated the utility of mini computers for training. The course is a proof of principle that bioinformatics software can work on a mini computer; the distributed software includes some comparative genomics and phylogenetics programs. However there is not much one can do with 1Gb RAM. The data in 4273π are small FASTA files, and while the Raspberry Pi is powerful enough to allow for experimentation and exploration of such datasets, even the new Raspberry Pi 3, with ten times the performance of the original 2012 model, still only has 1Gb of RAM and is not powerful enough for handling the current primary data type of genomics: high-throughput sequencing data.

Enter the Rock64.

The Rock64 is a new single-board computer from Pine64 that competes with the Raspberry Pi 3:

The Rock64 is evidence of the rapid and impressive development in single-board computers over the past few years, and Pine64 crosses a major threshold by offering a model with 4Gb RAM. The machine is also cheap. A 4Gb RAM Rock64, which is a 64-bit, quad core 1.5GHz machine, costs $44.95 (the 1Gb model is just$24.95). An enclosure is $7.95, a power supply$6.99, and a 64Gb SSD drive is only $31.95 (the 16Gb drive is$15.95). When my student Jase Gehring found out the specs of the machine last summer, he immediately realized that it was powerful enough to run kallisto for RNA-Seq analyses, and we preordered a handful of the boards for the lab. These arrived in the fall and we have been testing the machines for a while. One of them is hooked up to a monitor, and together with a bluetooth mouse and keyboard is serving as a general desktop computer in the wet lab. They are extraordinary versatile mini computers that, in my opinion, portend a future of mobile, low-cost, and light-weight computing for clinical and field genomics applications.

Unfortunately ARM is not an architecture known to most computational biologists, and my initial enthusiasm for the Rock64 was dampened when I found out that most genomics software does not work on ARM architecture. However I managed to install R, and Páll Melsted compiled kallisto on the Rock64 for the new release of version 0.44 (the release introduces an ARM binary, along with pseudobam for visualization of pseudoalignments). With these programs in place on Gibraltar (our first Rock64 with 4Gb of RAM, a 64Gb SSD drive, and a quad-core 1.5GHz processor), there was ample processing power to quantify RNA-Seq datasets.

For example, I was able to build the Saccharomyces cerevisae release 81 transcriptome index in one minute. A complete quantification of 6 samples from Ellahi, Thurtle and Rine, 2015 using two cores (with 30 bootstraps per sample) took 21 minutes. The quantification consisted of processing 47,744,312 paired-end reads. Amazingly, the Rock64 can quantify human RNA-Seq, which requires pseudoalignment of reads to a much larger transcriptome than yeast. A human 15,117,833 paired-end read sample (SRR493366) took less than 11 minutes to quantify using a single core. These results show that the Rock64 is not a toy; it can be used for the analysis of high-throughput sequencing data from substantial biological experiments.

It’s mind boggling to consider just how amazing it is to be able to quantify RNA-Seq on such a machine. When we developed kallisto we knew that the two orders of magnitude speedup was a game-changer, but I never thought we would literally be able to run it on what is not much more than a phone. We’re not going to switch over all of our RNA-Seq analyses to the Rock64s quite yet, but cluster assemblies such as the Pico5S have piqued my interest.

I imagine that it won’t be long before mini computers are even more powerful, and provide ultra low-cost portable alternatives to current server and cloud computing solutions. Having said that, I still miss my Commodore 64. Fortunately the mini revolution isn’t leaving me behind: a mini version of the C64 is slated for release early this year.

The paper “Genomic-scale capture and sequencing of endogenous DNA from feces” by George H. Perry, John C. Marioni, Páll Melsted and Yoav Gilad is literally full of feces. The word ‘fecal’ appears 100 times.

Poop jokes aside, the paper presents an interesting idea that has legs. Perry et al. show that clever use of Agilent’s SureSelect allows for capturing nuclear genomic regions from fecal DNA. Intellectually, it is the predecessor of T. Mercer et al.‘s “Targeted RNA sequencing reveals deep complexity of the human transcriptome“, Nature Biotechnology, 2011 (another paper I like and for which I wrote a research highlight). Even though the Perry et al. paper does not have many citations, the Mercer et al. paper does (although unfortunately the authors forgot to cite Perry et al., which I think they should have). In other words, the Perry et al. paper is not as well known as it ought to be, and this post is an attempt to rectify that.

The ‘*’ in sh*t in my title is for *Seq. At a high level,  the Perry et al. paper shows how high-throughput sequencing technology can be leveraged to sequence deeply a single genome from among a community of metagenomes. For this reason, and for convenience, I henceforth will refer to the Perry et al. paper as the Sh*t-Seq paper. The “*” is inserted in lieu of the “i” not as censorship, but to highlight the point that the method is general and applies not only to sequencing nuclear genome from fecal DNA, but also as Mercer et al. shows, for targeted transcriptome sequencing (one can imagine also many other applications).

The Sh*t-Seq protocol is conceptually simple yet complicated in practice. DNA was captured using the Agilent SureSelect target enrichment system coupled to the Illumina single-end sequencing platform library prep protocol (of note is that the paper is from 2010 and the kits are from 2009). However because of the very small amount of DNA targeted (the authors claim ~1.8%), a number of adjustments to standard SureSelect capture / Illumina library prep had to be implemented. To the authors credit, the paragraphs in the section “DNA Capture” are exemplary in their level of detail and presumably greatly facilitate replicability. I won’t repeat all the detail here. However there are two steps that caught my attention as possibly problematic. First, the the authors performed substantial PCR amplification of the  adapter-ligated fecal DNA. This affects the computational analysis they discuss later and leads to a computational step they implemented that I have some issues with (more on this later). Second, they performed two rounds of capture as one round was insufficient for capturing the needed material for sequencing. This necessitated additional PCR, which also is possibly problematic.

The samples collected were from six chimpanzees. This is a fairly small n but the paper is a proof of principle and I think this is sufficient. Both fecal and blood samples were collected allowing for comparison of the fecally derived nuclear DNA to the actual genome of the primates. In what is clearly an attempt to channel James Bond, they collected fecal samples (2 g of stool) within 1 hour of defecation in tubes containing RNALater and these were then “shaken vigorously” (not stirred).

The next part of the paper is devoted to computational analyses to confirm that the Sh*t-Seq protocol can in fact be used to target nuclear endogenous DNA. As a sanity check, mtDNA was considered first. They noted too much diversity to align reads to a reference genome with BWA, and opted instead for de novo assembly using ABySS. This is certainly overkill, possible only because of the high copy count of mitochondrion reads. But I suppose it worked (after filtering out all the low coverage ABySS sequence, which was presumably junk). One interesting idea given more modern RNA-Seq assembly tools would be to assemble the resulting reads with an RNA-Seq de novo assembler that allows for different abundances of sequences. Ideally, such an assembly should indicate naturally the sought after enrichment.

Next, nuclear DNA was investigated, specifically the X chromosome and chromosome 21. Here the analyses is very pre-2014. First, all multi-mapping reads were removed. This is not a good idea for many reasons, and I am quite certain that with the new GRCh38 (with alternate sequence representation for variant regions) it is a practice that will rapidly be phased out. I’d like to give Perry et al. the benefit of the doubt for making this mistake since they published in 2010, but their paper appeared 6 months after the Cufflinks paper so they could have, in principle, known better. Having said that, while I do think multi-mapping would have allowed them to obtain much stronger results as to the accuracy and extent of their enrichment by avoiding the tossing of a large number of reads, their paper does manage to prove their principle so its not a big deal.

The removal of multi-mapping reads was just the first step in a series of “filters” designed to narrow down the nuclear DNA reads to regions of the chimpanzee genome that could be argued to be unambiguously representative of the target. I won’t go into details, although they are all in the paper. As with the experimental methods, I applaud the authors on publishing reproducible methods, especially computational methods, with all details included. But there was a final red flag for me in the computational methods: namely the selection of a single unique fragment (at random) for each genomic (start) site for the purposes of calling SNPs. This was done to eliminate problems due to amplification biases, which is indeed a serious concern, but if heterozygous sites appear due to the PCR steps, then there ought to have been telltale signatures. For example, a PCR “SNP’ would, I think, appear only in reads specific to a single position, but not in other overlapping reads of the site. It would have been very helpful if they would have done a detailed analysis of this issue, rather than just pick a single read at random for each genomic (start) site. They kicked the can down the road.

Having removed multiple mapping reads, repetitive regions, low coverage regions, etc. etc. Dayenu, they ended up estimating a false positive rate for heterozygous sites (using the X chromosome in males) at 0.0007% for fecal DNA and 0.0010% for blood DNA. This led them to conclude that incorrectly-identified heterozygous sites in their study were 0.8%, 2.0%, 1.1%, and 2.7% for fecal DNA chromosome 21, fecal DNA chromosome X in females, blood DNA chromosome 21, and blood DNA chromosome X in females, respectively. Such good news is certainly the result of their extraordinarily stringent filtering, but I think it does prove that they were able to target effectively. They give further proof using PCR and Sanger sequencing of 20 regions.

I have a final nitpick and it relates to Figure 4. It is a a companion to Figure 3 which shows the Chimpanzee phylogeny for their samples based on the mtDNA. As expected in that figure, the fecal and blood samples cluster together. Figure 4 shows two phylogenies, one based on chr 21, the other on chr X. My issue here is with the way that distances were constructed. Its a technical point, but it looks like they used hamming distance, and I don’t think that makes a lot of sense, not to mention the fact that neighbor-joining does not seem like the appropriate algorithm for building a tree in this setting (I plan to blog about neighbor-joining shortly). But this is a methodological point not really relevant to the main result of the paper, namely proof of principle for targeted sequencing of endogenous DNA from fecal matter.

I think Sh*t-Seq has a future. The idea of targeted capture coupled to high-throughput sequencing has more than an economic rationale. It provides the possibility to probe the “deep field” as discussed in the previously mentioned review on targeted RNA-Seq. This is a general principle that should be more widely recognized.

And of course, dung is just cool. Happy new year!

My new year ‘s resolution.

[Update April 6, 2014: The initial title of this post was “23andme genotypes are all wrong”. While that was and remains a technically correct statement, I have changed it because the readership of my blog, and this post in particular, has changed. Initially, when I made this post, the readers of the blog were (computational) biologists with extensive knowledge of genotyping and association mapping, and they could understand the point I was trying to make with the title. However in the past few months the readership of my blog has grown greatly, and the post is now reaching a wide public audience. The revised title clarifies that the content of this post relates to the point that low error rates in genotyping can be problematic in the context of genome-wide association reports because of multiple-testing.]

I have been reading the flurry of news articles and blog posts written this week about 23andme and the FDA with some interest. In my research talks, I am fond of displaying 23andme results, and have found that people always respond with interest. On the teaching side, I have subsidized 23andme testing for volunteer students in Math127 who were interested in genetics so that they could learn about personalized genomics first-hand. Finally, a number of my former and current students have worked at 23andme, and some are current employees.

Despite lots of opinions being expressed about the 23andme vs. FDA kerfuffle, I believe that two key points have been ignored in the discussions:

1. All 23andme genotypes that have ever been reported to customers are wrong. This is the case despite very accurate genotyping technology used by 23andme.
2. The interpretation of 23andme results involves examining a large number of odds ratios. The presence of errors leads to a huge multiple-testing problem.

Together, these issues lead to an interesting conundrum for the company, for customers, and for the FDA.

I always find it useful to think about problems concretely. In the case of 23andme, it means examining actual genotypes. Fortunately, you don’t have to pay the company \$99 dollars to get your own- numerous helpful volunteers have posted their 23andme genotypes online. They can be viewed at openSNP.org where “customers of direct-to-customer genetic tests [can] publish their test results, find others with similar genetic variations, learn more about their results, get the latest primary literature on their variations and help scientists find new associations”. There are a total of 624 genotypes available at openSNP, many of them from 23andme. As an example, consider “samantha“, who in addition to providing her 23andme genotype, also provides lots of phenotypic information. Here is the initial part of her genotype file:

# This data file generated by 23andMe at: Wed Jul 20 20:37:11 2011
#
# Below is a text version of your data. Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier
# (an rsid or an internal id), its location on the reference human genome, and the
# genotype call oriented with respect to the plus strand on the human reference
# sequence.     We are using reference human assembly build 36.  Note that it is possible
# that data downloaded at different times may be different due to ongoing improvements
#
# http://www.ncbi.nlm.nih.gov/projects/mapview/map_search.cgi?taxid=9606&build=36
#
# rsid	chromosome	position	genotype
rs4477212	1	72017	AA
rs3094315	1	742429	AG
rs3131972	1	742584	AG
rs12124819	1	766409	AA
rs11240777	1	788822	AA
rs6681049	1	789870	CC
rs4970383	1	828418	CC
rs4475691	1	836671	CC
rs7537756	1	844113	AA
rs13302982	1	851671	GG
rs1110052	1	863421	GT
...`

Anyone who has been genotyped by 23andme can get this file for themselves from the website (by clicking on their name, then on “Browse Raw Data” from the pull-down menu, and then clicking on “Download” in the top-right corner of the browser window). The SNPs are labeled with rsid labels (e.g. rs3094315) and correspond to specific locations on chromosomes (e.g. chr1:742429). Since every human is diploid, two bases are shown for every SNP; one came from mom and one from dad. The 23andme genotype is not phased, which means that you can’t tell in the case of rs3094315 whether the A was from mom and the G from dad, or vice versa (it turns out paternal origin can be important, but that is a topic for another post).

A key question the FDA has asked, as it does for any diagnostic test, is whether the SNP calls are accurate. The answer is already out there. First, someone has performed a 23andme replicate experiment precisely to assess the error rate. In an experiment in 2010 with two replicates, 85 SNPs out of about 600,000 were different. Today, Illumina types around 1 million SNPs, so one would expect even more errors. Furthermore, a replicate analysis provides only a lower bound, since systematic errors will not be detected. Another way to examine the error rate is to look at genotypes of siblings. That was written about in this blog post which concluded there were 87 errors. 23andme currently uses the Illumina Omni Express for genotyping, and the Illumina spec sheet claims a similar error rate to those inferred in the blog posts mentioned above. The bottom line is that even though the error rate for any individual SNP call is very very low (<0.01% error), with a million SNPs being called there is (almost) certainly at least one error somewhere in the genotype. In fact, assuming a conservative error rate leading to an average of 100 errors per genotype, the probability that a 23andme genotype has no errors is less than 10^(-40).

The fact that 23andme genotypes are wrong (i.e. at least one error in some SNP) wouldn’t matter if one was only interested in a single SNP. With very high probability, it would be some other SNPs that are the wrong ones. But the way people use 23andme is not to look at a single SNP of interest, but rather to scan the results from all SNPs to find out whether there is some genetic variant with large (negative) effect. The good news is that there isn’t much information available for the majority of the 1 million SNPs being tested. But there are, nevertheless, lots of SNPs (thousands) to look at. Whereas a comprehensive exam at a doctor’s office might currently constitute a handful of tests– a dozen or a few dozen at most– a 23andme test assessing thousands of SNPs and hundreds of diseases/traits constitutes more diagnostic tests on an individual at one time than have previously been performed in a lifetime.

To understand how many tests are being performed in a 23andme experiment, it is helpful to look at the Interpretome website. The website allows a user to examine information on SNPs without paying, and without uploading the data. I took a look at Samantha, and the Interpretome gave information about 2829 SNPs. These are SNPs for which there is a research article that has identified the SNP as significant in some association study (the website conveniently provides direct links to the articles). For example, here are two rows from the phenotype table describing something about Samantha’s genetic predisposition for large head circumference:

Head circumference (infant) 11655470  CC  T  .05    4E-6    22504419
Head circumference (infant) 1042725    CC  T  .07    3E-10  22504419

Samantha’s genotype at the locus is CC, the “risk” allele is T, the odds ratios  are very small (0.05,0.07) and the p-values are apparently significant. Interpretome’s results differ from those of 23andme, but looking at the diversity of phenotypes reported on gives one a sense for the possibilities that currently exist in genetics, and the scope of 23andme’s reports.

From the estimates of error rates provided above, and using the back of an envelope, it stands to reason that about 1/3 of 23andme tested individuals have an error at one of their “interesting” SNPs. Not all of SNPs arising in association studies are related to diseases, but many of them are. I don’t think its unreasonable to postulate that a significant percentage of 23andme customers have some error in a SNP that is medically important. Whether such errors are typically false positives or false negatives is unclear, and the extent to which they may lead to significant odds ratios is another interesting question. In other words, its not good enough to know how frequently warfarin sensitivity is being called incorrectly. The question is how frequently some medically significant result is incorrect.

Of course, the issue of multiple testing as it pertains to interpreting genotypes is probably a secondary issue with 23andme. As many bloggers have pointed out, it is not even clear that many of 23andme’s odds ratios are accurate or meaningful. A major issue, for example, is the population background of an individual examining his/her genotype and how close it is to the population on which the GWAS were performed. Furthermore, there are serious questions about the meaning of the GWAS odds ratios in the case of complex traits. However I think the issue of multiple testing is a deeper one, and a problem that will only be exacerbated as more disease SNPs are identified. Having said that, there are also approaches that could mitigate errors and improve fidelity of the tests. As DECODE genetics has demonstrated, imputation and phasing can in principle be used to infer population haplotypes, which not only are useful for GWAS analyses, but can also be used to identify erroneous SNP calls. 23andme’s problem is that although they have many genotypes, they are from diverse populations that will be harder to impute and phase.

The issue of multiple testing arising in the context of 23andme and the contrast with classic diagnostics reminds me of the dichotomy between whole-genome analysis and classic single gene molecular biology. The way in which customers are looking at their 23andme results is precisely to look for the largest effects, i.e. phenotypes where they appear to have high odds of contracting a disease, or being sensitive to some drug. This is the equivalent of genome scientists picking the “low hanging fruit” out of genome-wide experiments such as those performed in ENCODE. In genomics, scientists have learned (with some exceptions)  how to interpret genome-wide analyses after correcting for multiple-hypothesis testing by controlling for false discovery rate. But are the customers of 23andme doing so? Is the company helping them do it? Should it? Will the FDA require it? Can looking at ones own genotype constitute too much testing?

There are certainly many precedents for superfluous harmful testing in medicine. For example, the American Academy of Family Physicians has concluded that prostate cancer PSA tests and digital rectal exams have marginal benefits that are outweighed by the harm caused by following up on positive results. Similar arguments have been made for mammography screening. I therefore think that there are serious issues to consider about the implications of direct-to-consumer genetic testing and although I support the democratization of genomics, I’m glad the FDA is paying attention.

Samantha’s type 2 diabetes risk as estimated from her genotype by Interpretome. She appears to have a lower risk than an average person. Does this make it ok for her to have another cookie?

Don’t believe the anti-hype. They are saying that RNA-Seq promises the discovery of new expression events, but it doesn’t deliver:

Is this true? There have been a few papers comparing microarrays to RNA-Seq technology (including one of my own) that I’ll discuss below, but first a break-down of the Affymetrix “evidence”. The first is this figure (the poor quality of the images is exactly what Affymetrix provides, and not due to a reduction in quality on this site; they are slightly enlarged when clicked on):

The content of this figure is an illustration of the gene LMNB1 (Lamin protein of type B), used to argue that microarrays can provide transcript level resolution whereas RNA-Seq can’t!! Really? Affymetrix is saying that RNA-Seq users would likely use the RefSeq annotation which only has three isoforms. But this is a ridiculous claim. It is well known that RefSeq is a conservative annotation and certainly RNA-Seq users have the same access to the multiple databases Affymetrix used to build their annotation (presumably, e.g. Ensembl). It therefore seems that what Affymetrix is saying with this figure is that RNA-Seq users are dumb.

The next figure is showing the variability in abundance estimates as a function of expression level for RNA-SEq and the HTA 2.0, with the intended message being that microarrays are less noisy:

But there is a subtle trick going on here. And its in the units. The x-axis is showing RPM, which is an abbreviation for Reads Per Million. This is not a commonly used unit, and there is a reason. First, its helpful to review what is used. In his landmark paper on RNA-Seq, Ali Mortazavi introduced the units RPKM (note the extra K) that stands for reads per kilobase of transcript per million mapped. Why the extra kilobase term? In my review on RNA-Seq quantification I explain that RPKM is proportional to a maximum likelihood estimate of transcript abundance (obtained from a simple RNA-Seq model). The complete derivation is on page 6 ending in Equation 13; I include a summary here:

The maximum likelihood (ML) abundances $\hat{\rho}_t$ are  given by

$\hat{\rho}_t = \frac{\frac{\hat{\alpha}_t}{l_t}}{\sum_{r \in T} \frac{\hat{\alpha}_r}{l_r}} \propto \frac{X_t}{\left( \frac{l_t}{10^3}\right) \left( \frac{N}{10^6}\right) }$

In these equations $l_t$ is the length of transcript (if reads are long it is necessary to modify the length due to edge effects, hence the tilde in the paper), the $\hat{\alpha}_t$ are the maximum likelihood estimates for the probabilities of selecting reads from transcripts (unnormalized by their length) and finally $X_t$ is the number of reads mapping to transcript t while N is the total number of mapped reads. The point is that RPKM (the rightmost formula for abundance) is really a unit for describing the maximum likelihood relative abundances ($\hat{\rho}$) scaled by some factors.

RPKM as a unit has two problems. The first is that in current RNA-Seq experiments reads are paired so that the actual units being counted (in $X_t$) are fragments. For this reason we modified RPKM to FPKM in the Cufflinks paper (the “F” replaced “R” for fragment instead of read). A more serious problem, noted by Bo Li and Colin Dewey in their paper on RSEM, is that while FPKM is proportional to ML estimates of abundance, the proportionality constants may vary between experiments. For this reason they proposed TPM (transcripts per million) which is also proportional to the ML abundance estimates but with a proportionality constant (a million) that is the same between experiments. All of these units are used simply to avoid writing down the $\hat{\rho}_t$ which are in many cases tiny numbers since they must all sum to 1.

Returning to the Affymetrix figure, we see the strange RPM units. In essence, this is the rightmost term in the equation above, with the $l_t$ length terms removed from the denominators. Therefore RPM is proportional to the $\hat{\alpha}_t$. If a transcript is short, even if it is equally abundant to a longer transcript ,it will produce less RNA-Seq reads and therefore its $\hat{\alpha}_t$ will be (possibly considerably) smaller. The effect of displaying RPM for RNA-Seq vs. expression level for the HTA 2.0 arrays is therefore to mix apples and oranges. Since what is being displayed is a coefficient of variation, there is a bias caused by the relationship between length and expression (e.g. many highly expressed housekeeping genes are short).

To be fair to Affymetrix the conversion between the $\hat{\alpha}$ and the $\hat{\rho}$ can be confusing (its explained in Lemma 14 in the Supplement of the Cufflinks paper). So maybe the discordant x-axes were unintentional…but then there is the third figure:

Here its a bit hard to tell what is going on because not all the information needed to decipher the figure is provided. For example, its not clear how the “expression of exons” was computed or measured for the RNA-Seq experiment. I suspect that as with the previous figure, read numbers were not normalized by length of exons, and moreover spliced reads (and other possibly informative reads from transcripts) were ignored. In other words, I don’t really believe the result.

Having said this, it is true that expression arrays can have an advantage in measuring exon expression, because an array measurement is absolute (as opposed to the relative quantification that is all that is possible with RNA-Seq). Array signal is based on hybridization, and it is probably a reasonable assumption that some minimum amount of RNA triggers a signal, and that this amount is independent of the remainder of the RNA in an experiment. So arrays can (and in many cases probably do) have advantages over RNA-Seq.

There are a few papers that have looked into this, for example the paper “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae ” by Nookaew et al., Nucleic Acids Research 40 (2012) who find high reproducibility in RNA-Seq and consistency between arrays and RNA-Seq.  Xu et al., in “Human transcriptome array for high-throughput clinical studies“, PNAS 108 (2011), 3707–3712 are more critical, agreeing with Affymetrix that arrays are more sensitive at the exon level. For disease studies, they recommend using RNA-Seq to identify transcripts relevant to the disease, and then screening for those transcripts on patients using arrays.

For the Cuffdiff2 paper describing our new statistical procedures for differential analysis of transcripts and genes, the Rinn lab performed deep RNA-Seq and array expression measurement on the same samples from a HOXA1 knowdown (the experiments included multiple replicates of both the RNA-Seq and the arrays). To my knowledge, it is the deepest and most comprehensive data currently available for comparing arrays and RNA-Seq. Admittedly, the arrays used were not Affymetrix but Agilent SurePrint G3, and the RNA-Seq coverage was deep, however we had two main findings very different from the Affymetrix claims. First, we found overall strong correlation between array expression values and RNA-Seq abundance estimates. The correlation remained strong for large regimes of expression even with very few reads (tested by sequencing fewer reads from a MiSeq). Second, we found that arrays were missing differentially expressed transcripts, especially at low abundance levels. In other words, we found RNA-Seq to have higher resolution. The following figure from our paper made the case (note the overall Spearman Correlation was 0.86):

There are definitely continued applications for arrays. Both in high-throughput screening applications (as suggested in the Xu et al. paper), and also in the development of novel assays. For example Mercer et al. “Targeted rNA sequencing reveals the deep complexity of the human transcriptome“, Nature Biotechnology 30 (2012) 99–104  show how to couple capture (with arrays) with RNA-Seq to provide ultra deep sequencing in subsets of the transcriptome. So its not yet the time to write off arrays. But RNA-Seq has many applications of its own. For example the ability to better detect allele-specific expression, the opportunity to identify RNA-DNA differences (and thereby study RNA editing), and the ability to study expression in non-model organisms where genomes sequences are incomplete and annotations poor. For all these reasons I’m betting on RNA-Seq.