You are currently browsing the monthly archive for January 2019.

The long-standing practice of data sharing in genomics can be traced to the Bermuda principles, which were formulated during the human genome project (Contreras, 2010). While the Bermuda principles focused on open sharing of DNA sequence data, they heralded the adoption of other open source standards in the genomics community. For example, unlike many other scientific disciplines, most genomics software is open source and this has been the case for a long time (Stajich and Lapp, 2006). The open principles of genomics have arguably greatly accelerated progress and facilitated discovery.

While open sourcing has become de rigueur in genomics dry labs, wet labs remain beholden to commercial instrument providers that rarely open source hardware or software, and impose draconian restrictions on instrument use and modification. With a view towards joining others who are working to change this state of affairs, we’ve posted a new preprint in which we describe an open source syringe pump and microscope system called poseidon:

A. Sina Booeshaghi, Eduardo da Veiga Beltrame, Dylan Bannon, Jase Gehring and Lior Pachter,

The poseidon system consists of

• A syringe pump that can operate at a wide range of flow rates. The bulk cost per pump is \$37.91. A system of three pumps that can be used for droplet based single-cell RNA-seq experiments can be assembled for \$174.87
• A microscope system that can be used to evaluate the quality of emulsions produced using the syringe pumps. The cost is \$211.69.
• Open source software that can be used to operate four pumps simultaneously, either via a Raspberry Pi (that is part of the microscope system) or directly via a laptop/desktop.

Together, these components can be used to build a Drop-seq rig for under \$400, or they can be used piecemeal for a wide variety of tasks. Along with describing benchmarks of poseidon, the preprint presents design guidelines that we hope can accelerate both development and adoption of open source bioinstruments. These were developed while working on the project; some were borrowed from our experience with bioinformatics software, while others emerged as we worked out kinks during development. As is the case with software, open source is not,  in and of itself, enough to make an application usable.  We had to optimize many steps during the development of poseidon, and in the preprint we illustrate the design principles we converged on with specific examples from poseidon.

The complete hardware/software package consists of the following components:

We benchmarked the system thoroughly and it has similar performance to a commercial Harvard Apparatus syringe pump; see panel (a) below. The software driving the pumps can be used for infusion or withdrawl, and is easily customizable. In fact, the ability to easily program arbitrary schedules and flow rates without depending on vendors who frequently charge money and require firmware upgrades for basic tasks, was a major motivation for undertaking the project. The microscope is basic but usable for setting up emulsions. Shown in panel (b) below is a microfluidic droplet generation chip imaged with the microscope. Panel (c) shows that we have no trouble generating uniform emulsions with it.

Together, the system constitutes a Drop-seq rig (3 pumps + microscope) that can be built for under \$400:

We did not start the poseidon project from scratch. First of all, we were fortunate to have some experience with 3D printing. Shortly after I started setting up a wet lab, Shannon Hateley, a former student in the lab, encouraged me to buy a 3D printer to reduce costs for basic lab supplies. The original MakerGear M2 we purchased has served us well saving us enormous amounts of money just as Shannon predicted, and in fact we now have added a Prusa printer:

The printer Shannon introduced to the lab came in handy when, some time later, after starting to perform Drop-seq in the lab, Jase Gehring became frustrated with the rigidity commercial syringe pumps he was using. With a 3D printer available in-house, he was able to print and assemble a published open source syringe pump design. What started as a hobby project became more serious when two students joined the lab: Sina Booeshaghi, a mechanical engineer, and  Eduardo Beltrame, an expert in 3D printing. We were also encouraged by the publication of a complete Drop-seq do-it-yourself design from the Satija lab. Starting with the microscope device from the Stephenson et al. paper, and the syringe pump from the Wijnen et al. paper, we worked our way through numerous hardware design optimizations and software prototypes. The photo below shows the published work we started with at the bottom, the final designs we ended up with at the top, and intermediate versions as we converged on design choices:

In the course of design we realized that despite a lot of experience developing open source software in the lab, there were new lessons we were learning regarding open-source hardware development, and hardware-software integration. We ended up formulating six design principles that we explain in detail in the preprint via example of how they pertained to the poseidon design:

We are hopeful that these principles we adhered to will serve as useful guidelines for others interested in undertaking open source bioinstrumentation projects.

This post is a review of a recent preprint on an approach to testing for RNA-seq gene differential expression directly from transcript compatibility counts:

Marek Cmero, Nadia M Davidson and Alicia Oshlack, Fast and accurate differential transcript usage by testing equivalence class counts, bioRxiv 2018.

To understand the preprint two definitions are important. The first is of gene differential expression, which I wrote about in a previous blog post and is best understood, I think, with the following figure (reproduced from Supplementary Figure 1 of Ntranos, Yi, et al., 2018):

In this figure, two isoforms of a hypothetical gene are called primary and secondary, and two conditions in a hypothetical experiment are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates $x_A$ and $x_B$ corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates $y_A$ and $y_B$ corresponding to the abundance of the secondary isoforms. In data from the hypothetical experiment, the black dots represent the mean level of expression of the constituent isoforms as derived from replicates. Differential transcript expression (DTE) refers to change in one of the isoforms. Differential gene expression (DGE) refers to change in overall gene expression (i.e. expression as the sum of the expression of the two isoforms). Differential transcript usage (DTU) refers to change in relative expression between the two isoform and gene differential expression (GDE) refers to change in expression along the red line. Note that DGE, DTU and DGE are special cases of GDE.

The Cmero et al. preprint describes a method for testing for GDE, and the method is based on comparison of equivalence classes of reads between conditions. There is a natural equivalence relation $\sim$ on the set of reads in an RNA-seq experiment, where two reads $r_1$ and $r_2$ are related by $\sim$ when $r_1$ and $r_2$ align (ambiguously) to exactly the same set of transcripts (see, e.g. Nicolae et al. 2011). The equivalence relation $\sim$ partitions the reads into equivalence classes, and, in a slight abuse of notation, the term “equivalence class” in RNA-seq is used to denote the set of transcripts corresponding to an equivalence class of reads. Starting with the pseudoalignment program kallisto published in Bray et al. 2016, it became possible to rapidly obtain the (transcript) equivalence classes for reads from an RNA-seq experiment.

In previous work (Ntranos et al. 2016) we introduced the term transcript compatibility counts to denote the cardinality of the (read) equivalence classes. We thought about this name carefully; due to the abuse of notation inherent in the term “equivalence class” in RNA-seq, we felt that using “equivalence class counts” would be confusing as it would be unclear whether it refers to the cardinalities of the (read) equivalence classes or the (transcript) “equivalence classes”.

With these definitions at hand, the Cmero et al.’s preprint can be understood to describe a method for identifying GDE between conditions by directly comparing transcript compatibility counts. The Cmero et al. method is to perform Šidák aggregation of p-values for equivalence classes, where the p-values are computed by comparing transcript compatibility counts for each equivalence class with the program DEXSeq (Anders et al. 2012). A different method that also identifies GDE by directly comparing transcript compatibility counts was previously published by my student Lynn Yi in Yi et al. 2018. I was curious to see how the Yi et al. method, which is based on Lancaster aggregation of p-values computed from transcript compatibility counts compares to the Cmero et al. method. Fortunately it was really easy to find out because Cmero et al. released code with their paper that can be used to make all of their figures.

I would like to note how much fun it is to reproduce someone else’s work. It is extremely empowering to know that all the methods of a paper are pliable at the press of a button. Below is the first results figure, Figure 2, from Cmero et al.’s paper:

It’s beautiful to see not only apples-to-apples, but the exact same apple! Reproducibility is obviously important to facilitate transparency in papers and to ensure correctness, but its real value lies in the fact that it allows for modifying and experimenting with methods in a paper. Below is the second results figure, Figure 3, from Cmero et al.’s paper:

The figure below is the reproduction, but with an added analysis in Figure 3a, namely the method of Yi et al. 2018 included (shown in orange as “Lancaster_equivalence_class”).

The additional code required for the extra analysis is just a few lines and can be downloaded from the Bits of DNA Github repository:

library(aggregation)
library(dplyr)
dm_dexseq_results <- as.data.frame(DEXSeqResults(dm_ec_results\$dexseq_object))
dm_lancaster_results <- dm_dexseq_results %>% group_by(groupID) %>% summarize(pval = lancaster(pvalue, log(exonBaseMean)))
dm_lancaster_results <- data.frame(gene = dm_lancaster_results\$groupID,
FDR = dm_lancaster_results\$gene_FDR)

hs_dexseq_results <- as.data.frame(DEXSeqResults(hs_ec_results\$dexseq_object))
hs_lancaster_results <- hs_dexseq_results %>% group_by(groupID) %>% summarize(pval = lancaster(pvalue, log(exonBaseMean)))
hs_lancaster_results <- data.frame(gene = hs_lancaster_results\$groupID,
FDR = hs_lancaster_results\$gene_FDR)

A zoom-in of Figure 3a below shows that the improvement of Yi et al.’s method in the hsapiens dataset over the method of Cmero et al. is as large as the improvement of aggregation (of any sort) over GDE based on transcript quantifications. Importantly, this is a true apples-to-apples comparison because Yi et al.’s method is being tested on exactly the data and with exactly the metrics that Cmero et al. chose:

The improvement is not surprising; an extensive comparison of Lancaster aggregation with Šidák aggregation is detailed in Yi et al. and there we noted that while Šidák aggregation performs well when transcripts are perturbed independently, it performs very poorly in the more common case of correlated effect. Furthermore, we also examined in detail DEXSeq’s aggregation (perGeneQvalue) which appears to be an attempt to perform Šidák aggregation but is not quite right, in a sense we explain in detail in Section 2 of the Yi et al. supplement. While DEXSeq’s implementation of Šidák aggregation does control the FDR, it will tend to report genes with many isoforms and consumes the “FDR budget” faster than Šidák aggregation. This is one reason why, for the purpose of comparing Lancaster and Šidák aggregation in Yi et al. 2018, we did not rely on DEXSeq’s implementation of Šidák aggregation. Needless to say, separately from this issue, as mentioned above we found that Lancaster aggregation substantially outperforms Šidák aggregation.

The figures below complete the reproduction of the results of Cmero et al. The reproduced figures are are very similar to Cmero et al.’s figures but not identical. The difference is likely due to the fact that the Cmero paper states that a full comparison of the “Bottomly data” (on which these results are based) is a comparison of 10 vs. 10 samples. The reproduced results are based on downloading the data which consists of 10 vs. 11 samples for a total of 21 samples (this is confirmed in the Bottomly et al. paper which states that they “generated single end RNA-Seq reads from 10 B6 and 11 D2 mice.”) I also noticed one other small difference in the Drosophila analysis shown in Figure 3a where one of the methods is different for reasons I don’t understand. As for the supplement, the Cmero et al. figures are shown on the left hand side below, and to their right are the reproduced figures:

The final supplementary figure is a comparison of kallisto to Salmon: the Cmero et al. paper shows that Salmon results are consistent with kallisto results shown in Figure 3a,  and reproduces the claim I made in a previous blog post, namely that Salmon results are near identical to kallisto:

The final paragraph in the discussion of Cmero et al. states that “[transcript compatibility counts] have the potential to be useful in a range of other expression analysis. In particular [transcript compatibility counts] could be used as the initial unit of measurement for many other types of analysis such as dimension reduction visualizations, clustering and differential expression.” In fact, transcript compatibility counts have already been used for all these applications and have been shown to have numerous advantages. See the papers:

Many of these papers were summarized in a talk I gave at Cold Spring Harbor in 2017 on “Post-Procrustean Bioinformatics”, where I emphasized that instead of fitting methods to the predominant data types (in the case of RNA-seq, gene counts), one should work with data types that can support powerful analysis methods (in the case of RNA-seq, transcript compatibility counts).