The long-standing practice of data sharing in genomics can be traced to the Bermuda principles, which were formulated during the human genome project (Contreras, 2010). While the Bermuda principles focused on open sharing of DNA sequence data, they heralded the adoption of other open source standards in the genomics community. For example, unlike many other scientific disciplines, most genomics software is open source and this has been the case for a long time (Stajich and Lapp, 2006). The open principles of genomics have arguably greatly accelerated progress and facilitated discovery.

While open sourcing has become de rigueur in genomics dry labs, wet labs remain beholden to commercial instrument providers that rarely open source hardware or software, and impose draconian restrictions on instrument use and modification. With a view towards joining others who are working to change this state of affairs, we’ve posted a new preprint in which we describe an open source syringe pump and microscope system called poseidon:

A. Sina Booeshaghi, Eduardo da Veiga Beltrame, Dylan Bannon, Jase Gehring and Lior Pachter,

The poseidon system consists of

• A syringe pump that can operate at a wide range of flow rates. The bulk cost per pump is $37.91. A system of three pumps that can be used for droplet based single-cell RNA-seq experiments can be assembled for$174.87
• A microscope system that can be used to evaluate the quality of emulsions produced using the syringe pumps. The cost is $211.69. • Open source software that can be used to operate four pumps simultaneously, either via a Raspberry Pi (that is part of the microscope system) or directly via a laptop/desktop. Together, these components can be used to build a Drop-seq rig for under$400, or they can be used piecemeal for a wide variety of tasks. Along with describing benchmarks of poseidon, the preprint presents design guidelines that we hope can accelerate both development and adoption of open source bioinstruments. These were developed while working on the project; some were borrowed from our experience with bioinformatics software, while others emerged as we worked out kinks during development. As is the case with software, open source is not,  in and of itself, enough to make an application usable.  We had to optimize many steps during the development of poseidon, and in the preprint we illustrate the design principles we converged on with specific examples from poseidon.

The complete hardware/software package consists of the following components:

We benchmarked the system thoroughly and it has similar performance to a commercial Harvard Apparatus syringe pump; see panel (a) below. The software driving the pumps can be used for infusion or withdrawl, and is easily customizable. In fact, the ability to easily program arbitrary schedules and flow rates without depending on vendors who frequently charge money and require firmware upgrades for basic tasks, was a major motivation for undertaking the project. The microscope is basic but usable for setting up emulsions. Shown in panel (b) below is a microfluidic droplet generation chip imaged with the microscope. Panel (c) shows that we have no trouble generating uniform emulsions with it.

Together, the system constitutes a Drop-seq rig (3 pumps + microscope) that can be built for under 400: We did not start the poseidon project from scratch. First of all, we were fortunate to have some experience with 3D printing. Shortly after I started setting up a wet lab, Shannon Hateley, a former student in the lab, encouraged me to buy a 3D printer to reduce costs for basic lab supplies. The original MakerGear M2 we purchased has served us well saving us enormous amounts of money just as Shannon predicted, and in fact we now have added a Prusa printer: The printer Shannon introduced to the lab came in handy when, some time later, after starting to perform Drop-seq in the lab, Jase Gehring became frustrated with the rigidity commercial syringe pumps he was using. With a 3D printer available in-house, he was able to print and assemble a published open source syringe pump design. What started as a hobby project became more serious when two students joined the lab: Sina Booeshaghi, a mechanical engineer, and Eduardo Beltrame, an expert in 3D printing. We were also encouraged by the publication of a complete Drop-seq do-it-yourself design from the Satija lab. Starting with the microscope device from the Stephenson et al. paper, and the syringe pump from the Wijnen et al. paper, we worked our way through numerous hardware design optimizations and software prototypes. The photo below shows the published work we started with at the bottom, the final designs we ended up with at the top, and intermediate versions as we converged on design choices: In the course of design we realized that despite a lot of experience developing open source software in the lab, there were new lessons we were learning regarding open-source hardware development, and hardware-software integration. We ended up formulating six design principles that we explain in detail in the preprint via example of how they pertained to the poseidon design: We are hopeful that these principles we adhered to will serve as useful guidelines for others interested in undertaking open source bioinstrumentation projects. This post is a review of a recent preprint on an approach to testing for RNA-seq gene differential expression directly from transcript compatibility counts: Marek Cmero, Nadia M Davidson and Alicia Oshlack, Fast and accurate differential transcript usage by testing equivalence class counts, bioRxiv 2018. To understand the preprint two definitions are important. The first is of gene differential expression, which I wrote about in a previous blog post and is best understood, I think, with the following figure (reproduced from Supplementary Figure 1 of Ntranos, Yi, et al., 2018): In this figure, two isoforms of a hypothetical gene are called primary and secondary, and two conditions in a hypothetical experiment are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates $x_A$ and $x_B$ corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates $y_A$ and $y_B$ corresponding to the abundance of the secondary isoforms. In data from the hypothetical experiment, the black dots represent the mean level of expression of the constituent isoforms as derived from replicates. Differential transcript expression (DTE) refers to change in one of the isoforms. Differential gene expression (DGE) refers to change in overall gene expression (i.e. expression as the sum of the expression of the two isoforms). Differential transcript usage (DTU) refers to change in relative expression between the two isoform and gene differential expression (GDE) refers to change in expression along the red line. Note that DGE, DTU and DGE are special cases of GDE. The Cmero et al. preprint describes a method for testing for GDE, and the method is based on comparison of equivalence classes of reads between conditions. There is a natural equivalence relation $\sim$ on the set of reads in an RNA-seq experiment, where two reads $r_1$ and $r_2$ are related by $\sim$ when $r_1$ and $r_2$ align (ambiguously) to exactly the same set of transcripts (see, e.g. Nicolae et al. 2011). The equivalence relation $\sim$ partitions the reads into equivalence classes, and, in a slight abuse of notation, the term “equivalence class” in RNA-seq is used to denote the set of transcripts corresponding to an equivalence class of reads. Starting with the pseudoalignment program kallisto published in Bray et al. 2016, it became possible to rapidly obtain the (transcript) equivalence classes for reads from an RNA-seq experiment. In previous work (Ntranos et al. 2016) we introduced the term transcript compatibility counts to denote the cardinality of the (read) equivalence classes. We thought about this name carefully; due to the abuse of notation inherent in the term “equivalence class” in RNA-seq, we felt that using “equivalence class counts” would be confusing as it would be unclear whether it refers to the cardinalities of the (read) equivalence classes or the (transcript) “equivalence classes”. With these definitions at hand, the Cmero et al.’s preprint can be understood to describe a method for identifying GDE between conditions by directly comparing transcript compatibility counts. The Cmero et al. method is to perform Šidák aggregation of p-values for equivalence classes, where the p-values are computed by comparing transcript compatibility counts for each equivalence class with the program DEXSeq (Anders et al. 2012). A different method that also identifies GDE by directly comparing transcript compatibility counts was previously published by my student Lynn Yi in Yi et al. 2018. I was curious to see how the Yi et al. method, which is based on Lancaster aggregation of p-values computed from transcript compatibility counts compares to the Cmero et al. method. Fortunately it was really easy to find out because Cmero et al. released code with their paper that can be used to make all of their figures. I would like to note how much fun it is to reproduce someone else’s work. It is extremely empowering to know that all the methods of a paper are pliable at the press of a button. Below is the first results figure, Figure 2, from Cmero et al.’s paper: Below is the same figure reproduced independently using their code (and downloading the relevant data): It’s beautiful to see not only apples-to-apples, but the exact same apple! Reproducibility is obviously important to facilitate transparency in papers and to ensure correctness, but its real value lies in the fact that it allows for modifying and experimenting with methods in a paper. Below is the second results figure, Figure 3, from Cmero et al.’s paper: The figure below is the reproduction, but with an added analysis in Figure 3a, namely the method of Yi et al. 2018 included (shown in orange as “Lancaster_equivalence_class”). The additional code required for the extra analysis is just a few lines and can be downloaded from the Bits of DNA Github repository: library(aggregation) library(dplyr) dm_dexseq_results <- as.data.frame(DEXSeqResults(dm_ec_resultsdexseq_object))
dm_lancaster_results <- dm_dexseq_results %>% group_by(groupID) %>% summarize(pval = lancaster(pvalue, log(exonBaseMean)))
dm_lancaster_results$gene_FDR <- p.adjust(dm_lancaster_results$pval, ‘BH’)
dm_lancaster_results <- data.frame(gene = dm_lancaster_results$groupID, FDR = dm_lancaster_results$gene_FDR)

hs_dexseq_results <- as.data.frame(DEXSeqResults(hs_ec_results$dexseq_object)) hs_lancaster_results <- hs_dexseq_results %>% group_by(groupID) %>% summarize(pval = lancaster(pvalue, log(exonBaseMean))) hs_lancaster_results$gene_FDR <- p.adjust(hs_lancaster_results$pval, ‘BH’) hs_lancaster_results <- data.frame(gene = hs_lancaster_results$groupID,
FDR = hs_lancaster_results$gene_FDR) A zoom-in of Figure 3a below shows that the improvement of Yi et al.’s method in the hsapiens dataset over the method of Cmero et al. is as large as the improvement of aggregation (of any sort) over GDE based on transcript quantifications. Importantly, this is a true apples-to-apples comparison because Yi et al.’s method is being tested on exactly the data and with exactly the metrics that Cmero et al. chose: The improvement is not surprising; an extensive comparison of Lancaster aggregation with Šidák aggregation is detailed in Yi et al. and there we noted that while Šidák aggregation performs well when transcripts are perturbed independently, it performs very poorly in the more common case of correlated effect. Furthermore, we also examined in detail DEXSeq’s aggregation (perGeneQvalue) which appears to be an attempt to perform Šidák aggregation but is not quite right, in a sense we explain in detail in Section 2 of the Yi et al. supplement. While DEXSeq’s implementation of Šidák aggregation does control the FDR, it will tend to report genes with many isoforms and consumes the “FDR budget” faster than Šidák aggregation. This is one reason why, for the purpose of comparing Lancaster and Šidák aggregation in Yi et al. 2018, we did not rely on DEXSeq’s implementation of Šidák aggregation. Needless to say, separately from this issue, as mentioned above we found that Lancaster aggregation substantially outperforms Šidák aggregation. The figures below complete the reproduction of the results of Cmero et al. The reproduced figures are are very similar to Cmero et al.’s figures but not identical. The difference is likely due to the fact that the Cmero paper states that a full comparison of the “Bottomly data” (on which these results are based) is a comparison of 10 vs. 10 samples. The reproduced results are based on downloading the data which consists of 10 vs. 11 samples for a total of 21 samples (this is confirmed in the Bottomly et al. paper which states that they “generated single end RNA-Seq reads from 10 B6 and 11 D2 mice.”) I also noticed one other small difference in the Drosophila analysis shown in Figure 3a where one of the methods is different for reasons I don’t understand. As for the supplement, the Cmero et al. figures are shown on the left hand side below, and to their right are the reproduced figures: The final supplementary figure is a comparison of kallisto to Salmon: the Cmero et al. paper shows that Salmon results are consistent with kallisto results shown in Figure 3a, and reproduces the claim I made in a previous blog post, namely that Salmon results are near identical to kallisto: The final paragraph in the discussion of Cmero et al. states that “[transcript compatibility counts] have the potential to be useful in a range of other expression analysis. In particular [transcript compatibility counts] could be used as the initial unit of measurement for many other types of analysis such as dimension reduction visualizations, clustering and differential expression.” In fact, transcript compatibility counts have already been used for all these applications and have been shown to have numerous advantages. See the papers: Many of these papers were summarized in a talk I gave at Cold Spring Harbor in 2017 on “Post-Procrustean Bioinformatics”, where I emphasized that instead of fitting methods to the predominant data types (in the case of RNA-seq, gene counts), one should work with data types that can support powerful analysis methods (in the case of RNA-seq, transcript compatibility counts). Three years ago, when my coauthors (Páll Melsted, Nicolas Bray, Harold Pimentel) and I published the “kallisto paper” on the arXiv (later Bray et al. “Near-optimal probabilistic RNA-seq quantification“, 2016), we claimed that kallisto removed a major computational bottleneck from RNA-seq analysis by virtue of being two orders of magnitude faster than other state-of-the-art quantification methods of the time, without compromising accuracy. With kallisto, computations that previously took days, could be performed as accurately in minutes. Even though the speedup was significant, its relevance was immediately questioned. Critics noted that experiments, library preparations and sequencing take at least months, if not years, and questioned the relevance of a speedup that would save only days. One rebuttal we made to this legitimate point was that kallisto would be useful not only for rapid analysis of individual datasets, but that it would enable analyses at previously unimaginable scales. To make our point concrete, in a follow-up paper (Pimentel et al., “The Lair: a resource for exploratory analysis of published RNA-seq data”, 2016) we described a semi-automated framework for analysis of archived RNA-seq data that was possible thanks to the speed and accuracy of kallisto, and we articulated a vision for “holistic analysis of [short read archive] SRA data” that would facilitate “comparison of results across studies [by] use of the same tools to process diverse datasets.” A major challenge in realizing this vision was that although kallisto was fast enough to allow for low cost processing of all the RNA-seq in the short read archive (e.g. shortly after we published kallisto, Vivian et al., 2017 showed that kallisto reduced the cost of processing per sample from$1.30 to $0.19, and Tatlow and Piccolo, 2016 achieved$0.09 per sample with it), an analysis of experiments consists of much more than just quantification. In Pimentel et al. 2016 we struggled with how to wrangle metadata of experiments (subsequently an entire paper was written by Bernstein et al. 2017 just on this problem), how to enable users to dynamically test distinct hypotheses for studies, and how to link results to existing databases and resources. As a result, Pimentel et al. 2016 was more of a proof-of-principle than a complete resource; ultimately we were able to set up analysis of only a few dozen datasets.

Now, the group of Avi Ma’ayan at the Icahn School of Medicine at Mount Sinai has surmounted the many challenges that must be overcome to enable automated analysis of RNA-seq projects on the short read archive, and has published a tool called BioJupies (Torre et al. 2018). To assess BioJupies I began by conducting a positive control in the form of analysis of data from the “Cuffdiff2” paper, Trapnell et al. 2013. The data is archived as GSE37704. This is the dataset I used to initially test the methods of Pimentel et al. 2016 and is also the dataset underlying the Getting Started Walkthrough for sleuth. I thought, given my familiarity with it, that it would be a good test case for BioJupies.

Briefly, in Trapnell et al. 2013, Trapnell and Hendrickson performed a differential analysis of lung fibroblasts in response to an siRNA knockdown of HOXA1 which is a developmental transcription factor. Analyzing the dataset with BioJupies is as simple as typing the Gene Expression Omnibus (GEO) accession on the BioJupies searchbox. I clicked “analyze”, clicked on “+” a few times to add all the possible plots that can be generated, and this opened a window asking for a description of the samples:

I selected “Perturbation” for the HOXA1 knockdown samples and “Control” for the samples that were treated with scrambled siRNA that did not target a specific gene. Finally, I  clicked on “generate notebook”…

and

BioJupies displayed a notebook (Trapnell et al. 2013 | BioJupies) with a complete analysis of the data. Much of the Trapnell et al. 2013 analysis was immediately evident in the notebook. For example, the following figure is Figure 5a in Trapnell et al. 2013. It is a gene set enrichment analysis (GSEA) of the knockdown:

BioJupies presents the same analysis:

It’s easy to match them up. Of course BioJupies presents a lot of other information and analysis, ranging from a useful PCA plot to an interesting L1000 connectivity map analysis (expression signatures from a large database of over 20,000 perturbations applied to various cell lines that match the signatures in the dataset).

One of the powerful applications of BioJupies is the presentation of ARCHS⁴ co-expression data. ARCHS⁴ is the kallisto computed database of expression for the whole and is the primary database that enables BioJupies. One of its features is a list of co-expressed genes (as ascertained via correlation across the whole short read archive). These are displayed in BioJupies making it possible to place the results of an experiment in the context of “global” transcriptome associations.

While the Trapnell et al. 2013 reanalysis was fun, the real power of BioJupies is clear when analyzing a dataset that has not yet been published. I examined the GEO database and found a series GSE60538 that appears to be a partial dataset from what looks like a paper in the works. The data is from an experiment designed to investigate the role of Sox5 and Sox6 in the mouse heart via two single knockout experiments, and a double knockout. The entry originates in 2014 (consistent with the single-end 50bp reads it contains), but was updated recently. There are a total of 8 samples: 4 controls and 4 from the double knockout (the single knockouts are not available yet). I could not find an associated paper, nor was one linked to on GEO, but the abstract of the paper has already been uploaded to the site. Just as I did with the Trapnell et al. 2013 dataset, I entered the accession in the BioJupies website and… four minutes later:

The abstract of the GSE60538 entry states that “We performed RNA deep sequencing in ventricles from DKO and control mice to identify potential Sox5/6 target genes and found altered expression of genes encoding regulators of calcium handling and cation transporters” and indeed, BioJupies verifies this result (see Beetz et al. GSE60538| BioJupies):

Of course, there is a lot more analysis than just this. The BioJupies page includes, in addition to basic QC and datasets statistics, the PCA analysis, a “clustergrammer” showing which genes drive similarity between samples, differentially expressed genes (with associated MA and volcano plots), gene ontology enrichment analysis, pathway enrichment analysis, transcription factor enrichment analysis, kinase enrichment analysis, microRNA enrichment analysis, and L1000 analysis. In a sense, one could say that with BioJupies, users can literally produce the analysis for a paper in four minutes via a website.

The Ma’ayan lab has been working towards BioJupies for some time. The service is essentially a combination of a number of tools, workflows and resources published previously by the lab, including:

With BioJupies, these tools become more than the sum of their parts. Yet while BioJupies is impressive, it is not complete. There is no isoform analysis, which is unfortunate; for example one of the key points of Trapnell et al. 2013 was how informative transcript-level analysis of RNA-seq data can be. However I see no reason why a future release of BioJupies can’t include detailed isoform analyses. Isoform quantifications are provided by kallisto and are already downloadable via ARCHS⁴. It would also be great if BioJupies were extended to organisms other than human and mouse, although some of the databases that are currently relied on are less complete in other model organisms. Still, it should even be possible to create a BioJupies for non-models. I expect the authors have thought of all of these ideas. I do have some other issues with BioJupies: e.g. the notebook should cite all the underlying programs and databases used to generate the results, and while it’s neat that there is an automatically generated methods section, it is far from complete and should include the actual calls made to the programs used so as to facilitate complete reproducibility. Then, there is my pet peeve: “library size” is not the number of reads in a sample. The number of reads sequenced is “sequencing depth”.  All of these issues can be easily fixed.

In summary, BioJupies represents an impressive breakthrough in RNA-seq analysis. It leverages a comprehensive analysis of all (human and mouse) publicly available RNA-seq data to enable rapid and detailed analyses that transcend what has been previously possible. Discoveries await.

The post concerns Yuval Peres, a principal researcher in the Microsoft Theory Group [update Dec. 26, 2018: YP is no longer employed at Microsoft] and a former colleague of mine at UC Berkeley. Below is a copy of an email sent yesterday to numerous theory of computer science professors worldwide, and published on the Stanford Theory Seminar List. It corroborates information I heard about Yuval Peres a number of years ago when I was a mathematics professor at UC Berkeley. At the time I was asked to keep the information I heard confidential, and I did so because the person who discussed it with me was, understandably, afraid of retaliation. Now I wonder to what extent my silence allowed his harassment of women to continue unabated. I also wonder when the leaders of the statistics department at UC Berkeley, where Peres used to work, and where Terry Speed was a professor emeritus before I reported him, will end their culture of silence.

Hello all,

This is an email composed by Irit Dinur, Oded Goldreich and me. The purpose of this email is to share with you concerns that we had regarding the unethical behavior of Yuval Peres. The behavior we are referring to includes several recent incidents from the past few years, on top of the two “big” cases of sexual harassment that led to severe sanctions against him by his employer, Microsoft, and to the termination of his connections with the University of Washington. Together with two colleagues who are highly regarded and trusted by us, we have first and second-hand testimonies (by people we trust without a shed of doubt) of at least five additional cases of him approaching junior female scientists, some of them students, with offers of intimate nature, behavior that has caused its victims quite a bit of distress since these offers were “insistent”. While the examples that we encountered from the last few years do not fall under the category of sexual harassment from a legal point of view, they certainly caused great discomfort to the victims, who were young female scientists, putting them in a highly awkward situation, and creating an atmosphere that they’d rather avoid (i.e., they would rather miss a conference or a lecture than risk being subjected to repeated intimate offers by him). We wish to stress that his aggressive advances toward young women, usually with no previous friendly connections with him, puts them in a vulnerable position of fearing to cross a senior scientist who might have an impact on their career, which is at a fragile stage. We believe that the questions of whether or not Yuval Peres intended to make them uncomfortable, and whether or not he would or could actually harm their scientific status are irrelevant; the fact is that the victims felt very stressed to a point that they’d rather miss professional events than risk encountering the same situation again. Needless to say, it is the responsibility of senior members of our community to avoid putting less senior members in such a position.

Our current involvement with this issue was triggered by an invitation Yuval Peres received to give a plenary talk at an international conference next year. We felt that this invitation sends a highly undesirable message to our community in general, and to the women he harassed in particular, as if his transgressions are considered unimportant.

We sent an email conveying our concern to the organizers of the conference, suggesting that they disinvite him. With our permission, they forwarded a version of our letter (in which we made changes in order to protect the identity of the women involved) to Yuval Peres. They did not reveal our identity, rather they told him that this is a letter from “senior members of the community”. In our letter we included a paragraph describing a general principle that should be followed. The principle is:

A senior researcher should not approach a junior researcher with an invitation that may be viewed as intimate or personal unless such an invitation was issued in the past by this specific junior to that specific senior. The point being that even if the senior researcher has no intimate/personal intentions, such intentions may be read by the junior researcher, placing the junior in an awkward situation and possibly causing them great distress. Examples for such an invitation include any invitation to a personal event in which only the senior and the junior will be present (e.g., a two-person dinner, a meeting in a private home, etc).

Yuval’s reply was rather laconic, in particular, he did not address the issue of his behavior in the past couple of years. However, he did write:

“I certainly embrace the principle described in boldface in the letter. This seems to be the right approach for any senior scientist these days.”

The reason we are copying this to all of you (as opposed, for example, to using bcc) is related to the islanders’ paradox: we believe that the fact that everyone knows that everyone knows is a significant boost to holding Yuval Peres accountable for his future actions. We’re also bcc’ing several young women who already aware of Yuval Peres’s actions, in order to keep them in the know too.

We understand that sending this out to a large number of people without offering Yuval Peres the chance to respond may be considered unfair. However, after weighing the pros and cons carefully we believe this is a good course of action. First of all, because it is clear that the victims did not invent his offers and their ensuing feelings of anxiety and stress. Secondly, we know that Yuval Peres has been confronted in a face to face conversation by a senior colleague, and it did not end his behavior, so we think it’s important to stay vigilant in protecting the younger members of our community. Thirdly, the information in this letter will reach (or has already reached) almost all of you in any case, so the main effect of the letter is making what everyone knows into public knowledge. Finally, although his response to the organizers did include the minimum of declaring he accepts the guiding principle that we stated, it did not include any reference to the ongoing behavior we described- neither regret nor concern nor denial. So it’s not easy to assume that he truly intends to mend his ways.

We hope that our actions will contribute to the future of our community as an environment that offers all a pleasant and non-threatening atmosphere.

Sincerely,
Irit Dinur, Ehud Friedgut, Oded Goldreich

Last year I wrote a blog post on being wrong. I also wrote a blog post about being wrong three years ago. It’s not fun to admit being wrong, but sometimes it’s necessary. I have to admit to being wrong again.

To place this admission in context I need to start with Mordell’s finite basis theorem, which has been on my mind this past week. The theorem, proved in 1922, states the rational points on an elliptic curve defined over the rational numbers form a finitely generated abelian group. There is quite a bit of math jargon in this statement that makes it seem somewhat esoteric, but it’s actually a beautiful, fundamental, and accessible result at the crossroads of number theory and algebraic geometry.

First, the phrase elliptic curve is just a fancy name for a polynomial equation of the form y² = x³ + ax + b (subject to some technical conditions). “Defined over the rationals” just means that and b are rational numbers. For example a=-36, b=0 or a=0, b=-26 would each produce an elliptic curve. A “rational point on the curve” refers to a solution to the equation whose coordinates are rational numbers. For example, if we’re looking at the case where a=0 and b=-26 then the elliptic curve is y² = x³ – 26 and one rational solution would be the point (35,-207). This solution also happens to be an integer solution; try to find some others! Elliptic curves are pretty and one can easily explore them in WolframAlpha. For example, the curve y² = x³ – 36x looks like this:

WolframAlpha does more than just provide a picture. It finds integer solutions to the equation. In this case just typing the equation for the elliptic curve into the WolframAlpha box produces:

One of the cool things about elliptic curves is that the points on them form the structure of an abelian group. That is to say, there is a way to “add” points on the curves. I’m not going to go through how this works here but there is a very good introduction to this connection between elliptic curves and groups in an exposition by Tanuj Nayak, an undergrad at Carnegie Mellon University.

Interestingly, even just the rational points on an elliptic curve form a group, and Mordell’s theorem says that for an elliptic curve defined over the rational numbers this group is finitely generated. That means that for such an elliptic curve one can describe all rational points on the curve as finite combinations of some finite set of points. In other words, we (humankind) has been interested in studying Diophantine equations since the time of Diophantus (3rd century). Trying to solve arbitrary polynomial equations is very difficult, so we restrict our attention to easier problems (elliptic curves). Working with integers is difficult, so we relax that requirement a bit and work with rational numbers. And here is a theorem that gives us hope, namely the hope that we can find all solutions to such problems because at least the description of the solutions can be finite.

The idea of looking for all solutions to a problem, and not just one solution, is fundamental to mathematics. I recently had the pleasure of attending a lesson for 1st and 2nd graders by Oleg Gleizer, an exceptional mathematician who takes time not only to teach children mathematics, but to develop mathematics (not arithmetic!) curriculum that is accessible to them. The first thing Oleg asks young children is what they see when looking at this picture:

Children are quick to find the answer and reply either “rabbit” or “duck”. But the lesson they learn is that the answer to his question is that there is no single answer! Saying “rabbit” or “duck” is not a complete answer. In mathematics we seek all solutions to a problem. From this point of view, WolframAlpha’s “integer solutions” section is not satisfactory (it omits x=6, y=0), but while in principle one might worry that one would have to search forever, Mordell’s finite basis theorem provides some peace of mind for an important class of questions in number theory. It also guides mathematicians: if interested in a specific elliptic curve, think about how to find the (finite) generators for the associated group. Now the proof of Mordell’s theorem, or its natural generalization, the Mordell-Weil theorem, is not simple and requires some knowledge of algebraic geometry, but the statement of Mordell’s theorem and its meaning can be explained to kids via simple examples.

I don’t recall exactly when I learned Mordell’s theorem but I think it was while preparing for my qualifying exam in graduate school, when I studied Silverman’s book on elliptic curves for the cryptography section on my qualifying exam- yes, this math is even related to some very powerful schemes for cryptography! But I do remember when a few years later a (mathematician) friend mentioned to me “the coolest paper ever”, a paper related to generalizations of Mordell’s theorem, the very theorem that I had studied for my exam. The paper was by two mathematicians, Steven Zucker and David Cox, and it was titled Intersection Number of Sections of Elliptic Surfaces. The paper described an algorithm for determining whether some sections form a basis for the Mordell-Weil group for certain elliptic surfaces. The content was not why my friend thought this paper was cool, and in fact I don’t think he ever read it. The excitement was because of the juxtaposition of author names. Apparently David Cox had realized that if he could coauthor a paper with his colleague Steven Zucker, they could publish a theorem, which when named after the authors, would produce a misogynistic and homophobic slur. Cox sought out Zucker for this purpose, and their mission was a “success”. Another mathematician, Charles Schwartz, wrote a paper in which he built on this “joke”. From his paper:

So now, in the mathematics literature, in an interesting part of number theory, you have the Cox-Zucker machine. Many mathematicians think this is hilarious. I thought this was hilarious. In fact, when I was younger I frequently boasted about this “joke”, and how cool mathematicians are for coming up with clever stuff like this.

I was wrong.

I first started to wonder about the Zucker and Cox stunt when a friend pointed out to me, after I had used the term C-S to demean someone, that I had just spouted a misogynistic and homophobic slur. I started to notice the use of the C-S phrase all around me and it made me increasingly uncomfortable. I stopped using it. I stopped thinking that the Zucker-Cox stunt was funny (while noticing the irony that the sexual innuendo they constructed was much more cited than their math), and I started to think about the implications of this sort of thing for my profession. How would one explain the Zucker-Cox result to kids? How would undergraduates write a term paper about it without sexual innuendo distracting from the math? How would one discuss the result, the actual math, with colleagues? What kind of environment emerges when misogynistic and homophobic language is not only tolerated in a field, but is a source of pride by the men who dominate it?

These questions have been on my mind this past week as I’ve considered the result of the NIPS conference naming deliberation. This conference was named in 1987 by founders who, as far as I understand, did not consider the sexual connotations (they dismissed the fact that the abbreviation is a racial slur since they considered it all but extinct). Regardless of original intentions I write this post to lend my voice to those who are insisting that the conference change its name. I do so for many reasons. I hear from many of my colleagues that they are deeply offended by the name. That is already reason enough. I do so because the phrase NIPS has been weaponized and is being used to demean and degrade women at one of the main annual machine learning conferences. I don’t make this claim lightly. Consider, for example, TITS 2017 (the (un)official sister event to NIPS). I’ve thought about this specific aggression a lot because in mathematics there is a mathematician by the name of Tits who has many important objects named after him (e.g. Tits buildings). So I have worked through the thought experiment of trying to understand why I think it’s wrong to name a conference NIPS but I’m fine talking about the mathematician Tits. I remember when I first learned of Tits buildings I was taken aback for a moment. But I learned to understand the name Tits as French and I pronounce it as such in my mind and with my voice when I use it. There is no problem there, nor is there a problem with many names that clash across cultures and languages. TITS 2017 is something completely different. It is a deliberate use of NIPS and TITS in a way that can and will make many women uncomfortable. As for NIPS itself perhaps there is a “solution” to interpreting the name that doesn’t involve a racial slur or sexual innuendo (Neural Information Processing Systems). Maybe some people see a rabbit. But others see a duck. All the “solutions” matter. The fact is many women are uncomfortable because instead of being respected as scientists, their bodies and looks have become a subtext for the science that is being discussed. This is a longstanding problem at NIPS (see e.g., Lenna). Furthermore, it’s not only women who are uncomfortable. I am uncomfortable with the NIPS name for the reasons I gave above, and I know many other men are as well. I’m not at ease at conferences where racial slurs and sexual innuendo are featured prominently, and if there are men who are (cf. NIPS poll data) then they should be ignored.

I think this is an extremely important issue not only for computer science, but for all of science. It’s about much more than a name of some conference. This is about recognizing centuries of discriminatory and exclusionary practices against women and minorities, and about eliminating such practices when they occur now rather than encouraging them. The NIPS conference must change their name. #protestNIPS

A few years ago I wrote a post arguing that it is time to end ordered authorship. However that time has not yet arrived, and it appears that it is unlikely to arrive anytime soon. In the meantime, if one is writing a paper with 10 authors, a choice for authorship ordering and equal contribution designation must be made from among the almost 2 billion possibilities (1857945600 to be exact). No wonder authorship arguments are commonplace! The purpose of this short post is to explain the number 1857945600.

At first glance the enumeration of authorship orderings seems to be straightforward, namely that in a paper with n authors there are n! ways to order the authors. However this solution fails to account for designation of authors as “equal contributors”. For example, in the four author paper Structural origin of slow diffusion in protein folding, the first two authors contributed equally, and separately from that, so did the last two (as articulated via a designation of “co-corresponding” authorship). Another such example is the paper PRDM/Blimp1 downregulates expression of germinal center genes LMO2 and HGAL. Equal contribution designations can be more complex. In the recent preprint Connect-seq to superimpose molecular on anatomical neural circuit maps the first and second authors contributed equally, as did the third and fourth (though the equal contributions of the first and second authors was distinct from that of the third and fourth). Sometimes there are also more than two authors who contributed equally. In SeqVis: Visualization of compositional heterogeneity in large alignments of nucleotides the first eight authors contributed equally. A study on “equal contribution” designation in biomedical papers found that this type of designation is becoming increasingly common and can be associated with nearly every position in the byline.

To account for “equal contribution” groupings, I make the assumption that a set of authors who contributed equally must be consecutive in the authorship ordering. This assumption is certainly reasonable in the biological sciences given that there are two gradients of “contribution” (one from the front and one from the end of the authorship list), and that contributions for those in the end gradient are fundamentally distinct from those in the front. An authorship designation for a paper with n authors therefore consists of two separate parts: the n! ways to order the authors, and then the $2^{n-1}$ ways of designating groups of equal contribution for consecutive authors. The latter enumeration is simple: designation of equal authorship is in one-to-one correspondence with placement of dividers in the n-1 gaps between the authors in the authorship list. In the extreme case of placement of no dividers the corresponding designation is that all authors contributed equally. Similarly, the placement of dividers between all consecutive pairs of authors corresponds to all contributions being distinct. Thus, the total number of authorship orderings/designations is given by $n! \cdot 2^{n-1}$. These numbers also enumerate the number of ways to lace a shoe. Other examples of objects whose enumeration results in these numbers are given in the Online Encyclopedia of Integer Sequences entry for this sequence (A002866). The first twenty numbers are:

1, 4, 24, 192, 1920, 23040, 322560, 5160960, 92897280, 1857945600, 40874803200, 980995276800, 25505877196800, 714164561510400, 21424936845312000, 685597979049984000, 23310331287699456000, 839171926357180416000, 31888533201572855808000, 1275541328062914232320000.

In the case of a paper with 60 authors, the number of ways to order authors and designate equal contribution is much larger than the number of atoms in the universe. Good luck with your next consortium project!

Continuous-time Markov chain models for DNA mutations on a phylogenetic tree (e.g. the Jukes-Cantor model, the Kimura models, and more generally models of the Felsenstein hierarchy) have the simple and convenient property of multiplicativity. Specifically, if Q is a rate matrix then the associated substitution matrices are multiplicative in the following sense:

$e^{Q(t_1+t_2)} = e^{Qt_1}e^{Qt_2}$.

This follows directly from the fact that the matrices $Qt_1$ and $Qt_2$ commute, because for any two commuting matrices A and B

$e^{A+B} = e^{A}e^{B}$.

This means that substitutions over a time period 2t are equivalently described as substitutions occurring over a time period t, followed by substitutions occurring afterwards over another time period t.

But what if over the course of time the rate matrix changes? For example, suppose that for a period of time t mutations proceed according to a rate matrix Q, and following that, for another period of time t,  mutations proceed according to a rate matrix R? Is it true that the substitutions after time 2t will behave as if mutations occurred for a time 2t according to the (average) rate matrix $\frac{Q+R}{2}$?

If Q and R commute the answer will be yes, as Qt and Rt will also be commutative and the multiplicativity property will hold. But what if Q and don’t commute? Is there any relationship at all between $e^{\frac{Q+R}{2}2t}$ and the matrices $e^{Qt}$ and $e^{Rt}$?

This week I visited Yale University to give a talk in the Center for Biomedical Data Science seminar series.  I was invited by Smita Krishnaswamy, who organized a wonderful visit that included many interesting conversations not only in computational biology, but also applied math, computer science and statistics (Yale has strong programs in applied mathematics, statistics and data science, computer science and biostatistics). At dinner I learned from Dan Spielman of the Golden-Thompson inequality which provides a beautiful answer to the question above in the case where Q and R are symmetric. The theorem is a trace inequality for Hermitian matrices A and B:

$tr(e^{A+B}) \leq tr(e^Ae^B)$.

This inequality is well known in statistical mechanics and random matrix theory but I don’t believe it is known in the phylogenetics community, hence this post. The phylogenetic interpretation of the pieces of the Golden-Thompson inequality (replacing A with Qt and B with Rt) is straightforward:

• The matrices $e^{Qt}$ and $e^{Rt}$ are substitution matrices for the rate matrices Q and R respectively.
• The product $e^{Qt}e^{Rt}$ is the substitution matrix corresponding to mutations occurring with rate matrix Q for time t followed by rate matrix R for time t.
• The matrix $e^{Qt+Rt} = e^{\frac{Q+R}{2} \cdot 2t}$ is the substitution matrix for mutations occurring with rate $\frac{Q+R}{2}$ for time 2t.
• Since the trace of a substitution matrix is the probability that there is no transition, or equivalently the probability that a change in nucleotide does not occur, the Golden-Thompson inequality states that for two symmetric rate matrices and R, the probability of a substitution after time 2t is higher when mutations occur first at rate Q for time t and then at rate R for time t, than if they occur at rate $\frac{Q+R}{2}$ for time 2t.

In other words, rate changes decrease the expected number of substitutions in comparison to what one would see if rates are constant

The Golden-Thompson inequality was discovered independently by Sidney Golden and Colin Thompson in 1965. A proof is explained in an expository blog post by Terence Tao who heard of the Golden-Thompson inequality only eight years ago, which makes me feel a little bit better about not having heard of it until this week! It would be nice if there was a really simple proof but that appears not to be the case (there is a purported one page proof in a paper titled Golden-Thompson from Davis, however what is proved there is the different inequality $tr(e^{A+B}) \leq tr(e^A)tr(e^B)$, which can be shown, by virtue of another matrix trace inequality, to be a weaker inequality).

There is considerable interest in evolutionary biology in models that allow for time-varying rates of mutation, as there is substantial evidence of such variation. The Golden-Thompson inequality provides an additional insight for how mutation rate changes over time can affect naïve estimates based on homogeneity assumptions.

The Felsenstein hierarchy (from Algebraic Statistics for Computational Biology).

Six years ago I received an email from a colleague in the mathematics department at UC Berkeley asking me whether he should participate in a study that involved “collecting DNA from the brightest minds in the fields of theoretical physics and mathematics.”  I later learned that the codename for the study was “Project Einstein“, an initiative of entrepreneur Jonathan Rothberg with the goal of finding the genetic basis for “math genius”. After replying to my colleague I received an inquiry from another professor in the department, and then another and another… All were clearly flattered that they were selected for their “brightest mind”, and curious to understand the genetic secret of their brilliance.

I counseled my colleagues not to participate in this ill-advised genome-wide association study. The phenotype was ill-defined and in any case the study would be underpowered (only 400 “geniuses” were solicited), but I believe many of them sent in their samples. As far as I know their DNA now languishes in one of Jonathan Rothberg’s freezers. No result has ever emerged from “Project Einstein”, and I’d pretty much forgotten about the ego-driven inquiries I had received years ago. Then, last week, I remembered them when reading a series of blog posts and associated commentary on evolutionary biology by some of the most distinguished mathematicians in the world.

1. Sir Timothy Gowers is blogging about evolutionary biology?

It turns out that mathematicians such as Timothy Gowers and Terence Tao are hosting discussions about evolutionary biology (see On the recently removed paper from the New York Journal of Mathematics, Has an uncomfortable truth been suppressed, Additional thoughts on the Ted Hill paper) because some mathematician wrote a paper titled “An Evolutionary Theory for the Variability Hypothesis“, and an ensuing publication kerfuffle has the mathematics community up in arms. I’ll get to that in a moment, but first I want to focus on the scientific discourse in these elite math blogs. If you scroll to the bottom of the blog posts you’ll see hundreds of comments, many written by eminent mathematicians who are engaged in pseudoscientific speculation littered with sexist tropes. The number of inane comments is astonishing. For example, in a comment on Timothy Gowers’ blog, Gabriel Nivasch, a lecturer at Ariel University writes

“It’s also ironic that what causes so much controversy is not humans having descended from apes, which since Darwin people sort-of managed to swallow, but rather the relatively minor issue of differences between the sexes.”

This person’s understanding of the theory of evolution is where the Victorian public was at in England ca. 1871:

In mathematics, just a year later in 1872, Karl Weierstrass published what at the time was considered another monstrosity, one that threw the entire mathematics community into disarray. The result was just as counterintuitive for mathematics as Darwin’s theory of evolution was for biology. Weierstrass had constructed a function that is uniformly continuous on the real line, but not differentiable on any interval:

$f(x) = \sum_{n=0}^{\infty} \left( \frac{1}{2} \right)^ncos({11}^n\pi x)$.

Not only does this construction remain valid today as it was back then, but lots of mathematics has been developed in its wake. What is certain is that if one doesn’t understand the first thing about Weierstrass’ construction, e.g. one doesn’t know what a derivative is, one won’t be able to contribute meaningfully to modern research in analysis. With that in mind consider the level of ignorance of someone who does not even understand the notion of common ancestor in evolutionary biology, and who presumes that biologists have been idle and have learned nothing during the last 150 years. Imagine the hubris of mathematicians spewing incoherent theories about sexual selection when they literally don’t know anything about human genetics or evolutionary biology, and haven’t read any of the relevant scientific literature about the subject they are rambling about. You don’t have to imagine. Just go and read the Tao and Gowers blogs and the hundreds of comments they have accrued over the past few days.

2. Hijacking a journal

To understand what is going on requires an introduction to Igor Rivin, a professor of mathematics at Temple University and, of relevance in this mathematics matter, an editor of  the New York Journal of Mathematics (NYJM) [Update November 21, 2018: Igor Rivin is no longer an editor of NYJM]. Last year Rivin invited the author of a paper on the variability hypothesis to submit his work to NYJM. He solicited two reviews and published it in the journal. For a mathematics paper such a process is standard practice at NYJM,  but in this case the facts point to Igor Rivin hijacking the editorial process to advance a sexist agenda. To wit:

• The paper in question, “An Evolutionary Theory for the Variability Hypothesis” is not a mathematics or biology paper but rather a sexist opinion piece. As such it was not suitable for publication in any mathematics or biology journal, let alone in the NYJM which is a venue for publication of pure mathematics.
• Editor Igor Rivin did not understand the topic and therefore had no business soliciting or handling review of the paper.
• The “reviewers” of the paper were not experts in the relevant mathematics or biology.

To elaborate on these points I begin with a brief history of the variability hypothesis. Its origin is Darwin’s 1875 book on “The Descent of Man and Selection in Relation to Sex” which was ostensibly the beginning of the study of sexual selection. However as explained in Stephanie Shields’ excellent review, while the variability hypothesis started out as a hypothesis about variance in physical and intellectual traits, at the turn of 20th century it morphed to a specific statement about sex differences in intelligence. I will not, in this blog post, attempt to review the entire field of sexual selection nor will I discuss in detail the breadth of work on the variability hypothesis. But there are three important points to glean from the Shields review: 1. The variability hypothesis is about intellectual differences between men and women and in fact this is what “An evolutionary theory for the variability hypothesis” tries really hard to get across. Specifically, that the best mathematicians are males because of biology. 2. There has been dispute for over a century about the extent of differences, should they even exist, and 3. Naïve attempts at modeling sexual selection are seriously flawed and completely unrealistic. For example naïve models that assume the same genetic mechanism produces both high IQ and mental deficits are ignoring ample evidence to the contrary.

Insofar as modeling of sexual selection is concerned, there was already statistical work in the area by Karl Pearson in 1895 (see “Note on regression and inheritance in the case of two parents“). In the paper Pearson explicitly considers the sex-specific variance of traits and the relationship of said variance to heritability. However as with much of population genetics, it was Ronald Fisher, first in the 1930s (Fisher’s principle) and then later in important work from 1958 what is now referred to as Darwin-Fisher theory (see, e.g. Kirkpatrick, Price and Arnold 1990) who significantly advanced the theory of sexual selection. Amazingly, despite including 51 citations in the final arXiv version of “An Evolutionary Theory for the Variability Hypothesis”, there isn’t a single reference to prior work in the area. I believe the author was completely unaware of the 150 years of work by biologists, statisticians, and mathematical biologists in the field.

What is cited in “An Evolutionary Theory for the Variability Hypothesis”? There is an inordinate amount of cherry picking of quotes from papers to bolster the message the author is intent on getting across: that there are sex-differences in variance of intelligence (whatever that means), specifically males are more variable. The arXiv posting has undergone eight revisions, and somewhere among these revisions there is even a brief cameo by Lawrence Summers and a regurgitation of his infamous sexist remarks. One of the thorough papers reviewing evidence for such claims is “The science of sex differences in science and mathematics” by Halpern et al. 2007. The author cherry picks a quote from the abstract of that paper, namely that “the reasons why males are often more variable remain elusive.” and follows it with a question posed by statistician Howard Wainer that implicitly makes a claim: “Why was our genetic structure built to yield greater variation among males than females?” An actual reading of the Halpern et al. paper reveals that the excess of males in the top tail of the distribution of quantitative reasoning has dramatically decreased during the last few decades, an observation that cannot be explained by genetics. Furthermore, females have a greater variability in reading and writing than males. They point out that these findings “run counter to the usual conclusion that males are more variable in all cognitive ability domains”. The author of “An Evolutionary Theory for the Variability Hypothesis” conveniently omits this from a very short section titled “Primary Analyses Inconsistent with the Greater Male Variability Hypothesis.” This is serious amateur time.

One of the commenters on Terence Tao’s blog explained that the mathematical theory in “An Evolutionary Theory for the Variability Hypothesis” is “obviously true”, and explained its premise for the layman:

It’s assumed that women only pick the “best” – according to some quantity X percent of men as partners where X is (much) smaller than 50, let’s assume. On the contrary, men are OK to date women from the best Y percent where Y is above 50 or at least greater than X.

Let’s go with this for a second, but think about how this premise would have to change to be consistent with results for reading and writing (where variance is higher in females). Then we must go with the following premise for everything to work out:

It’s assumed that men only pick the “best” – according to some quantity X percent of women as partners where X is (much) smaller than 50, let’s assume. On the contrary, women are OK to date men from the best Y percent where Y is above 50 or at least greater than X.

Perhaps I should write up this up (citing only studies on reading and writing) and send it to Igor Rivin, editor at the New York Journal of Mathematics as my explanation for my greater variability hypothesis?

Actually, I hope that will not be possible. Igor Rivin should be immediately removed from the editorial board of the New York Journal of Mathematics. I looked up Rivin’s credentials in terms of handling a paper in mathematical biology. Rivin has an impressive publication list, mostly in geometry but also a handful of publications in other areas. He, and separately Mary Rees, are known for showing that the number of simple closed geodesics of length at most L grows polynomially in L (this result was the beginning of some of the impressive results of Maryam Mirzakhani who went much further and subsequently won the Fields Medal for her work). Nowhere among Rivin’s publications, or in many of his talks which are online, or in his extensive online writings (on Twitter, Facebook etc.) is there any evidence that he has a shred of knowledge about evolutionary biology. The fact that he accepted a paper that is completely untethered from the field in which it purports to make an advance is further evidence of his ignorance.

Ignorance is one thing but hijacking a journal for a sexist agenda is another. Last year I encountered a Facebook thread on which Rivin had commented in response to a BuzzFeed article titled A Former Student Says UC Berkeley’s Star Philosophy Professor Groped Her and Watched Porn at Work. It discussed a lawsuit alleging that John Searle had sexually harassed, assaulted and retaliated against a former student and employee. While working for Searle the student was paid $1,000 a month with an additional$3,000 for being his assistant. On the Facebook thread Igor Rivin wrote

Here is an editor of the NYJM suggesting that a student should have effectively known that if she was paid $36K/year for work as an assistant of a professor (not a high salary for such work), she ought to expect sexual harassment and sexual assault as part of her job. Her LinkedIn profile (which he linked to) showed her to have worked a summer in litigation. So he was essentially saying that this victim prostituted herself with the intent of benefiting financially via suing John Searle. Below is, thankfully, a quick and stern rebuke from a professor of mathematics at Indiana University: I mention this because it shows that Igor Rivin has a documented history of misogyny. Thus his acceptance of a paper providing a “theory” for “higher general intelligence” in males, a paper in an area he knows nothing about to a journal in pure mathematics is nothing other than hijacking the editorial process of the journal to further a sexist agenda. How did he actually do it? He solicited a paper that had been rejected elsewhere, and sent it out for review to two reviewers who turned it around in 3 weeks. I mentioned above that the “reviewers” of the paper were not experts in the relevant mathematics or biology. This is clear from an examination of the version of the paper that the NYJM accepted. The 51 references were reduced to 11 (one of them is to the author’s preprint). None of the remaining 10 references cite any relevant prior work in evolutionary biology on sexual selection. The fundamental flaws of the paper remain unaddressed. The entire content of the reviews was presumably something along the lines of “please tone down some of the blatant sexism in the paper by removing 40 gratuitous references”. In defending the three week turnaround Rivin wrote (on Gowers’ blog) “Three weeks: I assume you have read the paper, if so, you will have found that it is quite short and does not require a huge amount of background.” Since when does a mathematician judge the complexity of reviewing a paper by its length? I took a look at Rivin’s publications; many of them are very short. Consider for example “On geometry of convex ideal polyhedra in hyperbolic 3-space”. The paper is 5 pages with 3 references. It was received 15 October 1990 and in revised form 27 January 1992. Also excuse me, but if one thinks that a mathematical biology paper “does not require a huge amount of background” then one simply doesn’t know any mathematical biology. 3. Time for mathematicians to wet their paws The irony of mathematicians who believe they are in the high end tail of some ill-specified distribution of intelligence demonstrating en masse that they are idiots is not lost on those of us who actually work in mathematics and biology. Gian-Carlo Rota’s ghost can be heard screaming from Vigevano “The lack of real contact between mathematics and biology is either a tragedy, a scandal, or a challenge, it is hard to decide which!!” I’ve spent the past 15 years of my career focusing on Rota’s call to address the challenge of making more contacts between mathematics and biology. The two cultures are sometimes far apart but the potential for both fields, if there is real contact, is tremendous. Not only can mathematics lead to breakthroughs in biology, biology can also lead to new theorems in mathematics. In response to incoherent rambling about genetics on Gowers’ blog, Noah Snyder, a math professor at Indiana University gave sage advice: I really wish you wouldn’t do this. A bunch of mathematicians speculating about stuff they know nothing about is not a good way to get to the truth. If you really want to do some modeling of evolutionary biology, then find some experts to collaborate or at least spend a year learning some background. What he is saying is די קאַץ האָט ליב פֿיש אָבער זי װיל ניט די פֿיס אײַננעצן (the cat likes fish but she doesn’t want to wet her paws). If you’re a mathematician who is interested in questions of evolutionary biology, great! But first you must get your paws wet. If you refuse to do so then you can do real harm. It might be tempting to imagine that mathematics is divorced from reality and has no impact or influence on the world, but nothing could be farther from the truth. Mathematics matters. In the case discussed in this blog post, the underlying subtext is pervasive sexism and misogyny in the mathematics profession, and if this sham paper on the variance hypothesis had gotten the stamp of approval of a journal as respected as NYJM, real harm to women in mathematics and women who in the future may have chosen to study mathematics could have been done. It’s no different than the case of Andrew Wakefield‘s paper in The Lancet implying a link between vaccinations and autism. By the time of the retraction (twelve years after publication of the article, in 2010), the paper had significantly damaged public health, and even today its effects, namely death as a result of reduced vaccination, continue to be felt. It’s not good enough to say: “Once the rockets are up, who cares where they come down? That’s not my department,” says Wernher von Braun. Here are two IQ test questions for you: 1. Fill in the blank in the sequence 1, 4, 9, 16, 25, __ , 49, 64, 81. 2. What number comes next in the sequence 1, 1, 2, 3, 5, 8, 13, .. ? Please stop and think about these questions before proceeding. Spoiler alert: the blog post reveals the answers. Earlier this month I posted a new paper on the bioRxiv: Jase Gehring, Jeff Park, Sisi Chen, Matt Thomson, and Lior Pachter, Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transciptional Spaces, bioRxiv, 2018. The paper offers some insights into the benefits of multiplex single-cell RNA-Seq, a molecular implementation of information multiplexing. The paper also reflects the benefits of a multiplex lab, and the project came about thanks to Jase Gehring, a multiplex molecular biologist/computational biologist in my lab. mult·i·plex /`məltəˌpleks/ adjective – consisting of many elements in a complex relationship. – involving simultaneous transmission of several messages along a single channel of communication. Conceptually, Jase’s work presents a method for chemically labeling cells from multiple samples with DNA nucleotides so that samples can be pooled prior to single-cell RNA-Seq, yet cells can subsequently be associated with their samples of origin after sequencing. This is achieved by labeling all cells from a sample with DNA that is unique to that sample; in the figure below colors are used to represent the different DNA tags that are used for each sample: This is analogous to the barcoding of transcripts in single-cell RNA-Seq, that allows for transcripts from the same cell of origin to be associated with each other, yet in this framework there is an additional layer of barcoding of cells. The tagging mechanism is a click chemistry one-pot, two-step reaction in which cell samples are exposed to methyltetrazine-activated DNA (MTZ-DNA) oligos as well as the amine-reactive cross-linker NHS-trans-cyclooctene (NHS-TCO). The NHS functionalized oligos are formed in situ by reaction of methyltetrazine with trans-cyclooctene (the inverse-election demand Diels-Alder (IEDDA) reaction). Nucleophilic amines present on all proteins, but not nucleic acids, attack the in situ-formed NHS-DNA, chemoprecipitating the functionalized oligos directly onto the cells: MTZ-DNAs are made by activating 5′-amine modified oligos with NHS-MTZ for the IEDDA reaction, and they are designed with a PCR primer, a cell tag (a unique “barcode” sequence) and a poly-A tract so that they can be captured by poly-T during single-cell RNA-Seq: Such oligos can be readily ordered from IDT. We are careful to refer to the identifying sequences in these oligos as cell tags rather than barcodes so as not to confuse them with cell barcodes which are used in single-cell RNA-Seq to associate transcripts with cells. The process of sample tagging for single-cell RNA-Seq is illustrated in the figure below. It shows how the tags, appearing as synthetic “transcripts” in cells, are captured during 3′ based microfluidic single-cell RNA-Seq and are subsequently deciphered by sequencing a tag library alongside the cDNA library: This significance of multiplexing is manifold. First, by labeling cells prior to performing single-cell RNA-Seq, multiplexing allows for controlling a trade off between the number of cells assayed per sample, and the total number of samples analyzed. This allows for leveraging the large number of cells that can be assayed with current technologies to enable complex experimental designs based on many samples. In our paper we demonstrate this by performing an experiment consisting of single-cell RNA-Seq of neural stem cells (NSCs) exposed to 96 different combinations of growth factors. The experiment was conducted in collaboration with the Thomson lab that is interested in performing large-scale perturbation experiments to understand cell fate decisions in response to developmental signals. We examined NSCs subjected to different concentrations of Scriptaid/Decitabine, epidermal growth factor/basic fibroblast growth factor, retinoid acid, and bone morphogenic protein 4. In other words, our experiment corresponded to a 4x4x6 table of conditions, and for each condition we performed a single-cell RNA-Seq experiment (in multiplex). This is one of the largest (in terms of samples) single-cell RNA-Seq experiments to date: a 100-fold decrease in the number of cells we collected per sample allowed us to perform an experiment with 100x more samples. Without multiplexing, an experiment that cost us ~$7,000 would cost a few hundred thousand dollars, well outside the scope of what is possible in a typical lab. We certainly would have not been able to perform the experiment without multiplexing. Although the cost tradeoff is impactful, there are many other important implications of multiplexing as well:

• Whereas simplex single-cell RNA-Seq is descriptive, focusing on what is in a single sample, multiplex single-cell RNA-Seq allows for asking how? For example how do cell states change in response to perturbations? How does disease affect cell state and type?
• Simplex single-cell RNA-Seq leads to systematics arguments about clustering: when do cells that cluster together constitute a “cell type”? How many clusters are real? How should clustering be performed? Multiplex single-cell RNA-Seq provides an approach to assigning significance to clusters via their association with samples. In our paper, we specifically utilized sample identification to determine the parameters/thresholds for the clustering algorithm:On the left hand side is a t-SNE plot labeled by different samples, and on the right hand side de novo clusters. The experiment allowed us to confirm the functional significance of a cluster as a cell state resulting from a specific range of perturbation conditions.
• Multiplexing reduces batch effect, and also makes possible the procurement of more replicates in experiments, an important aspect of single-cell RNA-Seq as noted by Hicks et al. 2017.
• Multiplexing has numerous other benefits, e.g. allowing for the detection of doublets and their removal prior to analysis. This useful observation of Stoeckius et al. makes possible higher-throughput single-cell RNA-Seq. We also found an intriguing relationship between tag abundance and cell size. Both of these phenomena are illustrated in one supplementary figure of our paper that I’m particularly fond of:

It shows a multiplexing experiment in which 8 different samples have been pooled together. Two of these samples are human-only samples, and two are mouse-only. The remaining four are samples in which human and mouse cells have been mixed together (with 2,3,4 and 5 tags being used for each sample respectively). The t-SNE plot is made from the tag counts, which is why the samples are neatly separated into 8 clusters. However in Panel b, the cells are colored by their cDNA content (human, mouse, or both). The pure samples are readily identifiable, as are the mixed samples. Cell doublets (purple) can be easily identified and therefore removed from analysis. The relationship between cell size and tag abundance is shown in Panel d. For a given sample with both human and mouse cells (bottom row), human cells give consistently higher sample tag counts. Along with all of this, the figure shows we are able to label a sample with 5 tags, which means that using only 20 oligos (this is how many we worked with for all of our experiments) it is possible to label ${20 \choose 5} = 15,504$ samples.

• Thinking about hundreds (and soon thousands) of single-cell experiments is going to be complicated. The cell-gene matrix that is the fundamental object of study in single-cell RNA-Seq extends to a cell-gene-sample tensor. While more complicated, there is an opportunity for novel analysis paradigms to be developed. A hint of this is evident in our visualization of the samples by projecting the sample-cluster matrix. Specifically, the matrix below shows which clusters are represented within each sample, and the matrix is quantitative in the sense that the magnitude of each entry represents the relative abundance of cells in a sample occupying a given cluster:
A three-dimensional PCA of this matrix reveals interesting structure in the experiment. Here each point is an entire sample, not a cell, and one can see how changes in factors move samples in “experiment space”:

As experiments become even more complicated, and single-cell assays become increasingly multimodal (including not only RNA-Seq but also protein measurements, methylation data, etc.) development of a coherent mathematical framework for single-cell genomics will be central to interpreting the data. As Dueck et al. 2015 point out, such analysis is likely to not only be mathematically interesting, but also functionally important.

We aren’t the only group thinking about sample multiplexing for single-cell RNA-Seq. The “demuxlet” method by Kang et al., 2017 is an in silico approach based on multiplexing from genomic variation. Kang et al. show that if pooled samples are genetically heterogeneous, genotype data can be used to separate samples providing an effective solution for multiplexing single-cell RNA-Seq in large human studies. However demuxlet has limitations, for example it cannot be used for samples from a homogenous genetic background. Two papers at the end of last year develop an epitope labeling strategy for multiplexing: Stoeckius et al. 2017 and Peterson et al. 2017. While epitope labeling provides additional information that can be of interest, our method is more universal in that it can be used to multiplex any kind of samples, even from different organisms (a point we make with the species mixing multiplex experiment I described above). The approaches are also not exclusive, epitope labeling could be coupled to a live cell DNA tagging multiplex experiment allowing for the same epitopes to be assayed together in different samples. Finally, our click chemistry approach is fast, cheap and convenient, immediately providing multiplex capability for thousands, or even hundreds of thousands of samples.

One interesting aspect of Jase’s multiplexing paper is that the project it describes was itself a multiplexing experiment of sorts. The origins of the experiment date to 2005 when I was awarded tenure in the mathematics department at UC Berkeley. As is customary after tenure trauma, I went on sabbatical for a year, and I used that time to ponder career related questions that one is typically too busy for. Questions I remember thinking about: Why exactly did I become a computational biologist? Was a mathematics department the ideal home for me? Should I be more deeply engaged with biologists? Were the computational biology papers I’d been writing meaningful? What is computational biology anyway?

In 2008, partly as a result of my sabbatical rumination but mostly thanks to the encouragement and support of Jasper Rine, I changed the structure of my appointment and joined the UC Berkeley Molecular and Cell Biology (MCB) department (50%). A year later, I responded to a call by then Dean Mark Schlissel and requested wet lab space in what was to become the Li Ka Shing Center at UC Berkeley. This was not a rash decision. After working with Cole Trapnell on RNA-Seq I’d come to the conclusion that a small wet lab would be ideal for our group to better learn the details of the technologies we were working on, and I felt that practicing them ourselves would ultimately be the best way to arrive at meaningful (computational) methods contributions. I’d also visited David Haussler‘s wet lab where I met Jason Underwood who was working on FragSeq at the time. I was impressed with his work and what I saw were important benefits of real contact between wet and dry, experiment and computation.

In 2011 I was delighted to move into my new wet lab. The decision to give me a few benches was a bold and unexpected one, spearheaded by Mark Schlissel, but also supported by a committee he formed to decide on the make up of the building. I am especially grateful to John Ngai, Art Reingold and Randy Scheckman for their help. However I was in a strange position starting a wet lab as a tenured professor. On the one hand the security of tenure provided some reassurance that a failure in the wet lab would not immediately translate to a failure of career. On the other hand, I had no startup funds to buy all the basic infrastructure necessary to run a lab. CIRM, Mark Schlissel, and later other senior faculty in Molecular & Cell Biology at UC Berkeley, stepped in to provide me with the basics: a -80 and -20, access to a shared cold room, a Bioanalyzer (to be shared with others in the building), and a thermocycler. I bought some other basic equipment but the most important piece was the recruitment of my first MCB graduate student: Shannon Hateley. Shannon and I agreed that she would set up the lab and also be lab manager, while I would supervise purchasing and other organization lab matters. I obtained informed consent from Shannon prior to her joining my lab, for what would be a monumental effort requested of her. We also agreed she would be co-advised by another molecular biologist “just in case”.

With Shannon’s work and then my second molecular biology student, Lorian Schaeffer, the lab officially became multiplexed. Jase, who initiated and developed not only the molecular biology but also the computational biology of Gehring et al. 2018 is the latest experimentalist to multiplex in our group. However some of the mathematicians now multiplex as well. This has been a boon to the research of the group and I see Jase’s paper as fruit that has grown from the diversity in the lab. Moving forward, I see increasing use of mathematics ideas in the development of novel molecular biology. For example, current single-cell RNA-Seq multiplexing is a form of information multiplexing that is trivial in comparison to the multiplexing ideas from information theory; the achievements are in the molecular molecular implementations, but in the future I foresee much more of a blur between wet and dry and increasingly sophisticated mathematical ideas being implemented with molecular biology.

Hedy Lamarr, the mother of multiplexing.