To place this admission in context I need to start with Mordell’s finite basis theorem, which has been on my mind this past week. The theorem, proved in 1922, states the rational points on an elliptic curve defined over the rational numbers form a finitely generated abelian group. There is quite a bit of math jargon in this statement that makes it seem somewhat esoteric, but it’s actually a beautiful, fundamental, and accessible result at the crossroads of number theory and algebraic geometry.

First, the phrase *elliptic curve* is just a fancy name for a polynomial equation of the form *y² = x³ + ax + b* (subject to some technical conditions). “Defined over the rationals” just means that *a *and *b* are rational numbers. For example *a=-36, b=0 *or *a=0, b=-26 *would each produce an elliptic curve. A “rational point on the curve” refers to a solution to the equation whose coordinates are rational numbers. For example, if we’re looking at the case where *a=0* and *b=-26 *then the elliptic curve is *y² = x³ – 26* and one rational solution would be the point (35,-207). This solution also happens to be an integer solution; try to find some others! Elliptic curves are pretty and one can easily explore them in WolframAlpha. For example, the curve *y² = x³ – 36x *looks like this:

WolframAlpha does more than just provide a picture. It finds integer solutions to the equation. In this case just typing the equation for the elliptic curve into the WolframAlpha box produces:

One of the cool things about elliptic curves is that the points on them form the structure of an *abelian* *group*. That is to say, there is a way to “add” points on the curves. I’m not going to go through how this works here but there is a very good introduction to this connection between elliptic curves and groups in an exposition by Tanuj Nayak, an undergrad at Carnegie Mellon University.

Interestingly, even just the rational points on an elliptic curve form a group, and Mordell’s theorem says that for an elliptic curve defined over the rational numbers this group is *finitely generated*. That means that for such an elliptic curve one can describe *all *rational points on the curve as finite combinations of some finite set of points. In other words, we (humankind) has been interested in studying Diophantine equations since the time of Diophantus (3rd century). Trying to solve arbitrary polynomial equations is very difficult, so we restrict our attention to easier problems (elliptic curves). Working with integers is difficult, so we relax that requirement a bit and work with rational numbers. And here is a theorem that gives us hope, namely the hope that we can find *all *solutions to such problems because at least the description of the solutions can be finite.

The idea of looking for all solutions to a problem, and not just one solution, is fundamental to mathematics. I recently had the pleasure of attending a lesson for 1st and 2nd graders by Oleg Gleizer, an exceptional mathematician who takes time not only to teach children mathematics, but to develop *mathematics (*not arithmetic!) curriculum that is accessible to them. The first thing Oleg asks young children is what they see when looking at this picture:

Children are quick to find *the* answer and reply either “rabbit” or “duck”. But the lesson they learn is that the answer to his question is that there is no single answer! Saying “rabbit” or “duck” is not a complete answer. In mathematics we seek *all* solutions to a problem. From this point of view, WolframAlpha’s “integer solutions” section is not satisfactory (it omits *x=6, y=0*), but while in principle one might worry that one would have to search forever, Mordell’s finite basis theorem provides some peace of mind for an important class of questions in number theory. It also guides mathematicians: if interested in a specific elliptic curve, think about how to find the (finite) generators for the associated group. Now the proof of Mordell’s theorem, or its natural generalization, the Mordell-Weil theorem, is not simple and requires some knowledge of algebraic geometry, but the statement of Mordell’s theorem and its meaning can be explained to kids via simple examples.

I don’t recall exactly when I learned Mordell’s theorem but I think it was while preparing for my qualifying exam in graduate school, when I studied Silverman’s book on elliptic curves for the cryptography section on my qualifying exam- yes, this math is even related to some very powerful schemes for cryptography! But I do remember when a few years later a (mathematician) friend mentioned to me “the coolest paper ever”, a paper related to generalizations of Mordell’s theorem, the very theorem that I had studied for my exam. The paper was by two mathematicians, Steven Zucker and David Cox, and it was titled Intersection Number of Sections of Elliptic Surfaces. The paper described an algorithm for determining whether some sections form a basis for the Mordell-Weil group for certain elliptic surfaces. The content was not why my friend thought this paper was cool, and in fact I don’t think he ever read it. The excitement was because of the juxtaposition of author names. Apparently David Cox had realized that if he could coauthor a paper with his colleague Steven Zucker, they could publish a theorem, which when named after the authors, would produce a misogynistic and homophobic slur. Cox sought out Zucker for this purpose, and their mission was a “success”. Another mathematician, Charles Schwartz, wrote a paper in which he built on this “joke”. From his paper:

So now, in the mathematics literature, in an interesting part of number theory, you have the Cox-Zucker machine. Many mathematicians think this is hilarious. I thought this was hilarious. In fact, when I was younger I frequently boasted about this “joke”, and how cool mathematicians are for coming up with clever stuff like this.

I was wrong.

I first started to wonder about the Zucker and Cox stunt when a friend pointed out to me, after I had used the term C-S to demean someone, that I had just spouted a misogynistic and homophobic slur. I started to notice the use of the C-S phrase all around me and it made me increasingly uncomfortable. I stopped using it. I stopped thinking that the Zucker-Cox stunt was funny (while noticing the irony that the sexual innuendo they constructed was much more cited than their math), and I started to think about the implications of this sort of thing for my profession. How would one explain the Zucker-Cox result to kids? How would undergraduates write a term paper about it without sexual innuendo distracting from the math? How would one discuss the result, the actual math, with colleagues? What kind of environment emerges when misogynistic and homophobic language is not only tolerated in a field, but is a source of pride by the men who dominate it?

These questions have been on my mind this past week as I’ve considered the result of the NIPS conference naming deliberation. This conference was named in 1987 by founders who, as far as I understand, did not consider the sexual connotations (they dismissed the fact that the abbreviation is a racial slur since they considered it all but extinct). Regardless of original intentions **I write this post to lend my voice to those who are insisting that the conference change its name**. I do so for many reasons. I hear from many of my colleagues that they are deeply offended by the name. That is already reason enough. I do so because the phrase NIPS has been weaponized and is being used to demean and degrade women at one of the main annual machine learning conferences. I don’t make this claim lightly. Consider, for example, TITS 2017 (the (un)official sister event to NIPS). I’ve thought about this specific aggression a lot because in mathematics there is a mathematician by the name of Tits who has many important objects named after him (e.g. Tits buildings). So I have worked through the thought experiment of trying to understand why I think it’s wrong to name a conference NIPS but I’m fine talking about the mathematician Tits. I remember when I first learned of Tits buildings I was taken aback for a moment. But I learned to understand the name Tits as French and I pronounce it as such in my mind and with my voice when I use it. There is no problem there, nor is there a problem with many names that clash across cultures and languages. TITS 2017 is something completely different. It is a deliberate use of NIPS and TITS in a way that can and will make many women uncomfortable. As for NIPS itself perhaps there is *a* “solution” to interpreting the name that doesn’t involve a racial slur or sexual innuendo (Neural Information Processing Systems). Maybe some people see a rabbit. But others see a duck. All the “solutions” matter. The fact is many women are uncomfortable because instead of being respected as scientists, their bodies and looks have become a subtext for the science that is being discussed. This is a longstanding problem at NIPS (see e.g., Lenna). Furthermore, it’s not only women who are uncomfortable. I am uncomfortable with the NIPS name for the reasons I gave above, and I know many other men are as well. I’m not at ease at conferences where racial slurs and sexual innuendo are featured prominently, and if there are men who are (cf. NIPS poll data) then they should be ignored.

I think this is an extremely important issue not only for computer science, but for all of science. It’s about much more than a name of some conference. This is about recognizing centuries of discriminatory and exclusionary practices against women and minorities, and about eliminating such practices when they occur now rather than encouraging them. The NIPS conference must change their name. **#protestNIPS**

At first glance the enumeration of authorship orderings seems to be straightforward, namely that in a paper with *n* authors there are *n!* ways to order the authors. However this solution fails to account for designation of authors as “equal contributors”. For example, in the four author paper Structural origin of slow diffusion in protein folding, the first two authors contributed equally, and separately from that, so did the last two (as articulated via a designation of “co-corresponding” authorship). Another such example is the paper PRDM/Blimp1 downregulates expression of germinal center genes LMO2 and HGAL. Equal contribution designations can be more complex. In the recent preprint Connect-seq to superimpose molecular on anatomical neural circuit maps the first and second authors contributed equally, as did the third and fourth (though the equal contributions of the first and second authors was distinct from that of the third and fourth). Sometimes there are also more than two authors who contributed equally. In SeqVis: Visualization of compositional heterogeneity in large alignments of nucleotides the first eight authors contributed equally. A study on “equal contribution” designation in biomedical papers found that this type of designation is becoming increasingly common and can be associated with nearly every position in the byline.

To account for “equal contribution” groupings, I make the assumption that a set of authors who contributed equally must be consecutive in the authorship ordering. This assumption is certainly reasonable in the biological sciences given that there are two gradients of “contribution” (one from the front and one from the end of the authorship list), and that contributions for those in the end gradient are fundamentally distinct from those in the front. An authorship designation for a paper with *n* authors therefore consists of two separate parts: the *n!* ways to order the authors, and then the ways of designating groups of equal contribution for consecutive authors. The latter enumeration is simple: designation of equal authorship is in one-to-one correspondence with placement of dividers in the *n-1* gaps between the authors in the authorship list. In the extreme case of placement of no dividers the corresponding designation is that all authors contributed equally. Similarly, the placement of dividers between all consecutive pairs of authors corresponds to all contributions being distinct. Thus, the total number of authorship orderings/designations is given by . These numbers also enumerate **the number of ways to lace a shoe**. Other examples of objects whose enumeration results in these numbers are given in the Online Encyclopedia of Integer Sequences entry for this sequence (A002866). The first twenty numbers are:

1, 4, 24, 192, 1920, 23040, 322560, 5160960, 92897280, 1857945600, 40874803200, 980995276800, 25505877196800, 714164561510400, 21424936845312000, 685597979049984000, 23310331287699456000, 839171926357180416000, 31888533201572855808000, 1275541328062914232320000.

In the case of a paper with 60 authors, the number of ways to order authors and designate equal contribution is much larger than the number of atoms in the universe. Good luck with your next consortium project!

]]>

.

This follows directly from the fact that the matrices and commute, because for any two commuting matrices A and B

.

This means that substitutions over a time period *2t* are equivalently described as substitutions occurring over a time period *t, *followed by substitutions occurring afterwards over another time period *t*.

But what if over the course of time the rate matrix changes? For example, suppose that for a period of time *t* mutations proceed according to a rate matrix *Q*, and following that, for another period of time *t*, mutations proceed according to a rate matrix *R*? Is it true that the substitutions after time *2t* will behave as if mutations occurred for a time *2t *according to the (average) rate matrix ?

If *Q* and *R* commute the answer will be yes, as *Qt* and *Rt* will also be commutative and the multiplicativity property will hold. But what if *Q *and *R *don’t commute? Is there any relationship at all between and the matrices and ?

This week I visited Yale University to give a talk in the Center for Biomedical Data Science seminar series. I was invited by Smita Krishnaswamy, who organized a wonderful visit that included many interesting conversations not only in computational biology, but also applied math, computer science and statistics (Yale has strong programs in applied mathematics, statistics and data science, computer science and biostatistics)*.* At dinner I learned from Dan Spielman of the Golden-Thompson inequality which provides a beautiful answer to the question above in the case where *Q* and *R* are symmetric. The theorem is a trace inequality for Hermitian matrices *A* and *B*:

.

This inequality is well known in statistical mechanics and random matrix theory but I don’t believe it is known in the phylogenetics community, hence this post. The phylogenetic interpretation of the pieces of the Golden-Thompson inequality (replacing *A* with *Qt* and *B* with *Rt*) is straightforward:

- The matrices and are substitution matrices for the rate matrices
*Q*and*R*respectively. - The product is the substitution matrix corresponding to mutations occurring with rate matrix
*Q*for time*t*followed by rate matrix*R*for time*t*. - The matrix is the substitution matrix for mutations occurring with rate for time
*2t*. - Since the trace of a substitution matrix is the probability that there is no transition, or equivalently the probability that a change in nucleotide does not occur, the Golden-Thompson inequality states that for two symmetric rate matrices
*Q*and*R*, the probability of a substitution after time*2t*is higher when mutations occur first at rate*Q*for time*t*and then at rate*R*for time*t,*than if they occur at rate for time*2t.*

In other words, **rate changes decrease the expected number of substitutions in comparison to what one would see if rates are constant**.

The Golden-Thompson inequality was discovered independently by Sidney Golden and Colin Thompson in 1965. A proof is explained in an expository blog post by Terence Tao who heard of the Golden-Thompson inequality only eight years ago, which makes me feel a little bit better about not having heard of it until this week! It would be nice if there was a really simple proof but that appears not to be the case (there is a purported one page proof in a paper titled Golden-Thompson from Davis, however what is proved there is the different inequality , which can be shown, by virtue of another matrix trace inequality, to be a *weaker* inequality).

There is considerable interest in evolutionary biology in models that allow for time-varying rates of mutation, as there is substantial evidence of such variation. The Golden-Thompson inequality provides an additional insight for how mutation rate changes over time can affect naïve estimates based on homogeneity assumptions.

The Felsenstein hierarchy (from Algebraic Statistics for Computational Biology).

]]>I counseled my colleagues not to participate in this ill-advised genome-wide association study. The phenotype was ill-defined and in any case the study would be underpowered (only 400 “geniuses” were solicited), but I believe many of them sent in their samples. As far as I know their DNA now languishes in one of Jonathan Rothberg’s freezers. No result has ever emerged from “Project Einstein”, and I’d pretty much forgotten about the ego-driven inquiries I had received years ago. Then, last week, I remembered them when reading a series of blog posts and associated commentary on *evolutionary biology *by some of the most distinguished *mathematicians* in the world.

**1. Sir Timothy Gowers is blogging about evolutionary biology?**

It turns out that mathematicians such as Timothy Gowers and Terence Tao are hosting discussions about evolutionary biology (see On the recently removed paper from the New York Journal of Mathematics, Has an uncomfortable truth been suppressed, Additional thoughts on the Ted Hill paper) because some mathematician wrote a paper titled “An Evolutionary Theory for the Variability Hypothesis“, and an ensuing publication kerfuffle has the mathematics community up in arms. I’ll get to that in a moment, but **first I want to focus on the scientific discourse in these elite math blogs.** If you scroll to the bottom of the blog posts you’ll see hundreds of comments, many written by eminent mathematicians who are engaged in pseudoscientific speculation littered with sexist tropes. The number of inane comments is astonishing. For example, in a comment on Timothy Gowers’ blog, Gabriel Nivasch, a lecturer at Ariel University writes

“It’s also ironic that what causes so much controversy is not humans having descended from apes, which since Darwin people sort-of managed to swallow, but rather the relatively minor issue of differences between the sexes.”

This person’s understanding of the theory of evolution is where the Victorian public was at in England ca. 1871:

In mathematics, just a year later in 1872, Karl Weierstrass published what at the time was considered another monstrosity, one that threw the entire mathematics community into disarray. The result was just as counterintuitive for mathematics as Darwin’s theory of evolution was for biology. Weierstrass had constructed a function that is uniformly continuous on the real line, but not differentiable on any interval:

.

Not only does this construction remain valid today as it was back then, but lots of mathematics has been developed in its wake. What is certain is that if one doesn’t understand the first thing about Weierstrass’ construction, e.g. one doesn’t know what a derivative is, one won’t be able to contribute meaningfully to modern research in analysis. With that in mind consider the level of ignorance of someone who does not even understand the notion of common ancestor in evolutionary biology, and who presumes that biologists have been idle and have learned nothing during the last 150 years. **Imagine the hubris of mathematicians spewing incoherent theories about sexual selection when they literally don’t know anything about human genetics or evolutionary biology, and haven’t read any of the relevant scientific literature about the subject they are rambling about**. You don’t have to imagine. Just go and read the Tao and Gowers blogs and the hundreds of comments they have accrued over the past few days.

**2. Hijacking a journal**

To understand what is going on requires an introduction to Igor Rivin, a professor of mathematics at Temple University and, of relevance in this mathematics matter, an editor of the New York Journal of Mathematics (NYJM) **[Update November 21, 2018: Igor Rivin is no longer an editor of NYJM]**. Last year Rivin invited the author of a paper on the variability hypothesis to submit his work to NYJM. He solicited two reviews and published it in the journal. For a mathematics paper such a process is standard practice at NYJM, but in this case **the facts point to** **Igor Rivin hijacking the editorial process to advance a sexist agenda.** To wit:

- The paper in question, “An Evolutionary Theory for the Variability Hypothesis” is not a mathematics or biology paper but rather a sexist opinion piece. As such it was not suitable for publication in any mathematics or biology journal, let alone in the NYJM which is a venue for publication of pure mathematics.
- Editor Igor Rivin did not understand the topic and therefore had no business soliciting or handling review of the paper.
- The “reviewers” of the paper were not experts in the relevant mathematics or biology.

To elaborate on these points I begin with a brief history of the variability hypothesis. Its origin is Darwin’s 1875 book on “The Descent of Man and Selection in Relation to Sex” which was ostensibly the beginning of the study of sexual selection. However as explained in Stephanie Shields’ excellent review, while the variability hypothesis started out as a hypothesis about variance in physical and intellectual traits, at the turn of 20th century it morphed to a specific statement about sex differences in intelligence. I will not, in this blog post, attempt to review the entire field of sexual selection nor will I discuss in detail the breadth of work on the variability hypothesis. But there are three important points to glean from the Shields review: 1. The variability hypothesis is about intellectual differences between men and women and in fact this is what “An evolutionary theory for the variability hypothesis” tries really hard to get across. Specifically, that the best mathematicians are males because of biology. 2. There has been dispute for over a century about the extent of differences, should they even exist, and 3. Naïve attempts at modeling sexual selection are seriously flawed and completely unrealistic. For example naïve models that assume the same genetic mechanism produces both high IQ and mental deficits are ignoring ample evidence to the contrary.

Insofar as *modeling* of sexual selection is concerned, there was already statistical work in the area by Karl Pearson in 1895 (see “Note on regression and inheritance in the case of two parents“). In the paper Pearson explicitly considers the sex-specific variance of traits and the relationship of said variance to heritability. However as with much of population genetics, it was Ronald Fisher, first in the 1930s (Fisher’s principle) and then later in important work from 1958 what is now referred to as Darwin-Fisher theory (see, e.g. Kirkpatrick, Price and Arnold 1990) who significantly advanced the theory of sexual selection. Amazingly, despite including 51 citations in the final arXiv version of “An Evolutionary Theory for the Variability Hypothesis”, there isn’t a single reference to prior work in the area. I believe the author was completely unaware of the 150 years of work by biologists, statisticians, and mathematical biologists in the field.

What is cited in “An Evolutionary Theory for the Variability Hypothesis”? There is an inordinate amount of cherry picking of quotes from papers to bolster the message the author is intent on getting across: that there are sex-differences in variance of intelligence (whatever that means), specifically males are more variable. The arXiv posting has undergone eight revisions, and somewhere among these revisions there is even a brief cameo by Lawrence Summers and a regurgitation of his infamous sexist remarks. One of the thorough papers reviewing evidence for such claims is “The science of sex differences in science and mathematics” by Halpern *et al.* 2007. The author cherry picks a quote from the abstract of that paper, namely that “the reasons why males are often more variable remain elusive.” and follows it with a question posed by statistician Howard Wainer that implicitly makes a claim: “Why was our genetic structure built to yield greater variation among males than females?” An actual reading of the Halpern *et al.* paper reveals that the excess of males in the top tail of the distribution of quantitative reasoning has dramatically decreased during the last few decades, an observation that cannot be explained by genetics. Furthermore, *females *have a *greater* variability in reading and writing than males*.* They point out that these findings “run counter to the usual conclusion that males are more variable in all cognitive ability domains”. The author of “An Evolutionary Theory for the Variability Hypothesis” conveniently omits this from a **very** short section titled “Primary Analyses Inconsistent with the Greater Male Variability Hypothesis.” This is serious amateur time.

One of the commenters on Terence Tao’s blog explained that the mathematical theory in “An Evolutionary Theory for the Variability Hypothesis” is “obviously true”, and explained its premise for the layman:

It’s assumed that women only pick the “best” – according to some quantity X percent of men as partners where X is (much) smaller than 50, let’s assume. On the contrary, men are OK to date women from the best Y percent where Y is above 50 or at least greater than X.

Let’s go with this for a second, but think about how this premise would have to change to be consistent with results for *reading and writing *(where variance is higher in females). Then we must go with the following premise for everything to work out:

It’s assumed that

menonly pick the “best” – according to some quantity X percent ofwomenas partners where X is (much) smaller than 50, let’s assume. On the contrary,womenare OK to datemenfrom the best Y percent where Y is above 50 or at least greater than X.

Perhaps I should write up this up (citing only studies on reading and writing) and send it to Igor Rivin, editor at the New York Journal of Mathematics as *my* explanation for *my *greater variability hypothesis?

Actually, I hope that will not be possible. **Igor Rivin should be immediately removed from the editorial board of the New York Journal of Mathematics**. I looked up Rivin’s credentials in terms of handling a paper in mathematical biology. Rivin has an impressive publication list, mostly in geometry but also a handful of publications in other areas. He, and separately Mary Rees, are known for showing that the number of simple closed geodesics of length at most *L* grows polynomially in *L* (this result was the beginning of some of the impressive results of Maryam Mirzakhani who went much further and subsequently won the Fields Medal for her work). Nowhere among Rivin’s publications, or in many of his talks which are online, or in his extensive online writings (on Twitter, Facebook etc.) is there any evidence that he has a shred of knowledge about evolutionary biology. The fact that he accepted a paper that is completely untethered from the field in which it purports to make an advance is further evidence of his ignorance.

Ignorance is one thing but hijacking a journal for a sexist agenda is another. Last year I encountered a Facebook thread on which Rivin had commented in response to a BuzzFeed article titled A Former Student Says UC Berkeley’s Star Philosophy Professor Groped Her and Watched Porn at Work. It discussed a lawsuit alleging that John Searle had sexually harassed, assaulted and retaliated against a former student and employee. While working for Searle the student was paid $1,000 a month with an additional $3,000 for being his assistant. On the Facebook thread Igor Rivin wrote

Here is an editor of the NYJM suggesting that a student should have effectively known that if she was paid $36K/year for work as an assistant of a professor (not a high salary for such work), she ought to expect sexual harassment and sexual assault as part of her job. Her LinkedIn profile (which he linked to) showed her to have worked a summer in litigation. So he was essentially saying that this victim prostituted herself with the intent of benefiting financially via suing John Searle. Below is, thankfully, a quick and stern rebuke from a professor of mathematics at Indiana University:

I mention this because it shows that Igor Rivin has a documented history of misogyny. Thus his acceptance of a paper providing a “theory” for “higher general intelligence” in males, a paper in an area he knows nothing about to a journal in pure mathematics is nothing other than **hijacking the editorial process of the journal to further a sexist agenda.**

How did he actually do it? He solicited a paper that had been rejected elsewhere, and sent it out for review to two reviewers who turned it around in 3 weeks. I mentioned above that the “reviewers” of the paper were not experts in the relevant mathematics or biology. This is clear from an examination of the version of the paper that the NYJM accepted. The 51 references were reduced to 11 (one of them is to the author’s preprint). None of the remaining 10 references cite any relevant prior work in evolutionary biology on sexual selection. The fundamental flaws of the paper remain unaddressed. The entire content of the reviews was presumably something along the lines of “please tone down some of the blatant sexism in the paper by removing 40 gratuitous references”. In defending the three week turnaround Rivin wrote (on Gowers’ blog) “Three weeks: I assume you have read the paper, if so, you will have found that it is quite short and does not require a huge amount of background.” Since when does a mathematician judge the complexity of reviewing a paper by its length? I took a look at Rivin’s publications; many of them are very short. Consider for example “On geometry of convex ideal polyhedra in hyperbolic 3-space”. The paper is 5 pages with 3 references. It was received 15 October 1990 and in revised form 27 January 1992. Also excuse me, but if one thinks that a mathematical biology paper “does not require a huge amount of background” then one simply doesn’t know any mathematical biology.

**3. Time for mathematicians to wet their paws**

The irony of mathematicians who believe they are in the high end tail of some ill-specified distribution of intelligence demonstrating en masse that they are idiots is not lost on those of us who actually work in mathematics and biology. Gian-Carlo Rota’s ghost can be heard screaming from Vigevano “**The lack of real contact between mathematics and biology is either a tragedy, a scandal, or a challenge, it is hard to decide which!!**” I’ve spent the past 15 years of my career focusing on Rota’s call to address the challenge of making more contacts between mathematics and biology. The two cultures are sometimes far apart but the potential for both fields, if there is real contact, is tremendous. Not only can mathematics lead to breakthroughs in biology, biology can also lead to new theorems in mathematics. In response to incoherent rambling about genetics on Gowers’ blog, Noah Snyder, a math professor at Indiana University gave sage advice:

I really wish you wouldn’t do this. A bunch of mathematicians speculating about stuff they know nothing about is not a good way to get to the truth. If you really want to do some modeling of evolutionary biology, then find some experts to collaborate or at least spend a year learning some background.

What he is saying is די קאַץ האָט ליב פֿיש אָבער זי װיל ניט די פֿיס אײַננעצן (the cat likes fish but she doesn’t want to wet her paws). If you’re a mathematician who is interested in questions of evolutionary biology, **great! **But first you *must* get your paws wet. If you refuse to do so then you can do real harm. It might be tempting to imagine that mathematics is divorced from reality and has no impact or influence on the world, but nothing could be farther from the truth. **Mathematics matters**. In the case discussed in this blog post, the underlying subtext is pervasive sexism and misogyny in the mathematics profession, and if this sham paper on the variance hypothesis had gotten the stamp of approval of a journal as respected as NYJM, real harm to women in mathematics and women who in the future may have chosen to study mathematics could have been done. It’s no different than the case of Andrew Wakefield‘s paper in *The Lancet* implying a link between vaccinations and autism. By the time of the retraction (twelve years after publication of the article, in 2010), the paper had significantly damaged public health, and even today its effects, namely death as a result of reduced vaccination, continue to be felt. It’s not good enough to say:

*“Once the rockets are up,
who cares where they come down?
That’s not my department,”
says Wernher von Braun.*

- Fill in the blank in the sequence 1, 4, 9, 16, 25, __ , 49, 64, 81.
- What number comes next in the sequence 1, 1, 2, 3, 5, 8, 13, .. ?

Please stop and think about these questions before proceeding. **Spoiler alert**: the blog post reveals the answers.

First, don’t feel bad if it took you a while to answer these questions. In fact, if you didn’t answer them at all that’s a very good thing and I commend you (more on this later). But if you do have answers in mind, you can check them now. The answer key:

**72****12**

The pattern that explains why the answer to the first question is 72 may be subtle, but it’s simple. If the *n*th number in sequence is *f(n)*, then it’s clear that *f(n)* is just the period of the sequence *b(m)* defined by defined by .

Now I know what you’re thinking. You’re a so-called high intelligence quotient person and you *know* the answer is different. You think the pattern is just the squares 1², 2², 3², 4², 5², 6², 7², 8², 9² and that the blank should therefore be filled in with number 6² = 36. You think that it makes more sense because its a “simpler” pattern that explains the data. You are planning to comment on this post and you will invoke Occam’s razor. You will say this problem has appeared on many IQ tests with the answer 36 so that is what is *expected* in terms of an answer. You will explain that therefore it *is* the answer.

But what is a “simple” pattern? Let’s look at the second question. Here the *simplest* pattern that can explain the data is just that each number is the sum of the digits of the previous two numbers. So the next number in the sequence is **12** (=1+3+8). Yes, twelve. You shouldn’t be surprised that you got this wrong as well, and I know you did. Addition of digits of numbers is a common ingredient in sequence patterns on IQ tests. For example the problem of finding the next number in the sequence 58, 26, 16, 14, .. is a similar, albeit more difficult, IQ test question that is analogous my #2. But I know that you might have something else in mind. You’re thinking Fibonacci. You’re thinking there was a typo and 12 should have been 21. But I’m sorry.. no, no, no, **no**. An IQ test is not a math test. It’s about finding patterns, it is a test of *raw* intelligence.

What has happened here is that by merely asking you these two questions, **I’ve forced you to overfit**. There simply isn’t a meaningful way to choose a pattern from an enormous set of possibilities using only a handful of numbers. Yet this is exactly what I asked you to do, and what IQ tests ask one to do. They force one to overfit. **You know what one should call a**** test that encourages poor statistics hygiene? A “statistics deficit test” (SD test) instead of an “intelligence quotient test”.**

The mathematician Richard Guy uses many alliterations to describe the horror of overfitting from a handful of numbers:

Superficial similarities spawn spurious statements.

Capricious coincidences cause careless conjectures.

Early exceptions eclipse eventual essentials.

Initial irregularities inhibit incisive intuition.

These all capture the point that, as Guy says, “there aren’t enough small numbers to meet the many demands made of them”. For many good examples see his paper:

Richard K. Guy, The strong law of small numbers, The American Mathematical Monthly, 1988.

My favorite example of a pattern that is not what it appears to be is the sequence 1,1,1,1,1,1,1,1,1,1,1,.. continued for a total of 8424432925592889329288197322308900672459420460792432 times, and then followed by the number 8936582237915716659950962253358945635793453256935559. To understand this sequence one has to only identify what is an extremely obvious pattern. The *n*th term of the sequence is the greatest common divisor of two polynomials: *n*^{17}+9 and (*n*+1)^{17}+9. The greatest common divisor is therefore one for *n* up to 8424432925592889329288197322308900672459420460792432. Then the 8424432925592889329288197322308900672459420460792433rd greatest common divisor is 8936582237915716659950962253358945635793453256935559. Ok, maybe not so obvious and a somewhat unusual construction for a pattern. However this example makes another point. There cannot be certainty to the “solutions” on an IQ test, so that the term “answer” is a misnomer and its use is problematic. If a test asked for the 8424432925592889329288197322308900672459420460792433rd term of the sequence of ones above it is *likely* that it’s a one, but one cannot know for sure. Another way to say this is that IQ tests are implicitly asking test takers to accept null hypotheses instead of asking, at most, for a rejection.

There has been much debate on what IQ tests really measure and what they predict. But it seems to me that the answer is obvious. In order to score well on IQ tests one must be willing to overfit and to practice the p-value fallacy.** **Since such practices are synonymous with poor data science, I hypothesize that

**low**** IQ scores predict excellence in data science.**

]]>

Jase Gehring, Jeff Park, Sisi Chen, Matt Thomson, and Lior Pachter, Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transciptional Spaces, bioRxiv, 2018.

The paper offers some insights into the benefits of multiplex single-cell RNA-Seq, a molecular implementation of information multiplexing. The paper also reflects the benefits of a multiplex lab, and the project came about thanks to Jase Gehring, a multiplex molecular biologist/computational biologist in my lab.

mult·i·plex

/`məltəˌpleks/

adjective– consisting of many elements in a complex relationship.

– involving simultaneous transmission of several messages along a single channel of communication.

**Conceptually**, Jase’s work presents a method for chemically labeling cells from multiple samples with DNA nucleotides so that samples can be pooled prior to single-cell RNA-Seq, yet cells can subsequently be associated with their samples of origin after sequencing. This is achieved by labeling all cells from a sample with DNA that is unique to that sample; in the figure below colors are used to represent the different DNA tags that are used for each sample:

This is analogous to the barcoding of *transcripts* in single-cell RNA-Seq, that allows for transcripts from the same cell of origin to be associated with each other, yet in this framework there is an additional layer of barcoding of *cells*.

The tagging **mechanism** is a click chemistry one-pot, two-step reaction in which cell samples are exposed to methyltetrazine-activated DNA (MTZ-DNA) oligos as well as the amine-reactive cross-linker NHS-*trans*-cyclooctene (NHS-TCO). The NHS functionalized oligos are formed *in situ* by reaction of methyltetrazine with *trans*-cyclooctene (the inverse-election demand Diels-Alder (IEDDA) reaction). Nucleophilic amines present on all proteins, but not nucleic acids, attack the *in situ*-formed NHS-DNA, chemoprecipitating the functionalized oligos directly onto the cells:

MTZ-DNAs are made by activating 5′-amine modified oligos with NHS-MTZ for the IEDDA reaction, and they are designed with a PCR primer, a cell tag (a unique “barcode” sequence) and a poly-A tract so that they can be captured by poly-T during single-cell RNA-Seq:

Such oligos can be readily ordered from IDT. We are careful to refer to the identifying sequences in these oligos as cell* tags *rather than barcodes so as not to confuse them with cell* barcodes* which are used in single-cell RNA-Seq to associate transcripts with cells.

The **process** of sample tagging for single-cell RNA-Seq is illustrated in the figure below. It shows how the tags, appearing as synthetic “transcripts” in cells, are captured during 3′ based microfluidic single-cell RNA-Seq and are subsequently deciphered by sequencing a tag library alongside the cDNA library:

This **significance **of multiplexing is manifold. First, by labeling cells prior to performing single-cell RNA-Seq,

This is one of the largest (in terms of samples) single-cell RNA-Seq experiments to date: a 100-fold decrease in the number of cells we collected per sample allowed us to perform an experiment with 100x more samples. Without multiplexing, an experiment that cost us ~$7,000 would cost a few hundred thousand dollars, well outside the scope of what is possible in a typical lab. We certainly would have not been able to perform the experiment without multiplexing. Although the cost tradeoff is impactful, there are many other important implications of multiplexing as well:

- Whereas simplex single-cell RNA-Seq is descriptive, focusing on
*what*is in a single sample, multiplex single-cell RNA-Seq allows for asking*how*? For example how do cell states change in response to perturbations? How does disease affect cell state and type? - Simplex single-cell RNA-Seq leads to systematics arguments about clustering: when do cells that cluster together constitute a “cell type”? How many clusters are real? How should clustering be performed? Multiplex single-cell RNA-Seq provides an approach to assigning significance to clusters via their association with samples. In our paper, we specifically utilized sample identification to determine the parameters/thresholds for the clustering algorithm:On the left hand side is a t-SNE plot labeled by different samples, and on the right hand side
*de novo*clusters. The experiment allowed us to confirm the functional significance of a cluster as a cell state resulting from a specific range of perturbation conditions. - Multiplexing reduces batch effect, and also makes possible the procurement of more replicates in experiments, an important aspect of single-cell RNA-Seq as noted by Hicks
*et al.*2017. - Multiplexing has numerous other benefits, e.g. allowing for the detection of doublets and their removal prior to analysis. This useful observation of Stoeckius
*et al.*makes possible higher-throughput single-cell RNA-Seq. We also found an intriguing relationship between tag abundance and cell size. Both of these phenomena are illustrated in one supplementary figure of our paper that I’m particularly fond of:

It shows a multiplexing experiment in which 8 different samples have been pooled together. Two of these samples are human-only samples, and two are mouse-only. The remaining four are samples in which human and mouse cells have been mixed together (with 2,3,4 and 5 tags being used for each sample respectively). The t-SNE plot is made from the *tag counts*, which is why the samples are neatly separated into 8 clusters. However in Panel b, the cells are colored by their cDNA content (human, mouse, or both). The pure samples are readily identifiable, as are the mixed samples. Cell doublets (purple) can be easily identified and therefore removed from analysis. The relationship between cell size and tag abundance is shown in Panel d. For a given sample with both human and mouse cells (bottom row), human cells give consistently higher sample tag counts. Along with all of this, the figure shows we are able to label a sample with 5 tags, which means that using only 20 oligos (this is how many we worked with for all of our experiments) it is possible to label samples.

- Thinking about hundreds (and soon thousands) of single-cell experiments is going to be complicated. The cell-gene matrix that is the fundamental object of study in single-cell RNA-Seq extends to a cell-gene-sample tensor. While more complicated, there is an opportunity for novel analysis paradigms to be developed. A hint of this is evident in our visualization of the
*samples*by projecting the sample-cluster matrix. Specifically, the matrix below shows which*clusters*are represented within each*sample*, and the matrix is quantitative in the sense that the magnitude of each entry represents the relative abundance of cells in a sample occupying a given cluster:

A three-dimensional PCA of this matrix reveals interesting structure in the experiment. Here each point is an entire*sample*, not a cell, and one can see how changes in factors move samples in “experiment space”:

As experiments become even more complicated, and single-cell assays become increasingly multimodal (including not only RNA-Seq but also protein measurements, methylation data, etc.) development of a coherent mathematical framework for single-cell genomics will be central to interpreting the data. As Dueck *et al*. 2015 point out, such analysis is likely to not only be mathematically interesting, but also *functionally *important.

We aren’t the only group thinking about sample multiplexing for single-cell RNA-Seq. The “demuxlet” method by Kang *et al.,* 2017 is an *in silico* approach based on multiplexing from genomic variation. Kang *et al. *show that if pooled samples are genetically heterogeneous, genotype data can be used to separate samples providing an effective solution for multiplexing single-cell RNA-Seq in large human studies. However demuxlet has limitations, for example it cannot be used for samples from a homogenous genetic background. Two papers at the end of last year develop an epitope labeling strategy for multiplexing: Stoeckius *et al.* 2017 and Peterson *et al.* 2017. While epitope labeling provides additional information that can be of interest, our method is more universal in that it can be used to multiplex any kind of samples, even from different organisms (a point we make with the species mixing multiplex experiment I described above). The approaches are also not exclusive, epitope labeling could be coupled to a live cell DNA tagging multiplex experiment allowing for the same epitopes to be assayed together in different samples. Finally, our click chemistry approach is fast, cheap and convenient, immediately providing multiplex capability for thousands, or even hundreds of thousands of samples.

One interesting aspect of Jase’s multiplexing paper is that the project it describes was itself a multiplexing experiment of sorts. The origins of the experiment date to 2005 when I was awarded tenure in the mathematics department at UC Berkeley. As is customary after tenure trauma, I went on sabbatical for a year, and I used that time to ponder career related questions that one is typically too busy for. Questions I remember thinking about: Why exactly did I become a computational biologist? Was a mathematics department the ideal home for me? Should I be more deeply engaged with biologists? Were the computational biology papers I’d been writing meaningful? What is computational biology anyway?

In 2008, partly as a result of my sabbatical rumination but mostly thanks to the encouragement and support of Jasper Rine, I changed the structure of my appointment and joined the UC Berkeley Molecular and Cell Biology (MCB) department (50%). A year later, I responded to a call by then Dean Mark Schlissel and requested wet lab space in what was to become the Li Ka Shing Center at UC Berkeley. This was not a rash decision. After working with Cole Trapnell on RNA-Seq I’d come to the conclusion that a small wet lab would be ideal for our group to better learn the details of the technologies we were working on, and I felt that practicing them ourselves would ultimately be the best way to arrive at meaningful (computational) methods contributions. I’d also visited David Haussler‘s wet lab where I met Jason Underwood who was working on FragSeq at the time. I was impressed with his work and what I saw were important benefits of real contact between wet and dry, experiment and computation.

In 2011 I was delighted to move into my new wet lab. The decision to give me a few benches was a bold and unexpected one, spearheaded by Mark Schlissel, but also supported by a committee he formed to decide on the make up of the building. I am especially grateful to John Ngai, Art Reingold and Randy Scheckman for their help. However I was in a strange position starting a wet lab as a tenured professor. On the one hand the security of tenure provided some reassurance that a failure in the wet lab would not immediately translate to a failure of career. On the other hand, I had no startup funds to buy all the basic infrastructure necessary to run a lab. CIRM, Mark Schlissel, and later other senior faculty in Molecular & Cell Biology at UC Berkeley, stepped in to provide me with the basics: a -80 and -20, access to a shared cold room, a Bioanalyzer (to be shared with others in the building), and a thermocycler. I bought some other basic equipment but **the most important piece** was the recruitment of my first MCB graduate student: Shannon Hateley. Shannon and I agreed that she would set up the lab and also be lab manager, while I would supervise purchasing and other organization lab matters. I obtained informed consent from Shannon prior to her joining my lab, for what would be a monumental effort requested of her. We also agreed she would be co-advised by another molecular biologist “just in case”.

With Shannon’s work and then my second molecular biology student, Lorian Schaeffer, the lab officially became multiplexed. Jase, who initiated and developed not only the molecular biology but also the computational biology of Gehring *et al.* 2018 is the latest experimentalist to multiplex in our group. However some of the mathematicians now multiplex as well. This has been a boon to the research of the group and I see Jase’s paper as fruit that has grown from the diversity in the lab. Moving forward, I see increasing use of mathematics ideas in the development of novel molecular biology. For example, current single-cell RNA-Seq multiplexing is a form of information multiplexing that is trivial in comparison to the multiplexing ideas from information theory; the achievements are in the molecular molecular implementations, but in the future I foresee much more of a blur between wet and dry and increasingly sophisticated mathematical ideas being implemented with molecular biology.

Hedy Lamarr, the mother of multiplexing.

]]>“Whenever you interview fat people, you feel bad, because you know you’re not going to hire them”

“Japan should be bombed for dragging its feet on supporting the Human Genome Project”

“I’m not a racist in a conventional way”

“[The] historic curse of the Irish.. is not alcohol, it’s not stupidity.. it’s ignorance”

“Women are supposedly bad at three dimensions”

“[Rosalind Franklin] couldn’t think in three dimensions very well”

“[Rosalind Franklin] had Aspergers”

“People ask about [Rosalind Franklin] and I always say ‘autism’”

“[Francis Crick] may have been a bit autistic”

“Men are a bit strange and their strangest quality is their ability to understand mathematics”

“[Rosalind] Franklin couldn’t do maths”

“Indians in [my] experience [are] servile.. because of selection under the caste system”

“People who have to deal with black employees find [that they are equal] not true”

“[As a female scientist] you won’t be taken seriously if you have children”

“[Linus Pauling] was probably always half-insane”

“Anyone who would hire an ecologist is out of his mind”

“[Rosalind Franklin] was a loser”

“Disabled individuals are genetic losers”

“[With IVF] all hell will break loose, politically and morally, all over the world”

“If we knew our son would develop schizophrenia, we wouldn’t have had him”

“My former colleagues are pinkos and shits”

“[X University]- it used to be such a wonderful place. And then they started admitting women!”

“Catholics are more likely to forgive than Jews”

]]>

- The simple technique of logistic regression, by taking advantage of the large number of cells assayed in single-cell RNA-Seq experiments, is much more effective than current approaches at identifying marker genes for clusters of cells.
- The simplest single-cell RNA-Seq data, namely 3′ single-end reads produced by technologies such as Drop-Seq or 10X, can distinguish isoforms of genes.
- The simple idea of GDE provides a unified perspective on DGE, DTU and DTE.

These simple, simple and simple ideas are so obvious that *of course* anyone could have discovered them, and one might be tempted to go so far as to say that even if people didn’t explicitly write them down, they were *basically* already known. After all, logistic regression was published by David Cox in 1958, and who didn’t know that there are many 3′ unannotated UTRs in the human genome? As for DGE, DTU and DTE (and DTE->G and DTE+G) I mean who *doesn’t* get these basic concepts? Indeed, after reading our paper someone remarked that one of the key results “was already known“, presumably because the successful application of logistic regression as a gene differential expression method for single-cell RNA-Seq follows from the fact that Šidák aggregation fails for differential gene expression in bulk RNA-Seq.

The “was already known” comment reminded me of a recent blog post about the dirty secret of mathematics. In the post, the author begins with the following math problem: Without taking your pencil off the paper/screen, can you draw four straight lines that go through the middle of all of the dots?

The problem may not yield immediately (try it!) but the solution is obvious once presented. This is a case of the solution requiring a bit of out-of-the-box thinking, leading to a perspective on the problem that is obvious in retrospect. In the Ntranos, Yi *et al.* paper, the change in perspective was the realization that “Instead of the traditional approach of using the cell labels as covariates for gene expression, logistic regression incorporates transcript quantifications as covariates for cell labels”. It’s no surprise the “was already known” reaction reared it’s head in this case. It’s easy to convince oneself, after the fact, that the “obvious” idea was in one’s head all along.

The egg of Columbus is an apocryphal tale about ideas that seem trivial after the fact. The story originates from the book “History of the New World” by Girolamo Benzoni, who wrote that Columbus, upon upon being told that his journey to the West Indies was unremarkable and that Spain “would not have been devoid of a man who would have attempted the same” had he not undertaken the journey, replied

“Gentlemen, I will lay a wager with any of you, that you will not make this egg stand up as I will, naked and without anything at all.” They all tried, and no one succeeded in making it stand up. When the egg came round to the hands of Columbus, by beating it down on the table he fixed it, having thus crushed a little of one end”

The story makes a good point. Discovery of the Caribbean in the 6th millennium BC was certainly not a trivial accomplishment even if it was obvious after the fact. The egg trick, which Columbus would have learned from the Amerindians who first brought chickens to the Americas, is a good metaphor for the discovery.

There are many Amerindian eggs in mathematics, which has its own apocryphal story to make the point: A professor proving a theorem during a lecture pauses to remark that “it is obvious that…”, upon which she is interrupted by a student asking if that’s truly the case. The professor runs out of the classroom to a nearby office, returning after several minutes with a notepad filled with equations to exclaim “Why *yes*, it *is* obvious!” But even first-rate mathematicians can struggle to accept Amerindian eggs as worthy contributions, frequently succumbing to the temptation of dismissing others’ work as obvious. One of my former graduate school mentors was G.W. Peck, a math professor who created a pseudonym for the express purpose of publishing his Ameridian eggs in a way that would reduce unintended embarrassment for those whose work he was improving on in in “trivial ways”. G.W. Peck has an impressive publication record.

Bioinformatics is not very different from mathematics; the literature is populated with many Amerindian eggs. My favorite example is the Smith-Waterman algorithm, an algorithm for local alignment published by Temple Smith and Michael Waterman in 1981. The Smith-Waterman algorithm is a simple modification of the Needleman-Wunsch algorithm:

The table above shows the differences. **That’s it!** This table made for a (highly cited) paper. Just initialize the Needleman-Wunsch algorithm with zeroes instead of a gap penalty, set negative scores to 0, trace back from the highest score. In fact, it’s such a minor modification that when I first learned the details of the algorithm I thought “This is obvious! After all, it’s *just* the Needleman-Wunsch algorithm. Why does it even have a name?! Smith and Waterman got a highly cited paper?! For *this?!*” My skepticism lasted only as long as it took me to discover and read Peter Sellers’ 1980 paper attempting to solve the same problem. It’s a lot more complicated, relying on the idea of “inductive steps”, and requires untangling mysterious diagrams such as:

The Smith-Waterman solution was clever, simple and obvious (after the fact). Such ideas are a hallmark of Michael Waterman’s distinguished career. Consider the Lander-Waterman model, which is a formula for the expected number of contigs in a shotgun sequencing experiment:

Here *N* is the number of reads sequenced and *R=NL/G *is the “redundancy” (reads * fragment length / genome length). At first glance the Lander-Waterman “model” is *just* a formula arising from the Poisson distribution! It was *obvious*… immediately after they published it. The Pevzner-Tang-Waterman approach to DNA assembly is another good example. It is no coincidence that all of these foundational, important and impactful ideas have Waterman in their name.

Looking back at my own career, some of the most satisfying projects have been Amerindian eggs, projects where I was lucky to participate in collaborations leading to ideas that were obvious (after the fact). Nowadays I know I’ve hit the mark when I receive the most authentic of compliments: “your work is trivial!” or “was widely known in the field“, as I did recently after blogging about plagiarism of key ideas from kallisto. However I’m still waiting to hear the ultimate compliment: “*everything* you do is obvious and was already known!”

(Click “read the rest of this entry” to see the solution to the 9 dot problem.)

]]>

To illustrate the different concepts associated to differential expression, I’ll use the following example, consisting of a comparison of a single two-isoform gene in two conditions (the figure is Supplementary Figure 1 in Ntranos, Yi *et al.* Identification of transcriptional signatures for cell types from single-cell RNA-Seq, 2018):

The isoforms are labeled *primary* and *secondary*, and the two conditions are called “A” and “B”. The black dots labeled conditions A and B have x-coordinates and corresponding to the abundances of the primary isoform in the respective conditions, and y-coordinates and corresponding to the abundance of the secondary isoforms. In data from an experiment the black dots will represent the mean level of expression of the constituent isoforms as derived from replicates, and there will be uncertainty as to their exact location. In this example I’ll assume they represent the true abundances.

Below is a list of terms used to characterize changes in expression:

**Differential transcript expression (DTE) **is change in one of the isoforms. In the figure, this is represented (conceptually) by the two red lines along the x- and y-axes respectively. Algebraically, one might compute the change in the primary isoform by and the change in the secondary isoform by . However the term DTE is used to denote not only the extent of change, but also the event that a single isoform of a gene changes between conditions, i.e. when the two points lie on a horizontal or vertical line. DTE can be understood to occur as a result of transcriptional regulation if an isoform has a unique transcription start site, or post-transcriptional regulation if it is determined by a unique splicing event.

**Differential gene expression (DGE) **is the change in the overall output of the gene. Change in the overall output of a gene is change in the direction of the line , and the extent of change can be understood geometrically to be the distance between the projections of the two points onto the line (blue line labeled DGE). The distance will depend on the metric used. For example, the change in expression could be defined to be the total expression in condition B () minus the change in expression in condition A (), which is . This is just the length of the blue line labeled “DGE” given by the norm. Alternatively, one could consider “DGE” to be the length of the blue line in the norm. As with DTE, DGE can also refer to a specific type of change in gene expression between conditions, one in which every isoform changes (relatively) by the same amount so that the line joining the two points has a slope of 1 (i.e. is angled at 45°). DGE can be understood to be the result of transcriptional regulation, driving overall gene expression up or down.

**Differential transcript usage (DTU) **is the change in *relative* expression between the primary and secondary isoforms. This can be interpreted geometrically as the angle between the two points, or alternatively as the length (as given by some norm) of the green line labeled DTU. As with DTE and DGE, DTU is also a term used to describe a certain kind of difference in expression between two conditions, one in which the line joining the two points has a slope of -1. DTU events are most likely controlled by post-transcriptional regulation.

**Gene differential expression ****(GDE)** is represented by the red line. It is the amount of change in expression along in the direction of line joining the two points. GDE is a notion that, for reasons explained below, is not typically tested for, and there are few methods that consider it. However GDE is biologically meaningful, in that it generalizes the notions of DGE, DTU and DTE, allowing for change in *any *direction. A gene that exhibits *some* change in expression between conditions is GDE regardless of the direction of change. GDE can represent complex changes in expression driven by a combination of transcriptional and post-transcriptional regulation. Note that DGE, DTU and DTE are all special cases of GDE.

If the norm is used to measure length and denote DTE in the primary and secondary isoforms respectively, then it is clear that DGE, DTU, DTE and GDE satisfy the relationship

The terms DTE, DGE, DTU and GDE have an intuitive biological meaning, but they are also used in genomics as descriptors of certain null hypotheses for statistical testing of differential expression.

The **differential transcript expression (DTE)** null hypothesis for an isoform is that it did not change between conditions, i.e. for the primary isoform, or for the secondary isoform. In other words, in this example there are two DTE null hypotheses one could consider.

The **differential gene expresión (DGE)** null hypothesis is that there is no change in overall expression of the gene, i.e. .

The **differential transcript usage ****(DTU)** null hypothesis is that there is no change in the difference in expression of isoforms, i.e. .

The **gene differential expression (GDE)** null hypothesis is that there is no change in expression in *any* direction, i.e. for all constants , .

The **union differential transcript expression (UDTE) **null hypothesis is that there is no change in expression of *any* *isoform. *That is, that *and* (this null hypothesis is sometimes called DTE+G). The terminology is motivated by .

Not that , because if we assume GDE, and set we obtain DTE for the primary isoform and setting we obtain DTE for the secondary isoform. To be clear, by GDE or DTE in this case we mean the GDE (respectively DTE)* null hypothesis. *Furthermore, we have that

.

This is clear because if and then both DTE null hypotheses are satisfied by definition, and both DGE and DTU are trivially satisfied. However no other implications hold, i.e. , similarly , and .

The terms DGE, DTE, DTU and GDE also used to describe methods for differential analysis.

A **differential gene expression method** is one whose goal is to identify changes in overall gene expression. Because DGE depends on the projection of the points (representing gene abundances) to the line y=x, DGE methods typically take as input gene counts or abundances computed by summing transcript abundances and . Examples of early DGE methods for RNA-Seq were DESeq (now DESeq2) and edgeR. One problem with DGE methods is that it is problematic to estimate gene abundance by adding up counts of the constituent isoforms. This issue was discussed extensively in Trapnell *et al.* 2013. On the other hand, if the biology of a gene is DGE, i.e. changes in expression are the same (relatively) in all isoforms, then DGE methods will be optimal, and the issue of summed counts not representing gene abundances accurately is moot.

A **differential transcript expression method **is one whose goal is to identify individual transcripts that have undergone DTE. Early methods for DTE were Cufflinks (now Cuffdiff2) and MISO, and more recently sleuth, which improves DTE accuracy by modeling uncertainty in transcript quantifications. A key issue with DTE is that there are many more transcripts than genes, so that rejecting DTE null hypotheses is harder than rejecting DGE null hypotheses. On the other hand, DTE provides differential analysis at the highest resolution possible, pinpointing specific isoforms that change and opening a window to study post-transcriptional regulation. A number of recent examples highlight the importance of DTE in biomedicine (see, e.g., Vitting-Seerup and Sandelin 2017). Unfortunately DTE results do not always translate to testable hypotheses, as it is difficult to knock out individual isoforms of genes.

A **differential transcript usage **method is one whose goal is to identify genes whose overall expression is constant, but where isoform switching leads to changes in relative isoform abundances. Cufflinks implemented a DTU test using Jensen-Shannon divergence, and more recently RATs is a method specialized for DTU.

As discussed in the previous section, none of null hypotheses DGE, DTE and DTU imply any other, so users have to choose, prior to performing an analysis, which type of test they will perform. There are differing opinions on the “right” approach to choosing between DGE, DTU and DTE. Sonseson *et al.* 2016 suggest that while DTE and DTU may be appropriate in certain niche applications, generally it’s better to choose DGE, and they therefore advise not to bother with transcript-level analysis. In Trapnell *et al.* 2010, an argument was made for focusing on DTE and DTU, with the conclusion to the paper speculating that “differential RNA level isoform regulation…suggests functional specialization of the isoforms in many genes.” Van den Berge *et al. *2017 advocate for a middle ground: performing a gene-level analysis but saving some “FDR budget” for identifying DTE in genes for which the UDTE null hypothesis has been rejected.

There are two alternatives that have been proposed to get around the difficulty of having to choose, prior to analysis, whether to perform DGE, DTU or DTE:

A **differential transcript expression aggregation (DTE->G) **method is a method that first performs DTE on all isoforms of every gene, and then aggregates the resulting p-values (by gene) to obtain gene-level p-values. The “aggregation” relies on the observation that under the null hypothesis, p-values are uniformly distributed. There are a number of different tests (e.g. Fisher’s method) for testing whether (independent) p-values are uniformly distributed. Applying such tests to isoform p-values per gene provides gene-level p-values and the ability to reject UDTE. A DTE->G method was tested in Soneson *et al.* 2016 (based on Šidák aggregation) and the stageR method (Van den Berge *et al. *2017) uses the same method as a first step. Unfortunately, naïve DTE->G methods perform poorly when genes change by DGE, as shown in Yi *et al.* 2017. The same paper shows that Lancaster aggregation is a DTE->G method that achieves the best of both the DGE and DTU worlds. One major drawback of DTE->G methods is that they are non-constructive, i.e. the rejection of UDTE by a DTE->G method provides no information about *which* transcripts were differential and how. The stageR method averts this problem but requires sacrificing some power to reject UDTE in favor of the interpretability provided by subsequent DTE.

A **gene differential expression method **is a method for gene-level analysis that tests for differences *in the direction of change *identified between conditions. For a GDE method to be successful, it must be able to identify the direction of change, and that is not possible with bulk RNA-Seq data. This is because of the one in ten rule that states that approximately one predictive variable can be estimated from ten events. In bulk RNA-Seq, the number of replicates in standard experiments is three, and the number of isoforms in multi-isoform genes is at least two, and sometimes much more than that.

In Ntranos, Yi *et al.* 2018, it is shown that single-cell RNA-Seq provides enough “replicates” in the form of cells, that logistic regression can be used to predict condition based on expression, effectively identifying the direction of change. As such, it provides an alternative to DTE->G for rejecting UDTE. The Ntranos and Yi GDE methods is extremely powerful: by identifying the direction of change it is a DGE methods when the change is DGE, it is a DTU method when the change is DTU, and it is a DTE method when the change is DTE. Interpretability is provided in the prediction step: it is the estimated direction of change.

The discussion in this post is based on an example consisting of a gene with two isoforms, however the concepts discussed are easy to generalize to multi-isoform genes with more than two transcripts. I have not discussed differential exon usage (DEU), which is the focus of the DEXSeq method because of the complexities arising in genes which don’t have well-defined shared exons. Nevertheless, the DEXSeq approach to rejecting UDTE is similar to DTE->G, with DTE replaced by DEU. There are many programs for DTE, DTU and (especially) DGE that I haven’t mentioned; the ones cited are intended merely to serve as illustrative examples. This is not a comprehensive review of RNA-Seq differential expression methods.

The blog post was motivated by questions of Charlotte Soneson and Mark Robinson arising from an initial draft of the Ntranos, Yi *et al.* 2018 paper. The exposition was developed with Vasilis Ntranos and Lynn Yi. Valentine Svensson provided valuable comments and feedback.

**Enter the Rock64**.

The Rock64 is a new single-board computer from Pine64 that competes with the Raspberry Pi 3:

The Rock64 is evidence of the rapid and impressive development in single-board computers over the past few years, and Pine64 crosses a major threshold by offering a model with 4Gb RAM. The machine is also cheap. A 4Gb RAM Rock64, which is a 64-bit, quad core 1.5GHz machine, costs $44.95 (the 1Gb model is just $24.95). An enclosure is $7.95, a power supply $6.99, and a 64Gb SSD drive is only $31.95 (the 16Gb drive is $15.95). When my student Jase Gehring found out the specs of the machine last summer, he immediately realized that it was powerful enough to run kallisto for RNA-Seq analyses, and we preordered a handful of the boards for the lab. These arrived in the fall and we have been testing the machines for a while. One of them is hooked up to a monitor, and together with a bluetooth mouse and keyboard is serving as a general desktop computer in the wet lab. They are extraordinary versatile mini computers that, in my opinion, portend a future of mobile, low-cost, and light-weight computing for clinical and field genomics applications.

Unfortunately ARM is not an architecture known to most computational biologists, and my initial enthusiasm for the Rock64 was dampened when I found out that most genomics software does not work on ARM architecture. However I managed to install R, and Páll Melsted compiled kallisto on the Rock64 for the new release of version 0.44 (the release introduces an ARM binary, along with pseudobam for visualization of pseudoalignments). With these programs in place on Gibraltar (our first Rock64 with 4Gb of RAM, a 64Gb SSD drive, and a quad-core 1.5GHz processor), there was ample processing power to quantify RNA-Seq datasets.

For example, I was able to build the Saccharomyces cerevisae release 81 transcriptome index in one minute. **A complete quantification of 6 samples from Ellahi, Thurtle and Rine, 2015 using two cores (with 30 bootstraps per sample) took 21 minutes.** The quantification consisted of processing 47,744,312 paired-end reads. Amazingly, the Rock64 can quantify human RNA-Seq, which requires pseudoalignment of reads to a much larger transcriptome than yeast.

It’s mind boggling to consider just how amazing it is to be able to quantify RNA-Seq on such a machine. When we developed kallisto we knew that the two orders of magnitude speedup was a game-changer, but I never thought we would literally be able to run it on what is not much more than a phone. We’re not going to switch over all of our RNA-Seq analyses to the Rock64s quite yet, but cluster assemblies such as the Pico5S have piqued my interest.

I imagine that it won’t be long before mini computers are even more powerful, and provide ultra low-cost portable alternatives to current server and cloud computing solutions. Having said that, I still miss my Commodore 64. Fortunately the mini revolution isn’t leaving me behind: a mini version of the C64 is slated for release early this year.

]]>