This year half of the Nobel prize in Physiology or Medicine was awarded to May-Britt Moser and Edvard Moser, who happen to be both a personal and professional couple. Interestingly, they are not the first but rather the fourth couple to win the prize jointly: In 1903 Marie Curie and Pierre Curie shared the Nobel prize in physics, in 1935 Frederic Joiliot and Irene Joliot-Curie shared the Nobel prize in chemistry and in 1947 Carl Cori and Gerty Cori also shared the Nobel prize in physiology or medicine. It seems working on science with a spouse or partner can be a formula for success. Why then, when partners apply together for academic jobs, do universities refer to them as “two body problems“?

The “two-body problem” is a question in physics about the motions of pairs of celestial bodies that interact with each other gravitationally. It is a special case of the difficult “N-body problem” but simple enough that is (completely) solved; in fact it was solved by Johann Bernoulli a few centuries ago. The use of the term in the context of academic job searches has always bothered me- it suggests that hiring in academia is an exercise in mathematical physics (it is certainly not!) and even if one presumes that it is, the term is an oxymoron because in physics the problem is solved whereas in academia it is used in a way that implies it is unsolvable. There are countless times I have heard my colleagues sigh “so and so would be great but there is a two body problem”. Semantics aside, the allusion to high brow physics problems in the process of academic hiring belies a complete lack of understanding of the basic mathematical notion of epistasis relevant in the consideration of joint applications, not to mention an undercurrent of sexism that plagues science and engineering departments everywhere.  The results are poor hiring decisions, great harm to the academic prospects of partners and couples, and imposition of stress and anxiety that harms the careers of those who are lucky enough to be hired by the flawed system.

I believe it was Aristotle who first noted used the phrase “the whole is greater than the sum of its parts”. The old adage remains true today: owning a pair of matching socks is more than twice as good as having just one sock. This is called positive epistasis, or synergy. Of course the opposite may be true as well: a pair of individuals trying to squeeze through a narrow doorway together will take more than twice as long than if they would just go through one at a time. This would be negative epistasis. There is a beautiful algebra and geometry associated to positive/negative epistasis this is useful to understand, because its generalizations reveal a complexity to epistasis that is very much at play in academia.

Formally, thinking of two “parts”, we can represent them as two bit strings: 01 for one part and 10 for the other. The string 00 represents the situation of having neither part, and 11 having both parts. A “fitness function” $f:[0,1]^2 \rightarrow \mathbb{R}_+$ assigns to each string a value. Epistasis is defined to be the sign of the linear form

$u=f(00)+f(11)-f(10)-f(01)$.

That is, $u>0$ is positive epistasis, $u<0$ is negative epistasis and $u=0$ is no epistasis. In the case where $f(00)=0$, “the whole is greater than the sum of its parts” means that $f(11)>f(10)+f(01)$ and “the whole is less than the sum of its parts” means $f(11). There is an accompanying geometry that consists of drawing a square in the x-y plane whose corners are labeled by $00,01,10$ and $11$. At each corner,  the function $f$ can be represented by a point on the z axis, as shown in the example below:

The black line dividing the square into two triangles comes about by imagining that there are poles at the corners of the square, of height equal to the fitness value, and then that a tablecloth is draped over the poles and stretched taught. The picture above then correspond to the leftmost panel below:

The crease is the resulting of projecting down onto the square the “fold” in the tablecloth (assuming there is a fold). In other words, positive and negative epistasis can be thought of as corresponding to one of the two triangulations of the square. This is the geometry of two parts but what about n parts? We can similarly represent them by bit strings $100 \cdots 0, 010 \cdots 0, 001 \cdots 0, \ldots, 000 \cdots 1$ with the “whole” corresponding to $111 \cdots 1$. Assuming that the parts can only be added up all together, the geometry now works out to be that of triangulations of the hyperbipyramid; the case $n=3$ is shown below:

“The whole is greater than the sum of its parts”: the superior-inferior slice.

“The whole is less than the sum of its parts”: the transverse slice.

With multiple parts epistasis can become more complicated if one allows for arbitrary combining of parts. In a paper written jointly with Niko Beerenwinkel and Bernd Sturmfels titled “Epistasis and shapes of fitness landscapes“, we developed the mathematics for the general case and showed that epistasis among objects allowed to combine in all possible ways corresponds to the different triangulations of a hypercube. For example, in the case of three objects, the square is replaced by the cube with eight corners corresponding to the eight bit strings of length 3. There are 74 triangulations of the cube, falling into 6 symmetry classes. The complete classification is shown below (for details on the meaning of the GKZ vectors and out-edges see the paper):

There is a beautiful geometry describing how the different epistatic shapes (or triangulations) are related, which is known as the secondary polytope. Its vertices correspond to the triangulations and two are connected by an edge when they are the same except for the “flip” of one pair of neighboring tetrahedra. The case of the cube is shown below:

The point of the geometry, and its connection to academic epistasis that I want to highlight in this post, is made clear when considering the case of $n=4$. In that case the number of different types of epistatic interactions is given by the number of triangulations of the 4-cube. There are 87,959,448 triangulations and 235,277 symmetry types! In other words, the intuition from two parts that “interaction” can be positive, negative or neutral is difficult to generalize without math, and the point is there are a myriad of ways a faculty in a large department can be interacting both to the benefit and the detriment of their overall scientific output.

In many searches I’ve been involved in the stated principle for hiring is “let’s hire the best person”. Sometimes the search may be restricted to a field, but it is not uncommon that the search is open. Such a hiring policy deliberately ignores epistasis, and I think it’s crazy, not to mention sexist, because the policy affects and hurts women applicants far more than it does men. Not because women are less likely to be “the best” in their field, in fact quite the opposite. It is very common for women in academia to be partnered with men who are also in academia, and inevitably they suffer for that fact because departments have a hard time reconciling that both could be “the best”. There are also many reasons for departments to think epistaticially that go beyond basic fairness principles. For example, in the case of partners that are applying together to a university, even if they are not working together on research, it is likely that each one will be far more productive if the other has a stable job at the same institution. It is difficult to manage a family if one partner needs to commute hours, or in some cases days, to work. I know of a number of couples in academia that have jobs in different states.

In the last few years there are a few couples that have been bold enough to openly declare themselves “positively epistatic”. What I mean is that they apply jointly as a single applicant, or “joint lab” in the case of biology. For example, there is the case of the Altschuler-Wu lab that has recently relocated to UCSF or the Eddy-Rivas lab that is relocating to Harvard. Still, such cases are far and few between, and for the most part hiring is inefficient, clumsy and unfair (it is also worth noting that there are many other epistatic factors that can and should be considered, for example the field someone is working in, collaborators, etc.)

Epistasis has been carefully studied for a long time in population and statistical genetics, where it is fundamental in understanding the effects of genotype on phenotype. The geometry described above can be derived for diploid genomes and this was done by Ingileif Hallgrímsdóttir and Debbie Yuster in the paper “A complete classification of epistatic two-locus models” from 2008. In the paper they examine a previous classification of epistasis among 30 pairs of loci in a QTL analysis of growth traits in chicken (Carlborg et al., Genome Research 2003). The (re)-classification is shown in the figure below:

If we can classify epistasis for chickens in order to understand them, we can certainly assess the epistasis outlook for our potential colleagues, and we should hire accordingly.

It’s time that the two body problem be recognized as the two body opportunity.

This is part (2/2) about my travel this past summer to Iceland and Israel:

In my previous blog post I discussed the genetics of Icelanders, and the fact that most Icelanders can trace their roots back dozens of generations, all the way to Vikings from ca. 900AD. The country is homogenous in many other ways as well (religion, income, etc.), and therefore presents a stark contrast to the other country I visited this summer: Israel. Even though I’ve been to Israel many times since I was a child, now that I am an adult the manifold ethnic, social and religious makeup of the society is much more evident to me. This was particularly true during my visit this past summer, during which political and military turmoil in the country served to accentuate differences. There are Armenians, Ashkenazi Jews, Bahai, Bedouin, Beta Israel, Christian Arabs, Circassians, Copts, Druze, Maronites, Muslim Arab, Sephardic Jews etc. etc. etc. , and additional “diversity” caused by political splits leading to West Bank Palestinians, Gaza Palestinians, Israelis inside vs. outside the Green Line, etc. etc. etc. (and of course many individuals fall into multiple categories). It’s fair to say that “it’s complicated”. Moreover, the complex fabric that makes up Israeli society is part of a larger web of intertwined threads in the Middle East. The “Arab countries” that neighbor Israel are also internally heterogeneous and complex, both in obvious ways (e.g. the Sunni vs. Shia division), but also in many more subtle ways (e.g. language).

The 2014 Israeli-Gaza conflict started on July 8th. Having been in Israel for 4 weeks I was interacting closely with many friends and colleagues who were deeply impacted by the events (e.g. their children were suddenly called up to a partake in a war), and among them I noticed almost immediately an extreme polarization that reflected a public relations battle being waged between Hamas and Israel that played out more intensely than in any previous conflict on news channels and social media. The polarization extended to friends and acquaintances outside of Israel. Everyone had a very strong opinion. One thing I noticed were graphic memes being passed around in which the conflict was projected onto a two-colored map. For example, the map below was passed around on Facebook showing the (“real democratic”) Israel surrounded by a sea of Arab green in the Middle East:

I started noticing other bifurcating maps as other Middle East issues came to the fore later in the summer. Here is a map from a website depicting the Sunni-Shia divide:

In many cases the images being passed around were explicitly encouraging a “one-dimensional” view of the conflict(s), whereas in other cases the “us” vs. “them” factor was more subliminal. The feeling that I was being programmed how to think made me uncomfortable.

Moreover, the Middle East memes that were flooding my inbox were distracting me. I had visited Israel to nurture and establish connections and collaborations with the large number of computational biologists in the country. During my trip I was kindly hosted by Yael Mandel-Gutfreund at the Technion, and also had the honor of being an invited speaker at the annual Israeli Bioinformatics Society meeting. The visit was not supposed to be a bootcamp in salon politics. In any case, I found myself thinking about the situation in the Middle East with a computational biology mindset, and I was struck by the following “Middle East Friendship Chart” published in July that showed data about the relationships of the various entities/countries/organizations:

As a (computational) biologist I was keen to understand the data in a visual way that would reveal the connections more clearly, and as a computational (biologist) faced with ordinal data I thought immediately of non-metric multi-dimensional scaling as a way to depict the information in the matrix. I have discussed classic multi-dimensional scaling (or MDS) in a previous blog post, where I explained its connection to principal component analysis. In the case of ordinal data, non-metric MDS seeks to find points in a low-dimensional Euclidean space so that the ranks of distances correspond to the input ordinal matrix. It has been used in computational biology, for example in the analysis of gene expression matrices. The idea originates with a classic paper by Kruskal,that remains a good reference for understanding non-metric MDS. The key idea is summarized in his Figure 4:

Formally, in Kruskal’s notation, given a dissimilarity map $\delta$ (symmetric matrix with zeroes on the diagonal and nonnegative entries), the goal is to find points x in $R^k$ so that their pairwise distance match in rank. In Kruskal’s Figure 4, points on the plot correspond to pairs of points in $R^k$ and $\delta$ is shown on the y-axis, while the Euclidean distance between the points, represented by $d$, is shown on the x-axis. Monotonically increasing values $\hat{d}$ are then chosen so that $S=\sum_{ij} \left( d_{ij}-\hat{d}_{ij} \right)^2$ is minimized. The function S is called the “stress” function and is further normalized so that the “stress” is invariant up to scaling of the points. An iterative procedure can then be used to optimize the points, although results depend on which starting configuration is chosen, and for this reason multiple starting positions are considered.

I converted the smiley/frowny faces into numbers 0,1 or 2 (for red, yellow and green faces respectively) and was able to easily experiment with non-metric MDS using an implementation in R. The results for a 2D scaling of the friendship matrix are shown in the figure below:

It is evident that, as expected from the friendship matrix, ISIS is an outlier. One also sees some of “the enemy of thine enemy is thy friend”. What is interesting is that in some cases the placements are clearly affected by shared allegiances and mutual dislikes that are complicated in nature. For example, the reason Saudi Arabia is placed between Israel and the United States is the friendship of the U.S. towards Iraq in contrast to Israel’s relationship to the country. One interesting question, that is not addressed by the non-metric MDS approach, is what the direct influences are. For example, it stands to reason that Israel is neutral to Saudi Arabia partly because of the U.S. friendship with the country- can this be inferred from the data in the same way that causative links are inferred for gene networks? In any case, I thought the scaling was illuminating and it seems like an interesting exercise to extend the analysis to more countries/organizations/entities but it may be necessary to deal with missing data and I don’t have the time to do it.

I did decide to look at the 1D non-metric MDS, to see whether there is a meaningful one-dimensional representation of the matrix, consistent with some of the maps I’d seen. As it turns out, this is not what the data suggests. The one-dimensional scaling described below places ISIS in the middle, i.e. as the “neutral” country!

Israel                -4.55606607
Saudi Arabia          -3.62249810
Turkey                -3.04579321
United States         -2.6429534
Egypt                 -1.12919328
Al-Qaida              -0.38125270
Hamas                  0.01629508
ISIS                   0.40101149
Palestinian Authority  1.55546030
Iraq                   2.23849150
Hezbollah              2.66933449
Iran                   3.29650784
Syria                  5.20065616


This failure of non-metric MDS is simply a reflection of the fact that the friendship matrix is not “one-dimensional”. The Middle East is not one-dimensional. The complex interplay of Sunni vs. Shia, terrorist vs. freedom fighter, muslim vs. infidel, and all the rest of what is going on make it incorrect to think of the conflict in terms of a single attribute. The complex pattern of alliances and conflicts is therefore not well explained by two-colored maps, and the computations described above provide some kind of a “proof” of this fact. The friendship matrix also explains why it’s difficult to have meaningful discussions about the Middle East in 140 characters, or in Facebook tirades, or with soundbites on cable news. But as complicated as the Middle East is, I have no doubt that the “friendship matrix” of my colleagues in computational biology would require even higher dimension…

This past summer I spent a few weeks in Israel and then in Iceland (with brief visits to the Oxford workshop on Biological Sequence Analysis and Probabilistic Models, and to IST Austria). This is the first of two posts about my travels.

I have been a regular visitor to Iceland during the past 12 years, and every visit is associated with unforgettable unique and extraordinary experiences. I have climbed volcanos  and I’ve descended into their depths. I have enjoyed geothermal heating- both in swimming pools and at the beach. And I have seen incredible Aurora Borealis. A couple of years ago I even got married there.

Iceland is indeed a beautiful place. But the most amazing thing about the country… is a truly remarkable and extraordinary… website. It is called Islendingabók, and is not to be confused with the book Islendingabók (Book of Icelanders) from which it borrowed its name. Islendingabók (the website) is a collaboration between the company deCODE Genetics and a computer scientist Friðrik Skúlason, who together set up a searchable genealogical database of Icelanders. The genealogy can only be browsed by registered users, and registration is currently limited to citizens and residents with an Icelandic kennitala (social security number). Many people have heard that Iceland has kept “good” records, but I don’t think the scope and power of the database is generally understood. Even geneticists I have talked to typically don’t have a clue about how detailed and thorough the database is. At the risk of hyperbole, I am blown away every time I look at it. There is nothing like it in the world.

As explained above, I am married to an Icelander (Ingileif Bryndis Hallgrímsdóttir), and she has kindly given me permission to peek at the data. Before getting to her family tree, just a word about the naming system in Iceland, because understanding it is helpful in parsing the genealogy. Surnames are patronymic, and contain the first name of the father with the appendage “son” for sons, and “dóttir” for daughters. Therefore husbands and wives don’t share surnames, but their first names point to their fathers. Below is Ingileif’s (Inga’s) complete family tree going back five generations:

Another naming convention is apparent in the repetition of names (modulo 2 generations) and its coupling to the patronymic naming system. Notice the switch from Ásgeir Jónsson -> Jón Ásgeirsson  -> Ásgeir Jónsson and so on. Traditions run deep. For example, my daughter Steinunn Liorsdóttir is named after her grandmother, Steinunn Jónsdóttir, who is named after her grandmother, Steinunn Guðmundsdóttir, who is named after her grandmother, Steinunn Hannesdóttir, who is named after her grandmother, Steinunn Eyjólfsdóttir (born 1788).

As impressive as this is, the tree is much deeper. Below is the tree for her great-great-great grandfather Ásgeir (on her mother’s side), who was born in 1821:

This tree also goes back five generations (!), and is complete with the exception of three ancestors, who are five generations back (10th generation from my wife). At this point, ten generations back from my wife, we are looking at her relatives born in the early part of the 17th century. There is a lot of annotated information about every individual, not only when and where they were born and died, but also where they lived, and frequently also their professions. Of course the genealogy starts thinning out as one goes back in time and records become more scarce. How far back does it go? For some lines ancestors can be traced back to the 10th century and beyond, with a few lineages reaching kings of Norway ca. 800 AD. Below is the direct line of descendants from Ingólfur Arnarson, first settler of Iceland, to my wife. He is literally one of the great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-great-grandfathers of my daughters:

Every time I look at Icelandic ancestry chains like this I find my mind forced to stretch to accommodate a broader perspective of history. I begin to think in terms of dozens or hundreds of generations, and what humans have been doing on those timescales. In comparison to Icelandic genetics, other population genetic studies, such as the interesting recent study on Ashkenazi Jews by Itsik Pe’er’s group, are the blur of Gerhard Richter compared to the hyperrealism of Denis Peterson.

The genealogy of Icelanders was partly the rationale for the founding of deCODE Genetics, with the idea that it would be a powerful tool for linkage analysis when coupled with the DNA of Icelanders (in retrospect the genotyping, and now sequencing of a large part of the population means that large swaths of the genealogy can be inferred based on reconstructed haplotype blocks). But another rationale for the formation of deCODE, and one that has turned out to be extremely useful, is the general availability and sharing of records. Of course Iceland has a centralized health care system, and deCODE has been successful in working together with the Ministry of Welfare to perform many GWAS studies for a  variety of diseases (it is worth noting that deCODE has by far the best publication record in GWAS in the world), but what is less well known outside of Iceland is the extent to which individuals are prepared to trade-off privacy for the sake of national equality, and the implications that has for genetics studies. To give one example, this summer during my visit the yearly estimates of salary for representative individuals from all professions were published. These are based on tax returns, which in Iceland are publicly available upon request. Here are the salaries of top executives (salaries are reported in thousands of Icelandic Krona per month; at this time 1 USD = 120 ISK):

Also, the income of a number of early deCODE employees were high this year as a result of the sale of the company to Amgen:

Below are some of the salaries for musicians and artists. Sadly, some things are the same in all countries:

Typically the salaries of about 1% of the population are estimated and published (this year approximately 3,000 people).

Along with publicly available tax records, many other databases are public, for example for many years school records of all students (i.e. grades) were published annually. One can start to imagine all sorts of creative GWAS…Again, as with the genealogy, I can think of no other country in the world with anything like this.

Iceland’s genealogy is embedded deeply into the public psyche, to the extent that I think it’s fair to say that the national identity is constructed around it. After all, it’s hard to argue with someone that they are Icelandic when their ancestry traces back to the first settler. At the same time, like many nations in Europe and around the world, the country is becoming increasingly cosmopolitan, and the genealogy tree is beginning to look a lot more like a trimmed rosebush. Like many other nations, Icelanders are having to confront the questions “Who is an Icelander? What is an Icelander?” Ultimately, the answer cannot and will not come from genetics.

Nature Publishing Group claims on its website that it is committed to publishing “original research” that is “of the highest quality and impact”. But when exactly is research “original”?  This is a question with a complicated answer. A recent blog post by senior editor Dorothy Clyde at Nature Protocols provides insight into the difficulties Nature faces in detecting plagiarism, and identifies the issue of self plagiarism as particularly problematic. The journal tries to avoid publishing the work of authors who have previously published the same work or a minor variant thereof. I imagine this is partly in the interests of fairness, a service to the scientific community to ensure that researchers don’t have to sift through numerous variants of a single research project in the literature, and a personal interest of the journal in its aim to publish only the highest level of scholarship.

On the other hand, there is also a rationale for individual researchers to revisit their own previously published work. Sometimes results can be recast in a way that makes them accessible to different communities, and rethinking of ideas frequently leads to a better understanding, and therefore a better exposition. The mathematician Gian-Carlo Rota made the case for enlightened self-plagiarism in one of his ten lessons he wished he had been taught when he was younger:

3. Publish the same result several times

After getting my degree, I worked for a few years in functional analysis. I bought a copy of Frederick Riesz’ Collected Papers as soon as the big thick heavy oversize volume was published. However, as I began to leaf through, I could not help but notice that the pages were extra thick, almost like cardboard. Strangely, each of Riesz’ publications had been reset in exceptionally large type. I was fond of Riesz’ papers, which were invariably beautifully written and gave the reader a feeling of definitiveness.

As I looked through his Collected Papers however, another picture emerged. The editors had gone out of their way to publish every little scrap Riesz had ever published. It was clear that Riesz’ publications were few. What is more surprising is that the papers had been published several times. Riesz would publish the first rough version of an idea in some obscure Hungarian journal. A few years later, he would send a series of notes to the French Academy’s Comptes Rendus in which the same material was further elaborated. A few more years would pass, and he would publish the definitive paper, either in French or in English. Adam Koranyi, who took courses with Frederick Riesz, told me that Riesz would lecture on the same subject year after year, while meditating on the definitive version to be written. No wonder the final version was perfect.

Riesz’ example is worth following. The mathematical community is split into small groups, each one with its own customs, notation and terminology. It may soon be indispensable to present the same result in several versions, each one accessible to a specific group; the price one might have to pay otherwise is to have our work rediscovered by someone who uses a different language and notation, and who will rightly claim it as his own.

The question is: where does one draw the line?

I was recently forced to confront this question when reading an interesting paper about a statistical approach to utilizing controls in large-scale genomics experiments:

J.A. Gagnon-Bartsch and T.P. Speed, Using control genes to corrected for unwanted variation in microarray dataBiostatistics, 2012.

A cornerstone in the logic and methodology of biology is the notion of a “control”. For example, when testing the result of a drug on patients, a subset of individuals will be given a placebo. This is done to literally control for effects that might be measured in patients taking the drug, but that are not inherent to the drug itself. By examining patients on the placebo, it is possible to essentially cancel out uninteresting effects that are not specific to the drug. In modern genomics experiments that involve thousands, or even hundreds of thousands of measurements, there is a biological question of how to design suitable controls, and a statistical question of how to exploit large numbers of controls to “normalize” (i.e. remove unwanted variation) from the high-dimensional measurements.

Formally, one framework for thinking about this is a linear model for gene expression. Using the notation of Gagnon-Bartsch & Speed, we have an expression matrix $Y$ of size $m \times n$ (samples and genes) modeled as

$Y_{m \times n} = X_{m \times p}\beta_{p \times n} + Z_{m \times q}\gamma_{q \times n} + W_{m \times k} \alpha_{k \times n} + \epsilon_{m \times n}$.

Here is a matrix describing various conditions (also called factors) and associated to it is the parameter matrix $\beta$ that records the contribution, or influence, of each factor on each gene. $\beta$ is the primary parameter of interest to be estimated from the data Y. The $\epsilon$ are random noise, and finally  and are observed and unobserved covariates respectively. For example Z might encode factors for covariates such as gender, whereas W would encode factors that are hidden, or unobserved. A crucial point is that the number of hidden factors in W, namely k, is not known. The matrices $\gamma$ and $\alpha$ record the contributions of the Z and W factors on gene expression, and must also be estimated. It should be noted that X may be the logarithm of expression levels from a microarray experiment, or the analogous quantity from an RNA-Seq experiment (e.g. log of abundance in FPKM units).

Linear models have been applied to gene expression analysis for a very long time; I can think of papers going back 15 years. But They became central to all analysis about a decade ago, specifically popularized with the Limma package for microarray data analysis. In an important paper in 2007, Leek and Storey focused explicitly on the identification of hidden factors and estimation of their influence, using a method called SVA (Surrogate Variable Analysis). Mathematically, they described a procedure for estimating k and W and the parameters $\alpha$. I will not delve into the details of SVA in this post, except to say that the overall idea is to first perform linear regression (assuming no hidden factors) to identify the parameters $\beta$ and to then perform singular value decomposition (SVD) on the residuals to identify hidden factors (details omitted here). The resulting identified hidden factors (and associated influence parameters) are then used in a more general model for gene expression in subsequent analysis.

Gagnon-Bartsch and Speed refine this idea by suggesting that it is better to infer W from controls. For example, house-keeping genes that are unlikely to correlate with the conditions being tested, can be used to first estimate W, and then subsequently all the parameters of the model can be estimated by linear regression. They term this two-step process RUV-2 (acronym for Remote Unwanted Variation) where the “2” designates that the procedure is a two-step procedure. As with SVA, the key to inferring W from the controls is to perform singular value decomposition (or more generally factor analysis). This is actually clear from the probabilistic interpretation of PCA and the observation that what it means to be a in the set of “control genes” C  in a setting where there are no observed factors Z, is that

$Y_C = W \alpha_C + \epsilon_C$.

That is, for such control genes the corresponding $\beta$ parameters are zero. This is a simple but powerful observation, because the explicit designation of control genes in the procedure makes it clear how to estimate W, and therefore the procedure becomes conceptually compelling and practically simple to implement. Thus, even though the model being used is the same as that of Leek & Storey, there is a novel idea in the paper that makes the procedure “cleaner”. Indeed, Gagnon-Bartsch & Speed provide experimental results in their paper showing that RUV-2 outperforms SVA. Even more convincing, is the use of RUV-2 by others. For example, in a paper on “The functional consequences of variation in transcription factor binding” by Cusanovitch et al., PLoS Genetics 2014, RUV-2 is shown to work well, and the authors explain how it helps them to take advantage of the controls in experimental design they created.

There is a tech report and also a preprint that follow up on the Gagnon-Bartsch & Speed paper; the tech report extends RUV-2 to a four step method RUV-4 (it also provides a very clear exposition of the statistics), and separately the preprint describes an extension to RUV-2 for the case where the factor of interest is also unknown. Both of these papers build on the original paper in significant ways and are important work, that to return to the original question in the post, certainly are on the right side of “the line”

The wrong side of the line?

The development of RUV-2 and SVA occurred in the context of microarrays, and it is natural to ask whether the details are really different for RNA-Seq (spoiler: they aren’t).  In a book chapter published earlier this year:

D. Risso, J. Ngai, T.P. Speed, S. Dudoit, The role of spike-in standards in the normalization of RNA-Seq, in Statistical Analysis of Next Generation Sequencing Data (2014), 169-190.

the authors replace “log expression levels” from microarrays with “log counts” from RNA-Seq and the linear regression performed with Limma for RUV-2 with a Poisson regression (this involves one different R command). They call the new method RUV, which is the same as the previously published RUV, a naming convention that makes sense since the paper has no new method. In fact, the mathematical formulas describing the method are identical (and even in almost identical notation!) with the exception that the book chapter ignores altogether, and replaces $\epsilon$ with O.

To be fair, there is one added highlight in the book chapter, namely the observation that spike-ins can be used in lieu of housekeeping (or other control) genes. The method is unchanged, of course. It is just that the spike-ins are used to estimate W. Although spike-ins were not mentioned in the original Gagnon-Bartsch paper, there is no reason not to use them with arrays as well; they are standard with Affymetrix arrays.

My one critique of the chapter is that it doesn’t make sense to me that counts are used in the procedure. I think it would be better to use abundance estimates, and in fact I believe that Jeff Leek has already investigated the possibility in a preprint that appears to be an update to his original SVA work. That issue aside, the book chapter does provide concrete evidence using a Zebrafish experiment that RUV-2 is relevant and works for RNA-Seq data.

The story should end here (and this blog post would not have been written if it had) but two weeks ago, among five RNA-Seq papers published in Nature Biotechnology (I have yet to read the others), I found the following publication:

D. Risso, J. Ngai, T.P. Speed, S. Dudoit, Normalization of RNA-Seq data using factor analysis of control genes or samples, Nature Biotechnology 32 (2014), 896-902.

This paper has the same authors as the book chapter (with the exception that Sandrine Dudoit is now a co-corresponding author with Davide Risso, who was the sole corresponding author on the first publication), and, it turns out, it is basically the same paper… in fact in many parts it is the identical paper. It looks like the Nature Biotechnology paper is an edited and polished version of the book chapter, with a handful of additional figures (based on the same data) and better graphics. I thought that Nature journals publish original and reproducible research papers. I guess I didn’t realize that for some people “reproducible” means “reproduce your own previous research and republish it”.

At this point, before drawing attention to some comparisons between the papers, I’d like to point out that the book chapter was refereed. This is clear from the fact that it is described as such in both corresponding authors’ CVs (Davide Risso CV and Sandrine Dudoit CV).

How similar are the two papers?

Final paragraph of paper in the book:

Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical effects. With the advent of single-cell sequencing [35], the role of spike-in standards should become even more important, both to account for technical variability [6] and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike-in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols.

Final paragraph of paper in Nature Biotechnology:

Internal and external controls are essential for the analysis of high-throughput data and spike-in sequences have the potential to help researchers better adjust for unwanted technical factors. With the advent of single-cell sequencing27, the role of spike-in standards should become even more important, both to account for technical variability28 and to allow the move from relative to absolute RNA expression quantification. It is therefore essential to ensure that spike- in standards behave as expected and to develop a set of controls that are stable enough across replicate libraries and robust to both differences in library composition and library preparation protocols.

Abstract of paper in the book:

Normalization of RNA-seq data is essential to ensure accurate inference of expression levels, by adjusting for sequencing depth and other more complex nuisance effects, both within and between samples. Recently, the External RNA Control Consortium (ERCC) developed a set of 92 synthetic spike-in standards that are commercially available and relatively easy to add to a typical library preparation. In this chapter, we compare the performance of several state-of-the-art normalization methods, including adaptations that directly use spike-in sequences as controls. We show that although the ERCC spike-ins could in principle be valuable for assessing accuracy in RNA-seq experiments, their read counts are not stable enough to be used for normalization purposes. We propose a novel approach to normalization that can successfully make use of control sequences to remove unwanted effects and lead to accurate estimation of expression fold-changes and tests of differential expression.

Abstract of paper in Nature Biotechnology:

Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms.

Abstract of Gagnon-Bartsch & Speed paper that already took credit for a “new” method called RUV:

Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method “Remove Unwanted Variation, 2-step” (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.

Many figures are also the same (except one that appears to have been fixed in the Nature Biotechnology paper– I leave the discovery of the figure as an exercise to the reader). Here is Figure 9.2 in the book:

The two panels appears as (b) and (c) in Figure 4 in the Nature Biotechnology paper (albeit transformed via a 90 degree rotation and reflection from the dihedral group):

Basically the whole of the book chapter and the Nature Biotechnology paper are essentially the same, down to the math notation, which even two papers removed is just a rehashing of the RUV method of Gagnon-Bartsch & Speed. A complete diff of the papers is beyond the scope of this blog post and technically not trivial to perform, but examination by eye reveals one to be a draft of the other.

Although it is acceptable in the academic community to draw on material from published research articles for expository book chapters (with permission), and conversely to publish preprints, including conference proceedings, in journals, this case is different. (a) the book chapter was refereed, exactly like a journal publication (b) the material in the chapter is not expository; it is research, (c) it was published before the Nature Biotechnology article, and presumably prepared long before,  (d) the book chapter cites the Nature Biotechnology article but not vice versa and (e) the book chapter is not a particularly innovative piece of work to begin with. The method it describes and claims to be “novel”, namely RUV, was already published by Gagnon-Bartsch & Speed.

Below is a musical rendition of what has happened here:

“An entertaining freshness… Tic Tac!” This is Ferrero‘s tag line for its most successful product, the ubiquitous Tic Tac. And the line has stuck. As WikiHow points out in how to make your breath freshfirst buy some mints, then brush your teeth.

One of the amazing things about Tic Tacs is that they are sugar free. Well… almost not. As the label explains, a single serving (one single Tic Tac) contains 0g of sugar (to be precise, less than 0.5g, as explained in a footnote). In what could initially be assumed to be a mere coincidence, the weight of a single serving is 0.49g. It did not escape my attention that 0.50-0.49=0.01. Why?

To understand it helps to look at the labeling rules of the FDA. I’ve reproduced the relevant section (Title 21) below, and boldfaced the relevant parts:

 TITLE 21–FOOD AND DRUGS
 CHAPTER I–FOOD AND DRUG ADMINISTRATION DEPARTMENT OF HEALTH AND HUMAN SERVICES
 SUBCHAPTER B–FOOD FOR HUMAN CONSUMPTION

(c) Sugar content claims –(1) Use of terms such as “sugar free,” “free of sugar,” “no sugar,” “zero sugar,” “without sugar,” “sugarless,” “trivial source of sugar,” “negligible source of sugar,” or “dietarily insignificant source of sugar.” Consumers may reasonably be expected to regard terms that represent that the food contains no sugars or sweeteners e.g., “sugar free,” or “no sugar,” as indicating a product which is low in calories or significantly reduced in calories. Consequently, except as provided in paragraph (c)(2) of this section, a food may not be labeled with such terms unless:

(i) The food contains less than 0.5 g of sugars, as defined in 101.9(c)(6)(ii), per reference amount customarily consumed and per labeled serving or, in the case of a meal product or main dish product, less than 0.5 g of sugars per labeled serving; and

(ii) The food contains no ingredient that is a sugar or that is generally understood by consumers to contain sugars unless the listing of the ingredient in the ingredient statement is followed by an asterisk that refers to the statement below the list of ingredients, which states “adds a trivial amount of sugar,” “adds a negligible amount of sugar,” or “adds a dietarily insignificant amount of sugar;” and

(iii)(A) It is labeled “low calorie” or “reduced calorie” or bears a relative claim of special dietary usefulness labeled in compliance with paragraphs (b)(2), (b)(3), (b)(4), or (b)(5) of this section, or, if a dietary supplement, it meets the definition in paragraph (b)(2) of this section for “low calorie” but is prohibited by 101.13(b)(5) and 101.60(a)(4) from bearing the claim; or

(B) Such term is immediately accompanied, each time it is used, by either the statement “not a reduced calorie food,” “not a low calorie food,” or “not for weight control.”

It turns out that Tic Tacs are in fact almost pure sugar. Its easy to figure this out by looking at the number of calories per serving (1.9) and multiplying  the number of calories per gram of sugar (3.8) by 0.49 => 1.862 calories. 98% sugar! Ferrero basically admits this in their FAQ. Acting completely within the bounds of the law, they have simply exploited an arbitrary threshold of the FDA. Arbitrary thresholds are always problematic; not only can they have unintended consequences, but they can be manipulated to engineer desired outcomes. In computational biology they have become ubiquitous, frequently being described as “filters” or “pre-processing steps”.  Regardless of how they are justified, thresholds are thresholds are thresholds. They can sometimes be beneficial, but they are dangerous when wielded indiscriminately.

There is one type of thresholding/filtering in used in RNA-Seq that my postdoc Bo Li and I have been thinking about a bit this year. It consists of removing duplicate reads, i.e. reads that map to the same position in a transcriptome. The motivation behind such filtering is to reduce or eliminate amplification bias, and it is based on the intuition that it is unlikely that lightning strikes the same spot multiple times. That is, it is improbable that many reads would map to the exact same location assuming a model for sequencing that posits selecting fragments from transcripts uniformly. The idea is also called de-duplication or digital normalization.

Digital normalization is obviously problematic for high abundance transcripts. Consider, for example, a transcripts that is so abundant that it is extremely likely that at least one read will start at every site (ignoring the ends, which for the purposes of the thought experiment are not relevant). This would also be the case if the transcript was twice as abundant, and so digital normalization would prevent the possibility for estimating the difference. This issue was noted in a paper published earlier this year by Zhou et al.  The authors investigate in some detail the implications of this problem, and quantify the bias it introduces in a number of data sets. But a key question not answered in the paper is what does digital normalization actually do?

To answer the question, it is helpful to consider how one might estimate the abundance of a transcript after digital normalization. One naive approach is to just count the number of reads after de-duplication, followed by normalization for the length of the transcript and the number of reads sequenced. Specifically if there are sites where a read might start, and of the sites had at least one read, then the naive approach would be to use the estimate $\frac{k}{n}$ suitably normalized for the total number of reads in the experiment. This is exactly what is done in standard de-duplication pipelines, or in digital normalization as described in the preprint by Brown et al. However assuming a simple model for sequencing, namely that every read is selected by first choosing a transcript according to a multinomial distribution and then choosing a location on it uniformly at random from among the sites, a different formula emerges.

Let be a random variable that denotes the number of sites on a transcript of length n that are covered in a random sequencing experiment, where the number of reads starting at each site of the transcript is Poisson distributed with parameter c (i.e., the average coverage of the transcript is c). Note that

$Pr(X \geq 1) = 1-Pr(X=0) = 1-e^{-c}$.

The maximum likelihood estimate for can also be obtained by the method of moments, which is to set

$\frac{k}{n} = 1-e^{-c}$

from which it is easy to see that

$c = -log(1-\frac{k}{n})$.

This is the same as the (derivation of the) Jukes-Cantor correction in phylogenetics where the method of moments equation is replaced by $\frac{4}{3}\frac{k}{n} = 1-e^{-\frac{4}{3}c}$ yielding $D_{JC} = -\frac{3}{4}log(1-\frac{4}{3}\frac{k}{n})$, but I’ll leave an extended discussion of the Jukes-Cantor model and correction for a future post.

The point here, as noticed by Bo Li, is that since $log(1-x) \approx -x$ by Taylor approximation, it follows that the average coverage can be estimated by $c \approx \frac{k}{n}$. This is exactly the naive estimate of de-duplication or digital normalization, and the fact that $\frac{k}{n} \rightarrow 1$ as $k \rightarrow n$ means that $-log(1-\frac{k}{n})$ blows up, at high coverage hence the results of Zhou et al.

Digital normalization as proposed by Brown et al. involves possibly thresholding at more than one read per site (for example choosing a threshold C and removing all but at most C reads at every site). But even this modified heuristic fails to adequately relate to a probabilistic model of sequencing. One interesting and easy exercise is to consider the second or higher order Taylor approximations. But a more interesting approach to dealing with amplification bias is to avoid thresholding per se,  and to instead identify outliers among duplicate reads and to adjust them according to an estimated distribution of coverage. This is the approach of Hashimoto et al. in a the paper “Universal count correction for high-throughput sequencing” published in March in PLoS One. There are undoubtedly other approaches as well, and in my opinion the issue will received renewed attention in the coming year as the removal of amplification biases in single-cell transcriptome experiments becomes a priority.

As mentioned above, digital normalization/de-duplication is just one of many thresholds applied in a typical RNA-Seq “pipeline”. To get a sense of the extent of thresholding, one need only scan the (supplementary?) methods section of any genomics paper. For example, the GEUVADIS RNA-Seq consortium describe their analysis pipeline as follows:

“We employed the JIP pipeline (T.G. & M.S., data not shown) to map mRNA-seq reads and to quantify mRNA transcripts. For alignment to the human reference genome sequence (GRCh37, autosomes + X + Y + M), we used the GEM mapping suite24 (v1.349 which corresponds to publicly available pre-release 2) to first map (max. mismatches = 4%, max. edit distance = 20%, min. decoded strata = 2 and strata after best = 1) and subsequently to split-map (max.mismatches = 4%, Gencode v12 and de novo junctions) all reads that did not map entirely. Both mapping steps are repeated for reads trimmed 20 nucleotides from their 3′-end, and then for reads trimmed 5 nucleotides from their 5′-end in addition to earlier 3′-trimming—each time considering exclusively reads that have not been mapped in earlier iterations. Finally, all read mappings were assessed with respect to the mate pair information: valid mapping pairs are formed up to a maximum insert size of 100,000 bp, extension trigger = 0.999 and minimum decoded strata = 1. The mapping pipeline and settings are described below and can also be found in https://github.com/gemtools, where the code as well as an example pipeline are hosted.”

This is not a bad pipeline- the paper shows it was carefully evaluated– and it may have been a practical approach to dealing with the large amount of RNA-Seq data in the project. But even the first and seemingly innocuous thresholding to trim low quality bases from the ends of reads is controversial and potentially problematic. In a careful analysis published earlier this year, Matthew MacManes looked carefully at the effect of trimming in RNA-Seq, and concluded that aggressive trimming of bases below Q20, a standard that is frequently employed in pipelines, is problematic. I think his Figure 3, which I’ve reproduced below, is very convincing:

It certainly appears that some mild trimming can be beneficial, but a threshold that is optimal (and more importantly not detrimental) depends on the specifics of the dataset and is difficult or impossible to determine a priori. MacManes’ view (for more see his blog post on the topic) is consistent with another paper by Del Fabbro et al. that while seemingly positive about trimming in the abstract, actually concludes that “In the specific case of RNA-Seq, the tradeoff between sensitivity (number of aligned reads) and specificity (number of correctly aligned reads) seems to be always detrimental when trimming the datasets (Figure S2); in such a case, the modern aligners, like Tophat, seem to be able to overcome low quality issues, therefore making trimming unnecessary.”

Alas, Tic Tac thresholds are everywhere. My advice is: brush your teeth first.

When I was a teenager I broke all the rules on Friday night. After dinner I would watch Louis Rukeyser’s Wall Street Week at 8:30pm, and I would be in bed an hour later. On new year’s eve, he had a special “year-end review”, during which he hosted “financial experts” who would opine on the stock market and make predictions for the coming year.

What I learned from Louis Rukeyser was:

1. Never trust men in suits (or tuxedos).

2. It’s easier to perpetrate the 1024 scam than one might think!

Here are the experts in 1999 all predicting increases for the stock market in 2000:

As it turned out, the NASDAQ peaked on March 10, 2000, and within a week and a half had dropped 10%. By the end of the year the dot-com bubble had completely burst and a few years later the market had lost almost 80% of its value.

Predictions on the last day of the 20th century represented a spectacular failure for the “pundits”, but by then I had already witnessed many failures on the show. I’d also noted that almost all the invited “experts” were men. Of course correlation does not imply causation, but I remember having a hard time dispelling the notion that the guests were wrong because they were men. I never wanted to be sexist, but Louis Rukeyser made it very difficult for me!

Gender issues aside, the main lesson I learned from Louis Rukeyser’s show is that it’s easy to perpetrate the 1024 scam. The scam goes something like this: a scammer sends out 1024 emails to individuals that are unlikely to know each other, with each email making a prediction about the performance of the stock market in the coming week. For half the people (512), she predicts the stock market will go up, and for the other half, that it will go down. The next week, she has obviously sent a correct prediction of the market to half the people (this assumes the market is never unchanged after a week). She ignores the 512 people who have received an incorrect prediction, dividing those who received the correct prediction into two halves (256 each). Again, she predicts the performance of the market in the coming week, sending 256 individuals a prediction that the market will go up, and the other 256 a prediction that it will go down. She continues this divide-and-conquer for 10 weeks, at which time there is one individual that has received correct predictions about the movement of the stock market for 2.5 months! This person may believe that the scammer has the ability to predict the market; after all, $(\frac{1}{2})^{10} = 0.00098$ which looks like a very significant p-value. This is when the scammer asks for a “large investment”. Of course what is missing is knowledge of the other prediction emails sent out, or in other words the multiple testing problem.

The Wall Street Week guest panels essentially provided a perfect setting in which to perpetrate this scam. “Experts” that would err would be unlikely to be invited back. Whereas regular winners would be back for another chance at guessing. This is a situation very similar to the mutual fund management market, where managers are sacked when they have a bad year, only to have large firms with hundreds of funds on the books highlight funds that have performed well for 10 years in a row in their annual glossy brochures. But that is not the subject matter of this blog post. Rather, it’s the blog itself.

I wrote and posted my first blog entry (Genesis of *Seq) exactly a year ago. I began writing it for two reasons. First, I thought it could be a convenient and useful forum for discussion of technical developments in computational biology. I was motivated partly by the seqanswers website, which allows users to share information and experience in dealing with high-throughput sequence data. But I was also inspired by the What’s New Blog that has created numerous bridges in the mathematics community via highly technical yet accessible posts that have democratized mathematics. Second, I had noticed an extraordinary abuse of multiple testing in computational biology, and I was desperate for a forum where I could bring the issue to peoples attention. My initial frustration with outlandish claims in papers based on weak statistics had also grown over time to encompass a general concern for lack of rigor in computational biology papers. None of us are perfect but there is a wide gap between perfect and wrong. Computational biology is a field that is now an amalgamation of many subjects and I hoped that a blog would be able to reach the different silos more effectively than publications.

And thus this blog was born on August 19th 2013. I started without a preconception of how it would turn out over time, and I’m happy to say I’ve been surprised by its impact, most notably on myself. I’ve learned an enormous amount from reader feedback, in part via comments on individual posts, but also from private emails to me and in personal conversations. For this (selfish) reason alone, I will keep blogging. I have also been asked by many of you to keep posting, and I’m listening. When I have nothing left to say, I promise I will quit. But for now I have a backlog of posts, and after a break this summer, I am ready to return to the keyboard. Besides, since starting to blog I still haven’t been to Las Vegas.

There has recently been something of an uproar over the new book A Troublesome Inheritance by Nicholas Wade, with much of the criticism centering on Wade’s claim that race is a meaningful biological category. This subject is one with which I1 have some personal connection since as a child growing up in South Africa in the 1980s, I was myself categorized very neatly by the Office for Race Classification: 10. A simple pair of digits that conferred on me numerous rights and privileges denied to the majority of the population.

Explanation of identity numbers assigned to citizens by the South African government during apartheid.

And yet the system behind those digits was anything but simple. The group to which an individual was assigned could be based on not only their skin color but also their employment, eating and drinking habits, and indeed explicitly social factors as related by Muriel Horrell of the South African Institute of Race Relations: “Should a man who is initially classified white have a number of coloured friends and spend many of his leisure hours in their company, he stands to risk being re-classified as coloured.”

With these memories in mind, I found Wade’s concept of race as a biological category quite confusing, a confusion which only deepened when I discovered that he identifies not the eight races designated by the South African Population Registration Act of 1950, but rather five, none of which was the Griqua! With the full force of modern science on his side2, it seemed unlikely that these disparities represented an error on Wade’s part. And so I was left with a perplexing question: how could it be that the South African apartheid regime — racists par excellence — had failed to institutionalize their racism correctly? How had Wade gotten it right when Hendrik Verwoerd had gone awry?

Eventually I realized that A Troublesome Inheritance itself might contain the answer to this conundrum. Institutions, Wade explains, are genetic: “they grow out of instinctual social behaviors” and “one indication of such a genetic effect is that, if institutions were purely cultural, it should be easy to transfer an institution from one society to another.”3 So perhaps it is Wade’s genetic instincts as a Briton that explain how he has navigated these waters more skillfully than the Dutch-descended Afrikaners who designed the institutions of apartheid.

One might initially be inclined to scoff at such a suggestion or even to find it offensive. However, we must press boldly on in the name of truth and try to explain why this hypothesis might be true. Again, A Troublesome Inheritance comes to our aid. There, Wade discusses the decline in English interest rates between 1400 and 1850. This is the result, we learn, of rich English people producing more children than the poor and thereby genetically propagating those qualities which the rich are so famous for possessing: “less impulsive, more patient, and more willing to save.”4 However this period of time saw not only falling interest rates but also the rise of the British Empire. It was a period when Englishmen not only built steam engines and textile mills, but also trafficked in slaves by the millions and colonized countries whose people lacked their imperial genes. These latter activities, with an obvious appeal to the more racially minded among England’s population, could bring great wealth to those who engaged in them and so perhaps the greater reproductive fitness of England’s economic elite propagated not only patience but a predisposition to racism. This would explain, for example, the ability of John Hanning Speke to sniff out “the best blood of Abyssinia” when distinguishing the Tutsi from their Hutu neighbors.

Some might be tempted to speculate that Wade is himself a racist. While Wade — who freely speculates about billions of human beings — would no doubt support such an activity, those who engage in such speculation should perhaps not judge him too harshly. After all, racism may simply be Wade’s own troublesome inheritance.

#### Footnotes

1.  In the spirit of authorship designation as discussed in this post, we describe the author contributions as follows: the recollections of South Africa are those of Lior Pachter, who distinctly remembers his classification as “white”. Nicolas Bray conceived and composed the post with input from LP. LP discloses no conflicts of interest. NB discloses being of British ancestry.
2. Perhaps not quite the full force, given the reception his book has received from actual scientists.
3. While this post is satirical, it should be noted for clarity that, improbably, this is an actual quote from Wade’s book.
4. Again, not satire.

In the Jeopardy! game show contestants are presented with questions formulated as answers that require answers in the form questions. For example, if a contestant selects “Normality for $200″ she might be shown the following clue: “The average $\frac{x_1+x_2+\cdots + x_n}{n}$,” to which she would reply “What is the maximum likelihood estimate for the mean of independent identically distributed Gaussian random variables from which samples $x_1,x_2,\ldots,x_n$ have been obtained?” Host Alex Trebek would immediately exclaim “That is the correct answer for$200!”

The process of doing mathematics involves repeatedly playing Jeopardy! with oneself in an unending quest to understand everything just a little bit better. The purpose of this blog post is to provide an exposition of how this works for understanding principal component analysis (PCA): I present four Jeopardy clues in the “Normality” category that all share the same answer: “What is principal component analysis?” The post was motivated by a conversation I recently had with a well-known population geneticist at a conference I was attending. I mentioned to him that I would be saying something about PCA in my talk, and that he might find what I have to say interesting because I knew he had used the method in many of his papers. Without hesitation he replied that he was well aware that PCA was not a statistical method and merely a heuristic visualization tool.

The problem, of course, is that PCA does have a statistical interpretation and is not at all an ad-hoc heuristic. Unfortunately, the previously mentioned population geneticist is not alone; there is a lot of confusion about what PCA is really about. For example, in one textbook it is stated that “PCA is not a statistical method to infer parameters or test hypotheses. Instead, it provides a method to reduce a complex dataset to lower dimension to reveal sometimes hidden, simplified structure that often underlie it.” In another one finds out that “PCA is a statistical method routinely used to analyze interrelationships among large numbers of objects.” In a highly cited review on gene expression analysis PCA is described as “more useful as a visualization technique than as an analytical method” but then in a paper by Markus Ringnér  titled the same as this post, i.e. What is principal component analysis? in Nature Biotechnology, 2008, the author writes that “Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the data set” (the author then avoids going into the details because “understanding the details underlying PCA requires knowledge of linear algebra”). All of these statements are both correct and incorrect and confusing. A major issue is that the description by Ringnér of PCA in terms of the procedure for computing it (singular value decomposition) is common and unfortunately does not shed light on when it should be used. But knowing when to use a method is far more important than knowing how to do it.

I therefore offer four Jeopardy! clues for principal component analysis that I think help to understand both when and how to use the method:

1. An affine subspace closest to a set of points.

Suppose we are given numbers $x_1,\ldots,x_n$ as in the initial example above. We are interested in finding the “closest” number to these numbers. By “closest” we mean in the sense of total squared difference. That is, we are looking for a number $m$ such that $\sum_{i=1}^n (m-x_i)^2$ is minimized.

This is a (straightforward) calculus problem, solved by taking the derivative of the function above and setting it equal to zero. If we let $f(m) = \sum_{i=1}^n (m-x_i)^2$ then $f'(m) = 2 \cdot \sum_{i=1}^n (m-x_i)$ and setting $f'(m)=0$ we can solve for $m$ to obtain $m = \frac{1}{n} \sum_{i=1}^n x_i$.

The right hand side of the equation is just the average of the n numbers and the optimization problem provides an interpretation of it as the number minimizing the total squared difference with the given numbers (note that one can replace squared difference by absolute value, i.e. minimization of $\sum_{i=1}^n |m-x_i|$, in which case the solution for is the median; we return to this point and its implications for PCA later).

Suppose that instead of n numbers, one is given n points in $\mathbb{R}^p$. That is, point is ${\bf x}^i = (x^i_1,\ldots,x^i_p)$. We can now ask for a point ${\bf m}$ with the property that the squared distance of ${\bf m}$ to the n points is minimized. This is asking for $min_{\bf m} \sum_{i=1}^n ||{\bf m}-{\bf x}^i||_2$.

The solution for $m$ can be obtained by minimizing each coordinate independently, thereby reducing the problem to the simpler version of numbers above, and it follows that ${\bf m} = \frac{1}{n} \sum_{i=1}^n {\bf x}^i$.

This is 0-dimensional PCA, i.e., PCA of a set of points onto a single point, and it is the centroid of the points. The generalization of this concept provides a definition for PCA:

Definition: Given n points in $\mathbb{R}^p$, principal components analysis consists of choosing a dimension $k < p$ and then finding the affine space of dimension with the property that the squared distance of the points to their orthogonal projection onto the space is minimized.

This definition can be thought of as a generalization of the centroid (or average) of the points. To understand this generalization, it is useful to think of the simplest case that is not 0-dimensional PCA, namely 1-dimensional PCA of a set of points in two dimensions:

In this case the 1-dimensional PCA subspace can be thought of as the line that best represents the average of the points. The blue points are the orthogonal projections of the points onto the “average line” (see, e.g., the red point projected orthogonally), which minimizes the squared lengths of the dashed lines. In higher dimensions line is replaced by affine subspace, and the orthogonal projections are to points on that subspace. There are a few properties of the PCA affine subspaces that are worth noting:

1. The set of PCA subspaces (translated to the origin) form a flagThis means that the PCA subspace of dimension k is contained in the PCA subspace of dimension k+1. For example, all PCA subspaces contain the centroid of the points (in the figure above the centroid is the green point). This follows from the fact that the PCA subspaces can be incrementally constructed by building a basis from eigenvectors of a single matrix, a point we will return to later.
2. The PCA subspaces are not scale invariant. For example, if the points are scaled by multiplying one of the coordinates by a constant, then the PCA subspaces change. This is obvious because the centroid of the points will change. For this reason, when PCA is applied to data obtained from heterogeneous measurements, the units matter. One can form a “common” set of units by scaling the values in each coordinate to have the same variance.
3. If the data points are represented in matrix form as an $n \times p$ matrix $X$, and the points orthogonally projected onto the PCA subspace of dimension are represented as in the ambient p dimensional space by a matrix $\tilde{X}$, then $\tilde{X} = argmin_{M:rk(M)=k} ||X-M||_2$. That is, $\tilde{X}$ is the matrix of rank k with the property that the Frobenius norm $||X-\tilde{X}||_2$ is minimized. This is just a rephrasing in linear algebra of the definition of PCA given above.

At this point it is useful to mention some terminology confusion associated with PCA. Unfortunately there is no standard for describing the various parts of an analysis. What I have called the “PCA subspaces” are also sometimes called “principal axes”. The orthogonal vectors forming the flag mentioned above are called “weight vectors”, or “loadings”. Sometimes they are called “principal components”, although that term is sometimes used to refer to points projected onto a principal axis. In this post I stick to “PCA subspaces” and “PCA points” to avoid confusion.

Returning to Jeopardy!, we have “Normality for $400″ with the answer “An affine subspace closest to a set of points” and the question “What is PCA?”. One question at this point is why the Jeopardy! question just asked is in the category “Normality”. After all, the normal distribution does not seem to be related to the optimization problem just discussed. The connection is as follows: 2. A generalization of linear regression in which the Gaussian noise is isotropic. PCA has an interpretation as the maximum likelihood parameter of a linear Gaussian model, a point that is crucial in understanding the scope of its application. To explain this point of view, we begin by elaborating on the opening Jeopardy! question about Normality for$200:

The point of the question was that the average of n numbers can be interpreted as a maximum likelihood estimation of the mean of a Gaussian. The Gaussian distribution is

$f(x,\mu,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$. Given the numbers $x_1,\ldots,x_n$, the likelihood function is therefore

$L(\mu,\sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}$. The maximum of this function is the same as the maximum of its logarithm, which is

$log L(\mu,\sigma) = \sum_{i=1}^n \left( log \frac{1}{\sqrt{2 \pi \sigma^2}} -\frac{(x_i-\mu)^2}{2\sigma^2} \right)$. Therefore the problem of finding the maximum likelihood estimate for the mean is equivalent to that of finding the minimum of the function

$S(\mu) = \sum_{i=1}^n (x_i-\mu)^2$. This is exactly the optimization problem solved by 0-dimensional PCA, as we saw above. With this calculation at hand, we turn to the statistical interpretation of least squares:

Given n points $\{(x_i,y_i)\}_{i=1}^n$ in the plane (see figure above),  the least squares line $y=mx+b$ (purple in figure) is the one that minimizes the sum of the squares $\sum_{i=1}^n \left( (mx_i+b) - y_i \right)^2$. That is, the least squares line is the one minimizing the sum of the squared vertical distances to the points. As with the average of numbers, the least squares line has a statistical interpretation: Suppose that there is some line $y=m^{*}x+b^{*}$ (black line in figure) that is unknown, but that “generated” the observed points, in the sense that each observed point was obtained by perturbing the point $m^{*}x_i +b^{*}$ vertically by a random amount from a single Gaussian distribution with mean 0 and variance $\sigma^2$. In the figure, an example is shown where the blue point on the unknown line “generates” the observed red point; the Gaussian is indicated with the blue streak around the point. Note that the model specified so far is not fully generative, as it depends on the hidden points $m^{*}x_i +b^{*}$ and there is no procedure given to generate the $x_i$. This can be done by positing that the $x_i$ are generated from a Gaussian distribution along the line $y=m^{*}x+b$ (followed by the points $y_i$ generated by Gaussian perturbation of the y coordinate on the line). The coordinates $x_i$ can then be deduced directly from the observed points as the Gaussian perturbations are all vertical. The relationship between the statistical model just described and least squares is made precise by a theorem (which we state informally, but is a special case of the Gauss-Markov theorem):

Theorem (Gauss-Markov): The maximum likelihood estimate for the line (the parameters and b) in the model described above correspond to the least squares line.

The proof is analogous to the argument given for the average of numbers above so we omit it. It can be generalized to higher dimensions where it forms the basis of what is known as linear regression. In regression, the $x_i$ are known as independent variables and $y$ the dependent variable. The generative model provides an interpretation of the independent variables as fixed measured quantities, whereas the dependent variable is a linear combination of the independent variables with added noise. It is important to note that the origins of linear regression are in physics, specifically in work of Legendre (1805) and Gauss (1809) who applied least squares to the astronomical problem of calculating the orbits of comets around the sun. In their application, the independent variables were time (for which accurate measurements were possible with clocks; by 1800 clocks were accurate to less than 0.15 seconds per day) and the (noisy) dependent variable the measurement of location. Linear regression has become one of the most (if not the most) widely used statistical tools but as we now explain, PCA (and its generalization factor analysis), with a statistical interpretation that includes noise in the $x_i$ variables, seems better suited for biological data.

The statistical interpretation of least squares can be extended to a similar framework for PCA. Recall that we first considered a statistical interpretation for least squares where an unknown line $y=m^{*}x+b^{*}$ “generated” the observed points, in the sense that each observed point was obtained by perturbing the point $m^{*}x_i +b^{*}$ vertically by a random amount from a single Gaussian distribution with mean 0 and variance $\sigma^2$. PCA can be understood analogously by replacing vertically by orthogonally (this is the probabilistic model of Collins et al., NIPS 2001 for PCA). However this approach is not completely satisfactory as the orthogonality of the perturbation is is not readily interpretable. Stated differently, it is not obvious what physical processes would generate points orthogonal to a linear affine subspace by perturbations that are always orthogonal to the subspace. In the case of least squares, the “vertical” perturbation corresponds to noise in one measurement (represented by one coordinate). The problem is in naturally interpreting orthogonal perturbations in terms of a noise model for measurements. This difficulty is resolved by a model called probabilistic PCA (pPCA), first proposed by Tipping and Bishop in a Tech Report in 1997, and published in the J. of the Royal Statistical Society B 2002,  and independently by Sam Roweis, NIPS 1998, that is illustrated visually in the figure below, and that we now explain:

In the pPCA model there is an (unknown) line (affine space in higher dimension) on which (hidden) points (blue) are generated at random according to a Gaussian distribution (represented by gray streak in the figure above, where the mean of the Gaussian is the green point). Observed points (red) are then generated from the hidden points by addition of isotropic Gaussian noise (blue smear), meaning that the Gaussian has a diagonal covariance matrix with equal entries. Formally, in the notation of Tipping and Bishop, this is a linear Gaussian model described as follows:

Observed random variables are given by $t = Wx + \mu + \epsilon$ where x are latent (hidden) random variables, W is a matrix describing a subspace and $Wx+\mu$ are the latent points on an affine subspace ($\mu$ corresponds to a translation). Finally, $\epsilon$ is an error term, given by a Gaussian random variable with mean 0 and covariance matrix $\sigma^2 I$. The parameters of the model are $W,\mu$ and $\sigma^2$. Equivalently, the observed random variables are themselves Gaussian, described by the distribution $t \sim \mathcal{N}(\mu,WW^T + \psi)$ where $\psi \sim \mathcal{N}(0,\sigma^2I)$. Tipping and Bishop prove an analogy of the Gauss-Markov theorem, namely that the affine subspace given by the maximum likelihood estimates of $W$ and $\mu$ is the PCA subspace (the proof is not difficult but I omit it and refer interested readers to their paper, or Bishop’s Pattern Recognition and Machine Learning book).

It is important to note that although the maximum likelihood estimates of $W,\mu$ in the pPCA model correspond to the PCA subspace, only posterior distributions can be obtained for the latent data (points on the subspace). Neither the mode nor the mean of those distributions corresponds to the PCA points (orthogonal projections of the observations onto the subspace). However what is true, is that the posterior distributions converge to the PCA points as $\sigma^2 \rightarrow 0$. In other words, the relationship between pPCA and PCA is a bit more subtle than that between least squares and regression.

The relationship between regression and (p)PCA is shown in the figure below:

In the figure, points have been generated randomly according to the pPCA model. the black smear shows the affine space on which the points were generated, with the smear indicating the Gaussian distribution used. Subsequently the latent points (light blue on the gray line) were used to make observed points (red) by the addition of isotropic Gaussian noise. The green line is the maximum likelihood estimate for the space, or equivalently by the theorem of Tipping and Bishop the PCA subspace. The projection of the observed points onto the PCA subspace (blue) are the PCA points. The purple line is the least squares line, or equivalently the affine space obtained by regression (y observed as a noisy function of x). The pink line is also a regression line, except where is observed as a noisy function of y.

A natural question to ask is why the probabilistic interpretation of PCA (pPCA) is useful or necessary? One reason it is beneficial is that maximum likelihood inference for pPCA involves hidden random variables, and therefore the EM algorithm immediately comes to mind as a solution (the strategy was suggested by both Tipping & Bishop and Roweis). I have not yet discussed how to find the PCA subspace, and the EM algorithm provides an intuitive and direct way to see how it can be done, without the need for writing down any linear algebra:

The exact version of the EM shown above is due to Roweis. In it, one begins with a random affine subspace passing through the centroid of the points. The “E” step (expectation) consists of projecting the points to the subspace. The projected points are considered fixed to the subspace. The “M” step (maximization) then consists of rotating the space so that the total squared distance of the fixed points on the subspace to the observed points is minimized. This is repeated until convergence. Roweis points out that this approach to finding the PCA subspace is equivalent to power iteration for (efficiently) finding eigenvalues of the the sample covariance matrix without computing it directly. This is our first use of the word eigenvalue in describing PCA, and we elaborate on it, and the linear algebra of computing PCA subspaces later in the post.

Another point of note is that pPCA can be viewed as a special case of factor analysis, and this connection provides an immediate starting point for thinking about generalizations of PCA. Specifically, factor analysis corresponds to the model $t \sim \mathcal{N}(\mu,WW^T + \psi)$ where the covariance matrix $\psi$ is less constrained, and only required to be diagonal. This is connected to a comment made above about when the PCA subspace might be more useful as a linear fit to data than regression. To reiterate, unlike physics, where some coordinate measurements have very little noise in comparison to others, biological measurements are frequently noisy in all coordinates. In such settings factor analysis is preferable, as the variance in each coordinate is estimated as part of the model. PCA is perhaps a good compromise, as PCA subspaces are easier to find than parameters for factor analysis, yet PCA, via its pPCA interpretation, accounts for noise in all coordinates.

A final comment about pPCA is that it provides a natural framework for thinking about hypothesis testing. The book Statistical Methods: A Geometric Approach by Saville and Wood is essentially about (the geometry of) pPCA and its connection to hypothesis testing. The authors do not use the term pPCA but their starting point is exactly the linear Gaussian model of Tipping and Bishop. The idea is to consider single samples from n independent identically distributed independent Gaussian random variables as one single sample from a high-dimensional multivariate linear Gaussian model with isotropic noise. From that point of view pPCA provides an interpretation for Bessel’s correction. The details are interesting but tangential to our focus on PCA.

We are therefore ready to return to Jeopardy!, where we have “Normality for $600″ with the answer “A generalization of linear regression in which the Gaussian noise is isotropic” and the question “What is PCA?” 3. An orthogonal projection of points onto an affine space that maximizes the retained sample variance. In the previous two interpretations of PCA, the focus was on the PCA affine subspace. However in many uses of PCA the output of interest is the projection of the given points onto the PCA affine space. The projected points have three useful related interpretations: 1. As seen in in section 1, the (orthogonally) projected points (red -> blue) are those whose total squared distance to the observed points is minimized. 2. What we focus on in this section, is the interpretation that the PCA subspace is the one onto which the (orthogonally) projected points maximize the retained sample variance. 3. The topic of the next section, namely that the squared distances between the (orthogonally) projected points are on average (in the $l_2$ metric) closest to the original distances between the points. The sample variance of a set of points is the average squared distance from each point to the centroid. Mathematically, if the observed points are translated so that their centroid is at zero (known as zero-centering), and then represented by an $n \times p$ matrix X, then the sample covariance matrix is given by $\frac{1}{n-1}X^tX$ and the sample variance is given by the trace of the matrix. The point is that the jth diagonal entry of $\frac{1}{n-1}X^tX$ is just $\frac{1}{n-1}\sum_{i=1}^n (x^i_j)^2$, which is the sample variance of the jth variable. The PCA subspace can be viewed as that subspace with the property that the sample variance of the projections of the observed points onto the subspace is maximized. This is easy to see from the figure above. For each point (blue), Pythagoras’ theorem implies that $d(red,blue)^2+d(blue,green)^2 = d(red,green)^2$. Since the PCA subspace is the one minimizing the total squared red-blue distances, and since the solid black lines (red-green distances) are fixed, it follows that the PCA subspace also maximizes the total squared green-blue distances. In other words, PCA maximizes the retained sample variance. The explanation above is informal, and uses a 1-dimensional PCA subspace in dimension 2 to make the argument. However the argument extends easily to higher dimension, which is typically the setting where PCA is used. In fact, PCA is typically used to “visualize” high dimensional points by projection into dimensions two or three, precisely because of the interpretation provided above, namely that it retains the sample variance. I put visualize in quotes because intuition in two or three dimensions does not always hold in high dimensions. However PCA can be useful for visualization, and one of my favorite examples is the evidence for genes mirroring geography in humans. This was first alluded to by Cavalli-Sforza, but definitively shown by Lao et al., 2008, who analyzed 2541 individuals and showed that PCA of the SNP matrix (approximately) recapitulates geography: Genes mirror geography from Lao et al. 2008: (Left) PCA of the SNP matrix (2541 individuals x 309,790 SNPs) showing a density map of projected points. (Right) Map of Europe showing locations of the populations . In the picture above, it is useful to keep in mind that the emergence of geography is occurring in that projection in which the sample variance is maximized. As far as interpretation goes, it is useful to look back at Cavalli-Sforza’s work. He and collaborators who worked on the problem in the 1970s, were unable to obtain a dense SNP matrix due to limited technology of the time. Instead, in Menozzi et al., 1978 they performed PCA of an allele-frequency matrix, i.e. a matrix indexed by populations and allele frequencies instead of individuals and genotypes. Unfortunately they fell into the trap of misinterpreting the biological meaning of the eigenvectors in PCA. Specifically, they inferred migration patterns from contour plots in geographic space obtained by plotting the relative contributions from the eigenvectors, but the effects they observed turned out to be an artifact of PCA. However as we discussed above, PCA can be used quantitatively via the stochastic process for which it solves maximum likelihood inference. It just has to be properly understood. To conclude this section in Jeopardy! language, we have “Normality for$800″ with the answer “A set of points in an affine space obtained via projection of a set of given points so that the sample variance of the projected points is maximized” and the question “What is PCA?”

4. Principal component analysis of Euclidean distance matrices.

In the preceding interpretations of PCA, I have focused on what happens to individual points when projected to a lower dimensional subspace, but it is also interesting to consider what happens to pairs of points. One thing that is clear is that if a pair of points are projected orthogonally onto a low-dimensional affine subspace then the distance between the points in the projection is smaller than the original distance between the points. This is clear because of Pythagoras’ theorem, which implies that the squared distance will shrink unless the points are parallel to the subspace in which case the distance remains the same. An interesting observation is that in fact the PCA subspace is the one with the property where the average (or total) squared distances between the points is maximized. To see this it again suffices to consider only projections onto one dimension (the general case follows by Pythagoras’ theorem). The following lemma, discussed in my previous blog post, makes the connection to the previous discussion:

Lemma: Let $x_1,\ldots,x_n$ be numbers with mean $\overline{x} = \frac{1}{n}\sum_i x_i$. If the average squared distance between pairs of points is denoted $D = \frac{1}{n^2}\sum_{i,j} (x_i-x_j)^2$ and the variance is denoted $V=\frac{1}{n}\sum_i (x_i-\overline{x})^2$ then $V=\frac{1}{2}D$.

What the lemma says is that the sample variance is equal to the average squared difference between the numbers (i.e. it is a scalar multiple that does not depend on the numbers). I have already discussed that the PCA subspace maximizes the retained variance, and it therefore follows that it also maximizes the average (or total) projected squared distance between the points. Alternately, PCA can be interpreted as minimizing the total (squared) distance that is lost, i,e. if the original distances between the points are given by a distance matrix $D$ and the projected distances are given by $\tilde{D}$, then the PCA subspace minimizes $\sum_{ij} (D^2_{ij} - \tilde{D}^2_{ij})$, where each term in the sum is positive as discussed above.

This interpretation of PCA leads to an interesting application of the method to (Euclidean) distance matrices rather than points. The idea is based on a theorem of Isaac Schoenberg that characterizes Euclidean distance matrices and provides a method for realizing them. The theorem is well-known to structural biologists who work with NMR, because it is one of the foundations used to reconstruct coordinates of structures from distance measurements. It requires a bit of notation: is a distance matrix with entries $d_{ij}$ and $\Delta$ is the matrix with entries $-\frac{1}{2}d^2_{ij}$. ${\bf 1}$ denotes the vector of all ones, and ${\bf s}$ denotes a vector.

Theorem (Schoenberg, 1938): A matrix D is a Euclidean distance matrix if and only if the matrix $B=(I-{\bf 1}{\bf s}')\Delta(I-{\bf s}{\bf 1}')$ is positive semi-definite where ${\bf s}'{\bf 1} = 1$.

For the case when ${\bf s}$ is chosen to be a unit vector, i.e. all entries are zero except one of them equal to 1, the matrix B can be viewed as the Gromov transform (known as the Farris transform in phylogenetics) of the matrix with entries $d^2_{ij}$. Since the matrix is positive semidefinite it can be written as $B=XX^t$, where the matrix X provides coordinates for points that realize D. At this point PCA can be applied resulting in a principal subspace and points on it (the orthogonal projections of X). A point of note is that eigenvectors of $XX^t$ can be computed directly, avoiding the need to compute $X^tX$ which may be a larger matrix if $n < p$.

The procedure just described is called classic multidimensional scaling (MDS) and it returns a set of points on a Euclidean subspace with distance matrix $\tilde{D}$ that best represent the original distance matrix D in the sense that $\sum_{ij} (D^2_{ij} - \tilde{D}^2_{ij})$ is minimized. The term multidimensional scaling without the “classic” has taken on an expanded meaning, namely it encapsulates all methods that seek to approximately realize a distance matrix by points in a low dimensional Euclidean space. Such methods are generally not related to PCA, but classic multidimensional scaling is PCA. This is a general source of confusion and error on the internet. In fact, most articles and course notes I found online describing the connection between MDS and PCA are incorrect. In any case classic multidimensional scaling is a very useful instance of PCA, because it extends the utility of the method to cases where points are not available but distances between them are.

Now we return to Jeopardy! one final time with the final question in the category: “Normality for $1000″. The answer is “Principal component analysis of Euclidean distance matrices” and the question is “What is classic multidimensional scaling?” An example To illustrate the interpretations of PCA I have highlighted, I’m including an example in R inspired by an example from another blog post (all commands can be directly pasted into an R console). I’m also providing the example because missing in the discussion above is a description of how to compute PCA subspaces and the projections of points onto them. I therefore explain some of this math in the course of working out the example: First, I generate a set of points (in $\mathbb{R}^2$). I’ve chosen a low dimension so that pictures can be drawn that are compatible with some of the examples above. Comments following commands appear after the # character.  set.seed(2) #sets the seed for random number generation. x <- 1:100 #creates a vector x with numbers from 1 to 100 ex <- rnorm(100, 0, 30) #100 normally distributed rand. nos. w/ mean=0, s.d.=30 ey <- rnorm(100, 0, 30) # " " y <- 30 + 2 * x #sets y to be a vector that is a linear function of x x_obs <- x + ex #adds "noise" to x y_obs <- y + ey #adds "noise" to y P <- cbind(x_obs,y_obs) #places points in matrix plot(P,asp=1,col=1) #plot points points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center At this point a full PCA analysis can be undertaken in R using the command “prcomp”, but in order to illustrate the algorithm I show all the steps below:  M <- cbind(x_obs-mean(x_obs),y_obs-mean(y_obs))#centered matrix MCov <- cov(M) #creates covariance matrix Note that the covariance matrix is proportional to the matrix$M^tM$. Next I turn to computation of the principal axes:  eigenValues <- eigen(MCov)$values       #compute eigenvalues
eigenVectors <- eigen(MCov)$vectors #compute eigenvectors The eigenvectors of the covariance matrix provide the principal axes, and the eigenvalues quantify the fraction of variance explained in each component. This math is explained in many papers and books so we omit it here, except to say that the fact that eigenvalues of the sample covariance matrix are the principal axes follows from recasting the PCA optimization problem as maximization of the Raleigh quotient. A key point is that although I’ve computed the sample covariance matrix explicitly in this example, it is not necessary to do so in practice in order to obtain its eigenvectors. In fact, it is inadvisable to do so. Instead, it is computationally more efficient, and also more stable to directly compute the singular value decomposition of M. The singular value decomposition of M decomposes it into $M=UDV^t$ where is a diagonal matrix and both $U$ and $V^t$ are orthogonal matrices. I will also not explain in detail the linear algebra of singular value decomposition and its relationship to eigenvectors of the sample covariance matrix (there is plenty of material elsewhere), and only show how to compute it in R:  d <- svd(M)$d          #the singular values
v <- svd(M)\$v          #the right singular vectors

The right singular vectors are the eigenvectors of $M^tM$.  Next I plot the principal axes:

 lines(x_obs,eigenVectors[2,1]/eigenVectors[1,1]*M[x]+mean(y_obs),col=8)

This shows the first principal axis. Note that it passes through the mean as expected. The ratio of the eigenvectors gives the slope of the axis. Next

 lines(x_obs,eigenVectors[2,2]/eigenVectors[1,2]*M[x]+mean(y_obs),col=8)

shows the second principal axis, which is orthogonal to the first (recall that the matrix $V^t$ in the singular value decomposition is orthogonal). This can be checked by noting that the second principal axis is also

 lines(x_obs,-1/(eigenVectors[2,1]/eigenVectors[1,1])*M[x]+mean(y_obs),col=8)

as the product of orthogonal slopes is -1. Next, I plot the projections of the points onto the first principal component:

 trans <- (M%*%v[,1])%*%v[,1] #compute projections of points
P_proj <- scale(trans, center=-cbind(mean(x_obs),mean(y_obs)), scale=FALSE)
points(P_proj, col=4,pch=19,cex=0.5) #plot projections
segments(x_obs,y_obs,P_proj[,1],P_proj[,2],col=4,lty=2) #connect to points

The linear algebra of the projection is simply a rotation followed by a projection (and an extra step to recenter to the coordinates of the original points). Formally, the matrix M of points is rotated by the matrix of eigenvectors W to produce $T=MW$. This is the rotation that has all the optimality properties described above. The matrix T is sometimes called the PCA score matrix. All of the above code produces the following figure, which should be compared to those shown above:

There are many generalizations and modifications to PCA that go far beyond what has been presented here. The first step in generalizing probabilistic PCA is factor analysis, which includes estimation of variance parameters in each coordinate. Since it is rare that “noise” in data will be the same in each coordinate, factor analysis is almost always a better idea than PCA (although the numerical algorithms are more complicated). In other words, I just explained PCA in detail, now I’m saying don’t use it! There are other aspects that have been generalized and extended. For example the Gaussian assumption can be relaxed to other members of the exponential family, an important idea if the data is discrete (as in genetics). Yang et al. 2012 exploit this idea by replacing PCA with logistic PCA for analysis of genotypes. There are also many constrained and regularized versions of PCA, all improving on the basic algorithm to deal with numerous issues and difficulties. Perhaps more importantly, there are issues in using PCA that I have not discussed. A big one is how to choose the PCA dimension to project to in analysis of high-dimensional data. But I am stopping here as I am certain no one is reading at this far into the post anyway…

The take-home message about PCA? Always be thinking when using it!

Acknowledgment: The exposition of PCA in this post began with notes I compiled for my course MCB/Math 239: 14 Lessons in Computational Genomics taught in the Spring of 2013. I thank students in that class for their questions and feedback. None of the material presented in class was new, but the exposition was intended to clarify when PCA ought to be used, and how. I was inspired by the papers of Tipping, Bishop and Roweis on probabilistic PCA in the late 1990s that provided the needed statistical framework for its understanding. Following the class I taught, I benefited greatly from conversations with Nicolas Bray, Brielin Brown, Isaac Joseph and Shannon McCurdy who helped me to further frame PCA in the way presented in this post.

The Habsburg rulership of Spain ended with an inbreeding coefficient of F=0.254. The last king, Charles II (1661-1700), suffered an unenviable life. He was unable to chew. His tongue was so large he could not speak clearly, and he constantly drooled. Sadly, his mouth was the least of his problems. He suffered seizures, had intellectual disabilities, and was frequently vomiting. He was also impotent and infertile, which meant that even his death was a curse in that his lack of heirs led to a war.

None of these problems prevented him from being married (twice). His first wife, princess Henrietta of England, died at age 26 after becoming deeply depressed having being married to the man for a decade. Only a year later, he married another princess, 23 year old Maria Anna of Neuberg. To put it mildly, his wives did not end up living the charmed life of Disney princesses, nor were they presumably smitten by young Charles II who apparently aged prematurely and looked the part of his horrific homozygosity. The princesses married Charles II because they were forced to. Royals organized marriages to protect and expand their power, money and influence. Coupled to this were primogeniture rules which ensured that the sons of kings, their own flesh and blood and therefore presumably the best-suited to be in power, would indeed have the opportunity to succeed their fathers. The family tree of Charles II shows how this worked in Spain:

It is believed that the inbreeding in Charles II’s family led to two genetic disorders, combined pituitary hormone deficiency and distal renal tubular acidosis, that explained many of his physical and mental problems. In other words, genetic diversity is important, and the point of this blog post is to highlight the fact that diversity is important in education as well.

The problem of inbreeding in academia has been studied previously, albeit to a limited extent. One interesting article is Navel Grazing: Academic Inbreeding and Scientific Productivity by Horta et al published in 2010 (my own experience with an inbred academic from a department where 39% of the faculty are self-hires anecdotally confirms the claims made in the paper). But here I focus on the downsides of inbreeding of ideas rather than of faculty. For example home-schooling, the educational equivalent of primogeniture, can be fantastic if the parents happen to be good teachers, but can fail miserably if they are not. One thing that is guaranteed in a school or university setting is that learning happens by exposure to many teachers (different faculty, students, tutors, the internet, etc.) Students frequently complain when there is high variance in teaching quality, but one thing such variance ensures is that is is very unlikely that any student is exposed only to bad teachers. Diversity in teaching also helps to foster the development of new ideas. Different teachers, by virtue of insight or error, will occasionally “mutate” ideas or concepts for better or for worse. In other words, one does not have to fully embrace the theory of memes to acknowledge that there are benefits to variance in teaching styles, methods and pedagogy. Conversely, there is danger in homogeneity.

This brings me to MOOCs. One of the great things about MOOCs is that they reach millions of people. Udacity claims it has 1.6 million “users” (students?). Coursera claims 7.1 million. These companies are greatly expanding the accessibility of education. Starving children in India can now take courses in mathematical methods for quantitative finance, and for the first time in history, a president of the United States can discreetly take a freshman course on economics together with its high school algebra prerequisites (highly recommended). But when I am asked whether I would be interested in offering a MOOC I hesitate, paralyzed at the thought that any error I make would immediately be embedded in the brains of millions of innocent victims. My concern is this: MOOCs can greatly reduce the variance in education. For example, Coursera currently offers 641 courses, which means that each courses is or has been taught to over 11,000 students. Many college courses may have less than a few dozen students, and even large college courses rarely have more than a few hundred students. This means that on average, through MOOCs, individual professors reach many more (2 orders of magnitude!) students. A great lecture can end up positively impacting a large number of individuals, but at the same time, a MOOC can be a vehicle for infecting the brains of millions of people with nonsense. If that nonsense is then propagated and reaffirmed via the interactions of the people who have learned it from the same source, then the inbreeding of ideas has occurred.

I mention MOOCs because I was recently thinking about intuition behind Bessel’s correction replacing n with n-1 in the formula for sample variance. Formally, Bessel’s correction replaces the biased formula

$s^2_n = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2$

for estimating the variance of a random variable from samples $x_1,\ldots,x_n$ with

$s^2_{n-1} = \frac{1}{n-1} \sum_{i=1}^n (x_i-\overline{x})^2$.

The switch from to n-1 is a bit mysterious and surprising, and in introductory statistics classes it is frequently just presented as a “fact”. When an explanation is provided, it is usually in the form of algebraic manipulation that establishes the result. The issue came up as a result of a blog post I’m writing about principal components analysis (PCA), and I thought I would check for an intuitive explanation online. I googled “intuition sample variance” and the top link was a MOOC from the Khan Academy:

The video has over 51,000 views with over 100 “likes” and only 6 “dislikes”. Unfortunately, in this case, popularity is not a good proxy for quality. Despite the title promising “review” and “intuition” for “why we divide by n-1 for the unbiased sample variance” there is no specific reason given why is replaced by n-1 (as opposed to another correction). Furthermore, the intuition provided has to do with the fact that $x_i-\overline{x}$ underestimates $x_i-\mu$ (where $\mu$ is the mean of the random variable and $\overline{x}$ is the sample mean) but the explanation is confusing and not quantitative (which it can easily be). In fact, the wikipedia page for Bessel’s correction provides three different mathematical explanations for the correction together with the intuition that motivates them, but it is difficult to find with Google unless one knows that the correction is called “Bessel’s correction”.

Wikipedia is also not perfect, and this example is a good one for why teaching by humans is important. Among the three alternative derivations, I think that one stands out as “better” but one would not know by just looking at the wikipedia page. Specifically, I refer to “Alternate 1″ on the wikipedia page, that is essentially explaining that variance can be rewritten as a double sum corresponding to the average squared distance between points and the diagonal terms of the sum are zero in expectation. An explanation of why this fact leads to the n-1 in the unbiased estimator is as follows:

The first step is to notice that the variance of a random variable is equal to half of the expected squared difference of two independent identically distributed random variables of that type. Specifically, the definition of variance is:

$var(X) = \mathbb{E}(X - \mu)^2$ where $\mu = \mathbb{E}(X)$. Equivalently, $var(X) = \mathbb{E}(X^2) -\mu^2$. Now suppose that Y is another random variable identically distributed to X and with X,Y independent. Then $\mathbb{E}(X-Y)^2 = 2 var(X)$. This is easy to see by using the fact that

$\mathbb{E}(X-Y)^2 = \mathbb{E}(X^2) + \mathbb{E}(Y^2) - 2\mathbb{E}(X)\mathbb{E}(Y) = 2\mathbb{E}(X^2)-2\mu^2$.

This identity motivates a rewriting of the (uncorrected) sample variance $s_n$ in a way that is computationally less efficient, but mathematically more insightful:

$s_n = \frac{1}{2n^2} \sum_{i,j=1}^n (x_i-x_j)^2$.

Of note is that in this summation exactly n of the terms are zero, namely the terms when i=j. These terms are zero independently of the original distribution, and remain so in expectation thereby biasing the estimate of the variance, specifically leading to an underestimate. Removing them fixes the estimate and produces

$s_{n-1}^2 = \frac{1}{2n(n-1)} \sum_{i,j=1, i \neq j}^n (x_i-x_j)^2$.

It is easy to see that this is indeed Bessel’s correction. In other words, the correction boils down to the fact that $n^2-n = n(n-1)$, hence the appearance of n-1.

Why do I like this particular derivation of Bessel’s correction? There are two reasons: first, n-1 emerges naturally and obviously from the derivation. The denominator in $s_{n-1}^2$ matches exactly the number of terms being summed, so that it can be understood as a true average (this is not apparent in its standard form as $s_{n-1}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\overline{x})^2$. There is really nothing mysterious anymore, its just that some terms having been omitted from the sum because they were non-inofrmative. Second, as I will show in my forthcoming blog post on PCA, the fact that the variance of a random variable is half of the expectation of the squared difference of two instances, is key to understanding the connection between multi-dimensional scaling (MDS) and PCA. In other words, as my student Nicolas Bray is fond of saying, although most people think a proof is either right or wrong, in fact some proofs are more right than others. The connection between Bessel’s correction and PCA goes even deeper: as explained by Saville and Wood in their book Statistical Methods: A Geometric Approach n-1 can be understood to be a reduction in one dimension from the point of view of probabilistic PCA (Saville and Wood do not explicitly use the term probabilistic PCA but as I will explain in my PCA post it is implicit in their book). Finally, there are many subtleties to Bessel’s correction, for example it is an unbiased estimator for variance and not standard deviation. These issues ought to be mentioned in a good lecture about the topic. In other words, the Khan lecture is neither necessary nor sufficient, but unlike a standard lecture where the damage is limited to a small audience of students, it has been viewed more than 50,000 times and those views cannot be unviewed.

In writing this blog post I pondered the irony of my call for added diversity in teaching while I preach my own idea (this post) to a large number of readers via a medium designed for maximal outreach. I can only ask that others blog as well to offer alternative points of view :) and that readers inform themselves on the issues I raise by fact-checking elsewhere. As far as the statistics goes, if someone finds the post confusing, they should go and register for one of the many fantastic MOOCs on statistics! But I reiterate that in the rush to MOOCdom, providers must offer diversity in their offerings (even multiple lectures on the same topic) to ensure a healthy population of memes. This is especially true in Spain, where already inbred faculty are now inbreeding what they teach by MOOCing via Miriada X. Half of the MOOCs being offered in Spain originate from just 3 universities, while the number of potential viewers is enormous as Spanish is now the second most spoken language in the world (thanks to Charles II’s great-great-grandfather, Charles I).

May Charles II rest in peace.

One of my distinct memories from elementary school is going to “library class” to learn about the Dewey decimal classification and how to use a card catalog to find books. Searching for books efficiently was possible because cards in the catalog were sorted lexicographically.

It didn’t occur to me at the time, but the system required authors of books to be totally ordered. Without an ordering of authors in a book with multiple authors, there would be no way to decide where to place the card for the book in a catalog searchable by author. The practice of ordering authors on publications is evident in the oldest printed texts and has persisted to this day. I have never thought that it could be any other way.

However this past Wednesday I was visiting the University of Washington to deliver a seminar, and among the highlights of the visit was my meeting with the graduate students. I met 12 for lunch and two more came for dinner. Meeting with students is always my favorite part of a visit to a university. They have original and creative ideas, and most importantly, are not bound in their thought by archaic tradition. They frequently don’t know what one is supposed to think and how one is supposed to say it. They just think and speak!

One of the students I met on Wednesday was Vanessa Gray, a student of Doug Fowler, who in a conversation on authorship practices suggested to me the radical and brilliant idea that papers should be published without an ordering of authors.

Many journals now have a section called “Author contributions” where roles of individuals in collaborative projects can be described (many journals now require such descriptions). So why bother ordering the authors for a list underneath the title? As far as indexing and searching goes, Google and other search engines require only a set of authors, and not a specific ordering.

I agree with Vanessa that ending author ordering on publications would greatly improve fairness in the biological sciences, where many current projects involve complex assemblies of teams with complementary skills. “First authorship” is not well-defined when one author performed a large number of difficult experiments, and another developed novel algorithms and wrote complex software for analyzing the experiments. Similarly, “last authorship” fails as a concept when students are co-advised, or one principal investigator provides substantial funding on a project, while another is participating in doing the work. And recently, large consortium projects have completely destroyed any meaning of “author” by having hundreds, or even thousands of authors on projects. Even when there are relatively few authors people rarely credit anyone except the first and last authors, even if others did substantial work. In the recent ENCODE paper published in PNAS with 30 authors, it appears to me from the responses to my previous blog post about the paper that the 5th and 6th authors did a lot (majority?) of the work in putting together figures and results, yet I suspect the “credit” for the paper will go to the first author (the flip side in that case is that the first author is where blame is assigned as well).

There is also a theoretical justification for not ordering authors. Ordering of authors on a publication can be thought of as a ranking produced by “votes” of the participants in the project. Of course in practice not all votes are equal. In what is called dictatorship in social choice theory, PIs frequently make the decisions independently of how specific authors feel they may have contributed. This may work on a paper where there is a single PI (although it may be considered unfair by the graduate students and postdocs), however dictatorship as a system for determining authorship certainly breaks down when multiple PIs collaborate on a project. Arrow’s impossibility theorem explains that in the absence of dictatorship, there is a problem in producing a single ordering satisfying two other seemingly basic and essential fairness criteria. Informally, the theorem states that there is no authorship ordering system based on voting of contributing authors that can satisfy the following three criteria:

• If every author thinks that X should be ordered before Y, then the author list should have X placed before Y.
• For a fixed list of voting preferences regarding the ordering of X vs. Y, the ordering between X and Y in the author list will remain unchanged even if does not depend on the ordering of other pairs such as X and Z, Y and Z, or Z and W.
• There is no “dictator”, i.e. no single author possesses the power to determine the author ordering.

Authors frequently have differing opinions about the impact of their own contribution to a publication, and therefore their preferences (votes) for author ordering are discordant. This means that any system for ordering authors will not satisfy everyone’s preferences, and in the sense of Arrow’s impossibility theorem will be unfair. One way around Arrow’s impossibility theorem is to specify authorship order without regard to authors’ preferences, for example by always ordering authors alphabetically (the Hardy-Littlewood rule). This method, usually the one used in the mathematical sciences, is also fraught with problems. Of course, listing author contributions for what they are is not entirely trivial. For example, different authors may have conflicting views about what it means to have “written the text of the paper”. But using words to describe contributions allow for much more detail about what each author did, and allows for nuanced contributions to be described (e.g., John and Jane were in the room when the initial idea for the project was discussed, but did not contribute anything afterwards).

To summarize, in the modern era of electronic publishing ordering of authors is unnecessary, and if it is unnecessary, then why confront Arrow’s theorem and inevitably produce orderings unfairly? Publications should just explain the author contributions. Time to end ordered authorship.

The card catalog at Yale University’s Sterling Memorial Library (from Wikipedia).