When I was a teenager I broke all the rules on Friday night. After dinner I would watch Louis Rukeyser’s Wall Street Week at 8:30pm, and I would be in bed an hour later. On new year’s eve, he had a special “year-end review”, during which he hosted “financial experts” who would opine on the stock market and make predictions for the coming year.

What I learned from Louis Rukeyser was:

1. Never trust men in suits (or tuxedos).

2. It’s easier to perpetrate the 1024 scam than one might think!

Here are the experts in 1999 all predicting increases for the stock market in 2000:

As it turned out, the NASDAQ peaked on March 10, 2000, and within a week and a half had dropped 10%. By the end of the year the dot-com bubble had completely burst and a few years later the market had lost almost 80% of its value.

Predictions on the last day of the 20th century represented a spectacular failure for the “pundits”, but by then I had already witnessed many failures on the show. I’d also noted that almost all the invited “experts” were men. Of course correlation does not imply causation, but I remember having a hard time dispelling the notion that the guests were wrong because they were men. I never wanted to be sexist, but Louis Rukeyser made it very difficult for me!

Gender issues aside, the main lesson I learned from Louis Rukeyser’s show is that it’s easy to perpetrate the 1024 scam. The scam goes something like this: a scammer sends out 1024 emails to individuals that are unlikely to know each other, with each email making a prediction about the performance of the stock market in the coming week. For half the people (512), she predicts the stock market will go up, and for the other half, that it will go down. The next week, she has obviously sent a correct prediction of the market to half the people (this assumes the market is never unchanged after a week). She ignores the 512 people who have received an incorrect prediction, dividing those who received the correct prediction into two halves (256 each). Again, she predicts the performance of the market in the coming week, sending 256 individuals a prediction that the market will go up, and the other 256 a prediction that it will go down. She continues this divide-and-conquer for 10 weeks, at which time there is one individual that has received correct predictions about the movement of the stock market for 2.5 months! This person may believe that the scammer has the ability to predict the market; after all, $(\frac{1}{2})^{10} = 0.00098$ which looks like a very significant p-value. This is when the scammer asks for a “large investment”. Of course what is missing is knowledge of the other prediction emails sent out, or in other words the multiple testing problem.

The Wall Street Week guest panels essentially provided a perfect setting in which to perpetrate this scam. “Experts” that would err would be unlikely to be invited back. Whereas regular winners would be back for another chance at guessing. This is a situation very similar to the mutual fund management market, where managers are sacked when they have a bad year, only to have large firms with hundreds of funds on the books highlight funds that have performed well for 10 years in a row in their annual glossy brochures. But that is not the subject matter of this blog post. Rather, it’s the blog itself.

I wrote and posted my first blog entry (Genesis of *Seq) exactly a year ago. I began writing it for two reasons. First, I thought it could be a convenient and useful forum for discussion of technical developments in computational biology. I was motivated partly by the seqanswers website, which allows users to share information and experience in dealing with high-throughput sequence data. But I was also inspired by the What’s New Blog that has created numerous bridges in the mathematics community via highly technical yet accessible posts that have democratized mathematics. Second, I had noticed an extraordinary abuse of multiple testing in computational biology, and I was desperate for a forum where I could bring the issue to peoples attention. My initial frustration with outlandish claims in papers based on weak statistics had also grown over time to encompass a general concern for lack of rigor in computational biology papers. None of us are perfect but there is a wide gap between perfect and wrong. Computational biology is a field that is now an amalgamation of many subjects and I hoped that a blog would be able to reach the different silos more effectively than publications.

And thus this blog was born on August 19th 2013. I started without a preconception of how it would turn out over time, and I’m happy to say I’ve been surprised by its impact, most notably on myself. I’ve learned an enormous amount from reader feedback, in part via comments on individual posts, but also from private emails to me and in personal conversations. For this (selfish) reason alone, I will keep blogging. I have also been asked by many of you to keep posting, and I’m listening. When I have nothing left to say, I promise I will quit. But for now I have a backlog of posts, and after a break this summer, I am ready to return to the keyboard. Besides, since starting to blog I still haven’t been to Las Vegas.

There has recently been something of an uproar over the new book A Troublesome Inheritance by Nicholas Wade, with much of the criticism centering on Wade’s claim that race is a meaningful biological category. This subject is one with which I1 have some personal connection since as a child growing up in South Africa in the 1980s, I was myself categorized very neatly by the Office for Race Classification: 10. A simple pair of digits that conferred on me numerous rights and privileges denied to the majority of the population.

Explanation of identity numbers assigned to citizens by the South African government during apartheid.

And yet the system behind those digits was anything but simple. The group to which an individual was assigned could be based on not only their skin color but also their employment, eating and drinking habits, and indeed explicitly social factors as related by Muriel Horrell of the South African Institute of Race Relations: “Should a man who is initially classified white have a number of coloured friends and spend many of his leisure hours in their company, he stands to risk being re-classified as coloured.”

With these memories in mind, I found Wade’s concept of race as a biological category quite confusing, a confusion which only deepened when I discovered that he identifies not the eight races designated by the South African Population Registration Act of 1950, but rather five, none of which was the Griqua! With the full force of modern science on his side2, it seemed unlikely that these disparities represented an error on Wade’s part. And so I was left with a perplexing question: how could it be that the South African apartheid regime — racists par excellence — had failed to institutionalize their racism correctly? How had Wade gotten it right when Hendrik Verwoerd had gone awry?

Eventually I realized that A Troublesome Inheritance itself might contain the answer to this conundrum. Institutions, Wade explains, are genetic: “they grow out of instinctual social behaviors” and “one indication of such a genetic effect is that, if institutions were purely cultural, it should be easy to transfer an institution from one society to another.”3 So perhaps it is Wade’s genetic instincts as a Briton that explain how he has navigated these waters more skillfully than the Dutch-descended Afrikaners who designed the institutions of apartheid.

One might initially be inclined to scoff at such a suggestion or even to find it offensive. However, we must press boldly on in the name of truth and try to explain why this hypothesis might be true. Again, A Troublesome Inheritance comes to our aid. There, Wade discusses the decline in English interest rates between 1400 and 1850. This is the result, we learn, of rich English people producing more children than the poor and thereby genetically propagating those qualities which the rich are so famous for possessing: “less impulsive, more patient, and more willing to save.”4 However this period of time saw not only falling interest rates but also the rise of the British Empire. It was a period when Englishmen not only built steam engines and textile mills, but also trafficked in slaves by the millions and colonized countries whose people lacked their imperial genes. These latter activities, with an obvious appeal to the more racially minded among England’s population, could bring great wealth to those who engaged in them and so perhaps the greater reproductive fitness of England’s economic elite propagated not only patience but a predisposition to racism. This would explain, for example, the ability of John Hanning Speke to sniff out “the best blood of Abyssinia” when distinguishing the Tutsi from their Hutu neighbors.

Some might be tempted to speculate that Wade is himself a racist. While Wade — who freely speculates about billions of human beings — would no doubt support such an activity, those who engage in such speculation should perhaps not judge him too harshly. After all, racism may simply be Wade’s own troublesome inheritance.

#### Footnotes

1.  In the spirit of authorship designation as discussed in this post, we describe the author contributions as follows: the recollections of South Africa are those of Lior Pachter, who distinctly remembers his classification as “white”. Nicolas Bray conceived and composed the post with input from LP. LP discloses no conflicts of interest. NB discloses being of British ancestry.
2. Perhaps not quite the full force, given the reception his book has received from actual scientists.
3. While this post is satirical, it should be noted for clarity that, improbably, this is an actual quote from Wade’s book.
4. Again, not satire.

In the Jeopardy! game show contestants are presented with questions formulated as answers that require answers in the form questions. For example, if a contestant selects “Normality for $200″ she might be shown the following clue: “The average $\frac{x_1+x_2+\cdots + x_n}{n}$,” to which she would reply “What is the maximum likelihood estimate for the mean of independent identically distributed Gaussian random variables from which samples $x_1,x_2,\ldots,x_n$ have been obtained?” Host Alex Trebek would immediately exclaim “That is the correct answer for$200!”

The process of doing mathematics involves repeatedly playing Jeopardy! with oneself in an unending quest to understand everything just a little bit better. The purpose of this blog post is to provide an exposition of how this works for understanding principal component analysis (PCA): I present four Jeopardy clues in the “Normality” category that all share the same answer: “What is principal component analysis?” The post was motivated by a conversation I recently had with a well-known population geneticist at a conference I was attending. I mentioned to him that I would be saying something about PCA in my talk, and that he might find what I have to say interesting because I knew he had used the method in many of his papers. Without hesitation he replied that he was well aware that PCA was not a statistical method and merely a heuristic visualization tool.

The problem, of course, is that PCA does have a statistical interpretation and is not at all an ad-hoc heuristic. Unfortunately, the previously mentioned population geneticist is not alone; there is a lot of confusion about what PCA is really about. For example, in one textbook it is stated that “PCA is not a statistical method to infer parameters or test hypotheses. Instead, it provides a method to reduce a complex dataset to lower dimension to reveal sometimes hidden, simplified structure that often underlie it.” In another one finds out that “PCA is a statistical method routinely used to analyze interrelationships among large numbers of objects.” In a highly cited review on gene expression analysis PCA is described as “more useful as a visualization technique than as an analytical method” but then in a paper by Markus Ringnér  titled the same as this post, i.e. What is principal component analysis? in Nature Biotechnology, 2008, the author writes that “Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the data set” (the author then avoids going into the details because “understanding the details underlying PCA requires knowledge of linear algebra”). All of these statements are both correct and incorrect and confusing. A major issue is that the description by Ringnér of PCA in terms of the procedure for computing it (singular value decomposition) is common and unfortunately does not shed light on when it should be used. But knowing when to use a method is far more important than knowing how to do it.

I therefore offer four Jeopardy! clues for principal component analysis that I think help to understand both when and how to use the method:

1. An affine subspace closest to a set of points.

Suppose we are given numbers $x_1,\ldots,x_n$ as in the initial example above. We are interested in finding the “closest” number to these numbers. By “closest” we mean in the sense of total squared difference. That is, we are looking for a number $m$ such that $\sum_{i=1}^n (m-x_i)^2$ is minimized.

This is a (straightforward) calculus problem, solved by taking the derivative of the function above and setting it equal to zero. If we let $f(m) = \sum_{i=1}^n (m-x_i)^2$ then $f'(m) = 2 \cdot \sum_{i=1}^n (m-x_i)$ and setting $f'(m)=0$ we can solve for $m$ to obtain $m = \frac{1}{n} \sum_{i=1}^n x_i$.

The right hand side of the equation is just the average of the n numbers and the optimization problem provides an interpretation of it as the number minimizing the total squared difference with the given numbers (note that one can replace squared difference by absolute value, i.e. minimization of $\sum_{i=1}^n |m-x_i|$, in which case the solution for is the median; we return to this point and its implications for PCA later).

Suppose that instead of n numbers, one is given n points in $\mathbb{R}^p$. That is, point is ${\bf x}^i = (x^i_1,\ldots,x^i_p)$. We can now ask for a point ${\bf m}$ with the property that the squared distance of ${\bf m}$ to the n points is minimized. This is asking for $min_{\bf m} \sum_{i=1}^n ||{\bf m}-{\bf x}^i||_2$.

The solution for $m$ can be obtained by minimizing each coordinate independently, thereby reducing the problem to the simpler version of numbers above, and it follows that ${\bf m} = \frac{1}{n} \sum_{i=1}^n {\bf x}^i$.

This is 0-dimensional PCA, i.e., PCA of a set of points onto a single point, and it is the centroid of the points. The generalization of this concept provides a definition for PCA:

Definition: Given n points in $\mathbb{R}^p$, principal components analysis consists of choosing a dimension $k < p$ and then finding the affine space of dimension with the property that the squared distance of the points to their orthogonal projection onto the space is minimized.

This definition can be thought of as a generalization of the centroid (or average) of the points. To understand this generalization, it is useful to think of the simplest case that is not 0-dimensional PCA, namely 1-dimensional PCA of a set of points in two dimensions:

In this case the 1-dimensional PCA subspace can be thought of as the line that best represents the average of the points. The blue points are the orthogonal projections of the points onto the “average line” (see, e.g., the red point projected orthogonally), which minimizes the squared lengths of the dashed lines. In higher dimensions line is replaced by affine subspace, and the orthogonal projections are to points on that subspace. There are a few properties of the PCA affine subspaces that are worth noting:

1. The set of PCA subspaces (translated to the origin) form a flagThis means that the PCA subspace of dimension k is contained in the PCA subspace of dimension k+1. For example, all PCA subspaces contain the centroid of the points (in the figure above the centroid is the green point). This follows from the fact that the PCA subspaces can be incrementally constructed by building a basis from eigenvectors of a single matrix, a point we will return to later.
2. The PCA subspaces are not scale invariant. For example, if the points are scaled by multiplying one of the coordinates by a constant, then the PCA subspaces change. This is obvious because the centroid of the points will change. For this reason, when PCA is applied to data obtained from heterogeneous measurements, the units matter. One can form a “common” set of units by scaling the values in each coordinate to have the same variance.
3. If the data points are represented in matrix form as an $n \times p$ matrix $X$, and the points orthogonally projected onto the PCA subspace of dimension are represented as in the ambient p dimensional space by a matrix $\tilde{X}$, then $\tilde{X} = argmin_{M:rk(M)=k} ||X-M||_2$. That is, $\tilde{X}$ is the matrix of rank k with the property that the Frobenius norm $||X-\tilde{X}||_2$ is minimized. This is just a rephrasing in linear algebra of the definition of PCA given above.

At this point it is useful to mention some terminology confusion associated with PCA. Unfortunately there is no standard for describing the various parts of an analysis. What I have called the “PCA subspaces” are also sometimes called “principal axes”. The orthogonal vectors forming the flag mentioned above are called “weight vectors”, or “loadings”. Sometimes they are called “principal components”, although that term is sometimes used to refer to points projected onto a principal axis. In this post I stick to “PCA subspaces” and “PCA points” to avoid confusion.

Returning to Jeopardy!, we have “Normality for $400″ with the answer “An affine subspace closest to a set of points” and the question “What is PCA?”. One question at this point is why the Jeopardy! question just asked is in the category “Normality”. After all, the normal distribution does not seem to be related to the optimization problem just discussed. The connection is as follows: 2. A generalization of linear regression in which the Gaussian noise is isotropic. PCA has an interpretation as the maximum likelihood parameter of a linear Gaussian model, a point that is crucial in understanding the scope of its application. To explain this point of view, we begin by elaborating on the opening Jeopardy! question about Normality for$200:

The point of the question was that the average of n numbers can be interpreted as a maximum likelihood estimation of the mean of a Gaussian. The Gaussian distribution is

$f(x,\mu,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$. Given the numbers $x_1,\ldots,x_n$, the likelihood function is therefore

$L(\mu,\sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}$. The maximum of this function is the same as the maximum of its logarithm, which is

$log L(\mu,\sigma) = \sum_{i=1}^n \left( log \frac{1}{\sqrt{2 \pi \sigma^2}} -\frac{(x_i-\mu)^2}{2\sigma^2} \right)$. Therefore the problem of finding the maximum likelihood estimate for the mean is equivalent to that of finding the minimum of the function

$S(\mu) = \sum_{i=1}^n (x_i-\mu)^2$. This is exactly the optimization problem solved by 0-dimensional PCA, as we saw above. With this calculation at hand, we turn to the statistical interpretation of least squares:

Given n points $\{(x_i,y_i)\}_{i=1}^n$ in the plane (see figure above),  the least squares line $y=mx+b$ (purple in figure) is the one that minimizes the sum of the squares $\sum_{i=1}^n \left( (mx_i+b) - y_i \right)^2$. That is, the least squares line is the one minimizing the sum of the squared vertical distances to the points. As with the average of numbers, the least squares line has a statistical interpretation: Suppose that there is some line $y=m^{*}x+b^{*}$ (black line in figure) that is unknown, but that “generated” the observed points, in the sense that each observed point was obtained by perturbing the point $m^{*}x_i +b^{*}$ vertically by a random amount from a single Gaussian distribution with mean 0 and variance $\sigma^2$. In the figure, an example is shown where the blue point on the unknown line “generates” the observed red point; the Gaussian is indicated with the blue streak around the point. Note that the model specified so far is not fully generative, as it depends on the hidden points $m^{*}x_i +b^{*}$ and there is no procedure given to generate the $x_i$. This can be done by positing that the $x_i$ are generated from a Gaussian distribution along the line $y=m^{*}x+b$ (followed by the points $y_i$ generated by Gaussian perturbation of the y coordinate on the line). The coordinates $x_i$ can then be deduced directly from the observed points as the Gaussian perturbations are all vertical. The relationship between the statistical model just described and least squares is made precise by a theorem (which we state informally, but is a special case of the Gauss-Markov theorem):

Theorem (Gauss-Markov): The maximum likelihood estimate for the line (the parameters and b) in the model described above correspond to the least squares line.

The proof is analogous to the argument given for the average of numbers above so we omit it. It can be generalized to higher dimensions where it forms the basis of what is known as linear regression. In regression, the $x_i$ are known as independent variables and $y$ the dependent variable. The generative model provides an interpretation of the independent variables as fixed measured quantities, whereas the dependent variable is a linear combination of the independent variables with added noise. It is important to note that the origins of linear regression are in physics, specifically in work of Legendre (1805) and Gauss (1809) who applied least squares to the astronomical problem of calculating the orbits of comets around the sun. In their application, the independent variables were time (for which accurate measurements were possible with clocks; by 1800 clocks were accurate to less than 0.15 seconds per day) and the (noisy) dependent variable the measurement of location. Linear regression has become one of the most (if not the most) widely used statistical tools but as we now explain, PCA (and its generalization factor analysis), with a statistical interpretation that includes noise in the $x_i$ variables, seems better suited for biological data.

The statistical interpretation of least squares can be extended to a similar framework for PCA. Recall that we first considered a statistical interpretation for least squares where an unknown line $y=m^{*}x+b^{*}$ “generated” the observed points, in the sense that each observed point was obtained by perturbing the point $m^{*}x_i +b^{*}$ vertically by a random amount from a single Gaussian distribution with mean 0 and variance $\sigma^2$. PCA can be understood analogously by replacing vertically by orthogonally (this is the probabilistic model of Collins et al., NIPS 2001 for PCA). However this approach is not completely satisfactory as the orthogonality of the perturbation is is not readily interpretable. Stated differently, it is not obvious what physical processes would generate points orthogonal to a linear affine subspace by perturbations that are always orthogonal to the subspace. In the case of least squares, the “vertical” perturbation corresponds to noise in one measurement (represented by one coordinate). The problem is in naturally interpreting orthogonal perturbations in terms of a noise model for measurements. This difficulty is resolved by a model called probabilistic PCA (pPCA), first proposed by Tipping and Bishop in a Tech Report in 1997, and published in the J. of the Royal Statistical Society B 2002,  and independently by Sam Roweis, NIPS 1998, that is illustrated visually in the figure below, and that we now explain:

In the pPCA model there is an (unknown) line (affine space in higher dimension) on which (hidden) points (blue) are generated at random according to a Gaussian distribution (represented by gray streak in the figure above, where the mean of the Gaussian is the green point). Observed points (red) are then generated from the hidden points by addition of isotropic Gaussian noise (blue smear), meaning that the Gaussian has a diagonal covariance matrix with equal entries. Formally, in the notation of Tipping and Bishop, this is a linear Gaussian model described as follows:

Observed random variables are given by $t = Wx + \mu + \epsilon$ where x are latent (hidden) random variables, W is a matrix describing a subspace and $Wx+\mu$ are the latent points on an affine subspace ($\mu$ corresponds to a translation). Finally, $\epsilon$ is an error term, given by a Gaussian random variable with mean 0 and covariance matrix $\sigma^2 I$. The parameters of the model are $W,\mu$ and $\sigma^2$. Equivalently, the observed random variables are themselves Gaussian, described by the distribution $t \sim \mathcal{N}(\mu,WW^T + \psi)$ where $\psi \sim \mathcal{N}(0,\sigma^2I)$. Tipping and Bishop prove an analogy of the Gauss-Markov theorem, namely that the affine subspace given by the maximum likelihood estimates of $W$ and $\mu$ is the PCA subspace (the proof is not difficult but I omit it and refer interested readers to their paper, or Bishop’s Pattern Recognition and Machine Learning book).

It is important to note that although the maximum likelihood estimates of $W,\mu$ in the pPCA model correspond to the PCA subspace, only posterior distributions can be obtained for the latent data (points on the subspace). Neither the mode nor the mean of those distributions corresponds to the PCA points (orthogonal projections of the observations onto the subspace). However what is true, is that the posterior distributions converge to the PCA points as $\sigma^2 \rightarrow 0$. In other words, the relationship between pPCA and PCA is a bit more subtle than that between least squares and regression.

The relationship between regression and (p)PCA is shown in the figure below:

In the figure, points have been generated randomly according to the pPCA model. the black smear shows the affine space on which the points were generated, with the smear indicating the Gaussian distribution used. Subsequently the latent points (light blue on the gray line) were used to make observed points (red) by the addition of isotropic Gaussian noise. The green line is the maximum likelihood estimate for the space, or equivalently by the theorem of Tipping and Bishop the PCA subspace. The projection of the observed points onto the PCA subspace (blue) are the PCA points. The purple line is the least squares line, or equivalently the affine space obtained by regression (y observed as a noisy function of x). The pink line is also a regression line, except where is observed as a noisy function of y.

A natural question to ask is why the probabilistic interpretation of PCA (pPCA) is useful or necessary? One reason it is beneficial is that maximum likelihood inference for pPCA involves hidden random variables, and therefore the EM algorithm immediately comes to mind as a solution (the strategy was suggested by both Tipping & Bishop and Roweis). I have not yet discussed how to find the PCA subspace, and the EM algorithm provides an intuitive and direct way to see how it can be done, without the need for writing down any linear algebra:

The exact version of the EM shown above is due to Roweis. In it, one begins with a random affine subspace passing through the centroid of the points. The “E” step (expectation) consists of projecting the points to the subspace. The projected points are considered fixed to the subspace. The “M” step (maximization) then consists of rotating the space so that the total squared distance of the fixed points on the subspace to the observed points is minimized. This is repeated until convergence. Roweis points out that this approach to finding the PCA subspace is equivalent to power iteration for (efficiently) finding eigenvalues of the the sample covariance matrix without computing it directly. This is our first use of the word eigenvalue in describing PCA, and we elaborate on it, and the linear algebra of computing PCA subspaces later in the post.

Another point of note is that pPCA can be viewed as a special case of factor analysis, and this connection provides an immediate starting point for thinking about generalizations of PCA. Specifically, factor analysis corresponds to the model $t \sim \mathcal{N}(\mu,WW^T + \psi)$ where the covariance matrix $\psi$ is less constrained, and only required to be diagonal. This is connected to a comment made above about when the PCA subspace might be more useful as a linear fit to data than regression. To reiterate, unlike physics, where some coordinate measurements have very little noise in comparison to others, biological measurements are frequently noisy in all coordinates. In such settings factor analysis is preferable, as the variance in each coordinate is estimated as part of the model. PCA is perhaps a good compromise, as PCA subspaces are easier to find than parameters for factor analysis, yet PCA, via its pPCA interpretation, accounts for noise in all coordinates.

A final comment about pPCA is that it provides a natural framework for thinking about hypothesis testing. The book Statistical Methods: A Geometric Approach by Saville and Wood is essentially about (the geometry of) pPCA and its connection to hypothesis testing. The authors do not use the term pPCA but their starting point is exactly the linear Gaussian model of Tipping and Bishop. The idea is to consider single samples from n independent identically distributed independent Gaussian random variables as one single sample from a high-dimensional multivariate linear Gaussian model with isotropic noise. From that point of view pPCA provides an interpretation for Bessel’s correction. The details are interesting but tangential to our focus on PCA.

We are therefore ready to return to Jeopardy!, where we have “Normality for $600″ with the answer “A generalization of linear regression in which the Gaussian noise is isotropic” and the question “What is PCA?” 3. An orthogonal projection of points onto an affine space that maximizes the retained sample variance. In the previous two interpretations of PCA, the focus was on the PCA affine subspace. However in many uses of PCA the output of interest is the projection of the given points onto the PCA affine space. The projected points have three useful related interpretations: 1. As seen in in section 1, the (orthogonally) projected points (red -> blue) are those whose total squared distance to the observed points is minimized. 2. What we focus on in this section, is the interpretation that the PCA subspace is the one onto which the (orthogonally) projected points maximize the retained sample variance. 3. The topic of the next section, namely that the squared distances between the (orthogonally) projected points are on average (in the $l_2$ metric) closest to the original distances between the points. The sample variance of a set of points is the average squared distance from each point to the centroid. Mathematically, if the observed points are translated so that their centroid is at zero (known as zero-centering), and then represented by an $n \times p$ matrix X, then the sample covariance matrix is given by $\frac{1}{n-1}X^tX$ and the sample variance is given by the trace of the matrix. The point is that the jth diagonal entry of $\frac{1}{n-1}X^tX$ is just $\frac{1}{n-1}\sum_{i=1}^n (x^i_j)^2$, which is the sample variance of the jth variable. The PCA subspace can be viewed as that subspace with the property that the sample variance of the projections of the observed points onto the subspace is maximized. This is easy to see from the figure above. For each point (blue), Pythagoras’ theorem implies that $d(red,blue)^2+d(blue,green)^2 = d(red,green)^2$. Since the PCA subspace is the one minimizing the total squared red-blue distances, and since the solid black lines (red-green distances) are fixed, it follows that the PCA subspace also maximizes the total squared green-blue distances. In other words, PCA maximizes the retained sample variance. The explanation above is informal, and uses a 1-dimensional PCA subspace in dimension 2 to make the argument. However the argument extends easily to higher dimension, which is typically the setting where PCA is used. In fact, PCA is typically used to “visualize” high dimensional points by projection into dimensions two or three, precisely because of the interpretation provided above, namely that it retains the sample variance. I put visualize in quotes because intuition in two or three dimensions does not always hold in high dimensions. However PCA can be useful for visualization, and one of my favorite examples is the evidence for genes mirroring geography in humans. This was first alluded to by Cavalli-Sforza, but definitively shown by Lao et al., 2008, who analyzed 2541 individuals and showed that PCA of the SNP matrix (approximately) recapitulates geography: Genes mirror geography from Lao et al. 2008: (Left) PCA of the SNP matrix (2541 individuals x 309,790 SNPs) showing a density map of projected points. (Right) Map of Europe showing locations of the populations . In the picture above, it is useful to keep in mind that the emergence of geography is occurring in that projection in which the sample variance is maximized. As far as interpretation goes, it is useful to look back at Cavalli-Sforza’s work. He and collaborators who worked on the problem in the 1970s, were unable to obtain a dense SNP matrix due to limited technology of the time. Instead, in Menozzi et al., 1978 they performed PCA of an allele-frequency matrix, i.e. a matrix indexed by populations and allele frequencies instead of individuals and genotypes. Unfortunately they fell into the trap of misinterpreting the biological meaning of the eigenvectors in PCA. Specifically, they inferred migration patterns from contour plots in geographic space obtained by plotting the relative contributions from the eigenvectors, but the effects they observed turned out to be an artifact of PCA. However as we discussed above, PCA can be used quantitatively via the stochastic process for which it solves maximum likelihood inference. It just has to be properly understood. To conclude this section in Jeopardy! language, we have “Normality for$800″ with the answer “A set of points in an affine space obtained via projection of a set of given points so that the sample variance of the projected points is maximized” and the question “What is PCA?”

4. Principal component analysis of Euclidean distance matrices.

In the preceding interpretations of PCA, I have focused on what happens to individual points when projected to a lower dimensional subspace, but it is also interesting to consider what happens to pairs of points. One thing that is clear is that if a pair of points are projected orthogonally onto a low-dimensional affine subspace then the distance between the points in the projection is smaller than the original distance between the points. This is clear because of Pythagoras’ theorem, which implies that the squared distance will shrink unless the points are parallel to the subspace in which case the distance remains the same. An interesting observation is that in fact the PCA subspace is the one with the property where the average (or total) squared distances between the points is maximized. To see this it again suffices to consider only projections onto one dimension (the general case follows by Pythagoras’ theorem). The following lemma, discussed in my previous blog post, makes the connection to the previous discussion:

Lemma: Let $x_1,\ldots,x_n$ be numbers with mean $\overline{x} = \frac{1}{n}\sum_i x_i$. If the average squared distance between pairs of points is denoted $D = \frac{1}{n^2}\sum_{i,j} (x_i-x_j)^2$ and the variance is denoted $V=\frac{1}{n}\sum_i (x_i-\overline{x})^2$ then $V=\frac{1}{2}D$.

What the lemma says is that the sample variance is equal to the average squared difference between the numbers (i.e. it is a scalar multiple that does not depend on the numbers). I have already discussed that the PCA subspace maximizes the retained variance, and it therefore follows that it also maximizes the average (or total) projected squared distance between the points. Alternately, PCA can be interpreted as minimizing the total (squared) distance that is lost, i,e. if the original distances between the points are given by a distance matrix $D$ and the projected distances are given by $\tilde{D}$, then the PCA subspace minimizes $\sum_{ij} (D^2_{ij} - \tilde{D}^2_{ij})$, where each term in the sum is positive as discussed above.

This interpretation of PCA leads to an interesting application of the method to (Euclidean) distance matrices rather than points. The idea is based on a theorem of Isaac Schoenberg that characterizes Euclidean distance matrices and provides a method for realizing them. The theorem is well-known to structural biologists who work with NMR, because it is one of the foundations used to reconstruct coordinates of structures from distance measurements. It requires a bit of notation: is a distance matrix with entries $d_{ij}$ and $\Delta$ is the matrix with entries $-\frac{1}{2}d^2_{ij}$. ${\bf 1}$ denotes the vector of all ones, and ${\bf s}$ denotes a vector.

Theorem (Schoenberg, 1938): A matrix D is a Euclidean distance matrix if and only if the matrix $B=(I-{\bf 1}{\bf s}')\Delta(I-{\bf s}{\bf 1}')$ is positive semi-definite where ${\bf s}'{\bf 1} = 1$.

For the case when ${\bf s}$ is chosen to be a unit vector, i.e. all entries are zero except one of them equal to 1, the matrix B can be viewed as the Gromov transform (known as the Farris transform in phylogenetics) of the matrix with entries $d^2_{ij}$. Since the matrix is positive semidefinite it can be written as $B=XX^t$, where the matrix X provides coordinates for points that realize D. At this point PCA can be applied resulting in a principal subspace and points on it (the orthogonal projections of X). A point of note is that eigenvectors of $XX^t$ can be computed directly, avoiding the need to compute $X^tX$ which may be a larger matrix if $n < p$.

The procedure just described is called classic multidimensional scaling (MDS) and it returns a set of points on a Euclidean subspace with distance matrix $\tilde{D}$ that best represent the original distance matrix D in the sense that $\sum_{ij} (D^2_{ij} - \tilde{D}^2_{ij})$ is minimized. The term multidimensional scaling without the “classic” has taken on an expanded meaning, namely it encapsulates all methods that seek to approximately realize a distance matrix by points in a low dimensional Euclidean space. Such methods are generally not related to PCA, but classic multidimensional scaling is PCA. This is a general source of confusion and error on the internet. In fact, most articles and course notes I found online describing the connection between MDS and PCA are incorrect. In any case classic multidimensional scaling is a very useful instance of PCA, because it extends the utility of the method to cases where points are not available but distances between them are.

Now we return to Jeopardy! one final time with the final question in the category: “Normality for $1000″. The answer is “Principal component analysis of Euclidean distance matrices” and the question is “What is classic multidimensional scaling?” An example To illustrate the interpretations of PCA I have highlighted, I’m including an example in R inspired by an example from another blog post (all commands can be directly pasted into an R console). I’m also providing the example because missing in the discussion above is a description of how to compute PCA subspaces and the projections of points onto them. I therefore explain some of this math in the course of working out the example: First, I generate a set of points (in $\mathbb{R}^2$). I’ve chosen a low dimension so that pictures can be drawn that are compatible with some of the examples above. Comments following commands appear after the # character.  set.seed(2) #sets the seed for random number generation. x <- 1:100 #creates a vector x with numbers from 1 to 100 ex <- rnorm(100, 0, 30) #100 normally distributed rand. nos. w/ mean=0, s.d.=30 ey <- rnorm(100, 0, 30) # " " y <- 30 + 2 * x #sets y to be a vector that is a linear function of x x_obs <- x + ex #adds "noise" to x y_obs <- y + ey #adds "noise" to y P <- cbind(x_obs,y_obs) #places points in matrix plot(P,asp=1,col=1) #plot points points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center At this point a full PCA analysis can be undertaken in R using the command “prcomp”, but in order to illustrate the algorithm I show all the steps below:  M <- cbind(x_obs-mean(x_obs),y_obs-mean(y_obs))#centered matrix MCov <- cov(M) #creates covariance matrix Note that the covariance matrix is proportional to the matrix$M^tM$. Next I turn to computation of the principal axes:  eigenValues <- eigen(MCov)$values       #compute eigenvalues
eigenVectors <- eigen(MCov)$vectors #compute eigenvectors The eigenvectors of the covariance matrix provide the principal axes, and the eigenvalues quantify the fraction of variance explained in each component. This math is explained in many papers and books so we omit it here, except to say that the fact that eigenvalues of the sample covariance matrix are the principal axes follows from recasting the PCA optimization problem as maximization of the Raleigh quotient. A key point is that although I’ve computed the sample covariance matrix explicitly in this example, it is not necessary to do so in practice in order to obtain its eigenvectors. In fact, it is inadvisable to do so. Instead, it is computationally more efficient, and also more stable to directly compute the singular value decomposition of M. The singular value decomposition of M decomposes it into $M=UDV^t$ where is a diagonal matrix and both $U$ and $V^t$ are orthogonal matrices. I will also not explain in detail the linear algebra of singular value decomposition and its relationship to eigenvectors of the sample covariance matrix (there is plenty of material elsewhere), and only show how to compute it in R:  d <- svd(M)$d          #the singular values
v <- svd(M)$v #the right singular vectors The right singular vectors are the eigenvectors of $M^tM$. Next I plot the principal axes:  lines(x_obs,eigenVectors[2,1]/eigenVectors[1,1]*M[x]+mean(y_obs),col=8) This shows the first principal axis. Note that it passes through the mean as expected. The ratio of the eigenvectors gives the slope of the axis. Next  lines(x_obs,eigenVectors[2,2]/eigenVectors[1,2]*M[x]+mean(y_obs),col=8) shows the second principal axis, which is orthogonal to the first (recall that the matrix $V^t$ in the singular value decomposition is orthogonal). This can be checked by noting that the second principal axis is also  lines(x_obs,-1/(eigenVectors[2,1]/eigenVectors[1,1])*M[x]+mean(y_obs),col=8) as the product of orthogonal slopes is -1. Next, I plot the projections of the points onto the first principal component:  trans <- (M%*%v[,1])%*%v[,1] #compute projections of points P_proj <- scale(trans, center=-cbind(mean(x_obs),mean(y_obs)), scale=FALSE) points(P_proj, col=4,pch=19,cex=0.5) #plot projections segments(x_obs,y_obs,P_proj[,1],P_proj[,2],col=4,lty=2) #connect to points The linear algebra of the projection is simply a rotation followed by a projection (and an extra step to recenter to the coordinates of the original points). Formally, the matrix M of points is rotated by the matrix of eigenvectors W to produce $T=MW$. This is the rotation that has all the optimality properties described above. The matrix T is sometimes called the PCA score matrix. All of the above code produces the following figure, which should be compared to those shown above: There are many generalizations and modifications to PCA that go far beyond what has been presented here. The first step in generalizing probabilistic PCA is factor analysis, which includes estimation of variance parameters in each coordinate. Since it is rare that “noise” in data will be the same in each coordinate, factor analysis is almost always a better idea than PCA (although the numerical algorithms are more complicated). In other words, I just explained PCA in detail, now I’m saying don’t use it! There are other aspects that have been generalized and extended. For example the Gaussian assumption can be relaxed to other members of the exponential family, an important idea if the data is discrete (as in genetics). Yang et al. 2012 exploit this idea by replacing PCA with logistic PCA for analysis of genotypes. There are also many constrained and regularized versions of PCA, all improving on the basic algorithm to deal with numerous issues and difficulties. Perhaps more importantly, there are issues in using PCA that I have not discussed. A big one is how to choose the PCA dimension to project to in analysis of high-dimensional data. But I am stopping here as I am certain no one is reading at this far into the post anyway… The take-home message about PCA? Always be thinking when using it! Acknowledgment: The exposition of PCA in this post began with notes I compiled for my course MCB/Math 239: 14 Lessons in Computational Genomics taught in the Spring of 2013. I thank students in that class for their questions and feedback. None of the material presented in class was new, but the exposition was intended to clarify when PCA ought to be used, and how. I was inspired by the papers of Tipping, Bishop and Roweis on probabilistic PCA in the late 1990s that provided the needed statistical framework for its understanding. Following the class I taught, I benefited greatly from conversations with Nicolas Bray, Brielin Brown, Isaac Joseph and Shannon McCurdy who helped me to further frame PCA in the way presented in this post. The Habsburg rulership of Spain ended with an inbreeding coefficient of F=0.254. The last king, Charles II (1661-1700), suffered an unenviable life. He was unable to chew. His tongue was so large he could not speak clearly, and he constantly drooled. Sadly, his mouth was the least of his problems. He suffered seizures, had intellectual disabilities, and was frequently vomiting. He was also impotent and infertile, which meant that even his death was a curse in that his lack of heirs led to a war. None of these problems prevented him from being married (twice). His first wife, princess Henrietta of England, died at age 26 after becoming deeply depressed having being married to the man for a decade. Only a year later, he married another princess, 23 year old Maria Anna of Neuberg. To put it mildly, his wives did not end up living the charmed life of Disney princesses, nor were they presumably smitten by young Charles II who apparently aged prematurely and looked the part of his horrific homozygosity. The princesses married Charles II because they were forced to. Royals organized marriages to protect and expand their power, money and influence. Coupled to this were primogeniture rules which ensured that the sons of kings, their own flesh and blood and therefore presumably the best-suited to be in power, would indeed have the opportunity to succeed their fathers. The family tree of Charles II shows how this worked in Spain: It is believed that the inbreeding in Charles II’s family led to two genetic disorders, combined pituitary hormone deficiency and distal renal tubular acidosis, that explained many of his physical and mental problems. In other words, genetic diversity is important, and the point of this blog post is to highlight the fact that diversity is important in education as well. The problem of inbreeding in academia has been studied previously, albeit to a limited extent. One interesting article is Navel Grazing: Academic Inbreeding and Scientific Productivity by Horta et al published in 2010 (my own experience with an inbred academic from a department where 39% of the faculty are self-hires anecdotally confirms the claims made in the paper). But here I focus on the downsides of inbreeding of ideas rather than of faculty. For example home-schooling, the educational equivalent of primogeniture, can be fantastic if the parents happen to be good teachers, but can fail miserably if they are not. One thing that is guaranteed in a school or university setting is that learning happens by exposure to many teachers (different faculty, students, tutors, the internet, etc.) Students frequently complain when there is high variance in teaching quality, but one thing such variance ensures is that is is very unlikely that any student is exposed only to bad teachers. Diversity in teaching also helps to foster the development of new ideas. Different teachers, by virtue of insight or error, will occasionally “mutate” ideas or concepts for better or for worse. In other words, one does not have to fully embrace the theory of memes to acknowledge that there are benefits to variance in teaching styles, methods and pedagogy. Conversely, there is danger in homogeneity. This brings me to MOOCs. One of the great things about MOOCs is that they reach millions of people. Udacity claims it has 1.6 million “users” (students?). Coursera claims 7.1 million. These companies are greatly expanding the accessibility of education. Starving children in India can now take courses in mathematical methods for quantitative finance, and for the first time in history, a president of the United States can discreetly take a freshman course on economics together with its high school algebra prerequisites (highly recommended). But when I am asked whether I would be interested in offering a MOOC I hesitate, paralyzed at the thought that any error I make would immediately be embedded in the brains of millions of innocent victims. My concern is this: MOOCs can greatly reduce the variance in education. For example, Coursera currently offers 641 courses, which means that each courses is or has been taught to over 11,000 students. Many college courses may have less than a few dozen students, and even large college courses rarely have more than a few hundred students. This means that on average, through MOOCs, individual professors reach many more (2 orders of magnitude!) students. A great lecture can end up positively impacting a large number of individuals, but at the same time, a MOOC can be a vehicle for infecting the brains of millions of people with nonsense. If that nonsense is then propagated and reaffirmed via the interactions of the people who have learned it from the same source, then the inbreeding of ideas has occurred. I mention MOOCs because I was recently thinking about intuition behind Bessel’s correction replacing n with n-1 in the formula for sample variance. Formally, Bessel’s correction replaces the biased formula $s^2_n = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2$ for estimating the variance of a random variable from samples $x_1,\ldots,x_n$ with $s^2_{n-1} = \frac{1}{n-1} \sum_{i=1}^n (x_i-\overline{x})^2$. The switch from to n-1 is a bit mysterious and surprising, and in introductory statistics classes it is frequently just presented as a “fact”. When an explanation is provided, it is usually in the form of algebraic manipulation that establishes the result. The issue came up as a result of a blog post I’m writing about principal components analysis (PCA), and I thought I would check for an intuitive explanation online. I googled “intuition sample variance” and the top link was a MOOC from the Khan Academy: The video has over 51,000 views with over 100 “likes” and only 6 “dislikes”. Unfortunately, in this case, popularity is not a good proxy for quality. Despite the title promising “review” and “intuition” for “why we divide by n-1 for the unbiased sample variance” there is no specific reason given why is replaced by n-1 (as opposed to another correction). Furthermore, the intuition provided has to do with the fact that $x_i-\overline{x}$ underestimates $x_i-\mu$ (where $\mu$ is the mean of the random variable and $\overline{x}$ is the sample mean) but the explanation is confusing and not quantitative (which it can easily be). In fact, the wikipedia page for Bessel’s correction provides three different mathematical explanations for the correction together with the intuition that motivates them, but it is difficult to find with Google unless one knows that the correction is called “Bessel’s correction”. Wikipedia is also not perfect, and this example is a good one for why teaching by humans is important. Among the three alternative derivations, I think that one stands out as “better” but one would not know by just looking at the wikipedia page. Specifically, I refer to “Alternate 1″ on the wikipedia page, that is essentially explaining that variance can be rewritten as a double sum corresponding to the average squared distance between points and the diagonal terms of the sum are zero in expectation. An explanation of why this fact leads to the n-1 in the unbiased estimator is as follows: The first step is to notice that the variance of a random variable is equal to half of the expected squared difference of two independent identically distributed random variables of that type. Specifically, the definition of variance is: $var(X) = \mathbb{E}(X - \mu)^2$ where $\mu = \mathbb{E}(X)$. Equivalently, $var(X) = \mathbb{E}(X^2) -\mu^2$. Now suppose that Y is another random variable identically distributed to X and with X,Y independent. Then $\mathbb{E}(X-Y)^2 = 2 var(X)$. This is easy to see by using the fact that $\mathbb{E}(X-Y)^2 = \mathbb{E}(X^2) + \mathbb{E}(Y^2) - 2\mathbb{E}(X)\mathbb{E}(Y) = 2\mathbb{E}(X^2)-2\mu^2$. This identity motivates a rewriting of the (uncorrected) sample variance $s_n$ in a way that is computationally less efficient, but mathematically more insightful: $s_n = \frac{1}{2n^2} \sum_{i,j=1}^n (x_i-x_j)^2$. Of note is that in this summation exactly n of the terms are zero, namely the terms when i=j. These terms are zero independently of the original distribution, and remain so in expectation thereby biasing the estimate of the variance, specifically leading to an underestimate. Removing them fixes the estimate and produces $s_{n-1}^2 = \frac{1}{2n(n-1)} \sum_{i,j=1, i \neq j}^n (x_i-x_j)^2$. It is easy to see that this is indeed Bessel’s correction. In other words, the correction boils down to the fact that $n^2-n = n(n-1)$, hence the appearance of n-1. Why do I like this particular derivation of Bessel’s correction? There are two reasons: first, n-1 emerges naturally and obviously from the derivation. The denominator in $s_{n-1}^2$ matches exactly the number of terms being summed, so that it can be understood as a true average (this is not apparent in its standard form as $s_{n-1}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\overline{x})^2$. There is really nothing mysterious anymore, its just that some terms having been omitted from the sum because they were non-inofrmative. Second, as I will show in my forthcoming blog post on PCA, the fact that the variance of a random variable is half of the expectation of the squared difference of two instances, is key to understanding the connection between multi-dimensional scaling (MDS) and PCA. In other words, as my student Nicolas Bray is fond of saying, although most people think a proof is either right or wrong, in fact some proofs are more right than others. The connection between Bessel’s correction and PCA goes even deeper: as explained by Saville and Wood in their book Statistical Methods: A Geometric Approach n-1 can be understood to be a reduction in one dimension from the point of view of probabilistic PCA (Saville and Wood do not explicitly use the term probabilistic PCA but as I will explain in my PCA post it is implicit in their book). Finally, there are many subtleties to Bessel’s correction, for example it is an unbiased estimator for variance and not standard deviation. These issues ought to be mentioned in a good lecture about the topic. In other words, the Khan lecture is neither necessary nor sufficient, but unlike a standard lecture where the damage is limited to a small audience of students, it has been viewed more than 50,000 times and those views cannot be unviewed. In writing this blog post I pondered the irony of my call for added diversity in teaching while I preach my own idea (this post) to a large number of readers via a medium designed for maximal outreach. I can only ask that others blog as well to offer alternative points of view :) and that readers inform themselves on the issues I raise by fact-checking elsewhere. As far as the statistics goes, if someone finds the post confusing, they should go and register for one of the many fantastic MOOCs on statistics! But I reiterate that in the rush to MOOCdom, providers must offer diversity in their offerings (even multiple lectures on the same topic) to ensure a healthy population of memes. This is especially true in Spain, where already inbred faculty are now inbreeding what they teach by MOOCing via Miriada X. Half of the MOOCs being offered in Spain originate from just 3 universities, while the number of potential viewers is enormous as Spanish is now the second most spoken language in the world (thanks to Charles II’s great-great-grandfather, Charles I). May Charles II rest in peace. One of my distinct memories from elementary school is going to “library class” to learn about the Dewey decimal classification and how to use a card catalog to find books. Searching for books efficiently was possible because cards in the catalog were sorted lexicographically. It didn’t occur to me at the time, but the system required authors of books to be totally ordered. Without an ordering of authors in a book with multiple authors, there would be no way to decide where to place the card for the book in a catalog searchable by author. The practice of ordering authors on publications is evident in the oldest printed texts and has persisted to this day. I have never thought that it could be any other way. However this past Wednesday I was visiting the University of Washington to deliver a seminar, and among the highlights of the visit was my meeting with the graduate students. I met 12 for lunch and two more came for dinner. Meeting with students is always my favorite part of a visit to a university. They have original and creative ideas, and most importantly, are not bound in their thought by archaic tradition. They frequently don’t know what one is supposed to think and how one is supposed to say it. They just think and speak! One of the students I met on Wednesday was Vanessa Gray, a student of Doug Fowler, who in a conversation on authorship practices suggested to me the radical and brilliant idea that papers should be published without an ordering of authors. Many journals now have a section called “Author contributions” where roles of individuals in collaborative projects can be described (many journals now require such descriptions). So why bother ordering the authors for a list underneath the title? As far as indexing and searching goes, Google and other search engines require only a set of authors, and not a specific ordering. I agree with Vanessa that ending author ordering on publications would greatly improve fairness in the biological sciences, where many current projects involve complex assemblies of teams with complementary skills. “First authorship” is not well-defined when one author performed a large number of difficult experiments, and another developed novel algorithms and wrote complex software for analyzing the experiments. Similarly, “last authorship” fails as a concept when students are co-advised, or one principal investigator provides substantial funding on a project, while another is participating in doing the work. And recently, large consortium projects have completely destroyed any meaning of “author” by having hundreds, or even thousands of authors on projects. Even when there are relatively few authors people rarely credit anyone except the first and last authors, even if others did substantial work. In the recent ENCODE paper published in PNAS with 30 authors, it appears to me from the responses to my previous blog post about the paper that the 5th and 6th authors did a lot (majority?) of the work in putting together figures and results, yet I suspect the “credit” for the paper will go to the first author (the flip side in that case is that the first author is where blame is assigned as well). There is also a theoretical justification for not ordering authors. Ordering of authors on a publication can be thought of as a ranking produced by “votes” of the participants in the project. Of course in practice not all votes are equal. In what is called dictatorship in social choice theory, PIs frequently make the decisions independently of how specific authors feel they may have contributed. This may work on a paper where there is a single PI (although it may be considered unfair by the graduate students and postdocs), however dictatorship as a system for determining authorship certainly breaks down when multiple PIs collaborate on a project. Arrow’s impossibility theorem explains that in the absence of dictatorship, there is a problem in producing a single ordering satisfying two other seemingly basic and essential fairness criteria. Informally, the theorem states that there is no authorship ordering system based on voting of contributing authors that can satisfy the following three criteria: • If every author thinks that X should be ordered before Y, then the author list should have X placed before Y. • For a fixed list of voting preferences regarding the ordering of X vs. Y, the ordering between X and Y in the author list will remain unchanged even if does not depend on the ordering of other pairs such as X and Z, Y and Z, or Z and W. • There is no “dictator”, i.e. no single author possesses the power to determine the author ordering. Authors frequently have differing opinions about the impact of their own contribution to a publication, and therefore their preferences (votes) for author ordering are discordant. This means that any system for ordering authors will not satisfy everyone’s preferences, and in the sense of Arrow’s impossibility theorem will be unfair. One way around Arrow’s impossibility theorem is to specify authorship order without regard to authors’ preferences, for example by always ordering authors alphabetically (the Hardy-Littlewood rule). This method, usually the one used in the mathematical sciences, is also fraught with problems. Of course, listing author contributions for what they are is not entirely trivial. For example, different authors may have conflicting views about what it means to have “written the text of the paper”. But using words to describe contributions allow for much more detail about what each author did, and allows for nuanced contributions to be described (e.g., John and Jane were in the room when the initial idea for the project was discussed, but did not contribute anything afterwards). To summarize, in the modern era of electronic publishing ordering of authors is unnecessary, and if it is unnecessary, then why confront Arrow’s theorem and inevitably produce orderings unfairly? Publications should just explain the author contributions. Time to end ordered authorship. The card catalog at Yale University’s Sterling Memorial Library (from Wikipedia). I was recently reading the latest ENCODE paper published in PNAS when a sentence in the caption of Figure 2 caught my attention: “Depending on the total amount of RNA in a cell, one transcript copy per cell corresponds to between 0.5 and 5 FPKM in PolyA+ whole-cell samples according to current estimates (with the upper end of that range corresponding to small cells with little RNA and vice versa).” Although very few people actually care about ENCODE, many people do care about the interpretation of RNA-Seq FPKM measurements and to them this is likely to be a sentence of interest. In fact, there have been a number of attempts to provide intuitive meaning for RPKM (and FPKM) in terms of copy numbers of transcripts per cell. Even though the ENCODE PNAS paper provides no citation for the statement (or methods section explaining the derivation), I believe its source is the RNA-Seq paper by Mortazavi et al. In that paper, the authors write that “…absolute transcript levels per cell can also be calculated. For example, on the basis of literature values for the mRNA content of a liver cell [Galau et al. 1977] and the RNA standards, we estimated that 3 RPKM corresponds to about one transcript per liver cell. For C2C12 tissue culture cells, for which we know the starting cell number and RNA preparation yields needed to make the calculation, a transcript of 1 RPKM corresponds to approximately one transcript per cell. “ This statement has been picked up on in a number of publications (e.g., Hebenstreit et al., 2011, van Bakel et al., 2011). However the inference of transcript copies per cell directly from RPKM or FPKM estimates is not possible and conversion factors such as 1 RPKM = 1 transcript per cell are incoherent. At the same time, the estimates of Mortazavi et al. and the range provided in the ENCODE PNAS paper are informative. The “incoherence” stems from a subtle issue in the normalization of RPKM/FPKM that I have discussed in a talk I gave at CSHL, and is the reason why TPM is a better unit for RNA abundance. Still, the estimates turn out to be “informative”, in the sense that the effect of (lack of) normalization appears to be smaller than variability in the amount of RNA per cell. I explain these issues below: Why is the sentence incoherent? RNA-Seq can be used to estimate transcript abundances in an RNA sample. Formally, a sample consists of n distinct types of transcripts, and each occurs with different multiplicity (copy number), so that transcript appears $m_i$ times in the sample. By “abundance” we mean the relative amounts $\rho_1,\ldots,\rho_n$ where $\rho_i = \frac{m_i}{\sum_{i=1}^n m_i}$. Note that $0 \leq \rho_i \leq 1$ and $\sum_{i=1}^n \rho_i = 1$. Suppose that $m_j=1$ for some j. The corresponding $\rho_j$ is therefore $\rho_j = \frac{1}{M}$ where $M = \sum_{i=1}^n m_i$. The question is what does this $\rho$ value correspond to in RPKM (or FPKM). RPKM stands for “reads per kilobase of transcript per million reads mapped” and FPKM is the same except with “fragment” replacing read (initially reads were not paired-end, but with the advent of paired-end sequencing it makes more sense to speak of fragments, and hence FPKM). As a unit of measurement for an estimate, what FPKM really refers to is the expected number of fragments per kilboase of transcript per million reads. Formally, if we let $l_i$ be the length of transcript and define $\alpha_i = \frac{\rho_i l_i}{\sum_{j=1}^n \rho_j l_j}$ then abundance in FPKM for transcript is abundance measured as $FPKM_i = \frac{\alpha_i \cdot 10^{6}}{l_i/(10^3)}$. In terms of $\rho$, we obtain that $FPKM_i = \frac{\rho_i \cdot 10^9}{\sum_{j=1}^n \rho_j l_j}$. The term in the denominator can be considered a kind of normalization factor, that while identical for each transcript, depends on the abundances of each transcript (unless all lengths are equal). It is in essence an average of lengths of transcripts weighted by abundance. Moreover, the length of each transcript should be taken to be taken to be its “effective” length, i.e. the length with respect to fragment lengths, or equivalently, the number of positions where fragments can start. The implication for finding a relationship between FPKM and relative abundance constituting one transcript copy per cell is that one cannot. Mathematically, the latter is equivalent to setting $\rho_i = \frac{1}{M}$ in the formula above and then trying to determine $FPKM_i$. Unfortunately, all the remaining $\rho$ are still in the formula, and must be known in order to calculate the corresponding FPKM value. The argument above makes clear that it does not make sense to estimate transcript copy counts per cell in terms of RPKM or FPKM. Measurements in RPKM or FPKM units depend on the abundances of transcripts in the specific sample being considered, and therefore the connection to copy counts is incoherent. The obvious and correct solution is to work directly with the $\rho$. This is the rationale of TPM (transcripts per million) used by Bo Li and Colin Dewey in the RSEM paper (the argument for TPM is also made in Wagner et al. 2012). Why is the sentence informative? Even though incoherent, it turns out there is some truth to the ranges and estimates of copy count per cell in terms of RPKM and FPKM that have been circulated. To understand why requires noting that there are in fact two factors that come into play in estimating the FPKM corresponding to abundance of one transcript copy per cell. First, is M as defined above to be the total number of transcripts in a cell. This depends on the amount of RNA in a cell. Second are the relative abundances of all transcripts and their contribution to the denominator in the $FPKM_i$ formula. The best paper to date on the connection between transcript copy numbers and RNA-Seq measurements is the careful work of Marinov et al. in “From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing” published in Genome Research earlier this year. First of all, the paper describes careful estimates of RNA quantities in different cells, and concludes that (at least for the cells studied in the paper) amounts vary by approximately one order of magnitude. Incidentally, the estimates in Marinov et al. confirm and are consistent with rough estimates of Galau et al. from 1977, of 300,000 transcripts per cell. Marinov et al. also use spike-in measurements are used to conclude that in “GM12878 single cells, one transcript copy corresponds to ∼10 FPKM on average.”. The main value of the paper lies in its confirmation that RNA quantities can vary by an order of magnitude, and I am guessing this factor of 10 is the basis for the range provided in the ENCODE PNAS paper (0.5 to 5 FPKM). In order to determine the relative importance of the denominator in $FPKM_i$ I looked at a few RNA-Seq datasets we are currently examining. In the GEUVADIS data, the weighted average can vary by as much as 20% between samples. In a rat RNA-Seq dataset we are analyzing, the difference is a factor of two (and interestingly very dependent on the exact annotation used for quantification). The point here is that even the denominator in $FPKM_i$ does vary, but less, it seems, than the variability in RNA quantity. In other words, the estimate of 0.5 to 5 FPKM corresponding to one transcript per cell is incoherent albeit probably not too far off. One consequence of all of the above discussion is that while differential analysis of experiments can be performed based on FPKM units (as done for example in Cufflinks, where the normalization factors are appropriately accounted for), it does not make sense to compare raw FPKM values across experiments. This is precisely what is done in Figure 2 of the ENCODE PNAS paper. What the analysis above shows, is that actual abundances may be off by amounts much larger than the differences shown in the figure. In other words, while the caption turns out to to containing an interesting comment the overall figure doesn’t really make sense. Specifically, I’m not sure the relative RPKM values shown in the figure deliver the correct relative amounts, an issue that ENCODE can and should check. Which brings me to the last part of this post… What is ENCODE doing? Having realized the possible issue with RPKM comparisons in Figure 2, I took a look at Figure 3 to try to understand whether there were potential implications for it as well. That exercise took me to a whole other level of ENCODEness. To begin with, I was trying to make sense of the x-axis, which is labeled “biochemical signal strength (log10)” when I realized that the different curves on the plot all come from different, completely unrelated x-axes. If this sounds confusing, it is. The green curves are showing graphs of functions whose domain is in log 10 RPKM units. However the histone modification curves are in log (-10 log p), where p is a p-value that has been computed. I’ve never seen anyone plot log(log(p-values)); what does it mean?! Nor do I understand how such graphs can be placed on a common x-axis (?!). What is “biochemical signal strength” (?) Why in the bottom panel is the grey H3K9me3 showing %nucleotides conserved decreasing as “biochemical strength” is increasing (?!) Why is the green RNA curves showing conservation below genome average for low expressed transcripts (?!) and why in the top panel is the red H3K4me3 an “M” shape (?!) What does any of this mean (?!) What I’m supposed to understand from it, or frankly, what is going on at all ??? I know many of the authors of this ENCODE PNAS paper and I simply cannot believe they saw and approved this figure. It is truly beyond belief… see below: All of these figures are of course to support the main point of the paper. Which is that even though 80% of the genome is functional it is also true that this is not what was meant to be said , and that what is true is that “survey of biochemical activity led to a significant increase in genome coverage and thus accentuated the discrepancy between biochemical and evolutionary estimates… where function is ascertained independently of cellular state but is dependent on environment and evolutionary niche therefore resulting in estimates that differ widely in their false-positive and false-negative rates and the resolution with which elements can be defined… [unlike] genetic approaches that rely on sequence alterations to establish the biological relevance of a DNA segment and are often considered a gold standard for defining function.” The ENCODE PNAS paper was first published behind a paywall. However after some public criticism, the authors relented and paid for it to be open access. This was a mistake. Had it remained behind a paywall not only would the consortium have saved money, I and others might have been spared the experience of reading the paper. I hope the consortium will afford me the courtesy of paywall next time. In reading the news yesterday I came across multiple reports claiming that even casually smoking marijuana can change your brain. I usually don’t pay much attention to such articles; I’ve never smoked a joint in my life. In fact, I’ve never even smoked a cigarette. So even though as a scientist I’ve been interested in cannabis from the molecular biology point of view, and as a citizen from a legal point of view, the issues have not been personal. However reading a USA Today article about the paper, I noticed that the principal investigator Hans Breiter was claiming to be a psychiatrist and mathematician. That is an unusual combination so I decided to take a closer look. I immediately found out the claim was a lie. In fact, the totality of math credentials of Hans Breiter consist of some logic/philosophy courses during a year abroad at St. Andrews while he was a pre-med student at Northwestern. Even being an undergraduate major in mathematics does not make one a mathematician, just as being an undergraduate major in biology does not makes one a doctor. Thus, with his outlandish claim, Hans Breiter had succeeded in personally offending me! So, I decided to take a look at his paper underlying the multiple news reports: This is quite possibly the worst paper I’ve read all year (as some of my previous blog posts show I am saying something with this statement). Here is a breakdown of some of the issues with the paper: ### 1. Study design First of all, the study has a very small sample size, with only 20 “cases” (marijuana users), a fact that is important to keep in mind in what follows. The title uses the term “recreational users” to describe them, and in the press release accompanying the article Breiter says that “Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” In fact, the majority of users in the study were smoking more than 10 joints per week. There is even a person in the study smoking more than 30 joints per week (as disclosed above, I’m not an expert on this stuff but if 30 joints per week is “recreation” then it seems to me that person is having a lot of fun). More importantly, Breiter’s statement in the press release is a lie. There is no evidence in the paper whatsoever, not even a tiny shred, that the users who were getting high once or twice a week were having any problems. There are also other issues with the study design. For example, the paper claims the users are not “abusing” other drugs, but it is quite possible that they are getting high on cocaine, heroin, or ??? as well, an issue that could quite possibly affect the study. The experiment consisted of an MRI scan of each user/control, but only a single scan was done. Given the variability in MRI scans this also seems problematic. ### 2. Multiple testing The study looked at three aspects of brain morphometry in the study participants: gray matter density, volume and shape. Each of these morphometric analyses constituted multiple tests. In the case of gray matter density, estimates were based on small clusters of voxels, resulting in 123 tests (association of each voxel cluster with marijuana use). Volumes were estimated for four regions: left and right nucleus accumbens and amygdala. Shape was also tested in the same four regions. What the authors should have done is to correct the p-values computed for each of these tests by accounting for the total number of tests performed. Instead, (Bonferroni) corrections were performed separately for each type of analysis. For example, in the volume analysis p-values were required to be less than 0.0125 = 0.05/4. In other words, the extent of testing was not properly accounted for. Even so, many of the results were not significant. For example, the volume analysis showed no significant association for any of the four tested regions. The best case was the left nucleus accumbens (Figure 1C) with a corrected p-value of 0.015 which is over the authors’ own stated required threshold of 0.0125 (see caption). They use the language “The association with drug use, after correcting for 4 comparisons, was determined to be a trend toward significance” to describe this non-effect. It is worth noting that the removal of the outlier at a volume of over $800 mm^3$ would almost certainly flatten the line altogether and remove even the slight effect. It would have been nice to test this hypothesis but the authors did not release any of their data. Figure 1c. In the Fox News article about the paper, Breiter is quoted saying ““For the NAC [nucleus accumbens], all three measures were abnormal, and they were abnormal in a dose-dependent way, meaning the changes were greater with the amount of marijuana used,” Breiter said. “The amygdala had abnormalities for shape and density, and only volume correlated with use. But if you looked at all three types of measures, it showed the relationships between them were quite abnormal in the marijuana users, compared to the normal controls.” The result above shows this to be a lie. Volume did not significantly correlate with use. This is all very bad, but things get uglier the more one looks at the paper. In the tables reporting the p-values, the authors do something I have never seen before in a published paper. They report the uncorrected p-values, indicating those that are significant (prior to correction) in boldface, and then put an asterisk next to those that are significant after their (incomplete) correction. I realize my own use of boldface is controversial… but what they are doing is truly insane. The fact that they put an asterisk next to the values significant after correction indicates they are aware that multiple testing is required. So why bother boldfacing p-values that they know are not significant? The overall effect is an impression that more tests are significant than is actually the case. See for yourself in their Table 4: Table 4. The fact that there are multiple columns is also problematic. Separate tests were performed for smoking occasions per day, joints per occasion, joints per week and smoking days per week. These measures are highly correlated, but even so multiply testing them requires multiple test correction. The authors simply didn’t perform it. They say “We did not correct for the number of drug use measures because these measures tend not be independent of each other”. In other words, they multiplied the number of tests by four, and chose to not worry about that. Unbelievable. Then there is Table 5, where the authors did not report the p-values at all, only whether they were significant or not… without correction: Table 5. ### 3. Correlation vs. causation This issue is one of the oldest in the book. There is even a wikipedia entry about itCorrelation does not imply causation. Yet despite the fact the every result in the paper is directed at testing for association, in the last sentence of the abstract they say “These data suggest that marijuana exposure, even in young recreational users, is associated with exposure-dependent alterations of the neural matrix of core reward structures and is consistent with animal studies of changes in dendritic arborization.” At a minimum, such a result would require doing a longitudinal study. Breiter takes this language to an extreme in the press release accompanying the article. I repeat the statement he made that I quoted above where I boldface the causal claim: “”Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” I believe that scientists should be sanctioned for making public statements that directly contradict the content of their papers, as appears to be the case here. There is precedent for this. A few years ago after the birth of our second daughter and in anticipation of our third, I started designing a one-room addition for our house. One of the problems I faced was figuring out the shape of the roof. I learned of the concept of the straight skeleton of a polygon, first defined by Oswin Aichholzer and Franz Aurenhammer in a book chapter “Straight Skeletons for General Polygonal Figures in the Plane” in 1996. Wikipedia provides an intuitive definition: “The straight skeleton of a polygon is defined by a continuous shrinking process in which the edges of the polygon are moved inwards parallel to themselves at a constant speed. As the edges move in this way, the vertices where pairs of edges meet also move, at speeds that depend on the angle of the vertex. If one of these moving vertices collides with a nonadjacent edge, the polygon is split in two by the collision, and the process continues in each part. The straight skeleton is the set of curves traced out by the moving vertices in this process. In the illustration the top figure shows the shrinking process and the middle figure depicts the straight skeleton in blue.” but the concept is best understood by picture: The fact that straight skeletons fit “symmetrically” into the polygons that generated them, made me think about whether they could constitute aesthetic representations of phylogenetic trees. So I asked the inverse question: given a phylogenetic tree, i.e. a graph that is a tree with weighted edges, together with a cyclic orientation on its vertices, is there a convex polygon such that the tree is the straight skeleton of that polygon? A few google searches didn’t reveal anything, but fortunately and coincidentally, Satyan Devadoss, who is a topologist and computational geometer was visiting my group on his sabbatical (2009–2010). Now, a few years later, Satyan and coauthors have written a paper providing a partial answer to my question. Their paper is about to appear in the next issue of Discrete Applied Mathematics: The main theorem is formally about ribbon trees: Definition. A ribbon tree is a tree (a connected graph with no cycles) for which each edge is assigned a nonnegative length, each internal vertex has degree at least three, and the edges incident to each vertex are cyclically ordered. The authors prove the interesting result that there exists only a finite set of planar embeddings of a tree appearing as straight skeletons of convex polygons. Specifically, they show that: Theorem. A ribbon tree with n leaves has at most 2n−5 suitable convex polygons. Its fun to work out by hand the case of a star tree with three leaves: The algebra works out to solving a cubic system of three equations that can be seen to have one unique positive solution (Lemma 5 in the paper). The proof of the main theorem relies on some elementary trigonometry and algebra, as well as an interesting analogy of Cauchy’s “arm lemma” (if one increases one of the angles of a convex polygonal chain, then the distance between the endpoints will only increase). Furthermore, a few interesting cases and connections are pointed out along the way. For example, some ribbons, even caterpillar ribbons, are not realized by any polygon. There is also a related conference publication by many of the same authors, although including Aichholzer himself, that provides some interesting constructions in special cases: • Oswin Aichholzer, Howard Cheng, Satyan L. Devadoss, Thomas Hackl, Stefan Huber, Brian Li, Andrej Risteski, What makes a tree a straight skeleton?, Canadian Conference on Computational Geometry, 2012. Straight skeletons may yet find application in drawing phylogenetic trees, but for now the best out there are radial or circular representations optimizing various layout considerations. My addition is now complete and the roof is absolutely beautiful. The roof of the addition Reproducibility has become a major issue in scientific publication, it is under scrutiny by many, making headlines in the news, is on the minds of journals, and there are guidelines for achieving it. Reproducibility is certainly important for scientific research, but I think that lost in the debate is the notion of usability. Reproducibility only ensures that the results of a paper can be recreated by others, but usability ensures that researchers can build on scientific work, and explore hypotheses and ideas unforeseen by authors of the original papers. Here I describe a case study in reproducibility and usability, that emerged from previous post I wrote about the paper Feizi et al. describe a method called network deconvolution, that they claim improves the inference results for 8 out of 9 network inference methods, out of the 35 that were tested in the DREAM5 challenge. In DREAM5, participants were asked to examine four chip-based gene x expression matrices, and were also provided a list of transcription factors for each. They were then asked to provide ranked lists of transcription factor – gene interactions for each of the four datasets. The four datasets consisted of one computer simulation (the “in silico” dataset) and expression measurements in E. coli, S. cerevisiae and S. aureus. The consortium received submission from 29 different groups and ran 6 other “off-the-shelf” methods, while also developing its own “community” method (for a total of 36=29+6+1). The community method consisted of applying the Borda count to the 35 methods being tested, to produce a new consensus, or community, network. Nicolas Bray and I tried to replicate the results of Feizi et al. so that we could test for ourselves the performance of network deconvolution with different parameters and on other DREAM5 methods ( Feizi et al. tested only 9 methods; there were 36 in total). But despite contacting the authors for help we were unable to do so. In desperation, I even offered$100 for someone to replicate all of the figures in the paper. Perhaps as a result of my blogging efforts, or possibly due to a spontaneous change of heart, the authors finally released some of the code and data needed to reproduce some of the figures in their paper. In particular, I am pleased to say that the released material is sufficient to almost replicate Figure 2 of their paper which describes their results on a portion of the DREAM5 data. I say almost because the results for one of the networks is off, but to the authors credit it does appear that the distributed data and code are close to what was used to produce the figure in the paper (note: there is still not enough disclosure to replicate all of the figures of the paper, including the suspicious Figure S4 before and after revision of the supplement, and I am therefore not yet willing to concede the \$100). What Feizi et al. did accomplish was to make their methods usable. That is to say, with the distributed code and data I was able to test the method with different parameters and on new datasets. In other words, Feizi et al. is still not completely reproducible, but it is usable. In this post, I’ll demonstrate why usability is important, and make the case that it is too frequently overlooked or confused with reproducibility. With usable network deconvolution code in hand, I was finally able to test some of the claims of Feizi et al.  First, I identify the relationship between the DREAM methods and the methods Feizi et al. applied network deconvolution to. In the figure below, I have reproduced Figure 2 from Feizi et al. together with Figure 2 from Marbach et al.:

Figure 2 from Feizi et al. aligned to Figure 2 from Marbach et al.

The mapping is more complex than appears at first sight. For example, in the case of Spearman correlation (method Corr #2 in Marbach et al., method #5 in Feizi et al.), Feizi et al. ran network deconvolution on the method after taking absolute values. This makes no sense, as throwing away the sign is to throw away a significant amount of information, not to mention it destroys any hope of connecting the approach to the intuition of inferring directed interactions from the observed via the idealized “model” described in the paper. On the other hand, Marbach et al. evaluated Spearman correlation with sign. Without taking the absolute value before evaluation negative edges, strong (negative) interactions, are ignored. This is the reason for the very poor performance of Spearman correlation and the reason for the discrepancy in bar heights between Marbach et al. and Feizi et al. for that method. The caption of Figure 2 in Feizi et al. begins “Network deconvolution applied to the inferred networks of top-scoring methods [1] from DREAM5..” This is obviously not true. Moreover, one network they did not test on was the community network of Marbach et al. which was the best method and the point of the whole paper. However the methods they did test on were ranked 2,3,4,6,8,12,14,16,28 (out of 36 methods). The 10th “community” method of Feizi et al. is actually the result of applying the community approach to the ND output from all the methods, so it is not in and of itself a test of ND. Of the nine tested methods, arguably only a handful were “top” methods. I do think its sensible to consider “top” to be the best methods for each category (although Correlation is so poor I would discard it altogether). That leaves four top methods. So instead of creating the illusion that network deconvolution improves 9/10 top scoring methods, what Feizi et al. should have reported is that 3 out of 4 of the top methods that were tested were improved by network deconvolution. That is the result of running network deconvolution with the default parameters. I was curious what happens when using the parameters that Feizi et al. applied to the protein interaction data (alpha=1, beta=0.99). Fortunately, because they have made the code usable, I was able to test this. The overall result as well as the scores on the individual datasets are shown below: The Feizi et al. results on gene regulatory networks using parameters different from the default. The results are very different. Despite the claims of Feizi et al. that network deconvolution is robust to choice of parameters, now only 1 out of 4 of the top methods are improved by network deconvolution. Strikingly, the top three methods tested have their quality degraded. In fact, the top method in two out of the three datasets tested is made worse by network deconvolution. Network deconvolution is certainly not robust to parameter choice. What was surprising to me was the improved performance of network deconvolution on the S. cerevisae dataset, especially for the mutual information and correlation methods. In fact, the improvement of network deconvolution over the methods is appears extraordinary. At this point I started to wonder about what the improvements really mean, i.e. what is the “score” that is being measured. The y-axis, called the “score” by Feizi et al. and Marbach et al. seemed to be changing drastically between runs. I wondered… what exactly is the score? What do the improvements mean? It turns out that “score” is defined as follows:

$score = \frac{1}{2} ( \frac{1}{3} \sum_{i=1}^3 -log_{10} p_{AUROC,i} + \frac{1}{3} \sum_{i=1}^3 -log_{10} p_{AUPR,i})$.

This formula requires some untangling: First of all, AUROC is shorthand for area under the ROC (receiver operator curve), and AUPR for area under the PR (precision recall) curve. For context, ROC is a standard concept in engineering/statistics. Precision and recall are used frequently, but the PR curve is used much less than ROC . Both are measures for judging the quality of a binary classifier. In the DREAM5 setting, this means the following: there is a gold standard of “positives”, namely a set of edges in a network that should be predicted by a method, and the remainder of the edges will be considered “negatives”, i.e. they should not be predicted. A method generates a list of edges, sorted (ranked) in some way. As one proceeds through the list, one can measure the fraction of positives and false positives predicted. The ROC and PR curves measure the performance. A ROC is simply a plot showing the true positive rate for a method as a function of the false positive rate. Suppose that there are positives in the gold standard out of a goal of edges. If one examines the top k predictions of a method, then among them there will be t “true” positives as well as k-t “false” positives. This will result in a single point on the ROC, i.e. the point ($\frac{k-t}{n-m},\frac{t}{m}$). This can be confusing at first glance for a number of reasons. First, the points do not necessarily form a function, e.g. there can be points with the same x-coordinate. Second, as one varies one obtains a set of points, not a curve. The ROC is a curve, and is obtained by taking the envelope of all of the points for $k \in \{1,\ldots,n\}$. The following intuition is helpful in understanding ROC:

1. The x coordinate in the ROC is the false positive rate. If one doesn’t make any predictions of edges at all, then the false positive rate is 0 (in the notation above k=0, t=0). On the other hand, if all edges are considered to be “true”, then the false positive rate is 1 and the corresponding point on the ROC is (1,1), which corresponds to k=n, t=m.
2. If a method has no predictive power, i.e. the ranking of the edges tells you nothing about which edges really are true, then the ROC is the line y=x. This is because lack of predictive power means that truncating the list at any k, results in the same proportion of true positives above and below the kth edge. And a simple calculation shows that this will correspond to the point ($\frac{k}{n},{k}{n}$) on the ROC curve.
3. ROC curves can be summarized by a single number that has meaning: the area under the ROC (AUROC). The observation above means that a method without any predictive power will have an AUROC of 1/2. Similarly, a “perfect” method, where he true edges are all ranked at the top will have an AUROC of 1. AUROC is widely used to summarize the content of a ROC curve because it has an intuitive meaning: the AUROC is the probability that if a positive and a negative edge are each picked at random from the list of edges, the positive will rank higher than the negative.

An alternative to ROC is the precision-recall curve. Precision, in the mathematics notation above, is the value $\frac{t}{k}$, i.e., the number of true positives divided by the number of true positives plus false positives. Recall is the same as sensitivity, or true positive rate: it is $\frac{t}{m}$. In other words, the PR curve contains the points ($\frac{t}{m},\frac{t}{k}$), as recall is usually plotted on the x-axis. The area under the precision-recall curve (AUPR) has an intuitive meaning just like AUROC. It is the average of precision across all recall values, or alternatively, the probability that if a “positive” edge is selected from the ranked list of the method, then an edge above it on the list will be “positive”. Neither precision-recall curves, nor AUPR are widely used. There is one problem with AUPR, which is that its value is dependent on the number of positive examples in the dataset. For this reason, it doesn’t make sense to average AUPR across datasets (while it does make sense for AUROC). For all of these reasons, I’m slightly uncomfortable with AUPR but that is not the main issue in the DREAM5 analysis. I have included an example of ROC and PR curves below. I generated them for the method “GENIE3″ tested by Feizi et al.. This was the method with the best overall score. The figure below is for the S. cerevisiae dataset:

The ROC and a PR curves before (top) and after (bottom) applying network deconvolution to the GENIE3 network.

The red curve in the ROC plots is what one would see for a method without any predictive power (point #2 above). In this case, what the plot shows is that GENIE3 is effectively ranking the edges of the network randomly. The PR curve is showing that at all recall levels there is very little precision. The difference between GENIE3 before and after network deconvolution is so small, that it is indistinguishable in the plots. I had to create separate plots before and after network deconvolution because the curves literally overlapped and were not visible together. The conclusion from plots such as these, should not be that there is statistically significance (in the difference between methods with/without network deconvolution, or in comparison to random), but rather that there is negligible effect. There is a final ingredient that is needed to constitute “score”. Instead of just averaging AUROC and AUPR, both are first converted into p-values that measure the statistical significance of the method being different from random. The way this was done was to create random networks from the edges of the 35 methods, and then to measure their quality (by AUROC or AUPR) to obtain a distribution. The p-value for a given method was then taken to be the area under the probability density function to the right of the method’s value. The graph below shows the pdf for AUROC from the S. cerevisae DREAM5 data that was used by Feizi et al. to generate the scores:

Distribution of AUROC for random methods generated from the S. cerevisiae submissions in Marbach et al.

In other words, almost all random methods had an AUROC of around 0.51, so any slight deviation from that was magnified in the computation of p-value, and then by taking the (negative) logarithm of that number a very high “score” was produced. The scores were then taken to be the average of the AUROC and AUPR scores. I can understand why Feizi et al. might be curious whether the difference between a method’s performance (before and after network deconvolution) is significantly different from random, but to replace magnitude of effect with statistical significance in this setting, with such small effect sizes, is to completely mask the fact that the methods are hardly distinguishable from random in the first place. To make concrete the implication of reporting the statistical significance instead of effect size, I examined the “significant” improvement of network deconvolution on the S. cerevisae and other datasets when run with the protein parameters rather than the default (second figure above). Below I show the AUROC and AUPR plots for the dataset.

The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUROC).

The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUPR).

My conclusion was that the use of “score” was basically a red herringWhat looked like major differences between methods disappears into tiny effects in the experimental datasets, and even the in silico variations are greatly diminished. The differences in AUROC of one part in 1000 hardly seem reasonable for concluding that network deconvolution works. Biologically, both results are that the methods cannot reliably predict edges in the network. With usable network deconvolution code at hand, I was curious about one final question. The main result of the DREAM5 paper

was that the community method was best. So I wondered whether network deconvolution would improve it. In particular, the community result shown in Feizi et al. was not a test of network deconvolution, it was simply a construction of the community from the 9 methods tested (two communities were constructed, one before and one after network deconvolution). To perform the test, I went to examine the DREAM5 data, available as supplementary material with the paper. I was extremely impressed with reproducibility. The participant submissions are all available, together with scripts that can be used to quickly obtain the results of the paper. However the data is not very usable. For example, what is provided is the top 100,000 edges that each method produced. But if one wants to use the full prediction of a method, it is not available. The implication of this in the context of network deconvolution is that it is not possible to test network deconvolution on the DREAM5 data without thresholding. Furthermore, in order to evaluate edges absolute value was applied to all the edge weights. Again, this makes the data much less useful for further experiments one may wish to conduct. In other words, DREAM5 is reproducible but not very usable. But since Feizi et al. suggest that network deconvolution can literally be run on anything with “indirect effect”, I decided to give it a spin. I did have to threshold the input (although fortunately, Feizi et al. have assured us that this is a fine way to run network deconvolution), so actually the experiment is entirely reasonable in terms of their paper. The figure is below (produced with the default network deconvolution parameters),  but before looking at it, please accept my apology for making it. I really think its the most incoherent, illogical, meaningless and misleading figure I’ve ever made. But it does abide by the spirit of network deconvolution:

The DREAM5 results before and after network deconvolution.

Alas, network deconvolution decreases the quality of the best method, namely the community methodThe wise crowds have been dumbed down. In fact, 19/36 methods become worse, 4 stay the same, and only 13 improve. Moreover, network deconvolution decreases the quality of the top method in each dataset. The only methods with consistent improvements when network deconvolution is applied are the mutual information and correlation methods, poor performers overall, that Feizi et al. ended up focusing on. I will acknowledge that one complaint (of the many possible) about my plot is that the overall results are dominated by the in silico dataset. True- and I’ve tried to emphasize that by setting the y-axis to be the same in each dataset (unlike Feizi et al.) But I think its clear that no matter how the datasets are combined into an overall score, the result is that network deconvolution is not consistently improving methods. All of the analyses I’ve done were made possible thanks to the improved usability of network deconvolution. It is unfortunate that the result of the analyses is that network deconvolution should not be used. Still, I think this examples makes a good case for the fact that reproducibility is essential, but usability is more important.

Jingyi Jessica Li and Mark D. Biggin

We published a paper titled “System wide analyses have underestimated protein abundances and the importance of transcription in mammals” in PeerJ on Feb 27, 2014 (https://peerj.com/articles/270/). In our paper we use statistical methods to reanalyze the data of several proteomics papers to assess the relative importance that each step in gene expression plays in determining the variance in protein amounts expressed by each gene. Historically transcription was viewed as the dominant step. More recently, though, system wide analyses have claimed that translation plays the dominant role and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find that when measurement errors in mRNA and protein abundance data is taken into account, transcription again appears to be the dominant step.

Our study was initially motivated by our observation that the system wide label-free mass spectrometry data of 61 housekeeping proteins in Schwanhäusser et al (2011) have lower expression estimates than their corresponding individual protein measurements based on SILAC mass spectrometry or western blot data. The underestimation bias is especially obvious for proteins with expression levels lower than 106 molecules per cell. We therefore corrected this non linear bias to determine how more accurately scaled data impacts the relationship between protein and mRNA abundance data. We found that a two-part spline model fits well on the 61 housing keep protein data and applied this model to correct the system-wide protein abundance estimates in Schwanhäusser et al (2011). After this correction, our corrected protein abundance estimates show a significantly higher correlation with mRNA abundances than do the uncorrected protein data.

We then investigated if other sources of experimental error could further explain the relatively poor correlation between protein and mRNA levels. We employed two strategies that both use Analysis of Variance (ANOVA) to determine the percent of the variation in measured protein expression levels that is due to each of the four steps: transcription, mRNA degradation, translation, and protein degradation, as well as estimating the measurement errors in each step. ANOVA is a classic statistical method developed by RA Fisher in the 1920s. Despite the fact that this is a well-regarded and standard approach in some fields, its usefulness has not been widely appreciated in genomics and proteomics. In our first strategy, we estimated the variances of errors in mRNA and protein abundances using direct experimental measurements provided by control experiments in the Schwanhäusser et al. paper. Plugging these variances into ANOVA, we found that mRNA levels explain at least 56% of the differences in protein abundance for the 4,212 genes detected by Schwänhausser et al (2011). However,  because one major source of error—systematic error of protein measurements—could not be estimated, the true percent contribution of mRNA to protein expression should be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We show that the variance in translation rates directly measured by ribosome profiling is only 12% of that inferred by Schwanhäusser et al (2011), and that the measured and inferred translation rates correlate poorly. Based on this, our second strategy suggests that mRNA levels explain ∼81% of the variance in protein levels. While the magnitudes of our two estimates vary, they both suggest that transcription plays a more important role than the earlier studies implied and translation a much smaller role.

Finally, we noted that all of the published estimates, as welll as ours given above, only apply to those genes whose mRNA and protein expression was detected. Based on a detailed analysis by Hebenstreit et al. (2012), we estimate that approximately 40% of genes in a given cell within a population express no mRNA. Since there can be no translation in the absence of mRNA, we argue that differences in translation rates can play no role in determining the expression levels for the ∼40% of genes that are non-expressed.

• 351,784 hits