Jingyi Jessica Li and Mark D. Biggin

We published a paper titled “System wide analyses have underestimated protein abundances and the importance of transcription in mammals” in PeerJ on Feb 27, 2014 (https://peerj.com/articles/270/). In our paper we use statistical methods to reanalyze the data of several proteomics papers to assess the relative importance that each step in gene expression plays in determining the variance in protein amounts expressed by each gene. Historically transcription was viewed as the dominant step. More recently, though, system wide analyses have claimed that translation plays the dominant role and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find that when measurement errors in mRNA and protein abundance data is taken into account, transcription again appears to be the dominant step.

Our study was initially motivated by our observation that the system wide label-free mass spectrometry data of 61 housekeeping proteins in Schwanhäusser et al (2011) have lower expression estimates than their corresponding individual protein measurements based on SILAC mass spectrometry or western blot data. The underestimation bias is especially obvious for proteins with expression levels lower than 106 molecules per cell. We therefore corrected this non linear bias to determine how more accurately scaled data impacts the relationship between protein and mRNA abundance data. We found that a two-part spline model fits well on the 61 housing keep protein data and applied this model to correct the system-wide protein abundance estimates in Schwanhäusser et al (2011). After this correction, our corrected protein abundance estimates show a significantly higher correlation with mRNA abundances than do the uncorrected protein data.

We then investigated if other sources of experimental error could further explain the relatively poor correlation between protein and mRNA levels. We employed two strategies that both use Analysis of Variance (ANOVA) to determine the percent of the variation in measured protein expression levels that is due to each of the four steps: transcription, mRNA degradation, translation, and protein degradation, as well as estimating the measurement errors in each step. ANOVA is a classic statistical method developed by RA Fisher in the 1920s. Despite the fact that this is a well-regarded and standard approach in some fields, its usefulness has not been widely appreciated in genomics and proteomics. In our first strategy, we estimated the variances of errors in mRNA and protein abundances using direct experimental measurements provided by control experiments in the Schwanhäusser et al. paper. Plugging these variances into ANOVA, we found that mRNA levels explain at least 56% of the differences in protein abundance for the 4,212 genes detected by Schwänhausser et al (2011). However,  because one major source of error—systematic error of protein measurements—could not be estimated, the true percent contribution of mRNA to protein expression should be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We show that the variance in translation rates directly measured by ribosome profiling is only 12% of that inferred by Schwanhäusser et al (2011), and that the measured and inferred translation rates correlate poorly. Based on this, our second strategy suggests that mRNA levels explain ∼81% of the variance in protein levels. While the magnitudes of our two estimates vary, they both suggest that transcription plays a more important role than the earlier studies implied and translation a much smaller role.

Finally, we noted that all of the published estimates, as welll as ours given above, only apply to those genes whose mRNA and protein expression was detected. Based on a detailed analysis by Hebenstreit et al. (2012), we estimate that approximately 40% of genes in a given cell within a population express no mRNA. Since there can be no translation in the absence of mRNA, we argue that differences in translation rates can play no role in determining the expression levels for the ∼40% of genes that are non-expressed.

Given the reaction to our use of the word “fraud” in our blog post on Feizi et al., we would like to remind readers how it was actually used:

In academia the word fraudulent is usually reserved for outright forgery. However given what appears to be deliberate hiding, twisting and torturing of the facts by Feizi et al., we think that fraud (“deception deliberately practiced in order to secure unfair or unlawful gain”) is a reasonable characterization. If the paper had been presented in a straightforward manner, would it have been accepted by Nature Biotechnology?

While the reaction has largely focused on reproducibility and the swapping of figures, we reiterate our stance: misleading one’s readers (and reviewers) is itself a form of scientific fraud. Regarding the other questions raised, the response from Feizi, Marbach, Médard and Kellis falls short. On their new website their code has changed, their explanations are in many cases incoherent, self-contradictory, and make false claims, and the newly added correction to Figure S4 turns out not to explain the difference between the figures in the revisions. We explain all of this below. And yet for us the claim of fraud can stand on the basis of deception alone. The distance between the image created by the main text of the paper and the truth of their method is simply too great.

However, one unfortunate fact is that that judging that distance requires understanding some of the mathematics in the paper, and another common reaction has been “I don’t understand the math.”  For such readers, we explain network deconvolution in simple terms by providing an analogy using numbers instead of matrices. Understanding this requires no more than high school algebra. We follow that with a response to Feizi et al.‘s rebuttal.

In what follows, number deconvolution, a simplified form of network deconvolution, is explained in red. Network deconvolution is explained in blue.

The main concept to understand is that of a geometric series. This is a series with a constant ratio between successive terms, for example

$\displaystyle \frac{1}{2}+\frac{1}{4}+\frac{1}{8} + \cdots \ \ \ \ \ (1)$

In this case, each term is one half the previous term, so that the ratio between successive terms is always ${\frac{1}{2}}$. This sum converges to ${1}$, an intuitive notion that is formalized by defining the infinite sum to be

$\displaystyle \sum_{i=1}^{\infty} \left( \frac{1}{2}\right)^i = lim_{n \rightarrow \infty} \sum_{i=1}^n \left( \frac{1}{2} \right)^i = 1. \ \ \ \ \ (2)$

This is the familiar math of the

It is important to note that not every geometric series converges. If the number ${\frac{1}{2}}$ in the sum above is replaced by ${2}$, then the series ${2+4+8+16+\cdots}$ is said to diverge. The are two basic questions about geometric series: (1) for which numbers do they converge and (2) when they do converge, what do they converge to. These two questions are answered in the following:

Let ${a}$ be a real number. The geometric series converges if, and only if, ${-1 < a < 1}$, in which case

$\displaystyle m=\sum_{i=1}^{\infty} a^i = a(1-a)^{-1}. \ \ \ \ \ (3)$

It is not hard to see why this is true:

$\displaystyle m=a+a^2+\cdots = a(1+a+a^2+\cdots) = a(1+m) \Rightarrow m=a(1-a)^{-1}.$

Returning to the example of ${a=\frac{1}{2}}$, we see that it is a special case of this result. It converges because ${-1 < \frac{1}{2} < 1}$, and furthermore, ${\frac{\frac{1}{2}}{1-\frac{1}{2}} = 1}$.

Matrices behave like numbers in some ways, for example they can be added and multiplied, yet quite differently in others. The generalization of geometric series as in Feizi et al. specifically deals with diagonalizable matrices, a class of matrices which has many convenient properties. Let ${A}$ be such a matrix. The geometric series converges if, and only if, the eigenvalues of ${A}$ (numbers associated to the matrix) lie strictly between ${-1}$ and ${1}$, in which case

$\displaystyle M = \sum_{i=1}^{\infty} A^i = A \cdot (I-A)^{-1} \ \ \ \ \ (4)$

Notice that what has changed is that the condition that ${-1 has been replaced by an eigenvalue restriction on m and the formula  $A(I-A)^{-1}$ is just like $a(1-a)^{-1}$ except the operation of matrix inversion is required instead of number inversion. The result is obvious from elementary linear algebra because the assumption that ${A}$ is diagonalizable means that ${A=UDU^{-1}}$ for some matrix ${U}$, where ${D}$ is a diagonal matrix with the eigenvalues of ${A}$ on the diagonal. Therefore ${A^k=UD^kU^{-1}}$ and the series converges if, and only if, the geometric series formed by summing the powers of each eigenvalue converge.

In following Feizi et al., we call the number ${\frac{1}{2}}$ in the example above the “direct number”, and the remainder of the sum the “indirect number”. One should think of the indirect number as consisting of echoes of the direct number. Furthermore, following the language in Feizi et al., we call the ratio between terms, again the number ${\frac{1}{2}}$ in the example above, the indirect flow. When it is strong, the terms decrease slowly. When it is weak, the terms decrease rapidly.

In Feizi et al., the matrix A is called the “direct effect”, and the remainder of the sum the “indirect effect”. One should think of the indirect effect as consisting of echoes of the direct effect. Furthermore, following the language in Feizi et al., we call the rate at which the terms in the indirect effect decay the indirect flow. When it is strong, the terms decrease slowly. When it is weak, the terms decrease rapidly.

In number deconvolution the goal is to infer ${a}$ from ${m}$. That is, one would like to remove the indirect effect, the echoes, out of ${m}$. If ${m=a(1-a)^{-1}}$, then solving for ${a}$ we obtain

$\displaystyle a=m(1+m)^{-1}. \ \ \ \ \ (5)$

That is the formula for number deconvolution.

In network deconvolution the goal is to infer  ${A}$ from ${M}$. That is, one would like to remove the indirect effect, the echoes, out of ${M}$ as follows: If ${M=A(I-A)^{-1}}$ then solving for ${A}$ we obtain

$\displaystyle A = M \cdot (I+M)^{-1} \ \ \ \ \ (6)$

That is the formula for network deconvolution.

Everything looks great but there is a problem. Even though one can plug any number ${m}$ (except ${m=-1}$) into the formula and get out an ${a}$, the formula only gives an ${a}$ satisfying ${\sum_{i=1}^{\infty} a^i = m}$ when ${m > -\frac{1}{2}}$. For example, starting with ${m=-2}$ and then plugging it into number deconvolution one gets ${a=2}$. But ${2+4+ 8+\cdots}$ diverges.

The restriction on ${m}$ in number deconvolution, namely that it has to be bigger than ${-\frac{1}{2}}$, can be translated into the same restriction on the eigenvalues of ${M}$ for network deconvolution. This follows from the fact that ${M \, = \, \sum_{i=1}^{\infty}A^i \, = \, \sum_{i=1}^{\infty}UD^iU^{-1}}$ means that if ${\lambda}$ is an eigenvalue of ${A}$ then ${\mu :=\frac{\lambda}{1-\lambda}}$ is an eigenvalue of ${M}$ and all the eigenvalues of ${M}$ arise in this way. Therefore, ${\lambda = \frac{\mu}{1+\mu}}$ and the condition ${-1 < \lambda < 1}$ holds if, and only if, ${\mu>-\frac{1}{2}}$

But what if we wanted number deconvolution to work for all negative numbers? One idea is to follow Feizi et al.’s approach and introduce a scaling factor called ${\gamma}$ to be applied to ${m}$, so that the product ${\gamma m}$ can be deconvolved. For example, suppose we start with ${m=-12}$. We’d like to find ${a}$. But we can’t just apply number deconvolution. So we multiply ${m}$ by ${-\frac{1}{12}}$ to get ${1}$, and then deconvolve that to get ${a=\frac{1}{2}}$. We could have multiplied ${m}$ by something else, say ${\frac{-1}{6}}$ in which case we’d get ${a=\frac{2}{3}}$. In fact, there are infinitely many scaling factors, giving infinitely many solutions ${a}$. In fact, we can get any ${a}$ between ${-1}$ and ${1}$.

When ${M}$ is not writable as a geometric sum (decomposable), there exist scaling factors ${\gamma}$ such that the scaled matrix ${\gamma M}$ is decomposable. This is obvious because if ${\mu_{-} =\mu_1 \leq \ldots \leq \mu_n = \mu_{+}}$ are the eigenvalues of ${M}$, then ${\gamma \mu_1,\ldots,\gamma \mu_n}$ are the eigenvalues of ${\gamma M}$ so by choosing ${0 \leq \gamma < |\frac{1}{2\mu_{-}}|}$ we are guaranteed that ${\gamma M}$ is decomposable. Let ${M}$ be a real symmetric matrix with minimum/maximum eigenvalues ${\mu_{-}}$ and ${\mu_{+}}$ respectively and ${0<\beta<1}$. If

$\displaystyle \gamma = \frac{\beta}{max\left((1-\beta)\mu_{+},(-1-\beta)\mu_{-}\right)} \ \ \ \ \ (7)$

then the matrix ${\gamma M}$ is decomposable as ${\gamma M = \sum_{i=1}^{\infty}A^i}$. Furthermore, if ${\lambda_{-},\lambda_{+}}$ are the minimum/maximum eigenvalues of ${A}$ respectively then ${max(|\lambda_{-}|,|\lambda_{+}|) = \beta}$ (we omit the derivation, but it is straightforward).

Matrices do behave differently than numbers, and so it is not true that network deconvolution can produce any matrix as an output. However, as with number deconvolution, it remains true that network deconvolution can produce an infinite number of possible matrices.

Not only do Feizi et al. not mention this anywhere in the main text, instead they create the impression that there is a single output from any input.

Feizi et al.’s response

We now turn to the response of Feizi et al. to our post. First, upon initial inspection of the “Try It Out” site, we were astounded to find that the code being distributed there is not the same as the code distributed with the paper (in particular, step 1 has been removed) and so it seems unlikely that it could be used to replicate the paper’s results. Unless, that is, the code originally distributed with the paper was not the code used to generate its results. Second, the “one click reproduction”s that the authors have now added do not actually start with the original data but instead merely produce plots from some results without showing how those results were generated. This is not what reproducibility means. Third, while the one matrix of results from a “one click reproduction” that we looked at (DI_ND_1bkr.mat) was very close to the matrix originally distributed with the paper, it was slightly different. It was close and hopefully it does generate basically the same figure, but as we explain below, we’ve spent a good bit of time on this and have no desire to spend any more. This is why we have tried to incentivize someone to simply reproduce the results.

In retrospect, we regret not explaining exactly how we came to be so skeptical about the reproducibility of the results in the Feizi et al. paper. While we very quickly (yet not instantly) recognized how misleading the paper is, our initial goal in looking at their results was not to verify reproducibility (which we assumed) but rather to explore how changing the parameters affected the results. Verifying that we could recover the results from the paper was only supposed to be the first step in this process. We downloaded the file of datasets and code provided by the authors and began examining the protein contact dataset.

After writing scripts that we verified exactly regenerated some figures from the paper from the output matrices distributed by the authors, we then checked whether we arrived at the same results when running the ND code on what we assumed was the input data (files with names like “MI_1bkr.txt” while the output was named “MI_ND_1bkr.txt”). We were surprised when the output did not match, however we were then informed by the authors that they had not distributed the input data but rather thresholded versions of the input. When we asked the authors to provide us with the actual data, we were told that it would violate scientific etiquette to send us another scientist’s dataset and that we would have to regenerate it ourselves. We had never heard this claimed point of etiquette before, but acquiesced and attempted to do so. However, the matrices produced from the original data were actually a different size than those used by the authors (suggesting that they had used a different alignment).

Stymied on that front, we turned to the DREAM5 dataset instead. While the actual scoring that goes into the DREAM analysis is fairly complicated, we decided that we would start by merely checking that we could regenerate the output provided by the authors. We eventually received step-by-step instructions for doing so and, because those steps did not produce the provided output, asked a simple question suggested by our experience with protein contact dataset: to generate the provided results, should we use as input the provided non-ND matrices? We asked this question four times in total without receiving a reply. We sent our script attempting to implement their steps and received no response.

At some point, we also discovered that the authors had been using different parameter values for different datasets without disclosing that to their readers. They could provide no coherent explanation for doing so. And sometime after this, we found that the authors had removed the data files from their website: readers were directed instead to acquire the datasets elsewhere and the results from the paper were no longer provided there.

At this point, we expect it is obvious why we were skeptical about the reproducibility of the results. Having said that, we never wanted to believe that the results of the paper were not reproducible. It was not our initial assumption and we still hope to be proven wrong. However, we continue to wait, and the fact that the code has been changed yet again, and that the authors’ explanation in regards to figure S4 does not appear to explain its changes (see below) do not fill us with confidence.

We repeat that while most of the discussion above has focused on replicability, we will quote again the definition of “fraud” used in the original post: “deception deliberately practiced in order to secure unfair or unlawful gain”. Our main point is this: would any honest reader of the main text of the original paper recognize the actual method implemented? We don’t think so, hence deception. And we believe the authors deliberately wrote the paper in this way to unfairly gain acceptance at Nature Biotechnology.

Now, in response to the authors rebuttal, we offer the following (Feizi et al. remarks in italics):

Point-by-point rebuttal

Appendix A

The pre-processing step that Bray and Pachter criticize (step 1 in their description of our work) has no effect on the performance of ND. Mathematically, because our matrices are non-negative with min zero, a linear scaling of the network results in a similar scaling of its eigenvalues, which are normalized during eigenvalue scaling, canceling out the linear scaling of step 1. Practically, removing step 1 from the code has little effect on the performance of ND on the DREAM5 regulatory networks, as we show in Figure 1 below.

We cannot imagine how the authors came to make the statement “our matrices are non-negative with min zero”. After all, one of the inputs they used in their paper was correlation matrices and they surely know that correlations can be negative. If the map in step 1 were linear rather than – as we correctly stated in the document they are responding to – affine (for those unfamiliar with the terms, “linear” here means only scaling values while “affine” means scaling and shifting as well), then this explanation would be correct and it would have no effect. This just makes it even stranger that immediately after giving this explanation, the authors produced a graph showing the (sometimes quite significant) effect of what we assume was the affine mapping.

Stranger still is that after this defense, Step 1 has now been removed from the code.

Appendix B

The eigenvalue scaling (step 5) is essential both theoretically, to guarantee the convergence of the Taylor series, and practically, as performance decreases without it in practice. This step is clearly stated in both the main text of our paper and in the supplement, despite the claims by Bray and Pachter that it somehow was maliciously concealed from our description of the method (just search the manuscript for the word ‘scaling’). Our empirical results confirm that this step is also necessary in practice, as ND without eigenvalue scaling is consistently performing worse, as we show in Figure 2 below:

We do not understand why the author’s felt the need to defend the use of eigenvalue scaling when we have never suggested it should be removed. Our objection was rather to how it was presented (or not) to the reader. Yes, there is a mention of the word “scaling”. But it is done so in a way that makes it sound as though it were some trivial issue: there’s an assumption, you scale, assumption satisfied, done. And this mention occurs in the context of the rest of the main text of the paper which presents a clear image of a simple, parameterless, 100% theoretically justified method: you have a matrix, you put it into the method, you get the output, done (and it is “globally optimal”). In other words, it is the math we discuss above prior to the arrival of the parameter $\gamma$.

Yes, scaling is dealt with at length in the supplement. Perhaps Feizi et al. think it is fine for the main text of a paper to convey such a different image from the truth contained in its supplement and in the code. We do not and we would hope the rest of the scientific community agrees with us.

Appendix C: Robustness to input parameters.

Bray and Pachter claim that “the reason for covering up the existence of parameters is that the parameters were tuned to obtain the results”. Once again the claims are incorrect and unfounded. (1) First, we did not cover up the existence of the parameters. (2) Second, we used exactly the exact same set of parameters for all 27 tested regulatory networks in the DREAM challenge. (3) Third, we show here that our results are robust to the parameter values, using β = 0.5 and β = 0.99 for the DREAM5 network.

(1) Again: in the main text of the paper, there is not even a hint that the method could have parameters. This is highly relevant information that a reader who is looking at its performance on datasets should have access to. That reader could then ask, e.g. what parameters were used and how they were selected. They would even have the ability to wonder if the parameters were tuned to improve apparent performance. It is worth noting that when the paper was peer reviewed, the reviewers did not have access to the actual values of the parameters used.

(2) It is, of course, perfectly possible to tune parameters while using those same parameters on multiple datasets.

(3) The authors seem to have forgotten that there are two parameters to the method and that they gave the other parameter (alpha) different values on different datasets as well. They also have omitted their performance on the “Community” dataset for some reason.

Appendix D: Network Deconvolution vs. Partial Correlation.

Bray and Pachter compare network deconvolution to partial correlation using a test dataset built using a partial correlation model. In this very artificial setting, it is thus not surprising that partial correlation performs better, as it exactly matches the process that generated the data in the first place. To demonstrate the superiority of partial correlation, Bray and Pachter should test it on real datasets, such as the ones provided as part of the DREAM5 benchmarks . In our experience, partial correlation performed very poorly in practice, but we encourage Bray and Pachter to try it for themselves.

We looked at partial correlations here for a very simple reason: the authors originally claimed that their method reduced to it in the context we considered here. Thus, comparing them seemed a natural thing to do.

The idea that we should look at the performance of partial correlations in other contexts makes no sense: we have never claimed that it is the right tool to solve every problem. Indeed, it was the authors who claimed that their tool was “widely applicable”. By arguing for domain specific tools, they seem to be making our point for us.

Claim 1: “the method used to obtain the results in the paper is completely different than the idealized version sold in the main text of the paper”.

The paper clearly describes both the key matrix operation (“step 6″ in Lior Pachter’s blog post) which is shown in figure 1, and all the pre- and post-processing steps, which are all part of our method. There is nothing mischievous about including these pre- and post-processing steps that were clearly defined, well described, and implemented in the provided code.

The statement that the paper describes all steps is simply false. For example, the affine mapping appears nowhere.

Claim 2: “the method actually used has parameters that need to be set, yet no approach to setting them is provided”.

It is unfortunate that so many methods in our field have parameters associated with them, but we are not the first method to have them. However, we do provide guidelines for setting these in the supplement.

It is strange to suggest that it is unfortunate that methods have parameters. It is also strange to describe their guidelines as such when they did not even follow them. Also, there is no guideline for setting alpha in the supplement. In their FAQ, they say “we used beta=0.5 as we expected indirect flows to be propagating more slowly”. We wonder where one derives these expectations from.

Dr. Pachter also points out that a correction to Supplement Figure S4 on August 26th 2013 was not fully documented. We apologize for the omission and provide additional details in an updated correction notice dated February 12, 2014.

The correction notice states that the original Figure S4 was plotted with the incorrect x-axis. In particular it states that the maximum eigenvalue of the observed network was used, instead of the maximum eigenvalue of the direct network. We checked this correction, and have produced below the old curve and the new curve plotted on the same x-axis:

The new and old Figure S4. The old curve, remapped into the correct x-axis coordinates is shown in red. The new curve is shown in blue. Raw data was extracted from the supplement PDFs using WebPlotDigitizer.

As can be seen, the two curves are not the same. While it is expected that due to the transformation the red curve should not cover the whole x-axis, if the only difference was the choice of coordinates, the curves should overlap exactly. We cannot explain the discrepancy.

This is the third and final post in a series (part1, part2) of posts on two back-to-back papers published in Nature Biotechnology in August 2013:

1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601
2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635

An inconvenient request

One of the great things about conferences is that there is time to chat in person with distant friends and collaborators. Last July, at the ISMB conference in Berlin, I found myself doing just that during one of the coffee breaks. Suddenly, Manolis Kellis approached me and asked to talk in private. The reason for his interruption: he came to request that I remove an arXiv post of mine, namely ”Comment on ‘Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions“, a response to a paper by Ward and Kellis. Why? He pointed out that my arXiv post was ranking highly on Google. This was inconvenient for him, he explained, while insisting that it was wrong of me to post a criticism of his work on a forum where he could not directly respond. He suggested that it would be best to work out any issues I might have with his paper offline. Unfortunately,  there was the inconvenient truth that arXiv postings cannot be removed. Unlike some journals, where, say, a supplement can be revised while having the original removed (see the figure switching of Feizi et al.), arXiv preprints are permanent.

My initial confusion quickly turned to anger. After all, my arXiv comment had been rejected from Science where I had submitted it as a technical comment on the Ward-Kellis paper. I had then put it on the arXiv as a last resort measure to at least have some record of my concerns publicly accessible. How is this wrong? Can one not critique the work of Manolis Kellis?

Network nonsense begins

My first review of a Manolis Kellis paper was in the fall of 2006, in my capacity as a program committee member of the  Research in Computational Molecular Biology (RECOMB) conference held in Oakland, CA in 2007. Because Oakland is right next to Berkeley, a number of Berkeley professors were involved in organizing and running the conference. Terry Speed was chair of the program committee. I was out of the country that year on sabbatical at the University of Oxford, so I could not participate, or even attend, the conference, but I volunteered to serve on the program committee. For those not familiar with the RECOMB review process, it is modeled after the standard Computer Science conferences. The program committee chair forms the program committee, who are then assigned a handful of papers to review and score. Reviews are entered on a website, and after a brief period of online discussion about borderline papers, scores are revised and accept/reject decisions are made. Authors can revise their manuscripts, but the reviewers never see the papers again before publication in the proceedings.

One of the papers I was assigned to review was by a student named Joshua Grochow and his advisor Manolis Kellis. The paper was titled “Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking“. Although networks were not my research focus at the time, and “symmetry-breaking” evoked in me nightmares from the physics underworld, I agreed to the review. The paper seemed to contain some interesting algorithms, appeared to have a combinatorial flavor, and potentially important applications- a good mix for RECOMB.

The problem addressed by Grochow & Kellis was that of identifying “network motifs” in biological networks. “Motifs” can be defined in a variety of ways, and the Grochow-Kellis objective was simple. In graph theoretic terms, given a graph G, the goal was to find subgraphs occurring with high multiplicity to an extent unlikely in a random graph. There are many models for random graphs, and the one that the results in Grochow-Kellis are based on is the Erdös-Renyi model (each edge chosen independently with some fixed probability). The reason this definition might be of biological interest, is that recurrent motifs interspersed in a graph are likely to represent evolutionarily conserved interaction modules.

The paper begins with a description of the method. I won’t go into the details here, except to say that everything seemed well until I read the caption of Figure 3. There the number 27,720 caught my eye.

During my first few years of graduate school I took many courses on enumerative and algebraic combinatorics. There are some numbers that combinatorialists just “know”. For example, seeing 42 emerge as the answer to a counting problem does not bring to mind Douglas Adams, but rather the vast literature on Catalan numbers and their connections to dozens of well-known counting problems. Similarly, a number such as 126 brings to mind binomial coefficients ($126={9 \choose 4}$), and with them the idea of counting the number of subsets of fixed size from a larger set. When I saw the number 27,720 I had a hunch that somehow some canonical combinatorial set had been enumerated. The idea may have entered my mind because of the picture of the “motif” in which I immediately recognized a clique (all vertices mutually connected) and a stable set (no pair of vertices connected). In any case, I realized that

$27,720 = 220 \cdot 126 = {12 \choose 3} \cdot {9 \choose 4}$.

The significance of this is that the “motif” on the left-hand side of Figure 3 had appeared many times  because of a type of double- or rather thousandfold- counting. Instead of representing statistically significant recurring independent motifs, this “motif” arises because of a combinatorial artifact. In the specific example of Figure 3, the motif occurred once for any choice of 4 nodes from the clique of size 9, and any choice of 3 nodes from the stable set of size 12.

The point is that in a graph, any subgraph attached to a large clique (or stable set) will occur many times. This simple observation is a result of the fact that there are many subgraphs of a clique (or stable set) that are identical. I realized that this meant that the Grochow-Kellis method was essentially a heuristic for finding cliques and stable sets in graphs. The particular “network motifs” they were pulling out were just subgraphs that happened to be connected to some large cliques and stable sets. There are two problems with this: first, a clique or a stable set can hardly be considered an interesting “network motif”. Moreover, the fact that they appear in biological networks much more than in Erdös-Renyi random graphs is not surprising. Second, there is a large literature on finding cliques in graphs, none of which Grochow-Kellis cited or seemed to be familiar with.

The question of the performance of the Grochow-Kellis algorithm is answered in their Figure 3 as well. There is a slightly larger motif consisting of nodes from the stable set of size 12, instead of 3. That motif occurs in all ${12 \choose 6}$ subsets of the stable set instead of ${12 \choose 3}$ subsets which means that there is a motif that occurs 116,424 times! Grochow and Kellis’s algorithm did not even achieve its stated goal. It really ought to have outputted the left hand side figure with six nodes in the stable set on the left, and not three. In other words, this was a paper providing uninteresting solutions from a biological point of view, and doing so poorly to boot.

I wrote up a detailed report on the paper, and posted it on the RECOMB review website together with poor scores reflecting my opinion that the paper had to be rejected. How could RECOMB, ostensibly the premier computer science conference on computational and algorithmic biology, publish a paper with neither a computational nor biological result? Not to mention an algorithm that demonstratably did not find the most frequently occurring motif.

As you might already guess, my rejection was subsequently overruled. I don’t know who made the final decision to accept the Grochow & Kellis paper to the RECOMB conference, although presumably it was the program committee chair. The decision jarred with my sense of scientific integrity. I had put considerable effort into reviewing the paper and understanding it, and I felt that I had provided a compelling objective argument for why the paper was fundamentally flawed- the fact that the results were trivial (and incorrect!) was not a subjective statement. At this point I need to point out that the RECOMB conference is quite difficult to get into. The acceptance rate for papers in 2007, consistent with other years, was 21.8%. I knew this meant that even a single very negative review, especially one with a compelling argument against the paper, almost certainly would lead to rejection of the paper. So I couldn’t understand then, nor do I still understand now, on what basis the decision was made to accept the paper. This bothered me greatly, and after much deliberation I started boycotting the conference. Despite publishing five RECOMB papers from 2000 to 2006 and regularly attending the meeting during that time, the continued poor decisions and haphazard standards for papers selected have led me to not return in almost 8 years.

Grochow and Kellis obviously received my review and considered how to “deal with it”. They added a section titled “The role of combinatorial effects”, in which they explained the origins of the number 27,720 that they gleaned from my report, but then spun the bad news they had received as “resulting from combinatorial connectivity patterns prevalent in larger network structures.” They then added that “…this combinatorial clustering effect brings into question the current definition of network motif” and proposed that “additional statistics…might well be suited to identify larger meaningful networks.” This is a lot like someone claiming to discover a bacteria whose DNA is arsenic-based and upon being told by others that the “discovery” is incorrect – in fact, that very bacteria seeks out phosphorous – responding that this is “really helpful” and that it “raises lots of new interesting open questions” about how arsenate gets into cells. Chutzpah. When you discover your work is flawed, the correct response is to retract it.

I don’t think people read papers very carefully. Joshua Grochow went on to win the MIT Charles and Jennifer Johnson Outstanding M. Eng. Thesis Award for his RECOMB work on network motif discovery. [Added February 18: Grochow and Kellis have posted a reply here].

The nature of man

I have to admit that after the Grochow-Kellis paper I was a bit skeptical of Kellis’ work. Not because of the paper itself (everyone makes mistakes), but because of the way he responded to my review. So a year and a half ago, when Manolis Kellis published a paper in an area I care about and am involved in, I may have had a negative prior. The paper was Luke Ward and Manolis Kellis “Evidence for Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”, Science 337 (2012) . Having been involved with the ENCODE pilot, where I contributed to the multiple alignment sub-project, I was curious what comparative genomics insights the full-scale $130 million dollar project revealed. The press releases accompanying the Ward-Kellis paper (e.g. The Nature of Man, The Economist) were suggesting that Ward and Kellis had figured out what makes a human a human; my curiosity was understandably piqued. Ward and Kellis combined population genomic data from the 1000 Genomes Project with biochemical data from the ENCODE project to look for signatures of human constraint in regulatory elements. Their analysis was based on measuring three different proxies for constraint: SNP density, heterozygosity and derived allele frequency. To identify specific classes of regulatory regions under constraint, aggregated regions associated with specific gene ontology (GO) categories were tested for significance. Reading the paper I was amazed to discover they found precisely two categories: retinal cone cell development and nerve growth factor receptor signaling. It was only upon reading the supplement that I discovered that their tests had produced 53 other GO categories as well (Table S5). Despite the fact that the listed categories were required to pass a false discovery rate (FDR) threshold for both the heterozygosity and derived allele frequency (DAF) measures, it was statistically invalid for them to highlight any specific GO category. FDR control merely guarantees a low false discovery rate among the entries in the entire list. Moreover, there was no obvious explanation for why categories such as chromatin binding (which had a smaller DAF than nerve growth) or protein binding (with the smallest p-value) appeared to be under purifying selection. As with the Feizi et al. paper, the supplement produced a story much less clean than the one presented in the main body of the paper. In fact, retinal cone cell development and nerve growth factor were 33 and 34 out of the 55 listed GO categories when sorted by the DAF p-value (42 and 54 when sorted by heterozygosity p-value). In other words, the story being sold in the paper was based on blatant statistically invalid cherry picking. The other result of the paper was an estimate that in addition to the 5% of the human genome conserved across mammalian genomes, at least another 4% has been subject to lineage-specific constraint. This result was based on adding up the estimates of constrained nucleotides from their Table S6 (using the derived allele frequency measure). These were calculated using a statistic that was computed as follows: for each one of ten bins determined according to estimated background selection strength, and for every feature F, the average DAF value DF was rescaled to $PUC_F = \frac{(D_F - D_{CNDC})}{(D_{NCNE}-D_{CNDC})}$, where DCNDC and DNCNE were the bin-specific average DAFs of conserved non-degenerate coding regions and non-conserved non-ENCODE regions respectively. One problem with the statistic is that the non-conserved regions contain nucleotides not conserved in all mammals, which is not the same as nucleotides not conserved in any mammals. The latter would have been needed in order to identify human specific constraint. Second, the statistic PUCF was used as a proxy for the proportion under constraint even though, as defined, it could be less than zero or greater than one. Indeed, in Table S6 there were four values among the confidence intervals for the estimated proportions using DAF that included values less than 0% or above 100%: Ward and Kellis were therefore proposing that some features might have a negative number of nucleotides under constraint. Moreover, while it is possible that after further rescaling PUCmight have correlated with the true proportion of nucleotides under constraint, there was no argument provided in the paper. Thus, while Ward and Kellis claimed to have estimated the proportion of nucleotides under constraint, they had only computed a statistic named “proportion under constraint”. Nicolas Bray and I wrote up these points in a short technical comment and submitted it to the journal Science early in November 2012. The comment was summarily rejected with a curt reply by senior editor Laura Zahn stating that “relative to other Technical Comments we have recently received we feel that the scope and focus of your comment make it more suitable for the Online Comments facility at Science, rather than as a candidate for publication as a Technical Comment.” It is worth noting that Science did decide to publish another comment: Phil Green and Brent Ewing’s, “Comment on’Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions‘”, Science 10 (2013). Green and Ewing’s comment is biological in nature. Their concern is that “… the polymorphism trends are primarily attributable to mutational variation and technical artifacts rather than selection.” Its fine that Science decided to host a debate on a biology question on its pages, but how can one debate the interpretation of results from a method, when the method is fundamentally flawed to begin with? After all, our problem with PUC was much deeper than a “technical flaw”. We decided at the end to place the comment in the arXiv. After doing so, it became apparent that it had little impact. Indeed, I have never received any feedback about it from anyone. Apparently even this was too much for Manolis Kellis. Methods matter By the time I noticed the Feizi et al. paper in the journal Nature Biotechnology early in August 2013, my experiences reading Kellis’ papers had subtly altered the dynamic between myself and the printed word. Usually, when I read a paper and I don’t understand something, I assume the fault lies with me. I think most people are like this. But now, when the Feizi et al. paper started to not make sense, I didn’t presume the problem was with me. I tried hard to give the paper a fair reading, but after a few paragraphs the spell of the authors was already broken. And so it is that Nicolas Bray and I came to figure out what was really going on in Feizi et al., a project that eventually led us to also look at Barzel-Barabási. Speaking frankly, it was difficult work to write the blog posts about these articles. In addition to the time it took, it was exhausting and exasperating to discover the flaws, fallacies and frauds. Both Nick and I prefer to do research. But we felt a responsibility to spell out in detail what had happened here. Manolis Kellis is not just any scientist. He has, and continues to play leading roles in major consortium projects such as mod-ENCODE and ENCODE, and he has served on numerous advisory committees for the NHGRI. He is a member of the GCAT (Genomics, Computational Biology and Technology) study section until 2018. That any person would swap out a key figure in a published paper without publishing a correction, and without informing the editor is astonishing. That a person with great responsibility towards scientists is an abuser of science is unacceptable. Manolis Kellis’ behavior is part of a systemic problem in computational biology. The cross-fertilization of ideas between mathematics, statistics, computer science and biology is both an opportunity and a danger. It is not hard to peddle incoherent math to biologists, many of whom are literally math phobic. For example, a number of responses I’ve received to the Feizi et al. blog post have started with comments such as “I don’t have the expertise to judge the math, …” Similarly, it isn’t hard to fool mathematicians into believing biological fables. Many mathematicians throughout the country were recently convinced by Jonathan Rothberg to donate samples of their DNA so that they might find out “what makes them a genius”. Such mathematicians, and their colleagues in computer science and statistics, take at face value statements such as “we have figured out what makes a human human”. In the midst of such confusion, it is easy for an enterprising “computational person” to take advantage of the situation, and Kellis has. I believe the solution for this problem is for computational biologists to start taking themselves more seriously. Whether serving as reviewers for journals, as panel members for funding agencies, on hiring/tenure committees, or writing articles, all of us have to tone down the hype and pay closer attention to the science. There are many examples of what this means: a review of a math/stats containing paper cannot be a single paragraph long and based on a hunch, and similarly computational biologists shouldn’t claim, as have many of the authors of papers I’ve reviewed in these posts, pathways to cure disease and explanations for what makes humans human. Don’t fool the biologists. Don’t fool the computer scientists, statisticians, and mathematicians. The possibilities for computational methods in biology are unlimited. The future is exciting, and there are possibilities for significant advances in areas ranging from molecular and evolutionary biology to medicine. But money, citations and fame cannot rule the day. The details of the #methodsmatter. This is the second in a series of three posts (part1, part3) on two back-to-back papers published in Nature Biotechnology in August 2013: 1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601 2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635 The Barzel & Barabási paper we discussed in yesterday’s blog post is embarrassing for its math, shoddy “validations” and lack of results. However of the two papers above it is arguably the “better” one. That is because the paper by Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, in addition to having similar types of problems to the Barzel & Barabási paper, is also dishonest and fraudulent. For reasons that I will explain in the third and final post in this series, we (Nicolas Bray and I) started looking at the Feizi et al. paper shortly after it was published early in August 2013. This is the story: Feizi et al.‘s paper describes a method called network deconvolution that in their own words provides “…a systematic method for inferring the direct dependencies in a network, corresponding to true interactions, and removing the effects of transitive relationships that result from indirect effects.” They claim that the method is a “foundational graph theoretic tool” and that it “is widely applicable for computing dependencies in network science across diverse disciplines.” This high brow language makes network deconvolution sounds very impressive, but what exactly is the method? Feizi et al. would like you to believe that the method is what is described in their Figure 1, part of which is shown below: This is a model for networks represented as matrices. In this model an unknown matrix $G_{dir}$ with eigenvalues between -1 and 1 is to be inferred from an observed matrix $G_{obs}$ that is related to $G_{dir}$ via $G_{obs} = G_{dir}+G_{dir}^2+G_{dir}^3+\cdots$ (1) The matrix $G_{dir}$ represents “direct effects” and the sum of its powers “indirect effects”. It is probably worth noting at this point that there is no particular reason to believe that effects will behave in this manner in any particular system of interest, but we return to this later. The eigenvalue assumption on $G_{dir}$ is required for the specification of the model, because without it the infinite sum generating $G_{obs}$ does not converge. An elementary manipulation of (1) shows that the matrix $G_{dir}$ can be recovered from $G_{obs}$ by the formula $G_{dir} = G_{obs}(I+G_{obs})^{-1}$. (2) Unfortunately, many matrices cannot be decomposed as an infinite sum of powers of some matrix as in equation (1), so equation (2) cannot be applied directly to arbitrary data matrices. The only mention of this issue in the main text of the paper is the statement that ”This [eigenvalue] assumption can be achieved for any matrix by scaling the observed input network by a function of the magnitude of its eigenvalues.” This is true but incoherent. There are an infinite number of scalings that will satisfy the assumption, and while the authors claim in their FAQ that ”the effect of linear scaling on the input matrix is that … it does not have an effect” this is obviously false (also if it has no effect why do they do it?). For example, as the scaling goes to zero, $G_{dir}$ converges to $G_{obs}$. What the authors have actually done with their seemingly innocuous aside is fundamentally change their model: now instead of (1), $G_{obs}$ is given by $G_{obs} = \gamma \cdot (G_{dir}+G_{dir}^2+G_{dir}^3+\cdots)$. (3) The problem with this model is that given $G_{obs}$ there are an infinite number of solutions for $\gamma$ and $G_{dir}$. Feizi et al.‘s approach to dealing with this is to introduce a scaling parameter that must be chosen a priori. They do not even mention the existence of this parameter anywhere in the main text. Insteadthey choose to make a false statement in the caption of Figure 1 when they write “When these assumptions hold, network deconvolution removes all indirect flow effects and infers all direct interactions and weights exactly.” Even when $G_{obs}$ satisfies the eigenvalue constraint, once it is scaled before applying equation (2) the matrix $G_{dir}$ has probability zero of being recovered. In the video below, recorded at the Banff workshop on Statistical Data Integration Challenges in Computational Biology: Regulatory Networks and Personalized Medicine (August 11–16, 2013), you can see an explanation of network deconvolution by the corresponding author Kellis himself. The first part of the clip is worth watching just to see Kellis describe inverting a matrix as a challenge and then explain the wrong way to do it. But the main point is at the end (best viewed full screen with HD): Manolis Kellis received his B.S., M.S. and Ph.D degrees in computer science and electrical engineering from MIT, so it is hard to believe that he really thinks that solving (2), which requires nothing more than a matrix inversion, must be accomplished via eigendecomposition. In fact, inverting a 2000 x 2000 matrix in R is 50% slower using that approach. What is going on is that Kellis is hiding the fact that computation of the eigenvalues is used in Feizi et al. in order to set the scaling parameter, i.e. that he is actually solving (3) and not (2). Indeed, there is no mention of scaling in the video except for the mysteriously appearing footnote in the lower left-hand corner of the slide starting at 0:36 that is flashed for 2 seconds. Did you have time to think through all the implications of the footnote in 2 seconds, or were you fooled? While it does not appear in the main text, the key scaling parameter in network deconvolution is acknowledged in the supplementary material of the paper (published with the main paper online July 14, 2013). In supplementary Figure 4, Feizi et al. show the performance of network deconvolution with different scaling parameters ($\beta$) on the three different datasets examined in the paper: Figure S4, July 14, 2013. In the words of the authors, the point of this figure is that “…choosing $\beta$ close to one (i.e., considering higher order indirect interactions) leads to the best performance in all considered network deconvolution applications.” However, while the supplement revealed the existence of $\beta$, it did not disclose the values used for the results in the paper. We inquired with the authors and were surprised to discover that while $\beta = 0.95$ was used for the protein networks and $\beta = 0.99$ for the co-authorship network, $\beta=0.5$ was used for the DREAM5 regulatory networks violating their own advice. What rationale could there be for such a different choice, especially one very far away from the apparently optimal choice “near 1″? We asked the authors, who initially replied that the parameter setting didn’t make a difference. We then asked why the parameter would be set differently if its value didn’t make a difference; we never got an answer to this question. Although it may be hard to believe, this is where the story gets even murkier. Perhaps as a result of our queries, the scaling parameters were publicly disclosed by Feizi et al. in a correction to the original supplement posted on the Nature Biotechnology website on August 26, 2013. The correction states “Clarification has been made to Supplementary Notes 1.1, 1.3, 1.6 and Supplementary Figure 4 about the practical implementation of network deconvolution and parameter selection for application to the examples used in the paper. “ But Supplementary Figure 4 was not clarified, it was changed, a change not even acknowledged by the authors, let alone explained. Below is Figure S4 from the updated supplement for comparison with the one from the original supplement shown above. Figure S4, August 26, 2013. The revised figure is qualitatively and quantitatively different from its predecessor. A key point of the paper was changed- the scaling parameter $\beta$ is now seen to not be optimal near 1, but rather dataset dependent (this is reflected in a corresponding change in the text: in reference to choosing $\beta$ close to 1 “best performance” was changed to “high performance”). Specifically, the regulatory network dataset now has an optimal value of $\beta$ close to 0.5, perhaps providing an answer to our question above about why $\beta=0.5$ was used for that dataset alone. However that explanation directly implies that the authors knew the original figure was incorrect. We are left with two questions: 1. If the authors believed their original figure and their original statement that $\beta$ close to 1 “leads to the best performance in all considered network deconvolution applications”, then why did they use $\beta=0.5$ for one of the applications? 2. When and how was the July 16th Figure S4 made? Manolis Kellis declined to answer question 1 and by not acknowledging the figure change in the correction did not permit the readers of Nature Biotechnology to ask question 2. Clearly we have fallen a long way from the clean and canonical-seeming Figure 1. We wish we could say we have reached the bottom, but there is still a ways to go. It turns out that “network deconvolution” is actually neither of the two models (2) and (3) above, a fact revealed only upon reading the code distributed by the authors. Here is what is in the code: 1. affinely map the entries of the matrix $G_{obs}$ to lie between 0 and 1, 2. set the diagonal of the matrix to 0, 3. threshold the matrix, keeping only the largest $\alpha$ fraction of entries, 4. symmetrize the matrix, 5. scale the matrix so that $G_{dir}$ inferred in the next step will have maximum eigenvalue $\beta$, 6. apply formula (2), 7. affinely map between 0 and 1 again. The parameter $\beta$ is the scaling parameter we’ve already discussed, but there is a second parameter $\alpha$. When the Feizi et al. paper was first published, the parameter $\alpha$ appeared in the code with default value 1, and was literally not mentioned in the paper. When asked, the authors revealed that it also takes on different values in the different experiments in the paper. In particular, the regulatory networks DREAM5 experiments used $\alpha=0.1$. The only guidance for how to set $\alpha$, now provided with the new version of the supplement, is “In practice, the threshold chosen for visualization and subsequent analysis will depend on the specific application.” It is hard to say what steps 1–7 actually do. While it is impossible to give a simple analytic formula for this procedure as a whole, using the Sherman-Morrison formula we found that when applied to a correlation matrix C, steps 1, 2, 4, 6 and 7 produce a matrix whose ijth entry is (up to an affine mapping) $P_{ij} + \Sigma_{kl}P_{ij}P_{kl}+m\Sigma_{kl}P_{ik}P_{jl}$ where $P=C^{-1}$ and m is the minimum entry of C. Omitting step 1 results in Pii , the inverse correlation matrix, so the effect of the mapping in this case is the addition of the final two terms whose possible meaning escapes us. We are left to wonder why the authors chose to add this step which lacks any apparent theoretical justification. This bring us to question the meaning of the contribution of author Muriel Médard, a professor of information theory at MIT. According to the contributions section she is the one who “contributed to correctness proof and robustness analysis”. Presumably “correctness” meant describing the two steps needed to show that the solution to (1) is (2). But what would actually be relevant to the paper is a theorem about steps 1–7. Unfortunately such a theorem is hard to imagine. When Feizi et al. was published in August 2013, they set up a companion website at http://compbio.mit.edu/nd/ where the code and data was made available (the data is no longer available at the site). On December 4, 2013, the authors updated the website. While there used to be one version of the program, there are now two different versions, one specifically for use on regulatory networks where the default values for the parameters are $\beta=0.5, \, \alpha =0.1$ and the other with defaults $\beta=0.9, \, \alpha=1$“Network deconvolution” is supposedly a universal method applicable to any kind of network. Will the authors continue to distribute different domain-specific versions of the program, each with their own “default” parameters somehow divined by themselves? To summarize, “network deconvolution”, as best as we can currently tell, consists of steps 1–7 above with the parameters: • DREAM5 challenge: $\beta = 0.5$ and $\alpha=0.1$. • Protein network: $\beta = 0.99$ and $\alpha=1$. • Co-authorship: $\beta=0.95$ and $\alpha=1$. The description of the method in the main text of the paper is different from what is in the online methods, which in turn is different from what is in the supplement, which in turn is different from what is actually in each of the (two) released programs, namely the ad-hoc heuristic steps we have described above. Having said all of that, one might be curious whether the “method” actually works in practice. Sometimes heuristics work well for reasons that are not well understood (although they are usually not published in journals such as Nature Biotechnology). Unfortunately, we have to disclose that we don’t know how the method performs on the datasets in the paper. We tried really hard to replicate the results of the paper, running the software on the provided input matrices and comparing them to the distributed output matrices (available on the NBT website). Despite our best attempts, and occasionally getting very close results, we have been unable to replicate the results of the paper. This bothers us to such an extent that I, Lior Pachter, am hereby offering$100 to anyone (including Feizi et al.) who can provide me with the code and input to run network deconvolution and produce exactly the figures in the paper (including both versions of Figure S4) and the output matrices in the supplementary data.

Because we couldn’t replicate the results of the paper, we decided to examine how FK performs in one case for which Feizi et al. originally claimed (Supplement version July 14 2013) that their method is optimal. They stated “if the observed network is a covariance matrix of jointly Gaussian variables, ND infers direct interactions using global partial correlations”. This claim was deleted without mention in the updated supplement, however even if network deconvolution is not optimal for inferring a network from a correlation matrix (presumably what they meant in the original statement), the claimed universality suggested in the paper implies that it should be very good on any input.

The figure above shows a comparison of the Feizi et al. method (FK) with the various beta parameters used in their paper to regularized partial correlation, the “geometric root” (equation (2)) and the naïve method of simply selecting the top entries of the correlation matrix. In order to test the method we performed 40 simulations of sampling 500 observations from a random Gaussian graphical model with 1000 variables and an edge density of 5% to ensure the graph was connected yet sparse. Performance was assessed by comparing the ranking of the edges to the true edges present in the model and computing the area under the corresponding ROC curve. As a benchmark we applied partial correlation structural inference based on a James-Stein shrinkage estimation of the covariance matrix (a standard approach to gene regulatory network inference, described in Schäfer, J. & Strimmer, K. A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics. Statistical Applications in Genetics and Molecular Biology 4, (2005)).

That FK does better than the most naïve method and yet worse than a method actually designed for this task is perhaps to be expected for a heuristic method that is based on a metaphor (one that dates back to Seawall Wright’s work on path coefficients in the 1920s- although Wright used (1) in the restricted setting of directed acyclic graphs where it makes sense). It is possible that in some contexts, FK will make things somewhat better. However what such contexts might be is unclear. The only guidance that Feizi et al. provide on the assumptions needed to run their method is that they state in the supplement that the model assumes that networks are “linear time-invariant flow-preserving operators”. While it is true that individual parts of that phrase mean something in certain contexts, the complete phrase is word salad. We wonder: is co-authorship flow preserving?

To conclude: Feizi et al. have written a paper that appears to be about inference of edges in networks based on a theoretically justifiable model but

1. the method used to obtain the results in the paper is completely different than the idealized version sold in the main text of the paper and
2. the method actually used has parameters that need to be set, yet no approach to setting them is provided. Even worse,
3. the authors appear to have deliberately tried to hide the existence of the parameters. It looks like
4. the reason for covering up the existence of parameters is that the parameters were tuned to obtain the results. Moreover,
5. the results are not reproducible. The provided data and software is not enough to replicate even a single figure in the paper. This is disturbing because
6. the performance of the method on the simplest of all examples, a correlation matrix arising from a Gaussian graphical model, is poor.

In academia the word fraudulent is usually reserved for outright forgery. However given what appears to be deliberate hiding, twisting and torturing of the facts by Feizi et al., we think that fraud (“deception deliberately practiced in order to secure unfair or unlawful gain”) is a reasonable characterization. If the paper had been presented in a straightforward manner, would it have been accepted by Nature Biotechnology?

Post scriptum. After their paper was published, Feizi et al. issued some press releases. They explain how original and amazing their work is. Kellis compliments himself by explaining that “Introducing such a foundational operation on networks seems surprising in this day and age” and Médard adds that “the tool can be applied to networks of arbitrary dimension.” They describe the method itself as a way to have “…expressed the matrix representing all element correlations as a function of its principal components, and corresponding weights for each component” even though principal components has nothing to do with anything in their paper. As with Barzel-Barabási, this seems worthy of nomination for the Pressies.

Post post scriptum: In September 2013, we submitted a short commentary to Nature Biotechnology with many of the points made above. While it was rejected four months later, we really really wish we could share the reviews here. Unfortunately they were sent to us with a confidentiality restriction, though having said that, any of the reviewers are hereby invited to guest post on network science on this blog. We were disappointed that Nature Biotechnology would not publish the commentary, but not surprised. Frankly, it would have been too much to expect them and their reviewers to unravel the layers of deception in Feizi et al. in the initial review process; it took us an extraordinary amount of time and effort. Having published the paper, and without a clear path to unpublication (retraction is usually reserved for faking of experimental data), there were presumably not many options. The question is whether anything will change. Feizi et al. apparently have more Nature Biotechnology papers on the way as evidenced by an extraordinary declaration on Soheil Feizi’s publication page (this is a pdf of the version when this blog was posted; Feizi changed his site an hour later) where the publication venue of their next paper appears to have been preordained:

S. Feizi, G. Quon , M. Mendoza, M. Medard, M. Kellis
Comparative Analysis of Modular Integrative Regulatory Networks across Fly, Worm and Human
in preparation for Nature Biotechnology.

The term “in preparation” is common on academic websites, especially on sites of graduate students or postdocs who are eager to advertise forthcoming work, but “in preparation for” really requires some hubris. We have to wonder, why do Feizi et al. assume that peer review is just a formality in the journal Nature Biotechnology?

[Addendum 2/23: The authors posted a rebuttal to this blog post, which we replied to here.]

In the August 2013 issue of Nature Biotechnology there were two back-to-back methods papers published in the area of network theory:

1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601.
2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635.

This post is the first of a trilogy (part2, part3) in which my student Nicolas Bray and I tell the story of these papers and why we took the time to read them and critique them.

We start with the Barzel-Barabási paper that is about the applications of a model proposed by Barzel and his Ph.D. advisor, Ofer Biham (although all last names start with a B, Biham is not to be confused with Barabási):

In order to quantify connectivity in biological networks, Barzel and Biham proposed an experimental perturbation model in the paper Baruch Barzel & Ofer Biham, Quantifying the connectivity of a network: The network correlation function methodPhys. Rev. E 80, 046104 (2009) that forms the basis for network link prediction in Barzel-Barabási. In the context of biology, link prediction refers to the problem of identifying functional links between genes from data that may be confounded by indirect effects. For example, if gene A inhibits the expression of gene B, and also gene B inhibits the expression of gene C, then if the expression of A increases, it will decrease the expression of B, which in turn increase C. Therefore one might observe correlation in the expression levels of gene A and C, even though there is no direct interaction between them. The Barzel-Biham model is based on perturbation experiments. Assuming that a system of genes is in equilibrium, it is a model for the change in expression of one gene in response to a small perturbation in another.

The parameters in the Barzel-Biham model are entries in what they call a “local response matrix” S (any matrix with $S_{ii}=0$ for all i). Physical arguments pertaining to perturbations at equilibrium lead to the equations

$G=SG \,$ (off the diagonal) and $G_{ii}=1$ for all i.                                  (1)

for a “global response matrix” G that can, in principle, be observed and used to infer the matrix S. The innovation of Barzel and Barabási is to provide an approximate formula for recovering S from G, specifically the formula

$S \approx (G-I+D((G-I)G))^{-1}$                                                       (2)

where D(M) denotes the operation setting off-diagonal elements of M to zero. A significant part of the paper is devoted to showing that the approximation (2) is good. Then they suggest that (2) can be used to infer direct causal links in regulatory networks from collections of expression experiments. Barzel and Barabási claim that the approximation formula (2) is necessary because exact inference of S from G requires solving the intractable system of $n^2$ equations

$G=SG \,$ (off the diagonal) and $S_{ii}=0$ for all i.                                  (3)

The assertion of intractability is based on the claim that the equations are coupled. They reason that since the naïve matrix inversion algorithm requires $O(m^3)$ operations for m equations, the solution of (3) would require time $O(n^6)$. When we looked at this system, our first thought was that while it is large, it is also structured. We sat down and started examining it by writing down the equations for a simple case: a matrix S for a graph on 3 nodes. We immediately noticed the equations decoupled into n systems of n equations where system is given by $S_{ii}=0$ and $G_{ij} = \sum_k S_{ik} G_{kj}, \, (j \neq i)$, with the n unknowns $S_{i1},\ldots,S_{in}$. This immediately reduces the complexity to $O(n^4)$, or even $O(n^3)$ by simple parallelization. In other words, the system is trivially tractable.

But there is more: while looking at the paper I had to take a quick bathroom break, and by the time I returned Nick had realized he could apply the Sherman-Morrison formula to obtain the following formula for the exact solution:

$S=I-D(1/G^{-1})G^{-1}$.                                                                    (4)

Here the operator “/” denotes element-wise division, a simple operation to execute, so that inferring S from G requires no more than inverting G and scaling it, a formula that is also much simpler and more efficient to compute than (2). [Added 2/23: Jordan Ellenberg  pointed out the obvious fact that $G=SG$ off the diagonal means that $SG=G-D$ for some diagonal matrix $D$, and therefore $S=I-DG^{-1}$ and since the diagonal entries of $S$ must be zero it follows that $D=D(1/G^{-1})$. In other words, the Sherman-Morrison formula is not even needed]. While it would be nice for us to claim that our managing to quickly supersede the main result of a paper published in Nature Biotechnology was due to some sort of genius, in fact the entire exercise would be suitable for an undergraduate linear algebra homework problem. Barabási likes to compare himself to the great physicist and nobel laureate Subrahmanyan Chandrasekhar, but it is difficult to imagine the great Chandrasekhar having similar difficulties.

The approach to solving (4) has an implication that is even more important than the solution itself. It provides a dual formula for computing G from S according to (1), i.e. to simulate from the model. Using the same ideas as above, one finds that

$G=TD(1/T)$ where $T=(S-I)^{-1}$.                                              (5)

Unlike Barzel & Barabási that resorted to simulating with Michaelis-Menten dynamics in their study of performance of their approximation, using (4) we can efficiently simulate data directly from the model. One issue with Michaelis-Menten dynamics is that they make more sense for enzymatic networks as opposed to regulatory networks (for more on this see Karlebach, G. & Shamir, R. Modelling and analysis of gene regulatory networksNat Rev Mol Cell Biol 9, 770–780 (2008)), but in any case performance on such dynamics is hardly a validation of (2) since its mixing apples and oranges. So what happens when one simulates data from the Barzel-Biham model and then tries to recover the parameters?

A comparison of the standard method of regularized partial correlations with exact inference for the Barzel-Biham model. Random sparse graphs were generated according to the Erdös-Renyi graph model G(5000,p) where was varied to assess performance at different graph densities (shown on x-axis) The y-axis shows the average AUROC obtained from 75 random trials at each density.

When examining simulations from the Barzel-Biham model with graphs on 5,000 nodes (see Figure above), we were surprised to discover that when adding even small amounts of noise, the exact algorithm (4) failed to recover the local response matrix from G (we also analyzed the approximation (3) and observed that it always resulted in performance inferior to (4), and that 5% of the time the correlation with the exact solution was negative). This sensitivity to noise is due to the term $D(1/G^{-1})$ in the exact formula which becomes problematic if the diagonal entries of $G^{-1}$ are close to zero.

Some intuition for the behavior of $G^{-1}$ may be gained from noting that if S is such that its geometric sum converges, the diagonal of $G^{-1}$ is equal to that of $I+S+S^2+S^3+\cdots$. If S has mixed signs and there is significant feedback within the network, the diagonal of $G^{-1}$ may be close to zero and any noise in the measurement of G could create very large fluctuations in the inferred S. This means that the results in Figure 1 are not dependent on the graph model chosen (Erdös-Renyi) and will occur for any reasonable model of gene regulatory networks including the modeling of both enhancers and repressors. From Figure 2a in their paper, it appears that Barzel and Barabási used in their simulation an S with only positive entries that would preclude such effects. Such an assumption is biologically unrealistic.

However, the difficulties with noise for the Barzel-Biham model go much deeper. While a constant signal-to-noise ratio, as assumed by Barzel and Barabasi, is a commonly used model for errors in experiments, it is important to remember that there is no experiment for directly measuring the elements of G. Obtaining $G_{ij}$ from an experiment is done by making a small perturbation of size e to gene i, observing the change in gene j, and then dividing that change by e. This last step increases the noise on the estimate of $G_{ij}$ by a factor of 1/e (a large number, for a perturbative experiment) above the noise already present in the measurements. Increasing e acts to remove the system from the perturbative regime and thereby increases the intrinsic error in estimating G.  It is therefore the case that attaining reasonable error on G will require very low noise in the original measurements. In this case of biological networks this would mean performing many replicates of the experiments. However, as Barzel & Barabási acknowledge in their paper, even a single replicate of a perturbation experiment is not currently feasible.

While the exact algorithm (4) for inverting the Barzel-Biham model performs poorly, we found that a widely used shrinkage method based on partial correlation (Schäfer, J. & Strimmer, K. A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional GenomicsStatistical Applications in Genetics and Molecular Biology 4, (2005)) outperforms the exact algorithm (blue curve in Figure above). This suggests that there is no input for which (4) might be useful. The method is not even ideal for inference from data generated by the model it is based on.

This brings us to the “results” section of the paper. To demonstrate their method called Silencer, Barzel & Barabási ran it on only one of three datasets from the DREAM5 data. They then compared the performance of Silencer to three out of thirty five methods benchmarked in DREAM5. The Barzel-Biham model is for perturbation experiments, but Barzel & Barabási just threw in data from another universe (e.g. mutual information matrices). But lets just go with that for a moment. Their results are shown in the figure below:

Figure 3 from Barzel-Barabási.

The three methods tested potential improvements on are Pearson, Spearman and Mutual Information. Pearson and Spearman rank 16/35 and 18/35 respectively in the DREAM5 benchmarks. There may be some reason why Silencer applied on top of these methods improves performance: in the case where G is a correlation matrix, the path interpretation given by Barzel and Barabási connects the inference procedure to Seawall Wright’s path coefficients (ca. 1920), which in turn suggests an interpretation in terms of partial correlation. However in the case of mutual information, a method that is ranked 19/35 in the DREAM5 benchmarks, there is no statistically significant improvement at all. The improvement is from an AUROC of 0.67 to 0.68. Amazingly, Barzel and Barabási characterize these results by remarking that they ”improve upon the top-performing inference methods” (emphasis on top is ours).  Considering that the best of these rank 16/35 the use of the  word “top” seems, shall we say, unconventional.

We have to ask: how did Barzel & Barabási get to publish a paper in the journal Nature Biotechnology on regulatory network inference without improvement or testing on anything but a handful of mediocre DREAM5 methods from a single dataset? To put the Barzel-Barabási results in context, it is worth considering the standards the Feizi et al. paper were held to. In that paper the authors compared their method to DREAM5 data as well, except they tested on all 3 datasets and 9 methods (and even on a community based method). We think its fair to conclude that significantly more testing would have to be done to argue that Silencer improves on existing methods for biological network inference.

We therefore don’t see any current practical utility for the Barzel-Biham model, except possibly for perturbation experiments in small sub-networks. Even then, we don’t believe it is practical to perform the number of experiments that would be necessary to overcome signal to noise problems.

Unfortunately the problems in Barzel-Barabási spill over into a follow up article published by the duo: Barzel, Baruch, and Albert-László Barabási, “Universality in Network Dynamics.” Nature Physics 9 (2013). In the paper they assume that the local response matrix has entries that are all positive, i.e. they do not allow for inhibitory interactions. Such a restriction immediately renders the results of the paper, if they are to be believed, moot in terms of biological significance. Moreover, the restrictions on S appear to be imposed in order to provide approximations to G that are unnecessary in light of (5). Given these immediate issues, we suspect that were we to read the Universality paper carefully, it is quite likely this post would have to be lengthened considerably.

These are not the first of Barabási’s papers to package meaningless and incoherent results in Nature/Science publications. In fact, there is a long history of Barabási publishing with fanfare in top journals only to have others respond by publishing technical comments on his papers, in many cases refuting completely the claims he makes. Unfortunately many of the critiques are not well known because they are rejected from the journals where Barabási is successful, and instead find their way to preprint servers or more specialized publications. Here is a partial list of Barabási finest and the response(s):

1. Barabasi is famous for the “BA model”, proposed in Barabási and Albert ‘Emergence of Scaling in Random Networks“, Science, Vol. 286 15 October 1999, pp. 509-512. Lada Adamic and Bernardo Huberman immediately refuted the practical applications of the model. Moreover, as pointed out by Willinger, Alderson and Doyle, while it is true that scale-free networks exhibit some interesting mathematical properties (specifically they are resilient to random attack yet vulnerable to worst-case), even the math was not done by Barabási but by the combinatorialists Bollobás and Riordan.
2. Barabási has has repeatedly claimed metabolic networks as prime examples of scale-free networks, starting with the paper Jeong et al., “The large-scale organization of metabolic networks“, Nature 407 (2000). This fact has been disputed and refuted in the paper Scale Rich Metabolic Networks by Reiko Tanaka.
3. The issue of attack tolerance was the focus of Error and attack tolerance of complex networks by Réka Albert, Hawoong Jeong & Albert-László Barabási in Nature 406 (2000). John Doyle refuted the paper completely in this paper.
4. In the paper “The origin of bursts and heavy tails in human dynamics” published in Nature 435 (2005) Barabási pretends to offer insights into the “bursty nature of human behavior” (by analyzing e-mail). In a follow up comment Daniel Stouffer, Dean Malmgren and Luis Amaral demonstrate that the reported power-law distributions are solely an artifact of the analysis of the empirical data and the proposed model is not representative of e-mail communication patterns.
5. Venturing into the field of control theory, the paper “Controllability of complex networks” by Liu, Slotine and Barabási, Nature 473 (2011) argues that “sparse inhomogeneous networks, which emerge in many real complex systems, are the most difficult to control, but that dense and homogeneous networks can be controlled using a few driver nodes.” Not so. In a beautiful and strong rebuttal, Carl Bergstrom and colleagues show that a single control input applied to the power dominating set is all that is required for structural controllability of most, if not all networks. I have also blogged about this particular paper previously, explaining the Bergstrom result and why it reveals the Barabási paper to be a control theory embarrassment.

In other words, Barabási’s “work” is a regular feature in the journals Nature and Science despite the fact that many eminent scientists keep demonstrating that the network emperor has no clothes.

Post scriptum. After their paper was published, Barzel and Barabási issued a press release claiming that “their research moves the team a step closer in its quest to under­stand, pre­dict, and con­trol human disease.” The advertisment seems like an excellent candidate for Michael Eisen’s pressies.

Last Monday some biostatisticians/epidemiologists from Australia published a paper about a “visualization tool which may allow greater understanding of medical and epidemiological data”:

H. Wand et al., “Quilt Plots: A Simple Tool for the Visualisation of Large Epidemiological Data“, PLoS ONE 9(1): e85047.

A brief look at the “paper” reveals that the quilt plot they propose is a special case of what is commonly known as a heat map, a point the authors acknowledge, with the caveat that they claim that

” ‘heat maps’ require the specification of 21 arguments including hierarchical clustering, weights for reordering the row and columns dendrogram, which are not always easily understood unless one has an extensive programming knowledge and skills. One of the aims of our paper is to present ‘‘quilt plots’’ as a useful tool with simply formulated R-functions that can be easily understood by researchers from different scientific backgrounds without high-level programming skills.”

In other words, the quilt plot is a simplified heat map and the authors think it should be used because specifying parameters for a heat map (in R) would require a terrifying skill known as programming. This is of course all completely ridiculous. Not only does usage of R not require programming skill, there are also simplified heat map functions in many programming languages/computation environments that are as simple as the quilt plot function.

The fact that a paper like this was published in a journal is preposterous, and indeed the authors and editor of the paper have been ridiculed on social media, blogs and in comments to their paper on the PLoS One website.

BUT…

Wand et al. do have one point… those 21 parameters are not an entirely trivial matter. In fact, the majority of computational biologists (including many who have been ridiculing Wand) appear not to understand heat maps themselves, despite repeatedly (ab)using them in their own work.

What are heat maps?

In the simplest case, heat maps are just the conversion of a table of numbers into a grid with colored squares, where the colors represent the magnitude of the numbers. In the quilt plot paper that is the type of heat map considered. However in applications such as gene expression analysis, heat maps are used to visualize distances between experiments.

Heat maps have been popular for visualizing multiple gene expression datasets since the publication of the “Eisengram” (or the guilt plot?). So when my student Lorian Schaeffer and I recently needed to create a heat map from RNA-Seq abundance estimates in multiple samples we are analyzing with Ryan Forster and Dirk Hockemeyer, we assumed there would be a standard method (and software) we could use. However when starting to look at the literature we quickly found 3 papers with 4 different opinions about which similarity measure to use:

There are also the folks who don’t worry too much and just try anything and everything (for example using the heatmap.2 function in R) hoping that some distance measure produces the figure they need for their paper. There are certainly a plethora of distance measures for them to try out. And even if none of the distance measures provide the needed figure, there is always the opportunity to play with the colors and shading to “highlight” the desired result. In other words, heat maps are great for cheating with what appears to be statistics.

We wondered…  what is the “right” way to make a heat map?

Consider first the obvious choice for measuring similarity: Euclidean distance. Suppose that we are estimating the distance between abundance estimates from two RNA-Seq experiments, where for simplicity we assume that there are only three transcripts (A,B,C). The two abundance estimates can be represented by 3-tuples $(p_A,p_B,p_C)$ and $(q_A,q_B,q_C)$such that both $p_A+p_B+p_C=1$ and $q_A+q_B+q_C=1$. If  $p_A=1$ and $q_A=0$, then the Euclidean distance is given by $\sqrt{1+q_B^2+q_C^2}$. This obviously depends on $q_B$ and $q_C$, a dependence  that is problematic. What has changed between the two RNA-Seq experiments is that transcript $A$ has gone from being the only one transcribed, to not being transcribed at all. It is difficult to justify a distance metric that depends on the relative changes in $q_B$ and $q_C$. Why, for example, should $(1,0,0)$ be closer to $(1,\frac{1}{2},\frac{1}{2})$ than to $(1,1,0)$?

The Jensen-Shannon divergence, defined for two distributions $P$ and $Q$ by

$JSD(P,Q) = \frac{1}{2}D(P\|M) + \frac{1}{2}D(Q\|M)$

where $M = \frac{1}{2}(P+Q)$ and $D(A\|B)$ is the Kullback-Leibler divergence, is an example of a distance measure that does not have this problem. For the example above the JSD is always $log2$ (regardless of $q_B$ and $q_C$). However the JSD is not a metric (hence the term divergence in its name). In particular, it does not satisfy the triangle inequality (which the Euclidean distance does). Interestingly, this defect can be rectified by replacing JSD with the square root of JSD (the JSD metric). Formal proofs that the square root of JSD is a metric were provided in “A new Metric for Probability Distributions” by Dominik Endres and Johannes Schindelin (2003), and separately (and independently) in “A new class of metric divergences on probability spaces and its applicability in statistics” by Ferdinand Österreicher and Igor Vajda (2003). The paper “Jensen-Shannon Divergence and Hilbert space embedding” by Bent Fuglede and Flemming Topsøe (2004) makes clear the mathematical origins for this result by showing that the square root of JSD can be isometrically embedded into Hilbert space (as a logarithmic spiral)

The 2-simplex with contour lines showing points equidistant
from the probability distribution (1/3, 1/3, 1/3) for the JSD metric.

The meaning of the JSD metric is not immediately apparent based on its definition, but a number of results provide some insight. First, the JSD metric can be approximated by Pearson’s $\chi^2$ distance  (Equation (7) in Endres and Schindelin). This relationship is confirmed in the numerical experiments of Sung-Hyuk Cha (see Figure 3 in “Comprehensive survey on distance/similarity measures between probability distance functions“, in particular the close relationship between JSD and the probabilistic symmetric $\chi^2$). There are also information theoretic and physical interpretations for the JSD metric stemming from the definition of JSD in terms of Kullback-Leibler divergence.

In “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation“, Trapnell et al., Nature Biotechnology 28 (2010), we used the JSD metric to examine changes to relative isoform abundances in genes (see, for example, the Minard plot in Figure 2c). This application of the JSD metric makes sense, however the JSD metric  is not a panacea. Consider Figure 1 in the Merkin et al. paper mentioned above. It displays a heat map generated from 7713 genes (genes with singleton orthologs in the five species studied). Some of these genes will have higher expression, and therefore higher variance, than others. The nature of the JSD metric is such that those genes will dominate the distance function, so that the heat map is effectively generated only from the highly abundant genes. Since there is typically an (approximately) exponential distribution of transcript abundance this means that, in effect, very few genes dominate the analysis.

I started thinking about this issue with my student Nicolas Bray and we began by looking at the first obvious candidate for addressing the issue of domination by high variance genes: the Mahalanobis distance. Mahalanobis distance is an option in many heat map packages (e.g. in R), but has been used only rarely in publications (although there is some history of its use in the analyses of microarray data). Intuitively, Mahalanobis distance seeks to remedy the problem of genes with high variance among the samples dominating the distance calculation by appropriate normalization. This appears to have been the aim of the method in the Anders and Huber paper cited above, where the expression values are first normalized to obtain equal variance for each gene (the variance stabilization procedure). Mahalanobis distance goes a step further and better, by normalizing using the entire covariance matrix (as opposed to just its diagonal).

Intuitively, given a set of points in some dimension, the Mahalanobis distance is the Euclidean distance between the points after they have been transformed via a linear transformation that maps an ellipsoid fitted to the points to a sphere.  Formally, I think it is best understood in the following slightly more general terms:

Given an $n \times m$ expression matrix $X$ (rows=transcripts, columns=experiments), let $P=PCA(X)$ be the matrix consisting of projections of $X$ onto its principal components, and denote by $s^2_k(ij)$ the distance between projection of points i and j onto the kth component, i.e. $s^2_k(ij) = (P_{ik}-P_{jk})^2$. Let $\lambda_1,\ldots,\lambda_n$ be the singular values. For some $1 \leq p \leq n$, define the distance

$D_{ij} = \frac{s^2_1(ij)}{\lambda_1} + \cdots + \frac{s^2_p(ij)}{\lambda_p}$

When $n \leq m$ and $p=n$ then the distance D defined above is the Mahalanobis distance.

The Mahalanobis ellipses. In this figure the distance shown is from every point to the center (mean of the point) rather than between pairs of points. Mahalanobis distance is sometimes defined in this way. The figure is reproduced from this website. Note that the Anders-Huber heat map produces distances looking only at the variance in each direction (in this case horizontal and vertical) which assumes that the gene expression values are independent, or equivalently that the ellipse is not rotated.

It is interesting to note that D is defined even when $n > m$, providing a generalization of Mahalanobis distance for high-dimensional data.

The cutoff p involves ignoring the last few principal components. The reason one might want to do this is that the last few principal components presumably correspond to noise in the data. Amplifying this noise and treating it the same as the signal is not desirable. This is because as p increases the denominators $\lambda_p$ get smaller, and therefore have an increasing effect on the distance. So even though it makes sense to normalize by variance thereby allowing all genes to count the same, it is important to keep in mind that the last few principal components may be desirable to toss out. One way one could choose the appropriate threshold is by examination of a scree plot.

We’re still not completely happy with Mahalanobis distance. For example, unlike the Jensen-Shannon metric, it does not provide a metric over probability distributions. In functional genomics, almost all *Seq assays produce an output which is a (discrete) probability distribution (for example in RNA-Seq the output after quantification is a probability distribution on the set of transcripts). So making heat maps for such data seems to not be entirely trivial…

Does any of this matter?

The landmark Michael Eisen et al. paper “Cluster analysis and the display of genome-wide expression patterns“, PNAS 95 (1998), 14863–14868 describing the “Eisengram” was based on correlation as the distance measure between expression vectors. This has a similar problem to the issues we discussed above, namely that  abundant genes are weighted more heavily in the distance measure, and therefore they define the characteristics of the heat map. Yet the Eisengram and its variants have proven to be extremely popular and useful. It is fair to ask whether any of the issues I’ve raised matter in practice.

Depends. In many papers the heat map is a visualization tool intended for a qualitative exploration of the data. The issues discussed here touch on quantitative aspects, and in some applications changing distance measures may not change the qualitative results. Its difficult to say without reanalyzing data sets and (re)creating the heat maps with different parameters. Regardless, as expression technology continues to transition from microarrays to RNA-Seq, the demand for quantitative results is increasing. So I think it does matters how heat maps are made. Of course its easy to ridicule Handan Wand for her quilt plots, but I think those guilty of pasting ad-hoc heat maps based on arbitrary distance measures in their papers are really the ones that deserve a public spanking.

P.S. If you’re going to make your own heat map, after adhering to sound statistics, please use a colorblind-friendly palette.

P.P.S. In this post I have ignored the issue of clustering, namely how to order the rows and columns of heat maps so that similar expression profiles cluster together. This goes along with the problem of constructing meaningful dendograms, a visualization that has been a major factor in the popularization of the Eisengram. The choice of clustering algorithm is just as important as the choice of similarity measure, but I leave this for a future post.

The paper “Genomic-scale capture and sequencing of endogenous DNA from feces” by George H. Perry, John C. Marioni, Páll Melsted and Yoav Gilad is literally full of feces. The word ‘fecal’ appears 100 times.

Poop jokes aside, the paper presents an interesting idea that has legs. Perry et al. show that clever use of Agilent’s SureSelect allows for capturing nuclear genomic regions from fecal DNA. Intellectually, it is the predecessor of T. Mercer et al.‘s “Targeted RNA sequencing reveals deep complexity of the human transcriptome“, Nature Biotechnology, 2011 (another paper I like and for which I wrote a research highlight). Even though the Perry et al. paper does not have many citations, the Mercer et al. paper does (although unfortunately the authors forgot to cite Perry et al., which I think they should have). In other words, the Perry et al. paper is not as well known as it ought to be, and this post is an attempt to rectify that.

The ‘*’ in sh*t in my title is for *Seq. At a high level,  the Perry et al. paper shows how high-throughput sequencing technology can be leveraged to sequence deeply a single genome from among a community of metagenomes. For this reason, and for convenience, I henceforth will refer to the Perry et al. paper as the Sh*t-Seq paper. The “*” is inserted in lieu of the “i” not as censorship, but to highlight the point that the method is general and applies not only to sequencing nuclear genome from fecal DNA, but also as Mercer et al. shows, for targeted transcriptome sequencing (one can imagine also many other applications).

The Sh*t-Seq protocol is conceptually simple yet complicated in practice. DNA was captured using the Agilent SureSelect target enrichment system coupled to the Illumina single-end sequencing platform library prep protocol (of note is that the paper is from 2010 and the kits are from 2009). However because of the very small amount of DNA targeted (the authors claim ~1.8%), a number of adjustments to standard SureSelect capture / Illumina library prep had to be implemented. To the authors credit, the paragraphs in the section “DNA Capture” are exemplary in their level of detail and presumably greatly facilitate replicability. I won’t repeat all the detail here. However there are two steps that caught my attention as possibly problematic. First, the the authors performed substantial PCR amplification of the  adapter-ligated fecal DNA. This affects the computational analysis they discuss later and leads to a computational step they implemented that I have some issues with (more on this later). Second, they performed two rounds of capture as one round was insufficient for capturing the needed material for sequencing. This necessitated additional PCR, which also is possibly problematic.

The samples collected were from six chimpanzees. This is a fairly small n but the paper is a proof of principle and I think this is sufficient. Both fecal and blood samples were collected allowing for comparison of the fecally derived nuclear DNA to the actual genome of the primates. In what is clearly an attempt to channel James Bond, they collected fecal samples (2 g of stool) within 1 hour of defecation in tubes containing RNALater and these were then “shaken vigorously” (not stirred).

The next part of the paper is devoted to computational analyses to confirm that the Sh*t-Seq protocol can in fact be used to target nuclear endogenous DNA. As a sanity check, mtDNA was considered first. They noted too much diversity to align reads to a reference genome with BWA, and opted instead for de novo assembly using ABySS. This is certainly overkill, possible only because of the high copy count of mitochondrion reads. But I suppose it worked (after filtering out all the low coverage ABySS sequence, which was presumably junk). One interesting idea given more modern RNA-Seq assembly tools would be to assemble the resulting reads with an RNA-Seq de novo assembler that allows for different abundances of sequences. Ideally, such an assembly should indicate naturally the sought after enrichment.

Next, nuclear DNA was investigated, specifically the X chromosome and chromosome 21. Here the analyses is very pre-2014. First, all multi-mapping reads were removed. This is not a good idea for many reasons, and I am quite certain that with the new GRCh38 (with alternate sequence representation for variant regions) it is a practice that will rapidly be phased out. I’d like to give Perry et al. the benefit of the doubt for making this mistake since they published in 2010, but their paper appeared 6 months after the Cufflinks paper so they could have, in principle, known better. Having said that, while I do think multi-mapping would have allowed them to obtain much stronger results as to the accuracy and extent of their enrichment by avoiding the tossing of a large number of reads, their paper does manage to prove their principle so its not a big deal.

The removal of multi-mapping reads was just the first step in a series of “filters” designed to narrow down the nuclear DNA reads to regions of the chimpanzee genome that could be argued to be unambiguously representative of the target. I won’t go into details, although they are all in the paper. As with the experimental methods, I applaud the authors on publishing reproducible methods, especially computational methods, with all details included. But there was a final red flag for me in the computational methods: namely the selection of a single unique fragment (at random) for each genomic (start) site for the purposes of calling SNPs. This was done to eliminate problems due to amplification biases, which is indeed a serious concern, but if heterozygous sites appear due to the PCR steps, then there ought to have been telltale signatures. For example, a PCR “SNP’ would, I think, appear only in reads specific to a single position, but not in other overlapping reads of the site. It would have been very helpful if they would have done a detailed analysis of this issue, rather than just pick a single read at random for each genomic (start) site. They kicked the can down the road.

Having removed multiple mapping reads, repetitive regions, low coverage regions, etc. etc. Dayenu, they ended up estimating a false positive rate for heterozygous sites (using the X chromosome in males) at 0.0007% for fecal DNA and 0.0010% for blood DNA. This led them to conclude that incorrectly-identified heterozygous sites in their study were 0.8%, 2.0%, 1.1%, and 2.7% for fecal DNA chromosome 21, fecal DNA chromosome X in females, blood DNA chromosome 21, and blood DNA chromosome X in females, respectively. Such good news is certainly the result of their extraordinarily stringent filtering, but I think it does prove that they were able to target effectively. They give further proof using PCR and Sanger sequencing of 20 regions.

I have a final nitpick and it relates to Figure 4. It is a a companion to Figure 3 which shows the Chimpanzee phylogeny for their samples based on the mtDNA. As expected in that figure, the fecal and blood samples cluster together. Figure 4 shows two phylogenies, one based on chr 21, the other on chr X. My issue here is with the way that distances were constructed. Its a technical point, but it looks like they used hamming distance, and I don’t think that makes a lot of sense, not to mention the fact that neighbor-joining does not seem like the appropriate algorithm for building a tree in this setting (I plan to blog about neighbor-joining shortly). But this is a methodological point not really relevant to the main result of the paper, namely proof of principle for targeted sequencing of endogenous DNA from fecal matter.

I think Sh*t-Seq has a future. The idea of targeted capture coupled to high-throughput sequencing has more than an economic rationale. It provides the possibility to probe the “deep field” as discussed in the previously mentioned review on targeted RNA-Seq. This is a general principle that should be more widely recognized.

And of course, dung is just cool. Happy new year!

My new year ‘s resolution.

In a letter to a German princess on “the twelve tones of the harpsichord” penned by Leonhard Euler on the 3rd of May 1760 (coincidentally exactly 213 years before my birth day (coincidentally 213 is the smallest number that is part of a consecutive triple, each a product of a pair of distinct prime numbers; 213=3×71, 214=2×107, 215=5×43)), the tension between equal temperament and harmony is highlighted as a central issue in music. Euler wrote “It is evident, therefore, that in fact all semitones are not equal, whatever efforts may be made by musicians to render them such; because true harmony resists the execution of a design contrary to its nature”.

Music is relative. With the exception of a handful of individuals who possess perfect pitch (as an aside, the genetics are interesting), we perceive music relatively. This is very similar to an RNA-Seq experiment. Transcript abundances are measured only relatively to each other (by comparing read counts), although of course a very relatively abundant transcript can also be inferred to be absolutely abundant, as one has prior knowledge about the abundances of housekeeping and other yardstick genes. So it is with music. We know when we are listening to high vs. low pitch, but within a register we hear combinations of notes as concordant or discordant, as melodic or not, based on the relative frequencies of the pitches.

Harmony is the use of simultaneous sounds in music, and is closely associated with the notion of consonance. Specifically, the perfect fifth is the consonant sound of two frequencies at a ratio of 3/2 (the question of why this ratio sounds “good” is an interesting one, and I recommend Helmholtz‘s “On the sensations of tone” as an excellent starting point for exploration). Harmony leads to a “rational” requirement of musical scales: for example, the ability to produce a perfect fifth requires that among notes corresponding to sounds at pre-specified frequencies, there are a pair at a frequency ratio of 3/2.  Other consonant harmonies correspond to other rational numbers with small denominators (e.g. 4/3). Equal temperament is a relative requirement: in a musical scale composed of N notes, it is desirable that any pair of consecutive notes are at the same frequency ratio, say r. This allows for a music to begin on any note.

The tension Euler described between harmony and equal temperament is the mathematical observation that if $r^N = 2$ and $r^{k}=\frac{3}{2}$, then k should be chosen so that $\frac{k}{N} = log_2 \frac{3}{2}$. This is because the first equation leads to $Nlog_2 r = log_2 2 = 1$ so that $log_2 r = \frac{1}{N}$, and the second equation then leads to $k log_2 r = k \cdot \left( \frac{1}{N} \right) = log_2 \frac{3}{2}$. However $log_2 \frac{3}{2}$ is irrational. This is because if it were rational then $log_2 3$ would be rational, and that would mean that $log_2 3 = \frac{a}{b}$ for some positive integers $a,b$, which in turn would be mean that $2^a = 3^b$, which is impossible because powers of two are always even, and powers of three always odd. Ergo a problem.

The mathematics above shows that harmony is fundamental incompatible with equal temperament, but so what? If a very good rational approximation can be found for $log_2 \frac{3}{2}$ then music might not be perfect, but perfect should not be the enemy of the good. For example, in a 12 note scale,  we find that the seventh note (e.g. G if we are in C major) produces a frequency ratio with the first of $2^{\frac{7}{12}} = 1.4983...$, which is awfully close to 1.5. Does the difference in the third decimal really matter when one is listening to Rammstein?

Is 12 special? What about an 11 note scale? The following math provides the answer:

A (simple) continued fraction representation for an irrational number is an infinite expression of the form

$x=a_0+\frac{1}{a_1+\frac{1}{a_2+\frac{1}{a_3+\cdots}}}$

The convergents of are the rational numbers obtained by truncating the continued fraction expression at successively increasing finite points, i.e. the sequence $\frac{a_0}{1}, \frac{a_0a_1+1}{a_1},\frac{a_2(a_1a_0+1)}{a_2a_1+1},\ldots$

Theorem: Let $r_i=\frac{n_i}{d_i}$ be the convergents of a number x. Then the rational number $r_i$ is the best rational approximation to x with denominator $\leq d_i$.

The theorem is saying that if one approximates an irrational number with its continued fractions, then the denominators of those fractions are optimal in the sense that they provide rational approximations that improve on all lower denominators (an elementary proof is written up here). For example, the continued fraction approximation of $log_2 \frac{3}{2}$ is given by the sequence $[0,1,1,2,2,3,1,5,2,23,2,2,1,\ldots ]$ which translates to the expansion

$log_2 \frac{3}{2} = 0+\frac{1}{1+\frac{1}{1+\frac{1}{2+\frac{1}{2+ \cdots}}}}$

and leads to to the rational approximations $0,1,\frac{1}{2},\frac{3}{5},\frac{7}{12},\frac{24}{41},\frac{21}{53},\ldots$

The fourth denominator is 5, and its interpretation in light of the theorem is that no rational number with 1,2,3 or in the denominator can provide an approximation as good as 3/5 to $log_2 \frac{3}{2}$. Similarly, with 12 being the next denominator, the theorem implies that no rational number with a denominator between and 11 can provide an approximation to $log_2 \frac{3}{2}$ as good as 7/12. In fact, to improve on the accuracy of the perfect fifth a 29 note scale is required, a number which is much less practical than 12.There are of course other “rational” considerations in choosing a scale. Not only the perfect 5th must be approximated well, but also thirds and fourths. It turns out 12 provides a reasonably good approximation for all of these. In other words, the 12 note scale is not just a historical accident, it is a mathematical inevitability.

The question is did Princess Friederike Charlotte of Brandenburg-Schwedt realize that underlying the principle Euler was writing to her about was the key to understanding not only the theory of music, but also a fundamental design in nature?

The number of petals in flowers

An obvious question not addressed by the theorem stated above is how good are the rational approximations of irrational numbers? Obviously a rational number can always be chosen to be arbitrarily close to any irrational number (e.g. by truncating the decimal expansion arbitrarily far along). But the question is interesting when considered, in light of the theorem above, in terms of the accuracy of approximation as a function of its denominator. Hurwitz‘s theorem provides the answer:

Theorem [Hurwitz, 1891]: For every irrational number x, there are infinitely many rational approximations n/d such that

$\left| x - \frac{n}{d} \right| < \frac{1}{\sqrt{5}d^2}$.

Moreover, the constant $\frac{1}{\sqrt{5}}$ is best possible.

In other words, the approximation can be as good as the inverse of the square of the denominator (this should be contrasted with using the decimal expansion to approximate, which provides accuracy only proportional to the inverse of the denominator). The fact that $\frac{1}{\sqrt{5}}$ is the best possible constant means that there is some irrational number such that if one sets the constant to be smaller, then there are only finitely many rational approximations achieving the improved accuracy. One cannot get more accurate than $\frac{1}{\sqrt{5}d^2}$ infinitely often, and it turns out that the irrational number that is the extremal example is the golden ratio $\varphi = \frac{1+\sqrt{5}}{2}$. In a sense made precise by the theorem above, the golden ratio is the hardest irrational number to approximate by rationals.

Who would care about irrational numbers that are hard to approximate by rational numbers? Mothers of newborn twins?! However as whimsical as Ehud Friedgut‘s Murphy’s law sounds, evolution appears to have appropriated the principle of offsetting cycles to optimize branching patterns in plants and trees.

Tree branches around a trunk, or of flower petals around a stem, demand sunlight and ideally do not line up with each other. Suppose that they are equally spaced, separated by a fixed angle $\theta$. If $\theta$ is rational, then the branches or petals will line up perfectly creating a pattern, viewed from above, of lines emanating from the trunk or main stem. An irrational angle ensures that branches never line up perfectly, however if the irrational number is well approximated by a rational number then the overlap will be extensive. That is, the branches will almost line up. The solution? Choose an irrational angle that is as hard as possible to approximate by rationals. The golden ratio! The angle turns out to be 137.5 degrees.

With the golden ratio, the branching pattern when viewed from above looks like a series of spirals. Depending on the time between the appearance of branches (or seeds at the center of a flower), the number of spirals varies. The numbers though will always correspond to one of the convergents of the (continued fraction expansion of the) golden ratio. These are

$\frac{2}{3}, \frac{3}{5}, \frac{5}{8}, \frac{8}{13}, \frac{13}{21}, \ldots$.

Note that the denominators are the Fibonacci numbers. Since flower petals typically form at the tips of the spirals, one frequently finds a Fibonacci number of petals on flowers. Fibonacci numbers are all over plants. Nature is irrational.

Georgia O’Keeffe: Music Pink and Blue No. 2 (1918). Next time you are pondering what her flowers truly represent, think also of the tension between the rational and the irrational in music, and in nature.

On Satuday I submitted the final grades for Math10A, the new UC Berkeley freshman math class for intended biology majors that I taught this semester. In assigning students their grades, I had a chance to reflect again on the system we use and its substantial shortcomings.

The system is broken, and my grade assignment procedure illustrates why. Math 10a had 223 students this semester, and they were graded according to the following policy: homework 10%, quizzes 10%, midterms 20% each (there were two midterms) and the final 40%. If midterm 1 was skipped then midterm 2 counted 40%. Similarly, if midterm 2 was skipped then the final counted 60%. This produced a raw score for each student and the final distribution is shown below (zeroes not shown):

The distribution seems fairly “reasonable”. One student didn’t do any work or show up and got a 5/100. At the other end of the spectrum some students aced the class. The average score was 74.48 and  the standard deviation 15.06. An optimal raw score distribution should allow for detailed discrimination between students (e.g. if everyone gets the same score thats not helpful). I think my distribution could have been a bit better but I overall I am satisfied with it.  The problem comes with the next step: after obtaining raw scores in a class, the professor has to set cutoffs for A+/A/A-/B+/B/B-/C+/C/C-/D+/D/D-/F. Depending on how the cutoffs are set, the grade distribution can change dramatically. In fact, it is easy to see that any discrete distribution on letter grades is achievable from any raw score distribution. One approach to letter grades would be to fix an A at, say, any raw score greater than or equal 90%, i.e., no curving. I found that threshold on wikipedia. But that is rarely how grades are set, partly because of large variability in the difficulty of exams. Almost every professor I know “curves” to some extent. At Berkeley one can examine grade distributions here.

It turns out that Roger Purves from statistics used to aim for a uniform distribution:

Roger Purves’ Stat 2 grade distribution over the past 6 years.

The increase in C- grades is explained by an artifact of the grading system at Berkeley.  If a student fails the class they can take it again and record the passing grade for their GPA (although the F remains on the transcript). A grade of D is not only devastating for the GPA, but also permanent. It cannot be improved by retaking the class. Therefore many students try to fail when they are doing poorly in a class, and many professors simply avoid assigning Ds. In other words, Purves’ C- includes his Ds. Another issue is that an A+ vs. A does not affect GPA, but an A vs. A- does; the latter is obviously a very subjective difference that varies widely between classes and professors. Note that Roger Purves just didn’t assign A+ grades, presumably because they have no GPA significance (although they do arguably have a psychological impact).

Marina Ratner from math was fond of failing students:

Marina Ratner’s Math 1B, Spring 2009.

In the same semester, in a parallel section, her colleague Richard Borcherds gave the following grades:

Richard Borcherd’s Math 1B, Spring 2009.

Unlike Ratner, Borcherds appears to be averse to failing students. Only 7 students failed out of 441 who were enrolled in his two sections that semester. Fair?

And then there are those who believe in the exponential distribution, for example Heino Nitsche who teaches Chem 1A:

Heino Nitsche’s Chem 1A, Spring 2011.

The variability in grade assignment is astonishing. As can be seen above, curving is prevalent and arbitrary, and the idea that grades have an absolute meaning is not credible. It is statistically highly unlikely that Ratner’s students were always terrible at learning math (whereas Borcherds “luckily” got the good students). Is chemistry inherently easy, to the point where an average student taking the class deserves an A?

This messed up system is different, yet similar in other schools. Sadly, many schools have used letter grading to manipulate GPAs via continual grade inflation. Just three weeks ago on December 3rd, the dean of undergraduate education at Harvard confirmed that the median grade at Harvard is an A- and the most common grade an A. The reasons for grade inflation are manifold. But I can understand it on a personal level. It is tempting for a faculty member to assign As because those are likely to immediately translate to better course evaluations (both internal, and public on sites such as Ninja Courses and ratemyprofessor). Local grade inflation can quickly lead to global inflation as professors, and at a higher level their universities, are competing with each other for the happiest students.

How did I assign letter grades for Math 10A?

After grading the final exams together, my five GSIs started the process of setting letter grade thresholds by examining the grades of “yardstick students”. These were students for which the GSIs felt confident in declaring their absolute knowledge of the material to be at the A,B,C or F levels. We then proceeded to refine the thresholds adding +/- cutoffs by simultaneously trying to respect the yardsticks, while also keeping in mind the overall grade distribution. Finally, I asked the GSIs to consider moving students upward across thresholds if they had shown consistent improvement and commitment throughout the semester (a policy I had informed the students of in class). The result was that about 40% of the students ended with an A. Students failed the class at a threshold where we believed they had not learned enough of the material to proceed to math 10B. I have to say that as far as my job goes, assigning letter grades for courses is the least scientific endeavor I participate in.

What should be done?

Until recently grades were recorded on paper, making it difficult to perform anything but trivial computations on the raw scores or letter grades. But electronic recording of grades allows for more sophisticated analysis. This should be taken advantage of. Suppose that instead of a letter grade, each student’s raw scores were recorded, along with the distribution of class scores. A single letter would immediately be replaced by a meaningful number in context.

I do think it is unfair to grade students only relatively, especially with respect to cohorts that can range in quality. But it should be possible to compute a meaningful custom raw score distribution specific to individual students based on the classes they have taken. The raw data is a 3-way table whose dimensions consist of professors x classes x raw scores. This table is sparse, as professors typically only teach a handful of different courses throughout their career. But by properly averaging the needed distributions as gleaned from this table, it should be possible to produce for each student an overall GPA score, together with a variance of the (student specific) distribution it came from averaged over the courses the student took. The resulting distribution and score could be renormalized to produce a single meaningful number. That way, taking “difficult” classes with professors who grade harshly would not penalize the GPA. Similarly, aiming for easy As wouldn’t help the GPA. And manipulative grade inflation on the part of professors and institutions would be much more difficult.

Its time to level the playing field for students, eliminate the possibility for manipulative grade inflation, and to stop hypocrisy. We need to not only preach statistical and computational thinking to our students, we need to practice it in every aspect of their education.

In 1963, a post-doctoral fellow at Caltech by the name of Emile Zuckerkandl, who was working with Linus Pauling, published a seminal paper that launched the field of comparative genomics. The full citation is:

L. Pauling and E. Zuckerkandl, “Chemical Paleogenetics: Molecular ‘restoration studies’ of extinct forms of life“, Acta Chem. Scand. 17 (1963) , S9–S16.

The paper not only introduces the concept of comparative genomics, but also its generalization to phylogenomics. It followed on the heels of pioneering work by Zuckerkandl, following the advice of Pauling, to sequence hemoglobin proteins in various primates and mammals. In related work, Zuckerkandl and Pauling (it is clear that Zuckerkandl was the creative and driving force) also introduced the concept of a molecular clock and proposed the concept of sequence alignment (although Ingram published first).

Zuckerkandl’s name and work are not as well known as they ought to be. Perhaps this is partly because his work was overshadowed by Watson and Crick’s famous paper published a decade earlier, but maybe also because computational methods take second place to experiments and Zuckerkandl’s legacy lies in his ideas about phylogenomics, not his contribution to protein sequencing.

The paper after which this blog post is titled is extraordinary in its originality, and the main figure in the paper is a classic that every computational biologist should instantly recognize:

The caption (from the paper), is as follows: Tentative partial structure of two chain-ancestors of the human hemoglobin polypeptide chains. The numbering of the residues is the one usually applied to the human $\alpha$-chain. The abbreviations for the amino acids are those commonly employed, except for asparagin (asg) and glutamine (glm). The other abbreviations: E=early epoch; M=medium epoch; L=late epoch; abs=absent. “None”, in column (e), means: probably no evolutionarily effective mutation occurred at the site under consideration in the line of descent leading from the ancestral genes to the human genes. The residues and comments are placed in parentheses when the conclusion reached is partly based on the consideration of human and sperm whale myglobins.

(a) Residue number

(b) Partial sequence of the $II^{\beta}--III^{\gamma}--IV^{\delta}$-chain (late form)

(c) Partial sequence of the $I-^{\alpha} II^{\beta}--III^{\gamma}--IV^{\delta}$-chain (late form)

(d) Chain(s) in whose direct ancestry the mutation(s) seem to have occurred

(e) Nature of the substitution

(f) Quantitative evaluation of the time of mutation

(g) From evidence relating to a number of different animals (cf. Gratzer and Allison)

(h) Not after the time of the ancestor common to horse and man

(j) Amino-acid residue x present at a homologous site in the two myoglobins

(k) Polypeptide chain $I^{\alpha}$ or $II^{\beta}--III^{\gamma}--IV^{\delta}$

The inference of ancestry is based on the hemoglobin gene phylogeny

The rows (d)–(f) basically tell the story of the paper. Zuckerkandl and Pauling show that with sequence, alignment and phylogeny one can infer the history of mutation and substitution and its effects on extant sequence. Right there, we have the genesis of phylogenomics. In the discussion of the paper there is a detailed description of how phylogenomic techniques can be used to infer functional relationships among proteins, with the conclusion that “even apparently unrelated proteins can indeed have a common molecular ancestor”.

The fields of genomics and phylogenetics have a common ancestor as well but sadly this ancestor, Emile Zuckerkandl, passed away on November 9, 2013. May his ideas continue to evolve.

[Update Dec. 12]: Upon reading the initial version of this post, Ewan Birney asked me how the Zuckerkandl-Pauling alignment looks with modern tools, an interesting question now that 50 years have passed since their paper was publishedI decided to look at this and started by downloaded the sequences and aligning them with three programs: 1) Clustal-Omega, the latest (last) version of the popular and widely used Clustal alignment programs, 2) Muscle, a sequence alignment tool billed by its author as “one of the best-performing multiple alignment programs” and 3) FSA, a sequence alignment program developed in my group by Robert Bradley. The results are interesting:

The Clustal-Omega alignment.

The muscle alignment.

The FSA alignment.

Its a bit complicated to compare these multiple alignments with the Z-P alignment. The latter is a pairwise alignment of the two ancestrally inferred sequences, whereas the multiple alignments above are of the extant sequences. This is already interesting; almost no alignment programs explicitly align by ancestral inference the way Z-P were doing it. The only exception, as far as I know, is MAVID, written by Nicolas Bray, although as implemented it was only suitable for DNA alignment. Nevertheless, the alignment is small enough that I was able to compare Z-P’s alignment to the Clustal-Omega, Muscle and FSA alignments by hand.

There are three areas of the alignment with indels. The first is right in the beginning, where Z-P annotate an indel right after the Valine (V). Both Clustal-Omega and Muscle moved this indel to the beginning of the sequence. That is probably because of heuristics setting gaps to the ends of sequences in case they are incomplete. In this case, I removed the Methionine because Z-P had not included it in their alignment. Nevertheless, FSA aligned this correctly (I think the Z-P alignment here is correct).

The second indel is two bases and different between all three programs and the Z-P alignment. Its a bit hard to say who is correct. One would have to look at more sequences, but I think that FSA looks good here (based on the posterior probabilities shown, see below).

The third indel involves the placement of the amino acids DLSH. In this case only muscle agrees with the Z-P alignment, however I think the FSA answer is the most informative (and only informative) answer in this case. FSA not only shows a single multiple alignment, but also colors it according to the posterior probabilities of the individual pairwise alignments. This is explained in the paper– the visualization is interactive and available via a java applet. What the FSA alignment shows is that there are really two solutions of almost equal accuracy: the one it produced and the Z-P solution. The latter requires two separate indel events. These are presumably much less common than mutations, so its not clear which is correct (this could be checked by examining a larger globin alignment including outlier species).

In summary, what the alignments with modern tools show, is that the multiple alignment is really hard. Much harder than most people think!

A final analysis I did is to look at the tree inferred according to an evolutionary model (the FSA alignment server does this automatically). The inferred tree is unrooted and shown below:

It agrees with the Z-P tree, except for the fact that the beta-delta ancestor is much closer than what is shown in their figure. It should be noted that at the time of their paper, (Markov) models for amino acid changes along a tree had not yet been developed; that only started happening with Jukes and Cantor in 1967.

Overall, Zuckerkandl and Pauling did a remarkable job with their alignment, ancestral inference and tree estimation. What I learned from my analysis is that their paper still provides a case study for multiple alignment today.