In reading the news yesterday I came across multiple reports claiming that even casually smoking marijuana can change your brain. I usually don’t pay much attention to such articles; I’ve never smoked a joint in my life. In fact, I’ve never even smoked a cigarette. So even though as a scientist I’ve been interested in cannabis from the molecular biology point of view, and as a citizen from a legal point of view, the issues have not been personal. However reading a USA Today article about the paper, I noticed that the principal investigator Hans Breiter was claiming to be a psychiatrist and mathematician. That is an unusual combination so I decided to take a closer look. I immediately found out the claim was a lie. In fact, the totality of math credentials of Hans Breiter consist of some logic/philosophy courses during a year abroad at St. Andrews while he was a pre-med student at Northwestern. Even being an undergraduate major in mathematics does not make one a mathematician, just as being an undergraduate major in biology does not makes one a doctor. Thus, with his outlandish claim, Hans Breiter had succeeded in personally offending me! So, I decided to take a look at his paper underlying the multiple news reports:

This is quite possibly the worst paper I’ve read all year (as some of my previous blog posts show I am saying something with this statement). Here is a breakdown of some of the issues with the paper:

### 1. Study design

First of all, the study has a very small sample size, with only 20 “cases” (marijuana users), a fact that is important to keep in mind in what follows. The title uses the term “recreational users” to describe them, and in the press release accompanying the article Breiter says that “Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” In fact, the majority of users in the study were smoking more than 10 joints per week. There is even a person in the study smoking more than 30 joints per week (as disclosed above, I’m not an expert on this stuff but if 30 joints per week is “recreation” then it seems to me that person is having a lot of fun). More importantly, Breiter’s statement in the press release is a lie. There is no evidence in the paper whatsoever, not even a tiny shred, that the users who were getting high once or twice a week were having any problems. There are also other issues with the study design. For example, the paper claims the users are not “abusing” other drugs, but it is quite possible that they are getting high on cocaine, heroin, or ??? as well, an issue that could quite possibly affect the study. The experiment consisted of an MRI scan of each user/control, but only a single scan was done. Given the variability in MRI scans this also seems problematic.

### 2. Multiple testing

The study looked at three aspects of brain morphometry in the study participants: gray matter density, volume and shape. Each of these morphometric analyses constituted multiple tests. In the case of gray matter density, estimates were based on small clusters of voxels, resulting in 123 tests (association of each voxel cluster with marijuana use). Volumes were estimated for four regions: left and right nucleus accumbens and amygdala. Shape was also tested in the same four regions. What the authors should have done is to correct the p-values computed for each of these tests by accounting for the total number of tests performed. Instead, (Bonferroni) corrections were performed separately for each type of analysis. For example, in the volume analysis p-values were required to be less than 0.0125 = 0.05/4. In other words, the extent of testing was not properly accounted for. Even so, many of the results were not significant. For example, the volume analysis showed no significant association for any of the four tested regions. The best case was the left nucleus accumbens (Figure 1C) with a corrected p-value of 0.015 which is over the authors’ own stated required threshold of 0.0125 (see caption). They use the language “The association with drug use, after correcting for 4 comparisons, was determined to be a trend toward significance” to describe this non-effect. It is worth noting that the removal of the outlier at a volume of over $800 mm^3$ would almost certainly flatten the line altogether and remove even the slight effect. It would have been nice to test this hypothesis but the authors did not release any of their data.

Figure 1c.

In the Fox News article about the paper, Breiter is quoted saying ““For the NAC [nucleus accumbens], all three measures were abnormal, and they were abnormal in a dose-dependent way, meaning the changes were greater with the amount of marijuana used,” Breiter said.  “The amygdala had abnormalities for shape and density, and only volume correlated with use.  But if you looked at all three types of measures, it showed the relationships between them were quite abnormal in the marijuana users, compared to the normal controls.” The result above shows this to be a lie. Volume did not significantly correlate with use.

This is all very bad, but things get uglier the more one looks at the paper. In the tables reporting the p-values, the authors do something I have never seen before in a published paper. They report the uncorrected p-values, indicating those that are significant (prior to correction) in boldface, and then put an asterisk next to those that are significant after their (incomplete) correction. I realize my own use of boldface is controversial… but what they are doing is truly insane. The fact that they put an asterisk next to the values significant after correction indicates they are aware that multiple testing is required. So why bother boldfacing p-values that they know are not significant? The overall effect is an impression that more tests are significant that is actually the case. See for yourself in their Table 4:

Table 4.

The fact that there are multiple columns is also problematic. Separate tests were performed for smoking occasions per day, joints per occasion, joints per week and smoking days per week. These measures are highly correlated, but even so multiply testing them requires multiple test correction. The authors simply didn’t perform it. They say “We did not correct for the number of drug use measures because these measures tend not be independent of each other”. In other words, they multiplied the number of tests by four, and chose to not worry about that. Unbelievable.

Then there is Table 5, where the authors did not report the p-values at all, only whether they were significant or not… without correction:

Table 5.

### 3. Correlation vs. causation

This issue is one of the oldest in the book. There is even a wikipedia entry about itCorrelation does not imply causation. Yet despite the fact the every result in the paper is directed at testing for association, in the last sentence of the abstract they say “These data suggest that marijuana exposure, even in young recreational users, is associated with exposure-dependent alterations of the neural matrix of core reward structures and is consistent with animal studies of changes in dendritic arborization.” At a minimum, such a result would require doing a longitudinal study. Breiter takes this language to an extreme in the press release accompanying the article. I repeat the statement he made that I quoted above where I boldface the causal claim: “”Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” I believe that scientists should be sanctioned for making public statements that directly contradict the content of their papers, as appears to be the case here. There is precedent for this.

A few years ago after the birth of our second daughter and in anticipation of our third, I started designing a one-room addition for our house. One of the problems I faced was figuring out the shape of the roof. I learned of the concept of the straight skeleton of a polygon, first defined by Oswin Aichholzer and Franz Aurenhammer in a book chapter “Straight Skeletons for General Polygonal Figures in the Plane” in 1996. Wikipedia provides an intuitive definition:

“The straight skeleton of a polygon is defined by a continuous shrinking process in which the edges of the polygon are moved inwards parallel to themselves at a constant speed. As the edges move in this way, the vertices where pairs of edges meet also move, at speeds that depend on the angle of the vertex. If one of these moving vertices collides with a nonadjacent edge, the polygon is split in two by the collision, and the process continues in each part. The straight skeleton is the set of curves traced out by the moving vertices in this process. In the illustration the top figure shows the shrinking process and the middle figure depicts the straight skeleton in blue.”

but the concept is best understood by picture:

The fact that straight skeletons fit “symmetrically” into the polygons that generated them, made me think about whether they could constitute aesthetic representations of phylogenetic trees. So I asked the inverse question: given a phylogenetic tree, i.e. a graph that is a tree with weighted edges, together with a cyclic orientation on its vertices, is there a convex polygon such that the tree is the straight skeleton of that polygon? A few google searches didn’t reveal anything, but fortunately and coincidentally, Satyan Devadoss, who is a topologist and computational geometer was visiting my group on his sabbatical (2009–2010).

Now, a few years later, Satyan and coauthors have written a paper providing a partial answer to my question. Their paper is about to appear in the next issue of Discrete Applied Mathematics:

The main theorem is formally about ribbon trees:

Definition. A ribbon tree is a tree (a connected graph with no cycles) for which each edge is assigned a nonnegative length, each internal vertex has degree at least three, and the edges incident to each vertex are cyclically ordered.

The authors prove the interesting result that there exists only a finite set of planar embeddings of a tree appearing as straight skeletons of convex polygons. Specifically, they show that:

Theorem. A ribbon tree with n leaves has at most 2n−5 suitable convex polygons.

Its fun to work out by hand the case of a star tree with three leaves:

The algebra works out to solving a cubic system of three equations that can be seen to have one unique positive solution (Lemma 5 in the paper).

The proof of the main theorem relies on some elementary trigonometry and algebra, as well as an interesting analogy of Cauchy’s “arm lemma” (if one increases one of the angles of a convex polygonal chain, then the distance between the endpoints will only increase). Furthermore, a few interesting cases and connections are pointed out along the way. For example, some ribbons, even caterpillar ribbons, are not realized by any polygon. There is also a related conference publication by many of the same authors, although including Aichholzer himself, that provides some interesting constructions in special cases:

• Oswin Aichholzer, Howard Cheng, Satyan L. Devadoss, Thomas Hackl, Stefan Huber, Brian Li, Andrej Risteski, What makes a tree a straight skeleton?, Canadian Conference on Computational Geometry, 2012.

Straight skeletons may yet find application in drawing phylogenetic trees, but for now the best out there are radial or circular representations optimizing various layout considerations.

My addition is now complete and the roof is absolutely beautiful.

The roof of the addition

Reproducibility has become a major issue in scientific publication, it is under scrutiny by many, making headlines in the news, is on the minds of journals, and there are guidelines for achieving it. Reproducibility is certainly important for scientific research, but I think that lost in the debate is the notion of usability. Reproducibility only ensures that the results of a paper can be recreated by others, but usability ensures that researchers can build on scientific work, and explore hypotheses and ideas unforeseen by authors of the original papers. Here I describe a case study in reproducibility and usability, that emerged from previous post I wrote about the paper

Feizi et al. describe a method called network deconvolution, that they claim improves the inference results for 8 out of 9 network inference methods, out of the 35 that were tested in the DREAM5 challenge. In DREAM5, participants were asked to examine four chip-based gene x expression matrices, and were also provided a list of transcription factors for each. They were then asked to provide ranked lists of transcription factor – gene interactions for each of the four datasets. The four datasets consisted of one computer simulation (the “in silico” dataset) and expression measurements in  E. coli, S. cerevisiae and S. aureus. The consortium received submission from 29 different groups and ran 6 other “off-the-shelf” methods, while also developing its own “community” method (for a total of 36=29+6+1). The community method consisted of applying the Borda count to the 35 methods being tested, to produce a new consensus, or community, network. Nicolas Bray and I tried to replicate the results of Feizi et al. so that we could test for ourselves the performance of network deconvolution with different parameters and on other DREAM5 methods ( Feizi et al. tested only 9 methods; there were 36 in total).  But despite contacting the authors for help we were unable to do so. In desperation, I even offered $100 for someone to replicate all of the figures in the paper. Perhaps as a result of my blogging efforts, or possibly due to a spontaneous change of heart, the authors finally released some of the code and data needed to reproduce some of the figures in their paper. In particular, I am pleased to say that the released material is sufficient to almost replicate Figure 2 of their paper which describes their results on a portion of the DREAM5 data. I say almost because the results for one of the networks is off, but to the authors credit it does appear that the distributed data and code are close to what was used to produce the figure in the paper (note: there is still not enough disclosure to replicate all of the figures of the paper, including the suspicious Figure S4 before and after revision of the supplement, and I am therefore not yet willing to concede the$100). What Feizi et al. did accomplish was to make their methods usable. That is to say, with the distributed code and data I was able to test the method with different parameters and on new datasets. In other words, Feizi et al. is still not completely reproducible, but it is usable. In this post, I’ll demonstrate why usability is important, and make the case that it is too frequently overlooked or confused with reproducibility. With usable network deconvolution code in hand, I was finally able to test some of the claims of Feizi et al.  First, I identify the relationship between the DREAM methods and the methods Feizi et al. applied network deconvolution to. In the figure below, I have reproduced Figure 2 from Feizi et al. together with Figure 2 from Marbach et al.:

Figure 2 from Feizi et al. aligned to Figure 2 from Marbach et al.

The mapping is more complex than appears at first sight. For example, in the case of Spearman correlation (method Corr #2 in Marbach et al., method #5 in Feizi et al.), Feizi et al. ran network deconvolution on the method after taking absolute values. This makes no sense, as throwing away the sign is to throw away a significant amount of information, not to mention it destroys any hope of connecting the approach to the intuition of inferring directed interactions from the observed via the idealized “model” described in the paper. On the other hand, Marbach et al. evaluated Spearman correlation with sign. Without taking the absolute value before evaluation negative edges, strong (negative) interactions, are ignored. This is the reason for the very poor performance of Spearman correlation and the reason for the discrepancy in bar heights between Marbach et al. and Feizi et al. for that method. The caption of Figure 2 in Feizi et al. begins “Network deconvolution applied to the inferred networks of top-scoring methods [1] from DREAM5..” This is obviously not true. Moreover, one network they did not test on was the community network of Marbach et al. which was the best method and the point of the whole paper. However the methods they did test on were ranked 2,3,4,6,8,12,14,16,28 (out of 36 methods). The 10th “community” method of Feizi et al. is actually the result of applying the community approach to the ND output from all the methods, so it is not in and of itself a test of ND. Of the nine tested methods, arguably only a handful were “top” methods. I do think its sensible to consider “top” to be the best methods for each category (although Correlation is so poor I would discard it altogether). That leaves four top methods. So instead of creating the illusion that network deconvolution improves 9/10 top scoring methods, what Feizi et al. should have reported is that 3 out of 4 of the top methods that were tested were improved by network deconvolution. That is the result of running network deconvolution with the default parameters. I was curious what happens when using the parameters that Feizi et al. applied to the protein interaction data (alpha=1, beta=0.99). Fortunately, because they have made the code usable, I was able to test this. The overall result as well as the scores on the individual datasets are shown below: The Feizi et al. results on gene regulatory networks using parameters different from the default. The results are very different. Despite the claims of Feizi et al. that network deconvolution is robust to choice of parameters, now only 1 out of 4 of the top methods are improved by network deconvolution. Strikingly, the top three methods tested have their quality degraded. In fact, the top method in two out of the three datasets tested is made worse by network deconvolution. Network deconvolution is certainly not robust to parameter choice. What was surprising to me was the improved performance of network deconvolution on the S. cerevisae dataset, especially for the mutual information and correlation methods. In fact, the improvement of network deconvolution over the methods is appears extraordinary. At this point I started to wonder about what the improvements really mean, i.e. what is the “score” that is being measured. The y-axis, called the “score” by Feizi et al. and Marbach et al. seemed to be changing drastically between runs. I wondered… what exactly is the score? What do the improvements mean? It turns out that “score” is defined as follows:

$score = \frac{1}{2} ( \frac{1}{3} \sum_{i=1}^3 -log_{10} p_{AUROC,i} + \frac{1}{3} \sum_{i=1}^3 -log_{10} p_{AUPR,i})$.

This formula requires some untangling: First of all, AUROC is shorthand for area under the ROC (receiver operator curve), and AUPR for area under the PR (precision recall) curve. For context, ROC is a standard concept in engineering/statistics. Precision and recall are used frequently, but the PR curve is used much less than ROC . Both are measures for judging the quality of a binary classifier. In the DREAM5 setting, this means the following: there is a gold standard of “positives”, namely a set of edges in a network that should be predicted by a method, and the remainder of the edges will be considered “negatives”, i.e. they should not be predicted. A method generates a list of edges, sorted (ranked) in some way. As one proceeds through the list, one can measure the fraction of positives and false positives predicted. The ROC and PR curves measure the performance. A ROC is simply a plot showing the true positive rate for a method as a function of the false positive rate. Suppose that there are positives in the gold standard out of a goal of edges. If one examines the top k predictions of a method, then among them there will be t “true” positives as well as k-t “false” positives. This will result in a single point on the ROC, i.e. the point ($\frac{k-t}{n-m},\frac{t}{m}$). This can be confusing at first glance for a number of reasons. First, the points do not necessarily form a function, e.g. there can be points with the same x-coordinate. Second, as one varies one obtains a set of points, not a curve. The ROC is a curve, and is obtained by taking the envelope of all of the points for $k \in \{1,\ldots,n\}$. The following intuition is helpful in understanding ROC:

1. The x coordinate in the ROC is the false positive rate. If one doesn’t make any predictions of edges at all, then the false positive rate is 0 (in the notation above k=0, t=0). On the other hand, if all edges are considered to be “true”, then the false positive rate is 1 and the corresponding point on the ROC is (1,1), which corresponds to k=n, t=m.
2. If a method has no predictive power, i.e. the ranking of the edges tells you nothing about which edges really are true, then the ROC is the line y=x. This is because lack of predictive power means that truncating the list at any k, results in the same proportion of true positives above and below the kth edge. And a simple calculation shows that this will correspond to the point ($\frac{k}{n},{k}{n}$) on the ROC curve.
3. ROC curves can be summarized by a single number that has meaning: the area under the ROC (AUROC). The observation above means that a method without any predictive power will have an AUROC of 1/2. Similarly, a “perfect” method, where he true edges are all ranked at the top will have an AUROC of 1. AUROC is widely used to summarize the content of a ROC curve because it has an intuitive meaning: the AUROC is the probability that if a positive and a negative edge are each picked at random from the list of edges, the positive will rank higher than the negative.

An alternative to ROC is the precision-recall curve. Precision, in the mathematics notation above, is the value $\frac{t}{k}$, i.e., the number of true positives divided by the number of true positives plus false positives. Recall is the same as sensitivity, or true positive rate: it is $\frac{t}{m}$. In other words, the PR curve contains the points ($\frac{t}{m},\frac{t}{k}$), as recall is usually plotted on the x-axis. The area under the precision-recall curve (AUPR) has an intuitive meaning just like AUROC. It is the average of precision across all recall values, or alternatively, the probability that if a “positive” edge is selected from the ranked list of the method, then an edge above it on the list will be “positive”. Neither precision-recall curves, nor AUPR are widely used. There is one problem with AUPR, which is that its value is dependent on the number of positive examples in the dataset. For this reason, it doesn’t make sense to average AUPR across datasets (while it does make sense for AUROC). For all of these reasons, I’m slightly uncomfortable with AUPR but that is not the main issue in the DREAM5 analysis. I have included an example of ROC and PR curves below. I generated them for the method “GENIE3″ tested by Feizi et al.. This was the method with the best overall score. The figure below is for the S. cerevisiae dataset:

The ROC and a PR curves before (top) and after (bottom) applying network deconvolution to the GENIE3 network.

The red curve in the ROC plots is what one would see for a method without any predictive power (point #2 above). In this case, what the plot shows is that GENIE3 is effectively ranking the edges of the network randomly. The PR curve is showing that at all recall levels there is very little precision. The difference between GENIE3 before and after network deconvolution is so small, that it is indistinguishable in the plots. I had to create separate plots before and after network deconvolution because the curves literally overlapped and were not visible together. The conclusion from plots such as these, should not be that there is statistically significance (in the difference between methods with/without network deconvolution, or in comparison to random), but rather that there is negligible effect. There is a final ingredient that is needed to constitute “score”. Instead of just averaging AUROC and AUPR, both are first converted into p-values that measure the statistical significance of the method being different from random. The way this was done was to create random networks from the edges of the 35 methods, and then to measure their quality (by AUROC or AUPR) to obtain a distribution. The p-value for a given method was then taken to be the area under the probability density function to the right of the method’s value. The graph below shows the pdf for AUROC from the S. cerevisae DREAM5 data that was used by Feizi et al. to generate the scores:

Distribution of AUROC for random methods generated from the S. cerevisiae submissions in Marbach et al.

In other words, almost all random methods had an AUROC of around 0.51, so any slight deviation from that was magnified in the computation of p-value, and then by taking the (negative) logarithm of that number a very high “score” was produced. The scores were then taken to be the average of the AUROC and AUPR scores. I can understand why Feizi et al. might be curious whether the difference between a method’s performance (before and after network deconvolution) is significantly different from random, but to replace magnitude of effect with statistical significance in this setting, with such small effect sizes, is to completely mask the fact that the methods are hardly distinguishable from random in the first place. To make concrete the implication of reporting the statistical significance instead of effect size, I examined the “significant” improvement of network deconvolution on the S. cerevisae and other datasets when run with the protein parameters rather than the default (second figure above). Below I show the AUROC and AUPR plots for the dataset.

The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUROC).

The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUPR).

My conclusion was that the use of “score” was basically a red herringWhat looked like major differences between methods disappears into tiny effects in the experimental datasets, and even the in silico variations are greatly diminished. The differences in AUROC of one part in 1000 hardly seem reasonable for concluding that network deconvolution works. Biologically, both results are that the methods cannot reliably predict edges in the network. With usable network deconvolution code at hand, I was curious about one final question. The main result of the DREAM5 paper

was that the community method was best. So I wondered whether network deconvolution would improve it. In particular, the community result shown in Feizi et al. was not a test of network deconvolution, it was simply a construction of the community from the 9 methods tested (two communities were constructed, one before and one after network deconvolution). To perform the test, I went to examine the DREAM5 data, available as supplementary material with the paper. I was extremely impressed with reproducibility. The participant submissions are all available, together with scripts that can be used to quickly obtain the results of the paper. However the data is not very usable. For example, what is provided is the top 100,000 edges that each method produced. But if one wants to use the full prediction of a method, it is not available. The implication of this in the context of network deconvolution is that it is not possible to test network deconvolution on the DREAM5 data without thresholding. Furthermore, in order to evaluate edges absolute value was applied to all the edge weights. Again, this makes the data much less useful for further experiments one may wish to conduct. In other words, DREAM5 is reproducible but not very usable. But since Feizi et al. suggest that network deconvolution can literally be run on anything with “indirect effect”, I decided to give it a spin. I did have to threshold the input (although fortunately, Feizi et al. have assured us that this is a fine way to run network deconvolution), so actually the experiment is entirely reasonable in terms of their paper. The figure is below (produced with the default network deconvolution parameters),  but before looking at it, please accept my apology for making it. I really think its the most incoherent, illogical, meaningless and misleading figure I’ve ever made. But it does abide by the spirit of network deconvolution:

The DREAM5 results before and after network deconvolution.

Alas, network deconvolution decreases the quality of the best method, namely the community methodThe wise crowds have been dumbed down. In fact, 19/36 methods become worse, 4 stay the same, and only 13 improve. Moreover, network deconvolution decreases the quality of the top method in each dataset. The only methods with consistent improvements when network deconvolution is applied are the mutual information and correlation methods, poor performers overall, that Feizi et al. ended up focusing on. I will acknowledge that one complaint (of the many possible) about my plot is that the overall results are dominated by the in silico dataset. True- and I’ve tried to emphasize that by setting the y-axis to be the same in each dataset (unlike Feizi et al.) But I think its clear that no matter how the datasets are combined into an overall score, the result is that network deconvolution is not consistently improving methods. All of the analyses I’ve done were made possible thanks to the improved usability of network deconvolution. It is unfortunate that the result of the analyses is that network deconvolution should not be used. Still, I think this examples makes a good case for the fact that reproducibility is essential, but usability is more important.

Jingyi Jessica Li and Mark D. Biggin

We published a paper titled “System wide analyses have underestimated protein abundances and the importance of transcription in mammals” in PeerJ on Feb 27, 2014 (https://peerj.com/articles/270/). In our paper we use statistical methods to reanalyze the data of several proteomics papers to assess the relative importance that each step in gene expression plays in determining the variance in protein amounts expressed by each gene. Historically transcription was viewed as the dominant step. More recently, though, system wide analyses have claimed that translation plays the dominant role and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find that when measurement errors in mRNA and protein abundance data is taken into account, transcription again appears to be the dominant step.

Our study was initially motivated by our observation that the system wide label-free mass spectrometry data of 61 housekeeping proteins in Schwanhäusser et al (2011) have lower expression estimates than their corresponding individual protein measurements based on SILAC mass spectrometry or western blot data. The underestimation bias is especially obvious for proteins with expression levels lower than 106 molecules per cell. We therefore corrected this non linear bias to determine how more accurately scaled data impacts the relationship between protein and mRNA abundance data. We found that a two-part spline model fits well on the 61 housing keep protein data and applied this model to correct the system-wide protein abundance estimates in Schwanhäusser et al (2011). After this correction, our corrected protein abundance estimates show a significantly higher correlation with mRNA abundances than do the uncorrected protein data.

We then investigated if other sources of experimental error could further explain the relatively poor correlation between protein and mRNA levels. We employed two strategies that both use Analysis of Variance (ANOVA) to determine the percent of the variation in measured protein expression levels that is due to each of the four steps: transcription, mRNA degradation, translation, and protein degradation, as well as estimating the measurement errors in each step. ANOVA is a classic statistical method developed by RA Fisher in the 1920s. Despite the fact that this is a well-regarded and standard approach in some fields, its usefulness has not been widely appreciated in genomics and proteomics. In our first strategy, we estimated the variances of errors in mRNA and protein abundances using direct experimental measurements provided by control experiments in the Schwanhäusser et al. paper. Plugging these variances into ANOVA, we found that mRNA levels explain at least 56% of the differences in protein abundance for the 4,212 genes detected by Schwänhausser et al (2011). However,  because one major source of error—systematic error of protein measurements—could not be estimated, the true percent contribution of mRNA to protein expression should be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We show that the variance in translation rates directly measured by ribosome profiling is only 12% of that inferred by Schwanhäusser et al (2011), and that the measured and inferred translation rates correlate poorly. Based on this, our second strategy suggests that mRNA levels explain ∼81% of the variance in protein levels. While the magnitudes of our two estimates vary, they both suggest that transcription plays a more important role than the earlier studies implied and translation a much smaller role.

Finally, we noted that all of the published estimates, as welll as ours given above, only apply to those genes whose mRNA and protein expression was detected. Based on a detailed analysis by Hebenstreit et al. (2012), we estimate that approximately 40% of genes in a given cell within a population express no mRNA. Since there can be no translation in the absence of mRNA, we argue that differences in translation rates can play no role in determining the expression levels for the ∼40% of genes that are non-expressed.

Given the reaction to our use of the word “fraud” in our blog post on Feizi et al., we would like to remind readers how it was actually used:

In academia the word fraudulent is usually reserved for outright forgery. However given what appears to be deliberate hiding, twisting and torturing of the facts by Feizi et al., we think that fraud (“deception deliberately practiced in order to secure unfair or unlawful gain”) is a reasonable characterization. If the paper had been presented in a straightforward manner, would it have been accepted by Nature Biotechnology?

While the reaction has largely focused on reproducibility and the swapping of figures, we reiterate our stance: misleading one’s readers (and reviewers) is itself a form of scientific fraud. Regarding the other questions raised, the response from Feizi, Marbach, Médard and Kellis falls short. On their new website their code has changed, their explanations are in many cases incoherent, self-contradictory, and make false claims, and the newly added correction to Figure S4 turns out not to explain the difference between the figures in the revisions. We explain all of this below. And yet for us the claim of fraud can stand on the basis of deception alone. The distance between the image created by the main text of the paper and the truth of their method is simply too great.

However, one unfortunate fact is that that judging that distance requires understanding some of the mathematics in the paper, and another common reaction has been “I don’t understand the math.”  For such readers, we explain network deconvolution in simple terms by providing an analogy using numbers instead of matrices. Understanding this requires no more than high school algebra. We follow that with a response to Feizi et al.‘s rebuttal.

Network deconvolution math made simple

In what follows, number deconvolution, a simplified form of network deconvolution, is explained in red. Network deconvolution is explained in blue.

The main concept to understand is that of a geometric series. This is a series with a constant ratio between successive terms, for example

$\displaystyle \frac{1}{2}+\frac{1}{4}+\frac{1}{8} + \cdots \ \ \ \ \ (1)$

In this case, each term is one half the previous term, so that the ratio between successive terms is always ${\frac{1}{2}}$. This sum converges to ${1}$, an intuitive notion that is formalized by defining the infinite sum to be

$\displaystyle \sum_{i=1}^{\infty} \left( \frac{1}{2}\right)^i = lim_{n \rightarrow \infty} \sum_{i=1}^n \left( \frac{1}{2} \right)^i = 1. \ \ \ \ \ (2)$

This is the familiar math of the

It is important to note that not every geometric series converges. If the number ${\frac{1}{2}}$ in the sum above is replaced by ${2}$, then the series ${2+4+8+16+\cdots}$ is said to diverge. The are two basic questions about geometric series: (1) for which numbers do they converge and (2) when they do converge, what do they converge to. These two questions are answered in the following:

Let ${a}$ be a real number. The geometric series converges if, and only if, ${-1 < a < 1}$, in which case

$\displaystyle m=\sum_{i=1}^{\infty} a^i = a(1-a)^{-1}. \ \ \ \ \ (3)$

It is not hard to see why this is true:

$\displaystyle m=a+a^2+\cdots = a(1+a+a^2+\cdots) = a(1+m) \Rightarrow m=a(1-a)^{-1}.$

Returning to the example of ${a=\frac{1}{2}}$, we see that it is a special case of this result. It converges because ${-1 < \frac{1}{2} < 1}$, and furthermore, ${\frac{\frac{1}{2}}{1-\frac{1}{2}} = 1}$.

Matrices behave like numbers in some ways, for example they can be added and multiplied, yet quite differently in others. The generalization of geometric series as in Feizi et al. specifically deals with diagonalizable matrices, a class of matrices which has many convenient properties. Let ${A}$ be such a matrix. The geometric series converges if, and only if, the eigenvalues of ${A}$ (numbers associated to the matrix) lie strictly between ${-1}$ and ${1}$, in which case

$\displaystyle M = \sum_{i=1}^{\infty} A^i = A \cdot (I-A)^{-1} \ \ \ \ \ (4)$

Notice that what has changed is that the condition that ${-1 has been replaced by an eigenvalue restriction on m and the formula  $A(I-A)^{-1}$ is just like $a(1-a)^{-1}$ except the operation of matrix inversion is required instead of number inversion. The result is obvious from elementary linear algebra because the assumption that ${A}$ is diagonalizable means that ${A=UDU^{-1}}$ for some matrix ${U}$, where ${D}$ is a diagonal matrix with the eigenvalues of ${A}$ on the diagonal. Therefore ${A^k=UD^kU^{-1}}$ and the series converges if, and only if, the geometric series formed by summing the powers of each eigenvalue converge.

In following Feizi et al., we call the number ${\frac{1}{2}}$ in the example above the “direct number”, and the remainder of the sum the “indirect number”. One should think of the indirect number as consisting of echoes of the direct number. Furthermore, following the language in Feizi et al., we call the ratio between terms, again the number ${\frac{1}{2}}$ in the example above, the indirect flow. When it is strong, the terms decrease slowly. When it is weak, the terms decrease rapidly.

In Feizi et al., the matrix A is called the “direct effect”, and the remainder of the sum the “indirect effect”. One should think of the indirect effect as consisting of echoes of the direct effect. Furthermore, following the language in Feizi et al., we call the rate at which the terms in the indirect effect decay the indirect flow. When it is strong, the terms decrease slowly. When it is weak, the terms decrease rapidly.

In number deconvolution the goal is to infer ${a}$ from ${m}$. That is, one would like to remove the indirect effect, the echoes, out of ${m}$. If ${m=a(1-a)^{-1}}$, then solving for ${a}$ we obtain

$\displaystyle a=m(1+m)^{-1}. \ \ \ \ \ (5)$

That is the formula for number deconvolution.

In network deconvolution the goal is to infer  ${A}$ from ${M}$. That is, one would like to remove the indirect effect, the echoes, out of ${M}$ as follows: If ${M=A(I-A)^{-1}}$ then solving for ${A}$ we obtain

$\displaystyle A = M \cdot (I+M)^{-1} \ \ \ \ \ (6)$

That is the formula for network deconvolution.

Everything looks great but there is a problem. Even though one can plug any number ${m}$ (except ${m=-1}$) into the formula and get out an ${a}$, the formula only gives an ${a}$ satisfying ${\sum_{i=1}^{\infty} a^i = m}$ when ${m > -\frac{1}{2}}$. For example, starting with ${m=-2}$ and then plugging it into number deconvolution one gets ${a=2}$. But ${2+4+ 8+\cdots}$ diverges.

The restriction on ${m}$ in number deconvolution, namely that it has to be bigger than ${-\frac{1}{2}}$, can be translated into the same restriction on the eigenvalues of ${M}$ for network deconvolution. This follows from the fact that ${M \, = \, \sum_{i=1}^{\infty}A^i \, = \, \sum_{i=1}^{\infty}UD^iU^{-1}}$ means that if ${\lambda}$ is an eigenvalue of ${A}$ then ${\mu :=\frac{\lambda}{1-\lambda}}$ is an eigenvalue of ${M}$ and all the eigenvalues of ${M}$ arise in this way. Therefore, ${\lambda = \frac{\mu}{1+\mu}}$ and the condition ${-1 < \lambda < 1}$ holds if, and only if, ${\mu>-\frac{1}{2}}$

But what if we wanted number deconvolution to work for all negative numbers? One idea is to follow Feizi et al.’s approach and introduce a scaling factor called ${\gamma}$ to be applied to ${m}$, so that the product ${\gamma m}$ can be deconvolved. For example, suppose we start with ${m=-12}$. We’d like to find ${a}$. But we can’t just apply number deconvolution. So we multiply ${m}$ by ${-\frac{1}{12}}$ to get ${1}$, and then deconvolve that to get ${a=\frac{1}{2}}$. We could have multiplied ${m}$ by something else, say ${\frac{-1}{6}}$ in which case we’d get ${a=\frac{2}{3}}$. In fact, there are infinitely many scaling factors, giving infinitely many solutions ${a}$. In fact, we can get any ${a}$ between ${-1}$ and ${1}$.

When ${M}$ is not writable as a geometric sum (decomposable), there exist scaling factors ${\gamma}$ such that the scaled matrix ${\gamma M}$ is decomposable. This is obvious because if ${\mu_{-} =\mu_1 \leq \ldots \leq \mu_n = \mu_{+}}$ are the eigenvalues of ${M}$, then ${\gamma \mu_1,\ldots,\gamma \mu_n}$ are the eigenvalues of ${\gamma M}$ so by choosing ${0 \leq \gamma < |\frac{1}{2\mu_{-}}|}$ we are guaranteed that ${\gamma M}$ is decomposable. Let ${M}$ be a real symmetric matrix with minimum/maximum eigenvalues ${\mu_{-}}$ and ${\mu_{+}}$ respectively and ${0<\beta<1}$. If

$\displaystyle \gamma = \frac{\beta}{max\left((1-\beta)\mu_{+},(-1-\beta)\mu_{-}\right)} \ \ \ \ \ (7)$

then the matrix ${\gamma M}$ is decomposable as ${\gamma M = \sum_{i=1}^{\infty}A^i}$. Furthermore, if ${\lambda_{-},\lambda_{+}}$ are the minimum/maximum eigenvalues of ${A}$ respectively then ${max(|\lambda_{-}|,|\lambda_{+}|) = \beta}$ (we omit the derivation, but it is straightforward).

Matrices do behave differently than numbers, and so it is not true that network deconvolution can produce any matrix as an output. However, as with number deconvolution, it remains true that network deconvolution can produce an infinite number of possible matrices.

Not only do Feizi et al. not mention this anywhere in the main text, instead they create the impression that there is a single output from any input.

Feizi et al.’s response

We now turn to the response of Feizi et al. to our post. First, upon initial inspection of the “Try It Out” site, we were astounded to find that the code being distributed there is not the same as the code distributed with the paper (in particular, step 1 has been removed) and so it seems unlikely that it could be used to replicate the paper’s results. Unless, that is, the code originally distributed with the paper was not the code used to generate its results. Second, the “one click reproduction”s that the authors have now added do not actually start with the original data but instead merely produce plots from some results without showing how those results were generated. This is not what reproducibility means. Third, while the one matrix of results from a “one click reproduction” that we looked at (DI_ND_1bkr.mat) was very close to the matrix originally distributed with the paper, it was slightly different. It was close and hopefully it does generate basically the same figure, but as we explain below, we’ve spent a good bit of time on this and have no desire to spend any more. This is why we have tried to incentivize someone to simply reproduce the results.

In retrospect, we regret not explaining exactly how we came to be so skeptical about the reproducibility of the results in the Feizi et al. paper. While we very quickly (yet not instantly) recognized how misleading the paper is, our initial goal in looking at their results was not to verify reproducibility (which we assumed) but rather to explore how changing the parameters affected the results. Verifying that we could recover the results from the paper was only supposed to be the first step in this process. We downloaded the file of datasets and code provided by the authors and began examining the protein contact dataset.

After writing scripts that we verified exactly regenerated some figures from the paper from the output matrices distributed by the authors, we then checked whether we arrived at the same results when running the ND code on what we assumed was the input data (files with names like “MI_1bkr.txt” while the output was named “MI_ND_1bkr.txt”). We were surprised when the output did not match, however we were then informed by the authors that they had not distributed the input data but rather thresholded versions of the input. When we asked the authors to provide us with the actual data, we were told that it would violate scientific etiquette to send us another scientist’s dataset and that we would have to regenerate it ourselves. We had never heard this claimed point of etiquette before, but acquiesced and attempted to do so. However, the matrices produced from the original data were actually a different size than those used by the authors (suggesting that they had used a different alignment).

Stymied on that front, we turned to the DREAM5 dataset instead. While the actual scoring that goes into the DREAM analysis is fairly complicated, we decided that we would start by merely checking that we could regenerate the output provided by the authors. We eventually received step-by-step instructions for doing so and, because those steps did not produce the provided output, asked a simple question suggested by our experience with protein contact dataset: to generate the provided results, should we use as input the provided non-ND matrices? We asked this question four times in total without receiving a reply. We sent our script attempting to implement their steps and received no response.

At some point, we also discovered that the authors had been using different parameter values for different datasets without disclosing that to their readers. They could provide no coherent explanation for doing so. And sometime after this, we found that the authors had removed the data files from their website: readers were directed instead to acquire the datasets elsewhere and the results from the paper were no longer provided there.

At this point, we expect it is obvious why we were skeptical about the reproducibility of the results. Having said that, we never wanted to believe that the results of the paper were not reproducible. It was not our initial assumption and we still hope to be proven wrong. However, we continue to wait, and the fact that the code has been changed yet again, and that the authors’ explanation in regards to figure S4 does not appear to explain its changes (see below) do not fill us with confidence.

We repeat that while most of the discussion above has focused on replicability, we will quote again the definition of “fraud” used in the original post: “deception deliberately practiced in order to secure unfair or unlawful gain”. Our main point is this: would any honest reader of the main text of the original paper recognize the actual method implemented? We don’t think so, hence deception. And we believe the authors deliberately wrote the paper in this way to unfairly gain acceptance at Nature Biotechnology.

Now, in response to the authors rebuttal, we offer the following (Feizi et al. remarks in italics):

Point-by-point rebuttal

Appendix A

The pre-processing step that Bray and Pachter criticize (step 1 in their description of our work) has no effect on the performance of ND. Mathematically, because our matrices are non-negative with min zero, a linear scaling of the network results in a similar scaling of its eigenvalues, which are normalized during eigenvalue scaling, canceling out the linear scaling of step 1. Practically, removing step 1 from the code has little effect on the performance of ND on the DREAM5 regulatory networks, as we show in Figure 1 below.

We cannot imagine how the authors came to make the statement “our matrices are non-negative with min zero”. After all, one of the inputs they used in their paper was correlation matrices and they surely know that correlations can be negative. If the map in step 1 were linear rather than – as we correctly stated in the document they are responding to – affine (for those unfamiliar with the terms, “linear” here means only scaling values while “affine” means scaling and shifting as well), then this explanation would be correct and it would have no effect. This just makes it even stranger that immediately after giving this explanation, the authors produced a graph showing the (sometimes quite significant) effect of what we assume was the affine mapping.

Stranger still is that after this defense, Step 1 has now been removed from the code.

Appendix B

The eigenvalue scaling (step 5) is essential both theoretically, to guarantee the convergence of the Taylor series, and practically, as performance decreases without it in practice. This step is clearly stated in both the main text of our paper and in the supplement, despite the claims by Bray and Pachter that it somehow was maliciously concealed from our description of the method (just search the manuscript for the word ‘scaling’). Our empirical results confirm that this step is also necessary in practice, as ND without eigenvalue scaling is consistently performing worse, as we show in Figure 2 below:

We do not understand why the author’s felt the need to defend the use of eigenvalue scaling when we have never suggested it should be removed. Our objection was rather to how it was presented (or not) to the reader. Yes, there is a mention of the word “scaling”. But it is done so in a way that makes it sound as though it were some trivial issue: there’s an assumption, you scale, assumption satisfied, done. And this mention occurs in the context of the rest of the main text of the paper which presents a clear image of a simple, parameterless, 100% theoretically justified method: you have a matrix, you put it into the method, you get the output, done (and it is “globally optimal”). In other words, it is the math we discuss above prior to the arrival of the parameter $\gamma$.

Yes, scaling is dealt with at length in the supplement. Perhaps Feizi et al. think it is fine for the main text of a paper to convey such a different image from the truth contained in its supplement and in the code. We do not and we would hope the rest of the scientific community agrees with us.

Appendix C: Robustness to input parameters.

Bray and Pachter claim that “the reason for covering up the existence of parameters is that the parameters were tuned to obtain the results”. Once again the claims are incorrect and unfounded. (1) First, we did not cover up the existence of the parameters. (2) Second, we used exactly the exact same set of parameters for all 27 tested regulatory networks in the DREAM challenge. (3) Third, we show here that our results are robust to the parameter values, using β = 0.5 and β = 0.99 for the DREAM5 network.

(1) Again: in the main text of the paper, there is not even a hint that the method could have parameters. This is highly relevant information that a reader who is looking at its performance on datasets should have access to. That reader could then ask, e.g. what parameters were used and how they were selected. They would even have the ability to wonder if the parameters were tuned to improve apparent performance. It is worth noting that when the paper was peer reviewed, the reviewers did not have access to the actual values of the parameters used.

(2) It is, of course, perfectly possible to tune parameters while using those same parameters on multiple datasets.

(3) The authors seem to have forgotten that there are two parameters to the method and that they gave the other parameter (alpha) different values on different datasets as well. They also have omitted their performance on the “Community” dataset for some reason.

Appendix D: Network Deconvolution vs. Partial Correlation.

Bray and Pachter compare network deconvolution to partial correlation using a test dataset built using a partial correlation model. In this very artificial setting, it is thus not surprising that partial correlation performs better, as it exactly matches the process that generated the data in the first place. To demonstrate the superiority of partial correlation, Bray and Pachter should test it on real datasets, such as the ones provided as part of the DREAM5 benchmarks . In our experience, partial correlation performed very poorly in practice, but we encourage Bray and Pachter to try it for themselves.

We looked at partial correlations here for a very simple reason: the authors originally claimed that their method reduced to it in the context we considered here. Thus, comparing them seemed a natural thing to do.

The idea that we should look at the performance of partial correlations in other contexts makes no sense: we have never claimed that it is the right tool to solve every problem. Indeed, it was the authors who claimed that their tool was “widely applicable”. By arguing for domain specific tools, they seem to be making our point for us.

Claim 1: “the method used to obtain the results in the paper is completely different than the idealized version sold in the main text of the paper”.

The paper clearly describes both the key matrix operation (“step 6″ in Lior Pachter’s blog post) which is shown in figure 1, and all the pre- and post-processing steps, which are all part of our method. There is nothing mischievous about including these pre- and post-processing steps that were clearly defined, well described, and implemented in the provided code.

The statement that the paper describes all steps is simply false. For example, the affine mapping appears nowhere.

Claim 2: “the method actually used has parameters that need to be set, yet no approach to setting them is provided”.

It is unfortunate that so many methods in our field have parameters associated with them, but we are not the first method to have them. However, we do provide guidelines for setting these in the supplement.

It is strange to suggest that it is unfortunate that methods have parameters. It is also strange to describe their guidelines as such when they did not even follow them. Also, there is no guideline for setting alpha in the supplement. In their FAQ, they say “we used beta=0.5 as we expected indirect flows to be propagating more slowly”. We wonder where one derives these expectations from.

Dr. Pachter also points out that a correction to Supplement Figure S4 on August 26th 2013 was not fully documented. We apologize for the omission and provide additional details in an updated correction notice dated February 12, 2014.

The correction notice states that the original Figure S4 was plotted with the incorrect x-axis. In particular it states that the maximum eigenvalue of the observed network was used, instead of the maximum eigenvalue of the direct network. We checked this correction, and have produced below the old curve and the new curve plotted on the same x-axis:

The new and old Figure S4. The old curve, remapped into the correct x-axis coordinates is shown in red. The new curve is shown in blue. Raw data was extracted from the supplement PDFs using WebPlotDigitizer.

As can be seen, the two curves are not the same. While it is expected that due to the transformation the red curve should not cover the whole x-axis, if the only difference was the choice of coordinates, the curves should overlap exactly. We cannot explain the discrepancy.

This is the third and final post in a series (part1, part2) of posts on two back-to-back papers published in Nature Biotechnology in August 2013:

1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601
2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635

An inconvenient request

One of the great things about conferences is that there is time to chat in person with distant friends and collaborators. Last July, at the ISMB conference in Berlin, I found myself doing just that during one of the coffee breaks. Suddenly, Manolis Kellis approached me and asked to talk in private. The reason for his interruption: he came to request that I remove an arXiv post of mine, namely “Comment on ‘Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions“, a response to a paper by Ward and Kellis. Why? He pointed out that my arXiv post was ranking highly on Google. This was inconvenient for him, he explained, while insisting that it was wrong of me to post a criticism of his work on a forum where he could not directly respond. He suggested that it would be best to work out any issues I might have with his paper offline. Unfortunately,  there was the inconvenient truth that arXiv postings cannot be removed. Unlike some journals, where, say, a supplement can be revised while having the original removed (see the figure switching of Feizi et al.), arXiv preprints are permanent.

My initial confusion quickly turned to anger. After all, my arXiv comment had been rejected from Science where I had submitted it as a technical comment on the Ward-Kellis paper. I had then put it on the arXiv as a last resort measure to at least have some record of my concerns publicly accessible. How is this wrong? Can one not critique the work of Manolis Kellis?

Network nonsense begins

My first review of a Manolis Kellis paper was in the fall of 2006, in my capacity as a program committee member of the  Research in Computational Molecular Biology (RECOMB) conference held in Oakland, CA in 2007. Because Oakland is right next to Berkeley, a number of Berkeley professors were involved in organizing and running the conference. Terry Speed was chair of the program committee. I was out of the country that year on sabbatical at the University of Oxford, so I could not participate, or even attend, the conference, but I volunteered to serve on the program committee. For those not familiar with the RECOMB review process, it is modeled after the standard Computer Science conferences. The program committee chair forms the program committee, who are then assigned a handful of papers to review and score. Reviews are entered on a website, and after a brief period of online discussion about borderline papers, scores are revised and accept/reject decisions are made. Authors can revise their manuscripts, but the reviewers never see the papers again before publication in the proceedings.

One of the papers I was assigned to review was by a student named Joshua Grochow and his advisor Manolis Kellis. The paper was titled “Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking“. Although networks were not my research focus at the time, and “symmetry-breaking” evoked in me nightmares from the physics underworld, I agreed to the review. The paper seemed to contain some interesting algorithms, appeared to have a combinatorial flavor, and potentially important applications- a good mix for RECOMB.

The problem addressed by Grochow & Kellis was that of identifying “network motifs” in biological networks. “Motifs” can be defined in a variety of ways, and the Grochow-Kellis objective was simple. In graph theoretic terms, given a graph G, the goal was to find subgraphs occurring with high multiplicity to an extent unlikely in a random graph. There are many models for random graphs, and the one that the results in Grochow-Kellis are based on is the Erdös-Renyi model (each edge chosen independently with some fixed probability). The reason this definition might be of biological interest, is that recurrent motifs interspersed in a graph are likely to represent evolutionarily conserved interaction modules.

The paper begins with a description of the method. I won’t go into the details here, except to say that everything seemed well until I read the caption of Figure 3. There the number 27,720 caught my eye.

During my first few years of graduate school I took many courses on enumerative and algebraic combinatorics. There are some numbers that combinatorialists just “know”. For example, seeing 42 emerge as the answer to a counting problem does not bring to mind Douglas Adams, but rather the vast literature on Catalan numbers and their connections to dozens of well-known counting problems. Similarly, a number such as 126 brings to mind binomial coefficients ($126={9 \choose 4}$), and with them the idea of counting the number of subsets of fixed size from a larger set. When I saw the number 27,720 I had a hunch that somehow some canonical combinatorial set had been enumerated. The idea may have entered my mind because of the picture of the “motif” in which I immediately recognized a clique (all vertices mutually connected) and a stable set (no pair of vertices connected). In any case, I realized that

$27,720 = 220 \cdot 126 = {12 \choose 3} \cdot {9 \choose 4}$.

The significance of this is that the “motif” on the left-hand side of Figure 3 had appeared many times  because of a type of double- or rather thousandfold- counting. Instead of representing statistically significant recurring independent motifs, this “motif” arises because of a combinatorial artifact. In the specific example of Figure 3, the motif occurred once for any choice of 4 nodes from the clique of size 9, and any choice of 3 nodes from the stable set of size 12.

The point is that in a graph, any subgraph attached to a large clique (or stable set) will occur many times. This simple observation is a result of the fact that there are many subgraphs of a clique (or stable set) that are identical. I realized that this meant that the Grochow-Kellis method was essentially a heuristic for finding cliques and stable sets in graphs. The particular “network motifs” they were pulling out were just subgraphs that happened to be connected to some large cliques and stable sets. There are two problems with this: first, a clique or a stable set can hardly be considered an interesting “network motif”. Moreover, the fact that they appear in biological networks much more than in Erdös-Renyi random graphs is not surprising. Second, there is a large literature on finding cliques in graphs, none of which Grochow-Kellis cited or seemed to be familiar with.

The question of the performance of the Grochow-Kellis algorithm is answered in their Figure 3 as well. There is a slightly larger motif consisting of nodes from the stable set of size 12, instead of 3. That motif occurs in all ${12 \choose 6}$ subsets of the stable set instead of ${12 \choose 3}$ subsets which means that there is a motif that occurs 116,424 times! Grochow and Kellis’s algorithm did not even achieve its stated goal. It really ought to have outputted the left hand side figure with six nodes in the stable set on the left, and not three. In other words, this was a paper providing uninteresting solutions from a biological point of view, and doing so poorly to boot.

I wrote up a detailed report on the paper, and posted it on the RECOMB review website together with poor scores reflecting my opinion that the paper had to be rejected. How could RECOMB, ostensibly the premier computer science conference on computational and algorithmic biology, publish a paper with neither a computational nor biological result? Not to mention an algorithm that demonstratably did not find the most frequently occurring motif.

As you might already guess, my rejection was subsequently overruled. I don’t know who made the final decision to accept the Grochow & Kellis paper to the RECOMB conference, although presumably it was the program committee chair. The decision jarred with my sense of scientific integrity. I had put considerable effort into reviewing the paper and understanding it, and I felt that I had provided a compelling objective argument for why the paper was fundamentally flawed- the fact that the results were trivial (and incorrect!) was not a subjective statement. At this point I need to point out that the RECOMB conference is quite difficult to get into. The acceptance rate for papers in 2007, consistent with other years, was 21.8%. I knew this meant that even a single very negative review, especially one with a compelling argument against the paper, almost certainly would lead to rejection of the paper. So I couldn’t understand then, nor do I still understand now, on what basis the decision was made to accept the paper. This bothered me greatly, and after much deliberation I started boycotting the conference. Despite publishing five RECOMB papers from 2000 to 2006 and regularly attending the meeting during that time, the continued poor decisions and haphazard standards for papers selected have led me to not return in almost 8 years.

Grochow and Kellis obviously received my review and considered how to “deal with it”. They added a section titled “The role of combinatorial effects”, in which they explained the origins of the number 27,720 that they gleaned from my report, but then spun the bad news they had received as “resulting from combinatorial connectivity patterns prevalent in larger network structures.” They then added that “…this combinatorial clustering effect brings into question the current definition of network motif” and proposed that “additional statistics…might well be suited to identify larger meaningful networks.” This is a lot like someone claiming to discover a bacteria whose DNA is arsenic-based and upon being told by others that the “discovery” is incorrect – in fact, that very bacteria seeks out phosphorous – responding that this is “really helpful” and that it “raises lots of new interesting open questions” about how arsenate gets into cells. Chutzpah. When you discover your work is flawed, the correct response is to retract it.

I don’t think people read papers very carefully. Joshua Grochow went on to win the MIT Charles and Jennifer Johnson Outstanding M. Eng. Thesis Award for his RECOMB work on network motif discovery. [Added February 18: Grochow and Kellis have posted a reply here].

The nature of man

I have to admit that after the Grochow-Kellis paper I was a bit skeptical of Kellis’ work. Not because of the paper itself (everyone makes mistakes), but because of the way he responded to my review. So a year and a half ago, when Manolis Kellis published a paper in an area I care about and am involved in, I may have had a negative prior. The paper was Luke Ward and Manolis Kellis “Evidence for Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”, Science 337 (2012) . Having been involved with the ENCODE pilot, where I contributed to the multiple alignment sub-project, I was curious what comparative genomics insights the full-scale $130 million dollar project revealed. The press releases accompanying the Ward-Kellis paper (e.g. The Nature of Man, The Economist) were suggesting that Ward and Kellis had figured out what makes a human a human; my curiosity was understandably piqued. Ward and Kellis combined population genomic data from the 1000 Genomes Project with biochemical data from the ENCODE project to look for signatures of human constraint in regulatory elements. Their analysis was based on measuring three different proxies for constraint: SNP density, heterozygosity and derived allele frequency. To identify specific classes of regulatory regions under constraint, aggregated regions associated with specific gene ontology (GO) categories were tested for significance. Reading the paper I was amazed to discover they found precisely two categories: retinal cone cell development and nerve growth factor receptor signaling. It was only upon reading the supplement that I discovered that their tests had produced 53 other GO categories as well (Table S5). Despite the fact that the listed categories were required to pass a false discovery rate (FDR) threshold for both the heterozygosity and derived allele frequency (DAF) measures, it was statistically invalid for them to highlight any specific GO category. FDR control merely guarantees a low false discovery rate among the entries in the entire list. Moreover, there was no obvious explanation for why categories such as chromatin binding (which had a smaller DAF than nerve growth) or protein binding (with the smallest p-value) appeared to be under purifying selection. As with the Feizi et al. paper, the supplement produced a story much less clean than the one presented in the main body of the paper. In fact, retinal cone cell development and nerve growth factor were 33 and 34 out of the 55 listed GO categories when sorted by the DAF p-value (42 and 54 when sorted by heterozygosity p-value). In other words, the story being sold in the paper was based on blatant statistically invalid cherry picking. The other result of the paper was an estimate that in addition to the 5% of the human genome conserved across mammalian genomes, at least another 4% has been subject to lineage-specific constraint. This result was based on adding up the estimates of constrained nucleotides from their Table S6 (using the derived allele frequency measure). These were calculated using a statistic that was computed as follows: for each one of ten bins determined according to estimated background selection strength, and for every feature F, the average DAF value DF was rescaled to $PUC_F = \frac{(D_F - D_{CNDC})}{(D_{NCNE}-D_{CNDC})}$, where DCNDC and DNCNE were the bin-specific average DAFs of conserved non-degenerate coding regions and non-conserved non-ENCODE regions respectively. One problem with the statistic is that the non-conserved regions contain nucleotides not conserved in all mammals, which is not the same as nucleotides not conserved in any mammals. The latter would have been needed in order to identify human specific constraint. Second, the statistic PUCF was used as a proxy for the proportion under constraint even though, as defined, it could be less than zero or greater than one. Indeed, in Table S6 there were four values among the confidence intervals for the estimated proportions using DAF that included values less than 0% or above 100%: Ward and Kellis were therefore proposing that some features might have a negative number of nucleotides under constraint. Moreover, while it is possible that after further rescaling PUCmight have correlated with the true proportion of nucleotides under constraint, there was no argument provided in the paper. Thus, while Ward and Kellis claimed to have estimated the proportion of nucleotides under constraint, they had only computed a statistic named “proportion under constraint”. Nicolas Bray and I wrote up these points in a short technical comment and submitted it to the journal Science early in November 2012. The comment was summarily rejected with a curt reply by senior editor Laura Zahn stating that “relative to other Technical Comments we have recently received we feel that the scope and focus of your comment make it more suitable for the Online Comments facility at Science, rather than as a candidate for publication as a Technical Comment.” It is worth noting that Science did decide to publish another comment: Phil Green and Brent Ewing’s, “Comment on’Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions‘”, Science 10 (2013). Green and Ewing’s comment is biological in nature. Their concern is that “… the polymorphism trends are primarily attributable to mutational variation and technical artifacts rather than selection.” Its fine that Science decided to host a debate on a biology question on its pages, but how can one debate the interpretation of results from a method, when the method is fundamentally flawed to begin with? After all, our problem with PUC was much deeper than a “technical flaw”. We decided at the end to place the comment in the arXiv. After doing so, it became apparent that it had little impact. Indeed, I have never received any feedback about it from anyone. Apparently even this was too much for Manolis Kellis. Methods matter By the time I noticed the Feizi et al. paper in the journal Nature Biotechnology early in August 2013, my experiences reading Kellis’ papers had subtly altered the dynamic between myself and the printed word. Usually, when I read a paper and I don’t understand something, I assume the fault lies with me. I think most people are like this. But now, when the Feizi et al. paper started to not make sense, I didn’t presume the problem was with me. I tried hard to give the paper a fair reading, but after a few paragraphs the spell of the authors was already broken. And so it is that Nicolas Bray and I came to figure out what was really going on in Feizi et al., a project that eventually led us to also look at Barzel-Barabási. Speaking frankly, it was difficult work to write the blog posts about these articles. In addition to the time it took, it was exhausting and exasperating to discover the flaws, fallacies and frauds. Both Nick and I prefer to do research. But we felt a responsibility to spell out in detail what had happened here. Manolis Kellis is not just any scientist. He has, and continues to play leading roles in major consortium projects such as mod-ENCODE and ENCODE, and he has served on numerous advisory committees for the NHGRI. He is a member of the GCAT (Genomics, Computational Biology and Technology) study section until 2018. That any person would swap out a key figure in a published paper without publishing a correction, and without informing the editor is astonishing. That a person with great responsibility towards scientists is an abuser of science is unacceptable. Manolis Kellis’ behavior is part of a systemic problem in computational biology. The cross-fertilization of ideas between mathematics, statistics, computer science and biology is both an opportunity and a danger. It is not hard to peddle incoherent math to biologists, many of whom are literally math phobic. For example, a number of responses I’ve received to the Feizi et al. blog post have started with comments such as “I don’t have the expertise to judge the math, …” Similarly, it isn’t hard to fool mathematicians into believing biological fables. Many mathematicians throughout the country were recently convinced by Jonathan Rothberg to donate samples of their DNA so that they might find out “what makes them a genius”. Such mathematicians, and their colleagues in computer science and statistics, take at face value statements such as “we have figured out what makes a human human”. In the midst of such confusion, it is easy for an enterprising “computational person” to take advantage of the situation, and Kellis has. I believe the solution for this problem is for computational biologists to start taking themselves more seriously. Whether serving as reviewers for journals, as panel members for funding agencies, on hiring/tenure committees, or writing articles, all of us have to tone down the hype and pay closer attention to the science. There are many examples of what this means: a review of a math/stats containing paper cannot be a single paragraph long and based on a hunch, and similarly computational biologists shouldn’t claim, as have many of the authors of papers I’ve reviewed in these posts, pathways to cure disease and explanations for what makes humans human. Don’t fool the biologists. Don’t fool the computer scientists, statisticians, and mathematicians. The possibilities for computational methods in biology are unlimited. The future is exciting, and there are possibilities for significant advances in areas ranging from molecular and evolutionary biology to medicine. But money, citations and fame cannot rule the day. The details of the #methodsmatter. This is the second in a series of three posts (part1, part3) on two back-to-back papers published in Nature Biotechnology in August 2013: 1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601 2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635 The Barzel & Barabási paper we discussed in yesterday’s blog post is embarrassing for its math, shoddy “validations” and lack of results. However of the two papers above it is arguably the “better” one. That is because the paper by Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, in addition to having similar types of problems to the Barzel & Barabási paper, is also dishonest and fraudulent. For reasons that I will explain in the third and final post in this series, we (Nicolas Bray and I) started looking at the Feizi et al. paper shortly after it was published early in August 2013. This is the story: Feizi et al.‘s paper describes a method called network deconvolution that in their own words provides “…a systematic method for inferring the direct dependencies in a network, corresponding to true interactions, and removing the effects of transitive relationships that result from indirect effects.” They claim that the method is a “foundational graph theoretic tool” and that it “is widely applicable for computing dependencies in network science across diverse disciplines.” This high brow language makes network deconvolution sounds very impressive, but what exactly is the method? Feizi et al. would like you to believe that the method is what is described in their Figure 1, part of which is shown below: This is a model for networks represented as matrices. In this model an unknown matrix $G_{dir}$ with eigenvalues between -1 and 1 is to be inferred from an observed matrix $G_{obs}$ that is related to $G_{dir}$ via $G_{obs} = G_{dir}+G_{dir}^2+G_{dir}^3+\cdots$ (1) The matrix $G_{dir}$ represents “direct effects” and the sum of its powers “indirect effects”. It is probably worth noting at this point that there is no particular reason to believe that effects will behave in this manner in any particular system of interest, but we return to this later. The eigenvalue assumption on $G_{dir}$ is required for the specification of the model, because without it the infinite sum generating $G_{obs}$ does not converge. An elementary manipulation of (1) shows that the matrix $G_{dir}$ can be recovered from $G_{obs}$ by the formula $G_{dir} = G_{obs}(I+G_{obs})^{-1}$. (2) Unfortunately, many matrices cannot be decomposed as an infinite sum of powers of some matrix as in equation (1), so equation (2) cannot be applied directly to arbitrary data matrices. The only mention of this issue in the main text of the paper is the statement that “This [eigenvalue] assumption can be achieved for any matrix by scaling the observed input network by a function of the magnitude of its eigenvalues.” This is true but incoherent. There are an infinite number of scalings that will satisfy the assumption, and while the authors claim in their FAQ that “the effect of linear scaling on the input matrix is that … it does not have an effect” this is obviously false (also if it has no effect why do they do it?). For example, as the scaling goes to zero, $G_{dir}$ converges to $G_{obs}$. What the authors have actually done with their seemingly innocuous aside is fundamentally change their model: now instead of (1), $G_{obs}$ is given by $G_{obs} = \gamma \cdot (G_{dir}+G_{dir}^2+G_{dir}^3+\cdots)$. (3) The problem with this model is that given $G_{obs}$ there are an infinite number of solutions for $\gamma$ and $G_{dir}$. Feizi et al.‘s approach to dealing with this is to introduce a scaling parameter that must be chosen a priori. They do not even mention the existence of this parameter anywhere in the main text. Insteadthey choose to make a false statement in the caption of Figure 1 when they write “When these assumptions hold, network deconvolution removes all indirect flow effects and infers all direct interactions and weights exactly.” Even when $G_{obs}$ satisfies the eigenvalue constraint, once it is scaled before applying equation (2) the matrix $G_{dir}$ has probability zero of being recovered. In the video below, recorded at the Banff workshop on Statistical Data Integration Challenges in Computational Biology: Regulatory Networks and Personalized Medicine (August 11–16, 2013), you can see an explanation of network deconvolution by the corresponding author Kellis himself. The first part of the clip is worth watching just to see Kellis describe inverting a matrix as a challenge and then explain the wrong way to do it. But the main point is at the end (best viewed full screen with HD): Manolis Kellis received his B.S., M.S. and Ph.D degrees in computer science and electrical engineering from MIT, so it is hard to believe that he really thinks that solving (2), which requires nothing more than a matrix inversion, must be accomplished via eigendecomposition. In fact, inverting a 2000 x 2000 matrix in R is 50% slower using that approach. What is going on is that Kellis is hiding the fact that computation of the eigenvalues is used in Feizi et al. in order to set the scaling parameter, i.e. that he is actually solving (3) and not (2). Indeed, there is no mention of scaling in the video except for the mysteriously appearing footnote in the lower left-hand corner of the slide starting at 0:36 that is flashed for 2 seconds. Did you have time to think through all the implications of the footnote in 2 seconds, or were you fooled? While it does not appear in the main text, the key scaling parameter in network deconvolution is acknowledged in the supplementary material of the paper (published with the main paper online July 14, 2013). In supplementary Figure 4, Feizi et al. show the performance of network deconvolution with different scaling parameters ($\beta$) on the three different datasets examined in the paper: Figure S4, July 14, 2013. In the words of the authors, the point of this figure is that “…choosing $\beta$ close to one (i.e., considering higher order indirect interactions) leads to the best performance in all considered network deconvolution applications.” However, while the supplement revealed the existence of $\beta$, it did not disclose the values used for the results in the paper. We inquired with the authors and were surprised to discover that while $\beta = 0.95$ was used for the protein networks and $\beta = 0.99$ for the co-authorship network, $\beta=0.5$ was used for the DREAM5 regulatory networks violating their own advice. What rationale could there be for such a different choice, especially one very far away from the apparently optimal choice “near 1″? We asked the authors, who initially replied that the parameter setting didn’t make a difference. We then asked why the parameter would be set differently if its value didn’t make a difference; we never got an answer to this question. Although it may be hard to believe, this is where the story gets even murkier. Perhaps as a result of our queries, the scaling parameters were publicly disclosed by Feizi et al. in a correction to the original supplement posted on the Nature Biotechnology website on August 26, 2013. The correction states “Clarification has been made to Supplementary Notes 1.1, 1.3, 1.6 and Supplementary Figure 4 about the practical implementation of network deconvolution and parameter selection for application to the examples used in the paper. ” But Supplementary Figure 4 was not clarified, it was changed, a change not even acknowledged by the authors, let alone explained. Below is Figure S4 from the updated supplement for comparison with the one from the original supplement shown above. Figure S4, August 26, 2013. The revised figure is qualitatively and quantitatively different from its predecessor. A key point of the paper was changed- the scaling parameter $\beta$ is now seen to not be optimal near 1, but rather dataset dependent (this is reflected in a corresponding change in the text: in reference to choosing $\beta$ close to 1 “best performance” was changed to “high performance”). Specifically, the regulatory network dataset now has an optimal value of $\beta$ close to 0.5, perhaps providing an answer to our question above about why $\beta=0.5$ was used for that dataset alone. However that explanation directly implies that the authors knew the original figure was incorrect. We are left with two questions: 1. If the authors believed their original figure and their original statement that $\beta$ close to 1 “leads to the best performance in all considered network deconvolution applications”, then why did they use $\beta=0.5$ for one of the applications? 2. When and how was the July 16th Figure S4 made? Manolis Kellis declined to answer question 1 and by not acknowledging the figure change in the correction did not permit the readers of Nature Biotechnology to ask question 2. Clearly we have fallen a long way from the clean and canonical-seeming Figure 1. We wish we could say we have reached the bottom, but there is still a ways to go. It turns out that “network deconvolution” is actually neither of the two models (2) and (3) above, a fact revealed only upon reading the code distributed by the authors. Here is what is in the code: 1. affinely map the entries of the matrix $G_{obs}$ to lie between 0 and 1, 2. set the diagonal of the matrix to 0, 3. threshold the matrix, keeping only the largest $\alpha$ fraction of entries, 4. symmetrize the matrix, 5. scale the matrix so that $G_{dir}$ inferred in the next step will have maximum eigenvalue $\beta$, 6. apply formula (2), 7. affinely map between 0 and 1 again. The parameter $\beta$ is the scaling parameter we’ve already discussed, but there is a second parameter $\alpha$. When the Feizi et al. paper was first published, the parameter $\alpha$ appeared in the code with default value 1, and was literally not mentioned in the paper. When asked, the authors revealed that it also takes on different values in the different experiments in the paper. In particular, the regulatory networks DREAM5 experiments used $\alpha=0.1$. The only guidance for how to set $\alpha$, now provided with the new version of the supplement, is “In practice, the threshold chosen for visualization and subsequent analysis will depend on the specific application.” It is hard to say what steps 1–7 actually do. While it is impossible to give a simple analytic formula for this procedure as a whole, using the Sherman-Morrison formula we found that when applied to a correlation matrix C, steps 1, 2, 4, 6 and 7 produce a matrix whose ijth entry is (up to an affine mapping) $P_{ij} + \Sigma_{kl}P_{ij}P_{kl}+m\Sigma_{kl}P_{ik}P_{jl}$ where $P=C^{-1}$ and m is the minimum entry of C. Omitting step 1 results in Pii , the inverse correlation matrix, so the effect of the mapping in this case is the addition of the final two terms whose possible meaning escapes us. We are left to wonder why the authors chose to add this step which lacks any apparent theoretical justification. This bring us to question the meaning of the contribution of author Muriel Médard, a professor of information theory at MIT. According to the contributions section she is the one who “contributed to correctness proof and robustness analysis”. Presumably “correctness” meant describing the two steps needed to show that the solution to (1) is (2). But what would actually be relevant to the paper is a theorem about steps 1–7. Unfortunately such a theorem is hard to imagine. When Feizi et al. was published in August 2013, they set up a companion website at http://compbio.mit.edu/nd/ where the code and data was made available (the data is no longer available at the site). On December 4, 2013, the authors updated the website. While there used to be one version of the program, there are now two different versions, one specifically for use on regulatory networks where the default values for the parameters are $\beta=0.5, \, \alpha =0.1$ and the other with defaults $\beta=0.9, \, \alpha=1$“Network deconvolution” is supposedly a universal method applicable to any kind of network. Will the authors continue to distribute different domain-specific versions of the program, each with their own “default” parameters somehow divined by themselves? To summarize, “network deconvolution”, as best as we can currently tell, consists of steps 1–7 above with the parameters: • DREAM5 challenge: $\beta = 0.5$ and $\alpha=0.1$. • Protein network: $\beta = 0.99$ and $\alpha=1$. • Co-authorship: $\beta=0.95$ and $\alpha=1$. The description of the method in the main text of the paper is different from what is in the online methods, which in turn is different from what is in the supplement, which in turn is different from what is actually in each of the (two) released programs, namely the ad-hoc heuristic steps we have described above. Having said all of that, one might be curious whether the “method” actually works in practice. Sometimes heuristics work well for reasons that are not well understood (although they are usually not published in journals such as Nature Biotechnology). Unfortunately, we have to disclose that we don’t know how the method performs on the datasets in the paper. We tried really hard to replicate the results of the paper, running the software on the provided input matrices and comparing them to the distributed output matrices (available on the NBT website). Despite our best attempts, and occasionally getting very close results, we have been unable to replicate the results of the paper. This bothers us to such an extent that I, Lior Pachter, am hereby offering$100 to anyone (including Feizi et al.) who can provide me with the code and input to run network deconvolution and produce exactly the figures in the paper (including both versions of Figure S4) and the output matrices in the supplementary data.

Because we couldn’t replicate the results of the paper, we decided to examine how FK performs in one case for which Feizi et al. originally claimed (Supplement version July 14 2013) that their method is optimal. They stated “if the observed network is a covariance matrix of jointly Gaussian variables, ND infers direct interactions using global partial correlations”. This claim was deleted without mention in the updated supplement, however even if network deconvolution is not optimal for inferring a network from a correlation matrix (presumably what they meant in the original statement), the claimed universality suggested in the paper implies that it should be very good on any input.

The figure above shows a comparison of the Feizi et al. method (FK) with the various beta parameters used in their paper to regularized partial correlation, the “geometric root” (equation (2)) and the naïve method of simply selecting the top entries of the correlation matrix. In order to test the method we performed 40 simulations of sampling 500 observations from a random Gaussian graphical model with 1000 variables and an edge density of 5% to ensure the graph was connected yet sparse. Performance was assessed by comparing the ranking of the edges to the true edges present in the model and computing the area under the corresponding ROC curve. As a benchmark we applied partial correlation structural inference based on a James-Stein shrinkage estimation of the covariance matrix (a standard approach to gene regulatory network inference, described in Schäfer, J. & Strimmer, K. A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics. Statistical Applications in Genetics and Molecular Biology 4, (2005)).

That FK does better than the most naïve method and yet worse than a method actually designed for this task is perhaps to be expected for a heuristic method that is based on a metaphor (one that dates back to Seawall Wright’s work on path coefficients in the 1920s- although Wright used (1) in the restricted setting of directed acyclic graphs where it makes sense). It is possible that in some contexts, FK will make things somewhat better. However what such contexts might be is unclear. The only guidance that Feizi et al. provide on the assumptions needed to run their method is that they state in the supplement that the model assumes that networks are “linear time-invariant flow-preserving operators”. While it is true that individual parts of that phrase mean something in certain contexts, the complete phrase is word salad. We wonder: is co-authorship flow preserving?

To conclude: Feizi et al. have written a paper that appears to be about inference of edges in networks based on a theoretically justifiable model but

1. the method used to obtain the results in the paper is completely different than the idealized version sold in the main text of the paper and
2. the method actually used has parameters that need to be set, yet no approach to setting them is provided. Even worse,
3. the authors appear to have deliberately tried to hide the existence of the parameters. It looks like
4. the reason for covering up the existence of parameters is that the parameters were tuned to obtain the results. Moreover,
5. the results are not reproducible. The provided data and software is not enough to replicate even a single figure in the paper. This is disturbing because
6. the performance of the method on the simplest of all examples, a correlation matrix arising from a Gaussian graphical model, is poor.

In academia the word fraudulent is usually reserved for outright forgery. However given what appears to be deliberate hiding, twisting and torturing of the facts by Feizi et al., we think that fraud (“deception deliberately practiced in order to secure unfair or unlawful gain”) is a reasonable characterization. If the paper had been presented in a straightforward manner, would it have been accepted by Nature Biotechnology?

Post scriptum. After their paper was published, Feizi et al. issued some press releases. They explain how original and amazing their work is. Kellis compliments himself by explaining that “Introducing such a foundational operation on networks seems surprising in this day and age” and Médard adds that “the tool can be applied to networks of arbitrary dimension.” They describe the method itself as a way to have “…expressed the matrix representing all element correlations as a function of its principal components, and corresponding weights for each component” even though principal components has nothing to do with anything in their paper. As with Barzel-Barabási, this seems worthy of nomination for the Pressies.

Post post scriptum: In September 2013, we submitted a short commentary to Nature Biotechnology with many of the points made above. While it was rejected four months later, we really really wish we could share the reviews here. Unfortunately they were sent to us with a confidentiality restriction, though having said that, any of the reviewers are hereby invited to guest post on network science on this blog. We were disappointed that Nature Biotechnology would not publish the commentary, but not surprised. Frankly, it would have been too much to expect them and their reviewers to unravel the layers of deception in Feizi et al. in the initial review process; it took us an extraordinary amount of time and effort. Having published the paper, and without a clear path to unpublication (retraction is usually reserved for faking of experimental data), there were presumably not many options. The question is whether anything will change. Feizi et al. apparently have more Nature Biotechnology papers on the way as evidenced by an extraordinary declaration on Soheil Feizi’s publication page (this is a pdf of the version when this blog was posted; Feizi changed his site an hour later) where the publication venue of their next paper appears to have been preordained:

S. Feizi, G. Quon , M. Mendoza, M. Medard, M. Kellis
Comparative Analysis of Modular Integrative Regulatory Networks across Fly, Worm and Human
in preparation for Nature Biotechnology.

The term “in preparation” is common on academic websites, especially on sites of graduate students or postdocs who are eager to advertise forthcoming work, but “in preparation for” really requires some hubris. We have to wonder, why do Feizi et al. assume that peer review is just a formality in the journal Nature Biotechnology?

[Addendum 2/23: The authors posted a rebuttal to this blog post, which we replied to here.]

In the August 2013 issue of Nature Biotechnology there were two back-to-back methods papers published in the area of network theory:

1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601.
2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635.

This post is the first of a trilogy (part2, part3) in which my student Nicolas Bray and I tell the story of these papers and why we took the time to read them and critique them.

We start with the Barzel-Barabási paper that is about the applications of a model proposed by Barzel and his Ph.D. advisor, Ofer Biham (although all last names start with a B, Biham is not to be confused with Barabási):

In order to quantify connectivity in biological networks, Barzel and Biham proposed an experimental perturbation model in the paper Baruch Barzel & Ofer Biham, Quantifying the connectivity of a network: The network correlation function methodPhys. Rev. E 80, 046104 (2009) that forms the basis for network link prediction in Barzel-Barabási. In the context of biology, link prediction refers to the problem of identifying functional links between genes from data that may be confounded by indirect effects. For example, if gene A inhibits the expression of gene B, and also gene B inhibits the expression of gene C, then if the expression of A increases, it will decrease the expression of B, which in turn increase C. Therefore one might observe correlation in the expression levels of gene A and C, even though there is no direct interaction between them. The Barzel-Biham model is based on perturbation experiments. Assuming that a system of genes is in equilibrium, it is a model for the change in expression of one gene in response to a small perturbation in another.

The parameters in the Barzel-Biham model are entries in what they call a “local response matrix” S (any matrix with $S_{ii}=0$ for all i). Physical arguments pertaining to perturbations at equilibrium lead to the equations

$G=SG \,$ (off the diagonal) and $G_{ii}=1$ for all i.                                  (1)

for a “global response matrix” G that can, in principle, be observed and used to infer the matrix S. The innovation of Barzel and Barabási is to provide an approximate formula for recovering S from G, specifically the formula

$S \approx (G-I+D((G-I)G))^{-1}$                                                       (2)

where D(M) denotes the operation setting off-diagonal elements of M to zero. A significant part of the paper is devoted to showing that the approximation (2) is good. Then they suggest that (2) can be used to infer direct causal links in regulatory networks from collections of expression experiments. Barzel and Barabási claim that the approximation formula (2) is necessary because exact inference of S from G requires solving the intractable system of $n^2$ equations

$G=SG \,$ (off the diagonal) and $S_{ii}=0$ for all i.                                  (3)

The assertion of intractability is based on the claim that the equations are coupled. They reason that since the naïve matrix inversion algorithm requires $O(m^3)$ operations for m equations, the solution of (3) would require time $O(n^6)$. When we looked at this system, our first thought was that while it is large, it is also structured. We sat down and started examining it by writing down the equations for a simple case: a matrix S for a graph on 3 nodes. We immediately noticed the equations decoupled into n systems of n equations where system is given by $S_{ii}=0$ and $G_{ij} = \sum_k S_{ik} G_{kj}, \, (j \neq i)$, with the n unknowns $S_{i1},\ldots,S_{in}$. This immediately reduces the complexity to $O(n^4)$, or even $O(n^3)$ by simple parallelization. In other words, the system is trivially tractable.

But there is more: while looking at the paper I had to take a quick bathroom break, and by the time I returned Nick had realized he could apply the Sherman-Morrison formula to obtain the following formula for the exact solution:

$S=I-D(1/G^{-1})G^{-1}$.                                                                    (4)

Here the operator “/” denotes element-wise division, a simple operation to execute, so that inferring S from G requires no more than inverting G and scaling it, a formula that is also much simpler and more efficient to compute than (2). [Added 2/23: Jordan Ellenberg  pointed out the obvious fact that $G=SG$ off the diagonal means that $SG=G-D$ for some diagonal matrix $D$, and therefore $S=I-DG^{-1}$ and since the diagonal entries of $S$ must be zero it follows that $D=D(1/G^{-1})$. In other words, the Sherman-Morrison formula is not even needed]. While it would be nice for us to claim that our managing to quickly supersede the main result of a paper published in Nature Biotechnology was due to some sort of genius, in fact the entire exercise would be suitable for an undergraduate linear algebra homework problem. Barabási likes to compare himself to the great physicist and nobel laureate Subrahmanyan Chandrasekhar, but it is difficult to imagine the great Chandrasekhar having similar difficulties.

The approach to solving (4) has an implication that is even more important than the solution itself. It provides a dual formula for computing G from S according to (1), i.e. to simulate from the model. Using the same ideas as above, one finds that

$G=TD(1/T)$ where $T=(S-I)^{-1}$.                                              (5)

Unlike Barzel & Barabási that resorted to simulating with Michaelis-Menten dynamics in their study of performance of their approximation, using (4) we can efficiently simulate data directly from the model. One issue with Michaelis-Menten dynamics is that they make more sense for enzymatic networks as opposed to regulatory networks (for more on this see Karlebach, G. & Shamir, R. Modelling and analysis of gene regulatory networksNat Rev Mol Cell Biol 9, 770–780 (2008)), but in any case performance on such dynamics is hardly a validation of (2) since its mixing apples and oranges. So what happens when one simulates data from the Barzel-Biham model and then tries to recover the parameters?

A comparison of the standard method of regularized partial correlations with exact inference for the Barzel-Biham model. Random sparse graphs were generated according to the Erdös-Renyi graph model G(5000,p) where was varied to assess performance at different graph densities (shown on x-axis) The y-axis shows the average AUROC obtained from 75 random trials at each density.

When examining simulations from the Barzel-Biham model with graphs on 5,000 nodes (see Figure above), we were surprised to discover that when adding even small amounts of noise, the exact algorithm (4) failed to recover the local response matrix from G (we also analyzed the approximation (3) and observed that it always resulted in performance inferior to (4), and that 5% of the time the correlation with the exact solution was negative). This sensitivity to noise is due to the term $D(1/G^{-1})$ in the exact formula which becomes problematic if the diagonal entries of $G^{-1}$ are close to zero.

Some intuition for the behavior of $G^{-1}$ may be gained from noting that if S is such that its geometric sum converges, the diagonal of $G^{-1}$ is equal to that of $I+S+S^2+S^3+\cdots$. If S has mixed signs and there is significant feedback within the network, the diagonal of $G^{-1}$ may be close to zero and any noise in the measurement of G could create very large fluctuations in the inferred S. This means that the results in Figure 1 are not dependent on the graph model chosen (Erdös-Renyi) and will occur for any reasonable model of gene regulatory networks including the modeling of both enhancers and repressors. From Figure 2a in their paper, it appears that Barzel and Barabási used in their simulation an S with only positive entries that would preclude such effects. Such an assumption is biologically unrealistic.

However, the difficulties with noise for the Barzel-Biham model go much deeper. While a constant signal-to-noise ratio, as assumed by Barzel and Barabasi, is a commonly used model for errors in experiments, it is important to remember that there is no experiment for directly measuring the elements of G. Obtaining $G_{ij}$ from an experiment is done by making a small perturbation of size e to gene i, observing the change in gene j, and then dividing that change by e. This last step increases the noise on the estimate of $G_{ij}$ by a factor of 1/e (a large number, for a perturbative experiment) above the noise already present in the measurements. Increasing e acts to remove the system from the perturbative regime and thereby increases the intrinsic error in estimating G.  It is therefore the case that attaining reasonable error on G will require very low noise in the original measurements. In this case of biological networks this would mean performing many replicates of the experiments. However, as Barzel & Barabási acknowledge in their paper, even a single replicate of a perturbation experiment is not currently feasible.

While the exact algorithm (4) for inverting the Barzel-Biham model performs poorly, we found that a widely used shrinkage method based on partial correlation (Schäfer, J. & Strimmer, K. A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional GenomicsStatistical Applications in Genetics and Molecular Biology 4, (2005)) outperforms the exact algorithm (blue curve in Figure above). This suggests that there is no input for which (4) might be useful. The method is not even ideal for inference from data generated by the model it is based on.

This brings us to the “results” section of the paper. To demonstrate their method called Silencer, Barzel & Barabási ran it on only one of three datasets from the DREAM5 data. They then compared the performance of Silencer to three out of thirty five methods benchmarked in DREAM5. The Barzel-Biham model is for perturbation experiments, but Barzel & Barabási just threw in data from another universe (e.g. mutual information matrices). But lets just go with that for a moment. Their results are shown in the figure below:

Figure 3 from Barzel-Barabási.

The three methods tested potential improvements on are Pearson, Spearman and Mutual Information. Pearson and Spearman rank 16/35 and 18/35 respectively in the DREAM5 benchmarks. There may be some reason why Silencer applied on top of these methods improves performance: in the case where G is a correlation matrix, the path interpretation given by Barzel and Barabási connects the inference procedure to Seawall Wright’s path coefficients (ca. 1920), which in turn suggests an interpretation in terms of partial correlation. However in the case of mutual information, a method that is ranked 19/35 in the DREAM5 benchmarks, there is no statistically significant improvement at all. The improvement is from an AUROC of 0.67 to 0.68. Amazingly, Barzel and Barabási characterize these results by remarking that they “improve upon the top-performing inference methods” (emphasis on top is ours).  Considering that the best of these rank 16/35 the use of the  word “top” seems, shall we say, unconventional.

We have to ask: how did Barzel & Barabási get to publish a paper in the journal Nature Biotechnology on regulatory network inference without improvement or testing on anything but a handful of mediocre DREAM5 methods from a single dataset? To put the Barzel-Barabási results in context, it is worth considering the standards the Feizi et al. paper were held to. In that paper the authors compared their method to DREAM5 data as well, except they tested on all 3 datasets and 9 methods (and even on a community based method). We think its fair to conclude that significantly more testing would have to be done to argue that Silencer improves on existing methods for biological network inference.

We therefore don’t see any current practical utility for the Barzel-Biham model, except possibly for perturbation experiments in small sub-networks. Even then, we don’t believe it is practical to perform the number of experiments that would be necessary to overcome signal to noise problems.

Unfortunately the problems in Barzel-Barabási spill over into a follow up article published by the duo: Barzel, Baruch, and Albert-László Barabási, “Universality in Network Dynamics.” Nature Physics 9 (2013). In the paper they assume that the local response matrix has entries that are all positive, i.e. they do not allow for inhibitory interactions. Such a restriction immediately renders the results of the paper, if they are to be believed, moot in terms of biological significance. Moreover, the restrictions on S appear to be imposed in order to provide approximations to G that are unnecessary in light of (5). Given these immediate issues, we suspect that were we to read the Universality paper carefully, it is quite likely this post would have to be lengthened considerably.

These are not the first of Barabási’s papers to package meaningless and incoherent results in Nature/Science publications. In fact, there is a long history of Barabási publishing with fanfare in top journals only to have others respond by publishing technical comments on his papers, in many cases refuting completely the claims he makes. Unfortunately many of the critiques are not well known because they are rejected from the journals where Barabási is successful, and instead find their way to preprint servers or more specialized publications. Here is a partial list of Barabási finest and the response(s):

1. Barabasi is famous for the “BA model”, proposed in Barabási and Albert ‘Emergence of Scaling in Random Networks“, Science, Vol. 286 15 October 1999, pp. 509-512. Lada Adamic and Bernardo Huberman immediately refuted the practical applications of the model. Moreover, as pointed out by Willinger, Alderson and Doyle, while it is true that scale-free networks exhibit some interesting mathematical properties (specifically they are resilient to random attack yet vulnerable to worst-case), even the math was not done by Barabási but by the combinatorialists Bollobás and Riordan.
2. Barabási has has repeatedly claimed metabolic networks as prime examples of scale-free networks, starting with the paper Jeong et al., “The large-scale organization of metabolic networks“, Nature 407 (2000). This fact has been disputed and refuted in the paper Scale Rich Metabolic Networks by Reiko Tanaka.
3. The issue of attack tolerance was the focus of Error and attack tolerance of complex networks by Réka Albert, Hawoong Jeong & Albert-László Barabási in Nature 406 (2000). John Doyle refuted the paper completely in this paper.
4. In the paper “The origin of bursts and heavy tails in human dynamics” published in Nature 435 (2005) Barabási pretends to offer insights into the “bursty nature of human behavior” (by analyzing e-mail). In a follow up comment Daniel Stouffer, Dean Malmgren and Luis Amaral demonstrate that the reported power-law distributions are solely an artifact of the analysis of the empirical data and the proposed model is not representative of e-mail communication patterns.
5. Venturing into the field of control theory, the paper “Controllability of complex networks” by Liu, Slotine and Barabási, Nature 473 (2011) argues that “sparse inhomogeneous networks, which emerge in many real complex systems, are the most difficult to control, but that dense and homogeneous networks can be controlled using a few driver nodes.” Not so. In a beautiful and strong rebuttal, Carl Bergstrom and colleagues show that a single control input applied to the power dominating set is all that is required for structural controllability of most, if not all networks. I have also blogged about this particular paper previously, explaining the Bergstrom result and why it reveals the Barabási paper to be a control theory embarrassment.

In other words, Barabási’s “work” is a regular feature in the journals Nature and Science despite the fact that many eminent scientists keep demonstrating that the network emperor has no clothes.

Post scriptum. After their paper was published, Barzel and Barabási issued a press release claiming that “their research moves the team a step closer in its quest to under­stand, pre­dict, and con­trol human disease.” The advertisment seems like an excellent candidate for Michael Eisen’s pressies.

Last Monday some biostatisticians/epidemiologists from Australia published a paper about a “visualization tool which may allow greater understanding of medical and epidemiological data”:

H. Wand et al., “Quilt Plots: A Simple Tool for the Visualisation of Large Epidemiological Data“, PLoS ONE 9(1): e85047.

A brief look at the “paper” reveals that the quilt plot they propose is a special case of what is commonly known as a heat map, a point the authors acknowledge, with the caveat that they claim that

” ‘heat maps’ require the specification of 21 arguments including hierarchical clustering, weights for reordering the row and columns dendrogram, which are not always easily understood unless one has an extensive programming knowledge and skills. One of the aims of our paper is to present ‘‘quilt plots’’ as a useful tool with simply formulated R-functions that can be easily understood by researchers from different scientific backgrounds without high-level programming skills.”

In other words, the quilt plot is a simplified heat map and the authors think it should be used because specifying parameters for a heat map (in R) would require a terrifying skill known as programming. This is of course all completely ridiculous. Not only does usage of R not require programming skill, there are also simplified heat map functions in many programming languages/computation environments that are as simple as the quilt plot function.

The fact that a paper like this was published in a journal is preposterous, and indeed the authors and editor of the paper have been ridiculed on social media, blogs and in comments to their paper on the PLoS One website.

BUT…

Wand et al. do have one point… those 21 parameters are not an entirely trivial matter. In fact, the majority of computational biologists (including many who have been ridiculing Wand) appear not to understand heat maps themselves, despite repeatedly (ab)using them in their own work.

What are heat maps?

In the simplest case, heat maps are just the conversion of a table of numbers into a grid with colored squares, where the colors represent the magnitude of the numbers. In the quilt plot paper that is the type of heat map considered. However in applications such as gene expression analysis, heat maps are used to visualize distances between experiments.

Heat maps have been popular for visualizing multiple gene expression datasets since the publication of the “Eisengram” (or the guilt plot?). So when my student Lorian Schaeffer and I recently needed to create a heat map from RNA-Seq abundance estimates in multiple samples we are analyzing with Ryan Forster and Dirk Hockemeyer, we assumed there would be a standard method (and software) we could use. However when starting to look at the literature we quickly found 3 papers with 4 different opinions about which similarity measure to use:

There are also the folks who don’t worry too much and just try anything and everything (for example using the heatmap.2 function in R) hoping that some distance measure produces the figure they need for their paper. There are certainly a plethora of distance measures for them to try out. And even if none of the distance measures provide the needed figure, there is always the opportunity to play with the colors and shading to “highlight” the desired result. In other words, heat maps are great for cheating with what appears to be statistics.

We wondered…  what is the “right” way to make a heat map?

Consider first the obvious choice for measuring similarity: Euclidean distance. Suppose that we are estimating the distance between abundance estimates from two RNA-Seq experiments, where for simplicity we assume that there are only three transcripts (A,B,C). The two abundance estimates can be represented by 3-tuples $(p_A,p_B,p_C)$ and $(q_A,q_B,q_C)$such that both $p_A+p_B+p_C=1$ and $q_A+q_B+q_C=1$. If  $p_A=1$ and $q_A=0$, then the Euclidean distance is given by $\sqrt{1+q_B^2+q_C^2}$. This obviously depends on $q_B$ and $q_C$, a dependence  that is problematic. What has changed between the two RNA-Seq experiments is that transcript $A$ has gone from being the only one transcribed, to not being transcribed at all. It is difficult to justify a distance metric that depends on the relative changes in $q_B$ and $q_C$. Why, for example, should $(1,0,0)$ be closer to $(1,\frac{1}{2},\frac{1}{2})$ than to $(1,1,0)$?

The Jensen-Shannon divergence, defined for two distributions $P$ and $Q$ by

$JSD(P,Q) = \frac{1}{2}D(P\|M) + \frac{1}{2}D(Q\|M)$

where $M = \frac{1}{2}(P+Q)$ and $D(A\|B)$ is the Kullback-Leibler divergence, is an example of a distance measure that does not have this problem. For the example above the JSD is always $log2$ (regardless of $q_B$ and $q_C$). However the JSD is not a metric (hence the term divergence in its name). In particular, it does not satisfy the triangle inequality (which the Euclidean distance does). Interestingly, this defect can be rectified by replacing JSD with the square root of JSD (the JSD metric). Formal proofs that the square root of JSD is a metric were provided in “A new Metric for Probability Distributions” by Dominik Endres and Johannes Schindelin (2003), and separately (and independently) in “A new class of metric divergences on probability spaces and its applicability in statistics” by Ferdinand Österreicher and Igor Vajda (2003). The paper “Jensen-Shannon Divergence and Hilbert space embedding” by Bent Fuglede and Flemming Topsøe (2004) makes clear the mathematical origins for this result by showing that the square root of JSD can be isometrically embedded into Hilbert space (as a logarithmic spiral)

The 2-simplex with contour lines showing points equidistant
from the probability distribution (1/3, 1/3, 1/3) for the JSD metric.

The meaning of the JSD metric is not immediately apparent based on its definition, but a number of results provide some insight. First, the JSD metric can be approximated by Pearson’s $\chi^2$ distance  (Equation (7) in Endres and Schindelin). This relationship is confirmed in the numerical experiments of Sung-Hyuk Cha (see Figure 3 in “Comprehensive survey on distance/similarity measures between probability distance functions“, in particular the close relationship between JSD and the probabilistic symmetric $\chi^2$). There are also information theoretic and physical interpretations for the JSD metric stemming from the definition of JSD in terms of Kullback-Leibler divergence.

In “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation“, Trapnell et al., Nature Biotechnology 28 (2010), we used the JSD metric to examine changes to relative isoform abundances in genes (see, for example, the Minard plot in Figure 2c). This application of the JSD metric makes sense, however the JSD metric  is not a panacea. Consider Figure 1 in the Merkin et al. paper mentioned above. It displays a heat map generated from 7713 genes (genes with singleton orthologs in the five species studied). Some of these genes will have higher expression, and therefore higher variance, than others. The nature of the JSD metric is such that those genes will dominate the distance function, so that the heat map is effectively generated only from the highly abundant genes. Since there is typically an (approximately) exponential distribution of transcript abundance this means that, in effect, very few genes dominate the analysis.

I started thinking about this issue with my student Nicolas Bray and we began by looking at the first obvious candidate for addressing the issue of domination by high variance genes: the Mahalanobis distance. Mahalanobis distance is an option in many heat map packages (e.g. in R), but has been used only rarely in publications (although there is some history of its use in the analyses of microarray data). Intuitively, Mahalanobis distance seeks to remedy the problem of genes with high variance among the samples dominating the distance calculation by appropriate normalization. This appears to have been the aim of the method in the Anders and Huber paper cited above, where the expression values are first normalized to obtain equal variance for each gene (the variance stabilization procedure). Mahalanobis distance goes a step further and better, by normalizing using the entire covariance matrix (as opposed to just its diagonal).

Intuitively, given a set of points in some dimension, the Mahalanobis distance is the Euclidean distance between the points after they have been transformed via a linear transformation that maps an ellipsoid fitted to the points to a sphere.  Formally, I think it is best understood in the following slightly more general terms:

Given an $n \times m$ expression matrix $X$ (rows=transcripts, columns=experiments), let $P=PCA(X)$ be the matrix consisting of projections of $X$ onto its principal components, and denote by $s^2_k(ij)$ the distance between projection of points i and j onto the kth component, i.e. $s^2_k(ij) = (P_{ik}-P_{jk})^2$. Let $\lambda_1,\ldots,\lambda_n$ be the singular values. For some $1 \leq p \leq n$, define the distance

$D_{ij} = \frac{s^2_1(ij)}{\lambda_1} + \cdots + \frac{s^2_p(ij)}{\lambda_p}$

When $n \leq m$ and $p=n$ then the distance D defined above is the Mahalanobis distance.

The Mahalanobis ellipses. In this figure the distance shown is from every point to the center (mean of the point) rather than between pairs of points. Mahalanobis distance is sometimes defined in this way. The figure is reproduced from this website. Note that the Anders-Huber heat map produces distances looking only at the variance in each direction (in this case horizontal and vertical) which assumes that the gene expression values are independent, or equivalently that the ellipse is not rotated.

It is interesting to note that D is defined even when $n > m$, providing a generalization of Mahalanobis distance for high-dimensional data.

The cutoff p involves ignoring the last few principal components. The reason one might want to do this is that the last few principal components presumably correspond to noise in the data. Amplifying this noise and treating it the same as the signal is not desirable. This is because as p increases the denominators $\lambda_p$ get smaller, and therefore have an increasing effect on the distance. So even though it makes sense to normalize by variance thereby allowing all genes to count the same, it is important to keep in mind that the last few principal components may be desirable to toss out. One way one could choose the appropriate threshold is by examination of a scree plot.

We’re still not completely happy with Mahalanobis distance. For example, unlike the Jensen-Shannon metric, it does not provide a metric over probability distributions. In functional genomics, almost all *Seq assays produce an output which is a (discrete) probability distribution (for example in RNA-Seq the output after quantification is a probability distribution on the set of transcripts). So making heat maps for such data seems to not be entirely trivial…

Does any of this matter?

The landmark Michael Eisen et al. paper “Cluster analysis and the display of genome-wide expression patterns“, PNAS 95 (1998), 14863–14868 describing the “Eisengram” was based on correlation as the distance measure between expression vectors. This has a similar problem to the issues we discussed above, namely that  abundant genes are weighted more heavily in the distance measure, and therefore they define the characteristics of the heat map. Yet the Eisengram and its variants have proven to be extremely popular and useful. It is fair to ask whether any of the issues I’ve raised matter in practice.

Depends. In many papers the heat map is a visualization tool intended for a qualitative exploration of the data. The issues discussed here touch on quantitative aspects, and in some applications changing distance measures may not change the qualitative results. Its difficult to say without reanalyzing data sets and (re)creating the heat maps with different parameters. Regardless, as expression technology continues to transition from microarrays to RNA-Seq, the demand for quantitative results is increasing. So I think it does matters how heat maps are made. Of course its easy to ridicule Handan Wand for her quilt plots, but I think those guilty of pasting ad-hoc heat maps based on arbitrary distance measures in their papers are really the ones that deserve a public spanking.

P.S. If you’re going to make your own heat map, after adhering to sound statistics, please use a colorblind-friendly palette.

P.P.S. In this post I have ignored the issue of clustering, namely how to order the rows and columns of heat maps so that similar expression profiles cluster together. This goes along with the problem of constructing meaningful dendograms, a visualization that has been a major factor in the popularization of the Eisengram. The choice of clustering algorithm is just as important as the choice of similarity measure, but I leave this for a future post.

The paper “Genomic-scale capture and sequencing of endogenous DNA from feces” by George H. Perry, John C. Marioni, Páll Melsted and Yoav Gilad is literally full of feces. The word ‘fecal’ appears 100 times.

Poop jokes aside, the paper presents an interesting idea that has legs. Perry et al. show that clever use of Agilent’s SureSelect allows for capturing nuclear genomic regions from fecal DNA. Intellectually, it is the predecessor of T. Mercer et al.‘s “Targeted RNA sequencing reveals deep complexity of the human transcriptome“, Nature Biotechnology, 2011 (another paper I like and for which I wrote a research highlight). Even though the Perry et al. paper does not have many citations, the Mercer et al. paper does (although unfortunately the authors forgot to cite Perry et al., which I think they should have). In other words, the Perry et al. paper is not as well known as it ought to be, and this post is an attempt to rectify that.

The ‘*’ in sh*t in my title is for *Seq. At a high level,  the Perry et al. paper shows how high-throughput sequencing technology can be leveraged to sequence deeply a single genome from among a community of metagenomes. For this reason, and for convenience, I henceforth will refer to the Perry et al. paper as the Sh*t-Seq paper. The “*” is inserted in lieu of the “i” not as censorship, but to highlight the point that the method is general and applies not only to sequencing nuclear genome from fecal DNA, but also as Mercer et al. shows, for targeted transcriptome sequencing (one can imagine also many other applications).

The Sh*t-Seq protocol is conceptually simple yet complicated in practice. DNA was captured using the Agilent SureSelect target enrichment system coupled to the Illumina single-end sequencing platform library prep protocol (of note is that the paper is from 2010 and the kits are from 2009). However because of the very small amount of DNA targeted (the authors claim ~1.8%), a number of adjustments to standard SureSelect capture / Illumina library prep had to be implemented. To the authors credit, the paragraphs in the section “DNA Capture” are exemplary in their level of detail and presumably greatly facilitate replicability. I won’t repeat all the detail here. However there are two steps that caught my attention as possibly problematic. First, the the authors performed substantial PCR amplification of the  adapter-ligated fecal DNA. This affects the computational analysis they discuss later and leads to a computational step they implemented that I have some issues with (more on this later). Second, they performed two rounds of capture as one round was insufficient for capturing the needed material for sequencing. This necessitated additional PCR, which also is possibly problematic.

The samples collected were from six chimpanzees. This is a fairly small n but the paper is a proof of principle and I think this is sufficient. Both fecal and blood samples were collected allowing for comparison of the fecally derived nuclear DNA to the actual genome of the primates. In what is clearly an attempt to channel James Bond, they collected fecal samples (2 g of stool) within 1 hour of defecation in tubes containing RNALater and these were then “shaken vigorously” (not stirred).

The next part of the paper is devoted to computational analyses to confirm that the Sh*t-Seq protocol can in fact be used to target nuclear endogenous DNA. As a sanity check, mtDNA was considered first. They noted too much diversity to align reads to a reference genome with BWA, and opted instead for de novo assembly using ABySS. This is certainly overkill, possible only because of the high copy count of mitochondrion reads. But I suppose it worked (after filtering out all the low coverage ABySS sequence, which was presumably junk). One interesting idea given more modern RNA-Seq assembly tools would be to assemble the resulting reads with an RNA-Seq de novo assembler that allows for different abundances of sequences. Ideally, such an assembly should indicate naturally the sought after enrichment.

Next, nuclear DNA was investigated, specifically the X chromosome and chromosome 21. Here the analyses is very pre-2014. First, all multi-mapping reads were removed. This is not a good idea for many reasons, and I am quite certain that with the new GRCh38 (with alternate sequence representation for variant regions) it is a practice that will rapidly be phased out. I’d like to give Perry et al. the benefit of the doubt for making this mistake since they published in 2010, but their paper appeared 6 months after the Cufflinks paper so they could have, in principle, known better. Having said that, while I do think multi-mapping would have allowed them to obtain much stronger results as to the accuracy and extent of their enrichment by avoiding the tossing of a large number of reads, their paper does manage to prove their principle so its not a big deal.

The removal of multi-mapping reads was just the first step in a series of “filters” designed to narrow down the nuclear DNA reads to regions of the chimpanzee genome that could be argued to be unambiguously representative of the target. I won’t go into details, although they are all in the paper. As with the experimental methods, I applaud the authors on publishing reproducible methods, especially computational methods, with all details included. But there was a final red flag for me in the computational methods: namely the selection of a single unique fragment (at random) for each genomic (start) site for the purposes of calling SNPs. This was done to eliminate problems due to amplification biases, which is indeed a serious concern, but if heterozygous sites appear due to the PCR steps, then there ought to have been telltale signatures. For example, a PCR “SNP’ would, I think, appear only in reads specific to a single position, but not in other overlapping reads of the site. It would have been very helpful if they would have done a detailed analysis of this issue, rather than just pick a single read at random for each genomic (start) site. They kicked the can down the road.

Having removed multiple mapping reads, repetitive regions, low coverage regions, etc. etc. Dayenu, they ended up estimating a false positive rate for heterozygous sites (using the X chromosome in males) at 0.0007% for fecal DNA and 0.0010% for blood DNA. This led them to conclude that incorrectly-identified heterozygous sites in their study were 0.8%, 2.0%, 1.1%, and 2.7% for fecal DNA chromosome 21, fecal DNA chromosome X in females, blood DNA chromosome 21, and blood DNA chromosome X in females, respectively. Such good news is certainly the result of their extraordinarily stringent filtering, but I think it does prove that they were able to target effectively. They give further proof using PCR and Sanger sequencing of 20 regions.

I have a final nitpick and it relates to Figure 4. It is a a companion to Figure 3 which shows the Chimpanzee phylogeny for their samples based on the mtDNA. As expected in that figure, the fecal and blood samples cluster together. Figure 4 shows two phylogenies, one based on chr 21, the other on chr X. My issue here is with the way that distances were constructed. Its a technical point, but it looks like they used hamming distance, and I don’t think that makes a lot of sense, not to mention the fact that neighbor-joining does not seem like the appropriate algorithm for building a tree in this setting (I plan to blog about neighbor-joining shortly). But this is a methodological point not really relevant to the main result of the paper, namely proof of principle for targeted sequencing of endogenous DNA from fecal matter.

I think Sh*t-Seq has a future. The idea of targeted capture coupled to high-throughput sequencing has more than an economic rationale. It provides the possibility to probe the “deep field” as discussed in the previously mentioned review on targeted RNA-Seq. This is a general principle that should be more widely recognized.

And of course, dung is just cool. Happy new year!

My new year ‘s resolution.