You are currently browsing the tag archive for the ‘DREAM5’ tag.

Reproducibility has become a major issue in scientific publication, it is under scrutiny by many, making headlines in the news, is on the minds of journals, and there are guidelines for achieving it. Reproducibility is certainly important for scientific research, but I think that lost in the debate is the notion of usability. Reproducibility only ensures that the results of a paper can be recreated by others, but usability ensures that researchers can build on scientific work, and explore hypotheses and ideas unforeseen by authors of the original papers. Here I describe a case study in reproducibility and usability, that emerged from previous post I wrote about the paper

Feizi et al. describe a method called network deconvolution, that they claim improves the inference results for 8 out of 9 network inference methods, out of the 35 that were tested in the DREAM5 challenge. In DREAM5, participants were asked to examine four chip-based gene x expression matrices, and were also provided a list of transcription factors for each. They were then asked to provide ranked lists of transcription factor – gene interactions for each of the four datasets. The four datasets consisted of one computer simulation (the “in silico” dataset) and expression measurements in  E. coli, S. cerevisiae and S. aureus. The consortium received submission from 29 different groups and ran 6 other “off-the-shelf” methods, while also developing its own “community” method (for a total of 36=29+6+1). The community method consisted of applying the Borda count to the 35 methods being tested, to produce a new consensus, or community, network. Nicolas Bray and I tried to replicate the results of Feizi et al. so that we could test for ourselves the performance of network deconvolution with different parameters and on other DREAM5 methods ( Feizi et al. tested only 9 methods; there were 36 in total).  But despite contacting the authors for help we were unable to do so. In desperation, I even offered $100 for someone to replicate all of the figures in the paper. Perhaps as a result of my blogging efforts, or possibly due to a spontaneous change of heart, the authors finally released some of the code and data needed to reproduce some of the figures in their paper. In particular, I am pleased to say that the released material is sufficient to almost replicate Figure 2 of their paper which describes their results on a portion of the DREAM5 data. I say almost because the results for one of the networks is off, but to the authors credit it does appear that the distributed data and code are close to what was used to produce the figure in the paper (note: there is still not enough disclosure to replicate all of the figures of the paper, including the suspicious Figure S4 before and after revision of the supplement, and I am therefore not yet willing to concede the$100). What Feizi et al. did accomplish was to make their methods usable. That is to say, with the distributed code and data I was able to test the method with different parameters and on new datasets. In other words, Feizi et al. is still not completely reproducible, but it is usable. In this post, I’ll demonstrate why usability is important, and make the case that it is too frequently overlooked or confused with reproducibility. With usable network deconvolution code in hand, I was finally able to test some of the claims of Feizi et al.  First, I identify the relationship between the DREAM methods and the methods Feizi et al. applied network deconvolution to. In the figure below, I have reproduced Figure 2 from Feizi et al. together with Figure 2 from Marbach et al.:

Figure 2 from Feizi et al. aligned to Figure 2 from Marbach et al.

The mapping is more complex than appears at first sight. For example, in the case of Spearman correlation (method Corr #2 in Marbach et al., method #5 in Feizi et al.), Feizi et al. ran network deconvolution on the method after taking absolute values. This makes no sense, as throwing away the sign is to throw away a significant amount of information, not to mention it destroys any hope of connecting the approach to the intuition of inferring directed interactions from the observed via the idealized “model” described in the paper. On the other hand, Marbach et al. evaluated Spearman correlation with sign. Without taking the absolute value before evaluation negative edges, strong (negative) interactions, are ignored. This is the reason for the very poor performance of Spearman correlation and the reason for the discrepancy in bar heights between Marbach et al. and Feizi et al. for that method. The caption of Figure 2 in Feizi et al. begins “Network deconvolution applied to the inferred networks of top-scoring methods [1] from DREAM5..” This is obviously not true. Moreover, one network they did not test on was the community network of Marbach et al. which was the best method and the point of the whole paper. However the methods they did test on were ranked 2,3,4,6,8,12,14,16,28 (out of 36 methods). The 10th “community” method of Feizi et al. is actually the result of applying the community approach to the ND output from all the methods, so it is not in and of itself a test of ND. Of the nine tested methods, arguably only a handful were “top” methods. I do think its sensible to consider “top” to be the best methods for each category (although Correlation is so poor I would discard it altogether). That leaves four top methods. So instead of creating the illusion that network deconvolution improves 9/10 top scoring methods, what Feizi et al. should have reported is that 3 out of 4 of the top methods that were tested were improved by network deconvolution. That is the result of running network deconvolution with the default parameters. I was curious what happens when using the parameters that Feizi et al. applied to the protein interaction data (alpha=1, beta=0.99). Fortunately, because they have made the code usable, I was able to test this. The overall result as well as the scores on the individual datasets are shown below: The Feizi et al. results on gene regulatory networks using parameters different from the default. The results are very different. Despite the claims of Feizi et al. that network deconvolution is robust to choice of parameters, now only 1 out of 4 of the top methods are improved by network deconvolution. Strikingly, the top three methods tested have their quality degraded. In fact, the top method in two out of the three datasets tested is made worse by network deconvolution. Network deconvolution is certainly not robust to parameter choice. What was surprising to me was the improved performance of network deconvolution on the S. cerevisae dataset, especially for the mutual information and correlation methods. In fact, the improvement of network deconvolution over the methods is appears extraordinary. At this point I started to wonder about what the improvements really mean, i.e. what is the “score” that is being measured. The y-axis, called the “score” by Feizi et al. and Marbach et al. seemed to be changing drastically between runs. I wondered… what exactly is the score? What do the improvements mean? It turns out that “score” is defined as follows:

$score = \frac{1}{2} ( \frac{1}{3} \sum_{i=1}^3 -log_{10} p_{AUROC,i} + \frac{1}{3} \sum_{i=1}^3 -log_{10} p_{AUPR,i})$.

This formula requires some untangling: First of all, AUROC is shorthand for area under the ROC (receiver operator curve), and AUPR for area under the PR (precision recall) curve. For context, ROC is a standard concept in engineering/statistics. Precision and recall are used frequently, but the PR curve is used much less than ROC . Both are measures for judging the quality of a binary classifier. In the DREAM5 setting, this means the following: there is a gold standard of “positives”, namely a set of edges in a network that should be predicted by a method, and the remainder of the edges will be considered “negatives”, i.e. they should not be predicted. A method generates a list of edges, sorted (ranked) in some way. As one proceeds through the list, one can measure the fraction of positives and false positives predicted. The ROC and PR curves measure the performance. A ROC is simply a plot showing the true positive rate for a method as a function of the false positive rate. Suppose that there are positives in the gold standard out of a goal of edges. If one examines the top k predictions of a method, then among them there will be t “true” positives as well as k-t “false” positives. This will result in a single point on the ROC, i.e. the point ($\frac{k-t}{n-m},\frac{t}{m}$). This can be confusing at first glance for a number of reasons. First, the points do not necessarily form a function, e.g. there can be points with the same x-coordinate. Second, as one varies one obtains a set of points, not a curve. The ROC is a curve, and is obtained by taking the envelope of all of the points for $k \in \{1,\ldots,n\}$. The following intuition is helpful in understanding ROC:

1. The x coordinate in the ROC is the false positive rate. If one doesn’t make any predictions of edges at all, then the false positive rate is 0 (in the notation above k=0, t=0). On the other hand, if all edges are considered to be “true”, then the false positive rate is 1 and the corresponding point on the ROC is (1,1), which corresponds to k=n, t=m.
2. If a method has no predictive power, i.e. the ranking of the edges tells you nothing about which edges really are true, then the ROC is the line y=x. This is because lack of predictive power means that truncating the list at any k, results in the same proportion of true positives above and below the kth edge. And a simple calculation shows that this will correspond to the point ($\frac{k}{n},{k}{n}$) on the ROC curve.
3. ROC curves can be summarized by a single number that has meaning: the area under the ROC (AUROC). The observation above means that a method without any predictive power will have an AUROC of 1/2. Similarly, a “perfect” method, where he true edges are all ranked at the top will have an AUROC of 1. AUROC is widely used to summarize the content of a ROC curve because it has an intuitive meaning: the AUROC is the probability that if a positive and a negative edge are each picked at random from the list of edges, the positive will rank higher than the negative.

An alternative to ROC is the precision-recall curve. Precision, in the mathematics notation above, is the value $\frac{t}{k}$, i.e., the number of true positives divided by the number of true positives plus false positives. Recall is the same as sensitivity, or true positive rate: it is $\frac{t}{m}$. In other words, the PR curve contains the points ($\frac{t}{m},\frac{t}{k}$), as recall is usually plotted on the x-axis. The area under the precision-recall curve (AUPR) has an intuitive meaning just like AUROC. It is the average of precision across all recall values, or alternatively, the probability that if a “positive” edge is selected from the ranked list of the method, then an edge above it on the list will be “positive”. Neither precision-recall curves, nor AUPR are widely used. There is one problem with AUPR, which is that its value is dependent on the number of positive examples in the dataset. For this reason, it doesn’t make sense to average AUPR across datasets (while it does make sense for AUROC). For all of these reasons, I’m slightly uncomfortable with AUPR but that is not the main issue in the DREAM5 analysis. I have included an example of ROC and PR curves below. I generated them for the method “GENIE3” tested by Feizi et al.. This was the method with the best overall score. The figure below is for the S. cerevisiae dataset:

The ROC and a PR curves before (top) and after (bottom) applying network deconvolution to the GENIE3 network.

The red curve in the ROC plots is what one would see for a method without any predictive power (point #2 above). In this case, what the plot shows is that GENIE3 is effectively ranking the edges of the network randomly. The PR curve is showing that at all recall levels there is very little precision. The difference between GENIE3 before and after network deconvolution is so small, that it is indistinguishable in the plots. I had to create separate plots before and after network deconvolution because the curves literally overlapped and were not visible together. The conclusion from plots such as these, should not be that there is statistically significance (in the difference between methods with/without network deconvolution, or in comparison to random), but rather that there is negligible effect. There is a final ingredient that is needed to constitute “score”. Instead of just averaging AUROC and AUPR, both are first converted into p-values that measure the statistical significance of the method being different from random. The way this was done was to create random networks from the edges of the 35 methods, and then to measure their quality (by AUROC or AUPR) to obtain a distribution. The p-value for a given method was then taken to be the area under the probability density function to the right of the method’s value. The graph below shows the pdf for AUROC from the S. cerevisae DREAM5 data that was used by Feizi et al. to generate the scores:

Distribution of AUROC for random methods generated from the S. cerevisiae submissions in Marbach et al.

In other words, almost all random methods had an AUROC of around 0.51, so any slight deviation from that was magnified in the computation of p-value, and then by taking the (negative) logarithm of that number a very high “score” was produced. The scores were then taken to be the average of the AUROC and AUPR scores. I can understand why Feizi et al. might be curious whether the difference between a method’s performance (before and after network deconvolution) is significantly different from random, but to replace magnitude of effect with statistical significance in this setting, with such small effect sizes, is to completely mask the fact that the methods are hardly distinguishable from random in the first place. To make concrete the implication of reporting the statistical significance instead of effect size, I examined the “significant” improvement of network deconvolution on the S. cerevisae and other datasets when run with the protein parameters rather than the default (second figure above). Below I show the AUROC and AUPR plots for the dataset.

The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUROC).

The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUPR).

My conclusion was that the use of “score” was basically a red herringWhat looked like major differences between methods disappears into tiny effects in the experimental datasets, and even the in silico variations are greatly diminished. The differences in AUROC of one part in 1000 hardly seem reasonable for concluding that network deconvolution works. Biologically, both results are that the methods cannot reliably predict edges in the network. With usable network deconvolution code at hand, I was curious about one final question. The main result of the DREAM5 paper

was that the community method was best. So I wondered whether network deconvolution would improve it. In particular, the community result shown in Feizi et al. was not a test of network deconvolution, it was simply a construction of the community from the 9 methods tested (two communities were constructed, one before and one after network deconvolution). To perform the test, I went to examine the DREAM5 data, available as supplementary material with the paper. I was extremely impressed with reproducibility. The participant submissions are all available, together with scripts that can be used to quickly obtain the results of the paper. However the data is not very usable. For example, what is provided is the top 100,000 edges that each method produced. But if one wants to use the full prediction of a method, it is not available. The implication of this in the context of network deconvolution is that it is not possible to test network deconvolution on the DREAM5 data without thresholding. Furthermore, in order to evaluate edges absolute value was applied to all the edge weights. Again, this makes the data much less useful for further experiments one may wish to conduct. In other words, DREAM5 is reproducible but not very usable. But since Feizi et al. suggest that network deconvolution can literally be run on anything with “indirect effect”, I decided to give it a spin. I did have to threshold the input (although fortunately, Feizi et al. have assured us that this is a fine way to run network deconvolution), so actually the experiment is entirely reasonable in terms of their paper. The figure is below (produced with the default network deconvolution parameters),  but before looking at it, please accept my apology for making it. I really think its the most incoherent, illogical, meaningless and misleading figure I’ve ever made. But it does abide by the spirit of network deconvolution:

The DREAM5 results before and after network deconvolution.

Alas, network deconvolution decreases the quality of the best method, namely the community methodThe wise crowds have been dumbed down. In fact, 19/36 methods become worse, 4 stay the same, and only 13 improve. Moreover, network deconvolution decreases the quality of the top method in each dataset. The only methods with consistent improvements when network deconvolution is applied are the mutual information and correlation methods, poor performers overall, that Feizi et al. ended up focusing on. I will acknowledge that one complaint (of the many possible) about my plot is that the overall results are dominated by the in silico dataset. True- and I’ve tried to emphasize that by setting the y-axis to be the same in each dataset (unlike Feizi et al.) But I think its clear that no matter how the datasets are combined into an overall score, the result is that network deconvolution is not consistently improving methods. All of the analyses I’ve done were made possible thanks to the improved usability of network deconvolution. It is unfortunate that the result of the analyses is that network deconvolution should not be used. Still, I think this examples makes a good case for the fact that reproducibility is essential, but usability is more important.

This is the second in a series of three posts (part1, part3) on two back-to-back papers published in Nature Biotechnology in August 2013:

1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601
2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635

The Barzel & Barabási paper we discussed in yesterday’s blog post is embarrassing for its math, shoddy “validations” and lack of results. However of the two papers above it is arguably the “better” one. That is because the paper by Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, in addition to having similar types of problems to the Barzel & Barabási paper, is also dishonest and fraudulent. For reasons that I will explain in the third and final post in this series, we (Nicolas Bray and I) started looking at the Feizi et al. paper shortly after it was published early in August 2013. This is the story:

Feizi et al.‘s paper describes a method called network deconvolution that in their own words provides “…a systematic method for inferring the direct dependencies in a network, corresponding to true interactions, and removing the effects of transitive relationships that result from indirect effects.” They claim that the method is a “foundational graph theoretic tool” and that it “is widely applicable for computing dependencies in network science across diverse disciplines.” This high brow language makes network deconvolution sounds very impressive, but what exactly is the method?

Feizi et al. would like you to believe that the method is what is described in their Figure 1, part of which is shown below:

This is a model for networks represented as matrices. In this model an unknown matrix $G_{dir}$ with eigenvalues between -1 and 1 is to be inferred from an observed matrix $G_{obs}$ that is related to $G_{dir}$ via

$G_{obs} = G_{dir}+G_{dir}^2+G_{dir}^3+\cdots$                         (1)

The matrix $G_{dir}$ represents “direct effects” and the sum of its powers “indirect effects”. It is probably worth noting at this point that there is no particular reason to believe that effects will behave in this manner in any particular system of interest, but we return to this later. The eigenvalue assumption on $G_{dir}$ is required for the specification of the model, because without it the infinite sum generating $G_{obs}$ does not converge. An elementary manipulation of (1) shows that the matrix $G_{dir}$ can be recovered from $G_{obs}$ by the formula

$G_{dir} = G_{obs}(I+G_{obs})^{-1}$.                                    (2)

Unfortunately, many matrices cannot be decomposed as an infinite sum of powers of some matrix as in equation (1), so equation (2) cannot be applied directly to arbitrary data matrices. The only mention of this issue in the main text of the paper is the statement that “This [eigenvalue] assumption can be achieved for any matrix by scaling the observed input network by a function of the magnitude of its eigenvalues.” This is true but incoherent. There are an infinite number of scalings that will satisfy the assumption, and while the authors claim in their FAQ that “the effect of linear scaling on the input matrix is that … it does not have an effect” this is obviously false (also if it has no effect why do they do it?). For example, as the scaling goes to zero, $G_{dir}$ converges to $G_{obs}$.

What the authors have actually done with their seemingly innocuous aside is fundamentally change their model: now instead of (1), $G_{obs}$ is given by

$G_{obs} = \gamma \cdot (G_{dir}+G_{dir}^2+G_{dir}^3+\cdots)$.             (3)

The problem with this model is that given $G_{obs}$ there are an infinite number of solutions for $\gamma$ and $G_{dir}$. Feizi et al.‘s approach to dealing with this is to introduce a scaling parameter that must be chosen a priori. They do not even mention the existence of this parameter anywhere in the main text. Insteadthey choose to make a false statement in the caption of Figure 1 when they write “When these assumptions hold, network deconvolution removes all indirect flow effects and infers all direct interactions and weights exactly.” Even when $G_{obs}$ satisfies the eigenvalue constraint, once it is scaled before applying equation (2) the matrix $G_{dir}$ has probability zero of being recovered.

In the video below, recorded at the Banff workshop on Statistical Data Integration Challenges in Computational Biology: Regulatory Networks and Personalized Medicine (August 11–16, 2013), you can see an explanation of network deconvolution by the corresponding author Kellis himself. The first part of the clip is worth watching just to see Kellis describe inverting a matrix as a challenge and then explain the wrong way to do it. But the main point is at the end (best viewed full screen with HD):

Manolis Kellis received his B.S., M.S. and Ph.D degrees in computer science and electrical engineering from MIT, so it is hard to believe that he really thinks that solving (2), which requires nothing more than a matrix inversion, must be accomplished via eigendecomposition. In fact, inverting a 2000 x 2000 matrix in R is 50% slower using that approach. What is going on is that Kellis is hiding the fact that computation of the eigenvalues is used in Feizi et al. in order to set the scaling parameter, i.e. that he is actually solving (3) and not (2). Indeed, there is no mention of scaling in the video except for the mysteriously appearing footnote in the lower left-hand corner of the slide starting at 0:36 that is flashed for 2 seconds. Did you have time to think through all the implications of the footnote in 2 seconds, or were you fooled?

While it does not appear in the main text, the key scaling parameter in network deconvolution is acknowledged in the supplementary material of the paper (published with the main paper online July 14, 2013). In supplementary Figure 4, Feizi et al. show the performance of network deconvolution with different scaling parameters ($\beta$) on the three different datasets examined in the paper:

Figure S4, July 14, 2013.

In the words of the authors, the point of this figure is that “…choosing $\beta$ close to one (i.e., considering higher order indirect interactions) leads to the best performance in all considered network deconvolution applications.” However, while the supplement revealed the existence of $\beta$, it did not disclose the values used for the results in the paper.

We inquired with the authors and were surprised to discover that while $\beta = 0.95$ was used for the protein networks and $\beta = 0.99$ for the co-authorship network, $\beta=0.5$ was used for the DREAM5 regulatory networks violating their own advice. What rationale could there be for such a different choice, especially one very far away from the apparently optimal choice “near 1”? We asked the authors, who initially replied that the parameter setting didn’t make a difference. We then asked why the parameter would be set differently if its value didn’t make a difference; we never got an answer to this question.

Although it may be hard to believe, this is where the story gets even murkier.

Perhaps as a result of our queries, the scaling parameters were publicly disclosed by Feizi et al. in a correction to the original supplement posted on the Nature Biotechnology website on August 26, 2013. The correction states “Clarification has been made to Supplementary Notes 1.1, 1.3, 1.6 and Supplementary Figure 4 about the practical implementation of network deconvolution and parameter selection for application to the examples used in the paper. ” But Supplementary Figure 4 was not clarified, it was changed, a change not even acknowledged by the authors, let alone explained. Below is Figure S4 from the updated supplement for comparison with the one from the original supplement shown above.

Figure S4, August 26, 2013.

The revised figure is qualitatively and quantitatively different from its predecessor. A key point of the paper was changed- the scaling parameter $\beta$ is now seen to not be optimal near 1, but rather dataset dependent (this is reflected in a corresponding change in the text: in reference to choosing $\beta$ close to 1 “best performance” was changed to “high performance”). Specifically, the regulatory network dataset now has an optimal value of $\beta$ close to 0.5, perhaps providing an answer to our question above about why $\beta=0.5$ was used for that dataset alone. However that explanation directly implies that the authors knew the original figure was incorrect. We are left with two questions:

1. If the authors believed their original figure and their original statement that $\beta$ close to 1 “leads to the best performance in all considered network deconvolution applications”, then why did they use $\beta=0.5$ for one of the applications?

2. When and how was the July 16th Figure S4 made?

Manolis Kellis declined to answer question 1 and by not acknowledging the figure change in the correction did not permit the readers of Nature Biotechnology to ask question 2.

Clearly we have fallen a long way from the clean and canonical-seeming Figure 1. We wish we could say we have reached the bottom, but there is still a ways to go. It turns out that “network deconvolution” is actually neither of the two models (2) and (3) above, a fact revealed only upon reading the code distributed by the authors. Here is what is in the code:

1. affinely map the entries of the matrix $G_{obs}$ to lie between 0 and 1,
2. set the diagonal of the matrix to 0,
3. threshold the matrix, keeping only the largest $\alpha$ fraction of entries,
4. symmetrize the matrix,
5. scale the matrix so that $G_{dir}$ inferred in the next step will have maximum eigenvalue $\beta$,
6. apply formula (2),
7. affinely map between 0 and 1 again.

The parameter $\beta$ is the scaling parameter we’ve already discussed, but there is a second parameter $\alpha$. When the Feizi et al. paper was first published, the parameter $\alpha$ appeared in the code with default value 1, and was literally not mentioned in the paper. When asked, the authors revealed that it also takes on different values in the different experiments in the paper. In particular, the regulatory networks DREAM5 experiments used $\alpha=0.1$. The only guidance for how to set $\alpha$, now provided with the new version of the supplement, is “In practice, the threshold chosen for visualization and subsequent analysis will depend on the specific application.”

It is hard to say what steps 1–7 actually do. While it is impossible to give a simple analytic formula for this procedure as a whole, using the Sherman-Morrison formula we found that when applied to a correlation matrix C, steps 1, 2, 4, 6 and 7 produce a matrix whose ijth entry is (up to an affine mapping) $P_{ij} + \Sigma_{kl}P_{ij}P_{kl}+m\Sigma_{kl}P_{ik}P_{jl}$ where $P=C^{-1}$ and m is the minimum entry of C. Omitting step 1 results in Pii , the inverse correlation matrix, so the effect of the mapping in this case is the addition of the final two terms whose possible meaning escapes us. We are left to wonder why the authors chose to add this step which lacks any apparent theoretical justification. This bring us to question the meaning of the contribution of author Muriel Médard, a professor of information theory at MIT. According to the contributions section she is the one who “contributed to correctness proof and robustness analysis”. Presumably “correctness” meant describing the two steps needed to show that the solution to (1) is (2). But what would actually be relevant to the paper is a theorem about steps 1–7. Unfortunately such a theorem is hard to imagine.

When Feizi et al. was published in August 2013, they set up a companion website at http://compbio.mit.edu/nd/ where the code and data was made available (the data is no longer available at the site). On December 4, 2013, the authors updated the website. While there used to be one version of the program, there are now two different versions, one specifically for use on regulatory networks where the default values for the parameters are $\beta=0.5, \, \alpha =0.1$ and the other with defaults $\beta=0.9, \, \alpha=1$“Network deconvolution” is supposedly a universal method applicable to any kind of network. Will the authors continue to distribute different domain-specific versions of the program, each with their own “default” parameters somehow divined by themselves?

To summarize, “network deconvolution”, as best as we can currently tell, consists of steps 1–7 above with the parameters:

• DREAM5 challenge: $\beta = 0.5$ and $\alpha=0.1$.
• Protein network: $\beta = 0.99$ and $\alpha=1$.
• Co-authorship: $\beta=0.95$ and $\alpha=1$.

The description of the method in the main text of the paper is different from what is in the online methods, which in turn is different from what is in the supplement, which in turn is different from what is actually in each of the (two) released programs, namely the ad-hoc heuristic steps we have described above.

Having said all of that, one might be curious whether the “method” actually works in practice. Sometimes heuristics work well for reasons that are not well understood (although they are usually not published in journals such as Nature Biotechnology). Unfortunately, we have to disclose that we don’t know how the method performs on the datasets in the paper. We tried really hard to replicate the results of the paper, running the software on the provided input matrices and comparing them to the distributed output matrices (available on the NBT website). Despite our best attempts, and occasionally getting very close results, we have been unable to replicate the results of the paper. This bothers us to such an extent that I, Lior Pachter, am hereby offering \$100 to anyone (including Feizi et al.) who can provide me with the code and input to run network deconvolution and produce exactly the figures in the paper (including both versions of Figure S4) and the output matrices in the supplementary data.

Because we couldn’t replicate the results of the paper, we decided to examine how FK performs in one case for which Feizi et al. originally claimed (Supplement version July 14 2013) that their method is optimal. They stated “if the observed network is a covariance matrix of jointly Gaussian variables, ND infers direct interactions using global partial correlations”. This claim was deleted without mention in the updated supplement, however even if network deconvolution is not optimal for inferring a network from a correlation matrix (presumably what they meant in the original statement), the claimed universality suggested in the paper implies that it should be very good on any input.

The figure above shows a comparison of the Feizi et al. method (FK) with the various beta parameters used in their paper to regularized partial correlation, the “geometric root” (equation (2)) and the naïve method of simply selecting the top entries of the correlation matrix. In order to test the method we performed 40 simulations of sampling 500 observations from a random Gaussian graphical model with 1000 variables and an edge density of 5% to ensure the graph was connected yet sparse. Performance was assessed by comparing the ranking of the edges to the true edges present in the model and computing the area under the corresponding ROC curve. As a benchmark we applied partial correlation structural inference based on a James-Stein shrinkage estimation of the covariance matrix (a standard approach to gene regulatory network inference, described in Schäfer, J. & Strimmer, K. A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics. Statistical Applications in Genetics and Molecular Biology 4, (2005)).

That FK does better than the most naïve method and yet worse than a method actually designed for this task is perhaps to be expected for a heuristic method that is based on a metaphor (one that dates back to Seawall Wright’s work on path coefficients in the 1920s- although Wright used (1) in the restricted setting of directed acyclic graphs where it makes sense). It is possible that in some contexts, FK will make things somewhat better. However what such contexts might be is unclear. The only guidance that Feizi et al. provide on the assumptions needed to run their method is that they state in the supplement that the model assumes that networks are “linear time-invariant flow-preserving operators”. While it is true that individual parts of that phrase mean something in certain contexts, the complete phrase is word salad. We wonder: is co-authorship flow preserving?

To conclude: Feizi et al. have written a paper that appears to be about inference of edges in networks based on a theoretically justifiable model but

1. the method used to obtain the results in the paper is completely different than the idealized version sold in the main text of the paper and
2. the method actually used has parameters that need to be set, yet no approach to setting them is provided. Even worse,
3. the authors appear to have deliberately tried to hide the existence of the parameters. It looks like
4. the reason for covering up the existence of parameters is that the parameters were tuned to obtain the results. Moreover,
5. the results are not reproducible. The provided data and software is not enough to replicate even a single figure in the paper. This is disturbing because
6. the performance of the method on the simplest of all examples, a correlation matrix arising from a Gaussian graphical model, is poor.

In academia the word fraudulent is usually reserved for outright forgery. However given what appears to be deliberate hiding, twisting and torturing of the facts by Feizi et al., we think that fraud (“deception deliberately practiced in order to secure unfair or unlawful gain”) is a reasonable characterization. If the paper had been presented in a straightforward manner, would it have been accepted by Nature Biotechnology?

Post scriptum. After their paper was published, Feizi et al. issued some press releases. They explain how original and amazing their work is. Kellis compliments himself by explaining that “Introducing such a foundational operation on networks seems surprising in this day and age” and Médard adds that “the tool can be applied to networks of arbitrary dimension.” They describe the method itself as a way to have “…expressed the matrix representing all element correlations as a function of its principal components, and corresponding weights for each component” even though principal components has nothing to do with anything in their paper. As with Barzel-Barabási, this seems worthy of nomination for the Pressies.

Post post scriptum: In September 2013, we submitted a short commentary to Nature Biotechnology with many of the points made above. While it was rejected four months later, we really really wish we could share the reviews here. Unfortunately they were sent to us with a confidentiality restriction, though having said that, any of the reviewers are hereby invited to guest post on network science on this blog. We were disappointed that Nature Biotechnology would not publish the commentary, but not surprised. Frankly, it would have been too much to expect them and their reviewers to unravel the layers of deception in Feizi et al. in the initial review process; it took us an extraordinary amount of time and effort. Having published the paper, and without a clear path to unpublication (retraction is usually reserved for faking of experimental data), there were presumably not many options. The question is whether anything will change. Feizi et al. apparently have more Nature Biotechnology papers on the way as evidenced by an extraordinary declaration on Soheil Feizi’s publication page (this is a pdf of the version when this blog was posted; Feizi changed his site an hour later) where the publication venue of their next paper appears to have been preordained:

S. Feizi, G. Quon , M. Mendoza, M. Medard, M. Kellis
Comparative Analysis of Modular Integrative Regulatory Networks across Fly, Worm and Human
in preparation for Nature Biotechnology.

The term “in preparation” is common on academic websites, especially on sites of graduate students or postdocs who are eager to advertise forthcoming work, but “in preparation for” really requires some hubris. We have to wonder, why do Feizi et al. assume that peer review is just a formality in the journal Nature Biotechnology?

[Addendum 2/23: The authors posted a rebuttal to this blog post, which we replied to here.]

In the August 2013 issue of Nature Biotechnology there were two back-to-back methods papers published in the area of network theory:

1. Baruch Barzel & Albert-László Barabási, Network link prediction by global silencing of indirect correlationsNature Biotechnology 31(8), 2013, p 720–725. doi:10.1038/nbt.2601.
2. Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networksNature Biotechnology 31(8), 2013, p 726–733. doi:10.1038/nbt.2635.

This post is the first of a trilogy (part2, part3) in which my student Nicolas Bray and I tell the story of these papers and why we took the time to read them and critique them.

We start with the Barzel-Barabási paper that is about the applications of a model proposed by Barzel and his Ph.D. advisor, Ofer Biham (although all last names start with a B, Biham is not to be confused with Barabási):

In order to quantify connectivity in biological networks, Barzel and Biham proposed an experimental perturbation model in the paper Baruch Barzel & Ofer Biham, Quantifying the connectivity of a network: The network correlation function methodPhys. Rev. E 80, 046104 (2009) that forms the basis for network link prediction in Barzel-Barabási. In the context of biology, link prediction refers to the problem of identifying functional links between genes from data that may be confounded by indirect effects. For example, if gene A inhibits the expression of gene B, and also gene B inhibits the expression of gene C, then if the expression of A increases, it will decrease the expression of B, which in turn increase C. Therefore one might observe correlation in the expression levels of gene A and C, even though there is no direct interaction between them. The Barzel-Biham model is based on perturbation experiments. Assuming that a system of genes is in equilibrium, it is a model for the change in expression of one gene in response to a small perturbation in another.

The parameters in the Barzel-Biham model are entries in what they call a “local response matrix” S (any matrix with $S_{ii}=0$ for all i). Physical arguments pertaining to perturbations at equilibrium lead to the equations

$G=SG \,$ (off the diagonal) and $G_{ii}=1$ for all i.                                  (1)

for a “global response matrix” G that can, in principle, be observed and used to infer the matrix S. The innovation of Barzel and Barabási is to provide an approximate formula for recovering S from G, specifically the formula

$S \approx (G-I+D((G-I)G))^{-1}$                                                       (2)

where D(M) denotes the operation setting off-diagonal elements of M to zero. A significant part of the paper is devoted to showing that the approximation (2) is good. Then they suggest that (2) can be used to infer direct causal links in regulatory networks from collections of expression experiments. Barzel and Barabási claim that the approximation formula (2) is necessary because exact inference of S from G requires solving the intractable system of $n^2$ equations

$G=SG \,$ (off the diagonal) and $S_{ii}=0$ for all i.                                  (3)

The assertion of intractability is based on the claim that the equations are coupled. They reason that since the naïve matrix inversion algorithm requires $O(m^3)$ operations for m equations, the solution of (3) would require time $O(n^6)$. When we looked at this system, our first thought was that while it is large, it is also structured. We sat down and started examining it by writing down the equations for a simple case: a matrix S for a graph on 3 nodes. We immediately noticed the equations decoupled into n systems of n equations where system is given by $S_{ii}=0$ and $G_{ij} = \sum_k S_{ik} G_{kj}, \, (j \neq i)$, with the n unknowns $S_{i1},\ldots,S_{in}$. This immediately reduces the complexity to $O(n^4)$, or even $O(n^3)$ by simple parallelization. In other words, the system is trivially tractable.

But there is more: while looking at the paper I had to take a quick bathroom break, and by the time I returned Nick had realized he could apply the Sherman-Morrison formula to obtain the following formula for the exact solution:

$S=I-D(1/G^{-1})G^{-1}$.                                                                    (4)

Here the operator “/” denotes element-wise division, a simple operation to execute, so that inferring S from G requires no more than inverting G and scaling it, a formula that is also much simpler and more efficient to compute than (2). [Added 2/23: Jordan Ellenberg  pointed out the obvious fact that $G=SG$ off the diagonal means that $SG=G-D$ for some diagonal matrix $D$, and therefore $S=I-DG^{-1}$ and since the diagonal entries of $S$ must be zero it follows that $D=D(1/G^{-1})$. In other words, the Sherman-Morrison formula is not even needed]. While it would be nice for us to claim that our managing to quickly supersede the main result of a paper published in Nature Biotechnology was due to some sort of genius, in fact the entire exercise would be suitable for an undergraduate linear algebra homework problem. Barabási likes to compare himself to the great physicist and nobel laureate Subrahmanyan Chandrasekhar, but it is difficult to imagine the great Chandrasekhar having similar difficulties.

The approach to solving (4) has an implication that is even more important than the solution itself. It provides a dual formula for computing G from S according to (1), i.e. to simulate from the model. Using the same ideas as above, one finds that

$G=TD(1/T)$ where $T=(S-I)^{-1}$.                                              (5)

Unlike Barzel & Barabási that resorted to simulating with Michaelis-Menten dynamics in their study of performance of their approximation, using (4) we can efficiently simulate data directly from the model. One issue with Michaelis-Menten dynamics is that they make more sense for enzymatic networks as opposed to regulatory networks (for more on this see Karlebach, G. & Shamir, R. Modelling and analysis of gene regulatory networksNat Rev Mol Cell Biol 9, 770–780 (2008)), but in any case performance on such dynamics is hardly a validation of (2) since its mixing apples and oranges. So what happens when one simulates data from the Barzel-Biham model and then tries to recover the parameters?

A comparison of the standard method of regularized partial correlations with exact inference for the Barzel-Biham model. Random sparse graphs were generated according to the Erdös-Renyi graph model G(5000,p) where was varied to assess performance at different graph densities (shown on x-axis) The y-axis shows the average AUROC obtained from 75 random trials at each density.

When examining simulations from the Barzel-Biham model with graphs on 5,000 nodes (see Figure above), we were surprised to discover that when adding even small amounts of noise, the exact algorithm (4) failed to recover the local response matrix from G (we also analyzed the approximation (3) and observed that it always resulted in performance inferior to (4), and that 5% of the time the correlation with the exact solution was negative). This sensitivity to noise is due to the term $D(1/G^{-1})$ in the exact formula which becomes problematic if the diagonal entries of $G^{-1}$ are close to zero.

Some intuition for the behavior of $G^{-1}$ may be gained from noting that if S is such that its geometric sum converges, the diagonal of $G^{-1}$ is equal to that of $I+S+S^2+S^3+\cdots$. If S has mixed signs and there is significant feedback within the network, the diagonal of $G^{-1}$ may be close to zero and any noise in the measurement of G could create very large fluctuations in the inferred S. This means that the results in Figure 1 are not dependent on the graph model chosen (Erdös-Renyi) and will occur for any reasonable model of gene regulatory networks including the modeling of both enhancers and repressors. From Figure 2a in their paper, it appears that Barzel and Barabási used in their simulation an S with only positive entries that would preclude such effects. Such an assumption is biologically unrealistic.

However, the difficulties with noise for the Barzel-Biham model go much deeper. While a constant signal-to-noise ratio, as assumed by Barzel and Barabasi, is a commonly used model for errors in experiments, it is important to remember that there is no experiment for directly measuring the elements of G. Obtaining $G_{ij}$ from an experiment is done by making a small perturbation of size e to gene i, observing the change in gene j, and then dividing that change by e. This last step increases the noise on the estimate of $G_{ij}$ by a factor of 1/e (a large number, for a perturbative experiment) above the noise already present in the measurements. Increasing e acts to remove the system from the perturbative regime and thereby increases the intrinsic error in estimating G.  It is therefore the case that attaining reasonable error on G will require very low noise in the original measurements. In this case of biological networks this would mean performing many replicates of the experiments. However, as Barzel & Barabási acknowledge in their paper, even a single replicate of a perturbation experiment is not currently feasible.

While the exact algorithm (4) for inverting the Barzel-Biham model performs poorly, we found that a widely used shrinkage method based on partial correlation (Schäfer, J. & Strimmer, K. A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional GenomicsStatistical Applications in Genetics and Molecular Biology 4, (2005)) outperforms the exact algorithm (blue curve in Figure above). This suggests that there is no input for which (4) might be useful. The method is not even ideal for inference from data generated by the model it is based on.

This brings us to the “results” section of the paper. To demonstrate their method called Silencer, Barzel & Barabási ran it on only one of three datasets from the DREAM5 data. They then compared the performance of Silencer to three out of thirty five methods benchmarked in DREAM5. The Barzel-Biham model is for perturbation experiments, but Barzel & Barabási just threw in data from another universe (e.g. mutual information matrices). But lets just go with that for a moment. Their results are shown in the figure below:

Figure 3 from Barzel-Barabási.

The three methods tested potential improvements on are Pearson, Spearman and Mutual Information. Pearson and Spearman rank 16/35 and 18/35 respectively in the DREAM5 benchmarks. There may be some reason why Silencer applied on top of these methods improves performance: in the case where G is a correlation matrix, the path interpretation given by Barzel and Barabási connects the inference procedure to Seawall Wright’s path coefficients (ca. 1920), which in turn suggests an interpretation in terms of partial correlation. However in the case of mutual information, a method that is ranked 19/35 in the DREAM5 benchmarks, there is no statistically significant improvement at all. The improvement is from an AUROC of 0.67 to 0.68. Amazingly, Barzel and Barabási characterize these results by remarking that they “improve upon the top-performing inference methods” (emphasis on top is ours).  Considering that the best of these rank 16/35 the use of the  word “top” seems, shall we say, unconventional.

We have to ask: how did Barzel & Barabási get to publish a paper in the journal Nature Biotechnology on regulatory network inference without improvement or testing on anything but a handful of mediocre DREAM5 methods from a single dataset? To put the Barzel-Barabási results in context, it is worth considering the standards the Feizi et al. paper were held to. In that paper the authors compared their method to DREAM5 data as well, except they tested on all 3 datasets and 9 methods (and even on a community based method). We think its fair to conclude that significantly more testing would have to be done to argue that Silencer improves on existing methods for biological network inference.

We therefore don’t see any current practical utility for the Barzel-Biham model, except possibly for perturbation experiments in small sub-networks. Even then, we don’t believe it is practical to perform the number of experiments that would be necessary to overcome signal to noise problems.

Unfortunately the problems in Barzel-Barabási spill over into a follow up article published by the duo: Barzel, Baruch, and Albert-László Barabási, “Universality in Network Dynamics.” Nature Physics 9 (2013). In the paper they assume that the local response matrix has entries that are all positive, i.e. they do not allow for inhibitory interactions. Such a restriction immediately renders the results of the paper, if they are to be believed, moot in terms of biological significance. Moreover, the restrictions on S appear to be imposed in order to provide approximations to G that are unnecessary in light of (5). Given these immediate issues, we suspect that were we to read the Universality paper carefully, it is quite likely this post would have to be lengthened considerably.

These are not the first of Barabási’s papers to package meaningless and incoherent results in Nature/Science publications. In fact, there is a long history of Barabási publishing with fanfare in top journals only to have others respond by publishing technical comments on his papers, in many cases refuting completely the claims he makes. Unfortunately many of the critiques are not well known because they are rejected from the journals where Barabási is successful, and instead find their way to preprint servers or more specialized publications. Here is a partial list of Barabási finest and the response(s):

1. Barabasi is famous for the “BA model”, proposed in Barabási and Albert ‘Emergence of Scaling in Random Networks“, Science, Vol. 286 15 October 1999, pp. 509-512. Lada Adamic and Bernardo Huberman immediately refuted the practical applications of the model. Moreover, as pointed out by Willinger, Alderson and Doyle, while it is true that scale-free networks exhibit some interesting mathematical properties (specifically they are resilient to random attack yet vulnerable to worst-case), even the math was not done by Barabási but by the combinatorialists Bollobás and Riordan.
2. Barabási has has repeatedly claimed metabolic networks as prime examples of scale-free networks, starting with the paper Jeong et al., “The large-scale organization of metabolic networks“, Nature 407 (2000). This fact has been disputed and refuted in the paper Scale Rich Metabolic Networks by Reiko Tanaka.
3. The issue of attack tolerance was the focus of Error and attack tolerance of complex networks by Réka Albert, Hawoong Jeong & Albert-László Barabási in Nature 406 (2000). John Doyle refuted the paper completely in this paper.
4. In the paper “The origin of bursts and heavy tails in human dynamics” published in Nature 435 (2005) Barabási pretends to offer insights into the “bursty nature of human behavior” (by analyzing e-mail). In a follow up comment Daniel Stouffer, Dean Malmgren and Luis Amaral demonstrate that the reported power-law distributions are solely an artifact of the analysis of the empirical data and the proposed model is not representative of e-mail communication patterns.
5. Venturing into the field of control theory, the paper “Controllability of complex networks” by Liu, Slotine and Barabási, Nature 473 (2011) argues that “sparse inhomogeneous networks, which emerge in many real complex systems, are the most difficult to control, but that dense and homogeneous networks can be controlled using a few driver nodes.” Not so. In a beautiful and strong rebuttal, Carl Bergstrom and colleagues show that a single control input applied to the power dominating set is all that is required for structural controllability of most, if not all networks. I have also blogged about this particular paper previously, explaining the Bergstrom result and why it reveals the Barabási paper to be a control theory embarrassment.

In other words, Barabási’s “work” is a regular feature in the journals Nature and Science despite the fact that many eminent scientists keep demonstrating that the network emperor has no clothes.

Post scriptum. After their paper was published, Barzel and Barabási issued a press release claiming that “their research moves the team a step closer in its quest to under­stand, pre­dict, and con­trol human disease.” The advertisment seems like an excellent candidate for Michael Eisen’s pressies.