Given the reaction to our use of the word “fraud” in our blog post on Feizi et al., we would like to remind readers how it was actually used:
In academia the word fraudulent is usually reserved for outright forgery. However given what appears to be deliberate hiding, twisting and torturing of the facts by Feizi et al., we think that fraud (“deception deliberately practiced in order to secure unfair or unlawful gain”) is a reasonable characterization. If the paper had been presented in a straightforward manner, would it have been accepted by Nature Biotechnology?
While the reaction has largely focused on reproducibility and the swapping of figures, we reiterate our stance: misleading one’s readers (and reviewers) is itself a form of scientific fraud. Regarding the other questions raised, the response from Feizi, Marbach, Médard and Kellis falls short. On their new website their code has changed, their explanations are in many cases incoherent, self-contradictory, and make false claims, and the newly added correction to Figure S4 turns out not to explain the difference between the figures in the revisions. We explain all of this below. And yet for us the claim of fraud can stand on the basis of deception alone. The distance between the image created by the main text of the paper and the truth of their method is simply too great.
However, one unfortunate fact is that that judging that distance requires understanding some of the mathematics in the paper, and another common reaction has been “I don’t understand the math.” For such readers, we explain network deconvolution in simple terms by providing an analogy using numbers instead of matrices. Understanding this requires no more than high school algebra. We follow that with a response to Feizi et al.‘s rebuttal.
Network deconvolution math made simple
In what follows, number deconvolution, a simplified form of network deconvolution, is explained in red. Network deconvolution is explained in blue.
The main concept to understand is that of a geometric series. This is a series with a constant ratio between successive terms, for example
In this case, each term is one half the previous term, so that the ratio between successive terms is always . This sum converges to , an intuitive notion that is formalized by defining the infinite sum to be
This is the familiar math of the Dichotomy paradox from Zeno’s paradoxes.
It is important to note that not every geometric series converges. If the number in the sum above is replaced by , then the series is said to diverge. The are two basic questions about geometric series: (1) for which numbers do they converge and (2) when they do converge, what do they converge to. These two questions are answered in the following:
Let be a real number. The geometric series converges if, and only if, , in which case
It is not hard to see why this is true:
Returning to the example of , we see that it is a special case of this result. It converges because , and furthermore, .
Matrices behave like numbers in some ways, for example they can be added and multiplied, yet quite differently in others. The generalization of geometric series as in Feizi et al. specifically deals with diagonalizable matrices, a class of matrices which has many convenient properties. Let be such a matrix. The geometric series converges if, and only if, the eigenvalues of (numbers associated to the matrix) lie strictly between and , in which case
Notice that what has changed is that the condition that has been replaced by an eigenvalue restriction on m and the formula is just like except the operation of matrix inversion is required instead of number inversion. The result is obvious from elementary linear algebra because the assumption that is diagonalizable means that for some matrix , where is a diagonal matrix with the eigenvalues of on the diagonal. Therefore and the series converges if, and only if, the geometric series formed by summing the powers of each eigenvalue converge.
In following Feizi et al., we call the number in the example above the “direct number”, and the remainder of the sum the “indirect number”. One should think of the indirect number as consisting of echoes of the direct number. Furthermore, following the language in Feizi et al., we call the ratio between terms, again the number in the example above, the indirect flow. When it is strong, the terms decrease slowly. When it is weak, the terms decrease rapidly.
In Feizi et al., the matrix A is called the “direct effect”, and the remainder of the sum the “indirect effect”. One should think of the indirect effect as consisting of echoes of the direct effect. Furthermore, following the language in Feizi et al., we call the rate at which the terms in the indirect effect decay the indirect flow. When it is strong, the terms decrease slowly. When it is weak, the terms decrease rapidly.
In number deconvolution the goal is to infer from . That is, one would like to remove the indirect effect, the echoes, out of . If , then solving for we obtain
That is the formula for number deconvolution.
In network deconvolution the goal is to infer from . That is, one would like to remove the indirect effect, the echoes, out of as follows: If then solving for we obtain
That is the formula for network deconvolution.
Everything looks great but there is a problem. Even though one can plug any number (except ) into the formula and get out an , the formula only gives an satisfying when . For example, starting with and then plugging it into number deconvolution one gets . But diverges.
The restriction on in number deconvolution, namely that it has to be bigger than , can be translated into the same restriction on the eigenvalues of for network deconvolution. This follows from the fact that means that if is an eigenvalue of then is an eigenvalue of and all the eigenvalues of arise in this way. Therefore, and the condition holds if, and only if, .
But what if we wanted number deconvolution to work for all negative numbers? One idea is to follow Feizi et al.’s approach and introduce a scaling factor called to be applied to , so that the product can be deconvolved. For example, suppose we start with . We’d like to find . But we can’t just apply number deconvolution. So we multiply by to get , and then deconvolve that to get . We could have multiplied by something else, say in which case we’d get . In fact, there are infinitely many scaling factors, giving infinitely many solutions . In fact, we can get any between and .
When is not writable as a geometric sum (decomposable), there exist scaling factors such that the scaled matrix is decomposable. This is obvious because if are the eigenvalues of , then are the eigenvalues of so by choosing we are guaranteed that is decomposable. Let be a real symmetric matrix with minimum/maximum eigenvalues and respectively and . If
then the matrix is decomposable as . Furthermore, if are the minimum/maximum eigenvalues of respectively then (we omit the derivation, but it is straightforward).
Matrices do behave differently than numbers, and so it is not true that network deconvolution can produce any matrix as an output. However, as with number deconvolution, it remains true that network deconvolution can produce an infinite number of possible matrices.
Not only do Feizi et al. not mention this anywhere in the main text, instead they create the impression that there is a single output from any input.
Feizi et al.’s response
We now turn to the response of Feizi et al. to our post. First, upon initial inspection of the “Try It Out” site, we were astounded to find that the code being distributed there is not the same as the code distributed with the paper (in particular, step 1 has been removed) and so it seems unlikely that it could be used to replicate the paper’s results. Unless, that is, the code originally distributed with the paper was not the code used to generate its results. Second, the “one click reproduction”s that the authors have now added do not actually start with the original data but instead merely produce plots from some results without showing how those results were generated. This is not what reproducibility means. Third, while the one matrix of results from a “one click reproduction” that we looked at (DI_ND_1bkr.mat) was very close to the matrix originally distributed with the paper, it was slightly different. It was close and hopefully it does generate basically the same figure, but as we explain below, we’ve spent a good bit of time on this and have no desire to spend any more. This is why we have tried to incentivize someone to simply reproduce the results.
In retrospect, we regret not explaining exactly how we came to be so skeptical about the reproducibility of the results in the Feizi et al. paper. While we very quickly (yet not instantly) recognized how misleading the paper is, our initial goal in looking at their results was not to verify reproducibility (which we assumed) but rather to explore how changing the parameters affected the results. Verifying that we could recover the results from the paper was only supposed to be the first step in this process. We downloaded the file of datasets and code provided by the authors and began examining the protein contact dataset.
After writing scripts that we verified exactly regenerated some figures from the paper from the output matrices distributed by the authors, we then checked whether we arrived at the same results when running the ND code on what we assumed was the input data (files with names like “MI_1bkr.txt” while the output was named “MI_ND_1bkr.txt”). We were surprised when the output did not match, however we were then informed by the authors that they had not distributed the input data but rather thresholded versions of the input. When we asked the authors to provide us with the actual data, we were told that it would violate scientific etiquette to send us another scientist’s dataset and that we would have to regenerate it ourselves. We had never heard this claimed point of etiquette before, but acquiesced and attempted to do so. However, the matrices produced from the original data were actually a different size than those used by the authors (suggesting that they had used a different alignment).
Stymied on that front, we turned to the DREAM5 dataset instead. While the actual scoring that goes into the DREAM analysis is fairly complicated, we decided that we would start by merely checking that we could regenerate the output provided by the authors. We eventually received step-by-step instructions for doing so and, because those steps did not produce the provided output, asked a simple question suggested by our experience with protein contact dataset: to generate the provided results, should we use as input the provided non-ND matrices? We asked this question four times in total without receiving a reply. We sent our script attempting to implement their steps and received no response.
At some point, we also discovered that the authors had been using different parameter values for different datasets without disclosing that to their readers. They could provide no coherent explanation for doing so. And sometime after this, we found that the authors had removed the data files from their website: readers were directed instead to acquire the datasets elsewhere and the results from the paper were no longer provided there.
At this point, we expect it is obvious why we were skeptical about the reproducibility of the results. Having said that, we never wanted to believe that the results of the paper were not reproducible. It was not our initial assumption and we still hope to be proven wrong. However, we continue to wait, and the fact that the code has been changed yet again, and that the authors’ explanation in regards to figure S4 does not appear to explain its changes (see below) do not fill us with confidence.
We repeat that while most of the discussion above has focused on replicability, we will quote again the definition of “fraud” used in the original post: “deception deliberately practiced in order to secure unfair or unlawful gain”. Our main point is this: would any honest reader of the main text of the original paper recognize the actual method implemented? We don’t think so, hence deception. And we believe the authors deliberately wrote the paper in this way to unfairly gain acceptance at Nature Biotechnology.
Now, in response to the authors rebuttal, we offer the following (Feizi et al. remarks in italics):
The pre-processing step that Bray and Pachter criticize (step 1 in their description of our work) has no effect on the performance of ND. Mathematically, because our matrices are non-negative with min zero, a linear scaling of the network results in a similar scaling of its eigenvalues, which are normalized during eigenvalue scaling, canceling out the linear scaling of step 1. Practically, removing step 1 from the code has little effect on the performance of ND on the DREAM5 regulatory networks, as we show in Figure 1 below.
We cannot imagine how the authors came to make the statement “our matrices are non-negative with min zero”. After all, one of the inputs they used in their paper was correlation matrices and they surely know that correlations can be negative. If the map in step 1 were linear rather than – as we correctly stated in the document they are responding to – affine (for those unfamiliar with the terms, “linear” here means only scaling values while “affine” means scaling and shifting as well), then this explanation would be correct and it would have no effect. This just makes it even stranger that immediately after giving this explanation, the authors produced a graph showing the (sometimes quite significant) effect of what we assume was the affine mapping.
Stranger still is that after this defense, Step 1 has now been removed from the code.
The eigenvalue scaling (step 5) is essential both theoretically, to guarantee the convergence of the Taylor series, and practically, as performance decreases without it in practice. This step is clearly stated in both the main text of our paper and in the supplement, despite the claims by Bray and Pachter that it somehow was maliciously concealed from our description of the method (just search the manuscript for the word ‘scaling’). Our empirical results confirm that this step is also necessary in practice, as ND without eigenvalue scaling is consistently performing worse, as we show in Figure 2 below:
We do not understand why the author’s felt the need to defend the use of eigenvalue scaling when we have never suggested it should be removed. Our objection was rather to how it was presented (or not) to the reader. Yes, there is a mention of the word “scaling”. But it is done so in a way that makes it sound as though it were some trivial issue: there’s an assumption, you scale, assumption satisfied, done. And this mention occurs in the context of the rest of the main text of the paper which presents a clear image of a simple, parameterless, 100% theoretically justified method: you have a matrix, you put it into the method, you get the output, done (and it is “globally optimal”). In other words, it is the math we discuss above prior to the arrival of the parameter .
Yes, scaling is dealt with at length in the supplement. Perhaps Feizi et al. think it is fine for the main text of a paper to convey such a different image from the truth contained in its supplement and in the code. We do not and we would hope the rest of the scientific community agrees with us.
Appendix C: Robustness to input parameters.
Bray and Pachter claim that “the reason for covering up the existence of parameters is that the parameters were tuned to obtain the results”. Once again the claims are incorrect and unfounded. (1) First, we did not cover up the existence of the parameters. (2) Second, we used exactly the exact same set of parameters for all 27 tested regulatory networks in the DREAM challenge. (3) Third, we show here that our results are robust to the parameter values, using β = 0.5 and β = 0.99 for the DREAM5 network.
(1) Again: in the main text of the paper, there is not even a hint that the method could have parameters. This is highly relevant information that a reader who is looking at its performance on datasets should have access to. That reader could then ask, e.g. what parameters were used and how they were selected. They would even have the ability to wonder if the parameters were tuned to improve apparent performance. It is worth noting that when the paper was peer reviewed, the reviewers did not have access to the actual values of the parameters used.
(2) It is, of course, perfectly possible to tune parameters while using those same parameters on multiple datasets.
(3) The authors seem to have forgotten that there are two parameters to the method and that they gave the other parameter (alpha) different values on different datasets as well. They also have omitted their performance on the “Community” dataset for some reason.
Appendix D: Network Deconvolution vs. Partial Correlation.
Bray and Pachter compare network deconvolution to partial correlation using a test dataset built using a partial correlation model. In this very artificial setting, it is thus not surprising that partial correlation performs better, as it exactly matches the process that generated the data in the first place. To demonstrate the superiority of partial correlation, Bray and Pachter should test it on real datasets, such as the ones provided as part of the DREAM5 benchmarks . In our experience, partial correlation performed very poorly in practice, but we encourage Bray and Pachter to try it for themselves.
We looked at partial correlations here for a very simple reason: the authors originally claimed that their method reduced to it in the context we considered here. Thus, comparing them seemed a natural thing to do.
The idea that we should look at the performance of partial correlations in other contexts makes no sense: we have never claimed that it is the right tool to solve every problem. Indeed, it was the authors who claimed that their tool was “widely applicable”. By arguing for domain specific tools, they seem to be making our point for us.
Claim 1: “the method used to obtain the results in the paper is completely different than the idealized version sold in the main text of the paper”.
The paper clearly describes both the key matrix operation (“step 6″ in Lior Pachter’s blog post) which is shown in figure 1, and all the pre- and post-processing steps, which are all part of our method. There is nothing mischievous about including these pre- and post-processing steps that were clearly defined, well described, and implemented in the provided code.
The statement that the paper describes all steps is simply false. For example, the affine mapping appears nowhere.
Claim 2: “the method actually used has parameters that need to be set, yet no approach to setting them is provided”.
It is unfortunate that so many methods in our field have parameters associated with them, but we are not the first method to have them. However, we do provide guidelines for setting these in the supplement.
It is strange to suggest that it is unfortunate that methods have parameters. It is also strange to describe their guidelines as such when they did not even follow them. Also, there is no guideline for setting alpha in the supplement. In their FAQ, they say “we used beta=0.5 as we expected indirect flows to be propagating more slowly”. We wonder where one derives these expectations from.
Dr. Pachter also points out that a correction to Supplement Figure S4 on August 26th 2013 was not fully documented. We apologize for the omission and provide additional details in an updated correction notice dated February 12, 2014.
The correction notice states that the original Figure S4 was plotted with the incorrect x-axis. In particular it states that the maximum eigenvalue of the observed network was used, instead of the maximum eigenvalue of the direct network. We checked this correction, and have produced below the old curve and the new curve plotted on the same x-axis:
The new and old Figure S4. The old curve, remapped into the correct x-axis coordinates is shown in red. The new curve is shown in blue. Raw data was extracted from the supplement PDFs using WebPlotDigitizer.
As can be seen, the two curves are not the same. While it is expected that due to the transformation the red curve should not cover the whole x-axis, if the only difference was the choice of coordinates, the curves should overlap exactly. We cannot explain the discrepancy.