Reproducibility has become a major issue in scientific publication, it is under scrutiny by many, making headlines in the news, is on the minds of journals, and there are guidelines for achieving it. Reproducibility is certainly important for scientific research, but I think that lost in the debate is the notion of usability. Reproducibility only ensures that the results of a paper can be recreated by others, but usability ensures that researchers can build on scientific work, and explore hypotheses and ideas unforeseen by authors of the original papers. Here I describe a case study in reproducibility and usability, that emerged from a previous post I wrote about the paper
- Soheil Feizi, Daniel Marbach, Muriel Médard & Manolis Kellis, Network deconvolution as a general method to distinguish direct dependencies in networks, Nature Biotechnology 31(8), 2013, p 726–733.
Feizi et al. describe a method called network deconvolution, that they claim improves the inference results for 8 out of 9 network inference methods, out of the 35 that were tested in the DREAM5 challenge. In DREAM5, participants were asked to examine four chip-based gene x expression matrices, and were also provided a list of transcription factors for each. They were then asked to provide ranked lists of transcription factor – gene interactions for each of the four datasets. The four datasets consisted of one computer simulation (the “in silico” dataset) and expression measurements in E. coli, S. cerevisiae and S. aureus. The consortium received submission from 29 different groups and ran 6 other “off-the-shelf” methods, while also developing its own “community” method (for a total of 36=29+6+1). The community method consisted of applying the Borda count to the 35 methods being tested, to produce a new consensus, or community, network. Nicolas Bray and I tried to replicate the results of Feizi et al. so that we could test for ourselves the performance of network deconvolution with different parameters and on other DREAM5 methods ( Feizi et al. tested only 9 methods; there were 36 in total). But despite contacting the authors for help we were unable to do so. In desperation, I even offered $100 for someone to replicate all of the figures in the paper. Perhaps as a result of my blogging efforts, or possibly due to a spontaneous change of heart, the authors finally released some of the code and data needed to reproduce some of the figures in their paper. In particular, I am pleased to say that the released material is sufficient to almost replicate Figure 2 of their paper which describes their results on a portion of the DREAM5 data. I say almost because the results for one of the networks is off, but to the authors credit it does appear that the distributed data and code are close to what was used to produce the figure in the paper (note: there is still not enough disclosure to replicate all of the figures of the paper, including the suspicious Figure S4 before and after revision of the supplement, and I am therefore not yet willing to concede the $100). What Feizi et al. did accomplish was to make their methods usable. That is to say, with the distributed code and data I was able to test the method with different parameters and on new datasets. In other words, Feizi et al. is still not completely reproducible, but it is usable. In this post, I’ll demonstrate why usability is important, and make the case that it is too frequently overlooked or confused with reproducibility. With usable network deconvolution code in hand, I was finally able to test some of the claims of Feizi et al. First, I identify the relationship between the DREAM methods and the methods Feizi et al. applied network deconvolution to. In the figure below, I have reproduced Figure 2 from Feizi et al. together with Figure 2 from Marbach et al.:
Figure 2 from Feizi et al. aligned to Figure 2 from Marbach et al.
The mapping is more complex than appears at first sight. For example, in the case of Spearman correlation (method Corr #2 in Marbach et al., method #5 in Feizi et al.), Feizi et al. ran network deconvolution on the method after taking absolute values. This makes no sense, as throwing away the sign is to throw away a significant amount of information, not to mention it destroys any hope of connecting the approach to the intuition of inferring directed interactions from the observed via the idealized “model” described in the paper. On the other hand, Marbach et al. evaluated Spearman correlation with sign. Without taking the absolute value before evaluation negative edges, strong (negative) interactions, are ignored. This is the reason for the very poor performance of Spearman correlation and the reason for the discrepancy in bar heights between Marbach et al. and Feizi et al. for that method. The caption of Figure 2 in Feizi et al. begins “Network deconvolution applied to the inferred networks of top-scoring methods [1] from DREAM5..” This is obviously not true. Moreover, one network they did not test on was the community network of Marbach et al. which was the best method and the point of the whole paper. However the methods they did test on were ranked 2,3,4,6,8,12,14,16,28 (out of 36 methods). The 10th “community” method of Feizi et al. is actually the result of applying the community approach to the ND output from all the methods, so it is not in and of itself a test of ND. Of the nine tested methods, arguably only a handful were “top” methods. I do think its sensible to consider “top” to be the best methods for each category (although Correlation is so poor I would discard it altogether). That leaves four top methods. So instead of creating the illusion that network deconvolution improves 9/10 top scoring methods, what Feizi et al. should have reported is that 3 out of 4 of the top methods that were tested were improved by network deconvolution. That is the result of running network deconvolution with the default parameters. I was curious what happens when using the parameters that Feizi et al. applied to the protein interaction data (alpha=1, beta=0.99). Fortunately, because they have made the code usable, I was able to test this. The overall result as well as the scores on the individual datasets are shown below: The Feizi et al. results on gene regulatory networks using parameters different from the default. The results are very different. Despite the claims of Feizi et al. that network deconvolution is robust to choice of parameters, now only 1 out of 4 of the top methods are improved by network deconvolution. Strikingly, the top three methods tested have their quality degraded. In fact, the top method in two out of the three datasets tested is made worse by network deconvolution. Network deconvolution is certainly not robust to parameter choice. What was surprising to me was the improved performance of network deconvolution on the S. cerevisae dataset, especially for the mutual information and correlation methods. In fact, the improvement of network deconvolution over the methods is appears extraordinary. At this point I started to wonder about what the improvements really mean, i.e. what is the “score” that is being measured. The y-axis, called the “score” by Feizi et al. and Marbach et al. seemed to be changing drastically between runs. I wondered… what exactly is the score? What do the improvements mean? It turns out that “score” is defined as follows:
.
This formula requires some untangling: First of all, AUROC is shorthand for area under the ROC (receiver operator curve), and AUPR for area under the PR (precision recall) curve. For context, ROC is a standard concept in engineering/statistics. Precision and recall are used frequently, but the PR curve is used much less than ROC . Both are measures for judging the quality of a binary classifier. In the DREAM5 setting, this means the following: there is a gold standard of “positives”, namely a set of edges in a network that should be predicted by a method, and the remainder of the edges will be considered “negatives”, i.e. they should not be predicted. A method generates a list of edges, sorted (ranked) in some way. As one proceeds through the list, one can measure the fraction of positives and false positives predicted. The ROC and PR curves measure the performance. A ROC is simply a plot showing the true positive rate for a method as a function of the false positive rate. Suppose that there are m positives in the gold standard out of a goal of n edges. If one examines the top k predictions of a method, then among them there will be t “true” positives as well as k-t “false” positives. This will result in a single point on the ROC, i.e. the point (). This can be confusing at first glance for a number of reasons. First, the points do not necessarily form a function, e.g. there can be points with the same x-coordinate. Second, as one varies k one obtains a set of points, not a curve. The ROC is a curve, and is obtained by taking the envelope of all of the points for
. The following intuition is helpful in understanding ROC:
- The x coordinate in the ROC is the false positive rate. If one doesn’t make any predictions of edges at all, then the false positive rate is 0 (in the notation above k=0, t=0). On the other hand, if all edges are considered to be “true”, then the false positive rate is 1 and the corresponding point on the ROC is (1,1), which corresponds to k=n, t=m.
- If a method has no predictive power, i.e. the ranking of the edges tells you nothing about which edges really are true, then the ROC is the line y=x. This is because lack of predictive power means that truncating the list at any k, results in the same proportion of true positives above and below the kth edge. And a simple calculation shows that this will correspond to the point (
) on the ROC curve.
- ROC curves can be summarized by a single number that has meaning: the area under the ROC (AUROC). The observation above means that a method without any predictive power will have an AUROC of 1/2. Similarly, a “perfect” method, where he true edges are all ranked at the top will have an AUROC of 1. AUROC is widely used to summarize the content of a ROC curve because it has an intuitive meaning: the AUROC is the probability that if a positive and a negative edge are each picked at random from the list of edges, the positive will rank higher than the negative.
An alternative to ROC is the precision-recall curve. Precision, in the mathematics notation above, is the value , i.e., the number of true positives divided by the number of true positives plus false positives. Recall is the same as sensitivity, or true positive rate: it is
. In other words, the PR curve contains the points (
), as recall is usually plotted on the x-axis. The area under the precision-recall curve (AUPR) has an intuitive meaning just like AUROC. It is the average of precision across all recall values, or alternatively, the probability that if a “positive” edge is selected from the ranked list of the method, then an edge above it on the list will be “positive”. Neither precision-recall curves, nor AUPR are widely used. There is one problem with AUPR, which is that its value is dependent on the number of positive examples in the dataset. For this reason, it doesn’t make sense to average AUPR across datasets (while it does make sense for AUROC). For all of these reasons, I’m slightly uncomfortable with AUPR but that is not the main issue in the DREAM5 analysis. I have included an example of ROC and PR curves below. I generated them for the method “GENIE3” tested by Feizi et al.. This was the method with the best overall score. The figure below is for the S. cerevisiae dataset:
The ROC and a PR curves before (top) and after (bottom) applying network deconvolution to the GENIE3 network.
The red curve in the ROC plots is what one would see for a method without any predictive power (point #2 above). In this case, what the plot shows is that GENIE3 is effectively ranking the edges of the network randomly. The PR curve is showing that at all recall levels there is very little precision. The difference between GENIE3 before and after network deconvolution is so small, that it is indistinguishable in the plots. I had to create separate plots before and after network deconvolution because the curves literally overlapped and were not visible together. The conclusion from plots such as these, should not be that there is statistically significance (in the difference between methods with/without network deconvolution, or in comparison to random), but rather that there is negligible effect. There is a final ingredient that is needed to constitute “score”. Instead of just averaging AUROC and AUPR, both are first converted into p-values that measure the statistical significance of the method being different from random. The way this was done was to create random networks from the edges of the 35 methods, and then to measure their quality (by AUROC or AUPR) to obtain a distribution. The p-value for a given method was then taken to be the area under the probability density function to the right of the method’s value. The graph below shows the pdf for AUROC from the S. cerevisae DREAM5 data that was used by Feizi et al. to generate the scores:
Distribution of AUROC for random methods generated from the S. cerevisiae submissions in Marbach et al.
In other words, almost all random methods had an AUROC of around 0.51, so any slight deviation from that was magnified in the computation of p-value, and then by taking the (negative) logarithm of that number a very high “score” was produced. The scores were then taken to be the average of the AUROC and AUPR scores. I can understand why Feizi et al. might be curious whether the difference between a method’s performance (before and after network deconvolution) is significantly different from random, but to replace magnitude of effect with statistical significance in this setting, with such small effect sizes, is to completely mask the fact that the methods are hardly distinguishable from random in the first place. To make concrete the implication of reporting the statistical significance instead of effect size, I examined the “significant” improvement of network deconvolution on the S. cerevisae and other datasets when run with the protein parameters rather than the default (second figure above). Below I show the AUROC and AUPR plots for the dataset.
The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUROC).
The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUPR).
My conclusion was that the use of “score” was basically a red herring. What looked like major differences between methods disappears into tiny effects in the experimental datasets, and even the in silico variations are greatly diminished. The differences in AUROC of one part in 1000 hardly seem reasonable for concluding that network deconvolution works. Biologically, both results are that the methods cannot reliably predict edges in the network. With usable network deconvolution code at hand, I was curious about one final question. The main result of the DREAM5 paper
- D. Marbach et al., Wisdom of Crowds for Robust Gene Network Inference, Nature Methods 9 (2012), 796–804.
was that the community method was best. So I wondered whether network deconvolution would improve it. In particular, the community result shown in Feizi et al. was not a test of network deconvolution, it was simply a construction of the community from the 9 methods tested (two communities were constructed, one before and one after network deconvolution). To perform the test, I went to examine the DREAM5 data, available as supplementary material with the paper. I was extremely impressed with reproducibility. The participant submissions are all available, together with scripts that can be used to quickly obtain the results of the paper. However the data is not very usable. For example, what is provided is the top 100,000 edges that each method produced. But if one wants to use the full prediction of a method, it is not available. The implication of this in the context of network deconvolution is that it is not possible to test network deconvolution on the DREAM5 data without thresholding. Furthermore, in order to evaluate edges absolute value was applied to all the edge weights. Again, this makes the data much less useful for further experiments one may wish to conduct. In other words, DREAM5 is reproducible but not very usable. But since Feizi et al. suggest that network deconvolution can literally be run on anything with “indirect effect”, I decided to give it a spin. I did have to threshold the input (although fortunately, Feizi et al. have assured us that this is a fine way to run network deconvolution), so actually the experiment is entirely reasonable in terms of their paper. The figure is below (produced with the default network deconvolution parameters), but before looking at it, please accept my apology for making it. I really think its the most incoherent, illogical, meaningless and misleading figure I’ve ever made. But it does abide by the spirit of network deconvolution:
The DREAM5 results before and after network deconvolution.
Alas, network deconvolution decreases the quality of the best method, namely the community method. The wise crowds have been dumbed down. In fact, 19/36 methods become worse, 4 stay the same, and only 13 improve. Moreover, network deconvolution decreases the quality of the top method in each dataset. The only methods with consistent improvements when network deconvolution is applied are the mutual information and correlation methods, poor performers overall, that Feizi et al. ended up focusing on. I will acknowledge that one complaint (of the many possible) about my plot is that the overall results are dominated by the in silico dataset. True- and I’ve tried to emphasize that by setting the y-axis to be the same in each dataset (unlike Feizi et al.) But I think its clear that no matter how the datasets are combined into an overall score, the result is that network deconvolution is not consistently improving methods. All of the analyses I’ve done were made possible thanks to the improved usability of network deconvolution. It is unfortunate that the result of the analyses is that network deconvolution should not be used. Still, I think this examples makes a good case for the fact that reproducibility is essential, but usability is more important.
20 comments
Comments feed for this article
March 18, 2014 at 4:07 pm
Jonathan Eisen
Reblogged this on Jonathan Eisen's Lab.
March 18, 2014 at 6:03 pm
homolog.us
Jonathan, it says –
“Apologies, but the page you requested could not be found. Perhaps searching will help.”
March 19, 2014 at 4:04 am
Florian Leitner
> There is one problem with AUPR, which is that its value is dependent on the number of positive examples in the dataset. For this reason, it doesn’t make sense to average AUPR across datasets (while it does make sense for AUROC). For all of these reasons, I’m slightly uncomfortable with AUPR but that is not the main issue in the DREAM5 analysis.
This is not quite true: AUPR can be compared between datasets, but only if the ratio of true:false examples is equal. Agreeably, this is a huge limitation, but the stated generalization [that cross-dataset PR results are incomparable] can probably be seen as either generally true (i.e., independent of the measure) as a problem inherent to the different nature of any dataset, or it has to be hedged on the above condition.
Additionally, the ROC has the huge shortcoming that it performs very poorly on highly unbalanced sets – such as the DREAM interaction challenges, where most of all possible interactions are wrong. This issue can be compared to the problem of using Accuracy vs. F_1-Measure when evaluating classification performance, where the former produces very “optimistic” results on imbalanced sets, too, and should not be used in those cases (instead, MCC would be the right choice in those cases – see below). This has been shown by Davis & Goadrich, The Relationship Betweeen Precision-Recall and ROC Curves, ICML 2006, where they demonstrated that a curve that dominates in PR space is guaranteed to dominate in ROC space, but not vice versa. The only fair, uniquely comparable score I know of – but that does not take ranking into account – is MCC (Matthew’s Correlation Coefficient) used, e.g., in the CASP challenges. All this implies that using AUROC is not a very good choice to make comparisons between DREAM challenge results/teams and explains why the DREAM organizers report AUPR (and not AUROC) [in Marbach’s Nat Methods paper], while [Feizi el al.’ Nat Methods paper] choice of using AUROC seems questionable to me, particularly given the above proof by Davis & Goadrich.
However, overall, and in defense of ND, I think the focus of this new method should not have been to try to demonstrate that ND is superior to existing methods. Rather, its true beauty is the fact that it can be implemented as a computationally inexpensive operation, and presumably much less expensive than any existing method. Although, by now, I am not sure what the computational cost of calculating the initial normalization needed for the Taylor series constraint is, so it would be interesting to learn if there are any [exponential?] computational costs in the initial normalization that has been looked into by these blog series. So unless there is an issue, in this sense at least, their approach might indeed fulfill “usability” when compared to existing methods.
March 19, 2014 at 4:12 am
Lior Pachter
Regarding ND, I think you misunderstand the idea. It requires first running another method, and then applying ND to the output of that method. Therefore by definition, it cannot be more efficient than any existing method.
The point about AUPR, that cross-dataset comparison requires the true:false ratio be the same, is exactly what I claimed. So I think we agree on that. I also agree that ROC and PR are not independent, which is why I can’t really provide an explanation of why it makes sense to average them, let alone average the p-values computed as in DREAM5 or by Feizi et al. In any case, the Feizi et al. results are terrible, in absolute terms, regardless of whether one looks at AUROC or AUPR (I show both in my post).
A final point of correction: Marbach et al. do report AUPR but the ranking of methods is based on the weird averaging I just mentioned and that is explained in my post. So it is not true that they are using AUPR.
March 19, 2014 at 8:47 am
Florian Leitner
By computational less expensive, I was referring to Big O notation; So it does not matter if you have to run one method and then another. The only question is which of those methods behaves linearly [O(n)], log-linearly [O(n*log n)], polynomial [O(n^c)], or exponential [O(c^n)]; From a computational aspect, chaining two linear operations is much cheaper (for n>2, at least) than, say, even just one quadratic operation, and for a reasonably sized n, this holds even for chaining two approaches with log-linear behavior. For example, an cubic operation is implied when you have a matrix factorization like N N^-1 in the method (and can’t work around it).
As for Marbach’s paper, I was referring to Figure 2a in the paper. There, the score axis is annotated with “AUPR” and there is no mention of this peculiar “mix”, so I assume those values, which I assumed is also the source of the first figure you show in your blog post. But then, you are right that they have all this fancy stuff with averaging AUPR+AUROC and deducing a p-value in the Supplementary Material, so maybe Figure 2a is just poorly annotated…
March 19, 2014 at 5:58 am
Gustavo Stolovitzky
Hi Florian, Just a clarification to your interesting comment. Theorem 3.2 in Davis and Goadrich is an if and only if statement, and it is a result on dominance. However it is true that if a ROC curve gives a higher AU-ROC than a second ROC curve does, it does not necessarily mean that the same order will be maintained in the PR space. This is because for a curve to be better, it doesn’t mean that that curve has to dominate. I guess something similar occurs in life…
March 19, 2014 at 9:04 am
Florian Leitner
Gustavo, yes, you are absolutely right! Sorry for my poor writing, I’ve probably confused a few readers by only mentioning a single “curve” rather than referring to the ordering of two different results that only holds when going from PR space to ROC space, but not vice versa. Thank you for pointing out this clarification.
March 19, 2014 at 8:18 am
Manolis Kellis
We provided a point-by-point response to Dr. Pachter’s previous two blog posts about our paper here: http://compbio.mit.edu/nd/Response_to_Nonsense_Blog_Post.pdf and we provide a point-by-point response to the third one here: http://compbio.mit.edu/nd/Response_to_Usability_Blog_Post.pdf
March 19, 2014 at 11:17 am
Curious observer
Lior, I am curious on how you are setting the threshold on usability as defined in this and previous posts. That is, when does code or results become usable or not usable. When reading through your posts it appears to mean that the definition of usability relates to you (or anyone else) being able to take code or results and apply them for some other purpose, but it seems to be limited to continued development of computational methods. I think usability of code is bit easier to define than usability of results. What stood out to me was the calling of the predictions from the DREAM5 paper not usable. From a biological perspective, I would say these results are very usable and directly actionable and this was shown in the paper through follow up validation experiments in Ecoli. I would have a hard time seeing any molecular biologist willing to invest time in testing the 100,001, 100,002, and 100,003 ranked predictions in the list. I think the distinction between repeatable and usable is quite important, but I also think the definition of usable can be different for different researchers and would be interested in your thoughts. Maybe another post on this topic??
March 19, 2014 at 2:57 pm
Lior Pachter
That is a good question. I was indeed thinking of computational methods development, but I think usability also applies to biological discovery. In the case of DREAM5, while its true that the 100,001th prediction may not be of interest to test, availability of the methods code would be very helpful. For example, it would allow a biologist to use the method on a different dataset (DREAM5 did not distribute this). Moreover, the full matrices are useful as well. In choosing a method for themselves, a biologist may wish to apply an evaluation criterion that requires all the entries, and not only the top ones. So I think that distribution of the complete predictions and the code of the methods would have indeed constituted usability in this case. I would add that I’m not sure the results were demonstrated to be “useable” by virtue of the experimental validation. The E. coli test did show good results but its unclear to me on what basis the transcription factors were chosen.
March 19, 2014 at 1:05 pm
DOpey
This “score” seems like a rather odd metric to use.
Another drawback of precision is what one does when no predictions were made, i.e. true positives + false positives = 0?
Presumably this situation can be avoided by merely starting at the highest scoring prediction, but in cases where methods are outputting scores that can be interpreted as probabilities (in the range of 0 to 1) one may want to ask “at a particular threshold, how does method A compare to method B”.
These are not situations one encounters when using sensitivity and specificity (dividing by zero), although I agree with the comments regarding AUC on unbalanced data-sets.
March 19, 2014 at 3:09 pm
Anon
Building on Curious observer’s remarks, I wonder what would constitute a “usable” contribution for a mostly “wet” paper ? An incredibly detailed protocol? One worry about setting that as the threshold would be that protocols are often heavily optimized towards a particular application, meaning that it wouldn’t fit the bill for new or novel applications. Any thoughts?
March 19, 2014 at 8:10 pm
David Quigley
One meaning for usability in the context of biological results is a physical reagent produced during the investigation and shared with the community. Transgenic mice and cell lines are often used by many investigators for experiments quite unrelated to their original purpose. I had never thought that a knockout mouse and an R library had much in common, but from this point of view they might.
March 20, 2014 at 8:29 pm
ML
Just landed here … I think that M. Kellis and his students should, perhaps, consider taking legal actions against this bully (Lior Pachter) and his mouthpiece (Bray with his continual echoes), who, in the frankly false pretense of defending the integrity of science, continue to defame their targets and issue ad hominem attacks. If indeed the purpose is to conduct post peer review effectively, this bully has done more damage to the process than he can perhaps appreciate (driven apparently by the rejections they have received from the various journals), as his innuendos frankly distract from the critical work of understanding the real issues. Tenured academics who get away with this type of nonsense, in the name of scholarship, end up giving scholarship a bad name, IMO. How one communicates one’s claims has everything to do with one’s scholarship and cannot be separated from it. (Why not pursue publishing in a peer reviewed venue? And why not try other journals, like the rest of us who may be sent away by these high-impact journals?) Finally, this reflects more on the bully and his mouthpiece than on their targets.
March 21, 2014 at 9:09 am
Truth in advertising
From this (like the ones before, very interesting) post, I conclude the following:
A) The very serious accusation of fraud is off the table. This, I think, should be a relief for the whole community.
B) While the metric used in the DREAM5 case may not be the best, one cannot fault Feizi et al for using the same metric as was used in DREAM5.
However,
C) ND is not robust to parameter choice (there may be some heuristics how to set \alpha or \beta, but it’s not clear they would be general).
D) ND doesn’t really perform any better than methods designed for a particular problem (that was already clear in the protein structure case, and it’s cleared up more for the DREAM5 case here).
March 21, 2014 at 4:49 pm
Nicolas Bray
“The very serious accusation of fraud is off the table. This, I think, should be a relief for the whole community.”
To make this statement again for the third time:
“In academia the word fraudulent is usually reserved for outright forgery. However given what appears to be deliberate hiding, twisting and torturing of the facts by Feizi et al., we think that fraud (‘deception deliberately practiced in order to secure unfair or unlawful gain’) is a reasonable characterization. If the paper had been presented in a straightforward manner, would it have been accepted by Nature Biotechnology?”
Just a brief reminder of (just some of) what we’re talking about here: they took the name “transitive closure” – truly a foundational operation on networks – and applied it to a different operation than that term has ever applied to. They then claimed to be inverting this new “transitive closure” when in fact they weren’t. Subsequently, they went around crowing about how they had thereby introduced a new foundational operation on networks.
If computational biology is a place where it’s acceptable behaviour to deceive other people about the significance and meaning of your work, then, sure, people should take relief in the fact that Soheil Feizi, Daniel Marbach, Muriel Médard, and Manolis Kellis have (possibly) lived up to the depressingly low standards of our field.
Perhaps other computational biologists look at this situation and think, to quote your earlier comment, “Thus, they managed to ‘get a major paper’, rather than publishing this in a technical journal (where it probably belongs). Good for them.” But the fact that no one who is not directly associated with the authors has stepped forward (publicly or privately) to defend this paper at least makes me hope that few actually think that way.
Whatever the case, should we not expect computational biology to be more than a game in which people try to figure out how best to deceive reviewers and readers in a quest for “major papers” and grant money?
March 22, 2014 at 4:32 pm
Truth in advertising
The statement in the first post that had me (and many others) the most worried was “Despite our best attempts … we have been unable to replicate the results of the paper”. You guys being highly competent people, the most parsimonious explanation for this appeared to be that the authors may have resorted to some sort of actual fraud (outright forgery by your definition). This would have been very shocking indeed!
So, I and the whole community are relieved that this is off the table.
March 21, 2014 at 8:10 pm
Andy Jerkins
Hey Lior & Nicolas,
I’m curious as to whether or not the current version of the code on the Supplementary website matches more closely with the description of the method in either the main text or the Supplementary methods? I unfortunately did not download the earlier version of their script.
March 22, 2014 at 8:28 am
Lior Pachter
The code has never resembled anything like what is described in the main text. If it had, it would have been a program with a single line of code, namely the MATLAB command
>> D=G*inv(eye(size(G))-G);
Everything else described below has to do, at best, with matching things in the (ever changing) supplement. The history of the code is as follows:
July 2013: The original version distributed with the paper consisted of a single program, ND.m. It contained two parameters, called “delta” and “alpha”. Delta corresponded to what was “Beta” in the supplement. “Alpha” was never mentioned in the supplement (there is an alpha in the supplement but it refers to something other than this parameter). Beta was also not mentioned in the main text, except, as I have explained in the original blog post on ND, in passing in a single sentence where it was made to seem irrelevant. The default for delta (beta) was 1, which was a value inconsistent with the premise of the paper, namely that beta should always be strictly less than 1. The default value for alpha was 1. Although not described in the supplement, it turned out that the parameter values used for the three datasets were (delta,alpha) = (0.5,0.1), (0.99,1) and (0.95,1). The code could be used for the protein and coauthorship networks, but did not match the required steps for the DREAM analysis, and in fact we were unable to get it to work despite repeated attempts and a number of requests for help to the authors. Finally, an initial affine transformation of the input matrix mapping it to values between 0 and 1 was implemented in the code but not described anywhere in the supplement; neither was there in the supplement a mention of the fact that diagonal entries were set to zero (which was in the code). In other words, in terms of the original post I made on the paper where I described 7 steps in the code, only steps 4,5,6 were described in the supplement.
August 2013: After we contacted the authors asking why delta (beta) was set to 1 when that made no sense in terms of the paper, they released a new version of the code on the NBT website with delta (beta) now set to 0.9. This is still the version that one gets from the NBT supplementary website. It was one of the reasons we became suspicious of the paper, because in the supplement it had said that beta should be set as close as possible to 1, yet now the fix for the incorrect beta=1 was to set it to 0.9.
February 2013: After my blog post came out, the authors created a new companion website for the paper where they posted a new version of the code. The code is now in the form of two different program: “ND code for symmetric networks” and “ND code for regulatory networks”. The latter is the code to be used for the DREAM analysis, and the former for the protein and coauthorship network analysis. The symmetric ND.m has default values alpha = 1 (matching what is now disclosed in the updated supplement as the parameter for the protein and coauthorship network) and beta = 0.99 (matching the parameter used for the protein network, but not the coauthorship network). Beta is now called by its correct name, but its still the case that alpha is not mentioned in the supplement (nor does it correspond to the alpha that is mentioned there). I should note that the authors have been fiddling with the software on the website without including release notes, including changing of the default beta once or twice, but I have not been tracking the code day-by-day (as an aside, they have been insisting over and over that the method is robust to the choice of parameters, yet they have been changing the parameters constantly, and used different ones for the different datasets, why?). In addition, the affine mapping step is commented out of the code, i.e. it has been removed, but instead has been inserted into the scripts distributed for replicating results. I guess this is so the authors can call it a “pre-processing step”. It is still not described in (even the updated) supplement.
The regulatory ND.m has default beta=0.5 and alpha=0.1, matching the parameters used for the DREAM analysis. This program also contains some new code for symmetrizing the input matrix and perturbing it if the result is not diagonalizable. These turn out to be required steps not mentioned in the original supplement. It should also be noted that in addition to code changes, the authors also released the complete data required to run ND for the first time in February 2013.
I have to add a final comment, namely that the software is a very simple piece of code, because even though its not one line, the additional steps are straightforward. Leaving aside comments in the code, the whole thing is about 80 lines. In fact, Nicolas Bray re-implemented it in R back in August in a few minutes (the whole thing is, annoyingly, in MATLAB which requires a license). But my point is that I have never, in my entire career, seen anything like the sorts of things above happening, especially with such a simple piece of code. Splitting of the software into two pieces.. constant changes in the default parameters.. missing pieces added after the fact.. removal of parts.. the whole thing is highly irregular.
March 6, 2015 at 8:24 am
george
i am new in the field but old in life(being 54 and all)..So to me it sounds like the money is too much!!:):)Really science today has gone wrong (and i know this is generalising but ,if it is true how can one describe it?).The investors is the main target and not the science.And as for the paper in question ,i can only say from my experience that there are around lotrs of such scientists ,(pretend to be),who got through the system somehow and obtained a title,as a means to achieve riches and fame.
And unfortunately,there are countless examples of these @scientists@ backing all kinds of claims that lead to some smart guy making profits without even having to leave home.Not to mention all these who advise ,and create laws based on their own false science.(remember the IMF paper that was the cause for implementing artificial poverty in greece and europe>>??peer review was nowhere to be seen ,but still these guys advised on global policies….So big up to you for uncovering few of these ….guys.