You are currently browsing the tag archive for the ‘network deconvolution’ tag.

Three years ago Nicolas Bray and I published a post-publication review of the paper “Network link prediction by global silencing of indirect correlations” (Barzel and Barabási, Nature Biotechnology, 2013). Despite our less than positive review of the work, the paper has gone on to garner 95 citations since its publication (source: Google Scholar). In fact, in just this past year the paper has  paper has been cited 44 times, an impressive feat with the result that the paper has become the first author’s most cited work.

Ultimate impact

In another Barabási paper (with Wang and Song) titled Quantifying Long-Term Scientific Impact  (Science, 2013), the authors provide a formula for estimating the total number of citations a paper will acquire during its lifetime. The estimate is

$c^{\infty} = m(e^{\lambda_i-1}),$

where $m$ and $\lambda_i$ are parameters learned from a few years of citation data. The authors call $c^{\infty}$ the ultimate impact because, they explain, “the total number of citations a paper will ever acquire [is equivalent to] the discovery’s ultimate impact”. With 95 citations in 3 years, the Barzel-Barabási “discovery” is therefore on track for significant “ultimate impact” (I leave it as an exercise for the reader to calculate the estimate for $c^{\infty}$ from the citation data). The ultimate impactful destiny of the paper is perhaps no surprise…Barzel and Barabási knew as much when writing it, describing its implication for systems biology as “Overall this silencing method will help translate the abundant correlation data into insights about the system’s interactions” and stating in a companion press release that After silencing, what you are left with is the pre­cise wiring dia­gram of the system… In a sense we get a peek into the black box.”

Drive by citations

Now that three years had passed since the publication of the press release and with the ultimate impact revealed, I was curious to see inside the 95 black boxes opened with global silencing, and to examine the 95 wiring diagrams that were thus precisely figured out.

So I delved into the citation list and examined, paper-by-paper, to what end the global silencing method had been used. Strikingly, I did not find, as I expected, precise wiring diagrams, or even black boxes. A typical example of what I did find is illustrated in the paper Global and portioned reconstructions of undirected complex networks by Xu et al. (European Journal Of Physics B, 2016) where the authors mention the Barzel-Barabási paper only once, in the following sentence of the introduction:

“To address this inverse problem, many methods have been proposed and they usually show robust and high performance with appropriate observations [9,10,11, 12,13,14,15,16,17,18,19,20,21].”

(Barzel-Barabási is reference [16]).

Andrew Perrin has coined the term drive by citations for “references to a work that make a very quick appearance, extract a very small, specific point from the work, and move on without really considering the existence or depth of connection [to] the cited work.” While its tempting to characterize the Xu et al. reference of Barzel-Barabási as a drive by citation the term seems overly generous, as Xu et al. have literally extracted nothing from Barzel-Barabási at all. It turns out that almost all of the 95 citations of Barzel-Barabási are of this type. Or not even that. In some cases I found no logical connection at all to the paper. Consider, for example, the Ph.D. thesis Dysbiosis in Inflammatory Bowel Disease, where Barzel-Barabási, as well as the Feizi et al. paper which Nicolas Bray and I also reviewed, are cited as follows:

The Ribosomal Database Project (RDP) is a web resource of curated reference sequences of bacterial, archeal, and fungal rRNAs. This service also facilitates the data analysis by providing the tools to build rRNA-derived phylogenetic trees, as well as aligned and annotated rRNA sequences (Barzel and Barabasi 2013; Feizi, Marbach et al. 2013).

(Neither papers has anything to do with building rRNA-derived phylogenetic trees or aligning rRNA sequences).

While this was probably an accidental error, some of the drive by citations were more sinister. For example, WenJun Zhang is an author who has cited Barzel-Barabási as

We may use an incomplete network to predict missing interactions (links) (Clauset et al., 2008; Guimera and Sales-Pardo, 2009; Barzel and Barabási, 2013; Lü et al., 2015; Zhang, 2015d, 2016a, 2016d; Zhang and Li, 2015).

in exactly the same way in three papers titled Network Informatics: A new science, Network pharmacology: A further description and Network toxicology: a new science. In fact this author has cited the work in exactly the same way in several other papers which appear to be copies of each other for a total of 7 citations all of which are placed in dubious “papers”. I suppose one may call this sort of thing hit and run citation.

I also found among the 95 citations one paper strongly criticizing the Barzel-Barabási paper in a letter to Nature Biotechnology (the title is Silence on the relevant literature and errors in implementation) , as well as the (to me unintelligible) response by the authors.

In any case, after carefully examining each of the 95 references citing Barzel and Barabási I was able to find only one paper that actually applied global silencing to biological data, and two others that benchmarked it. There are other ways a paper could impact derivative work, for example by virtue of the models or mathematics developed, be of use, but I could not find any other instance where Barzel and Barabási’s work was used meaningfully other than the three citations just mentioned.

When a citation is a citation

As mentioned, two papers have benchmarked global silencing (and also network deconvolution, from Feizi et al.). One was a paper by Nie et al. on Minimum Partial Correlation: An Accurate and Parameter-Free Measure of Functional Connectivity in fMRI. Table 1 from the paper shows the results of global silencing, network deconvolution and other methods on a series of simulations using the measure of c-sensitivity for accuracy:

Table 1 from Nie et al. showing performance of methods for “network cleanup”.

EPC is the “Elastic PC-algorithm” developed by the authors, which they argue is the best method. Interestingly, however, global silencing (GS) is equal to or worse than simply choosing the top entries from the partial correlation matrix (FP) in 19/28 cases- that’s 67% of the time! This is consistent with the results we published in Bray & Pachter 2013. In these simulations network deconvolution performs better than partial correlation, but still only 2/3 of the time. However in another benchmark of global silencing and network deconvolution published by Izadi et al. 2016 (A comparative analytical assay of gene regulatory networks inferred using microarray and RNA-seq datasets) network deconvolution underperformed global silencing. Also network deconvolution was examined in the paper Graph reconstruction Using covariance-based methods by Sulaimanov and Koeppl 2016 who show, with specific examples, that the scaling parameter we criticized in Bray & Pachter 2013 is indeed problematic:

The scaling parameter α is introduced in [Feizi et al. 2013] to improve network deconvolution. However, we show with simple examples that particular choices for α can lead to unwanted elimination of direct edges.

It’s therefore difficult to decide which is worse, network deconvolution or global silencing, however in either case it’s fair to consider the two papers that actually tested global silencing as legitimately citing the paper the method was described in.

The single paper I found that used global silencing to analyze a biological network for biological purposes is A Transcriptional and Metabolic Framework for Secondary Wall Formation in Arabidopsis by Li et al. in Plant Physiology, 2016. In fact the paper combined the results of network deconvolution and global silencing as follows:

First, for the given data set, we calculated the Pearson correlation coefficients matrix Sg×g. Given g1 regulators and g2 nonregulators, with g = g1+g2, the correlation matrix can be modified as

where O denotes the zero matrix, to include biological roles (TF and non-TF genes). We extracted the regulatory genes (TFs) from different databases, such as AGRIS (Palaniswamy et al., 2006), PlnTFDB (Pérez-Rodríguez et al., 2010), and DATF (Guo et al., 2005). We then applied the network deconvolution and global silencing methods to the modified correlation matrix S′. However, global silencing depends on finding the inverse of the correlation matrix that is rank deficient in the case p » n, where p is the number of genes and n is the number of features, as with the data analyzed here. Since finding an inverse for a rank-deficient matrix is an ill-posed problem, we resolved it by adding a noise term that renders the matrix positive definite. We then selected the best result, with respect to a match with experimentally verified regulatory interactions, from 10 runs of the procedure as a final outcome. The resulting distribution of weighted matrices for the regulatory interactions obtained by each method was decomposed into the mixture of two Gaussian distributions, and the value at which the two distributions intersect was taken as a cutoff for filtering the resulting interaction weight matrices. The latter was conducted to avoid arbitrary selection of a threshold value and prompted by the bimodality of the regulatory interaction weight matrices resulting from these methods. Finally, the gene regulatory network is attained by taking the shared regulatory interactions between the resulting filtered regulatory interactions obtained by the two approaches. The edges were rescored based on the geometric mean of the scores obtained by the two approaches.

In light of the benchmarks of global silencing and network deconvolution, and in the absence of analysis of the ad hoc method combining their results, it is difficult to believe that this methodology resulted in a meaningful network. However its citation of the relevant papers is certainly legitimate. Still, the results of the paper, which constitute a crude analysis of the resulting networks, are a far cry from revealing the “precise wiring diagram of the system”. The authors acknowledge this writing

From the cluster-based networks, it is clear that a wide variety of ontology terms are associated with each network, and it is difficult to directly associate a distinct process with a certain transcript profile.

The factor of use correction

The analysis of the Barzel and Barabási citations suggests that, because a citation is not always a citation (thanks to Nicolas Bray for suggesting the title for the post), to reflect the ultimate impact of a paper the quantity $c^{\infty}$ needs to be adjusted. I propose adjustment by the factor

$f^u = \frac{C-d_b}{C},$

where C is the total number of citations of a paper and $d_b$ is the number of drive by citations. The fraction $\frac{d_b}{C}$ is essentially a factor of use correction. It should be possible (and interesting) to develop text analytics algorithms for estimating $d_b$ so as to be able to correct $c^{\infty}$ to  $f^u \cdot c^{\infty}$, and similarly adjusting citations counts, h-indices, impact factors of journals and related metrics. Explicit computation and publication of the factor of use correction for papers would also incentivize authors to reduce or eliminate gratuitous drive by citation practices.

For now I decided to compute the factor of use correction for the Barzel-Barabási paper by generously estimating that $d_b=92$. This yielded $f^u = \frac{3}{95} = 0.0315$. Barabási has an h-index of 117, but applying this factor of use correction to all of his published papers I obtained the result that Barabasi’s factor of use corrected h-index is 30.

Reproducibility has become a major issue in scientific publication, it is under scrutiny by many, making headlines in the news, is on the minds of journals, and there are guidelines for achieving it. Reproducibility is certainly important for scientific research, but I think that lost in the debate is the notion of usability. Reproducibility only ensures that the results of a paper can be recreated by others, but usability ensures that researchers can build on scientific work, and explore hypotheses and ideas unforeseen by authors of the original papers. Here I describe a case study in reproducibility and usability, that emerged from previous post I wrote about the paper

Feizi et al. describe a method called network deconvolution, that they claim improves the inference results for 8 out of 9 network inference methods, out of the 35 that were tested in the DREAM5 challenge. In DREAM5, participants were asked to examine four chip-based gene x expression matrices, and were also provided a list of transcription factors for each. They were then asked to provide ranked lists of transcription factor – gene interactions for each of the four datasets. The four datasets consisted of one computer simulation (the “in silico” dataset) and expression measurements in  E. coli, S. cerevisiae and S. aureus. The consortium received submission from 29 different groups and ran 6 other “off-the-shelf” methods, while also developing its own “community” method (for a total of 36=29+6+1). The community method consisted of applying the Borda count to the 35 methods being tested, to produce a new consensus, or community, network. Nicolas Bray and I tried to replicate the results of Feizi et al. so that we could test for ourselves the performance of network deconvolution with different parameters and on other DREAM5 methods ( Feizi et al. tested only 9 methods; there were 36 in total).  But despite contacting the authors for help we were unable to do so. In desperation, I even offered $100 for someone to replicate all of the figures in the paper. Perhaps as a result of my blogging efforts, or possibly due to a spontaneous change of heart, the authors finally released some of the code and data needed to reproduce some of the figures in their paper. In particular, I am pleased to say that the released material is sufficient to almost replicate Figure 2 of their paper which describes their results on a portion of the DREAM5 data. I say almost because the results for one of the networks is off, but to the authors credit it does appear that the distributed data and code are close to what was used to produce the figure in the paper (note: there is still not enough disclosure to replicate all of the figures of the paper, including the suspicious Figure S4 before and after revision of the supplement, and I am therefore not yet willing to concede the$100). What Feizi et al. did accomplish was to make their methods usable. That is to say, with the distributed code and data I was able to test the method with different parameters and on new datasets. In other words, Feizi et al. is still not completely reproducible, but it is usable. In this post, I’ll demonstrate why usability is important, and make the case that it is too frequently overlooked or confused with reproducibility. With usable network deconvolution code in hand, I was finally able to test some of the claims of Feizi et al.  First, I identify the relationship between the DREAM methods and the methods Feizi et al. applied network deconvolution to. In the figure below, I have reproduced Figure 2 from Feizi et al. together with Figure 2 from Marbach et al.:

Figure 2 from Feizi et al. aligned to Figure 2 from Marbach et al.

The mapping is more complex than appears at first sight. For example, in the case of Spearman correlation (method Corr #2 in Marbach et al., method #5 in Feizi et al.), Feizi et al. ran network deconvolution on the method after taking absolute values. This makes no sense, as throwing away the sign is to throw away a significant amount of information, not to mention it destroys any hope of connecting the approach to the intuition of inferring directed interactions from the observed via the idealized “model” described in the paper. On the other hand, Marbach et al. evaluated Spearman correlation with sign. Without taking the absolute value before evaluation negative edges, strong (negative) interactions, are ignored. This is the reason for the very poor performance of Spearman correlation and the reason for the discrepancy in bar heights between Marbach et al. and Feizi et al. for that method. The caption of Figure 2 in Feizi et al. begins “Network deconvolution applied to the inferred networks of top-scoring methods [1] from DREAM5..” This is obviously not true. Moreover, one network they did not test on was the community network of Marbach et al. which was the best method and the point of the whole paper. However the methods they did test on were ranked 2,3,4,6,8,12,14,16,28 (out of 36 methods). The 10th “community” method of Feizi et al. is actually the result of applying the community approach to the ND output from all the methods, so it is not in and of itself a test of ND. Of the nine tested methods, arguably only a handful were “top” methods. I do think its sensible to consider “top” to be the best methods for each category (although Correlation is so poor I would discard it altogether). That leaves four top methods. So instead of creating the illusion that network deconvolution improves 9/10 top scoring methods, what Feizi et al. should have reported is that 3 out of 4 of the top methods that were tested were improved by network deconvolution. That is the result of running network deconvolution with the default parameters. I was curious what happens when using the parameters that Feizi et al. applied to the protein interaction data (alpha=1, beta=0.99). Fortunately, because they have made the code usable, I was able to test this. The overall result as well as the scores on the individual datasets are shown below: The Feizi et al. results on gene regulatory networks using parameters different from the default. The results are very different. Despite the claims of Feizi et al. that network deconvolution is robust to choice of parameters, now only 1 out of 4 of the top methods are improved by network deconvolution. Strikingly, the top three methods tested have their quality degraded. In fact, the top method in two out of the three datasets tested is made worse by network deconvolution. Network deconvolution is certainly not robust to parameter choice. What was surprising to me was the improved performance of network deconvolution on the S. cerevisae dataset, especially for the mutual information and correlation methods. In fact, the improvement of network deconvolution over the methods is appears extraordinary. At this point I started to wonder about what the improvements really mean, i.e. what is the “score” that is being measured. The y-axis, called the “score” by Feizi et al. and Marbach et al. seemed to be changing drastically between runs. I wondered… what exactly is the score? What do the improvements mean? It turns out that “score” is defined as follows:

$score = \frac{1}{2} ( \frac{1}{3} \sum_{i=1}^3 -log_{10} p_{AUROC,i} + \frac{1}{3} \sum_{i=1}^3 -log_{10} p_{AUPR,i})$.

This formula requires some untangling: First of all, AUROC is shorthand for area under the ROC (receiver operator curve), and AUPR for area under the PR (precision recall) curve. For context, ROC is a standard concept in engineering/statistics. Precision and recall are used frequently, but the PR curve is used much less than ROC . Both are measures for judging the quality of a binary classifier. In the DREAM5 setting, this means the following: there is a gold standard of “positives”, namely a set of edges in a network that should be predicted by a method, and the remainder of the edges will be considered “negatives”, i.e. they should not be predicted. A method generates a list of edges, sorted (ranked) in some way. As one proceeds through the list, one can measure the fraction of positives and false positives predicted. The ROC and PR curves measure the performance. A ROC is simply a plot showing the true positive rate for a method as a function of the false positive rate. Suppose that there are positives in the gold standard out of a goal of edges. If one examines the top k predictions of a method, then among them there will be t “true” positives as well as k-t “false” positives. This will result in a single point on the ROC, i.e. the point ($\frac{k-t}{n-m},\frac{t}{m}$). This can be confusing at first glance for a number of reasons. First, the points do not necessarily form a function, e.g. there can be points with the same x-coordinate. Second, as one varies one obtains a set of points, not a curve. The ROC is a curve, and is obtained by taking the envelope of all of the points for $k \in \{1,\ldots,n\}$. The following intuition is helpful in understanding ROC:

1. The x coordinate in the ROC is the false positive rate. If one doesn’t make any predictions of edges at all, then the false positive rate is 0 (in the notation above k=0, t=0). On the other hand, if all edges are considered to be “true”, then the false positive rate is 1 and the corresponding point on the ROC is (1,1), which corresponds to k=n, t=m.
2. If a method has no predictive power, i.e. the ranking of the edges tells you nothing about which edges really are true, then the ROC is the line y=x. This is because lack of predictive power means that truncating the list at any k, results in the same proportion of true positives above and below the kth edge. And a simple calculation shows that this will correspond to the point ($\frac{k}{n},{k}{n}$) on the ROC curve.
3. ROC curves can be summarized by a single number that has meaning: the area under the ROC (AUROC). The observation above means that a method without any predictive power will have an AUROC of 1/2. Similarly, a “perfect” method, where he true edges are all ranked at the top will have an AUROC of 1. AUROC is widely used to summarize the content of a ROC curve because it has an intuitive meaning: the AUROC is the probability that if a positive and a negative edge are each picked at random from the list of edges, the positive will rank higher than the negative.

An alternative to ROC is the precision-recall curve. Precision, in the mathematics notation above, is the value $\frac{t}{k}$, i.e., the number of true positives divided by the number of true positives plus false positives. Recall is the same as sensitivity, or true positive rate: it is $\frac{t}{m}$. In other words, the PR curve contains the points ($\frac{t}{m},\frac{t}{k}$), as recall is usually plotted on the x-axis. The area under the precision-recall curve (AUPR) has an intuitive meaning just like AUROC. It is the average of precision across all recall values, or alternatively, the probability that if a “positive” edge is selected from the ranked list of the method, then an edge above it on the list will be “positive”. Neither precision-recall curves, nor AUPR are widely used. There is one problem with AUPR, which is that its value is dependent on the number of positive examples in the dataset. For this reason, it doesn’t make sense to average AUPR across datasets (while it does make sense for AUROC). For all of these reasons, I’m slightly uncomfortable with AUPR but that is not the main issue in the DREAM5 analysis. I have included an example of ROC and PR curves below. I generated them for the method “GENIE3” tested by Feizi et al.. This was the method with the best overall score. The figure below is for the S. cerevisiae dataset:

The ROC and a PR curves before (top) and after (bottom) applying network deconvolution to the GENIE3 network.

The red curve in the ROC plots is what one would see for a method without any predictive power (point #2 above). In this case, what the plot shows is that GENIE3 is effectively ranking the edges of the network randomly. The PR curve is showing that at all recall levels there is very little precision. The difference between GENIE3 before and after network deconvolution is so small, that it is indistinguishable in the plots. I had to create separate plots before and after network deconvolution because the curves literally overlapped and were not visible together. The conclusion from plots such as these, should not be that there is statistically significance (in the difference between methods with/without network deconvolution, or in comparison to random), but rather that there is negligible effect. There is a final ingredient that is needed to constitute “score”. Instead of just averaging AUROC and AUPR, both are first converted into p-values that measure the statistical significance of the method being different from random. The way this was done was to create random networks from the edges of the 35 methods, and then to measure their quality (by AUROC or AUPR) to obtain a distribution. The p-value for a given method was then taken to be the area under the probability density function to the right of the method’s value. The graph below shows the pdf for AUROC from the S. cerevisae DREAM5 data that was used by Feizi et al. to generate the scores:

Distribution of AUROC for random methods generated from the S. cerevisiae submissions in Marbach et al.

In other words, almost all random methods had an AUROC of around 0.51, so any slight deviation from that was magnified in the computation of p-value, and then by taking the (negative) logarithm of that number a very high “score” was produced. The scores were then taken to be the average of the AUROC and AUPR scores. I can understand why Feizi et al. might be curious whether the difference between a method’s performance (before and after network deconvolution) is significantly different from random, but to replace magnitude of effect with statistical significance in this setting, with such small effect sizes, is to completely mask the fact that the methods are hardly distinguishable from random in the first place. To make concrete the implication of reporting the statistical significance instead of effect size, I examined the “significant” improvement of network deconvolution on the S. cerevisae and other datasets when run with the protein parameters rather than the default (second figure above). Below I show the AUROC and AUPR plots for the dataset.

The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUROC).

The Feizi et al. results before and after network deconvolution using alpha=1, beta=0.99 (shown with AUPR).

My conclusion was that the use of “score” was basically a red herringWhat looked like major differences between methods disappears into tiny effects in the experimental datasets, and even the in silico variations are greatly diminished. The differences in AUROC of one part in 1000 hardly seem reasonable for concluding that network deconvolution works. Biologically, both results are that the methods cannot reliably predict edges in the network. With usable network deconvolution code at hand, I was curious about one final question. The main result of the DREAM5 paper

was that the community method was best. So I wondered whether network deconvolution would improve it. In particular, the community result shown in Feizi et al. was not a test of network deconvolution, it was simply a construction of the community from the 9 methods tested (two communities were constructed, one before and one after network deconvolution). To perform the test, I went to examine the DREAM5 data, available as supplementary material with the paper. I was extremely impressed with reproducibility. The participant submissions are all available, together with scripts that can be used to quickly obtain the results of the paper. However the data is not very usable. For example, what is provided is the top 100,000 edges that each method produced. But if one wants to use the full prediction of a method, it is not available. The implication of this in the context of network deconvolution is that it is not possible to test network deconvolution on the DREAM5 data without thresholding. Furthermore, in order to evaluate edges absolute value was applied to all the edge weights. Again, this makes the data much less useful for further experiments one may wish to conduct. In other words, DREAM5 is reproducible but not very usable. But since Feizi et al. suggest that network deconvolution can literally be run on anything with “indirect effect”, I decided to give it a spin. I did have to threshold the input (although fortunately, Feizi et al. have assured us that this is a fine way to run network deconvolution), so actually the experiment is entirely reasonable in terms of their paper. The figure is below (produced with the default network deconvolution parameters),  but before looking at it, please accept my apology for making it. I really think its the most incoherent, illogical, meaningless and misleading figure I’ve ever made. But it does abide by the spirit of network deconvolution:

The DREAM5 results before and after network deconvolution.

Alas, network deconvolution decreases the quality of the best method, namely the community methodThe wise crowds have been dumbed down. In fact, 19/36 methods become worse, 4 stay the same, and only 13 improve. Moreover, network deconvolution decreases the quality of the top method in each dataset. The only methods with consistent improvements when network deconvolution is applied are the mutual information and correlation methods, poor performers overall, that Feizi et al. ended up focusing on. I will acknowledge that one complaint (of the many possible) about my plot is that the overall results are dominated by the in silico dataset. True- and I’ve tried to emphasize that by setting the y-axis to be the same in each dataset (unlike Feizi et al.) But I think its clear that no matter how the datasets are combined into an overall score, the result is that network deconvolution is not consistently improving methods. All of the analyses I’ve done were made possible thanks to the improved usability of network deconvolution. It is unfortunate that the result of the analyses is that network deconvolution should not be used. Still, I think this examples makes a good case for the fact that reproducibility is essential, but usability is more important.