“Basically, I’m not interested in doing research and I never have been. I’m interested in understanding, which is quite a different thing. And often to understand something you have to work it out yourself because no one else has done it.” – David Blackwell

The importance of understanding is unfortunately frequently downplayed in computational biology, where methods are the slums of Results, and are at best afforded a few sentences in supplementary materials of research articles.

Nevertheless, there are a handful of examples to the contrary, an important one being the paper “Neighbor-joining revealed” by Olivier Gascuel and Mike Steel, Molecular Biology and Evolution 23 (2006), p 1997–2000, which I have valued for many years and inspired me to write this post about a similar type of paper. In the neighbor-joining case, Gascuel and Steel review a series of results showing that neighbor-joining, first published in the form of an ad-hoc recipe with only metaphorical motivation, is actually a greedy algorithm for optimizing a least squares loss function (“Theoretical foundations of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting“, R Desper and O Gascuel, Molecular Biology and Evolution 21 (2004), p 587–598). The particular form of the minimization criterion emerges from a formula of Pauplin  (“Direct calculation of a tree length using a distance matrix“, Y Paulin, Journal of Molecular Evolution 10 (1993), p 1073–1095) that in turn has deeper roots. This is a story me and my students have played a small role in, providing some insight into the robustness of the neighbor-joining algorithm in “Why neighbor joining works” and an explanation for the surprisingly simple Pauplin’s formula in “Combinatorics of least squares trees“. These series of insights into neighbor-joining turn out to have practical significance. Results in the “Combinatorics of least squares trees” paper were recently extended by Fabio Pardi and Olivier Gascuel in “Combinatorics of distance-based tree inference“, where they synthesize all of the previous results on neighbor-joining into a blueprint for improving distance based methods. The details of this story will be the subject of a future blog post; here I discuss another beautiful example of work that involves understanding as opposed to research:

I have a long-standing interest in the computational problems that arise in metagenomics, starting with work I did with my former student Kevin Chen on “Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities“.  Recently, while drawing parallels between RNA-Seq and metagenomics (with Lorian Schaeffer and Kevin McLoughlin), I was thinking again about problems encountered in the paper “Disordered microbial communities in asthmatic airways” in which I played a minor role assisting William Cookson. Via a long chain of citation chasing, I eventually stumbled upon the elegant article “The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples” by Steven Evans and Erick Matsen, Journal of the Royal Statistical Society 74 (2012), p 569–592, in which the authors explain the meaning of the (weighted) UniFrac distance for metagenomic samples.

The weighted UniFrac distance is defined as follows: Consider a (rooted) tree T to which a total of n+m reads from two (metagenomics) experiments, A (reads)  and B (reads) have been assigned. By “assigned” we mean that every read is associated with a single leaf of the tree. If T has edges, and $l_i$ is the “length” of edge with $A_i, B_i$ the number of descendants of edge i in the tree, the weighted UniFrac distance is defined to be:

$\sum_{i=1}^E l_i \left| \frac{A_i}{m} - \frac{B_i}{n} \right|$.

Note that this formula is equation (1) in the Evans-Matsen paper (with the exception of a typo in the paper for the indexing of the summation where n has been used twice).

The intuitive meaning of UniFrac is illustrated in an example on a website of Rob Knight (figure shown below), who together with Catherine Lozupone popularized UniFrac in metagenomics analyses by providing easy to use software for its computation.

In the figure edge lengths are weighted by the relative abundance of reads from the square and circle experiments. In the latter each read counts half as much because there are twice as many. The width of the edges is proportional to the extent to which each edge is weighted in the formula for UniFrac (grey edges have no weight, e.g. edge 3 does not contribute to the distance because the relative abundances below it are equal). Edges 1 and 2 contribute large weights because their descendants are strongly biased towards squares (1) vs. circles (2)

The essence of the Evans-Matsen result is an interpretation of UniFrac as a special case of the family of Wasserstein metrics. Wasserstein metrics have many aliases (e.g. in computer science they are known as the earth mover’s distance). Evans-Matsen use the names of Kantorovich and Rubinstein, who were the first to highlight the importance of Wasserstein metrics in probability theory, and provided an important duality theorem that is the path towards the main result of the paper. Specifically, the Wasserstein metric is given by

$Z(P,Q) = \mbox{inf} \left\{ \int_{S \times S} r(x,y) R(dx,dy):R \in \mathcal{R}(P,Q) \right\}$

where in the formula $(S,r)$ is a metric space, P and Q are Borel probability distributions, and $\mathcal{R}(P,Q)$ denotes the set of probability distributions R on the product space $S \times S$ with the probability that the marginal distributions of R are P and Q. The dual form described in the theorem of Kantorovich and Rubinstein is

$Z(P,Q) = \mbox{sup} \left\{ \int_{S} f(x)P(dx) - \int_{S} f(y)Q(dy):f \in \mathcal{L} \right\}$

where $\mathcal{L}$ is the set of functions $f:S \rightarrow \mathbb{R}$ with the Lipschitz property $|f(x)-f(y)| \leq r(x,y)$ for all $x,y \in S$. This formula can be interpreted in the case of trees, by first noting that for an edge weighted tree T, the distance on edges provides a natural (unique) Borel measure $\lambda$ on T. Specifically, $\lambda([x,y]) = d(x,y)$ for all pairs of points $x,y \in T$ where $d(x,y)$ is the length of the (unique) arc between them. Then, given a root $\rho$ in the tree, if $\tau (x)$ denotes the subtree below a point x, i.e. $\tau (x) = \{ y \in T:x \in [\rho,y] \}$, it can be shown that

$Z(P,Q) = \int_{T} |P \{ \tau (y) \} - Q \{ \tau(y) \} | \lambda (dy)$

In the special case that P assigns a point mass $\frac{1}{m}$ to each of the leaves in A, and Q assigns a point mass $\frac{1}{n}$ to each of the leaves in B, the Wasserstein metric above reduces to the weighted UniFrac distance.

What is the point of this explanation?

Understanding UniFrac as a Wasserstein metric immediately suggests various generalizations. First, the absolute value of the difference between P and can be replaced by any pseudometric f on $[0,1]$ so that

$\hat{Z}(P,Q) = \int_{T} f[P \{ \tau (y) \}, Q \{ \tau(y) \} ] \lambda (dy)$.

Unweighted UniFrac corresponds to the choice of $f(x,y) = 1$ when exactly one of x or y is greater than zero. Weighted UniFrac comes from $f(x,y) = |x-y|$. The fact that weighted UniFrac distance is invariant with respect to the choice of root in the tree can now be seen to emerge from the property that $f(x,y) = f(1-x,1-y)$. These generalizations can also be extended from the 1st Wasserstein distance to the pth Wasserstein distance. Evans and Matsen argue that the case $p=2$

$Z_2^2(P,Q) = \int_{T} |P \{ \tau (y) \} - Q \{ \tau(y) \} |^2 \lambda (dy)$

is particularly appealing due to its interpretation as an average of the expected difference between the inter and intra distances for pairs of samples from and Q. That is, they show that

$Z_2^2(P,Q) = \frac{1}{2} \left( \mathbb{E} [d (X',Y') - d (X',X'')] + \mathbb{E} [d (X',Y') - d (Y',Y'')] \right)$

where $X',X''$ have distribution P on T and $Y',Y''$ have distribution on T.

The paper also discusses other issues of importance in using UniFrac, namely how to normalize distances so that they are comparable across trees, and how to test for significance of “clustering” between two samples. For the latter the main result is a faster (approximate) computation that may be more effective in practice than the current permutation testing that is used.

The main point of the paper though, is that the interpretation of UniFrac as a Wasserstein metric, provides not only insight into the properties of the distance measure, but also generalizations that may extend the possibilities for comparing and contrasting metagenomic communities. A triumph for understanding.