“Basically, I’m not interested in doing research and I never have been. I’m interested in understanding, which is quite a different thing. And often to understand something you have to work it out yourself because no one else has done it.” – David Blackwell

The importance of *understanding* is unfortunately frequently downplayed in computational biology, where methods are the slums of Results, and are at best afforded a few sentences in supplementary materials of research articles.

Nevertheless, there are a handful of examples to the contrary, an important one being the paper “Neighbor-joining revealed” by Olivier Gascuel and Mike Steel, Molecular Biology and Evolution **23** (2006), p 1997–2000, which I have valued for many years and inspired me to write this post about a similar type of paper. In the neighbor-joining case, Gascuel and Steel review a series of results showing that neighbor-joining, first published in the form of an ad-hoc recipe with only metaphorical motivation, is actually a greedy algorithm for optimizing a least squares loss function (“Theoretical foundations of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting“, R Desper and O Gascuel, Molecular Biology and Evolution **21** (2004), p 587–598). The particular form of the minimization criterion emerges from a formula of Pauplin (“Direct calculation of a tree length using a distance matrix“, Y Paulin, Journal of Molecular Evolution **10** (1993), p 1073–1095) that in turn has deeper roots. This is a story me and my students have played a small role in, providing some insight into the robustness of the neighbor-joining algorithm in “Why neighbor joining works” and an explanation for the surprisingly simple Pauplin’s formula in “Combinatorics of least squares trees“. These series of insights into neighbor-joining turn out to have practical significance. Results in the “Combinatorics of least squares trees” paper were recently extended by Fabio Pardi and Olivier Gascuel in “Combinatorics of distance-based tree inference“, where they synthesize all of the previous results on neighbor-joining into a blueprint for improving distance based methods. The details of this story will be the subject of a future blog post; here I discuss another beautiful example of work that involves *understanding* as opposed to *research:*

I have a long-standing interest in the computational problems that arise in metagenomics, starting with work I did with my former student Kevin Chen on “Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities“. Recently, while drawing parallels between RNA-Seq and metagenomics (with Lorian Schaeffer and Kevin McLoughlin), I was thinking again about problems encountered in the paper “Disordered microbial communities in asthmatic airways” in which I played a minor role assisting William Cookson. Via a long chain of citation chasing, I eventually stumbled upon the elegant article “The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples” by Steven Evans and Erick Matsen, Journal of the Royal Statistical Society **74** (2012), p 569–592, in which the authors explain the meaning of the (weighted) UniFrac distance for metagenomic samples.

The weighted UniFrac distance is defined as follows: Consider a (rooted) tree *T* to which a total of *n+m* reads from two (metagenomics) experiments, *A *(*m *reads) and *B* (*n *reads) have been assigned. By “assigned” we mean that every read is associated with a single leaf of the tree. If *T* has *E *edges, and is the “length” of edge *i *with the number of descendants of edge *i* in the tree, the weighted UniFrac distance is defined to be:

.

Note that this formula is equation (1) in the Evans-Matsen paper (with the exception of a typo in the paper for the indexing of the summation where *n* has been used twice).

The intuitive meaning of UniFrac is illustrated in an example on a website of Rob Knight (figure shown below), who together with Catherine Lozupone popularized UniFrac in metagenomics analyses by providing easy to use software for its computation.

In the figure edge lengths are weighted by the relative abundance of reads from the square and circle experiments. In the latter each read counts half as much because there are twice as many. The width of the edges is proportional to the extent to which each edge is weighted in the formula for UniFrac (grey edges have no weight, e.g. edge 3 does not contribute to the distance because the relative abundances below it are equal). Edges 1 and 2 contribute large weights because their descendants are strongly biased towards squares (1) vs. circles (2)

The essence of the Evans-Matsen result is an interpretation of UniFrac as a special case of the family of Wasserstein metrics. Wasserstein metrics have many aliases (e.g. in computer science they are known as the earth mover’s distance). Evans-Matsen use the names of Kantorovich and Rubinstein, who were the first to highlight the importance of Wasserstein metrics in probability theory, and provided an important duality theorem that is the path towards the main result of the paper. Specifically, the Wasserstein metric is given by

where in the formula is a metric space, *P* and *Q *are Borel probability distributions, and denotes the set of probability distributions *R* on the product space with the probability that the marginal distributions of *R* are *P *and Q. The dual form described in the theorem of Kantorovich and Rubinstein is

where is the set of functions with the Lipschitz property for all . This formula can be interpreted in the case of trees, by first noting that for an edge weighted tree *T*, the distance on edges provides a natural (unique) Borel measure on *T*. Specifically, for all pairs of points where is the length of the (unique) arc between them. Then, given a root in the tree, if denotes the subtree below a point *x*, i.e. , it can be shown that

In the special case that *P* assigns a point mass to each of the leaves in *A*, and *Q* assigns a point mass to each of the leaves in *B*, the Wasserstein metric above reduces to the weighted UniFrac distance.

What is the point of this explanation?

Understanding UniFrac as a Wasserstein metric immediately suggests various generalizations. First, the absolute value of the difference between *P* and *Q *can be replaced by any pseudometric *f* on so that

.

Unweighted UniFrac corresponds to the choice of when exactly one of *x* or *y* is greater than zero. Weighted UniFrac comes from . The fact that weighted UniFrac distance is invariant with respect to the choice of root in the tree can now be seen to emerge from the property that . These generalizations can also be extended from the 1st Wasserstein distance to the *p*th Wasserstein distance. Evans and Matsen argue that the case

is particularly appealing due to its interpretation as an average of the expected difference between the inter and intra distances for pairs of samples from *P *and *Q*. That is, they show that

where have distribution *P* on *T* and have distribution *Q *on *T*.

The paper also discusses other issues of importance in using UniFrac, namely how to normalize distances so that they are comparable across trees, and how to test for significance of “clustering” between two samples. For the latter the main result is a faster (approximate) computation that may be more effective in practice than the current permutation testing that is used.

The main point of the paper though, is that the interpretation of UniFrac as a Wasserstein metric, provides not only insight into the properties of the distance measure, but also generalizations that may extend the possibilities for comparing and contrasting metagenomic communities. A triumph for *understanding*.

## 5 comments

Comments feed for this article

September 18, 2013 at 7:26 am

Titus Brown (@ctitusbrown)I feel like I’ve been waiting for the information in this post for 8 years. Thanks!

September 19, 2013 at 4:31 pm

(Fred)erick Matsen (@ematsen)Hey, thanks, Lior (and Titus)! I just happened to see this on G+.

The assessment of significance is tricky, and there’s still some important work to be done there. Note that the randomization test (that seems to be very commonly used in association with UniFrac and that we describe too) does not have wonderful properties when in the setting of incomplete sampling with non-independent observations like metagenomic sampling. This problem is common whenever the identity of reads are not independent. Imagine, for example, that we have a random observation process on the tree equipped with some collection of “base observations.” Each process takes a random subset of those base observations and then throws down some number of reads for each observation in that set, the number of which has mean >> 1. If the set of base observations is large compared to the number of sample observations, then two draws will always appear significantly different even though they are from the same underlying process.

I didn’t mean to ramble on like this, but I hope someone takes it up. I would think that the folks doing modeling of these ecosystems should have the perspective to decide what we should call the “same” and “different”.

October 27, 2013 at 9:24 pm

Sébastien BoisvertI learned a lot while reading this article. Thank you !

Typos:

“that every reads is associated” -> “every read”

October 27, 2013 at 9:26 pm

Lior PachterThanks! Typo fixed.

May 20, 2015 at 10:26 am

JulianHi Lior

Quick questions, not a mathematician but trying to understand how unifrac works – does it actually use the branch lengths to determine how different the species are in the final calculation?