“Basically, I’m not interested in doing research and I never have been. I’m interested in understanding, which is quite a different thing. And often to understand something you have to work it out yourself because no one else has done it.” – David Blackwell

The importance of *understanding* is unfortunately frequently downplayed in computational biology, where methods are the slums of Results, and are at best afforded a few sentences in supplementary materials of research articles.

Nevertheless, there are a handful of examples to the contrary, an important one being the paper “Neighbor-joining revealed” by Olivier Gascuel and Mike Steel, Molecular Biology and Evolution **23** (2006), p 1997–2000, which I have valued for many years and inspired me to write this post about a similar type of paper. In the neighbor-joining case, Gascuel and Steel review a series of results showing that neighbor-joining, first published in the form of an ad-hoc recipe with only metaphorical motivation, is actually a greedy algorithm for optimizing a least squares loss function (“Theoretical foundations of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting“, R Desper and O Gascuel, Molecular Biology and Evolution **21** (2004), p 587–598). The particular form of the minimization criterion emerges from a formula of Pauplin (“Direct calculation of a tree length using a distance matrix“, Y Paulin, Journal of Molecular Evolution **10** (1993), p 1073–1095) that in turn has deeper roots. This is a story me and my students have played a small role in, providing some insight into the robustness of the neighbor-joining algorithm in “Why neighbor joining works” and an explanation for the surprisingly simple Pauplin’s formula in “Combinatorics of least squares trees“. These series of insights into neighbor-joining turn out to have practical significance. Results in the “Combinatorics of least squares trees” paper were recently extended by Fabio Pardi and Olivier Gascuel in “Combinatorics of distance-based tree inference“, where they synthesize all of the previous results on neighbor-joining into a blueprint for improving distance based methods. The details of this story will be the subject of a future blog post; here I discuss another beautiful example of work that involves *understanding* as opposed to *research:*

I have a long-standing interest in the computational problems that arise in metagenomics, starting with work I did with my former student Kevin Chen on “Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities“. Recently, while drawing parallels between RNA-Seq and metagenomics (with Lorian Schaeffer and Kevin McLoughlin), I was thinking again about problems encountered in the paper “Disordered microbial communities in asthmatic airways” in which I played a minor role assisting William Cookson. Via a long chain of citation chasing, I eventually stumbled upon the elegant article “The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples” by Steven Evans and Erick Matsen, Journal of the Royal Statistical Society **74** (2012), p 569–592, in which the authors explain the meaning of the (weighted) UniFrac distance for metagenomic samples. Read the rest of this entry »

### Like this:

Like Loading...

## Recent Comments