One of my favorite ideas in phylogenetics is that of a phylogenetic orange. The idea and the terminology come from a classic paper by  Junhyong KimSlicing hyperdimensional oranges: the geometry of phylogenetic estimation, Molecular Phylogenetics and Evolution, 17 (2000), p 58–75  (pdf available here). In the words of the author (from the abstract):

A new view of phylogenetic estimation is presented where data sets, tree evolution models, and estimation methods are placed in a common geometric framework.

Prior to Kim’s paper the term “space of phylogenetic trees” was used only metaphorically. Moreover, even though there were many papers proposing definitions for the “distance” between trees (the most popular being the Robinson-Foulds distance), the ideas were disconnected from the models used to analyze data, and therefore there was little hope of building a coherent theory of phylogenetics. The phylogenetic orange that Kim defined served to both define a “tree space” and at the same time provides a geometric framework for thinking about data and estimation in that space.

Central to Kim’s space are the (Markov) tree models that form the basis of statistical phylogenetics. In the notation of Semple and Steel, a tree model specifies a family of distributions on the leaves of a (rooted) phylogenetic X-tree, parameterized by “rate” matrices on the edges of the tree. Rate matrices, usually denoted by Q, are k x k matrices where k is the number of characters in the model (e.g. k=4 for DNA or k=20 for amino acids). They  have row sums equal to zero with non-negative entries off the diagonal and negative entries on the diagonal. A Markov model is associated to the tree where the transition matrices $\{T_e\}_{e \in E}$ for the edges E of the tree are given by matrix exponentials $T_e = e^{Q_{e}t}$

where the $Q_e$ are the rate matrices associated to edges. For a fixed tree T, the Markov model specified above can be used to calculate the probability of all of the $k^n (n=|X|)$ patterns at the leaves, and this set of probabilities forms a $k^n$ long vector that is a point in a simplex (since the probabilities sum to 1). As the parameters are varied, the points form a “space” (that Kim refers to informally as a “manifold”), and the different manifolds correspond to the different tree topologies fit together in the simplex to form a phylogenetic orange. The term orange arises from the observation that the points in the simplex corresponding to the different tree topologies approach each other as the branch lengths go to zero and as the branch lengths go to infinity. To see this, consider that as branch lengths grow to infinity the Markov model becomes equivalent to the independence model, as the probabilities of the character patterns all converge to the equilibrium distribution. This is shown in the following adaptation of Figure 5 in Kim’s paper: In his paper Kim discusses at length the geometric interpretation of data and estimation in a phylogenetic oranges. There are many results in the paper, which I will not review here, and also many prescient remarks that have taken years to fully understand and explore. One example is the paper by “Phylogenetic mixtures on a single tree can mimic a tree of another topology” by Frederick A. Matsen and Mike Steel  that provides a key insight: “the combined data from two genes whose phylogenetic trees differ only in terms of branch lengths can perfectly fit a tree of a different topology”.

One missing piece from Kim’s paper is the discussion of metrics. From the point of view of the paper, it seems natural to consider metrics on the leaf pattern probability distributions, such as the square root of the Jensen-Shannon divergence to determine distances between phylogenetic trees, but such ideas have not been carefully explored. Instead, a different notion of tree space with a natural metric (but one that is not quite right for phylogenetics) has become a popular object of study.

In 2001, shortly after Kim’s paper appeared, Lou Billera, Susan Holmes and Karen Vogtmann published “The geometry of the space of phylogenetic trees“, Advances in Applied Mathematics, 27 (2001), p 733–767. The Billera-Holmes-Vogtmann space (abbreviated BHV) is what is known as a CAT(0) space with a natural metric of non-positive curvature.  An important property of the space, that has been exploited in subsequent work, is that there are geodesics (unique shortest paths between any two trees). A formal definition of the BHV space is beyond the scope of this post, but intuitively it can be understood to be a tree space where one “moves” from one tree to another by growing and shrinking edges (see, e.g. Figure 18 in the BHV paper).

A natural question that emerged from the BHV paper was how to (efficiently) compute distances in the space. Almost 10 years after the publication of BHV,  Megan Owen and Scott Provan answered the question providing an $O(n^4)$ algorithm.  Since that time, Owen and collaborators have extended the distance computation algorithm to the computation of Fréchet means, and recently have provided a final crucial ingredient for statistics in the BHV tree space, namely a central limit theorem for Fréchet means.

However there is a fundamental dichotomy between the phylogenetic orange and the BHV space that is problematic: consider two (initially) different trees $T_1,T_2$ where the branch lengths of both to grow towards infinity. In the phylogenetic orange the distance (computed, for example, as KL Divergence) between the trees converges to zero, whereas in the BHV space the distances approaches infinity.

The figure below shows a simple experiment for two trees each with two leaves where one is fixed and the other grows to infinity: the x axis shows the branch length difference and the y axis the KL divergence: As expected, the distance between the trees converges to a constant value. In other words, for large branch lengths the intrinsic metric associated to the BHV space is problematic.

The silver lining is the approximately linear relationship between distance in the phylogenetic orange and the BHV space at short branch lengths. In that regime it may make sense to work with geodesic distance in the BHV space.

In summary, the phylogeny of orange (see Figure 1) is a point in a phylogenetic orange.

Additional notes: Both spaces discussed in this post have an algebraic description. The BHV space is a tropical Grassmanian (although the tropical Grassmanian is equipped with a different, extrinsic, metric) and a phylogenetic orange (also known as an edge-product space) is a special case of a toric cube. Some of these connections are explored in the “Algebraic Statistics for Computational Biology” book and an excellent introduction will be the forthcoming “Introduction to Tropical Geometry“.