My students run a journal club that meets once a week and last Friday the paper we discussed was M. Imakaev et al., Iterative correction of Hi-C data reveals hallmarks of chromosome organization, Nature Methods 9 (2012), p 999–1003. The paper, from Leonid Mirny’s group, describes an approach to analyzing Hi-C data, and is one of only a few papers that have been published focusing on methods for this *Seq assay. Present at journal club were my students Nicolas Bray, Shannon Hateley, Harold Pimentel, Atif Rahman, Lorian Schaeffer, Akshay Tambe, Faraz Tavakoli, postdocs Sreeram Kannan and Shannon McCurdy, and guest Emily Brown from Doris Bachtrog’s group. They all read the paper prior to our meeting, and the notes below are a summary of our discussion (although I have taken some more time to flesh out the math for the interested reader). For those who are impatient, the bottom line is that I think the title of the paper by Imakaev et al. should have had a few extra words (mine in italics) so that it would be “Iterative proportional fitting of a log-linear model for correction of Hi-C data reveals hallmarks of chromosome organization”.

The main (new) result in the Imakaev et al. paper is an iterative procedure for “correcting” Hi-C data. Following the correction, an eigendecomposition is proposed for the resulting “relative contact probability” matrix. The latter step is essentially the analysis idea from the original Hi-C paper by Erez Lieberman-Aiden et al. Together, these steps form a pipeline for analysis of Hi-C data that the authors term Iterative Correction and Eigendecomposition (ICE). Although we discussed the “E” of ICE at group meeting (and I have some questions about it), I don’t get into it in this post. I focus on the iterative correction, which in the notation of the paper is as follows:

First, the genome is divided into n bins (or regions) from which an nxn matrix O of contact counts is obtained. In this matrix, the ij-th entry $O_{ij}$ is the number of “double-sided” (DS) reads that mapped respectively to the ith and jth bins. The counts are obtained by mapping the reads to the bins, which are defined to be regions of size either 1-Mb or 200-kb. The paper also considers “single-sided” reads, which arises from contacts between mappable and unmappable parts of the genome, but for simplicity we don’t discuss their treatment here. A key point is that the matrix O has its diagonal and one-off diagonal elements zeroed out. The rationale, according to the paper, is that “within a bin and with adjacent bins, … contacts occur at scales finer than the given resolution”. In other words, because the matrix O is symmetric, the data consists of $\frac{n^2-n}{2} - (n-1) = {n-1 \choose 2}$ numbers.

Given O, the goal is now to find another nxn matrix T with zeroes on the diagonal and off-diagonal satisfying the n constraints $\sum_{i=1}^n T_{ij} = 1$, together with a vector of length n called B, so that the ${n-1 \choose 2}$ constraints $O_{ij} = B_iB_jT_{ij}$                                            (1)

are satisfied. In other words, there are ${n-1 \choose 2} + n$ indeterminates to solve for using the same number of constraints. The intuition provided in the paper is that the vector B captures bias of various kinds, leaving the interaction terms T as the “corrected” data. More on this in a moment.

A procedure is provided for solving for and T: Begin by setting a matrix $W^0_{ij} = O_{ij}$ and the vector $B^0$ to be 1. Then perform the iterations:

1. Calculate $S_i=\sum_j W^k_{ij}$ and let $\overline{S}_i = \frac{1}{n}\sum_{i=1}^n S_i$.
2. Set $\Delta B^k_i = \frac{S_i}{\overline{S}_i}$.
3. Set $W^{k+1}_{ij} = \frac{W^k_{ij}}{\Delta B_i \Delta B_j}$.
4. Set $B^{k+1}_i = B^k_i \cdot \Delta B^k_i$.

The hope is that the $B^k$ converge to B and the $W^k$ to T.

We were initially a bit perplexed by this method. First, the authors never prove convergence of the iterative algorithm. They  merely state that “for many maps, iterative correction converges in around 10 iterations. For certain cases, however, further iterations are required to pull in outliers” so they use 50 iterations because it is “twice the number required for the worst case encountered”. Second, although its easy to verify that a solution of equation (1) is a fixed point of the iteration, it is not clear that there is a unique solution. Third, neither the equation (1), nor the description of the iterative procedure provide an answer to the question why this method? The authors provide some explanation for what the iterative algorithm might be accomplishing in the Supplement (page 21) but as I discuss below, it is incomplete. The motivation for the exact form of equation (1) is basically the empirical “validation” that it appears to do good things in practice and makes intuitive sense. Notably, the claim in the abstract that the method yields”genome-wide maps of relative contact probabilities” is true only in definition. That is to say, the authors have defined the matrix T by equation (1) and then called it the “map of relative contact probabilities”. Moreover, as explained above, it is completely unclear that the iterative procedure even recovers a T satisfying equation (1).

So what are they really doing, and what is going on?

As a warmup, it helps to think about a model that leads to a simplification of equation (1). Suppose that instead of asking for a decomposition of $O_{ij}$ as $B_iB_j T_{ij}$, we instead adopt the following mindset: we imagine that the counts $O_{ij}$ are the result of observing Poisson (or multinomially) distributed random variables $X_{ij}$ where $E[X_{ij}] = NB_i B_j$ for some vector B whose elements sum to 1 and constant N This model is called the model of independence because the observed counts O can be viewed as being generated by repeating N times the process of choosing twice (independently) random indices according to the distribution B. It is a special case of a log-linear model due to the following description: a count in cell i,j can be modeled as having occurred with probability $log p_{ij} = c + u_i + u_j - Z$  where $c = log N$, $u_i = log B_i$ and Z ( $= c$) is a normalization factor ensuring that the probabilities p sum to 1.

Estimation of the maximum likelihood parameters of this model has been studied extensively in the statistics literature. For starters, if the row and column sums of O are positive, and O can not be permuted into block-diagonal shape, then there is a unique maximum to the likelihood function. In my book Algebraic Statistics for Computational Biology, coauthored with Bernd Sturmfels, we provided an introduction to log-linear models, also called toric models, with a presentation of Birch’s theorem (our Theorem 1.10) which generalizes the result above.

Assuming the MLE is unique, the following algorithm will converge to it: Set $\epsilon^0_{ij} = 1$ and then iterate $\epsilon^{k+1}_{ij} = \sum_j O_{ij} \frac{\epsilon^k_{ij}}{\sum_{j=1}^n \epsilon^k_{ij}}$.

This algorithm is known as iterative proportional fitting. It will produce a matrix $\epsilon$ of rank 1 with the same row and column sums as O, i.e. it will factor as $\epsilon=B_i B_j$.

Iterative proportional fitting has a long history starting with its introduction by W. Edwards Deming and Frederick F. Stephan in 1940 in the paper “On a least squares adjustment of a sampled frequency table when the expected marginal totals are known“, published in the Annals of Mathematical Statistics. The algorithm is motivated by computing the roots of the partial derivatives of the log likelihood function. In the setting above, assuming a Poisson model, we have that $L(B) = \sum_{i,j} O_{ij} \cdot (logB_i+logB_j)-NB_iB_j+C$

where C is a constant. From this we obtain that $\frac{\sum_{j}O_{ij}}{B_i} = N \sum_{j} B_j = N$ so that $N \cdot B_i = \sum_{j} \epsilon_{ij} = \sum_j O_{ij}$. In other words, at the maximum likelihood setting of the parameters, it must be the case that the model margins equal the observed margins. This is the observation of Imakaev et al. on page 21 of the supplement where the partial derivatives are computed. However this doesn’t explain how to solve for the maximum likelihood parameters. To get an algorithm, we multiply both sides of the margin identity by $\epsilon_{ij}$ to obtain $\epsilon_{ij} \cdot \sum_j \epsilon_{ij} = \epsilon_{ij} \cdot \sum_j O_{ij}$,

or $\epsilon_{ij} = \frac{\epsilon_{ij} \cdot \sum_j O_{ij}}{\sum_j \epsilon_{ij}}$.

Fixing the right hand side, while solving for the left hand sides, leads to the recursion $\epsilon^{k+1}_{ij} = \sum_j O_{ij} \frac{\epsilon^k_{ij}}{\sum_{j=1}^n \epsilon^k_{ij}}$.

which is exactly iterative proportional fitting (IPF).

A standard extension of the independence model above is to add an interaction term, so that $log p_{ij} = c + u_i + u_j + z_{ij} - Z$. This model is still log-linear. Upon exponentiation, and setting $T_{ij} = e^{z_{ij}}$, we obtain the likelihood function that appears in the Supplement of Imaev et al. (page 21, equation starting LL), or equivalently the constraints $\epsilon_{ij} = B_i B_j T_{ij}$.

Applying the principle explained above for deriving IPF to the partial derivatives, one arrives at the iterative correction of Imakaev et al. that I described in Steps 1–4 above.

The understanding that Imakaev et al. have rediscovered IPF in the context of a specific log-linear model is useful. First, there is a large (complicated and non-trivial) literature on the convergence properties of IPF, as well as a lot of work “explaining” the algorithm. For example it has the interpretation of minimizing the KL divergence from the observed distribution to the model distribution, which leads to an interpretation of IPF as coordinate ascent in the likelihood. Furthermore, there is work on regularization which could very well be useful in the context of Hi-C data. The work on IPF also shows that the convergence claims of Imakaev et al. depend on assumptions about the non-zero entries and row/column sums in the observation matrix, a fact that is glossed over in the paper and that might come into play as Hi-C is applied in different settings. Sometimes ignorance is not bliss!

Having said that, the take home message is that the Imakaev et al. iterative correction makes sense, and (fortunately) there is an established theory underpinning the algorithm. For these reasons I like the paper, and I also appreciate the open source distribution of the ICE software by Mirny’s group. However I can’t help but wonder what reviewer #3 would have written if the paper had been submitted with the extended title I have proposed, an acknowledgment that underlying the iterative correction in the ICE pipeline is an application of a standard inference algorithm (IPF) to a standard log-linear model for interaction, and references to the large body of literature on IPF.